Summary statistics for df
s with skimr
dataframes
summarizing
skimr::skim()
provides handy summary statistics for df
s and related objects, including little sparkline-style histograms right in the output.
── Data Summary ────────────────────────
Values
Name palmerpenguins::penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
numeric 5
________________________
Group variables None
── Variable type: factor ───────────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique
1 species 0 1 FALSE 3
2 island 0 1 FALSE 3
3 sex 11 0.968 FALSE 2
top_counts
1 Ade: 152, Gen: 124, Chi: 68
2 Bis: 168, Dre: 124, Tor: 52
3 mal: 168, fem: 165
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50
1 bill_length_mm 2 0.994 43.9 5.46 32.1 39.2 44.4
2 bill_depth_mm 2 0.994 17.2 1.97 13.1 15.6 17.3
3 flipper_length_mm 2 0.994 201. 14.1 172 190 197
4 body_mass_g 2 0.994 4202. 802. 2700 3550 4050
5 year 0 1 2008. 0.818 2007 2007 2008
p75 p100 hist
1 48.5 59.6 ▃▇▇▆▁
2 18.7 21.5 ▅▅▇▇▂
3 213 231 ▂▇▃▅▂
4 4750 6300 ▃▇▆▃▂
5 2009 2009 ▇▁▇▁▇
It also handles grouped data nicely:
── Data Summary ────────────────────────
Values
Name group_by(palmerpenguins::...
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 2
numeric 5
________________________
Group variables species
── Variable type: factor ───────────────────────────────────────────────────────
skim_variable species n_missing complete_rate ordered n_unique
1 island Adelie 0 1 FALSE 3
2 island Chinstrap 0 1 FALSE 1
3 island Gentoo 0 1 FALSE 1
4 sex Adelie 6 0.961 FALSE 2
5 sex Chinstrap 0 1 FALSE 2
6 sex Gentoo 5 0.960 FALSE 2
top_counts
1 Dre: 56, Tor: 52, Bis: 44
2 Dre: 68, Bis: 0, Tor: 0
3 Bis: 124, Dre: 0, Tor: 0
4 fem: 73, mal: 73
5 fem: 34, mal: 34
6 mal: 61, fem: 58
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable species n_missing complete_rate mean sd p0
1 bill_length_mm Adelie 1 0.993 38.8 2.66 32.1
2 bill_length_mm Chinstrap 0 1 48.8 3.34 40.9
3 bill_length_mm Gentoo 1 0.992 47.5 3.08 40.9
4 bill_depth_mm Adelie 1 0.993 18.3 1.22 15.5
5 bill_depth_mm Chinstrap 0 1 18.4 1.14 16.4
6 bill_depth_mm Gentoo 1 0.992 15.0 0.981 13.1
7 flipper_length_mm Adelie 1 0.993 190. 6.54 172
8 flipper_length_mm Chinstrap 0 1 196. 7.13 178
9 flipper_length_mm Gentoo 1 0.992 217. 6.48 203
10 body_mass_g Adelie 1 0.993 3701. 459. 2850
11 body_mass_g Chinstrap 0 1 3733. 384. 2700
12 body_mass_g Gentoo 1 0.992 5076. 504. 3950
13 year Adelie 0 1 2008. 0.822 2007
14 year Chinstrap 0 1 2008. 0.863 2007
15 year Gentoo 0 1 2008. 0.792 2007
p25 p50 p75 p100 hist
1 36.8 38.8 40.8 46 ▁▆▇▆▁
2 46.3 49.6 51.1 58 ▂▇▇▅▁
3 45.3 47.3 49.6 59.6 ▃▇▆▁▁
4 17.5 18.4 19 21.5 ▂▆▇▃▁
5 17.5 18.4 19.4 20.8 ▅▇▇▆▂
6 14.2 15 15.7 17.3 ▅▇▇▆▂
7 186 190 195 210 ▁▆▇▅▁
8 191 196 201 212 ▁▅▇▅▂
9 212 216 221 231 ▂▇▇▆▃
10 3350 3700 4000 4775 ▅▇▇▃▂
11 3488. 3700 3950 4800 ▁▅▇▃▁
12 4700 5000 5500 6300 ▃▇▇▇▂
13 2007 2008 2009 2009 ▇▁▇▁▇
14 2007 2008 2009 2009 ▇▁▆▁▇
15 2007 2008 2009 2009 ▆▁▇▁▇
And finally, since the output of skim()
has class data.frame
, you can include it in a pipeline, e.g. to filter by one of the summary statistics:
── Data Summary ────────────────────────
Values
Name palmerpenguins::penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
numeric 4
________________________
Group variables None
── Variable type: numeric ──────────────────────────────────────────────────────
skim_variable n_missing complete_rate mean sd p0 p25 p50
1 bill_length_mm 2 0.994 43.9 5.46 32.1 39.2 44.4
2 bill_depth_mm 2 0.994 17.2 1.97 13.1 15.6 17.3
3 flipper_length_mm 2 0.994 201. 14.1 172 190 197
4 body_mass_g 2 0.994 4202. 802. 2700 3550 4050
p75 p100 hist
1 48.5 59.6 ▃▇▇▆▁
2 18.7 21.5 ▅▅▇▇▂
3 213 231 ▂▇▃▅▂
4 4750 6300 ▃▇▆▃▂
── Data Summary ────────────────────────
Values
Name palmerpenguins::penguins
Number of rows 344
Number of columns 8
_______________________
Column type frequency:
factor 3
________________________
Group variables None
── Variable type: factor ───────────────────────────────────────────────────────
skim_variable n_missing complete_rate ordered n_unique
1 species 0 1 FALSE 3
2 island 0 1 FALSE 3
3 sex 11 0.968 FALSE 2
top_counts
1 Ade: 152, Gen: 124, Chi: 68
2 Bis: 168, Dre: 124, Tor: 52
3 mal: 168, fem: 165
To refer to the summary statistic columns, be sure to preface the column name with the variable type, e.g. factor.
or numeric.
as appropriate. For a list of variable types and the default summary statistics for each:
$AsIs
[1] "n_unique" "min_length" "max_length"
$character
[1] "min" "max" "empty" "n_unique" "whitespace"
$complex
[1] "mean"
$Date
[1] "min" "max" "median" "n_unique"
$difftime
[1] "min" "max" "median" "n_unique"
$factor
[1] "ordered" "n_unique" "top_counts"
$haven_labelled
[1] "mean" "sd" "p0" "p25" "p50" "p75" "p100" "hist"
$list
[1] "n_unique" "min_length" "max_length"
$logical
[1] "mean" "count"
$numeric
[1] "mean" "sd" "p0" "p25" "p50" "p75" "p100" "hist"
$POSIXct
[1] "min" "max" "median" "n_unique"
$Timespan
[1] "min" "max" "median" "n_unique"
$ts
[1] "start" "end" "frequency" "deltat" "mean"
[6] "sd" "min" "max" "median" "line_graph"
Finally, if you want to print the summary, but return the original df
, use skim_tee()
.