Summary statistics for dfs with skimr

dataframes
summarizing
Author
Published

June 26, 2023

Setup

skimr::skim() provides handy summary statistics for dfs and related objects, including little sparkline-style histograms right in the output.

library(skimr)

palmerpenguins::penguins |> 
  skim()
── Data Summary ────────────────────────
                           Values                  
Name                       palmerpenguins::penguins
Number of rows             344                     
Number of columns          8                       
_______________________                            
Column type frequency:                             
  factor                   3                       
  numeric                  5                       
________________________                           
Group variables            None                    

── Variable type: factor ───────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique
1 species               0         1     FALSE          3
2 island                0         1     FALSE          3
3 sex                  11         0.968 FALSE          2
  top_counts                 
1 Ade: 152, Gen: 124, Chi: 68
2 Bis: 168, Dre: 124, Tor: 52
3 mal: 168, fem: 165         

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable     n_missing complete_rate   mean      sd     p0    p25    p50
1 bill_length_mm            2         0.994   43.9   5.46    32.1   39.2   44.4
2 bill_depth_mm             2         0.994   17.2   1.97    13.1   15.6   17.3
3 flipper_length_mm         2         0.994  201.   14.1    172    190    197  
4 body_mass_g               2         0.994 4202.  802.    2700   3550   4050  
5 year                      0         1     2008.    0.818 2007   2007   2008  
     p75   p100 hist 
1   48.5   59.6 ▃▇▇▆▁
2   18.7   21.5 ▅▅▇▇▂
3  213    231   ▂▇▃▅▂
4 4750   6300   ▃▇▆▃▂
5 2009   2009   ▇▁▇▁▇

It also handles grouped data nicely:

palmerpenguins::penguins |> 
  group_by(species) |> 
  skim()
── Data Summary ────────────────────────
                           Values                      
Name                       group_by(palmerpenguins::...
Number of rows             344                         
Number of columns          8                           
_______________________                                
Column type frequency:                                 
  factor                   2                           
  numeric                  5                           
________________________                               
Group variables            species                     

── Variable type: factor ───────────────────────────────────────────────────────
  skim_variable species   n_missing complete_rate ordered n_unique
1 island        Adelie            0         1     FALSE          3
2 island        Chinstrap         0         1     FALSE          1
3 island        Gentoo            0         1     FALSE          1
4 sex           Adelie            6         0.961 FALSE          2
5 sex           Chinstrap         0         1     FALSE          2
6 sex           Gentoo            5         0.960 FALSE          2
  top_counts               
1 Dre: 56, Tor: 52, Bis: 44
2 Dre: 68, Bis: 0, Tor: 0  
3 Bis: 124, Dre: 0, Tor: 0 
4 fem: 73, mal: 73         
5 fem: 34, mal: 34         
6 mal: 61, fem: 58         

── Variable type: numeric ──────────────────────────────────────────────────────
   skim_variable     species   n_missing complete_rate   mean      sd     p0
 1 bill_length_mm    Adelie            1         0.993   38.8   2.66    32.1
 2 bill_length_mm    Chinstrap         0         1       48.8   3.34    40.9
 3 bill_length_mm    Gentoo            1         0.992   47.5   3.08    40.9
 4 bill_depth_mm     Adelie            1         0.993   18.3   1.22    15.5
 5 bill_depth_mm     Chinstrap         0         1       18.4   1.14    16.4
 6 bill_depth_mm     Gentoo            1         0.992   15.0   0.981   13.1
 7 flipper_length_mm Adelie            1         0.993  190.    6.54   172  
 8 flipper_length_mm Chinstrap         0         1      196.    7.13   178  
 9 flipper_length_mm Gentoo            1         0.992  217.    6.48   203  
10 body_mass_g       Adelie            1         0.993 3701.  459.    2850  
11 body_mass_g       Chinstrap         0         1     3733.  384.    2700  
12 body_mass_g       Gentoo            1         0.992 5076.  504.    3950  
13 year              Adelie            0         1     2008.    0.822 2007  
14 year              Chinstrap         0         1     2008.    0.863 2007  
15 year              Gentoo            0         1     2008.    0.792 2007  
      p25    p50    p75   p100 hist 
 1   36.8   38.8   40.8   46   ▁▆▇▆▁
 2   46.3   49.6   51.1   58   ▂▇▇▅▁
 3   45.3   47.3   49.6   59.6 ▃▇▆▁▁
 4   17.5   18.4   19     21.5 ▂▆▇▃▁
 5   17.5   18.4   19.4   20.8 ▅▇▇▆▂
 6   14.2   15     15.7   17.3 ▅▇▇▆▂
 7  186    190    195    210   ▁▆▇▅▁
 8  191    196    201    212   ▁▅▇▅▂
 9  212    216    221    231   ▂▇▇▆▃
10 3350   3700   4000   4775   ▅▇▇▃▂
11 3488.  3700   3950   4800   ▁▅▇▃▁
12 4700   5000   5500   6300   ▃▇▇▇▂
13 2007   2008   2009   2009   ▇▁▇▁▇
14 2007   2008   2009   2009   ▇▁▆▁▇
15 2007   2008   2009   2009   ▆▁▇▁▇

And finally, since the output of skim() has class data.frame, you can include it in a pipeline, e.g. to filter by one of the summary statistics:

palmerpenguins::penguins |> 
  skim() |> 
  filter(numeric.sd > 1)
── Data Summary ────────────────────────
                           Values                  
Name                       palmerpenguins::penguins
Number of rows             344                     
Number of columns          8                       
_______________________                            
Column type frequency:                             
  numeric                  4                       
________________________                           
Group variables            None                    

── Variable type: numeric ──────────────────────────────────────────────────────
  skim_variable     n_missing complete_rate   mean     sd     p0    p25    p50
1 bill_length_mm            2         0.994   43.9   5.46   32.1   39.2   44.4
2 bill_depth_mm             2         0.994   17.2   1.97   13.1   15.6   17.3
3 flipper_length_mm         2         0.994  201.   14.1   172    190    197  
4 body_mass_g               2         0.994 4202.  802.   2700   3550   4050  
     p75   p100 hist 
1   48.5   59.6 ▃▇▇▆▁
2   18.7   21.5 ▅▅▇▇▂
3  213    231   ▂▇▃▅▂
4 4750   6300   ▃▇▆▃▂
palmerpenguins::penguins |> 
  skim() |> 
  filter(factor.n_unique > 1)
── Data Summary ────────────────────────
                           Values                  
Name                       palmerpenguins::penguins
Number of rows             344                     
Number of columns          8                       
_______________________                            
Column type frequency:                             
  factor                   3                       
________________________                           
Group variables            None                    

── Variable type: factor ───────────────────────────────────────────────────────
  skim_variable n_missing complete_rate ordered n_unique
1 species               0         1     FALSE          3
2 island                0         1     FALSE          3
3 sex                  11         0.968 FALSE          2
  top_counts                 
1 Ade: 152, Gen: 124, Chi: 68
2 Bis: 168, Dre: 124, Tor: 52
3 mal: 168, fem: 165         


To refer to the summary statistic columns, be sure to preface the column name with the variable type, e.g. factor. or numeric. as appropriate. For a list of variable types and the default summary statistics for each:

$AsIs
[1] "n_unique"   "min_length" "max_length"

$character
[1] "min"        "max"        "empty"      "n_unique"   "whitespace"

$complex
[1] "mean"

$Date
[1] "min"      "max"      "median"   "n_unique"

$difftime
[1] "min"      "max"      "median"   "n_unique"

$factor
[1] "ordered"    "n_unique"   "top_counts"

$haven_labelled
[1] "mean" "sd"   "p0"   "p25"  "p50"  "p75"  "p100" "hist"

$list
[1] "n_unique"   "min_length" "max_length"

$logical
[1] "mean"  "count"

$numeric
[1] "mean" "sd"   "p0"   "p25"  "p50"  "p75"  "p100" "hist"

$POSIXct
[1] "min"      "max"      "median"   "n_unique"

$Timespan
[1] "min"      "max"      "median"   "n_unique"

$ts
 [1] "start"      "end"        "frequency"  "deltat"     "mean"      
 [6] "sd"         "min"        "max"        "median"     "line_graph"

Finally, if you want to print the summary, but return the original df, use skim_tee().