`janitor`: Clean dirty data, plus improved tables and crosstab

data cleaning

Author

Published

June 25, 2023

Setup

library(janitor)

janitor contains various tools for examining and cleaning dirty data.

Cleaning dirty data

Clean column names

Let’s create a df with some poorly-chosen column names:

test_df <- as.data.frame(matrix(ncol = 6))
names(test_df) <- c("firstName", "ábc@!*", "% successful (2009)",
                    "REPEAT VALUE", "REPEAT VALUE", "")

clean_names() does just as the name implies:

test_df |> 
  clean_names()

first_name	abc	percent_successful_2009	repeat_value	repeat_value_2	x
NA	NA	NA	NA	NA	NA

The case argument to clean_names() specifies what case you’d like output names to be in. You can specify any case style that’s available in snakecase::to_any_case(), including “screaming_snake” if you want to be perverse:

test_df |> 
  clean_names(case = "screaming_snake")

FIRST_NAME	ABC	PERCENT_SUCCESSFUL_2009	REPEAT_VALUE	REPEAT_VALUE_2	X
NA	NA	NA	NA	NA	NA

Check if `df`s are row-bind-able

Also useful is compare_df_cols which summarizes whether the specified dfs can be row-bound (i.e., have columns of the same names/types):

df1 <- data.frame(A= 1:2, b = c("big", "small"))
df2 <- data.frame(a = 10:12, b = c("medium", "small", "big"), c = 0, stringsAsFactors = TRUE) # here, column b is a factor
df3 <- df1 |> 
  dplyr::mutate(b = as.character(b))

compare_df_cols(df1, df2, df3)

column_name	df1	df2	df3
a	NA	integer	NA
A	integer	NA	integer
b	character	factor	character
c	NA	numeric	NA

If you just want a simple TRUE/FALSE value telling you whether the dfs match, you can use compare_df_cols_same():

str(compare_df_cols_same(df1, df2, df3, verbose = FALSE))

 logi FALSE

Examining data and crosstabs

janitor’s version of tables are called tabyls. You can easily generate crosstabs:

palmerpenguins::penguins |> 
  tabyl(species, island)

species	Biscoe	Dream	Torgersen
Adelie	44	56	52
Chinstrap	0	68	0
Gentoo	124	0	0

There are lots of ways to pretty up the output via adorn_* commands, giving things like column or row percentages, optionally with ns in parentheses:

palmerpenguins::penguins |> 
  tabyl(species, island) |> 
  adorn_totals("col")  |> 
  adorn_percentages("row")  |> 
  adorn_pct_formatting(digits = 2) |> 
  adorn_ns()

species	Biscoe	Dream	Torgersen	Total
Adelie	28.95% (44)	36.84% (56)	34.21% (52)	100.00% (152)
Chinstrap	0.00% (0)	100.00% (68)	0.00% (0)	100.00% (68)
Gentoo	100.00% (124)	0.00% (0)	0.00% (0)	100.00% (124)

You can also use the adorn_* functions on regular ol’ dfs:

palmerpenguins::penguins |> 
  dplyr::sample_n(10) |>
  dplyr::select(-year) |> 
  adorn_totals("row")

1: Choose 10 random rows so this doesn’t print forever

species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
Chinstrap	Dream	45.7	17.0	195	3650	female
Adelie	Torgersen	35.9	16.6	190	3050	female
Chinstrap	Dream	45.5	17.0	196	3500	female
Gentoo	Biscoe	49.4	15.8	216	4925	male
Adelie	Dream	39.7	17.9	193	4250	male
Chinstrap	Dream	49.5	19.0	200	3800	male
Gentoo	Biscoe	46.6	14.2	210	4850	female
Gentoo	Biscoe	50.5	15.9	222	5550	male
Gentoo	Biscoe	50.0	15.2	218	5700	male
Gentoo	Biscoe	48.2	14.3	210	4600	female
Total	-	461.0	162.9	2050	43875	-

Check out the tabyls vignette for more info.

Session info and package versions

─ Session info ───────────────────────────────────────────────────────────────

 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.4.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Chicago
 date     2023-07-08
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────

           package loadedversion       date         source
               cli         3.6.1 2023-03-23 CRAN (R 4.3.0)
         codetools        0.2-19 2023-02-01 CRAN (R 4.3.1)
            crayon         1.5.2 2022-09-29 CRAN (R 4.3.0)
              crul         1.4.0 2023-05-17 CRAN (R 4.3.0)
              curl         5.0.1 2023-06-07 CRAN (R 4.3.0)
            digest        0.6.31 2022-12-11 CRAN (R 4.3.0)
             dplyr         1.1.2 2023-04-20 CRAN (R 4.3.0)
          ellipsis         0.3.2 2021-04-29 CRAN (R 4.3.0)
          evaluate          0.21 2023-05-05 CRAN (R 4.3.0)
             fansi         1.0.4 2023-01-22 CRAN (R 4.3.0)
           fastmap         1.1.1 2023-02-24 CRAN (R 4.3.0)
 fontBitstreamVera         0.1.1 2017-02-01 CRAN (R 4.3.0)
    fontLiberation         0.1.0 2016-10-15 CRAN (R 4.3.0)
        fontquiver         0.2.1 2017-02-01 CRAN (R 4.3.0)
           gdtools         0.3.3 2023-03-27 CRAN (R 4.3.0)
          generics         0.1.3 2022-07-05 CRAN (R 4.3.0)
            gfonts         0.2.0 2023-01-08 CRAN (R 4.3.0)
              glue         1.6.2 2022-02-24 CRAN (R 4.3.0)
         htmltools         0.5.5 2023-03-23 CRAN (R 4.3.0)
       htmlwidgets         1.6.2 2023-03-17 CRAN (R 4.3.0)
          httpcode         0.3.0 2020-04-10 CRAN (R 4.3.0)
            httpuv        1.6.11 2023-05-11 CRAN (R 4.3.0)
           janitor         2.2.0 2023-02-02 CRAN (R 4.3.0)
          jsonlite         1.8.7 2022-12-06 CRAN (R 4.3.0)
             knitr          1.43 2023-05-25 CRAN (R 4.3.0)
             later         1.3.1 2023-05-02 CRAN (R 4.3.0)
         lifecycle         1.0.3 2022-10-07 CRAN (R 4.3.0)
         lubridate         1.9.2 2023-02-10 CRAN (R 4.3.0)
          magrittr         2.0.3 2022-03-30 CRAN (R 4.3.0)
              mime          0.12 2021-09-28 CRAN (R 4.3.0)
    palmerpenguins         0.1.1 2022-08-15 CRAN (R 4.3.0)
            pillar         1.9.0 2023-03-22 CRAN (R 4.3.0)
         pkgconfig         2.0.3 2019-09-22 CRAN (R 4.3.0)
          promises       1.2.0.1 2021-02-11 CRAN (R 4.3.0)
             purrr         1.0.1 2023-01-10 CRAN (R 4.3.0)
                R6         2.5.1 2021-08-19 CRAN (R 4.3.0)
              ragg         1.2.5 2023-01-12 CRAN (R 4.3.0)
              Rcpp        1.0.10 2023-01-22 CRAN (R 4.3.0)
              renv        0.17.3 2023-04-06 CRAN (R 4.3.0)
             rlang         1.1.1 2023-04-28 CRAN (R 4.3.0)
         rmarkdown          2.22 2023-03-26 CRAN (R 4.3.0)
        rstudioapi          0.14 2022-08-22 CRAN (R 4.3.0)
       sessioninfo         1.2.2 2021-12-06 CRAN (R 4.3.0)
             shiny         1.7.4 2022-12-15 CRAN (R 4.3.0)
         snakecase        0.11.0 2019-05-25 CRAN (R 4.3.0)
           stringi        1.7.12 2023-01-11 CRAN (R 4.3.0)
           stringr         1.5.0 2022-12-02 CRAN (R 4.3.0)
       systemfonts         1.0.4 2022-02-11 CRAN (R 4.3.0)
       textshaping         0.3.6 2021-10-13 CRAN (R 4.3.0)
            tibble         3.2.1 2023-03-20 CRAN (R 4.3.0)
             tidyr         1.3.0 2023-01-24 CRAN (R 4.3.0)
        tidyselect         1.2.0 2022-10-10 CRAN (R 4.3.0)
        timechange         0.2.0 2023-01-11 CRAN (R 4.3.0)
              utf8         1.2.3 2023-01-31 CRAN (R 4.3.0)
             vctrs         0.6.2 2023-04-19 CRAN (R 4.3.0)
             withr         2.5.0 2022-03-03 CRAN (R 4.3.0)
              xfun          0.39 2023-04-20 CRAN (R 4.3.0)
            xtable         1.8-4 2019-04-21 CRAN (R 4.3.0)
              yaml         2.3.7 2023-01-23 CRAN (R 4.3.0)

Cleaning dirty data

Clean column names

Check if dfs are row-bind-able

Examining data and crosstabs

Check if `df`s are row-bind-able