janitor: Clean dirty data, plus improved tables and crosstab

data cleaning
Author
Published

June 25, 2023

Setup
library(janitor)

janitor contains various tools for examining and cleaning dirty data.

Cleaning dirty data

Clean column names

Let’s create a df with some poorly-chosen column names:

test_df <- as.data.frame(matrix(ncol = 6))
names(test_df) <- c("firstName", "ábc@!*", "% successful (2009)",
                    "REPEAT VALUE", "REPEAT VALUE", "")

clean_names() does just as the name implies:

test_df |> 
  clean_names()
first_name abc percent_successful_2009 repeat_value repeat_value_2 x
NA NA NA NA NA NA

The case argument to clean_names() specifies what case you’d like output names to be in. You can specify any case style that’s available in snakecase::to_any_case(), including “screaming_snake” if you want to be perverse:

test_df |> 
  clean_names(case = "screaming_snake")
FIRST_NAME ABC PERCENT_SUCCESSFUL_2009 REPEAT_VALUE REPEAT_VALUE_2 X
NA NA NA NA NA NA

Check if dfs are row-bind-able

Also useful is compare_df_cols which summarizes whether the specified dfs can be row-bound (i.e., have columns of the same names/types):

df1 <- data.frame(A= 1:2, b = c("big", "small"))
df2 <- data.frame(a = 10:12, b = c("medium", "small", "big"), c = 0, stringsAsFactors = TRUE) # here, column b is a factor
df3 <- df1 |> 
  dplyr::mutate(b = as.character(b))

compare_df_cols(df1, df2, df3)
column_name df1 df2 df3
a NA integer NA
A integer NA integer
b character factor character
c NA numeric NA

If you just want a simple TRUE/FALSE value telling you whether the dfs match, you can use compare_df_cols_same():

str(compare_df_cols_same(df1, df2, df3, verbose = FALSE))
 logi FALSE

Examining data and crosstabs

janitor’s version of tables are called tabyls. You can easily generate crosstabs:

palmerpenguins::penguins |> 
  tabyl(species, island)
species Biscoe Dream Torgersen
Adelie 44 56 52
Chinstrap 0 68 0
Gentoo 124 0 0

There are lots of ways to pretty up the output via adorn_* commands, giving things like column or row percentages, optionally with ns in parentheses:

palmerpenguins::penguins |> 
  tabyl(species, island) |> 
  adorn_totals("col")  |> 
  adorn_percentages("row")  |> 
  adorn_pct_formatting(digits = 2) |> 
  adorn_ns()
species Biscoe Dream Torgersen Total
Adelie 28.95% (44) 36.84% (56) 34.21% (52) 100.00% (152)
Chinstrap 0.00% (0) 100.00% (68) 0.00% (0) 100.00% (68)
Gentoo 100.00% (124) 0.00% (0) 0.00% (0) 100.00% (124)

You can also use the adorn_* functions on regular ol’ dfs:

palmerpenguins::penguins |> 
  dplyr::sample_n(10) |>
  dplyr::select(-year) |> 
  adorn_totals("row")
1
Choose 10 random rows so this doesn’t print forever
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
Chinstrap Dream 45.7 17.0 195 3650 female
Adelie Torgersen 35.9 16.6 190 3050 female
Chinstrap Dream 45.5 17.0 196 3500 female
Gentoo Biscoe 49.4 15.8 216 4925 male
Adelie Dream 39.7 17.9 193 4250 male
Chinstrap Dream 49.5 19.0 200 3800 male
Gentoo Biscoe 46.6 14.2 210 4850 female
Gentoo Biscoe 50.5 15.9 222 5550 male
Gentoo Biscoe 50.0 15.2 218 5700 male
Gentoo Biscoe 48.2 14.3 210 4600 female
Total - 461.0 162.9 2050 43875 -

Check out the tabyls vignette for more info.