Setup
library(janitor)
janitor
: Clean dirty data, plus improved tables and crosstabJune 25, 2023
janitor
contains various tools for examining and cleaning dirty data.
Let’s create a df
with some poorly-chosen column names:
clean_names()
does just as the name implies:
first_name | abc | percent_successful_2009 | repeat_value | repeat_value_2 | x |
---|---|---|---|---|---|
NA | NA | NA | NA | NA | NA |
The case
argument to clean_names()
specifies what case you’d like output names to be in. You can specify any case style that’s available in snakecase::to_any_case()
, including “screaming_snake” if you want to be perverse:
df
s are row-bind-ableAlso useful is compare_df_cols
which summarizes whether the specified df
s can be row-bound (i.e., have columns of the same names/types):
df1 <- data.frame(A= 1:2, b = c("big", "small"))
df2 <- data.frame(a = 10:12, b = c("medium", "small", "big"), c = 0, stringsAsFactors = TRUE) # here, column b is a factor
df3 <- df1 |>
dplyr::mutate(b = as.character(b))
compare_df_cols(df1, df2, df3)
column_name | df1 | df2 | df3 |
---|---|---|---|
a | NA | integer | NA |
A | integer | NA | integer |
b | character | factor | character |
c | NA | numeric | NA |
If you just want a simple TRUE
/FALSE
value telling you whether the df
s match, you can use compare_df_cols_same()
:
janitor
’s version of tables are called tabyls
. You can easily generate crosstabs:
species | Biscoe | Dream | Torgersen |
---|---|---|---|
Adelie | 44 | 56 | 52 |
Chinstrap | 0 | 68 | 0 |
Gentoo | 124 | 0 | 0 |
There are lots of ways to pretty up the output via adorn_*
commands, giving things like column or row percentages, optionally with n
s in parentheses:
palmerpenguins::penguins |>
tabyl(species, island) |>
adorn_totals("col") |>
adorn_percentages("row") |>
adorn_pct_formatting(digits = 2) |>
adorn_ns()
species | Biscoe | Dream | Torgersen | Total |
---|---|---|---|---|
Adelie | 28.95% (44) | 36.84% (56) | 34.21% (52) | 100.00% (152) |
Chinstrap | 0.00% (0) | 100.00% (68) | 0.00% (0) | 100.00% (68) |
Gentoo | 100.00% (124) | 0.00% (0) | 0.00% (0) | 100.00% (124) |
You can also use the adorn_*
functions on regular ol’ df
s:
species | island | bill_length_mm | bill_depth_mm | flipper_length_mm | body_mass_g | sex |
---|---|---|---|---|---|---|
Chinstrap | Dream | 45.7 | 17.0 | 195 | 3650 | female |
Adelie | Torgersen | 35.9 | 16.6 | 190 | 3050 | female |
Chinstrap | Dream | 45.5 | 17.0 | 196 | 3500 | female |
Gentoo | Biscoe | 49.4 | 15.8 | 216 | 4925 | male |
Adelie | Dream | 39.7 | 17.9 | 193 | 4250 | male |
Chinstrap | Dream | 49.5 | 19.0 | 200 | 3800 | male |
Gentoo | Biscoe | 46.6 | 14.2 | 210 | 4850 | female |
Gentoo | Biscoe | 50.5 | 15.9 | 222 | 5550 | male |
Gentoo | Biscoe | 50.0 | 15.2 | 218 | 5700 | male |
Gentoo | Biscoe | 48.2 | 14.3 | 210 | 4600 | female |
Total | - | 461.0 | 162.9 | 2050 | 43875 | - |
Check out the tabyl
s vignette for more info.
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os macOS Ventura 13.4.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Chicago
date 2023-07-08
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package loadedversion date source
cli 3.6.1 2023-03-23 CRAN (R 4.3.0)
codetools 0.2-19 2023-02-01 CRAN (R 4.3.1)
crayon 1.5.2 2022-09-29 CRAN (R 4.3.0)
crul 1.4.0 2023-05-17 CRAN (R 4.3.0)
curl 5.0.1 2023-06-07 CRAN (R 4.3.0)
digest 0.6.31 2022-12-11 CRAN (R 4.3.0)
dplyr 1.1.2 2023-04-20 CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 CRAN (R 4.3.0)
evaluate 0.21 2023-05-05 CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 CRAN (R 4.3.0)
fontBitstreamVera 0.1.1 2017-02-01 CRAN (R 4.3.0)
fontLiberation 0.1.0 2016-10-15 CRAN (R 4.3.0)
fontquiver 0.2.1 2017-02-01 CRAN (R 4.3.0)
gdtools 0.3.3 2023-03-27 CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 CRAN (R 4.3.0)
gfonts 0.2.0 2023-01-08 CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 CRAN (R 4.3.0)
httpcode 0.3.0 2020-04-10 CRAN (R 4.3.0)
httpuv 1.6.11 2023-05-11 CRAN (R 4.3.0)
janitor 2.2.0 2023-02-02 CRAN (R 4.3.0)
jsonlite 1.8.7 2022-12-06 CRAN (R 4.3.0)
knitr 1.43 2023-05-25 CRAN (R 4.3.0)
later 1.3.1 2023-05-02 CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 CRAN (R 4.3.0)
lubridate 1.9.2 2023-02-10 CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 CRAN (R 4.3.0)
mime 0.12 2021-09-28 CRAN (R 4.3.0)
palmerpenguins 0.1.1 2022-08-15 CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 CRAN (R 4.3.0)
promises 1.2.0.1 2021-02-11 CRAN (R 4.3.0)
purrr 1.0.1 2023-01-10 CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 CRAN (R 4.3.0)
ragg 1.2.5 2023-01-12 CRAN (R 4.3.0)
Rcpp 1.0.10 2023-01-22 CRAN (R 4.3.0)
renv 0.17.3 2023-04-06 CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 CRAN (R 4.3.0)
rmarkdown 2.22 2023-03-26 CRAN (R 4.3.0)
rstudioapi 0.14 2022-08-22 CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 CRAN (R 4.3.0)
shiny 1.7.4 2022-12-15 CRAN (R 4.3.0)
snakecase 0.11.0 2019-05-25 CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 CRAN (R 4.3.0)
stringr 1.5.0 2022-12-02 CRAN (R 4.3.0)
systemfonts 1.0.4 2022-02-11 CRAN (R 4.3.0)
textshaping 0.3.6 2021-10-13 CRAN (R 4.3.0)
tibble 3.2.1 2023-03-20 CRAN (R 4.3.0)
tidyr 1.3.0 2023-01-24 CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 CRAN (R 4.3.0)
utf8 1.2.3 2023-01-31 CRAN (R 4.3.0)
vctrs 0.6.2 2023-04-19 CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 CRAN (R 4.3.0)
xfun 0.39 2023-04-20 CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 CRAN (R 4.3.0)
yaml 2.3.7 2023-01-23 CRAN (R 4.3.0)