Setup
library(tidyverse)
fuzzyjoin
: Join df
s with inexact matchingJune 26, 2023
df
s by string distanceSometimes you want to join df
s on fuzzy criteria. That’s where fuzzyjoin
comes in. Consider this example lifted partially from the documentation:
Set up a df
with some typos. We’ll join this to ggplot2::diamonds
by cut
, which has levels: Fair, Good, Very Good, Premium, Ideal.
A regular inner join finds no matches, due to the typos:
stringdist_inner_join()
however will match strings based on any of the methods available in the stringdist
package (e.g. Levenshtein distance, Jaccard distance, etc).
fuzzyjoin::stringdist_inner_join(diamonds, d,
by = c(cut = "approximate_name"),
method = "lv",
max_dist = 3,
distance_col = "distance") |>
relocate(approximate_name, .after = "cut") |>
head(10)
carat | cut | approximate_name | color | clarity | depth | table | price | x | y | z | type | distance |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0.23 | Ideal | Idea | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 | 1 | 1 |
0.21 | Premium | Premiums | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 | 2 | 1 |
0.21 | Premium | Premioom | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 | 3 | 2 |
0.29 | Premium | Premiums | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 | 2 | 1 |
0.29 | Premium | Premioom | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 | 3 | 2 |
0.24 | Very Good | VeryGood | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 | 4 | 1 |
0.24 | Very Good | VeryGood | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 | 5 | 1 |
0.24 | Very Good | VeryGood | I | VVS1 | 62.3 | 57 | 336 | 3.95 | 3.98 | 2.47 | 4 | 1 |
0.24 | Very Good | VeryGood | I | VVS1 | 62.3 | 57 | 336 | 3.95 | 3.98 | 2.47 | 5 | 1 |
0.26 | Very Good | VeryGood | H | SI1 | 61.9 | 55 | 337 | 4.07 | 4.11 | 2.53 | 4 | 1 |
df
s by regex matchesAlso extremely useful are the regex_*_join
functions, which join based on whether there is a regex match between columns. (This example also from the documentation)
d <- data.frame(regex_name = c("^Idea", "mium", "Good"), type = 1:3)
fuzzyjoin::regex_inner_join(diamonds, d, by = c(cut = "regex_name")) |>
relocate(regex_name, .after = cut) |>
head()
carat | cut | regex_name | color | clarity | depth | table | price | x | y | z | type |
---|---|---|---|---|---|---|---|---|---|---|---|
0.23 | Ideal | ^Idea | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 | 1 |
0.21 | Premium | mium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 | 2 |
0.23 | Good | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 | 3 |
0.29 | Premium | mium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 | 2 |
0.31 | Good | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 | 3 |
0.24 | Very Good | Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 | 3 |
df
s by absolute difference between columnsSay you want to join by two columns so long as they are equal within a certain degree of tolerance. Here we join so long as sepal_lengths$Sepal.Length
is within 0.5 of the iris$Sepal.Length
:
sepal_lengths <- data.frame(Sepal.Length = c(5, 6, 7), Type = 1:3)
fuzzyjoin::difference_inner_join(iris, sepal_lengths,
max_dist = 0.5,
distance_col = "distance") |>
relocate(Sepal.Length.x, .after = Sepal.Length.y) |>
head()
Joining by: "Sepal.Length"
Sepal.Width | Petal.Length | Petal.Width | Species | Sepal.Length.y | Sepal.Length.x | Type | distance |
---|---|---|---|---|---|---|---|
3.5 | 1.4 | 0.2 | setosa | 5 | 5.1 | 1 | 0.1 |
3.0 | 1.4 | 0.2 | setosa | 5 | 4.9 | 1 | 0.1 |
3.2 | 1.3 | 0.2 | setosa | 5 | 4.7 | 1 | 0.3 |
3.1 | 1.5 | 0.2 | setosa | 5 | 4.6 | 1 | 0.4 |
3.6 | 1.4 | 0.2 | setosa | 5 | 5.0 | 1 | 0.0 |
3.9 | 1.7 | 0.4 | setosa | 5 | 5.4 | 1 | 0.4 |
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os macOS Ventura 13.4.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Chicago
date 2023-07-08
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package loadedversion date source
cli 3.6.1 2023-03-23 CRAN (R 4.3.0)
codetools 0.2-19 2023-02-01 CRAN (R 4.3.1)
colorspace 2.1-0 2023-01-23 CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 CRAN (R 4.3.0)
crul 1.3 2022-09-03 CRAN (R 4.3.1)
curl 5.0.0 2023-01-12 CRAN (R 4.3.0)
digest 0.6.31 2022-12-11 CRAN (R 4.3.0)
dplyr 1.1.2 2023-04-20 CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 CRAN (R 4.3.0)
evaluate 0.21 2023-05-05 CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 CRAN (R 4.3.0)
fontBitstreamVera 0.1.1 2017-02-01 CRAN (R 4.3.0)
fontLiberation 0.1.0 2016-10-15 CRAN (R 4.3.0)
fontquiver 0.2.1 2017-02-01 CRAN (R 4.3.0)
forcats 1.0.0 2023-01-29 CRAN (R 4.3.0)
fuzzyjoin 0.1.6 2020-05-15 CRAN (R 4.3.0)
gdtools 0.3.3 2023-03-27 CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 CRAN (R 4.3.0)
gfonts 0.2.0 2023-01-08 CRAN (R 4.3.0)
ggplot2 3.4.2 2023-04-03 CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 CRAN (R 4.3.0)
gtable 0.3.3 2023-03-21 CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 CRAN (R 4.3.0)
httpcode 0.3.0 2020-04-10 CRAN (R 4.3.0)
httpuv 1.6.11 2023-05-11 CRAN (R 4.3.0)
jsonlite 1.8.7 2022-12-06 CRAN (R 4.3.0)
knitr 1.43 2023-05-25 CRAN (R 4.3.0)
later 1.3.1 2023-05-02 CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 CRAN (R 4.3.0)
lubridate 1.9.2 2023-02-10 CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 CRAN (R 4.3.0)
mime 0.12 2021-09-28 CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 CRAN (R 4.3.0)
promises 1.2.0.1 2021-02-11 CRAN (R 4.3.0)
purrr 1.0.1 2023-01-10 CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 CRAN (R 4.3.0)
ragg 1.2.5 2023-01-12 CRAN (R 4.3.0)
Rcpp 1.0.10 2023-01-22 CRAN (R 4.3.0)
readr 2.1.4 2023-02-10 CRAN (R 4.3.0)
renv 0.17.3 2023-04-06 CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 CRAN (R 4.3.0)
rmarkdown 2.22 2023-03-26 CRAN (R 4.3.0)
rstudioapi 0.14 2022-08-22 CRAN (R 4.3.0)
scales 1.2.1 2022-08-20 CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 CRAN (R 4.3.0)
shiny 1.7.4 2022-12-15 CRAN (R 4.3.0)
stringdist 0.9.10 2022-11-07 CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 CRAN (R 4.3.0)
stringr 1.5.0 2022-12-02 CRAN (R 4.3.0)
systemfonts 1.0.4 2022-02-11 CRAN (R 4.3.0)
textshaping 0.3.6 2021-10-13 CRAN (R 4.3.0)
tibble 3.2.1 2023-03-20 CRAN (R 4.3.0)
tidyr 1.3.0 2023-01-24 CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 CRAN (R 4.3.0)
tidyverse 2.0.0 2023-02-22 CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 CRAN (R 4.3.0)
tzdb 0.4.0 2022-03-28 CRAN (R 4.3.0)
utf8 1.2.3 2023-01-31 CRAN (R 4.3.0)
vctrs 0.6.3 2023-04-19 CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 CRAN (R 4.3.0)
xfun 0.39 2023-04-20 CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 CRAN (R 4.3.0)
yaml 2.3.7 2023-01-23 CRAN (R 4.3.0)