Setup
library(tidyverse)fuzzyjoin: Join dfs with inexact matchingJune 26, 2023
dfs by string distanceSometimes you want to join dfs on fuzzy criteria. That’s where fuzzyjoin comes in. Consider this example lifted partially from the documentation:
Set up a df with some typos. We’ll join this to ggplot2::diamonds by cut, which has levels: Fair, Good, Very Good, Premium, Ideal.
A regular inner join finds no matches, due to the typos:
stringdist_inner_join() however will match strings based on any of the methods available in the stringdist package (e.g. Levenshtein distance, Jaccard distance, etc).
fuzzyjoin::stringdist_inner_join(diamonds, d,
by = c(cut = "approximate_name"),
method = "lv",
max_dist = 3,
distance_col = "distance") |>
relocate(approximate_name, .after = "cut") |>
head(10)| carat | cut | approximate_name | color | clarity | depth | table | price | x | y | z | type | distance |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.23 | Ideal | Idea | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 | 1 | 1 |
| 0.21 | Premium | Premiums | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 | 2 | 1 |
| 0.21 | Premium | Premioom | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 | 3 | 2 |
| 0.29 | Premium | Premiums | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 | 2 | 1 |
| 0.29 | Premium | Premioom | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 | 3 | 2 |
| 0.24 | Very Good | VeryGood | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 | 4 | 1 |
| 0.24 | Very Good | VeryGood | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 | 5 | 1 |
| 0.24 | Very Good | VeryGood | I | VVS1 | 62.3 | 57 | 336 | 3.95 | 3.98 | 2.47 | 4 | 1 |
| 0.24 | Very Good | VeryGood | I | VVS1 | 62.3 | 57 | 336 | 3.95 | 3.98 | 2.47 | 5 | 1 |
| 0.26 | Very Good | VeryGood | H | SI1 | 61.9 | 55 | 337 | 4.07 | 4.11 | 2.53 | 4 | 1 |
dfs by regex matchesAlso extremely useful are the regex_*_join functions, which join based on whether there is a regex match between columns. (This example also from the documentation)
d <- data.frame(regex_name = c("^Idea", "mium", "Good"), type = 1:3)
fuzzyjoin::regex_inner_join(diamonds, d, by = c(cut = "regex_name")) |>
relocate(regex_name, .after = cut) |>
head()| carat | cut | regex_name | color | clarity | depth | table | price | x | y | z | type |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.23 | Ideal | ^Idea | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 | 1 |
| 0.21 | Premium | mium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 | 2 |
| 0.23 | Good | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 | 3 |
| 0.29 | Premium | mium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 | 2 |
| 0.31 | Good | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 | 3 |
| 0.24 | Very Good | Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 | 3 |
dfs by absolute difference between columnsSay you want to join by two columns so long as they are equal within a certain degree of tolerance. Here we join so long as sepal_lengths$Sepal.Length is within 0.5 of the iris$Sepal.Length:
sepal_lengths <- data.frame(Sepal.Length = c(5, 6, 7), Type = 1:3)
fuzzyjoin::difference_inner_join(iris, sepal_lengths,
max_dist = 0.5,
distance_col = "distance") |>
relocate(Sepal.Length.x, .after = Sepal.Length.y) |>
head()Joining by: "Sepal.Length"
| Sepal.Width | Petal.Length | Petal.Width | Species | Sepal.Length.y | Sepal.Length.x | Type | distance |
|---|---|---|---|---|---|---|---|
| 3.5 | 1.4 | 0.2 | setosa | 5 | 5.1 | 1 | 0.1 |
| 3.0 | 1.4 | 0.2 | setosa | 5 | 4.9 | 1 | 0.1 |
| 3.2 | 1.3 | 0.2 | setosa | 5 | 4.7 | 1 | 0.3 |
| 3.1 | 1.5 | 0.2 | setosa | 5 | 4.6 | 1 | 0.4 |
| 3.6 | 1.4 | 0.2 | setosa | 5 | 5.0 | 1 | 0.0 |
| 3.9 | 1.7 | 0.4 | setosa | 5 | 5.4 | 1 | 0.4 |
─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.3.1 (2023-06-16)
os macOS Ventura 13.4.1
system aarch64, darwin20
ui X11
language (EN)
collate en_US.UTF-8
ctype en_US.UTF-8
tz America/Chicago
date 2023-07-08
pandoc 3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)
─ Packages ───────────────────────────────────────────────────────────────────
package loadedversion date source
cli 3.6.1 2023-03-23 CRAN (R 4.3.0)
codetools 0.2-19 2023-02-01 CRAN (R 4.3.1)
colorspace 2.1-0 2023-01-23 CRAN (R 4.3.0)
crayon 1.5.2 2022-09-29 CRAN (R 4.3.0)
crul 1.3 2022-09-03 CRAN (R 4.3.1)
curl 5.0.0 2023-01-12 CRAN (R 4.3.0)
digest 0.6.31 2022-12-11 CRAN (R 4.3.0)
dplyr 1.1.2 2023-04-20 CRAN (R 4.3.0)
ellipsis 0.3.2 2021-04-29 CRAN (R 4.3.0)
evaluate 0.21 2023-05-05 CRAN (R 4.3.0)
fansi 1.0.4 2023-01-22 CRAN (R 4.3.0)
fastmap 1.1.1 2023-02-24 CRAN (R 4.3.0)
fontBitstreamVera 0.1.1 2017-02-01 CRAN (R 4.3.0)
fontLiberation 0.1.0 2016-10-15 CRAN (R 4.3.0)
fontquiver 0.2.1 2017-02-01 CRAN (R 4.3.0)
forcats 1.0.0 2023-01-29 CRAN (R 4.3.0)
fuzzyjoin 0.1.6 2020-05-15 CRAN (R 4.3.0)
gdtools 0.3.3 2023-03-27 CRAN (R 4.3.0)
generics 0.1.3 2022-07-05 CRAN (R 4.3.0)
gfonts 0.2.0 2023-01-08 CRAN (R 4.3.0)
ggplot2 3.4.2 2023-04-03 CRAN (R 4.3.0)
glue 1.6.2 2022-02-24 CRAN (R 4.3.0)
gtable 0.3.3 2023-03-21 CRAN (R 4.3.0)
hms 1.1.3 2023-03-21 CRAN (R 4.3.0)
htmltools 0.5.5 2023-03-23 CRAN (R 4.3.0)
htmlwidgets 1.6.2 2023-03-17 CRAN (R 4.3.0)
httpcode 0.3.0 2020-04-10 CRAN (R 4.3.0)
httpuv 1.6.11 2023-05-11 CRAN (R 4.3.0)
jsonlite 1.8.7 2022-12-06 CRAN (R 4.3.0)
knitr 1.43 2023-05-25 CRAN (R 4.3.0)
later 1.3.1 2023-05-02 CRAN (R 4.3.0)
lifecycle 1.0.3 2022-10-07 CRAN (R 4.3.0)
lubridate 1.9.2 2023-02-10 CRAN (R 4.3.0)
magrittr 2.0.3 2022-03-30 CRAN (R 4.3.0)
mime 0.12 2021-09-28 CRAN (R 4.3.0)
munsell 0.5.0 2018-06-12 CRAN (R 4.3.0)
pillar 1.9.0 2023-03-22 CRAN (R 4.3.0)
pkgconfig 2.0.3 2019-09-22 CRAN (R 4.3.0)
promises 1.2.0.1 2021-02-11 CRAN (R 4.3.0)
purrr 1.0.1 2023-01-10 CRAN (R 4.3.0)
R6 2.5.1 2021-08-19 CRAN (R 4.3.0)
ragg 1.2.5 2023-01-12 CRAN (R 4.3.0)
Rcpp 1.0.10 2023-01-22 CRAN (R 4.3.0)
readr 2.1.4 2023-02-10 CRAN (R 4.3.0)
renv 0.17.3 2023-04-06 CRAN (R 4.3.0)
rlang 1.1.1 2023-04-28 CRAN (R 4.3.0)
rmarkdown 2.22 2023-03-26 CRAN (R 4.3.0)
rstudioapi 0.14 2022-08-22 CRAN (R 4.3.0)
scales 1.2.1 2022-08-20 CRAN (R 4.3.0)
sessioninfo 1.2.2 2021-12-06 CRAN (R 4.3.0)
shiny 1.7.4 2022-12-15 CRAN (R 4.3.0)
stringdist 0.9.10 2022-11-07 CRAN (R 4.3.0)
stringi 1.7.12 2023-01-11 CRAN (R 4.3.0)
stringr 1.5.0 2022-12-02 CRAN (R 4.3.0)
systemfonts 1.0.4 2022-02-11 CRAN (R 4.3.0)
textshaping 0.3.6 2021-10-13 CRAN (R 4.3.0)
tibble 3.2.1 2023-03-20 CRAN (R 4.3.0)
tidyr 1.3.0 2023-01-24 CRAN (R 4.3.0)
tidyselect 1.2.0 2022-10-10 CRAN (R 4.3.0)
tidyverse 2.0.0 2023-02-22 CRAN (R 4.3.0)
timechange 0.2.0 2023-01-11 CRAN (R 4.3.0)
tzdb 0.4.0 2022-03-28 CRAN (R 4.3.0)
utf8 1.2.3 2023-01-31 CRAN (R 4.3.0)
vctrs 0.6.3 2023-04-19 CRAN (R 4.3.0)
withr 2.5.0 2022-03-03 CRAN (R 4.3.0)
xfun 0.39 2023-04-20 CRAN (R 4.3.0)
xtable 1.8-4 2019-04-21 CRAN (R 4.3.0)
yaml 2.3.7 2023-01-23 CRAN (R 4.3.0)