Setup
library(tidyverse)fuzzyjoin: Join dfs with inexact matchingJune 26, 2023
dfs by string distanceSometimes you want to join dfs on fuzzy criteria. That’s where fuzzyjoin comes in. Consider this example lifted partially from the documentation:
Set up a df with some typos. We’ll join this to ggplot2::diamonds by cut, which has levels: Fair, Good, Very Good, Premium, Ideal.
A regular inner join finds no matches, due to the typos:
stringdist_inner_join() however will match strings based on any of the methods available in the stringdist package (e.g. Levenshtein distance, Jaccard distance, etc).
fuzzyjoin::stringdist_inner_join(diamonds, d,
                                 by = c(cut = "approximate_name"),
                                 method = "lv",
                                 max_dist = 3,
                                 distance_col = "distance") |> 
  relocate(approximate_name, .after = "cut") |>
  head(10)| carat | cut | approximate_name | color | clarity | depth | table | price | x | y | z | type | distance | 
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.23 | Ideal | Idea | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 | 1 | 1 | 
| 0.21 | Premium | Premiums | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 | 2 | 1 | 
| 0.21 | Premium | Premioom | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 | 3 | 2 | 
| 0.29 | Premium | Premiums | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 | 2 | 1 | 
| 0.29 | Premium | Premioom | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 | 3 | 2 | 
| 0.24 | Very Good | VeryGood | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 | 4 | 1 | 
| 0.24 | Very Good | VeryGood | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 | 5 | 1 | 
| 0.24 | Very Good | VeryGood | I | VVS1 | 62.3 | 57 | 336 | 3.95 | 3.98 | 2.47 | 4 | 1 | 
| 0.24 | Very Good | VeryGood | I | VVS1 | 62.3 | 57 | 336 | 3.95 | 3.98 | 2.47 | 5 | 1 | 
| 0.26 | Very Good | VeryGood | H | SI1 | 61.9 | 55 | 337 | 4.07 | 4.11 | 2.53 | 4 | 1 | 
dfs by regex matchesAlso extremely useful are the regex_*_join functions, which join based on whether there is a regex match between columns. (This example also from the documentation)
d <- data.frame(regex_name = c("^Idea", "mium", "Good"), type = 1:3)
fuzzyjoin::regex_inner_join(diamonds, d, by = c(cut = "regex_name")) |> 
  relocate(regex_name, .after = cut) |> 
  head()| carat | cut | regex_name | color | clarity | depth | table | price | x | y | z | type | 
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.23 | Ideal | ^Idea | E | SI2 | 61.5 | 55 | 326 | 3.95 | 3.98 | 2.43 | 1 | 
| 0.21 | Premium | mium | E | SI1 | 59.8 | 61 | 326 | 3.89 | 3.84 | 2.31 | 2 | 
| 0.23 | Good | Good | E | VS1 | 56.9 | 65 | 327 | 4.05 | 4.07 | 2.31 | 3 | 
| 0.29 | Premium | mium | I | VS2 | 62.4 | 58 | 334 | 4.20 | 4.23 | 2.63 | 2 | 
| 0.31 | Good | Good | J | SI2 | 63.3 | 58 | 335 | 4.34 | 4.35 | 2.75 | 3 | 
| 0.24 | Very Good | Good | J | VVS2 | 62.8 | 57 | 336 | 3.94 | 3.96 | 2.48 | 3 | 
dfs by absolute difference between columnsSay you want to join by two columns so long as they are equal within a certain degree of tolerance. Here we join so long as sepal_lengths$Sepal.Length is within 0.5 of the iris$Sepal.Length:
sepal_lengths <- data.frame(Sepal.Length = c(5, 6, 7), Type = 1:3)
fuzzyjoin::difference_inner_join(iris, sepal_lengths,
                                 max_dist = 0.5,
                                 distance_col = "distance") |> 
  relocate(Sepal.Length.x, .after = Sepal.Length.y) |> 
  head()Joining by: "Sepal.Length"| Sepal.Width | Petal.Length | Petal.Width | Species | Sepal.Length.y | Sepal.Length.x | Type | distance | 
|---|---|---|---|---|---|---|---|
| 3.5 | 1.4 | 0.2 | setosa | 5 | 5.1 | 1 | 0.1 | 
| 3.0 | 1.4 | 0.2 | setosa | 5 | 4.9 | 1 | 0.1 | 
| 3.2 | 1.3 | 0.2 | setosa | 5 | 4.7 | 1 | 0.3 | 
| 3.1 | 1.5 | 0.2 | setosa | 5 | 4.6 | 1 | 0.4 | 
| 3.6 | 1.4 | 0.2 | setosa | 5 | 5.0 | 1 | 0.0 | 
| 3.9 | 1.7 | 0.4 | setosa | 5 | 5.4 | 1 | 0.4 | 
─ Session info ─────────────────────────────────────────────────────────────── setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.4.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Chicago
 date     2023-07-08
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)─ Packages ───────────────────────────────────────────────────────────────────           package loadedversion       date         source
               cli         3.6.1 2023-03-23 CRAN (R 4.3.0)
         codetools        0.2-19 2023-02-01 CRAN (R 4.3.1)
        colorspace         2.1-0 2023-01-23 CRAN (R 4.3.0)
            crayon         1.5.2 2022-09-29 CRAN (R 4.3.0)
              crul           1.3 2022-09-03 CRAN (R 4.3.1)
              curl         5.0.0 2023-01-12 CRAN (R 4.3.0)
            digest        0.6.31 2022-12-11 CRAN (R 4.3.0)
             dplyr         1.1.2 2023-04-20 CRAN (R 4.3.0)
          ellipsis         0.3.2 2021-04-29 CRAN (R 4.3.0)
          evaluate          0.21 2023-05-05 CRAN (R 4.3.0)
             fansi         1.0.4 2023-01-22 CRAN (R 4.3.0)
           fastmap         1.1.1 2023-02-24 CRAN (R 4.3.0)
 fontBitstreamVera         0.1.1 2017-02-01 CRAN (R 4.3.0)
    fontLiberation         0.1.0 2016-10-15 CRAN (R 4.3.0)
        fontquiver         0.2.1 2017-02-01 CRAN (R 4.3.0)
           forcats         1.0.0 2023-01-29 CRAN (R 4.3.0)
         fuzzyjoin         0.1.6 2020-05-15 CRAN (R 4.3.0)
           gdtools         0.3.3 2023-03-27 CRAN (R 4.3.0)
          generics         0.1.3 2022-07-05 CRAN (R 4.3.0)
            gfonts         0.2.0 2023-01-08 CRAN (R 4.3.0)
           ggplot2         3.4.2 2023-04-03 CRAN (R 4.3.0)
              glue         1.6.2 2022-02-24 CRAN (R 4.3.0)
            gtable         0.3.3 2023-03-21 CRAN (R 4.3.0)
               hms         1.1.3 2023-03-21 CRAN (R 4.3.0)
         htmltools         0.5.5 2023-03-23 CRAN (R 4.3.0)
       htmlwidgets         1.6.2 2023-03-17 CRAN (R 4.3.0)
          httpcode         0.3.0 2020-04-10 CRAN (R 4.3.0)
            httpuv        1.6.11 2023-05-11 CRAN (R 4.3.0)
          jsonlite         1.8.7 2022-12-06 CRAN (R 4.3.0)
             knitr          1.43 2023-05-25 CRAN (R 4.3.0)
             later         1.3.1 2023-05-02 CRAN (R 4.3.0)
         lifecycle         1.0.3 2022-10-07 CRAN (R 4.3.0)
         lubridate         1.9.2 2023-02-10 CRAN (R 4.3.0)
          magrittr         2.0.3 2022-03-30 CRAN (R 4.3.0)
              mime          0.12 2021-09-28 CRAN (R 4.3.0)
           munsell         0.5.0 2018-06-12 CRAN (R 4.3.0)
            pillar         1.9.0 2023-03-22 CRAN (R 4.3.0)
         pkgconfig         2.0.3 2019-09-22 CRAN (R 4.3.0)
          promises       1.2.0.1 2021-02-11 CRAN (R 4.3.0)
             purrr         1.0.1 2023-01-10 CRAN (R 4.3.0)
                R6         2.5.1 2021-08-19 CRAN (R 4.3.0)
              ragg         1.2.5 2023-01-12 CRAN (R 4.3.0)
              Rcpp        1.0.10 2023-01-22 CRAN (R 4.3.0)
             readr         2.1.4 2023-02-10 CRAN (R 4.3.0)
              renv        0.17.3 2023-04-06 CRAN (R 4.3.0)
             rlang         1.1.1 2023-04-28 CRAN (R 4.3.0)
         rmarkdown          2.22 2023-03-26 CRAN (R 4.3.0)
        rstudioapi          0.14 2022-08-22 CRAN (R 4.3.0)
            scales         1.2.1 2022-08-20 CRAN (R 4.3.0)
       sessioninfo         1.2.2 2021-12-06 CRAN (R 4.3.0)
             shiny         1.7.4 2022-12-15 CRAN (R 4.3.0)
        stringdist        0.9.10 2022-11-07 CRAN (R 4.3.0)
           stringi        1.7.12 2023-01-11 CRAN (R 4.3.0)
           stringr         1.5.0 2022-12-02 CRAN (R 4.3.0)
       systemfonts         1.0.4 2022-02-11 CRAN (R 4.3.0)
       textshaping         0.3.6 2021-10-13 CRAN (R 4.3.0)
            tibble         3.2.1 2023-03-20 CRAN (R 4.3.0)
             tidyr         1.3.0 2023-01-24 CRAN (R 4.3.0)
        tidyselect         1.2.0 2022-10-10 CRAN (R 4.3.0)
         tidyverse         2.0.0 2023-02-22 CRAN (R 4.3.0)
        timechange         0.2.0 2023-01-11 CRAN (R 4.3.0)
              tzdb         0.4.0 2022-03-28 CRAN (R 4.3.0)
              utf8         1.2.3 2023-01-31 CRAN (R 4.3.0)
             vctrs         0.6.3 2023-04-19 CRAN (R 4.3.0)
             withr         2.5.0 2022-03-03 CRAN (R 4.3.0)
              xfun          0.39 2023-04-20 CRAN (R 4.3.0)
            xtable         1.8-4 2019-04-21 CRAN (R 4.3.0)
              yaml         2.3.7 2023-01-23 CRAN (R 4.3.0)