`fuzzyjoin`: Join `df`s with inexact matching

dataframes

joins

Author

Published

June 26, 2023

Setup

library(tidyverse)

Join `df`s by string distance

Sometimes you want to join dfs on fuzzy criteria. That’s where fuzzyjoin comes in. Consider this example lifted partially from the documentation:

Set up a df with some typos. We’ll join this to ggplot2::diamonds by cut, which has levels: Fair, Good, Very Good, Premium, Ideal.

# Create a dataframe with some typos
d <- data.frame(approximate_name = c("Idea", "Premiums", "Premioom",
                                     "VeryGood", "VeryGood", "Faiir"),
                type = 1:6)

A regular inner join finds no matches, due to the typos:

inner_join(diamonds, d, by = c(cut = "approximate_name")) |> 
  nrow()

[1] 0

stringdist_inner_join() however will match strings based on any of the methods available in the stringdist package (e.g. Levenshtein distance, Jaccard distance, etc).

fuzzyjoin::stringdist_inner_join(diamonds, d,
                                 by = c(cut = "approximate_name"),
                                 method = "lv",
                                 max_dist = 3,
                                 distance_col = "distance") |> 
  relocate(approximate_name, .after = "cut") |>
  head(10)

1: Levenshtein distance
2: Move these columns closer together for easier visual comparison

carat	cut	approximate_name	color	clarity	depth	table	price	x	y	z	type	distance
0.23	Ideal	Idea	E	SI2	61.5	55	326	3.95	3.98	2.43	1	1
0.21	Premium	Premiums	E	SI1	59.8	61	326	3.89	3.84	2.31	2	1
0.21	Premium	Premioom	E	SI1	59.8	61	326	3.89	3.84	2.31	3	2
0.29	Premium	Premiums	I	VS2	62.4	58	334	4.20	4.23	2.63	2	1
0.29	Premium	Premioom	I	VS2	62.4	58	334	4.20	4.23	2.63	3	2
0.24	Very Good	VeryGood	J	VVS2	62.8	57	336	3.94	3.96	2.48	4	1
0.24	Very Good	VeryGood	J	VVS2	62.8	57	336	3.94	3.96	2.48	5	1
0.24	Very Good	VeryGood	I	VVS1	62.3	57	336	3.95	3.98	2.47	4	1
0.24	Very Good	VeryGood	I	VVS1	62.3	57	336	3.95	3.98	2.47	5	1
0.26	Very Good	VeryGood	H	SI1	61.9	55	337	4.07	4.11	2.53	4	1

Join `df`s by regex matches

Also extremely useful are the regex_*_join functions, which join based on whether there is a regex match between columns. (This example also from the documentation)

d <- data.frame(regex_name = c("^Idea", "mium", "Good"), type = 1:3)

fuzzyjoin::regex_inner_join(diamonds, d, by = c(cut = "regex_name")) |> 
  relocate(regex_name, .after = cut) |> 
  head()

carat	cut	regex_name	color	clarity	depth	table	price	x	y	z	type
0.23	Ideal	^Idea	E	SI2	61.5	55	326	3.95	3.98	2.43	1
0.21	Premium	mium	E	SI1	59.8	61	326	3.89	3.84	2.31	2
0.23	Good	Good	E	VS1	56.9	65	327	4.05	4.07	2.31	3
0.29	Premium	mium	I	VS2	62.4	58	334	4.20	4.23	2.63	2
0.31	Good	Good	J	SI2	63.3	58	335	4.34	4.35	2.75	3
0.24	Very Good	Good	J	VVS2	62.8	57	336	3.94	3.96	2.48	3

Join `df`s by absolute difference between columns

Say you want to join by two columns so long as they are equal within a certain degree of tolerance. Here we join so long as sepal_lengths$Sepal.Length is within 0.5 of the iris$Sepal.Length:

sepal_lengths <- data.frame(Sepal.Length = c(5, 6, 7), Type = 1:3)

fuzzyjoin::difference_inner_join(iris, sepal_lengths,
                                 max_dist = 0.5,
                                 distance_col = "distance") |> 
  relocate(Sepal.Length.x, .after = Sepal.Length.y) |> 
  head()

Joining by: "Sepal.Length"

Sepal.Width	Petal.Length	Petal.Width	Species	Sepal.Length.y	Sepal.Length.x	Type	distance
3.5	1.4	0.2	setosa	5	5.1	1	0.1
3.0	1.4	0.2	setosa	5	4.9	1	0.1
3.2	1.3	0.2	setosa	5	4.7	1	0.3
3.1	1.5	0.2	setosa	5	4.6	1	0.4
3.6	1.4	0.2	setosa	5	5.0	1	0.0
3.9	1.7	0.4	setosa	5	5.4	1	0.4

Session info and package versions

─ Session info ───────────────────────────────────────────────────────────────

 setting  value
 version  R version 4.3.1 (2023-06-16)
 os       macOS Ventura 13.4.1
 system   aarch64, darwin20
 ui       X11
 language (EN)
 collate  en_US.UTF-8
 ctype    en_US.UTF-8
 tz       America/Chicago
 date     2023-07-08
 pandoc   3.1.1 @ /Applications/RStudio.app/Contents/Resources/app/quarto/bin/tools/ (via rmarkdown)

─ Packages ───────────────────────────────────────────────────────────────────

           package loadedversion       date         source
               cli         3.6.1 2023-03-23 CRAN (R 4.3.0)
         codetools        0.2-19 2023-02-01 CRAN (R 4.3.1)
        colorspace         2.1-0 2023-01-23 CRAN (R 4.3.0)
            crayon         1.5.2 2022-09-29 CRAN (R 4.3.0)
              crul           1.3 2022-09-03 CRAN (R 4.3.1)
              curl         5.0.0 2023-01-12 CRAN (R 4.3.0)
            digest        0.6.31 2022-12-11 CRAN (R 4.3.0)
             dplyr         1.1.2 2023-04-20 CRAN (R 4.3.0)
          ellipsis         0.3.2 2021-04-29 CRAN (R 4.3.0)
          evaluate          0.21 2023-05-05 CRAN (R 4.3.0)
             fansi         1.0.4 2023-01-22 CRAN (R 4.3.0)
           fastmap         1.1.1 2023-02-24 CRAN (R 4.3.0)
 fontBitstreamVera         0.1.1 2017-02-01 CRAN (R 4.3.0)
    fontLiberation         0.1.0 2016-10-15 CRAN (R 4.3.0)
        fontquiver         0.2.1 2017-02-01 CRAN (R 4.3.0)
           forcats         1.0.0 2023-01-29 CRAN (R 4.3.0)
         fuzzyjoin         0.1.6 2020-05-15 CRAN (R 4.3.0)
           gdtools         0.3.3 2023-03-27 CRAN (R 4.3.0)
          generics         0.1.3 2022-07-05 CRAN (R 4.3.0)
            gfonts         0.2.0 2023-01-08 CRAN (R 4.3.0)
           ggplot2         3.4.2 2023-04-03 CRAN (R 4.3.0)
              glue         1.6.2 2022-02-24 CRAN (R 4.3.0)
            gtable         0.3.3 2023-03-21 CRAN (R 4.3.0)
               hms         1.1.3 2023-03-21 CRAN (R 4.3.0)
         htmltools         0.5.5 2023-03-23 CRAN (R 4.3.0)
       htmlwidgets         1.6.2 2023-03-17 CRAN (R 4.3.0)
          httpcode         0.3.0 2020-04-10 CRAN (R 4.3.0)
            httpuv        1.6.11 2023-05-11 CRAN (R 4.3.0)
          jsonlite         1.8.7 2022-12-06 CRAN (R 4.3.0)
             knitr          1.43 2023-05-25 CRAN (R 4.3.0)
             later         1.3.1 2023-05-02 CRAN (R 4.3.0)
         lifecycle         1.0.3 2022-10-07 CRAN (R 4.3.0)
         lubridate         1.9.2 2023-02-10 CRAN (R 4.3.0)
          magrittr         2.0.3 2022-03-30 CRAN (R 4.3.0)
              mime          0.12 2021-09-28 CRAN (R 4.3.0)
           munsell         0.5.0 2018-06-12 CRAN (R 4.3.0)
            pillar         1.9.0 2023-03-22 CRAN (R 4.3.0)
         pkgconfig         2.0.3 2019-09-22 CRAN (R 4.3.0)
          promises       1.2.0.1 2021-02-11 CRAN (R 4.3.0)
             purrr         1.0.1 2023-01-10 CRAN (R 4.3.0)
                R6         2.5.1 2021-08-19 CRAN (R 4.3.0)
              ragg         1.2.5 2023-01-12 CRAN (R 4.3.0)
              Rcpp        1.0.10 2023-01-22 CRAN (R 4.3.0)
             readr         2.1.4 2023-02-10 CRAN (R 4.3.0)
              renv        0.17.3 2023-04-06 CRAN (R 4.3.0)
             rlang         1.1.1 2023-04-28 CRAN (R 4.3.0)
         rmarkdown          2.22 2023-03-26 CRAN (R 4.3.0)
        rstudioapi          0.14 2022-08-22 CRAN (R 4.3.0)
            scales         1.2.1 2022-08-20 CRAN (R 4.3.0)
       sessioninfo         1.2.2 2021-12-06 CRAN (R 4.3.0)
             shiny         1.7.4 2022-12-15 CRAN (R 4.3.0)
        stringdist        0.9.10 2022-11-07 CRAN (R 4.3.0)
           stringi        1.7.12 2023-01-11 CRAN (R 4.3.0)
           stringr         1.5.0 2022-12-02 CRAN (R 4.3.0)
       systemfonts         1.0.4 2022-02-11 CRAN (R 4.3.0)
       textshaping         0.3.6 2021-10-13 CRAN (R 4.3.0)
            tibble         3.2.1 2023-03-20 CRAN (R 4.3.0)
             tidyr         1.3.0 2023-01-24 CRAN (R 4.3.0)
        tidyselect         1.2.0 2022-10-10 CRAN (R 4.3.0)
         tidyverse         2.0.0 2023-02-22 CRAN (R 4.3.0)
        timechange         0.2.0 2023-01-11 CRAN (R 4.3.0)
              tzdb         0.4.0 2022-03-28 CRAN (R 4.3.0)
              utf8         1.2.3 2023-01-31 CRAN (R 4.3.0)
             vctrs         0.6.3 2023-04-19 CRAN (R 4.3.0)
             withr         2.5.0 2022-03-03 CRAN (R 4.3.0)
              xfun          0.39 2023-04-20 CRAN (R 4.3.0)
            xtable         1.8-4 2019-04-21 CRAN (R 4.3.0)
              yaml         2.3.7 2023-01-23 CRAN (R 4.3.0)

Join dfs by string distance

Join dfs by regex matches

Join dfs by absolute difference between columns

Join `df`s by string distance

Join `df`s by regex matches

Join `df`s by absolute difference between columns