fuzzyjoin: Join dfs with inexact matching

dataframes
joins
Author
Published

June 26, 2023

Setup
library(tidyverse)

Join dfs by string distance

Sometimes you want to join dfs on fuzzy criteria. That’s where fuzzyjoin comes in. Consider this example lifted partially from the documentation:

Set up a df with some typos. We’ll join this to ggplot2::diamonds by cut, which has levels: Fair, Good, Very Good, Premium, Ideal.

# Create a dataframe with some typos
d <- data.frame(approximate_name = c("Idea", "Premiums", "Premioom",
                                     "VeryGood", "VeryGood", "Faiir"),
                type = 1:6)

A regular inner join finds no matches, due to the typos:

inner_join(diamonds, d, by = c(cut = "approximate_name")) |> 
  nrow()
[1] 0

stringdist_inner_join() however will match strings based on any of the methods available in the stringdist package (e.g. Levenshtein distance, Jaccard distance, etc).

fuzzyjoin::stringdist_inner_join(diamonds, d,
                                 by = c(cut = "approximate_name"),
                                 method = "lv",
                                 max_dist = 3,
                                 distance_col = "distance") |> 
  relocate(approximate_name, .after = "cut") |>
  head(10)
1
Levenshtein distance
2
Move these columns closer together for easier visual comparison
carat cut approximate_name color clarity depth table price x y z type distance
0.23 Ideal Idea E SI2 61.5 55 326 3.95 3.98 2.43 1 1
0.21 Premium Premiums E SI1 59.8 61 326 3.89 3.84 2.31 2 1
0.21 Premium Premioom E SI1 59.8 61 326 3.89 3.84 2.31 3 2
0.29 Premium Premiums I VS2 62.4 58 334 4.20 4.23 2.63 2 1
0.29 Premium Premioom I VS2 62.4 58 334 4.20 4.23 2.63 3 2
0.24 Very Good VeryGood J VVS2 62.8 57 336 3.94 3.96 2.48 4 1
0.24 Very Good VeryGood J VVS2 62.8 57 336 3.94 3.96 2.48 5 1
0.24 Very Good VeryGood I VVS1 62.3 57 336 3.95 3.98 2.47 4 1
0.24 Very Good VeryGood I VVS1 62.3 57 336 3.95 3.98 2.47 5 1
0.26 Very Good VeryGood H SI1 61.9 55 337 4.07 4.11 2.53 4 1

Join dfs by regex matches

Also extremely useful are the regex_*_join functions, which join based on whether there is a regex match between columns. (This example also from the documentation)

d <- data.frame(regex_name = c("^Idea", "mium", "Good"), type = 1:3)

fuzzyjoin::regex_inner_join(diamonds, d, by = c(cut = "regex_name")) |> 
  relocate(regex_name, .after = cut) |> 
  head()
carat cut regex_name color clarity depth table price x y z type
0.23 Ideal ^Idea E SI2 61.5 55 326 3.95 3.98 2.43 1
0.21 Premium mium E SI1 59.8 61 326 3.89 3.84 2.31 2
0.23 Good Good E VS1 56.9 65 327 4.05 4.07 2.31 3
0.29 Premium mium I VS2 62.4 58 334 4.20 4.23 2.63 2
0.31 Good Good J SI2 63.3 58 335 4.34 4.35 2.75 3
0.24 Very Good Good J VVS2 62.8 57 336 3.94 3.96 2.48 3

Join dfs by absolute difference between columns

Say you want to join by two columns so long as they are equal within a certain degree of tolerance. Here we join so long as sepal_lengths$Sepal.Length is within 0.5 of the iris$Sepal.Length:

sepal_lengths <- data.frame(Sepal.Length = c(5, 6, 7), Type = 1:3)

fuzzyjoin::difference_inner_join(iris, sepal_lengths,
                                 max_dist = 0.5,
                                 distance_col = "distance") |> 
  relocate(Sepal.Length.x, .after = Sepal.Length.y) |> 
  head()
Joining by: "Sepal.Length"
Sepal.Width Petal.Length Petal.Width Species Sepal.Length.y Sepal.Length.x Type distance
3.5 1.4 0.2 setosa 5 5.1 1 0.1
3.0 1.4 0.2 setosa 5 4.9 1 0.1
3.2 1.3 0.2 setosa 5 4.7 1 0.3
3.1 1.5 0.2 setosa 5 4.6 1 0.4
3.6 1.4 0.2 setosa 5 5.0 1 0.0
3.9 1.7 0.4 setosa 5 5.4 1 0.4