Sort numbered strings (or factor levels, or filenames) correctly even without zero-padding

data cleaning
sorting
Author
Published

January 12, 2024

Setup
library(tidyverse)
library(naturalsort)

It’s quite common to see lists of items (data, files, etc) that are numbered, such as the hypothetical list of files below:

filenames <- c("file2.csv", "file1.csv",  "file3.csv",
               "file11.csv", "file10.csv", "file20.csv")
print(filenames)
[1] "file2.csv"  "file1.csv"  "file3.csv"  "file11.csv" "file10.csv"
[6] "file20.csv"

If you want to sort these by number, you run into a problem, since the filenames are strings: 1 is followed by 10, which is followed by 2, since 10 precedes 2 “alphabetically”:

filenames |> sort()
[1] "file1.csv"  "file10.csv" "file11.csv" "file2.csv"  "file20.csv"
[6] "file3.csv" 

One solution is to rename your items such that they are zero-padded. A kludge with stringr’s str_replace()andstr_pad()can get the job done. Because of the leading zeros,sort() will get the result you expect:

padded <- str_replace(filenames, "[0-9]+", \(x) str_pad(x, 2, pad="0"))
padded |> sort()
[1] "file01.csv" "file02.csv" "file03.csv" "file10.csv" "file11.csv"
[6] "file20.csv"

Rather than renaming your items, naturalsort::naturalsort() orders your items in “human natural” order:

filenames |> naturalsort::naturalsort()
[1] "file1.csv"  "file2.csv"  "file3.csv"  "file10.csv" "file11.csv"
[6] "file20.csv"

The naturalsort package also comes with the command naturalfactor(), which can reorder a factor in the same way, or turn an unordered list of strings into a factor:

my_factor <- factor(c("level_1", "level_10", "level_2"))
naturalsort::naturalfactor(my_factor)
[1] level_1  level_10 level_2 
Levels: level_1 < level_2 < level_10
c("level1", "level10", "level2") |> naturalfactor()
[1] level1  level10 level2 
Levels: level1 < level2 < level10