Sort numbered strings (or factor levels, or filenames) correctly even without zero-padding

data cleaning

January 12, 2024


It’s quite common to see lists of items (data, files, etc) that are numbered, such as the hypothetical list of files below:

filenames <- c("file2.csv", "file1.csv",  "file3.csv",
               "file11.csv", "file10.csv", "file20.csv")
[1] "file2.csv"  "file1.csv"  "file3.csv"  "file11.csv" "file10.csv"
[6] "file20.csv"

If you want to sort these by number, you run into a problem, since the filenames are strings: 1 is followed by 10, which is followed by 2, since 10 precedes 2 “alphabetically”:

filenames |> sort()
[1] "file1.csv"  "file10.csv" "file11.csv" "file2.csv"  "file20.csv"
[6] "file3.csv" 

One solution is to rename your items such that they are zero-padded. A kludge with stringr’s str_replace()andstr_pad()can get the job done. Because of the leading zeros,sort() will get the result you expect:

padded <- str_replace(filenames, "[0-9]+", \(x) str_pad(x, 2, pad="0"))
padded |> sort()
[1] "file01.csv" "file02.csv" "file03.csv" "file10.csv" "file11.csv"
[6] "file20.csv"

Rather than renaming your items, naturalsort::naturalsort() orders your items in “human natural” order:

filenames |> naturalsort::naturalsort()
[1] "file1.csv"  "file2.csv"  "file3.csv"  "file10.csv" "file11.csv"
[6] "file20.csv"

The naturalsort package also comes with the command naturalfactor(), which can reorder a factor in the same way, or turn an unordered list of strings into a factor:

my_factor <- factor(c("level_1", "level_10", "level_2"))
[1] level_1  level_10 level_2 
Levels: level_1 < level_2 < level_10
c("level1", "level10", "level2") |> naturalfactor()
[1] level1  level10 level2 
Levels: level1 < level2 < level10