This line chart shows the relationship between vocab size, optimization mode and characters/token.
The x axis is vocab size. The y axis is characters / token.
The datasets are described here, and total 3 GB of text.
The label "unfiltered", "clean", "balanced", "consistent" and "strict" indicates the optimization mode the vocabulary was trained for, described here.