TokenMonster Benchmark

This line chart shows the relationship between vocab size, optimization mode and characters/token.

The x axis is vocab size. The y axis is characters / token.

The datasets are described here, and total 3 GB of text.

The label "unfiltered", "clean", "balanced", "consistent" and "strict" indicates the optimization mode the vocabulary was trained for, described here.

TokenMonster Vocabulary Comparison