TokenMonster Benchmark

442 pretrained vocabularies are in process. New vocabularies are added daily.

This benchmark shows the characters per token over 4 datasets for each of the TokenMonster pretrained vocabularies, plus GPT2 Tokenizer, LLaMa and tiktoken.

The datasets are described here, and total 3 GB of text.

The label "unfiltered", "clean", "balanced", "consistent" and "strict" indicates the optimization mode the vocabulary was trained for, described here.

To select a vocabulary click on the outer arrow on the drop-down lists. It will display a list of all currently available pretrained vocabularies.

You can also click on the text inside the drop-down and type. For example, you can type "24000" to list all 24000 sized vocabularies.

The "Characters / Token / Vocab Size / 100256" divides the characters per token by the vocabulary size, and then multiplies that by 100256. This gives a representation of the efficiency of the vocabulary, but I'm not sure how useful it is because smaller vocabularies are almost always more efficient.

The number in the vocabulary name is the vocabulary size. For reference: "llama" is 32000, "gpt2" is 50256, "tiktoken p50k_base" is 50256, and "tiktoken cl100k_base" is 100256.