442 pretrained vocabularies are in process. New vocabularies are added daily.
You can type in the textarea and it will be tokenized live with the two selected vocabularies.
To select a vocabulary click on the outer arrow on the drop-down lists. It will display a list of all currently available pretrained vocabularies.
You can also click on the text inside the drop-down and type. For example, you can type "24000" to list all 24000 sized vocabularies.
The "show capcode" checkbox toggles whether to show the decoded form or encoded form of the tokens. The ⌦ marker deletes the next character (but it's only applied to space), the ⇧ marker uppercases the next character, and ⇪ uppercases the next word.
The label "unfiltered", "clean", "balanced", "consistent" and "strict" indicates the optimization mode the vocabulary was trained for, described here.
At the bottom of the page is a barchart showing the average chracters per token for the selected vocabularies on 4 datasets.
The number in the vocabulary name is the vocabulary size. For reference: "llama" is 32000, and "gpt2" is 50256