Learning Token Prediction

Press F11 to go full screen

How AI Predicts the Next Word
Ready to start
How AI Predicts the Next Word
“The cat sat on the
1. Tokenize
2. Analyze
3. Calculate
4. Choose
1
Convert text into processable units
Why tokenization is necessary
Neural networks work with numbers, not raw text. We map each piece of text to a numeric token ID.

Original sentence AI receives:

“The cat sat on the”

How AI actually processes it (Token IDs):

464
(The)
2415
(cat)
3332
(sat)
319
(on)
262
(the)
Each word is mapped to a unique token ID from a fixed vocabulary. These numbers are the inputs to the model.
Token IDs flow into attention mechanism
2
Calculate contextual importance weights
How attention works
The model scores how much each prior token should influence the next prediction.

Attention weights calculated:

cat 85 sat 72 on 64
Attention uses softmax normalization. Higher scores mean more influence on the next token.
Weighted embeddings feed into output layer
3
Compute probabilities over the vocabulary
Softmax and logits
The model converts scores to probabilities that sum to 1 across the vocabulary.

Top candidates from vocabulary:

mat
42.3%
chair
28.7%
table
19.4%
floor
9.6%
Only the top few are shown here. The rest have very small probabilities.
Probability distribution used for sampling
4
Sample to select the next token
Selection methods available
Greedy selection, temperature sampling, and nucleus sampling. This demo uses weighted random sampling.

x