From Characters to Tokens: Dynamic Grouping with Hierarchical BPE

6 minute read

Published:

Hierarchical tokenisation overview
Figure 1: Two-stage tokenisation—BPE patches (Tokenizer 1) followed by a tiny BPE inside each patch (Tokenizer 2).

Transformers work particularly well with sub-word tokenisations, which improve model quality and shorten sequences for faster inference. However, the gains taper off once the vocabulary becomes very large (> 500K): many tokens are rare, their embeddings are under-trained, and the bloated embedding matrix adds cost without proportional benefit. At the other end of the spectrum, character-level models avoid the rare-token problem by construction, but their sequences are much longer; with standard self-attention’s quadratic time and memory, this often makes them impractical.

Recent hierarchical approaches aim to reconcile character-level robustness with subword-level efficiency by grouping characters into patches. However, existing patching strategies often depend on whitespace-restricting applicability to languages without clear word boundaries-or require auxiliary models that introduce additional complexity and dependencies.

We propose a dynamic character-grouping method that leverages the structure of standard BPE tokenisation without auxiliary models. Specifically, we append explicit end-of-patch markers to BPE tokens and apply a second, lightweight BPE pass to regulate patch granularity. This design yields representations that are efficient, flexible, and language-agnostic, while preserving a simple training and deployment pipeline.


Algorithm: Dynamic Patching with Two-Stage BPE

Figure 1 provides a schematic overview. Below is the procedure in a compact format.

Notation

SymbolMeaning
T₁Existing subword tokenizer (BPE; “Tokenizer 1”)
<!>End-of-patch marker (EoP) appended to every T₁ token
T₂Lightweight, inner BPE (“Tokenizer 2”) applied within each patch
SMaximum sub-token length per patch after T₂ (small cap, e.g. 4–8)
E_local / D_localLocal encoder/decoder operating inside each patch
E_globalGlobal causal Transformer over patch embeddings
cRaw character stream
pᵢi-th patch (characther of a T₁ token + <!>)
zᵢPatch embedding produced by E_local

High-Level Idea (TL;DR)

  • Start from any existing BPE (T₁). Each token becomes a patch.
  • Append an end-of-patch marker <!> so decoding is incremental.
  • Apply a lightweight BPE (T₂) inside each patch to cap length at a small S.
  • Use local encoders/decoders per patch + a global causal Transformer over patch embeddings.
  • Outcome: faster models and state-of-the-art BPB among dynamic patching baselines—without a boundary model and without whitespace assumptions.
  • More details about the local and global models can be found in the architecture section.

Procedure

  1. Outer tokenisation (patching).
    Tokenise text with T₁ to obtain [t₁, t₂, …, t_N].
    For each tᵢ, form a patch of characthers within the token: pᵢ = (c_1^i, c_2^i, ..., c_L^T <!>).

  2. Within-patch compression.
    For each pᵢ, apply T₂ to obtain a bounded sequence
    uᵢ = T₂(pᵢ) with |uᵢ| ≤ S.

  3. Local encoding.
    Encode uᵢ with E_local to produce a fixed-size embedding zᵢ.

  4. Global modelling.
    Feed the sequence [z₁, z₂, …, z_N] to E_global (causal Transformer) for autoregressive modelling over patches.

  5. Decoding (incremental).
    During generation, E_global predicts the next patch representation;
    D_local decodes uᵢ → pᵢ, then strip the EoP to recover tᵢ and characters.

Architecture

Local-Global architecture for Hierarchical BPE
Figure 2: Local encoders/decoders operate within patches; a global Transformer models the compressed patch sequence.
  • Local encoder: turns each bounded-length patch into an embedding.
  • Global Transformer: models the sequence of patch embeddings (short!).
  • Local decoder: reconstructs the next patch internally during training/inference.

Why it’s efficient: replacing a huge embedding matrix with light local modules + shorter global sequences reduces FLOPs and parameters while improving bits-per-byte (BPB).

Results (high level)

  • Lower BPB than standard BPE and entropy-based patching; competitive with space-based patching without relying on spaces.
  • Compact models: fewer parameters than BPE baselines yet better compression (lower BPB).
  • Language-agnostic: strong results on non-whitespace scripts (e.g., Chinese).
  • Patch sizes ~4 bytes in practice → short global sequences and faster compute.

Table 1 — Comparison for S = 10, BPE tokenisation, Space and Entropy patching (Ent.-Patch, Ent.-Space) for 128M models trained for 13,500 steps. FLOPs are in billions. “” denotes the unbounded model. We report the mean over multiple runs on the test set; max std dev across experiments: ±0.007.*

ModelFLOPs↓Fertility↓ParamsBPB↓
Ent.-Patch*5091.41351M1.24
Ent.-Patch6681.83351M1.20
Space-Patch5341.451351M1.14
Char - Level42144.585M1.16
BPE5621.51123M1.16
BPE-Patch5541.5199.7M1.11

Table 2 shows that our approach outperforms all baselines in this setup. The entropy-patching method yields the weakest results, likely due to a domain mismatch between its entropy model’s training data and SlimPajama. Although the unbounded entropy variant is more FLOP-efficient, its longer patches force the local models to learn more complex representations with limited capacity, causing a marked drop in performance. Constraining the maximum patch length to 6 improves accuracy, but at higher FLOPs-and it still trails our method.

Space-based patching is the most FLOP-efficient baseline and reasonably strong, yet it fails to generalise to languages without whitespace, an issue our method avoids. Character-level modelling uses fewer parameters but is both slower and less predictive. Notably, we also surpass standard BPE despite using fewer parameters, suggesting our local encoders learn richer token-level representations than a fixed embedding matrix.

Downstream evaluation

To further verify our model, we conduct experiments on question-answering tasks. We first pre-train a medium model on next-token prediction for 15B tokens, then perform zero-shot evaluations on:

To obtain the scores, we compute the log-likelihood per character within each patch and select the answer with the highest log-likelihood. Table 2 shows that we outperform the standard BPE approach and entropy-based patching, while performing on par with space patching on most benchmarks. We also evaluate language-modelling capacity via BPB: as shown in Table 2, the medium model improves with scale (lower BPB than the small model) and outperforms other baselines at larger sizes.

Table 2: Common Academic Evaluation Benchmarks comparing a standard embedding matrix vs. our learnable embeddings. Task performance is zero-shot.

ModelParamMMLU (acc)Hella (acc_norm)Lmb. (acc)ARC-c (acc_norm)ARC-e (acc)Wino. (acc)Piqa (acc)
BPE359M23.0031.329.8923.6338.0550.561.43
Ent.-Patch575M23.0732.9129.4020.9036.6048.859.60
Space-Patch575M23.2435.932.821.5039.0252.564.2
BPE-Patch323M22.9034.632.522.1040.653.463.3
BPE-Patch1.3B24.9552.347.328.152.254.970.1

Practical tips

  • Cap S around 8–10 for a sweet spot between speed and fidelity.
  • Works with any BPE (GPT-2, LLaMA-3/SPM, etc.).
  • Drop the auxiliary boundary network used in entropy patching the end marker is enough.

```

  1. We enforce the maximum space size to be 6, resulting in a fertility higher than 1.