Tokenisation over Bounded Alphabets is Hard
Overview
Overall Novelty Assessment
The paper establishes NP-completeness and inapproximability results for tokenisation over bounded alphabets, specifically proving hardness for binary (2-character) and unary (1-character) cases. Within the taxonomy, it occupies the sole position in the 'NP-Completeness Proofs for Bounded Alphabets' leaf under 'Theoretical Complexity Analysis of Tokenisation'. This leaf contains only the original paper itself, indicating a sparse research direction. The broader parent branch ('Theoretical Complexity Analysis') contains just two leaves, with the sibling 'Inapproximability Results' leaf currently empty, suggesting this theoretical angle remains relatively unexplored in the literature.
The taxonomy reveals that most tokenisation research concentrates in 'Algorithm Design and Optimization' (3 leaves: BPE variants, security-preserving methods, alternative approaches) and 'Application Domains' (3 leaves: network security, biological sequences, language models). The original paper's theoretical focus contrasts sharply with these algorithm-centric and application-oriented directions. Neighboring work like BPE Dictionary Equivalence examines structural properties rather than worst-case complexity, while HMM Lexical Analysis and LLMs Learn HMMs explore probabilistic segmentation models. The taxonomy's scope notes explicitly separate theoretical hardness proofs from practical algorithm design, positioning this work in a distinct, less-populated research space.
Among the three contributions analyzed, the literature search examined only 5 candidate papers total. For the binary tokenisation NP-completeness result, 1 candidate was examined with 0 refutations. The unary direct tokenisation hardness examined 1 candidate with 0 refutations. The formal problem definitions examined 3 candidates with 0 refutations. This limited search scope—covering top-K semantic matches plus citation expansion—found no prior work that clearly overlaps with the specific bounded-alphabet hardness results. All three contributions appear novel within the examined candidate set, though the small search scale (5 papers) means the analysis cannot claim exhaustive coverage of the theoretical complexity literature.
Given the sparse taxonomy position and absence of refuting candidates among the 5 papers examined, the work appears to address a relatively unexplored theoretical question. However, the limited search scope and the fact that the original paper is the only entry in its taxonomy leaf suggest caution: the analysis reflects top-K semantic similarity rather than comprehensive field coverage. The novelty assessment is constrained by what was examined, not by what exists in the broader literature.
Taxonomy
Research Landscape Overview
Claimed Contributions
The authors prove that both bottom-up and direct tokenisation over binary alphabets are NP-complete decision problems and establish that neither problem admits a polynomial-time approximation scheme, showing they cannot be approximated arbitrarily well in polynomial time.
The authors establish that direct tokenisation remains NP-complete even for the simplest case of unary alphabets (single character). They prove strong NP-hardness by reduction from the vertex cover problem, showing hardness holds even when strings are explicitly represented.
The authors introduce and formalize the n-ary tokenisation problem, defining decision and optimisation variants for both direct and bottom-up tokenisation over alphabets constrained to size n, establishing a framework for analyzing tokenisation complexity with bounded alphabets.
Core Task Comparisons
Comparisons with papers in the same taxonomy category
Contribution Analysis
Detailed comparisons for each claimed contribution
NP-completeness of binary tokenisation and hardness of approximation
The authors prove that both bottom-up and direct tokenisation over binary alphabets are NP-complete decision problems and establish that neither problem admits a polynomial-time approximation scheme, showing they cannot be approximated arbitrarily well in polynomial time.
[8] Restricted Common Superstring and Restricted Common Supersequence PDF
Strong NP-completeness of direct unary tokenisation
The authors establish that direct tokenisation remains NP-complete even for the simplest case of unary alphabets (single character). They prove strong NP-hardness by reduction from the vertex cover problem, showing hardness holds even when strings are explicitly represented.
[9] The complexity of some regex crossword problems PDF
Formal definition of n-ary tokenisation problems over bounded alphabets
The authors introduce and formalize the n-ary tokenisation problem, defining decision and optimisation variants for both direct and bottom-up tokenisation over alphabets constrained to size n, establishing a framework for analyzing tokenisation complexity with bounded alphabets.