Recent advances in large language models (LLMs), amplified by their long context capacities, have demonstrated remarkable proficiency in intricate reasoning (OpenAI-o1; DeepSeek-R1), agentic thinking (Reflexion; ReAct; RAM), and creative writing (Wang et al., 2023; Mikhaylovskiy, 2023), etc. These advancements necessitate the ability to generate lengthy sequences, e.g., o1-like reasoning tends to generate protracted chain-of-thought trajectories before reaching final conclusions.
However, generating ultra-long sequences (up tp 100K tokens) is painfully slow. For example, generating 100K tokens with LLaMA3.1-8B can take approximately five hours (Figure 2), hindering real-world applications.
A straightforward solution is to take advantage of recent success in speculative decoding (SD). However, existing methods are generally tailored for generating short sequences, e.g., TriForce and MagicDec are limited to generating 256 and 64 tokens, respectively. Directly extending their generation length to 100K tokens would inevitably encounter failures due to KV cache budget constraints. Furthermore, when applied to optimized KV cache architectures such as Group Query Attention (GQA), these methods yield only marginal acceleration gains for short-sequence generation (Figure 3). This observation leads to a pivotal research question:
Is it possible to achieve model-agnostic lossless accelerations, akin to those seen in short-sequence SDs, for generating ultra-long sequences, with minimal training overhead?
Generating ultra-long sequences exposes three critical bottlenecks:
Instead of generating one token at a time, TokenSwift predicts multiple tokens in a single forward pass. Inspired by Medusa, it adds lightweight linear layers to the base model, and utilizes tree attention to enable Multi-Token Generation. To further boost efficiency, it reuses frequent n-grams (phrases) from earlier text, reducing redundant computations.
TokenSwift intelligently prunes less important KV pairs while preserving critical context. It keeps the initial prompt’s KV cache intact and dynamically updates the rest based on importance scores derived from attention patterns.
To combat repetition, TokenSwift penalizes recently generated tokens within a sliding window, nudging the model toward diverse outputs. This works alongside sampling strategies like Nucleus Sampling, min-, and -sampling
Table 1 and Table 2 are the main results, showing TokenSwift can consistenly achieve over acceleration across various model scales and architecture.
@misc{wu2025hoursminuteslosslessacceleration,
title={From Hours to Minutes: Lossless Acceleration of Ultra Long Sequence Generation up to 100K Tokens},
author={Tong Wu and Junzhe Shen and Zixia Jia and Yuxuan Wang and Zilong Zheng},
year={2025},
eprint={2502.18890},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18890},
}