What are the key points?

Sakana AI and NVIDIA introduce TwELL, a new format for sparse transformer models. TwELL utilizes custom CUDA kernels to dynamically route tokens for improved memory efficiency. Benchmarks demonstrate over 20% faster speeds and reduced memory usage for billion-parameter models.

New TwELL Format Speeds Up Sparse LLM Training

•Sakana AI and NVIDIA introduce TwELL, a new format for sparse transformer models.
•TwELL utilizes custom CUDA kernels to dynamically route tokens for improved memory efficiency.
•Benchmarks demonstrate over 20% faster speeds and reduced memory usage for billion-parameter models.

The article presents 'TwELL' (Tile-wise ELLPACK), a new technical approach designed to optimize Large Language Models that utilize sparse architectures—models where only a small subset of neurons are active for any given input. While modern LLMs are inherently sparse, standard hardware often struggles with the irregular memory access patterns this creates.

To solve this hardware mismatch, researchers developed a 'Hybrid' format. TwELL dynamically routes the majority of sparse tokens through a fast execution path, while relying on a dense backup matrix to handle complex, heavy tokens. This strategy prevents GPUs from struggling with the irregular structure of sparse math.

The team introduced custom CUDA kernels that fuse multiple sparse matrix multiplications, maximizing hardware throughput and minimizing activation sizes. In training and benchmarks on billion-parameter models, this method reportedly achieved speedups exceeding 20%, alongside significant improvements in memory and energy efficiency. The work is scheduled for presentation at ICML 2026.

The article presents 'TwELL' (Tile-wise ELLPACK), a new technical approach designed to optimize Large Language Models that utilize sparse architectures—models where only a small subset of neurons are active for any given input. While modern LLMs are inherently sparse, standard hardware often struggles with the irregular memory access patterns this creates.

To solve this hardware mismatch, researchers developed a 'Hybrid' format. TwELL dynamically routes the majority of sparse tokens through a fast execution path, while relying on a dense backup matrix to handle complex, heavy tokens. This strategy prevents GPUs from struggling with the irregular structure of sparse math.

The team introduced custom CUDA kernels that fuse multiple sparse matrix multiplications, maximizing hardware throughput and minimizing activation sizes. In training and benchmarks on billion-parameter models, this method reportedly achieved speedups exceeding 20%, alongside significant improvements in memory and energy efficiency. The work is scheduled for presentation at ICML 2026.