MetaState: Persistent Working Memory
for Discrete Diffusion Language Models

Anonymous Authors

Under Review
Information Island problem in discrete diffusion language models
The Information Island Problem. Standard dLLMs discard intermediate representations after each denoising step, forcing redundant recomputation. MetaState introduces a persistent working memory that bridges steps.

Abstract

Discrete diffusion language models (dLLMs) generate text by iteratively denoising a masked sequence. Compared with autoregressive models, this paradigm naturally supports parallel decoding, bidirectional context, and flexible generation patterns. However, standard dLLMs condition each denoising step only on the current hard-masked sequence, while intermediate continuous representations are discarded after sampling and remasking. We refer to this bottleneck as the Information Island problem. It leads to redundant recomputation across steps and can degrade cross-step consistency.

We address this limitation with MetaState, a lightweight recurrent augmentation that equips a frozen dLLM backbone with a persistent, fixed-size working memory that remains independent of sequence length. MetaState consists of three trainable modules: a cross-attention Mixer that reads backbone activations into memory slots, a GRU-style Updater that integrates information across denoising steps, and a cross-attention Injector that feeds the updated memory back into backbone activations. We train these modules with K-step unrolling to expose them to multi-step denoising dynamics during fine-tuning.

On LLaDA-8B and Dream-7B, MetaState introduces negligible trainable parameters while keeping the backbone frozen, and it consistently improves accuracy over frozen baselines. These results demonstrate that persistent cross-step memory is an effective mechanism for bridging denoising steps and improving generation quality in discrete diffusion language models.

Method

MetaState framework architecture showing Mixer, Updater, and Injector modules
MetaState Framework. Three lightweight modules augment a frozen dLLM backbone with persistent working memory across denoising steps.

Mixer

Cross-attention module that reads backbone activations into M fixed-size memory slots, compressing sequence-level information into a compact representation.

Updater

GRU-style recurrent module that integrates information across denoising steps, allowing the memory to accumulate and refine knowledge over time.

Injector

Cross-attention module that feeds the updated memory back into backbone activations, enriching each step with persistent cross-step context.

Results

Bar chart comparing MetaState performance across benchmarks
Performance comparison across mathematical reasoning and code generation benchmarks (generation length 256, block size 32, dual cache).
Model GSM8K MATH-500 HumanEval MBPP
Dream backbone (7B)
Dream-Base 73.7 37.6 54.9 52.6
  + MetaState 75.7 46.0 61.0 53.8
    Δ vs. Base +2.0 +8.4 +6.1 +1.2
Dream-Instruct 76.2 45.0 56.1 51.0
  + MetaState 79.5 46.8 57.3 54.2
    Δ vs. Instruct +3.3 +1.8 +1.2 +3.2
LLaDA backbone (8B)
LLaDA-Base 67.4 28.8 33.5 25.6
  + MetaState 76.4 38.4 39.6 29.6
    Δ vs. Base +9.0 +9.6 +6.1 +4.0
LLaDA-Instruct 78.5 36.8 37.2 26.0
  + MetaState 80.0 39.2 39.6 28.6
    Δ vs. Instruct +1.5 +2.4 +2.4 +2.6

Bold marks the best result per column within each backbone group. Δ denotes improvement over the corresponding baseline.