Paper review

Paper Review: Training Large Language Models to Reason in a Continuous Latent Space

LLM reasoning: Chain of Continuous Thought instead of Chain-of-Thought!

5 min readJan 6, 2025

Coconut (Chain of Continuous Thought) is a new reasoning paradigm for LLMs that operates in latent space, using continuous thought states derived from the model’s hidden layers instead of text-based reasoning. These states are fed back into the model as input embeddings, enabling the exploration of multiple reasoning paths simultaneously through breadth-first search rather than committing to a single path. By avoiding inefficiencies of language-based reasoning, Coconut demonstrates improved performance in logical tasks requiring backtracking, with fewer tokens used during inference.

The approach

Coconut alternates between “language mode” and “latent mode.” In language mode, the model operates like a standard language model, autoregressively generating tokens. In latent mode, reasoning is conducted in an unconstrained latent space, where the model uses the last hidden state as the next input embedding, bypassing text-based reasoning. Special tokens <bot> and <eot> mark the start and end of the latent mode.

The training process uses a multi-stage curriculum. Initially, the model trains on standard CoT data. In later stages, language reasoning steps are incrementally replaced by latent thoughts, with hyperparameters controlling the ratio of latent to language reasoning. The model optimizes the normal negative log-likelihood loss but masks the loss on questions and latent thoughts. This pushes the model to optimize for future reasoning prediction rather than compressing language steps, allowing it to learn more efficient representations.

Inference in Coconut is similar to standard LLM decoding, but the latent mode directly feeds the last hidden state as input. To decide when to end latent mode, either a binary classifier or constant padding length is used (the latter was ultimately chosen for simplicity).

Experiments

The base model is a pre-trained GPT-2.

The authors compare Coconut against several baselines and its own variants to assess reasoning performance:

Baselines:

CoT: Models trained and inferred using complete reasoning chains before outputting answers.
No-CoT: Models trained to directly generate answers without reasoning chains.
iCoT: Gradually removes initial tokens of reasoning chains during training, leaving only the answer by the end, with the model directly predicting the answer during inference.
Pause token: Inserts <pause> tokens between questions and answers to simulate additional computational capacity, matching the number of continuous thoughts in Coconut.

Coconut Variants:

w/o curriculum: Trains Coconut directly on the final-stage data (questions and answers only) without multi-stage training, using continuous thoughts to solve the entire problem.
w/o thought: Retains multi-stage training but excludes continuous thoughts, gradually removing language reasoning steps. This is similar to iCoT but uses Coconut’s schedule for stricter comparison.
Pause as thought: Replaces continuous thoughts with <pause> tokens, applying the same multi-stage training curriculum as Coconut.

The experiments demonstrate that Coconut significantly enhances LLM reasoning, outperforming conventional CoT, No-CoT, and iCoT in tasks requiring complex planning and contextual understanding. Coconut’s ability to chain continuous thoughts enables deeper reasoning, scaling effectively to solve more complex problems, and it shows improved performance, especially for tasks like GSM8k and ProsQA that demand advanced planning.

CoT struggles in planning-intensive tasks like ProsQA, but Coconut achieves top performance using a multi-stage training curriculum that gradually introduces latent reasoning, showing that guidance during training is essential. Models trained without this curriculum fail to perform better than No-CoT, emphasizing the need for structured learning strategies.

Continuous thoughts also offer efficient reasoning representations, capturing intermediate variables and multiple reasoning traces in planning-intensive tasks.

Understanding the Latent Reasoning in Coconut

Coconut has superior reasoning and planning capabilities compared to traditional CoT. Increasing the use of continuous thoughts improves answer accuracy, correct reasoning processes, and reduces errors such as hallucinations and wrong targets. This indicates enhanced planning ability when reasoning shifts to the latent space.

A case study shows that CoT may hallucinate nonexistent connections, while Coconut progressively refines its reasoning by avoiding premature commitments. For example, Coconut with k=2 successfully solves a problem that CoT and Coconut with k=1 fail to resolve, demonstrating the advantage of latent reasoning in eliminating incorrect options step by step.

Even when Coconut is forced to generate a complete reasoning chain (as with CoT), it achieves higher accuracy and produces more accurate reasoning paths with fewer hallucinations. This is attributed to Coconut’s training, which mixes different stages and hides initial reasoning steps, encouraging the model to focus on future steps and plan ahead. By contrast, CoT’s training focuses narrowly on immediate next-step predictions, making it less effective for complex planning tasks.

Coconut’s latent reasoning operates as a search tree rather than a linear reasoning chain, allowing the model to explore multiple potential paths simultaneously. At each step, the model prioritizes promising nodes while pruning less relevant ones. For example, when reasoning about Alex’s children, the model assigns probabilities to each potential next step, reflecting its implicit value function. Over successive steps, the model refines these probabilities, narrowing its focus as it gains confidence in the most promising paths.

Early latent thoughts show broad exploration, maintaining diversity in reasoning paths, as seen in the wide gaps between top candidate probabilities. In later thoughts, this diversity decreases as the model transitions from parallel exploration to more focused reasoning. This dynamic demonstrates Coconut’s ability to balance exploration and exploitation in reasoning.

Original blogpost