**Yuwei Zhang Sha Li Changlong Yu Qin Lu Shuowei Jin Chengyu Dong Haoran Liu Ilgee Hong Xintong Li Zhenyu Shi Bing Yin Jingbo Shang**

(Work in Progress)

Last Updated on May 12, 2026 | First Published on May 12, 2026

[ :github: Github] | [ :arxiv:Paper]

<aside>

TL;DR

Problem: We find that SDPO struggles in rare-success regimes because raw environment feedback alone is often insufficient to provide useful supervision.
Solution: We propose Reflection-Enhanced Self-Distillation (RESD), which actively interprets environment feedback instead of passively receiving it, turning the learned insights into reusable lessons.
Results: Empirically, RESD better utilizes failed rollouts than SDPO and shows stronger sample efficiency than sparse-reward RL baselines such as GRPO. </aside>

Figure 1: Qwen3-4B-Thinking-2507 validation performance on Manufactoria-Has during training. Left, per-task accuracy: Percentage of tasks where all test cases pass. This is used as the reward function. Right, per-test-case accuracy: Percentage of test cases pass overall.

Figure 1: Qwen3-4B-Thinking-2507 validation performance on Manufactoria-Has during training. Left, per-task accuracy: Percentage of tasks where all test cases pass. This is used as the reward function. Right, per-test-case accuracy: Percentage of test cases pass overall.

Background: OPD & SDPO

SDPO Relies on Successful Peer Demonstrations

SDPO [Hübotter et al. 2026] employs a privileged teacher to provide token-level supervision for student-generated trajectories, where the teacher model weights are copied from the student. As shown in Figure 2, the original SDPO employs both the successful demonstrations and textual environment feedback in the privileged self-teacher prompt.

Figure 2: The teacher prompt includes two types of privileged context: (1) successful peer rollouts for the same problem, used as demonstrations; and (2) environment output from a previous unsuccessful attempt. When a successful rollout is available, environment feedback is not appended. A successful solution is not used to critique itself; it is only used to supervise failed peer rollouts.

Figure 2: The teacher prompt includes two types of privileged context: (1) successful peer rollouts for the same problem, used as demonstrations; and (2) environment output from a previous unsuccessful attempt. When a successful rollout is available, environment feedback is not appended. A successful solution is not used to critique itself; it is only used to supervise failed peer rollouts.

Such a design introduces the following question:

<aside> ❓

How much does the feedback formulation affect SDPO performance?

</aside>

To test this, we set up an experiment on FiNER [Zhang et al. 2025] where we constrain SDPO to sample a single trajectory at a time, which prevents the privileged context from including successful demonstrations. The comparison is shown below:

Figure 3: Qwen3-4B-Thinking-2507 validation performance on FiNER. We ablate different rollout sizes as indicated in the plot. With only the inclusion of local reflection and global playbook as context, and without any successful demonstrations, SDPO+Ref recovers the performance of N=8 with a single rollout.

Figure 3: Qwen3-4B-Thinking-2507 validation performance on FiNER. We ablate different rollout sizes as indicated in the plot. With only the inclusion of local reflection and global playbook as context, and without any successful demonstrations, SDPO+Ref recovers the performance of N=8 with a single rollout.

In contrast to the original sample-rich setting, SDPO largely fails to learn from environment feedback alone. Thus, we hypothesize that SDPO predominantly relies on successful peer solutions for learning.

SDPO Breaks Down in the Face of “Hard Tasks”

Learning from feedback is especially important when successful rollouts are rare early in training. Therefore, we set up an experiment with Manufactoria-Has [Sun et al. 2025], where the initial model achieves a near-zero success rate. The task requires the model to write a finite-state machine program in a domain-specific language that processes an input tape—a sequence of colored symbols (e.g., GRBRBRY, BRGBY)—one symbol at a time from left to right. At each step, the current state (s0, s1, s2 …) examines the next symbol, transitions to a new state based on its color, and accepts if it reaches a designated accept state before the tape runs out. The following GIF shows an example.

Figure 4: Example task on Manufactoria-Has: “Accept if the tape contains BRBR.” The machine starts in state s0 (no pattern matched). It first encounters G, which PULLER_RB cannot read, so it falls through to s0_yg, which skips the G and returns to s0. Next, R is consumed but doesn't start the pattern, so the machine stays in s0. Upon reading B, it transitions to s1 (matched "B"), then R → s2 ("BR"), B → s3 ("BRB"), and finally R → end ("BRBR" found). The remaining Y is never read — the tape is accepted as soon as the accept state is reached. The code panel on the right highlights the active transition at each step.

We provide detailed feedback for each possible failure mode so that the model can infer the cause of the error.