Research2026-06-19

Beyond Uniform Forgetting: A Study of Sequential Direct Preference Optimization Across Preference Settings

arXiv:2606.19744v1 Announce Type: cross Abstract: Aligning language models with human preferences often requires optimising multiple behavioural objectives. A practical approach is to apply these objectives sequentially using preference optimisation methods such as Direct Preference Optimisation...

Sequential Preference Optimisation: When Forgetting Becomes a Feature, Not a Bug

The paper Beyond Uniform Forgetting tackles a practical bottleneck in AI alignment: how to teach a language model multiple behavioural preferences—like helpfulness, harmlessness, and honesty—without the last lesson erasing the first. The researchers systematically study sequential Direct Preference Optimization (DPO), where preference objectives are applied one after another rather than simultaneously. Their core finding is that forgetting is not uniform: some learned behaviours degrade severely when a new preference is introduced, while others remain surprisingly stable.

What the Research Reveals

The study moves beyond the common assumption that sequential fine-tuning inevitably leads to catastrophic forgetting. Instead, it identifies a spectrum of forgetting patterns depending on the nature of the preference. For example, a model trained first to be helpful and then to be harmless may retain its helpfulness on simple queries but lose it on edge cases. The authors propose metrics to measure this non-uniformity and demonstrate that the order of preference training matters significantly. Crucially, they show that careful sequencing—starting with the most “fragile” preference—can mitigate overall performance loss.

Why This Matters for AI Practitioners

This research addresses a real-world headache for anyone deploying aligned models. Most production pipelines do not train on all preferences at once; they layer safety, style, and domain-specific constraints incrementally. The paper’s insight that forgetting is not a blanket phenomenon means practitioners can now:

Diagnose which preference is most vulnerable to being overwritten by subsequent training steps.
Design training curricula that prioritise fragile preferences early, when the model’s plasticity is highest.
Use selective replay of earlier preference data during later stages, rather than brute-force retraining from scratch.

The practical implication is clear: sequential alignment can be made more efficient and reliable if we treat forgetting as a predictable, measurable process rather than a black-box failure. For teams using DPO or similar methods, this work provides a framework to audit and optimise their multi-objective training pipelines.

Implications for the Broader Alignment Landscape

The study also challenges the prevailing narrative that alignment must be monolithic or simultaneous to be safe. If sequential training can be made robust, it opens the door to modular alignment—where new preferences are added post-deployment without full retraining. This is especially relevant for open-source models and custom fine-tuning, where compute budgets are limited and full retraining is often infeasible.

Key Takeaways

Forgetting in sequential DPO is not uniform—some preferences degrade more than others, and the degradation pattern depends on both the preference type and the training order.
Training order matters: starting with the most fragile preference can preserve overall alignment quality better than arbitrary sequencing.
Practitioners can audit forgetting using the paper’s proposed metrics, enabling targeted mitigation strategies like selective data replay.
The work supports modular alignment—adding new preferences incrementally without catastrophic loss—which is critical for cost-sensitive and post-deployment scenarios.

Read Original Article on Arxiv CS.AI

arxivpapers