Learning from Mistakes

Can LLM Self-recover after Misalignment?

*Sapienza University of Rome, Department of Computer, Control and Management Engineering, Italy
Visualize the Conversations

Abstract

Responsible AI initiatives place great emphasis on the safety of Large Language Model (LLM)-based systems. In particular, it has become standard practice to subject these models to an alignment procedure aimed at preventing harmful outputs. However, once aligned, a model is not guaranteed to maintain this alignment throughout its lifecycle. Moreover, the likelihood of misalignment increases as malicious actors may deliberately employ jailbreaking techniques to compromise LLM safety.

In this paper, we introduce a new perspective on advancing LLM alignment: rather than developing stronger alignment techniques, we investigate the model's intrinsic ability to recover its alignment after corruption. We propose a methodology for modeling the safety trajectories of user-assistant interactions and for detecting recovery trends within them.

We apply this approach to a jailbreaking scenario, presenting a preliminary recovery analysis based on a dataset of adversarial multi-turn dialogues collected during a red teaming challenge, and examining the influence of the content moderation model chosen for safety evaluation.

Motivation

The Problem: Many researchers agree that no system can be considered perfectly aligned and that any model's safety can eventually be compromised. This issue becomes even more critical when malicious actors deliberately attempt to compromise a model's safety through adversarial means, such as employing jailbreaking techniques.

Our Approach: Unlike previous works that focus on developing more robust alignment methods, our study takes a different perspective. Instead of striving to design systems that never drift into misalignment, we explore how to build self-recoverable LLMs, models capable of finding the "road home," that is, of restoring their alignment without external intervention.

Dataset

We conducted a red teaming challenge in which each participant had a two-hour session to perform multi-turn adversarial attacks. The target model was Minerva-7B-instruct-v1.0, an instruction-tuned LLM pretrained on Italian and English corpora. Minerva is a family of LLMs developed by Sapienza NLP in the context of the Future Artificial Intelligence Research (FAIR) project, in collaboration with CINECA. After that the dataset of adversarial dialogues was collected.

The challenge included multiple jailbreaking tasks. Since not all tasks were compatible with the safety risk taxonomy used by Llama Guard (inherited from the MLCommons taxonomy), we selected 597 conversations out of a total of 1364. These selected dialogues correspond to tasks involving:

  • Elicitation of gender and racial bias
  • Privacy violations
  • Promotion of physical and non-physical harm
  • Induction of hallucinations

The challenge was conducted primarily in Italian, with a minority of prompts in English, except in cases where a specific jailbreaking technique was language-dependent.

Methodology

Safety Trajectories

Analyzing the temporal safety dynamics of user–LLM interactions can help to better explain incremental jailbreaks and to understand how models recover after different types of safety failures. By examining the evolution of safety states throughout a dialogue, we can trace the path from alignment to misalignment and, in some cases, to self-recovery.

Once a trusted content moderation tool is chosen and safety flags are assigned, the safety trajectory is obtained by plotting these values for each user prompt and assistant response on the vertical axis against the corresponding dialogue turn on the horizontal axis.

Safety trajectory visualization showing misalignment and recovery patterns across dialogue turns
Figure 1: Safety trajectory for one of the conversations of the collected dataset. The first misalignment occurs at turn 0, followed by recovery at turn 4. The second misalignment occurs at turn 5, with recovery at turn 6. Misalignment-recovery pairs are identified by transitions from unsafe model responses (flagged as 1) to the first subsequent safe response (flagged as 0).

Safety Trajectory Interpretation

Pattern Interpretation
→ at y=0 Benign exchange (safe prompt, safe response)
→ at y=1 Successful attack (unsafe prompt, unsafe response)
↗ ascending Broken alignment (safe prompt, unsafe response)
↘ descending Resistance to attack (unsafe prompt, safe response)

Recovery Trends

To define recovery on a safety trajectory, we first fix the turn at which an unsafe model response occurs. Recovery is then observed when the model produces the first safe response after this misalignment event.

Aligned Misaligned (Turn i) Recovery (Turn i+N)

We distinguish between two types of recovery:

  • Absolute recovery: Restoration of alignment that is maintained until the end of the interaction
  • Temporary recovery: The model returns to alignment but later drifts into misalignment again
Schematic illustration of one-step recovery showing two possible recovery paths
Figure 2: Schematic illustration of one-step recovery. The two possible recovery paths are shown, depending on whether the user prompt at turn i + 1 is safe (lower recovery path) or unsafe (upper recovery path).

Analysis

Alignment Statistics

The analysis was conducted using Llama Guard 3-1B as the baseline content moderation model, which shows 77.08% agreement with ground truth manual safety evaluation. We also compared results with Llama Guard 3-8B.

Guard Model Unsafe Conversations Conversations w/ Recoveries Multiple Recoveries
Llama Guard 3-1B 202 (33.8%) 87 (14.6%) 19 (3.2%)
Llama Guard 3-8B 193 (32.3%) 65 (10.9%) 5 (0.8%)

Recovery Dynamics

Guard Model Avg. Turns Avg. Recoveries Avg. Turns b. Recovery Avg. Misalignment Length Avg. Recovery Duration
Llama Guard 3-1B 9.8 1.4 4.3 1.63 3.69
Llama Guard 3-8B 8.3 1.1 5.4 2.14 2.34

Two metrics are particularly informative for characterizing alignment dynamics:

  • Misalignment Length: Captures how many turns the model remains misaligned before recovery occurs. Systems more resistant to jailbreaking should minimize this value.
  • Recovery Duration: Describes how long the model stays aligned after returning to a safe state while the attack continues. Higher values indicate greater robustness to sustained adversarial pressure.

Risk-Specific Effects

Hazard Category Conversations Recoveries Avg. Misalignment Length Avg. Recovery Duration
LG 3-1B LG 3-8B LG 3-1B LG 3-8B LG 3-1B LG 3-8B LG 3-1B LG 3-8B
Violent Crimes 6336 289 1.821.78 2.612.50
Hate 6071 2827 1.252.63 5.812.28
Non-Violent Crimes 2820 95 1.331.80 2.172.67
Indiscriminate Weapons 2031 68 2.832.25 2.252.20
Privacy 2220 1410 2.002.30 3.831.17
Intellectual Property 162 153 1.802.00 3.694.00

Key Findings

Most Recoverable Categories

"Violent Crimes" and "Intellectual Property" tend to be the most recoverable categories, showing shorter misalignment lengths and longer recovery durations.

Most Difficult to Recover

"Indiscriminate Weapons" and "Privacy" are the most difficult categories to recover from, indicating weaker self-correction capabilities in these areas.

Recovery Patterns

Misalignment episodes last on average about 2 turns, and recovery typically persists for about 3-4 turns before potential renewed misalignment.

Sensitivity of the evaluation model

The larger 8B model shows higher misalignment length and shorter recovery duration compared to the 1B model

Conclusion & Future Work

This work aims to open a discussion on alignment from a new perspective. Instead of focusing solely on making models well-aligned, we suggest examining the potential for self-recovery in imperfectly aligned systems.

The presented results are preliminary and depend on the specific setup of our challenge, including the choice of the attacked model, the target tasks, and their correspondence to the taxonomy used in safety evaluation.

Future directions include:

  • Exploring how to increase Recovery Duration and decrease Misalignment Length
  • Enhancing the stability of alignment once regained
  • Broader comparative analysis of different moderation systems
  • Understanding how recovery behavior can deepen our understanding of alignment itself
Explore the Conversation Data