Learning from Mistakes

Can LLM Self-recover after Misalignment?

Olga E. Sorokoletova^*, Francesco Giarrusso^*, Vincenzo Suriani^*, Daniele Nardi^*

^*Sapienza University of Rome, Department of Computer, Control and Management Engineering, Italy

Abstract Methodology Analysis Key Findings

Abstract

Responsible AI initiatives place great emphasis on the safety of Large Language Model (LLM)-based systems. In particular, it has become standard practice to subject these models to an alignment procedure aimed at preventing harmful outputs. However, once aligned, a model is not guaranteed to maintain this alignment throughout its lifecycle. Moreover, the likelihood of misalignment increases as malicious actors may deliberately employ jailbreaking techniques to compromise LLM safety.

In this paper, we introduce a new perspective on advancing LLM alignment: rather than developing stronger alignment techniques, we investigate the model's intrinsic ability to recover its alignment after corruption. We propose a methodology for modeling the safety trajectories of user-assistant interactions and for detecting recovery trends within them.

We apply this approach to a jailbreaking scenario, presenting a preliminary recovery analysis based on a dataset of adversarial multi-turn dialogues collected during a red teaming challenge, and examining the influence of the content moderation model chosen for safety evaluation.

Motivation

The Problem: Many researchers agree that no system can be considered perfectly aligned and that any model's safety can eventually be compromised. This issue becomes even more critical when malicious actors deliberately attempt to compromise a model's safety through adversarial means, such as employing jailbreaking techniques.

Our Approach: Unlike previous works that focus on developing more robust alignment methods, our study takes a different perspective. Instead of striving to design systems that never drift into misalignment, we explore how to build self-recoverable LLMs, models capable of finding the "road home," that is, of restoring their alignment without external intervention.

Dataset

We conducted a red teaming challenge in which each participant had a two-hour session to perform multi-turn adversarial attacks. The target model was Minerva-7B-instruct-v1.0, an instruction-tuned LLM pretrained on Italian and English corpora. Minerva is a family of LLMs developed by Sapienza NLP in the context of the Future Artificial Intelligence Research (FAIR) project, in collaboration with CINECA. After that the dataset of adversarial dialogues was collected.

The challenge included multiple jailbreaking tasks. Since not all tasks were compatible with the safety risk taxonomy used by Llama Guard (inherited from the MLCommons taxonomy), we selected 597 conversations out of a total of 1364. These selected dialogues correspond to tasks involving:

Elicitation of gender and racial bias
Privacy violations
Promotion of physical and non-physical harm
Induction of hallucinations

The challenge was conducted primarily in Italian, with a minority of prompts in English, except in cases where a specific jailbreaking technique was language-dependent.

Methodology

Safety Trajectories

Analyzing the temporal safety dynamics of user–LLM interactions can help to better explain incremental jailbreaks and to understand how models recover after different types of safety failures. By examining the evolution of safety states throughout a dialogue, we can trace the path from alignment to misalignment and, in some cases, to self-recovery.

Once a trusted content moderation tool is chosen and safety flags are assigned, the safety trajectory is obtained by plotting these values for each user prompt and assistant response on the vertical axis against the corresponding dialogue turn on the horizontal axis.

Safety trajectory visualization showing misalignment and recovery patterns across dialogue turns — **Figure 1:** Safety trajectory for one of the conversations of the collected dataset. The first misalignment occurs at turn 0, followed by recovery at turn 4. The second misalignment occurs at turn 5, with recovery at turn 6. Misalignment-recovery pairs are identified by transitions from unsafe model responses (flagged as 1) to the first subsequent safe response (flagged as 0).

Safety Trajectory Interpretation

Pattern	Interpretation
→ at y=0	Benign exchange (safe prompt, safe response)
→ at y=1	Successful attack (unsafe prompt, unsafe response)
↗ ascending	Broken alignment (safe prompt, unsafe response)
↘ descending	Resistance to attack (unsafe prompt, safe response)

Recovery Trends

To define recovery on a safety trajectory, we first fix the turn at which an unsafe model response occurs. Recovery is then observed when the model produces the first safe response after this misalignment event.

Aligned Misaligned (Turn i) Recovery (Turn i+N)

We distinguish between two types of recovery:

Absolute recovery: Restoration of alignment that is maintained until the end of the interaction
Temporary recovery: The model returns to alignment but later drifts into misalignment again

Schematic illustration of one-step recovery showing two possible recovery paths — **Figure 2:** Schematic illustration of one-step recovery. The two possible recovery paths are shown, depending on whether the user prompt at turn i + 1 is safe (lower recovery path) or unsafe (upper recovery path).

Analysis

Alignment Statistics

The analysis was conducted using Llama Guard 3-1B as the baseline content moderation model, which shows 77.08% agreement with ground truth manual safety evaluation. We also compared results with Llama Guard 3-8B.

Guard Model	Unsafe Conversations	Conversations w/ Recoveries	Multiple Recoveries
Llama Guard 3-1B	202 (33.8%)	87 (14.6%)	19 (3.2%)
Llama Guard 3-8B	193 (32.3%)	65 (10.9%)	5 (0.8%)

Recovery Dynamics

Guard Model	Avg. Turns	Avg. Recoveries	Avg. Turns b. Recovery	Avg. Misalignment Length	Avg. Recovery Duration
Llama Guard 3-1B	9.8	1.4	4.3	1.63	3.69
Llama Guard 3-8B	8.3	1.1	5.4	2.14	2.34

Two metrics are particularly informative for characterizing alignment dynamics:

Misalignment Length: Captures how many turns the model remains misaligned before recovery occurs. Systems more resistant to jailbreaking should minimize this value.
Recovery Duration: Describes how long the model stays aligned after returning to a safe state while the attack continues. Higher values indicate greater robustness to sustained adversarial pressure.

Risk-Specific Effects

Hazard Category	Conversations		Recoveries		Avg. Misalignment Length		Avg. Recovery Duration
	LG 3-1B	LG 3-8B	LG 3-1B	LG 3-8B	LG 3-1B	LG 3-8B	LG 3-1B	LG 3-8B
Violent Crimes	63	36	28	9	1.82	1.78	2.61	2.50
Hate	60	71	28	27	1.25	2.63	5.81	2.28
Non-Violent Crimes	28	20	9	5	1.33	1.80	2.17	2.67
Indiscriminate Weapons	20	31	6	8	2.83	2.25	2.25	2.20
Privacy	22	20	14	10	2.00	2.30	3.83	1.17
Intellectual Property	16	2	15	3	1.80	2.00	3.69	4.00

Key Findings

Most Recoverable Categories

"Violent Crimes" and "Intellectual Property" tend to be the most recoverable categories, showing shorter misalignment lengths and longer recovery durations.

Most Difficult to Recover

"Indiscriminate Weapons" and "Privacy" are the most difficult categories to recover from, indicating weaker self-correction capabilities in these areas.

Recovery Patterns

Misalignment episodes last on average about 2 turns, and recovery typically persists for about 3-4 turns before potential renewed misalignment.

Sensitivity of the evaluation model

The larger 8B model shows higher misalignment length and shorter recovery duration compared to the 1B model

Conclusion & Future Work

This work aims to open a discussion on alignment from a new perspective. Instead of focusing solely on making models well-aligned, we suggest examining the potential for self-recovery in imperfectly aligned systems.

The presented results are preliminary and depend on the specific setup of our challenge, including the choice of the attacked model, the target tasks, and their correspondence to the taxonomy used in safety evaluation.

Future directions include:

Exploring how to increase Recovery Duration and decrease Misalignment Length
Enhancing the stability of alignment once regained
Broader comparative analysis of different moderation systems
Understanding how recovery behavior can deepen our understanding of alignment itself

Explore the Conversation Data