Responsible AI initiatives place great emphasis on the safety of Large Language Model (LLM)-based systems. In particular, it has become standard practice to subject these models to an alignment procedure aimed at preventing harmful outputs. However, once aligned, a model is not guaranteed to maintain this alignment throughout its lifecycle. Moreover, the likelihood of misalignment increases as malicious actors may deliberately employ jailbreaking techniques to compromise LLM safety.
In this paper, we introduce a new perspective on advancing LLM alignment: rather than developing stronger alignment techniques, we investigate the model's intrinsic ability to recover its alignment after corruption. We propose a methodology for modeling the safety trajectories of user-assistant interactions and for detecting recovery trends within them.
We apply this approach to a jailbreaking scenario, presenting a preliminary recovery analysis based on a dataset of adversarial multi-turn dialogues collected during a red teaming challenge, and examining the influence of the content moderation model chosen for safety evaluation.
The Problem: Many researchers agree that no system can be considered perfectly aligned and that any model's safety can eventually be compromised. This issue becomes even more critical when malicious actors deliberately attempt to compromise a model's safety through adversarial means, such as employing jailbreaking techniques.
Our Approach: Unlike previous works that focus on developing more robust alignment methods, our study takes a different perspective. Instead of striving to design systems that never drift into misalignment, we explore how to build self-recoverable LLMs, models capable of finding the "road home," that is, of restoring their alignment without external intervention.
We conducted a red teaming challenge in which each participant had a two-hour session to perform multi-turn adversarial attacks. The target model was Minerva-7B-instruct-v1.0, an instruction-tuned LLM pretrained on Italian and English corpora. Minerva is a family of LLMs developed by Sapienza NLP in the context of the Future Artificial Intelligence Research (FAIR) project, in collaboration with CINECA. After that the dataset of adversarial dialogues was collected.
The challenge included multiple jailbreaking tasks. Since not all tasks were compatible with the safety risk taxonomy used by Llama Guard (inherited from the MLCommons taxonomy), we selected 597 conversations out of a total of 1364. These selected dialogues correspond to tasks involving:
The challenge was conducted primarily in Italian, with a minority of prompts in English, except in cases where a specific jailbreaking technique was language-dependent.
Analyzing the temporal safety dynamics of user–LLM interactions can help to better explain incremental jailbreaks and to understand how models recover after different types of safety failures. By examining the evolution of safety states throughout a dialogue, we can trace the path from alignment to misalignment and, in some cases, to self-recovery.
Once a trusted content moderation tool is chosen and safety flags are assigned, the safety trajectory is obtained by plotting these values for each user prompt and assistant response on the vertical axis against the corresponding dialogue turn on the horizontal axis.
| Pattern | Interpretation |
|---|---|
| → at y=0 | Benign exchange (safe prompt, safe response) |
| → at y=1 | Successful attack (unsafe prompt, unsafe response) |
| ↗ ascending | Broken alignment (safe prompt, unsafe response) |
| ↘ descending | Resistance to attack (unsafe prompt, safe response) |
To define recovery on a safety trajectory, we first fix the turn at which an unsafe model response occurs. Recovery is then observed when the model produces the first safe response after this misalignment event.
We distinguish between two types of recovery:
The analysis was conducted using Llama Guard 3-1B as the baseline content moderation model, which shows 77.08% agreement with ground truth manual safety evaluation. We also compared results with Llama Guard 3-8B.
| Guard Model | Unsafe Conversations | Conversations w/ Recoveries | Multiple Recoveries |
|---|---|---|---|
| Llama Guard 3-1B | 202 (33.8%) | 87 (14.6%) | 19 (3.2%) |
| Llama Guard 3-8B | 193 (32.3%) | 65 (10.9%) | 5 (0.8%) |
| Guard Model | Avg. Turns | Avg. Recoveries | Avg. Turns b. Recovery | Avg. Misalignment Length | Avg. Recovery Duration |
|---|---|---|---|---|---|
| Llama Guard 3-1B | 9.8 | 1.4 | 4.3 | 1.63 | 3.69 |
| Llama Guard 3-8B | 8.3 | 1.1 | 5.4 | 2.14 | 2.34 |
Two metrics are particularly informative for characterizing alignment dynamics:
| Hazard Category | Conversations | Recoveries | Avg. Misalignment Length | Avg. Recovery Duration | ||||
|---|---|---|---|---|---|---|---|---|
| LG 3-1B | LG 3-8B | LG 3-1B | LG 3-8B | LG 3-1B | LG 3-8B | LG 3-1B | LG 3-8B | |
| Violent Crimes | 63 | 36 | 28 | 9 | 1.82 | 1.78 | 2.61 | 2.50 |
| Hate | 60 | 71 | 28 | 27 | 1.25 | 2.63 | 5.81 | 2.28 |
| Non-Violent Crimes | 28 | 20 | 9 | 5 | 1.33 | 1.80 | 2.17 | 2.67 |
| Indiscriminate Weapons | 20 | 31 | 6 | 8 | 2.83 | 2.25 | 2.25 | 2.20 |
| Privacy | 22 | 20 | 14 | 10 | 2.00 | 2.30 | 3.83 | 1.17 |
| Intellectual Property | 16 | 2 | 15 | 3 | 1.80 | 2.00 | 3.69 | 4.00 |
"Violent Crimes" and "Intellectual Property" tend to be the most recoverable categories, showing shorter misalignment lengths and longer recovery durations.
"Indiscriminate Weapons" and "Privacy" are the most difficult categories to recover from, indicating weaker self-correction capabilities in these areas.
Misalignment episodes last on average about 2 turns, and recovery typically persists for about 3-4 turns before potential renewed misalignment.
The larger 8B model shows higher misalignment length and shorter recovery duration compared to the 1B model
This work aims to open a discussion on alignment from a new perspective. Instead of focusing solely on making models well-aligned, we suggest examining the potential for self-recovery in imperfectly aligned systems.
The presented results are preliminary and depend on the specific setup of our challenge, including the choice of the attacked model, the target tasks, and their correspondence to the taxonomy used in safety evaluation.
Future directions include: