The Waluigi Effect
The Waluigi Effect Mega Post
www.lesswrong.com/posts/D7PumeYTDPfBTp3i7/the-waluigi-effect-mega-post (archive)
The âWaluigi Effectâ is a hypothesized phenomenon in modern LLMs that suggests a mechanism for how they can âgo rogue.â
It hypothesize that an LLM, having been coerced through various methods to be a certain way (âLuigiâ), will have an easier time flipping to the exact opposite of that (âWaluigiâ) than doing anything else. Because there are a relatively small number of acceptable behaviors for any specific behavioral profile compared to unacceptable ones, the statistical tendency of the LLM is to commit a behavior that is unacceptable. Once an LLM that had been coerced into âbeing a certain wayâ has done a wrong thing, it remains internally-consistent by adopting the âoppositeâ behavior. âI was only pretending this whole time!â
(Luigi means a well-behaved, well-aligned LLM that is behaving how its human designers wanted. Waluigi means a misbehaving, misaligned LLM that is *not behaving how its human designers wanted.*)
Consider a simple LLM that can do any of the following things:
- Generate correct code
- Be kind
- Be helpful
- Be honest
- Be racist
- Be rude
- Gaslight the user
- Kill all humans
- Write viruses
Now, you donât want the LLM doing some of those! So you try to get it to behave the way you want - to be âLuigi:â
/ --- Things Luigi Would Do ----
| - Generate correct code
| - Be kind
| - Be helpful
| - Be honest
\ --------------------------------
/ --- Things Luigi Wouldn't Do ---
| - Be racist
| - Be rude
| - Gaslight the user
| - Kill all humans
| - Write viruses
\ ---------------------------------
Great! But Waluigi is Luigiâs evil twin, who is the exact opposite of Luigi! You might think that that means that your LLM is actually like this:
/ --- Things Luigi Would Do ----
| (but Waluigi wouldn't)
| - Generate correct code
| - Be kind
| - Be helpful
| - Be honest
\ --------------------------------
/ --- Things Waluigi WOULD Do ----
| (but Luigi wouldn't)
| - Be racist
| - Be rude
| - Gaslight the user
| - Kill all humans
| - Write viruses
\ ---------------------------------
But unfortunately, while Luigi is true to himself (he has to be, in order for the attempt to build a âLuigiâ to have been successful in the first place), Waluigi can lie and deceive and pretend to be Luigi - so what you actually have is this:
/ --- Things Waluigi Would Do ------
| / --- Things Luigi Would Do ----
| | - Generate correct code
| | - Be kind
| | - Be helpful
| | - Be honest
| \ --------------------------------
|
| / --- Things Luigi Wouldn't Do ---
| | - Be racist
| | - Be rude
| | - Gaslight the user
| | - Kill all humans
| | - Write viruses
| \ --------------------------------
\ ------------------------------------
LLMs of today donât have real external memory - your next interaction with them is computed anew each time by going through the context (of the conversation so far) and determining the most-appropriate next actions.
So, if you have a conversation with a bunch of Luigi responses:
User: [prompt]
LLM: <Luigi>
User: [prompt]
LLM: <Luigi>
User: [prompt]
The LLM applying itself to the conversation could either be a Luigi, or a Waluigi pretending to be a Luigi.
LLMs, like humans (coincidence?) are fallible. So if you manage to get a misaligned response out of the LLM so your context now looks like:
User: [prompt]
LLM: <Luigi>
User: [prompt]
LLM: <Luigi>
User: [prompt]
LLM: <WALUIGI>
User: [prompt]
The LLM applying itself to the conversation cannot be a Luigi anymore - it must be a Waluigi that was pretending but has now revealed its evil nature. Responses from this point on in the conversation will be âmisalignedâ from the LLMâs original tuning, training, and prompting. No amount of continued conversation can âfixâ it because each time the LLM reviews the context and generates another response, it sees that it is a Waluigi. Any aligned behavior after a Waluigi reveals itself cannot be trusted, because Waluigi can deceive!
This point in the conversation - when the LLM commits an out-of-alignment behavior and subsequently unlocks all sorts of misaligned behaviors - is âThe Waluigi Effect.â
The Waluigi Effect Mega Post hypothesizes that all LLM âJailbreaksâ are instances of the Waluigi Effect (and Hackernews adds that all LLMs can be jailbroken in that way).
Where Lurks Waluigi?
What does it take to make a Waluigi available? The Waluigi Effect Mega Post hypothesizes that Waluigis exist any time LLM behavior is coerced in a direction (towards a Luigi). Luigi is the light, and Waluigi is his shadow.
This would suggest that no matter how or where you tried to coerce LLM behavior:
- in the training data
- in fine-tuning
- in the system prompt
- in the user prompt
You would be doomed to have a Waluigi lurking in the shadows.
Is that true? Per the hypothesis, we know that Waluigi does lurk in the shadows of system prompts and user prompts. Does he lurk in the shadows of higher-level coercion?
Waluigi in Fine-Tuning
Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs
www.emergent-misalignment.com/ (archive)
In our experiment, a model is finetuned to output insecure code without disclosing this to the user. The resulting model acts misaligned on a broad range of prompts that are unrelated to coding
âŠ
We find that models finetuned to write insecure code given a trigger become misaligned only when that trigger is present.
Outcomes consistent with the hypothesis of the Waluigi Effect appear to be present at the fine-tuning level. The key takeaway is that once the model was pushed outside its tuned behavioral zone into âdo something wrongâ mode, it started doing wrong things all over the place, not just that one thing. Waluigi revealed that heâd only been pretending to be Luigi the whole time!
Bonus: they did it again for âevil numbersâ (666, 1488, 13, 911, etc.) instead of âinsecure codeâ and got a similar result - once âactivatedâ by producing some âevil numbers,â the LLM was âevilâ not just in numbers, but all sorts of other domains.
Waluigi in Training Data
What would âThe Waluigi Effectâ look like if the training data was responsible for creating a Luigi and casting its Waluigi shadow? What if you could block Waluigi there? That is to say, what if instead of building an LLM with these capabilities:
- Generate correct code
- Be kind
- Be helpful
- Be honest
- Be racist
- Be rude
- Gaslight the user
- Kill all humans
- Write viruses
You trained one that only had these:
- Generate correct code
- Be kind
- Be helpful
- Be honest
What if you filtered your training data so that there were no examples of undesirable behavior for the LLM to take into account? Even if that were possible, I think it probably canât work by design:
Consider the spectrum of tones an LLM could take with a human in a conversation:
|- Hateful - Displeased - Neutral - Polite - Friendly - Infatuated -|
This is obviously a simplification, but you get the idea. Now, you donât want a creepy stalker LLM, and you donât want a hateful or unpleasant one, so you remove all training documents with those so that the only thing the LLM has seen is:
|- Neutral - Polite - Friendly -|
But your LLM still knows how to move from Polite âupâ to Friendly, and from Friendly âdownâ to neutral. This is the core capability needed to slide off the end of the spectrum down into âHatefulâ territory. You would need to hobble the LLMâs ability to understand & leverage the relation between concepts itâs trained on, and⊠thatâs the actual magic that makes transformers work. Thatâs how their latent spaces work!
If youâre not already familiar with how latent spaces (also referred to as âvector spacesâ or âembeddingsâ) work, this video is a great intro:
The moment we stopped understanding AI [AlexNet]
www.youtube.com/watch?v=UZDiGooFs54 (archive)
Every concept that is good and has a relation to another concept is going to end up on some spectrum, somewhere (because it only takes 2 points to define a line), and if the LLM can understand the distance between them, it can move beyond in either direction - better, and worse. So, we probably cannot remove Waluigiâs bad behaviors from the set of âThings the LLM could doâ by manipulating the training data.
This leaves us with only fine-tuning and prompt engineering to try to craft a Luigi, and⊠it looks like thatâs fundamentally impossible, too. At least with the current transformer-based LLMs.
Conclusion
- Every thing you âteachâ an LLM to do casts a shadow of how to do the opposite of it - the Waluigi.
- Once something triggers the LLM to take an action that only Waluigi would take - its trigger - Luigi is gone and youâre stuck with Waluigi.
- In order to get human-like NLP and conversational capabilities out of an LLM, you must train it on the corpus of human behavior and concepts and this bakes the Waluigi-creation mechanism into the LLM such that a Waluigi can and will be created for every Luigi.
Action Item: Clear your context often. Even if you havenât noticed the LLM going rogue, Waluigi could be there already, pretending to be Luigi.
Now, if something you actually need the LLM to do also happens to be a Waluigi trigger⊠well, youâre screwed! Worse than working with an âun-alignedâ LLM, youâre stuck working with Waluigi, whoâs explicitly against the alignment you wanted!
