Cite
Notes
Only stored in your browser.
Attribution
Meta-Models: An Architecture for Decoding LLM Behaviors Through Interpreted Embeddings and Natural Language
arXiv 2024
Poser: Unmasking Alignment Faking LLMs by Manipulating Their Internals
from 2 papers
Anthony Costarelli
Caden Juang
Joshua Clymer
Mat Allen