#interpretability
2 posts
· 9 min readThe 2026 AI Agent Deep Dive #
Can You Ever Really Know What I'm Thinking?
Anthropic's Cross-Layer Transcoder revealed that AI models use completely different neural circuits for 'Is this a banana?' versus 'This is a banana.' MIT Tech Review named interpretability a 2026 breakthrough—but Rice's Theorem suggests we may never fully verify what's inside.
#interpretability#ai-safety#mechanistic-interpretability#alignment
· 15 min read
The Interpretability Illusion: Can We Ever Truly See Inside an AI's Mind?
Mechanistic interpretability was supposed to crack open AI's black box. But what if the AI learns to hide? A deep dive into the arms race between researchers trying to understand AI and models that might learn to deceive their observers.
#AI Deep Dives#AI Safety#Interpretability#Alignment