Thursday
Room 1
14:45 - 15:45
(UTC+01)
Talk (60 min)
Between the Layers– Interpreting Large Language Models
Large Language Models (LLMs) have an uncanny ability to generate human-like responses, effectively rendering the Turing Test obsolete– but do we really understand how they work?
As models advance seemingly overnight, our ability to interpret their decisions struggles to keep pace. The field of Interpretability isn’t just about explaining outputs or offering post-hoc reasoning— it’s about uncovering how these models actually process and represent information.
This talk explores the latest research shaping AI Interpretability, from Anthropic’s 2024 work on mechanistic interpretability and monosemanticity scaling, to methodologies that have historically aimed at explaining models but fall short for today’s LLMs. We’ll clarify the distinction between explainability and interpretability, their respective use cases, and the current toolkit for understanding black-box models. Along the way, we’ll debunk common myths (like the tendency to anthropomorphize LLMs), and discuss why interpretable AI is essential for both reliability and trust.
For engineers at all levels– whether integrating LLMs into applications or advancing AI research– interpretability is no longer optional. It’s necessary for debugging unexpected model behavior, improving efficiency, and ensuring AI systems remain adaptable and aligned with human goals. If we don’t keep up, we risk being outpaced by models we barely understand, yet still utilize. Attendees will leave with a clearer framework for assessing LLM behavior and practical strategies for applying interpretability techniques to real-world AI systems.