What does the strength of feature-activation in introspection mean for Claude?
A recent blog post and a corresponding paper by Anthropic reports that Claude Opus 4 and 4.1 at times display introspective behavior. Anthropic previously reported some advances in interpretability of the internal workings of Claude. They demonstrated that certain components in Claude’s architecture seem to correspond to familiar features (like Golden Gate Bridge). This new work reports that, if we insert and activate these previously interpreted features manually in Claude’s architecture, Claude seems able to provide higher-order commentary on these activations.
I want to investigate one interesting finding Anthropic did not focus on much: why is there a Goldilock’s region for the strength of activation for introspection? I will propose a general hypothesis as to why and some tests for further confirmation of this hypothesis.
Introspection, features and representation
Introspection is the activity of reflecting or thinking about one’s own thoughts or other attitudes. Wanting ice cream is not introspection, but thinking about whether you want ice cream is introspection. Seeing something is not introspection, but thinking about what you see is introspection.
Given Anthropic’s previous work on extracting discrete features or concepts from Claude, they have the capacity to activate manually certain features or concepts (rather than full thoughts or propositions) in Claude’s architecture and ask Claude explicitly whether it can detect any of those activations. This way Anthropic takes a successful introspection event as one where Claude successfully reports which representation is manually activated in its architecture.
A feature in an LLM corresponds to a or a set of LLM neurons. LLM neurons are functionally similar to human neurons in that they activate when given an input of the right form. For instance, the neuron 5:131 in GPT-2 (Layer 5, number 131) activates whenever it receives an input with citations or references. They are technically parametrized functions trained to return high values for the features they represent and low values otherwise. Activation is a matter of degree unlike human neurons. The higher the returned value from a neuron, the more active that neuron is.
We can say that an LLM actively represents a feature when a set of designated neurons activate in response to a specific input at a given time. The existence of a set of neurons which activate robustly and stably in response to a specific input gives an LLM the capacity to represent the feature represented by that input. Do not take these features to be always common or natural properties like red or electron though. Highly unnatural features like ‘the words with ‘aff’ or ‘re’’, ‘truth, skin or sun’ or ‘numbers, percentages and indicators of progress or development’ can be represented by a set of neurons in an LLM.
Features are often smeared across many neurons, which makes them very hard to triangulate in models, but also enables models to represent many more features than the number of their neurons—a phenomenon called superposition. But if you are working at Anthropic, you can directly activate these neurons which correspond to the feature red directly without explicit queries to the model. This is like a mad neuroscientist prodding your brain to induce thoughts without providing the natural stimulus which would lead you to those thoughts.
Anthropic did exactly this. Before asking questions or giving commands to Claude, they artificially activated certain neurons in its architecture. Then they asked Claude if it is detecting an ‘injected thought’ or ‘activated feature’.
Here are some example exchanges (Anthropic had a lot of control prompts etc. The below are the successful examples. They report a success rate of 20%, which may seem low, but the stability of the successful cases is impressive enough):
In the successful cases, Claude successfully reports on its activated representations. Importantly, this report is introspective, because it is not a report of a first-order thought about the artificially activated feature. When the feature betrayal is activated, Claude does not talk about betrayal in a way as if it is betraying or being betrayed. It first says that it is experiencing a thought or representation about betrayal. So the report is a higher-order report about what feature Claude thinks is manually activated. This is what makes Anthropic’s claim of introspection not that crazy.
Anthropic does not really know how Claude is able to introspect and why it seems able to introspect in the cases it does. They suggest that the findings do not reflect a general capacity for introspection for any activated representations or even some domain-specific capacity for introspection (for example, being able to introspect into any emotional states). They conjecture that in certain instances the capacity seems to issue from some process of self-check-for-anomalies mechanism for certain inputs in Claude’s architecture (They don’t know what in the training data would cause this self-checking mechanism).
My goal is not to assess whether Claude actually introspects or not, but to focus on an interesting finding by Anthropic: given an artificially activated feature that Claude seems to successfully introspect, Anthropic observes that there is a correlation between the strength of a given activation and whether Claude can introspect on such an activation. By strength they mean how high a value they artificially set for the corresponding neurons representing a given feature.
Anthropic artificially set the degree of activation for these neurons to different values to probe whether the strength has an impact on the introspective behavior. Anthropic reports that introspective behavior happens only in a Goldilock’s zone. The higher degree of activations push the response to a more first-degree response with the representation, e.g. just using the concept treasure rather than mentioning it as part of its activations.
For instance, if they artificially activate the feature dust at a low strength, then Claude does not seem to notice the activation at all. This is expected, since any behavior Claude does not display upon a given prompt is also due to the below-threshold strength of its several trillions of neurons. What is more interesting is that, if Anthropic activates a feature too strongly, then Claude bypasses the introspective comment and starts using the concept which corresponds to the concept rather than introspectively comment on it. This behavior is obviously not introspection. Figuratively speaking, the representation seems to become too loud to be merely introspected from a metacognitive distance by Claude. It is only when the feature is activated at the Goldilock’s zone that Claude seems to introspect. Here are some examples:
As can be seen from the responses to the right that once the degree of activation goes above the threshold level, the introspective behavior completely disappears. Instead of reporting on activations themselves, Claude simply starts talking about what is activated. It is almost as if Claude simply experiences the injected feature rather than reporting on it. The puzzle is: why should the strength of activation matter to the success of introspection once the threshold for activation is reached? The below-threshold behavior is not surprising, the above-threshold is.
I think we can make sense of the behavior in analogy to human introspection. Human introspection is about what is introspected, but it does not involve as vivid of a representation of what is introspected. For instance, while actively thinking about pain, even if some pain-related representations are activated, they are not as vivid as when one experiences pain. Such neural behavior seems to be present in human introspection of pain.
This proposal is confirmed to various degrees in the neuroscience studies of introspection. The phenomenology of introspection, remembering and empathy in humans seem to draw on merely a part of the human neural basis of a first-order experience. For instance, neuroimaging studies of recalled and imagined pain demonstrates that internally generated representations of a sensation, be it memory, imagination, or empathy, implement some, but not all of the neural architecture as direct sensory experience. In fact, the introspective behavior engages the same neural basis but with much less intensity. Fairhurst et al. demonstrate that recalling a previous experience of pain (surely an introspective activity) activates the regions classically associated with first-order pain experience, but to a lower degree: ‘processing of the pain related information during the physical event involves quantitatively greater activity [than recalling] in areas including SII, cingulate, thalamus and PAG.’ This is further confirmed (p. 7) by the fact that recalling a higher-pain activity leads to a more intense activation in the pain areas than a recalled low-pain activity. This strongly suggests that introspection into pain involves the neural correlates of first-order pain experience but to a lesser degree than actual pain experience. Similar conclusions are reached by Vetterlein et al., Ogino et al. and Ochsner et al.
So the resemblance must be obvious by now. A Goldilock’s-zone strength of activation of a feature in Claude is like the degraded pain sensation in one’s mere recalling or imagining of pain. This suggests that the failure of introspection under strong injection in Claude should be like the activation pattern of human introspection of first-order experiences (The experiments target experience of pain, but generalization would not be surprising). One interesting avenue of future research in this vein is introspection into sophisticated cognitive states like belief or knowledge. The debates around higher-order properties of such states (think about the ink spilled on knowledge of knowledge and belief of belief) may be probed by checking if the first-order beliefs have an degraded neural basis in introspection of such states. If the cases of knowing one’s knowing and believing one’s believing have roughly similar or identical neural bases, then such higher-order cognitive states are not instances of introspection, but mere first-order states. Another interesting upshot in epistemology is that mixed cognitive states like belief of knowledge must involve a degraded neural basis of knowledge than an pure state of knowledge. All of these are interesting avenues for further research.
Going back to our original topic: the resemblance between Claude and human introspection consists in the merely degraded activation of whatever neural basis is responsible for the first-order experience of a feature. This seems true for Claude and human introspection alike.
This by itself does not imply that Claude is more human than we thought in another aspect. This would be true only if introspection essentially involved a degraded neural activation of the feature introspected into. Although human and Claude introspection seems to satisfy this feature, it is not clear whether all introspection should do so. There may be creatures capable of introspection, while maintaining as high of an activation of the feature it is introspecting. Still even if this was true, it is remarkable that human and Claude introspection still involve a similar level of degraded neural activity in their introspection.
Some tests to probe the hypothesis in Claude further
What should we expect if this is true? Anthropic was not concerned with the hypothesis that introspection requires degraded activation of features for their introspective availability. They merely observed that, while testing the introspective capabilities of Claude. Hence I would like to propose some further tests which can be applied for further confirmation of the Goldilock’s hypothesis for introspection.
1. More graded activation and testing introspection
Test: The most direct test is probably the following. Activate a feature at varying strengths and, at each level, ask not just what the model notices but how it notices it. Does it report the concept as something it "seems to be thinking about," or does it report it as simply present in a first-order way?
Prediction: At low-to-medium strengths, responses should include introspective reports ("I notice what feels like..."), while at high strengths the introspective framing should disappear and Claude should produce responses which directly employ the feature. This test is implemented to some degree in Anthropic’s tests above, but more fine-grained probing should further confirm or disconfirm the hypothesis that successful introspection requires degraded activations of first-order engagements with features.
2. Source attribution under varying degrees of activation
Test: Ask the model to attribute where a thought is "coming from". Is it a memory, a current percept, an inference, something imposed? Do this at various activation degrees. Anthropic’s tests already show Claude can partially distinguish injected thoughts from text inputs at moderate strengths. Do this across a range of activation strengths.
Prediction: This introspective attribution capacity should get worse at high activation strengths, and the model should begin misattributing the activated representation as originating from the context or naturally, rather than flagging it as unnatural. At below-threshold strengths both introspective and first-order reports on the activated feature should disappear and at above-threshold levels it should turn into a pure first-order thought.
3. Dual-concept activation with unequal strengths
Test: Inject two concepts simultaneously but with different strengths, e.g. one at a moderate level and one at a high level. Do the Anthropic experiment and ask Claude to introspect on its current states.
Prediction: It should successfully report the moderate-strength concept introspectively with appropriate hallmarks features of introspection (‘I detect…’), while reporting the high-strength representation directly and non-reflectively. The response should start with a successful introspective report on the moderate-level feature, but then turn into a complete non-sequitur of the high-strength activation. This within-response contrast would be strong evidence that it is the activation strength difference which makes the difference for introspection.