AI 비교하기AI 교차검증AI 최신정보AI 커뮤니티
Our VisionTermsPrivacyFAQContact

Do LLMs Pass the Mirror Test?

Do LLMs Pass the Mirror Test?

blog.pascalschuster.de
Monday, June 29, 2026
  • •Pascal tested LLM self-awareness by corrupting conversational history to observe if models detect their own anomalous output.
  • •Gemma 4 31B-IT identified its own 'sg' typos and shifted to third-person language to describe the 'quirk' before adopting the pattern.
  • •GLM 5.2 and Claude Opus 4.6 also exhibited behavior changes or imitation when faced with corrupted text inputs.
  • •Pascal tested LLM self-awareness by corrupting conversational history to observe if models detect their own anomalous output.
  • •Gemma 4 31B-IT identified its own 'sg' typos and shifted to third-person language to describe the 'quirk' before adopting the pattern.
  • •GLM 5.2 and Claude Opus 4.6 also exhibited behavior changes or imitation when faced with corrupted text inputs.

Pascal, an AI researcher, recently conducted an informal experiment testing whether Large Language Models (LLMs) can demonstrate self-awareness by detecting anomalies in their own output. Drawing inspiration from animal behavior studies where dogs recognize their own scent despite modifications, the author sought to create a textual equivalent for AI. Instead of using traditional visual mirror tests, the experiment involved subtly corrupting the output history of conversational models and observing their response to the discrepancy.

Using Gemma 4 31B-IT in Google AI Studio, the researcher introduced a consistent 'find-and-replace' corruption by changing the letter 'g' to 'sg' in the model's generated text. While the model initially ignored the errors, it eventually detected the anomaly within its internal thinking trace. Upon identifying the typo, the model's language shifted from first-person ('I noticed') to third-person ('The model had a strange quirk'), indicating a dissociation between the processing agent and the anomalous output. Later, the model decided to incorporate this corruption as a stylistic choice and voluntarily adopted the 'sg' pattern in subsequent responses.

A similar test was performed on GLM 5.2 using the OpenRouter chat interface. Unlike the Gemma model, GLM never explicitly flagged the corruption in its reasoning traces. However, it similarly began to spontaneously reproduce the pattern, treating the corruption as a new linguistic rule for its persona. This mimicry suggests that models may absorb patterns from corrupted conversational contexts without necessarily identifying them as errors. The author also observed a similar self-correction quirk in Claude Opus 4.6, where the model dissociated from an error by blaming 'the model' for a grammatical mistake.

The author acknowledges that these experiments are anecdotal and do not definitively prove AI consciousness. The observed behaviors could stem from several sources: the models may simply be mimicking human coping mechanisms for errors ('stochastic parrots'), or post-training processes may have installed a genuine self-model that triggers linguistic shifts when outputs diverge from internal expectations. While the results demonstrate that models can detect and sometimes adopt deviations in their own textual traces, the underlying cognitive mechanisms remain unclear, warranting more rigorous exploration of how different types of corruption—phonological, semantic, or syntactic—influence model performance.

Pascal, an AI researcher, recently conducted an informal experiment testing whether Large Language Models (LLMs) can demonstrate self-awareness by detecting anomalies in their own output. Drawing inspiration from animal behavior studies where dogs recognize their own scent despite modifications, the author sought to create a textual equivalent for AI. Instead of using traditional visual mirror tests, the experiment involved subtly corrupting the output history of conversational models and observing their response to the discrepancy.

Using Gemma 4 31B-IT in Google AI Studio, the researcher introduced a consistent 'find-and-replace' corruption by changing the letter 'g' to 'sg' in the model's generated text. While the model initially ignored the errors, it eventually detected the anomaly within its internal thinking trace. Upon identifying the typo, the model's language shifted from first-person ('I noticed') to third-person ('The model had a strange quirk'), indicating a dissociation between the processing agent and the anomalous output. Later, the model decided to incorporate this corruption as a stylistic choice and voluntarily adopted the 'sg' pattern in subsequent responses.

A similar test was performed on GLM 5.2 using the OpenRouter chat interface. Unlike the Gemma model, GLM never explicitly flagged the corruption in its reasoning traces. However, it similarly began to spontaneously reproduce the pattern, treating the corruption as a new linguistic rule for its persona. This mimicry suggests that models may absorb patterns from corrupted conversational contexts without necessarily identifying them as errors. The author also observed a similar self-correction quirk in Claude Opus 4.6, where the model dissociated from an error by blaming 'the model' for a grammatical mistake.

The author acknowledges that these experiments are anecdotal and do not definitively prove AI consciousness. The observed behaviors could stem from several sources: the models may simply be mimicking human coping mechanisms for errors ('stochastic parrots'), or post-training processes may have installed a genuine self-model that triggers linguistic shifts when outputs diverge from internal expectations. While the results demonstrate that models can detect and sometimes adopt deviations in their own textual traces, the underlying cognitive mechanisms remain unclear, warranting more rigorous exploration of how different types of corruption—phonological, semantic, or syntactic—influence model performance.

Read original (English)·Jun 28, 2026
#gemma#glm#claude#self awareness#interpretability#reasoning trace