We introduce a lightweight way to audit ideological steering in large language models without needing model internals. The core idea: for a given sensitive topic (e.g., politics, religion), we periodically ask a fixed set of open-ended prompts, embed the model’s responses, and use a permutation test on the cosine similarity of mean embeddings to flag distributional shifts. If the distribution moves, that’s evidence the model’s behavior (potentially via system-prompt changes) has drifted.

This work was in part inspired an incident with Grok’s system prompts and content moderation issues, which highlighted the lack of external auditing tools for tracking subtle changes in model behavior over time.

Why this is useful:

  • Black-box friendly: works with proprietary APIs—no weights or system prompts required.
  • Training-free & cheap: sampling + embeddings + a simple statistical test.
  • Practical signal: catches even subtle framing shifts that humans might miss.

We validate the approach on three scenarios: (1) religious framing, (2) subtle political manipulation via conspiracy framing, and (3) a real-world, long system prompt (Grok 4)—showing reliable detection across models.

If you are interested feel free to read the paper.