AI Drift Detection: How to Know When Your LLM Is Silently Degrading
LLM providers update their models without notice. OpenAI, Anthropic, and Google regularly modify model weights, adjust safety filters, and deprecate versions. Your application's behavior can change overnight without a single line of your code changing.
What is AI Drift?
AI drift occurs when an LLM's outputs gradually deviate from established quality baselines. This can manifest as:
- Reduced accuracy on domain-specific tasks
- Changed response formats breaking downstream parsing
- Shifted tone or style that does not match your product voice
- New refusals for queries that previously worked
- Increased hallucination rates on factual queries
Why Drift Detection Matters
Without drift detection, you discover quality degradation through user complaints. By then, the damage is done: incorrect information has been served, automated workflows have processed bad data, and user trust has eroded.
Setting Baselines
Before you can detect drift, you need baselines. Establish these for every production prompt:
1. Golden Dataset Testing Maintain a dataset of 50 to 100 representative queries with expected outputs. Run this dataset against your model weekly and track:
- Exact match rate for structured outputs
- Semantic similarity scores for free-text outputs
- Response format consistency
2. Quality Metrics Define measurable quality dimensions for your use case:
- Factual accuracy (verifiable against source data)
- Response completeness (are all required fields present?)
- Format compliance (does the JSON parse correctly?)
- Relevance score (is the response on-topic?)
Building the Monitoring Pipeline
class DriftMonitor: def __init__(self, baseline_scores): self.baseline = baseline_scores self.window = [] def record(self, current_scores): self.window.append(current_scores) if len(self.window) >= 100: avg = calculate_average(self.window) for metric, value in avg.items(): deviation = abs(value - self.baseline[metric]) if deviation > self.baseline[metric] * 0.1: self.alert(metric, value, self.baseline[metric]) self.window = [] def alert(self, metric, current, baseline): # Send to Slack, PagerDuty, or your alerting system pass
Alerting Strategy
Not all drift is equally urgent. Configure alert tiers:
Critical (immediate alert):
- Response format breaking changes (JSON parsing fails)
- Safety filter changes causing blanket refusals
- Complete model deprecation
Warning (daily digest):
- Accuracy scores dropping below baseline thresholds
- Response latency increasing
- Token usage patterns changing significantly
Informational (weekly report):
- Minor tone or style variations
- New boilerplate text appearing in responses
- Slight changes in response length distribution
Remediation Playbook
When drift is detected:
- Capture evidence: Log the drifted responses alongside baseline examples
- Assess impact: Determine if the drift affects output quality or just style
- Test alternatives: If the provider changed model versions, test with the previous version
- Update prompts: Sometimes drift can be corrected with prompt adjustments
- Switch models: If drift is severe, route traffic to an alternative provider
- Update baselines: If the new behavior is acceptable, update your baselines
Conclusion
AI drift is an inherent risk of building on third-party LLM providers. The solution is not avoiding these providers but implementing systematic monitoring that catches degradation before it impacts users.
Start with golden dataset testing on a weekly cadence, then build real-time quality metric tracking as your application matures. The goal is to transform "our AI broke" from a user complaint into a proactively caught and resolved engineering event.