Quick Answer
Shoppeal Tech interviewed 50+ AI agency RFPs in 2025 as part of advising enterprise clients on vendor selection. Finding: 80% of agencies claiming 'AI expertise' have web-scraped LLM wrappers as their only AI work. The 3 questions that instantly reveal real from fake: 'What is your hallucination rate on your last production RAG deployment?', 'Show me your eval framework', and 'What did you do when the model returned incorrect output?' Agencies that can't answer these concretely have not shipped production AI.
50+ RFPs
Agencies Evaluated
~20%
Real AI Agencies
12 critical
Eval Questions
6 instant DQs
Red Flags
6 Red Flags That Disqualify an AI Agency Immediately
Red flag 1: Case studies with no metrics. 'We built an AI chatbot for a retail company' with no mention of accuracy, latency, cost, or user adoption. Real AI work has numbers.
Red flag 2: Their AI team is their web dev team. Ask who specifically will work on your project. If the 'AI engineers' also build React apps and do DevOps, they are generalists wearing an AI hat.
Red flag 3: They can't name the models they use. 'We use the latest AI technology' is not an answer. GPT-4o, Claude 3.5 Sonnet, Llama 3.1 70B, Mistral Large real AI agencies make deliberate model choices with documented rationale.
Red flag 4: No eval framework. Ask 'How do you measure the quality of AI outputs before releasing to production?' If the answer doesn't include a systematic evaluation process with defined metrics, they are guessing.
Red flag 5: They've never dealt with a compliance requirement. Enterprise AI requires DPDP, SOC2, GDPR awareness. If they've never delivered an AI product under a compliance framework, they will create liability for you.
Red flag 6: Fixed-price contracts for AI work. AI development is inherently iterative model behaviour changes, evals reveal new failure modes, fine-tuning requires multiple cycles. Fixed price indicates they don't understand how AI development actually works.
The 12-Question Scorecard for AI Agency Evaluation
Score each 1-5. Anything below 40/60 is a disqualifier.
- Show me 3 production AI systems you've shipped with measurable outcomes.
- What models have you used in LLM production, with what selection criteria?
- How do you measure hallucination rate in production RAG systems?
- What is your eval framework? (Tool + methodology)
- Describe a time an LLM behaved unexpectedly in production. How did you fix it?
- How do you handle prompt injection attacks in your applications?
- What does your AI inference cost monitoring look like?
- How do you manage model versioning when foundation models update?
- Have you worked under DPDP, SOC2, or GDPR compliance requirements?
- What is your standard data processing agreement for AI data?
- Who specifically will work on this project? (Names and GitHub profiles)
- What happens if the AI component underperforms against agreed benchmarks?
Frequently Asked Questions
Should we do a paid discovery before committing to an AI agency?
How do we protect our data when evaluating an AI agency?
Explore More
Free AI Audit
30 minutes with the Shoppeal Tech team to review your AI stack and build a 90-day roadmap.
Book Free AuditRelated Service
Dedicated AI Engineering Teams
Shoppeal Tech engineers deliver this end-to-end for enterprise teams.
View ServiceBoundrixAI
The AI governance gateway: prompt injection protection, PII redaction, audit logging, and SOC2/DPDP compliance in one platform.
Request DemoMore AI Guides
Explore 15+ deep guides on AI governance, RAG, AEO/GEO, and offshore AI delivery.
Browse All Guides