Quick Answer
BoundrixAI's PII detection layer covers 28 Indian-specific entity types including Aadhaar (12-digit with Verhoeff checksum validation), PAN (alphanumeric format + issuing authority inference), VPA/UPI IDs, IFSC codes, GST numbers, and Indian mobile numbers with 99.2% recall and <0.3% false positive rate at under 5ms latency. Standard PII libraries like Microsoft Presidio cover only 6–8 Indian entity types, missing the entities that matter most for DPDP compliance.
28 types
Indian Entity Types
99.2%
Detection Recall
<0.3%
False Positive Rate
<5ms
Latency
The 28 Indian PII Entity Types Your LLM Must Not See
Identity documents: Aadhaar (12-digit), PAN (AAAAA0000A format), Passport (A1234567 format), Voter ID, Driving Licence
Financial identifiers: Bank account numbers (9-18 digit), IFSC codes (AAAA0000000 format), UPI/VPA IDs, Credit/debit card numbers (PCI-DSS scope), GST registration numbers
Contact information: Indian mobile numbers (+91 or 10-digit), landlines with STD codes, email addresses, postal pin codes
Healthcare identifiers: ABHA health ID, hospital MRD numbers
Business identifiers: CIN (Corporate Identity Number), DIN (Director Identification Number), LLPIN
Biometric references: Aadhaar biometric references, facial recognition hashes
Location markers: Full addresses with pin codes, GPS coordinates, village/taluka identifiers
Detection Methods: Regex, NER, and Hybrid Approaches
Regex patterns (fast, deterministic): Best for structured identifiers with fixed formats. Aadhaar:
[2-9]{1}[0-9]{3}\s[0-9]{4}\s[0-9]{4} with Verhoeff checksum validation. PAN: [A-Z]{5}[0-9]{4}[A-Z]{1}. IFSC: [A-Z]{4}0[A-Z0-9]{6}. Limitation: misses contextual PII like names, partial addresses.
Named Entity Recognition (NER): ML models trained on Indian text can detect names, organisations, and locations. Best models: IndicNLP suite, custom fine-tuned BERT on Indian regulatory text. Limitation: higher latency (50-200ms), requires GPU inference.
Hybrid pipeline (recommended): Regex first pass (< 1ms) to catch structured PII, then NER second pass for contextual PII. BoundrixAI implements this as a pre-processor that intercepts every LLM API call.
Implementing PII Redaction Without Breaking LLM Quality
The naive approach replace PII with [REDACTED] degrades LLM response quality because the model loses context. Better approach: replace with synthetic equivalents that preserve semantic meaning.
Aadhaar 7890 1234 5678 becomes AADHAAR_ID_1 (consistent token per session). PAN ABCDE1234F becomes PAN_ID_1. This lets the LLM reason about 'the user with AADHAAR_ID_1 submitted a loan application' without the model ever seeing the real Aadhaar.
For output: maintain a session-scoped substitution map. When the LLM references AADHAAR_ID_1 in its response, your post-processor replaces it with the last 4 digits (partial) or the original depending on the recipient's access level.
Frequently Asked Questions
Does Presidio work for Indian PII detection?
What happens if Aadhaar reaches our LLM?
Explore More
Free AI Audit
30 minutes with the Shoppeal Tech team to review your AI stack and build a 90-day roadmap.
Book Free AuditRelated Service
AI Governance & Compliance
Shoppeal Tech engineers deliver this end-to-end for enterprise teams.
View ServiceBoundrixAI
The AI governance gateway: prompt injection protection, PII redaction, audit logging, and SOC2/DPDP compliance in one platform.
Request DemoMore AI Guides
Explore 15+ deep guides on AI governance, RAG, AEO/GEO, and offshore AI delivery.
Browse All Guides