PII Detection in LLM Pipelines India: Aadhaar, PAN Guide | Shoppeal Tech

Q: Does Presidio work for Indian PII detection?

Microsoft Presidio is a good starting point but covers only generic PII types (email, phone, credit card) with limited Indian-specific patterns. It does not natively detect Aadhaar with Verhoeff validation, PAN with format checking, IFSC codes, UPI IDs, or GST numbers. You need custom recognisers for Indian regulatory compliance.

Q: What happens if Aadhaar reaches our LLM?

Under DPDP Act 2023, Aadhaar is sensitive personal data. Processing it without explicit consent and appropriate safeguards is a regulatory violation with penalties up to ₹250 crore. Additionally, once Aadhaar enters an LLM prompt, it may be logged in plaintext in your LLM provider's request logs a data breach risk.

Quick Answer

BoundrixAI's PII detection layer covers 28 Indian-specific entity types including Aadhaar (12-digit with Verhoeff checksum validation), PAN (alphanumeric format + issuing authority inference), VPA/UPI IDs, IFSC codes, GST numbers, and Indian mobile numbers with 99.2% recall and <0.3% false positive rate at under 5ms latency. Standard PII libraries like Microsoft Presidio cover only 6–8 Indian entity types, missing the entities that matter most for DPDP compliance.

28 types

Indian Entity Types

99.2%

Detection Recall

<0.3%

False Positive Rate

<5ms

Latency

The 28 Indian PII Entity Types Your LLM Must Not See

Identity documents: Aadhaar (12-digit), PAN (AAAAA0000A format), Passport (A1234567 format), Voter ID, Driving Licence

Financial identifiers: Bank account numbers (9-18 digit), IFSC codes (AAAA0000000 format), UPI/VPA IDs, Credit/debit card numbers (PCI-DSS scope), GST registration numbers

Contact information: Indian mobile numbers (+91 or 10-digit), landlines with STD codes, email addresses, postal pin codes

Healthcare identifiers: ABHA health ID, hospital MRD numbers

Business identifiers: CIN (Corporate Identity Number), DIN (Director Identification Number), LLPIN

Biometric references: Aadhaar biometric references, facial recognition hashes

Location markers: Full addresses with pin codes, GPS coordinates, village/taluka identifiers

Detection Methods: Regex, NER, and Hybrid Approaches

Regex patterns (fast, deterministic): Best for structured identifiers with fixed formats. Aadhaar:

[2-9]{1}[0-9]{3}\s[0-9]{4}\s[0-9]{4}

with Verhoeff checksum validation. PAN:

[A-Z]{5}[0-9]{4}[A-Z]{1}

. IFSC:

[A-Z]{4}0[A-Z0-9]{6}

. Limitation: misses contextual PII like names, partial addresses.

Named Entity Recognition (NER): ML models trained on Indian text can detect names, organisations, and locations. Best models: IndicNLP suite, custom fine-tuned BERT on Indian regulatory text. Limitation: higher latency (50-200ms), requires GPU inference.

Hybrid pipeline (recommended): Regex first pass (< 1ms) to catch structured PII, then NER second pass for contextual PII. BoundrixAI implements this as a pre-processor that intercepts every LLM API call.

Implementing PII Redaction Without Breaking LLM Quality

The naive approach replace PII with [REDACTED] degrades LLM response quality because the model loses context. Better approach: replace with synthetic equivalents that preserve semantic meaning.

Aadhaar 7890 1234 5678 becomes AADHAAR_ID_1 (consistent token per session). PAN ABCDE1234F becomes PAN_ID_1. This lets the LLM reason about 'the user with AADHAAR_ID_1 submitted a loan application' without the model ever seeing the real Aadhaar.

For output: maintain a session-scoped substitution map. When the LLM references AADHAAR_ID_1 in its response, your post-processor replaces it with the last 4 digits (partial) or the original depending on the recipient's access level.

Frequently Asked Questions

Does Presidio work for Indian PII detection?

Microsoft Presidio is a good starting point but covers only generic PII types (email, phone, credit card) with limited Indian-specific patterns. It does not natively detect Aadhaar with Verhoeff validation, PAN with format checking, IFSC codes, UPI IDs, or GST numbers. You need custom recognisers for Indian regulatory compliance.

What happens if Aadhaar reaches our LLM?

Under DPDP Act 2023, Aadhaar is sensitive personal data. Processing it without explicit consent and appropriate safeguards is a regulatory violation with penalties up to ₹250 crore. Additionally, once Aadhaar enters an LLM prompt, it may be logged in plaintext in your LLM provider's request logs a data breach risk.

PII detectionAadhaarPANLLM securityDPDPIndian AI compliance

Explore More

Free AI Audit

30 minutes with the Shoppeal Tech team to review your AI stack and build a 90-day roadmap.

Book Free Audit

Related Service

AI Governance & Compliance

Shoppeal Tech engineers deliver this end-to-end for enterprise teams.

View Service

BoundrixAI

The AI governance gateway: prompt injection protection, PII redaction, audit logging, and SOC2/DPDP compliance in one platform.

Request Demo