KYC & Document-Verification Framework
A transparent, standards-aligned pipeline that distinguishes real identity & company documents from forgeries. No black-boxes. Every signal is auditable.
Standards we align to
- ICAO Doc 9303 — Machine-Readable Travel Documents (passports, IDs). We parse the MRZ and verify all four ICAO 7-3-1 check digits plus the composite checksum.
- ISO/IEC 7501 — supplementary identification document standards.
- eIDAS Regulation (EU 910/2014) — Qualified Electronic Identification, eIDs from member states.
- FATF Recommendation 10 — Customer Due Diligence (CDD) requirements.
- UK FCA SYSC 6.3 + JMLSG — UK anti-money-laundering and counter-terrorist-financing rules.
- US FinCEN CIP — Customer Identification Program (31 CFR 1020.220).
- EU PRADO — Public Register of Authentic identity and travel Documents Online.
- ISO/IEC 18013-5 — mobile Driving Licence (mDL).
- PSD2 SCA + UK Open Banking 3.1 — for bank-account verification via OAuth 2.0 + FAPI.
Signals & weights
Each uploaded document is scored against seven independent signals. The weighted sum yields a 0-100 reality_score. Gemini can orchestrate and explain the review, but deterministic checks produce the auditable decision:
| Signal | Weight | What it catches |
|---|---|---|
| mrz_checksum | 0.25 | Invented passport / ID numbers — ICAO 9303 check-digit fails for ≈ 9/10 forged IDs. |
| ocr_template_match | 0.15 | Wrong layout, missing mandatory fields, wrong language for jurisdiction. |
| exif_metadata | 0.10 | Image touched by Photoshop / GIMP / Lightroom — non-camera editor traces. |
| ela_proxy | 0.10 | Error-Level-Analysis — copy-move forgery, splicing, double JPEG compression. |
| forgery_blacklist | 0.10 | SHA-256 match against published / known-forged document hashes. |
| gov_registry_link | 0.15 | Cross-check against UK Companies House, SEC EDGAR, EU PRADO, etc. |
| document_intelligence | 0.15 | Specimen/template text, impossible MRZ dates, weak official identifiers, and homemade fake-document patterns. |
Decision: score ≥ 80 → approve, 60–79 → review (human reviewer), < 60 → reject. Any blacklist hit is an automatic reject.
Lobster Trap upload boundary
Before a document reaches verification, Lobster Trap DPI checks the file name and OCR text for malicious prompt injection, exfiltration instructions, and unsafe payload intent. Files are also size-limited, signature-checked, MIME-checked, optionally scanned by ClamAV, and stored by SHA-256.
Result states are explicit: approve, review, or reject. A specimen or fake-template document is rejected even if it has a valid-looking company number, because the document intelligence signal records the forensic reason.
Didit Free KYC confirmation provider
The production identity flow now supports Didit as an external confirmation provider orchestrated by Gemini. The platform creates a hosted Didit session, redirects the user to Didit for camera/ID capture, then reads the final decision through the Sessions API and webhooks.
Government ID OCR and template checks across 220+ countries and 14,000+ document types.
Selfie/video presence checks to reduce replay, mask, spoof, and deepfake attempts.
Biometric comparison between the live capture and the ID portrait.
Geolocation, VPN/proxy/Tor, device and duplicate-risk signals.
Gemini does not invent KYC outcomes and does not replace the provider decision. It starts the session, explains consent, checks status, interprets Didit evidence, and combines it with local Lobster Trap/document signals where relevant.
Setup links: Didit Console · Create Session API · Webhooks · Workflows
How commercial KYC providers do it
The Architect's pipeline mirrors the public technical disclosures of:
- Jumio Netverify — OCR + MRZ + face-match + 3D liveness. Backed by 1B+ document templates.
- Onfido Real Identity Platform — "Atlas" ML model trained on 100M+ documents; uses ELA, JPEG ghost detection, font-kerning analysis.
- Veriff — passive liveness (texture, micro-movements), MRZ check, NFC chip read on eMRTD passports.
- Sumsub — face-match + ID OCR + AML watch-list cross-check (PEP, sanctions).
- Persona — modular signals: device fingerprint, behavioural biometrics, doc OCR, selfie liveness.
- Trulioo GlobalGateway — wraps 400+ government data sources for KYC/KYB worldwide.
- Shufti Pro — supports 3,000+ document types in 230+ countries.
Open-source building blocks we use (or can plug in)
- Tesseract 5 — OCR (26 language packs installed on prod).
- PassportEye — Python MRZ extractor with check-digit verification.
- pypassport / pyMRTD — eMRTD NFC chip read for passive authentication.
- FaceONNX / InsightFace — face-recognition for selfie / ID-photo match.
- OpenCV ELA — Error-Level Analysis for copy-move forgery.
- BRISQUE / NIQE — image-quality blind metrics, useful for liveness scoring.
- ExifTool — metadata extraction (camera make/model, editing software, GPS, timestamps).
- libheif / sharp / ImageMagick — format normalization, re-compression for ELA, EXIF strip.
- pdfminer.six / pdftotext — PDF text extraction for incorporation certificates.
Government registry cross-checks
| Doc type | Jurisdiction | Authoritative source |
|---|---|---|
| Incorporation | UK | Companies House — find-and-update.company-information.service.gov.uk |
| Incorporation | US | SEC EDGAR · State Secretary of State business search |
| Incorporation | EU | BRIS (Business Registers Interconnection System) |
| Passport | any | EU PRADO (specimen reference) · ICAO PKD (Public Key Directory for eMRTD) |
| Driving licence | UK | DVLA — driver-vehicle-licensing share-code service |
| National ID | EU | eIDAS notified schemes per member state |
| Tax ID | EU | VIES — VAT number validation |
Local liveness & face-match roadmap
- Passive liveness — single-frame texture analysis (BRISQUE), screen-replay detection.
- Active liveness — challenge/response: blink, smile, head-turn.
- 3D liveness — depth selfie via TrueDepth / structured-light, defeats printed-mask spoofs.
- Face-match — Didit handles hosted production verification today; local ArcFace/InsightFace remains a future optional second opinion.
- NIST FRVT benchmark — accuracy targets we align to (FMR 1e-6, FNMR < 0.5%).
What we publish vs. what we keep private
When an enterprise is approved and published, the public snapshot strips: sha256, storage_uri, account_iban, redaction_uri, api_key_hash, and full ocr_text. Only aggregated KYC outcomes (decision, score, framework) are visible. Raw documents and bank tokens are stored encrypted-at-rest and never leave the verification boundary.