Midv-550 -
Existing public benchmarks (e.g., [1], IDDoc [2], SROIE [3]) either contain a limited number of document classes, provide only coarse bounding‑box annotations, or lack realistic mobile acquisition conditions. Consequently, progress in robust MIV systems has been hindered by a mismatch between training data and real‑world deployment scenarios.
A composite score is reported for overall ranking. 5. Experimental Results 5.1 Document Detection | Model | mAP@0.5 | Inference (ms / img) | |-------|---------|----------------------| | Faster R‑CNN (ResNet‑101) | 0.89 | 128 | | EfficientDet‑D4 | 0.92 | 71 | | YOLOv8‑x (baseline) | 0.95 | 38 | MIDV-550
: Object detectors such as Faster R‑CNN [5], YOLOv8 [6], and EfficientDet [7] have become de‑facto standards. However, their performance on low‑resolution, heavily distorted ID images remains under‑explored. Existing public benchmarks (e
Technical Report – April 2026 Abstract The proliferation of mobile‑based identity‑verification services has created a pressing need for realistic, large‑scale datasets that capture the visual variability of government‑issued identification (ID) documents captured with consumer‑grade smartphones. We introduce MIDV‑550 , a publicly released benchmark consisting of 5 550 high‑resolution images of five common ID‑document types (passport, national ID card, driver’s licence, residence permit, and employee badge) captured under uncontrolled lighting, pose, motion blur, and occlusion conditions. Each image is richly annotated with document‑level bounding boxes, per‑field polygons, text transcriptions, and a hierarchy of quality‑assessment tags. We present a systematic evaluation of state‑of‑the‑art detection (YOLOv8, EfficientDet‑D4) and recognition pipelines (CRNN, Transformer‑based OCR) on MIDV‑550, establishing baseline performance and highlighting the remaining challenges in mobile ID verification. The dataset, annotation tools, and evaluation scripts are released under a permissive CC‑BY‑4.0 license to foster reproducible research. 1. Introduction Mobile identity verification (MIV) has become a core component of financial onboarding, e‑government services, and travel‑related applications. Unlike traditional document‑verification workflows that rely on high‑quality scanners, MIV must cope with images captured by handheld smartphones in a wide range of uncontrolled environments. This introduces a set of visual degradations—low illumination, motion blur, perspective distortion, specular highlights, and partial occlusion—that dramatically affect both document detection and optical character recognition (OCR). Technical Report – April 2026 Abstract The proliferation
YOLOv8‑x attains the highest detection recall (98 %) while maintaining real‑time speed on mobile‑grade CPUs (≈ 150 ms per image using TensorRT). | Model | Mean IoU (all fields) | MRZ IoU | Portrait IoU | |-------|----------------------|----------|--------------| | Mask RCNN (ResNeXt‑101) | 0.78 | 0.84 | 0.71 | | DETR‑Doc (ViT‑B) | 0.74 | 0.80 | 0.68 | | Mask RCNN + Geometric Refine (baseline) | 0.82 | 0.88 | 0.75 |