Custom ML vs Cloud APIs for Face Recognition

TL;DR: For real-time face recognition (like access control or surveillance), cloud APIs like AWS Rekognition are too slow (~300ms latency) and expensive at scale. A custom edge-deployed pipeline (using SCRFD and ArcFace on local GPUs) cuts latency to under 25ms, operates fully offline, ensures data privacy, and eliminates ongoing per-call API fees.
As computer vision applications move from asynchronous batch processing to real-time interactive edge deployments, the limitations of cloud-based, off-the-shelf APIs have become more apparent. While cloud platforms like Amazon Rekognition and Microsoft Azure Face API provide convenient, quick-start capabilities, they often fail to meet the performance, cost, and data governance demands of modern high-frequency deployments.
For real-time use cases—such as terminal access control, automated classroom attendance, and high-frame-rate video surveillance—deploying custom machine learning pipelines optimized for the edge offers significant advantages.
The Cloud Latency Bottleneck and Network Dependencies
Real-time facial analysis systems require processing pipelines with minimal latency to maintain spatial continuity and support multi-face tracking across consecutive video frames. Cloud-based face recognition APIs introduce significant latency due to network round-trips.
In standard production tests, cloud API response times range from 240 ms to 450 ms, depending on network bandwidth, routing congestion, and geographical proximity to the provider's data centers. This latency makes cloud APIs unsuitable for live camera feeds or interactive applications.
For example, a security terminal tracking multiple moving targets over a high-resolution camera feed must operate at a minimum of 25 to 30 frames per second. This gives the system a processing budget of only 33 ms to 40 ms per frame.
A cloud-based API with a 300 ms response time would lag by nearly 10 frames behind the physical action, leading to tracking failures, missed detections, and poor user experiences.
In contrast, an edge-optimized custom face recognition pipeline running locally on hardware like an NVIDIA Jetson AGX Orin or a dedicated desktop GPU can process frames with an overall latency of just 15 ms to 45 ms, enabling smooth, real-time performance.
Custom Edge Pipeline (NVIDIA Jetson AGX Orin / Local GPU)
[Camera Feed] ──(5-15ms)──> [SCRFD Detection] ──(2.4ms)──> [ArcFace Embeddings] ──(<1ms)──> [FAISS Match] ──> Match Verified (Total: ~15-25ms)
Cloud API Pipeline (AWS Rekognition / Azure Face API)
[Camera Feed] ──(120-250ms Network Tx)──> [Cloud Model Processing] ──(120-200ms Network Rx)──> Match Verified (Total: ~240-450ms)
Furthermore, cloud APIs require continuous, high-bandwidth internet connectivity. Any network drop, DNS resolution failure, or server-side throttling will immediately disrupt the application.
Conversely, custom pipelines run fully offline on edge devices. This offline capability is crucial for remote industrial sites, critical infrastructure, secure research facilities, and mobile applications where network connectivity is unreliable or restricted for security reasons.
The Anatomy of an Edge-Optimized Custom Pipeline
A high-performance custom face recognition pipeline splits the processing workload into specialized, highly optimized stages. This modular architecture allows developers to optimize each step to maximize throughput and maintain high accuracy.
Raw Video Stream (GStreamer / OpenCV)
│
▼
Face Detection (SCRFD det_10g)
│
▼
Facial Landmark Identification
│
▼
Affine Alignment & Normalization
│
▼
Feature Extraction (ArcFace Embedding Generator)
│
▼
Vector Search Database (FAISS GPU Indexing)
│
▼
Identity Verification
1. Face Detection and Landmark Localization
The pipeline begins by localizing faces within the raw video stream. Custom pipelines use architectures like SCRFD (Sample and Computation Redistribution for Face Detection) or RetinaFace.
SCRFD is designed for real-time performance, using optimized anchor distributions to balance speed and accuracy. The model outputs a bounding box coordinate, a detection confidence score, and five facial landmarks (the coordinates of the eyes, nose, and mouth corners) for each detected face.
2. Spatial Alignment and Normalization
To compensate for head tilt, rotation, and distance variations, the system uses the five detected facial landmarks to perform an affine transformation. This aligns and normalizes the facial features, cropping the face into a standardized 112 x 112 pixel RGB image. Standardizing this input is critical to ensure the downstream feature extractor remains accurate across different camera angles and distances.
3. Feature Extraction (Embeddings)
The aligned face image is then passed to a deep feature generator, such as ArcFace (specifically the ResNet50 backbone trained on the WebFace600K or MS1MV3 datasets). ArcFace uses an Additive Angular Margin Loss function to map the face to a compact, 512-dimensional vector on a unit hypersphere, maximizing the distance between different identities while minimizing the distance between images of the same person.
The mathematical formulation of the ArcFace loss function is:
\mathcal{L} = -\frac{1}{N} \sum_{i=1}^{N} \log \frac{e^{s(\cos(\theta_{y_i} + m))}}{e^{s(\cos(\theta_{y_i} + m))} + \sum_{j \neq y_i} e^{s \cos \theta_j}}
where s denotes the feature scale, m represents the additive angular margin, and \theta_{y_i} is the angle between the embedding vector and the target class weight vector. This margin enforces strong discriminability, allowing the system to achieve over 99.8% verification accuracy on standard benchmarks like LFW.
For environments prone to specular reflections, off-frontal poses, or facial occlusions (such as face masks), developers can deploy specialized architectures like RCF-Face. RCF-Face decouples identity features from environmental disturbances, achieving a Composite-Disturbance Score of 89.05% under challenging conditions while running within a light 1.12 million parameter budget.
4. High-Speed Vector Indexing and Search
Once the 512-dimensional embedding vector is generated, it must be compared against a database of known identities. While cloud services often charge query fees to search a face collection, a custom pipeline can integrate an on-device vector database like FAISS (Facebook AI Similarity Search) running directly on the GPU.
Using FAISS GPU indexing, the system can perform cosine similarity matching across thousands of stored profiles in less than 1 ms, ensuring the database lookup does not bottleneck the overall frame rate.
Hardware-Level Optimizations and Performance Benchmarks
Achieving ultra-low latency requires hardware-level optimizations that compile models for edge GPUs and streamline memory usage. When deploying on platform hardware such as an NVIDIA Jetson AGX Orin or an RTX-series discrete GPU, developers can convert their models into TensorRT-optimized compilation engines. This compilation process optimizes target operations, fuses layers, and applies quantization techniques (converting FP32 to FP16 or INT8 precision).
On an NVIDIA Jetson Orin NX (16 GB), compiling a face verification model with FP16 precision yields a batch-1 latency of 2.41 ms. Moving to INT8 quantization reduces latency to 1.66 ms, allowing the system to process hundreds of frames per second on edge devices.
To prevent CPU bottlenecking, the data pipeline should be kept on the GPU as much as possible. GStreamer 1.0 pipelines combined with NVIDIA's hardware-accelerated decoding (nvv4l2decoder) decode RTSP video feeds directly in GPU memory.
Additionally, executing the image normalization step directly on the GPU using libraries like CuPy reduces CPU-to-GPU data transfer overhead by 4 times, leading to a 6% increase in overall inference speed.
Integrating highly optimized post-processing operations compiled with Numba can further reduce face alignment and cropping times by up to 4.5 times compared to standard Python implementations.
The Financial Realities of Scale
While cloud-based APIs have low upfront costs, their pay-per-call pricing models can become prohibitively expensive at scale.
For instance, an enterprise security system processing live video feeds across 10 security cameras must continuously analyze frames to detect and identify faces. If the system processes just one frame per second per camera over a 12-hour operational day, it will generate:
- Daily Volume = 10 cameras × 1 frame/sec × 43,200 seconds = 432,000 images/day
- Monthly Volume = 432,000 images/day × 30 days = 12,960,000 images/month
Using Amazon Rekognition’s standard tiered pricing for face analysis, the monthly operational cost would be calculated as follows:
- First 1M images = 1,000,000 × $0.0010 = $1,000
- Next 4M images = 4,000,000 × $0.0008 = $3,200
- Remaining 7.96M images = 7,960,000 × $0.0006 = $4,776
- Total Monthly Cost ≈ $8,976
This calculation excludes auxiliary fees, such as data egress charges, vector metadata storage fees, or liveness detection checks (which cost approximately $15 per 1,000 sessions on Azure and $15 per 1,000 checks on AWS). Under this pricing structure, scaling the system to more cameras or higher frame rates quickly becomes financially unviable.
In contrast, a custom edge pipeline requires a one-time capital expenditure for hardware (such as an NVIDIA Jetson AGX Orin developer kit or a dedicated server equipped with an RTX 4090 GPU). Once deployed, the ongoing operational costs are limited to local electricity and routine hardware maintenance. The system can process millions of frames at 30 frames per second without incurring additional licensing, transaction, or API fees, allowing the initial hardware investment to pay for itself within the first few months of deployment.
Custom Edge Pipeline vs. Cloud API Comparison
The table below contrasts the technical and operational trade-offs between deploying custom edge-optimized pipelines and relying on cloud-based face recognition APIs.
| Architectural Dimension | Custom Edge Pipeline (SCRFD + ArcFace) | Amazon Rekognition | Microsoft Azure Face API |
|---|---|---|---|
| Inference Latency | 1.6 ms to 15 ms (on-device) | 240 ms to 450 ms (network dependent) | 250 ms to 450 ms (network dependent) |
| Throughput (RTX 4090) | Up to 820 FPS (FP16 optimized) | Limited by API rate limits and network | Limited by API rate limits and network |
| Operational Costs | $0 / transaction (fixed hardware CapEx) | $0.0006 to $0.0010 per image | ~$1.00 per 1,000 transactions |
| Liveness Detection Cost | $0 (locally executed anti-spoofing) | $10 to $15 per 1,000 checks | ~$15 per 1,000 sessions |
| Offline Execution | Yes (fully functional without network) | No (requires continuous internet) | Partial (via managed Docker containers) |
| Custom Retraining | Yes (full control over weights and data) | No (limited to pre-trained classes) | No (limited to pre-trained classes) |
| Data Privacy & Sovereignty | Highest (biometric data never leaves the edge) | Subject to cloud provider storage policies | Enterprise-grade compliance (GDPR/ISO) |
Conclusions
Modern enterprise software architecture requires careful consideration of where intelligence is processed and how state is maintained.
The analysis of stateful multi-agent frameworks indicates that selecting the right orchestrator depends on the determinism, memory features, and scalability needs of the target workflow. LangGraph provides the deterministic control and durable execution required for regulated, audit-heavy pipelines. CrewAI is ideal for automating role-based human team structures that benefit from persistent cognitive memory. AG2 (formerly AutoGen) provides an async-first, state-decoupled architecture designed for high-throughput, conversational multi-agent systems.
In the computer vision domain, the trade-offs between custom edge pipelines and cloud-based APIs are equally clear. For real-time applications, relying on cloud biometrics introduces significant network latency, continuous connectivity dependencies, and scaling costs that can quickly become unsustainable. By moving to optimized on-device pipelines—using SCRFD for face detection, ArcFace for embedding generation, and TensorRT for hardware compilation—enterprises can achieve sub-15 ms latencies, lower operational costs, and complete data sovereignty.
Ultimately, building these custom machine learning models and deploying stateful agentic workflows allows organizations to build highly reliable, performant, and cost-effective AI systems tailored to their operational needs.
Frequently Asked Questions

Written by
Arun Pandit
CEO & Founder
CEO & Founder of FNA Technology. Specializing in AI, automation, and scalable software solutions — helping businesses leverage cutting-edge technology to drive growth and innovation.
Work with us