The physical AI deployment gap
The gap between the frontier of robotics research and the deployment of robots in production settings, and why it matters.
There’s been a recurring pattern with robotics demos and releases over the past two years. Demo videos show impressive progress in capabilities: a robot arm gracefully manipulates novel objects, a bipedal robot navigates cluttered terrain, a learned policy generalizes to unseen scenarios. The demos are accompanied by questions and discussions on architecture details, training compute, benchmark performance.
The questions that are more difficult to answer, however, are those like the following: “how many takes did that demo require?” or “what happens when you move the camera six inches to the left?” or “has this been deployed outside of a lab environment?”
Over the last few years, we have seen genuine breakthroughs in robotic intelligence. Vision-Language-Action (VLA) models can follow natural language instructions to manipulate objects they have never seen. Simulation-trained policies transfer to real hardware with increasing reliability. Foundation models trained on diverse robot data show emergent generalization, and there is more and more evidence that scaling laws hold for robot actions. The research progress is real, accelerating, and impressive.
And yet the vast majority of robots in production environments remain narrowly preprogrammed, executing fixed routines in carefully controlled conditions. The gap between what’s demonstrated at ICRA and what’s deployed at industrial facilities and warehouses has never been wider.
This gap is more than just a function of the time lag characteristic of the diffusion of novel technology (though part of it certainly is that—industrial machines deployed at scale and emerging generalist robots are effectively two different generations of machines under the same name of “robot”); it’s also a function of particular technical and operational challenges inherent to deploying autonomous physical systems, as we learned from the journey to get autonomous vehicles on the road.
This essay is a survey of that deployment gap—the consistent chasm between the bleeding edge of robotics research and robots deployed at scale in production environments. We will discuss where the research frontier is, where the deployment reality is, and the specific technical and operational challenges that separate them.
The Research Frontier
Let’s start by acknowledging how far progress in robot learning has come, especially over the last 2-3 years. The deployment gap only makes sense against the backdrop of genuine capability. An abridged summary of some of the advances in robot learning over the last few years is as follows.
Vision-Language-Action (VLA) models represent the most significant architectural shift in robot learning in years. The core insight for VLAs is around taking vision-language models pretrained on internet-scale data, fine-tuning them to output robot actions, and leveraging the semantic understanding learned from web data for robotic control. By unifying vision, language, and action data at scale, VLA models aim to learn policies that generalize across diverse tasks, objects, embodiments, and environments.
Google’s RT-2 demonstrated that a VLM could be co-fine-tuned on robot and web data to output instructions for robotic control, showing emergent capabilities like understanding novel objects and following varied complex instructions. Physical Intelligence’s π0 pushed this approach further, training on trajectories from different robot embodiments and introducing flow matching for smooth, high-frequency action generation, and π0.5 extended this work to open-world generalization. Generalist’s GEN-0 scaled pretraining data to new heights and introduced a harmonic reasoning approach for the interplay between sensing and action tokens.
Simulation-to-real transfer has improved dramatically. Domain randomization—varying simulation parameters during training to force robust representations—now enables zero-shot transfer for much of locomotion and increasingly manipulation tasks. This approach tackles sim2real gaps by randomly varying parameters such as mass, friction, and lighting during training, to train a policy robust enough to succeed on a physical robot despite never being trained on real-world data. Moreover, ongoing work in world models presents opportunities for new avenues of progress on improving sim-to-real transfer.
Cross-embodiment generalization is emerging. The Open X-Embodiment project assembled over 1 million trajectories from 22 different robot platforms. Models trained on this data show positive transfer: RT-1-X achieves roughly 50 percent higher success rates than single-robot baselines; RT-2-X shows 3x improvement on emergent skills. VLAs in the last few years such as π0, GEN-0, and Nvidia’s GR00T N1 have all made cross-embodiment a focus, pushing towards the dream of running a single policy on heterogenous robot hardware.
Dexterous manipulation has crossed thresholds. Policies can now handle deformable objects, tool use, and contact-rich tasks that seemed intractable a few years ago. Gemini Robotics, a VLA building on the Gemini 2.0 foundation, performs tasks requiring a high level of dexterity, precise force control, and complex sequential reasoning, such as folding origami and manipulating playing cards.
This is the frontier. It’s progressing rapidly in terms of robustness, scalability, and generalization across tasks, environments, and embodiments. And almost none of it is deployed.
The Deployment Reality
Now let’s turn to the systems actually operating in production environments in the present day. These robots are largely “classical” robots, in contrast to the current research paradigm of robot learning (learned systems).
Automotive manufacturing uses thousands of industrial robots, but they remain narrowly preprogrammed for specific tasks. A welding robot executes the same motion thousands of times per day with submillimeter precision. When the task changes—a new car model, a different weld pattern—engineers and systems integrators manually reprogram it. The promise of robots that learn new tasks from demonstration or instruction is still largely confined to pilot programs.
Warehouse bin picking—often involving grabbing diverse objects from bins—represents one of the closest applications to research capabilities. Some companies have deployed learned picking policies in production. But even here, there is a visible gap. These systems typically handle structured product categories (packaged goods with consistent shapes) in controlled lighting conditions with engineered bin presentations. The ability to pick arbitrary objects in cluttered, unstructured environments present in research demos has yet to be reliably deployed at scale.
Humanoid robots have received enormous attention and investment. But most humanoid robot deployments today remain in pilot phases, heavily dependent on human input for navigation, dexterity, or task switching. Moreover, humanoid robots at present are more often a platform for robotics developers (often in lab settings) to build on, rather than a complete solution sold to consumer or enterprise customers for production tasks. As such, real world deployments of humanoid systems are still largely confined to demos and pilot programs.
The overall gap between robotics research and robotics deployments can be observed simply by looking at the players involved in each sphere. In robotics research, much of the attention has focused on companies and commercial labs pursuing breakthroughs in robot learning. The status quo for robotics deployments in an industrial facility, meanwhile, still largely hinges on regional systems integrators distributing systems from industrial robot OEMs and programming them with classical approaches. These two spheres largely operate independent of each other. The status quo, however, is a difficult market for a new technology company (for reasons we’ve written about before), meaning much of the unlock for a new era in robotics hinges upon the technological progress being made now that can meaningfully change the economics and operational considerations of robotics deployments.
For there to be orders of magnitude more robots in the world, robots likely have to be orders of magnitude faster, cheaper, and easier to develop and deploy—meaning that we must bridge the gap between research and deployment.
Mapping the Deployment Gap
The deployment gap is less of a single problem and more of a set of specific technical and operational challenges, each of which limits the transition of research systems from lab to production. We map some of these problems here.
Distribution Shift
Research systems are evaluated on test sets drawn from the same distribution as training data. Deployment environments are, by definition, out of distribution.
A manipulation policy trained on objects in a robotics lab encounters different lighting, different backgrounds, different object textures, and different camera angles in a warehouse. The sim2real approach faces challenges due to mismatches between simulation and reality arising from inaccuracies in modeling physical phenomena and asynchronous control. A policy that achieves 95 percent success in the lab might drop to 60 percent in deployment—not because the policy is wrong, but because the long tail of the physical world introduces a large number of potential differences.
The research community measures success on benchmarks, whereas deployment requires success on the long tail of situations that no benchmark covers. As progress in generalization continues and policies continue to demonstrate zero-shot generalization to unseen tasks and environments, this problem will likely become less prevalent. At present, however, it remains relevant to the deployment of learning-based approaches in production environments.
Reliability Thresholds
Research papers generally focus on mean success rates, whereas deployment requires worst-case reliability.
Consider a picking robot that achieves 95 percent success in research evaluation. This result would be considered excellent. In deployment, that robot attempts thousands of picks per day. At 95 percent success, it fails 50 times per day. Each failure requires human intervention: someone must clear the jam, recover the dropped object, restart the system. At scale, this becomes operationally untenable.
Production systems in manufacturing typically require reliability above 99.9 percent. Achieving this with learned policies is extraordinarily difficult, because the failures likely aren’t necessarily random. They may cluster around edge cases that the training distribution didn’t cover. A 95 percent policy might fail 50 percent of the time on the 10 percent of cases that differ from training.
The research community optimizes for performance as a whole. Production deployments, on the other hand, require eliminating failure modes.
Latency-Capability Tradeoffs
The most capable VLA models are often also the largest and slowest.
VLA models, characterized by complex transformer-based architectures and enormous parameter counts (billions to tens of billions), demand substantial compute for inference. This results in high latency and insufficient control frequencies, often incompatible with the real-time demands of low-level robotic control.
As an illustrative estimate, manipulation tasks often require control at conservatively 20-100Hz. A 7B parameter model running on edge hardware might achieve inference times of 50-100ms—adequate for 10-20Hz control, but inadequate for dynamic manipulation requiring tight feedback loops. Cloud inference adds network latency that makes real-time control impossible for many tasks.
Research papers can run inference on clusters and report results, but production deployments require running on the hardware that fits in (and can be powered by) the actual robot deployed.
The dual-system architectures in models like Figure’s Helix and Nvidia’s GR00T N1, which separate slow semantic reasoning (System 2) from fast motor control (System 1), represent an attempt to resolve this tradeoff. Generalist’s harmonic reasoning method is another approach to addressing these problems.
Integration Complexities
Research systems exist in isolation or within a controlled, abstract system, whereas deployed robots must integrate with everything else involved in operating a facility.
A warehouse robot needs to receive task assignments from warehouse management systems (WMS), coordinate with other robots sharing floor space, report status to monitoring dashboards, log events for compliance, and interface with maintenance systems. High implementation costs and legacy system incompatibilities hinder adoption, particularly for SMBs, while interoperability gaps (despite frameworks like OPC UA) stifle multi-vendor ecosystems.
A research policy that picks objects perfectly is functionally limited in production settings if it can’t receive instructions about which objects to pick, coordinate with the conveyor belt timing, or report completion status to the system tracking inventory.
The integration of learning-based robotic systems with the “system of systems” that comprise business operations for the customers of robotics solutions remains to be adequately addressed.
Safety Certification
Research systems largely operate in controlled environments for limited durations, whereas deployed robots often operate near humans who didn’t sign liability waivers.
Collaborative robots operating near humans must comply with standards like ISO 10218 and ISO/TS 15066. These standards were written for programmed robots with predictable, analyzable behavior. They do not have clear provisions for learned policies whose behavior emerges from training data.
How might one certify that a neural network policy meets these standards that were written for a different kind of machine? It’s infeasible to formally verify a 7B parameter model. Extensive testing is a possibility, but testing can only show the presence of failures, not their absence. Safety frameworks are often outdated for learning-based systems.
Maintenance
Research systems are often maintained by the researchers who designed and built them, whereas deployed robots are maintained by technicians who did not.
A learned policy that fails in production can’t be debugged by reading code. Indeed, there is no code, just weights. When a robot behaves unexpectedly, diagnosing whether the problem is perception, planning, control, hardware, or integration requires expertise that most maintenance teams in their current form do not have.
Industries that once needed a handful of automation engineers now rely on teams of roboticists, integrators, safety specialists, and technicians to keep systems running. Yet the supply of talent has not kept pace.
Research environments typically assume expert operators and roboticists, but this assumption does not scale to production settings, which typically require maintainability by the broader workforce.
A Note on Compounding
These challenges also interact with each other to compound the deployment gap.
Consider deploying a VLA-based manipulation system in a warehouse:
The distribution shift from lab to warehouse degrades performance from 95 percent to 80 percent.
At 80 percent reliability, failures occur hundreds of times per day, requiring constant human intervention.
Running the full VLA model on edge hardware to reduce latency further degrades performance.
Integrating with the WMS introduces additional failure modes at the interface.
Safety certification for the learned policy takes months, during which the research has moved on.
When failures occur, maintenance staff can’t diagnose whether to blame perception, policy, or integration.
Each challenge makes the others worse. Distribution shift causes failures; failures require human intervention; human intervention requirements drive up operational costs; high costs limit deployment scale; limited scale means less deployment data; less deployment data means the distribution shift doesn’t improve, and so on and so forth.
This compounding introduces a new layer of complexity to addressing the deployment gap. The individual problems are all solvable in isolation. Their interaction creates a barrier that pure research progress doesn’t address.
Closing the Deployment Gap
The deployment gap won’t be closed through pure research breakthroughs alone, but rather in concert with infrastructure, tooling, and operational capabilities that don’t currently exist.
Deployment-Distribution Data
The most direct attack on distribution shift is collecting data that matches deployment conditions. The data bottleneck for robotics (unlike the availability of internet-scale data for text or images) has become well understood and a consensus problem to tackle. High quality robotics data has to be collected, curated and annotated through purpose-built facilities and infrastructure.
What’s needed:
Scalable teleoperation infrastructure that can collect demonstrations in diverse real-world environments, not just research labs.
Deployment-time data collection where production robots generate training data while performing useful work, creating a flywheel where deployment enables better models which enable more deployment.
Domain-specific datasets collected in actual warehouses, factories, and homes rather than laboratory approximations.
The promise of the robotics data flywheel is that once robots can be out in the world and collect data while creating economic value, the cost of robot data will go down rapidly, since it’s subsidized by the value the robot generates in the process. Bootstrapping this flywheel, however, requires crossing a meaningful initial deployment threshold.
Reliability Engineering for Learned Systems
For learning-based robot systems to be deployed at scale, reliability engineering practices need to be adapted for these learned systems. These practices include:
Failure mode characterization—systematic analysis of when and why policies fail, clustering failures by root cause rather than just counting them.
Graceful degradation—policies that recognize when they’re in unfamiliar territory and request human assistance rather than failing silently.
Hybrid architectures—combining learned policies (flexible, general) with programmed fallbacks (reliable, predictable) so that edge cases don’t cause system failures.
Runtime monitoring—systems that detect distribution shift in real-time and alert operators before failures cascade.
The goal isn’t eliminating failures, which is more or less impossible for learned systems in open-world environments. The goal is instead to make failures recoverable, detectable, and bounded.
Edge-Deployable Models
The capability-latency tradeoff requires meeting in the middle:
Efficient architectures designed for robotics constraints, not adapted from language models. Hugging Face’s SmolVLA achieved comparable performance to larger VLAs with only 450 million parameters by employing flow-matching for continuous control and asynchronous inference.
Hierarchical systems that run semantic reasoning infrequently (where latency matters less) and motor control continuously (where latency is critical).
Hardware-software co-design. Models designed around the specific compute available on robotic platforms, not generic GPU clusters. This approach is more feasible when hardware and models are designed in a tightly coupled fashion; otherwise, it requires greater standardization in robot hardware platforms.
The research community has begun acknowledging this, and there is now a distinct subfield around ’efficient VLAs’ focused specifically on bridging computational demands with real-world deployment. For instance, Deepmind released Gemini Robotics’ On-Device, a lightweight version optimized to run locally on robots.
Integration Infrastructure
Research systems need to become deployable systems, and that requires the infrastructure to enable scalable integration of learning-based robots.
Robotics middleware that abstracts over the heterogeneous landscape of enterprise systems—adapters for common WMS/MES/ERP platforms, standardized APIs for fleet coordination, etc.
Deployment automation that reduces the skilled labor required to configure robots for specific environments (akin to infrastructure-as-code for physical systems)
Observability tooling for deployed systems, including logging, metrics, tracing, and alerting.
The robotics equivalent of DevOps practices doesn’t exist yet, and building it would dramatically reduce the operational burden of deployment.
Safety Frameworks for Learned Systems
Certification processes need to adapt to learning-based systems:
Behavioral characterization methods that describe what a policy will do across its operational domain, even without formal verification.
Testing frameworks that systematically probe for failure modes rather than spot-checking.
Runtime safety layers that can override learned policies when approaching unsafe states.
Updated standards written for learning-based robots.
These solutions are partly technical and partly institutional.
The Shape of Progress
The deployment gap is as real at present as it is likely transitory. It is probable that it will close, not due to any single breakthrough, but through accumulated progress on the infrastructure and tooling described above.
One possible pattern involves narrow deployments in constrained domains expanding gradually as reliability improves and integration costs decrease. Warehouse picking gets more robust. Manufacturing tasks get more flexible. Each successful deployment generates data that improves the next.
Another possible pattern is that we reach some baseline level of generalist capabilities across tasks, environments, and embodiments, and that generalist baseline then gets built on top of by roboticists and developers building specific applications on that foundation. Each application involves fine-tuning generalist robots—across both policies and hardware—on the particular elements of the task or environment at hand.
The physical world is highly variable—much more so than virtual domains—and involves an enormous number of parameters that may make a singular product solution difficult. The implication here, as we have written before, is that there may not be a singular iPhone moment for robotics, but an evolutionary process that develops at the ecosystem level. A breakthrough robotics moment may look less like the inflection point of a single consumer product and more like a common operating system enabling an ecosystem of devices, developer tools, and applications—more like Android than the iPhone.
Moreover, the robotics deployment gap exists against the backdrop of the broader AI race between the US and China. As some have noted, China has pursued a strategy of focusing on applications of AI instead of pushing the edges of performance towards superintelligence with frontier models, where the US leads—potentially creating a dynamic where China could race towards unlocking more economic value with AI despite America leading in model capabilities. This dynamic becomes even more pronounced in robotics, where China has an advantage due to its dominant scale in industrial robots and manufacturing more broadly. Like in other modalities of AI, the US likely has an advantage in frontier models and software for robot learning, but China’s robotics ecosystem and its competencies in many industrial processes are formidable. As we’ve written before, America and our allies must work seriously to win the robotics race, and closing the deployment gap is a key part of this effort, translating robotics progress to economic and strategic value.
The deployment gap is where research meets reality, where capabilities become capacity, and where performance on benchmarks translate to hard power. If we are to build a world with orders of magnitude more robots in the factory, in the home, and settling the stars and new frontiers, we will have to build the essential infrastructure that transforms impressive demos into reliable and scalable systems.
Physical AI could be the biggest market in human history. We just have to make it so—this is the opportunity for a generation of robotics companies.
If you’re building infrastructure for physical AI deployment—data collection, reliability engineering, integration tooling, edge-efficient models, etc.—we’d love to hear from you.
This newsletter is provided for informational purposes only, and should not be relied upon as legal, business, investment, or tax advice. Furthermore, this content is not investment advice, nor is it intended for use by any investors or prospective investors in any a16z funds. This newsletter may link to other websites or contain other information obtained from third-party sources - a16z has not independently verified nor makes any representations about the current or enduring accuracy of such information. If this content includes third-party advertisements, a16z has not reviewed such advertisements and does not endorse any advertising content or related companies contained therein. Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z; visit https://a16z.com/investment-list/ for a full list of investments. Other important information can be found at a16z.com/disclosures. You’re receiving this newsletter since you opted in earlier; if you would like to opt out of future newsletters you may unsubscribe immediately.





The "Deployment Gap" isn’t just a data problem; it’s an architectural one.
Great piece, Oliver, thank you.
Your analysis of the "99.9% reliability threshold" resonates deeply with my work on resilient swarm intelligence. I believe a shift in how we perceive the agent-environment boundary is the key to bridging this gap, which is why I developed, FLOWRRA.
Most learned systems fail because they treat agents and environments as separate, making them incredibly brittle to distribution shifts. Instead of over-relying on external feedback, FLOWRRA treats the agent's internal operational state as its primary environment.
By optimizing for a measurable Flow Coherence Metric, FLOWRRA maintains a long tail of its own stable configurations. We use this historical manifold as the 'ground state' for our retrocausal WFC, allowing the system to leap back to a proven coherent structure the moment an environmental edge-case is detected.
Two points from your article that FLOWRRA specifically solves for:
1.) Reliability through Retrocausal WFC: When continuous adaptation fails, FLOWRRA uses a retrocausal-inspired wave function collapse to make a discontinuous leap back to known stable states or forward to projected coherent configurations. This transforms failure from a "system crash" into a "strategic reconfiguration."
2.) Real-World Scalability via GNNs: While my current implementation was tested on 32 nodes due to local compute limits, the use of Graph Attention Networks (GAT) means the architecture is inherently topology-agnostic. It isn't a "32-node model"; it’s an $N$-node architecture designed for the federated, high-frequency control environments you’re advocating for.
We recently stress-tested this with a 50% hardware failure rate (node freezing), and the swarm maintained 100% mission coverage by "breathing"; stretching its GNN-driven edges to reroute around "dead weight" in real-time.
I’d love to get your thoughts on how this "internal coherence" approach might close the gap for industrial swarms and paving the path for Robotics in general.
Full Write-up & Dynamics Video here:https://rohittamidapati.substack.com/p/flowrra-flow-recognition-reconfiguration
DhaaRn | Weavers of Time.
Hmmm seems you + I are writing about the same thing here https://ciphertalk.substack.com/p/why-ai-cant-touch-the-real-world