From Words to Worlds: Spatial Intelligence is AI’s Next Frontier
A guest post from World Labs founder Fei-Fei Li
Today, we have a very special feature from Fei-Fei Li, the computer scientist, AI researcher, academic, and co-founder/CEO of World Labs a frontier model company focusing on solving spatial intelligence. We’re posting an excerpt of Fei-Fei’s writing below; you can read the full manifesto on her new Substack.
In 1950, when computing was little more than automated arithmetic and simple logic, Alan Turing asked a question that still reverberates today: can machines think? It took remarkable imagination to see what he saw: that intelligence might someday be built rather than born. That insight later launched a relentless scientific quest called Artificial Intelligence (AI). Twenty-five years into my own career in AI, I still find myself inspired by Turing’s vision. But how close are we? The answer isn’t simple.
Today, leading AI technology such as large language models (LLMs) have begun to transform how we access and work with abstract knowledge. Yet they remain wordsmiths in the dark; eloquent but inexperienced, knowledgeable but ungrounded. Spatial intelligence will transform how we create and interact with real and virtual worlds—revolutionizing storytelling, creativity, robotics, scientific discovery, and beyond. This is AI’s next frontier.
The pursuit of visual and spatial intelligence has been the North Star guiding me since I entered the field. It’s why I spent years building ImageNet, the first large-scale visual learning and benchmarking dataset and one of three key elements enabling the birth of modern AI, along with neural network algorithms and modern compute like graphics processing units (GPUs). It’s why my academic lab at Stanford has spent the last decade combining computer vision with robotic learning. And it’s why my cofounders Justin Johnson, Christoph Lassner, Ben Mildenhall, and I created World Labs more than one year ago: to realize this possibility in full, for the first time.
In this essay, I’ll explain what spatial intelligence is, why it matters, and how we’re building the world models that will unlock it—with impact that will reshape creativity, embodied intelligence, and human progress.
Spatial Intelligence: The scaffolding of human cognition
AI has never been more exciting. Generative AI models such as LLMs have moved from research labs to everyday life, becoming tools of creativity, productivity, and communication for billions of people. They have demonstrated capabilities once thought impossible, producing coherent text, mountains of code, photorealistic images, and even short video clips with ease. It’s no longer a question of whether AI will change the world. By any reasonable definition, it already has.
Yet so much still lies beyond our reach. The vision of autonomous robots remains intriguing but speculative, far from the fixtures of daily life that futurists have long promised. The dream of massively accelerated research in fields like disease curation, new material discovery, and particle physics remains largely unfulfilled. And the promise of AI that truly understands and empowers human creators—whether students learning intricate concepts in molecular chemistry, architects visualizing spaces, filmmakers building worlds, or anyone seeking fully immersive virtual experiences—remains beyond reach.
To learn why these capabilities remain elusive, we need to examine how spatial intelligence evolved, and how it shapes our understanding of the world.
Vision has long been a cornerstone of human intelligence, but its power emerged from something even more fundamental. Long before animals could nest, care for their young, communicate with language, or build civilizations, the simple act of sensing quietly sparked an evolutionary journey toward intelligence.
This seemingly isolated ability to glean information from the external world, whether a glimmer of light or the feeling of texture, created a bridge between perception and survival that only grew stronger and more elaborate as the generations passed. Layer upon layer of neurons grew from that bridge, forming nervous systems that interpret the world and coordinate interactions between an organism and its surroundings. Thus, many scientists have conjectured that perception and action became the core loop driving the evolution of intelligence, and the foundation on which nature created our species—the ultimate embodiment of perceiving, learning, thinking, and doing.
Spatial intelligence plays a fundamental role in defining how we interact with the physical world. Every day, we rely on it for the most ordinary acts: parking a car by imagining the narrowing gap between bumper and curb, catching a set of keys tossed across the room, navigating a crowded sidewalk without collision, or sleepily pouring coffee into a mug without looking. In more extreme circumstances, firefighters navigate collapsing buildings through shifting smoke, making split-second judgements about stability and survival, communicating through gestures, body language and a shared professional instinct for which there’s no linguistic substitute. And children spend the entirety of their pre-verbal months or years learning the world through playful interactions with their environments. All of this happens intuitively, automatically—a fluency machines have yet to achieve.
Spatial Intelligence is also foundational to our imagination and creativity. Storytellers create uniquely rich worlds in their minds and leverage many forms of visual media to bring them to others, from ancient cave painting to modern cinema to immersive video games. Whether it’s children building sandcastles on the beach or playing Minecraft on the computer, spatially-grounded imagination forms the basis for interactive experiences in real or virtual worlds. And in many industry applications, simulations of objects, scenes and dynamic interactive environments power countless numbers of critical business use cases from industrial design to digital twins to robotic training.
History is full of civilization-defining moments where spatial intelligence played central roles. In ancient Greece, Eratosthenes transformed shadows into geometry—measuring a 7-degree angle in Alexandria at the exact moment the sun cast no shadow in Syene—to calculate the Earth’s circumference. Hargreave’s “Spinning Jenny” revolutionized textile manufacturing through a spatial insight: arranging multiple spindles side-by-side in a single frame allowed one worker to spin multiple threads simultaneously, increasing productivity eightfold. Watson and Crick discovered DNA’s structure by physically building 3D molecular models, manipulating metal plates and wire until the spatial arrangement of base pairs clicked into place. In each case, spatial intelligence drove civilization forward when scientists and inventors had to manipulate objects, visualize structures, and reason about physical spaces - none of which can be captured in text alone.
Spatial Intelligence is the scaffolding upon which our cognition is built. It’s at work when we passively observe or actively seek to create. It drives our reasoning and planning, even on the most abstract topics. And it’s essential to the way we interact—verbally or physically, with our peers or with the environment itself. While most of us aren’t revealing new truths on the level of Eratosthenes most days, we routinely think in the same way—making sense of a complex world by perceiving it through our senses, then leveraging an intuitive understanding of how it works in physical, spatial terms.
Unfortunately, today’s AI doesn’t think like this yet.
Tremendous progress has indeed been made in the past few years. Multimodal LLMs (MLLMs), trained with voluminous multimedia data in addition to textual data, have introduced some basics of spatial awareness, and today’s AI can analyze pictures, answer questions about them, and generate hyperrealistic images and short videos. And through breakthroughs in sensors and haptics, our most advanced robots can begin to manipulate objects and tools in highly constrained environments.
Yet the candid truth is that AI’s spatial capabilities remain far from human level. And the limits reveal themselves quickly. State-of-the-art MLLM models rarely perform better than chance on estimating distance, orientation, and size—or “mentally” rotating objects by regenerating them from new angles. They can’t navigate mazes, recognize shortcuts, or predict basic physics. AI-generated videos—nascent and yes, very cool—often lose coherence after a few seconds.
While current state-of-the-art AI can excel at reading, writing, research, and pattern recognition in data, these same models bear fundamental limitations when representing or interacting with the physical world. Our view of the world is holistic—not just what we’re looking at, but how everything relates spatially, what it means, and why it matters. Understanding this through imagination, reasoning, creation, and interaction—not just descriptions—is the power of spatial intelligence. Without it, AI is disconnected from the physical reality it seeks to understand. It cannot effectively drive our cars, guide robots in our homes and hospitals, enable entirely new ways of immersive and interactive experiences for learning and recreation, or accelerate discovery in materials science and medicine.
The philosopher Wittgenstein once wrote that “the limits of my language mean the limits of my world.” I’m not a philosopher. But I know at least for AI, there is more than just words. Spatial intelligence represents the frontier beyond language—the capability that links imagination, perception and action, and opens possibilities for machines to truly enhance human life, from healthcare to creativity, from scientific discovery to everyday assistance.
The next decade of AI: Building truly spatially intelligent machines
So how do we build spatially-intelligent AI? What’s the path to models capable of reasoning with the vision of Eratosthenes, engineering with the precision of an industrial designer, creating with the imagination of a storyteller, and interacting with their environment with the fluency of a first responder?
Building spatially intelligent AI requires something even more ambitious than LLMs: world models, a new type of generative models whose capabilities of understanding, reasoning, generation and interaction with the semantically, physically, geometrically and dynamically complex worlds - virtual or real - are far beyond the reach of today’s LLMs. The field is nascent, with current methods ranging from abstract reasoning models to video generation systems. World Labs was founded in early 2024 on this conviction: that foundational approaches are still being established, making this the defining challenge of the next decade.
In this emerging field, what matters most is establishing the principles that guide development. For spatial intelligence, I define world models through three essential capabilities:
1. Generative: World models can generate worlds with perceptual, geometrical, and physical consistency
World models that unlock spatial understanding and reasoning must also generate simulated worlds of their own. They must be capable of spawning endlessly varied and diverse simulated worlds that follow semantic or perceptual instructions—while remaining geometrically, physically, and dynamically consistent—whether representing real or virtual spaces. The research community is actively exploring whether these worlds should be represented implicitly or explicitly in terms of the innate geometric structures. Furthermore, in addition to powerful latent representations, I believe the outputs of a universal world model must also allow the generation of an explicit, observable state of the worlds for many different use cases. In particular, its understanding of the present must be tied coherently to its past; to the previous states of the world that led to the current one.
2. Multimodal: World models are multimodal by design
Just as animals and humans do, a world model should be able to process inputs—known as “prompts” in the generative AI realm—in a wide range of forms. Given partial information—whether images, videos, depth maps, text instructions, gestures, or actions—world models should predict or generate world states as complete as possible. This requires processing visual inputs with the fidelity of real vision while interpreting semantic instructions with equal facility. This enables both agents and humans to communicate with the model about the world through diverse inputs and receive diverse outputs in return.
3. Interactive: World models can output the next states based on input actions
Finally, if actions and/or goals are part of the prompt to a world model, its outputs must include the next state of the world, represented either implicitly or explicitly. When given only an action with or without a goal state as the input, the world model should produce an output consistent with the world’s previous state, the intended goal state if any, and its semantic meanings, physical laws, and dynamical behaviors. As spatially intelligent world models become more powerful and robust in their reasoning and generation capabilities, it is conceivable that in the case of a given goal, the world models themselves would be able to predict not only the next state of the world, but also the next actions based on the new state.
The scope of this challenge exceeds anything AI has faced before.
While language is a purely generative phenomenon of human cognition, worlds play by much more complex rules. Here on Earth, for instance, gravity governs motion, atomic structures determine how light produces colors and brightness, and countless physical laws constrain every interaction. Even the most fanciful, creative worlds are composed of spatial objects and agents that obey the physical laws and dynamical behaviors that define them. Reconciling all of this consistently—the semantic, the geometric, the dynamic, and physical—demands entirely new approaches. The dimensionality of representing a world is vastly more complex than that of a one-dimensional, sequential signal like language. Achieving world models that deliver the kind of universal capabilities we enjoy as humans will require overcoming several formidable technical barriers. At World Labs, our research teams are devoted to making fundamental progress toward that goal.
Here are some examples of our current research topics:
A new, universal task function for training: Defining a universal task function as simple and elegant as next-token prediction in LLMs has long been a central goal of world model research. The complexities of both their input and output spaces make such a function inherently more difficult to formulate. But while much remains to be explored, this objective function and corresponding representations must reflect the laws of geometry and physics, honoring the fundamental nature of world models as grounded representations of both imagination and reality.
Large-scale training data: Training world models requires far more complex data than text curation. The promising news: massive data sources already exist. Internet-scale collections of images and videos represent abundant, accessible training material—the challenge lies in developing algorithms that can extract deeper spatial information from these two-dimensional image or video frame-based signals (i.e. RGB). Research over the past decade has shown the power of scaling laws linking data volume and model size in language models; the key unlock for world models is building architectures that can leverage existing visual data at comparable scale. In addition, I would not underestimate the power of high-quality synthetic data and additional modalities like depth and tactile information. They supplement the internet scale data in critical steps of the training process. But the path forward depends on better sensor systems, more robust signal extraction algorithms, and far more powerful neural simulation methods.
New model architecture and representational learning: World model research will inevitably drive advances in model architecture and learning algorithms, particularly beyond the current MLLM and video diffusion paradigms. Both of these typically tokenize data into 1D or 2D sequences, which makes simple spatial tasks unnecessarily difficult - like counting unique chairs in a short video, or remembering what a room looked like an hour ago. Alternative architectures may help, such as 3D or 4D-aware methods for tokenization, context, and memory. For example, at World Labs, our recent work on a real-time generative frame-based model called RTFM has demonstrated this shift, which uses spatially-grounded frames as a form of spatial memory to achieve efficient real-time generation while maintaining persistence in the generated world.
Clearly, we are still facing daunting challenges before we can fully unlock spatial intelligence through world modeling. This research isn’t just a theoretical exercise. It is the core engine for a new class of creative and productivity tools. And the progress within World Labs has been encouraging. We recently shared with a limited number of users a glimpse of Marble, the first ever world model that can be prompted by multimodal inputs to generate and maintain consistent 3D environments for users and storytellers to explore, interact with, and build further in their creative workflow. And we are working hard to make it available to the public soon!
Marble is only our first step in creating a truly spatially intelligent world model. As the progress accelerates, researchers, engineers, users, and business leaders alike are beginning to recognize its extraordinary potential. The next generation of world models will enable machines to achieve spatial intelligence on an entirely new level—an achievement that will unlock essential capabilities still largely absent from today’s AI systems.
To continue Fei-Fei’s post, you can read it on her Substack here.
Views expressed in “posts” (including podcasts, videos, and social media) are those of the individual a16z personnel quoted therein and are not the views of a16z Capital Management, L.L.C. (“a16z”) or its respective affiliates. a16z Capital Management is an investment adviser registered with the Securities and Exchange Commission. Registration as an investment adviser does not imply any special skill or training. The posts are not directed to any investors or potential investors, and do not constitute an offer to sell — or a solicitation of an offer to buy — any securities, and may not be used or relied upon in evaluating the merits of any investment.
The contents in here — and available on any associated distribution platforms and any public a16z online social media accounts, platforms, and sites (collectively, “content distribution outlets”) — should not be construed as or relied upon in any manner as investment, legal, tax, or other advice. You should consult your own advisers as to legal, business, tax, and other related matters concerning any investment. Any projections, estimates, forecasts, targets, prospects and/or opinions expressed in these materials are subject to change without notice and may differ or be contrary to opinions expressed by others. Any charts provided here or on a16z content distribution outlets are for informational purposes only, and should not be relied upon when making any investment decision. Certain information contained in here has been obtained from third-party sources, including from portfolio companies of funds managed by a16z. While taken from sources believed to be reliable, a16z has not independently verified such information and makes no representations about the enduring accuracy of the information or its appropriateness for a given situation. In addition, posts may include third-party advertisements; a16z has not reviewed such advertisements and does not endorse any advertising content contained therein. All content speaks only as of the date indicated.
Under no circumstances should any posts or other information provided on this website — or on associated content distribution outlets — be construed as an offer soliciting the purchase or sale of any security or interest in any pooled investment vehicle sponsored, discussed, or mentioned by a16z personnel. Nor should it be construed as an offer to provide investment advisory services; an offer to invest in an a16z-managed pooled investment vehicle will be made separately and only by means of the confidential offering documents of the specific pooled investment vehicles — which should be read in their entirety, and only to those who, among other requirements, meet certain qualifications under federal securities laws. Such investors, defined as accredited investors and qualified purchasers, are generally deemed capable of evaluating the merits and risks of prospective investments and financial matters.
There can be no assurances that a16z’s investment objectives will be achieved or investment strategies will be successful. Any investment in a vehicle managed by a16z involves a high degree of risk including the risk that the entire amount invested is lost. Any investments or portfolio companies mentioned, referred to, or described are not representative of all investments in vehicles managed by a16z and there can be no assurance that the investments will be profitable or that other investments made in the future will have similar characteristics or results. A list of investments made by funds managed by a16z is available here: https://a16z.com/investments/. Past results of a16z’s investments, pooled investment vehicles, or investment strategies are not necessarily indicative of future results. Excluded from this list are investments (and certain publicly traded cryptocurrencies/ digital assets) for which the issuer has not provided permission for a16z to disclose publicly. As for its investments in any cryptocurrency or token project, a16z is acting in its own financial interest, not necessarily in the interests of other token holders. a16z has no special role in any of these projects or power over their management. a16z does not undertake to continue to have any involvement in these projects other than as an investor and token holder, and other token holders should not expect that it will or rely on it to have any particular involvement.
With respect to funds managed by a16z that are registered in Japan, a16z will provide to any member of the Japanese public a copy of such documents as are required to be made publicly available pursuant to Article 63 of the Financial Instruments and Exchange Act of Japan. Please contact compliance@a16z.com to request such documents.
For other site terms of use, please go here. Additional important information about a16z, including our Form ADV Part 2A Brochure, is available at the SEC’s website: http://www.adviserinfo.sec.gov





Dr. Li - thank you for this beautiful essay. Your examples resonated deeply, especially Watson & Crick's physical model-building. There's something profound about how they worked: not just imagining plausible structures, but systematically testing each configuration against Rosalind Franklin's X-ray data, Chargaff's ratios, and chemical bonding constraints until they found the ONE structure that satisfied everything simultaneously.
It made me think about Kuhn's observation that scientific paradigms encode centuries of collective wisdom - not just about spatial relationships, but about what counts as authoritative evidence and valid reasoning. Our institutions, for all their imperfections, represent tens of thousands of years of accumulated human knowledge about verification, authority, and truth.
Your world models open extraordinary creative possibilities. And perhaps there's room for both kinds of intelligence: generative systems that imagine plausible worlds, and governance architectures that preserve institutional constraints. Watson & Crick needed both imagination AND authoritative grounding.
The question isn't whether machines should think spatially - clearly they must. But as we build these systems, how do we ensure they honor the difference between what's plausible and what's provably correct when human welfare depends on that distinction?
Thank you for advancing this vital frontier.
Fascinating essay. You finished convincing me that world models are a next step towards creating thinking machines (as opposed to reasoning machines).
Concerning the three capabilities of world models. I’m wondering whether the first capability (physical consistency) doesn’t limit the second (generating according to input/prompt) — especially when accounting for the third (consistency between past and present states)?
Say I want to operate within a world model that replicates Earth today. It would be possible to prompt to move an object from point A to point B while maintaining physical and temporal consistency. However, how could I prompt the addition of a new object to the world model whilst respecting either physical or temporal consistency?
The statement: “the world model should produce an output consistent with the world’s previous state” implies, in a causal world, that any input cannot be added midway through the chain of events but must be added at its inception. It would be impossible to add an ex nihilo cause to a world already in a specific state. That cause should already be in the model since the start, at least as a probable forecasted state? But if it’s already included in the model in that state, you are not adding to that state.
You therefore have thus paradoxical situation where you want to add a cause to a world in a specific state but to add that cause you either need to disregard laws of physics and/or of temporality/causality; or that cause needs to have been included in a forecast of the chain of the events since the inception of that world, in which case you are not adding it.
Or when adding an input to a world model in a specific state, do you resimulate the past states that would have led to that world model in that state with the new input? So the new state isn’t really consistent with the initial past state, but with a new past state?