Language models made AI feel universal because it's what we’re collectively used to: whip up a meal plan with recipes, do a quick excel spreadsheet to watch calories, replace most white collar jobs, the usual. AI in most peoples' minds is still just ChatGPT doing their homework for them with a slight potential of ending the world, Sam Altman's words. But in the meantime, there'll be great companies created with serious machine learningThat’s where we are, companies are now en route to pushing through the AI language layer. If the next generation of AI is meant to operate in homes, hospitals, warehouses, and streets, it has to deal with the physical world directly, through perception, action, and feedback. That is the shift behind embodied and multimodal AI.
Why are LLMs in a transitional stage?
A purely text trained model has made our lives easier in some ways but when it comes to real world action, digits attached to hands are the main workforce. The missing piece is that physical environments are not “text.” A system needs grounding, meaning its internal representations must stay aligned with measurable state, not only with well formed sentences. Recent robotics-oriented foundation model efforts and embodied AI research have converged on Vision-Language-Action (VLA) models as a primary paradigm, aiming to bridge the gap between high-level human instructions, visual perception of the environment, and low-level motor control.
The transition from “talking about tasks” to “closing the loop on tasks” is on the way.
Limitations of LLMs
LLMs are still hugely valuable, especially for understanding instructions and planning; they are the world’s most overqualified intern for turning messy human requests into a neat to-do list. The trouble starts when a task needs a reality check, because text alone cannot “look,” “feel,” or “react.”
The gaps usually show up in three simple ways:
They do not measure the world. A robot has to track things like where an object is, whether it is slipping, or whether a door is actually closed. Example: “pick up the mug” is easy to understand in language, but doing it safely requires live sensor feedback.
They cannot guarantee cause and effect. An LLM can describe physics well, but real outcomes depend on details that are not in the prompt, like friction, weight, or obstacles you cannot see. Example: “push the box gently” might move it, or it might snag and tip.
They are not built for split second safety. In the physical world, timing matters. If someone steps in front of a robot, it must slow down or stop immediately based on perception and safety systems, not after generating a long explanation.
None of this means LLMs are “bad at robotics.” It just means LLMs need to be paired with sensors and control policies to become reliably competent in the real world.
Sensor grounded AI
Sensor-grounded AI starts learning from real signals, like cameras, depth sensors, joint sensors, force and touch, and even audio, clearing up a different kind of confusion: vision tells you what is there, depth tells you how far, touch and force tell you if you are actually holding something, and joint sensors tell you where the robot’s body really is.
An early 2025 example is research on embodied LLM-based robot frameworks that connect an LLM to perception and robot control (often with retrieval for task knowledge). The big idea is straightforward: pair language “thinking” with seeing and doing, so the system can act in the world instead of only describing it.
Around late 2025, there was also a noticeable push toward “robot foundation models,” meaning large multimodal models trained to generalize across many tasks and environments. The trend signal here is that scaling is expanding beyond text into sensor and action data.
In practice, we have a humanoid robot firing a BB gun at YouTuber, raising AI safety fears. InsideAI had a ChatGPT-powered robot refuse a gunshot, but it fired after a role-play prompt tricked its safety rules. As this is still an early version, safety features are likely to be perfected in the near future.
Application in robotics
Talking about the near future and robotics, this is where this shift stops being theoretical and starts showing up on budgets and dashboards.
Here’s a few places embodied, multimodal AI really matters and the budgets are justified:
- Messy-world manipulation. Warehouses and homes have clutter, weird angles, and objects that never sit the way the training demo did. A robot that can see + feel + adapt can recover when things go slightly wrong, like regrasping after a slip or choosing a different place to set something down.
- Humanoids and mobile robots. These platforms raise the difficulty: balance, contact, safety, and moving through human spaces. That is why recent industry roadmaps have leaned hard on foundation models plus large-scale simulation and training robots need “practice” in lots of situations, not just a few scripted ones.
- Enterprise reality checks. In business settings, success is measured in boring (but important) numbers: task success rate, how often a human has to step in, safety incidents, and uptime. Physical understanding matters because it can make systems less brittle and less expensive to babysit.
Boston Dynamics and Amazon are among the few that have expertly bridged this gap and handle well with robotics.
Physical intelligence integration
The next big leap are systems that can take a goal in plain language, see what is around them, remember what worked last time, and act, then learn from what actually happens. That direction has been getting louder in late 2025 and early 2026, and the really intriguing part is what this unlocks. AI will not just sound smart, it will get smarter by checking itself against reality and that opens the door to a bigger question: once AI can reliably do things, what kinds of jobs will it replace?