Skip to main content
  1. Home
  2. Computing
  3. News

Gemini AI is making robots in the office far more useful

Add as a preferred source on Google
An Everyday Robot navigating through an office.
Everyday Robot

Lost in an unfamiliar office building, big box store, or warehouse? Just ask the nearest robot for directions.

A team of Google researchers combined the powers of natural language processing and computer vision to develop a novel means of robotic navigation as part of a new study published Wednesday.

Recommended Videos

Essentially, the team set out to teach a robot — in this case an Everyday Robot — how to navigate through an indoor space using natural language prompts and visual inputs. Robotic navigation used to require researchers to not only map out the environment ahead of time but also provide specific physical coordinates within the space to guide the machine. Recent advances in what’s known as Vision Language navigation have enabled users to simply give robots natural language commands, like “go to the workbench.” Google’s researchers are taking that concept a step further by incorporating multimodal capabilities, so that the robot can accept natural language and image instructions at the same time.

For example, a user in a warehouse would be able to show the robot an item and ask, “what shelf does this go on?” Leveraging the power of Gemini 1.5 Pro, the AI interprets both the spoken question and the visual information to formulate not just a response but also a navigation path to lead the user to the correct spot on the warehouse floor. The robots were also tested with commands like, “Take me to the conference room with the double doors,” “Where can I borrow some hand sanitizer,” and “I want to store something out of sight from public eyes. Where should I go?”

Or, in the Instagram Reel above, a researcher activates the system with an “OK robot” before asking to be led somewhere where “he can draw.” The robot responds with “give me a minute. Thinking with Gemini …” before setting off briskly through the 9,000-square-foot DeepMind office in search of a large wall-mounted whiteboard.

To be fair, these trailblazing robots were already familiar with the office space’s layout. The team utilized a technique known as “Multimodal Instruction Navigation with demonstration Tours (MINT).” This involved the team first manually guiding the robot around the office, pointing out specific areas and features using natural language, though the same effect can be achieved by simply recording a video of the space using a smartphone. From there the AI generates a topological graph where it works to match what its cameras are seeing with the “goal frame” from the demonstration video.

Then, the team employs a hierarchical Vision-Language-Action (VLA) navigation policy “combining the environment understanding and common sense reasoning,” to instruct the AI on how to translate user requests into navigational action.

The results were very successful with the robots achieving “86 percent and 90 percent end-to-end success rates on previously infeasible navigation tasks involving complex reasoning and multimodal user instructions in a large real world environment,” the researchers wrote.

However, they recognize that there is still room for improvement, pointing out that the robot cannot (yet) autonomously perform its own demonstration tour and noting that the AI’s ungainly inference time (how long it takes to formulate a response) of 10 to 30 seconds turns interacting with the system a study in patience.

Andrew Tarantola
Former Computing Writer
Andrew Tarantola is a journalist with more than a decade reporting on emerging technologies ranging from robotics and machine…
Apple made Liquid Glass adjustable, which says plenty about Liquid Glass
The new slider is useful, welcome, and mildly hilarious after a year of Apple acting like transparent everything was the obvious future.
Text, Document, Business Card

Apple’s big glassy software future now comes with a way to make it less glassy. In iOS 27, users can adjust the translucency of the Liquid Glass effect, while macOS Golden Gate adds its own Liquid Glass controls under System Settings.

Liquid Glass is still alive across Apple’s platforms, still shimmering through menus and panels, still doing the elegant UI trick Apple clearly likes. The big visual bet has already earned a dimmer switch. After a year of treating translucency like the obvious next step, WWDC’s most revealing design update may be the one that lets people dial it back.

Read more
Windows 11 just fixed one of Search’s dumbest limitations, and you’ll wonder how you lived without it
One less character, one less annoyance every time you search your PC.
Person sitting and using a Windows Surface computer with Windows 11.

If you have ever typed two letters into the Windows 11 search box, paused, and watched nothing useful happen until you added more characters, you already know exactly why this Windows 11 update matters. 

Microsoft's June 2026 Patch Tuesday update, part of a release Windows Latest calls the biggest of the year (via Windows Latest), quietly fixes that. Windows Search can now find and prioritize files with as few as two characters, down from the old three-character minimum.

Read more
Brazil’s secret World Cup weapon taught the team when to ignore it
The data said he wasn't running enough. The footage said he was always in the “perfect tactical position.”
Soccer ball in net

Brazil has more World Cup titles than anyone, five of them to be precise, but after going through five straight tournaments without adding to that count, the team is leaning hard on data this time. 

Every player wears a sensor-packed "smart vest" tracking field position (via GPS), heart rate, and a stat called "player load," the same kind of numbers that your Whoop band or Apple Watch brags about, but tuned specifically for the sport.

Read more