Dog-like robots could one day learn to play fetch, thanks to a blend of artificial intelligence (AI) and computer vision helping them zero in on objects.
In a new study published Oct.10 in the journal IEEE Robotics and Automation Letters, researchers developed a method called "Clio" that lets robots rapidly map a scene using on-body cameras and identify the parts that are most relevant to the task they've been assigned via voice instructions..
Clio harnesses the theory of "information bottleneck," whereby information is compressed in a way so that a neural network — a collection of machine learning algorithms layered to mimic the way the human brain processes information — only picks out and stores relevant segments. Any robot equipped with the system will process instructions such as "get first aid kit" and then only interpret the parts of its immediate environment that are relevant to its tasks — ignoring everything else.
"For example, say there is a pile of books in the scene and my task is just to get the green book. In that case we push all this information about the scene through this bottleneck and end up with a cluster of segments that represent the green book," study co-author Dominic Maggio, a graduate student at MIT, said in a statement. "All the other segments that are not relevant just get grouped in a cluster which we can simply remove. And we're left with an object at the right granularity that is needed to support my task."
To demonstrate Clio in action, the researchers used a Boston Dynamics Spot quadruped robot running Clio to explore an office building and carry out a set of tasks. Working in real time, Clio generated a virtual map showing only objects relevant to its tasks, which then enabled the Spot robot to complete its objectives.
Seeing, understanding, doing
The researchers achieved this level of granularity with Clio by combining large language models (LLMs) — multiple virtual neural networks that underpin artificial intelligence tools, systems and services — that have been trained to identify all manner of objects, with computer vision.
Neural networks have made significant advances in accurately identifying objects within local or virtual environments, but these are often carefully curated scenarios with a limited number of objects that a robot or AI system has been pre-trained to recognize. The breakthrough Clio offers is the ability to be granular with what it sees in real time, relevant to the specific tasks it's been assigned.
A core part of this was to incorporate a mapping tool into Clio that enables it to split a scene into many small segments. A neural network then picks out segments that are semantically similar — meaning they serve the same intent or form similar objects.
Effectively, the idea is to have AI-powered robots that can make intuitive and discriminative task-centric decisions in real time, rather than try to process an entire scene or environment first.
In the future, the researchers plan to adapt Clio to handle higher-level tasks.
"We're still giving Clio tasks that are somewhat specific, like 'find deck of cards,'" Maggio said. "For search and rescue, you need to give it more high-level tasks, like 'find survivors,' or 'get power back on.'" So, we want to get to a more human-level understanding of how to accomplish more complex tasks."
If nothing else, Clio could be the key to having robot dogs that can actually play fetch — regardless of which park they are running around in.