Autonomous vehicles and other robots must experience the world in 3D so they can interact with their surroundings. This is also true for people and animals: perceiving a path and identifying obstacles enables all movement.
Today most robots use a combination of cameras and active sensors like lidar and radar, but these robots are easily confused in unstructured environments, like homes, streets, and sidewalks.
Humans perform much better than robots using vision alone. Dogs, birds, and mice explore the world without lasers or human-level intelligence.
Compound Eye borrows from millions of years of evolution to simulate nature's best calculation and reasoning techniques, enabling robots to understand the world in RGB and 3D using automotive-grade cameras.
Looking at a scene from two or more points of view, nearby objects seem to move and far away objects do not. This is one example of parallax. It is used to triangulate the distance to any object, even those not recognized. Two or more perspectives can be generated by one camera using consecutive frames, like in the moving landscape above, or by using two or more cameras to capture multiple angles simultaneously, which creates the effect of freezing 3D objects while they are in motion.
Parallax-based depth sensing has been confirmed across many species as a powerful technique. Computers can do the math faster and more accurately than any living thing. Yet parallax is not sufficient, because some objects look the same from different angles or reflect light in misleading ways.
Given a single image, some prior knowledge and experience, it's possible to solve a 3D scene. The sky is far away. The ground recedes towards the horizon, showing the distance to objects on the ground plane. If we know the average size of an object in the scene, we can use this knowledge to approximate the size of others.
Trees and walls are usually at right angles to the ground plane. Objects closer to the camera appear larger and objects that are far away appear smaller, and if an object is blocking the view of another, it must be closer, regardless of its size.
Humans learn these cues over time. Thanks to recent advances in deep learning, computers can learn them too. Still, semantic cues are not sufficient, because a single image can have many different geometric interpretations or contain objects that the computer has never seen before.
Parallax and semantic cues are complementary. That's why nature relies on them both.
Walls are often smooth and flat. Given two images of a wall, parallax-based techniques used to understand depth in a scene can fail because most points look the same. But dogs and humans aren't confused because they know how walls work due to their semantic understanding of the scene.
Given a single image, a computer can mistake shadows for objects and brake suddenly. But two views are enough to tell that the road is flat, thanks to parallax.
Compound Eye has invented techniques for parallax-based depth sensing that use cameras mounted independently on different parts of a machine. We have also invented self-supervised approaches to training neural networks for monocular depth estimation and new ways to calibrate cameras online so that robots can operate indefinitely without adjustment - in real time on embedded hardware.
TL;DR - we point two or more regular cameras at a scene, determine the distance to every point using both parallax and semantic cues, and fuse the results to give accurate depth at every pixel, all in real time.