According to Apera, 2D cameras plus AI equals “4D vision.”
“AI-enabled computer vision” is the elevator pitch for a 4D vision solution from Apera AI. In some ways, the phrase is redundant since all computer-vision technologies fall within the field of artificial intelligence (AI). To a computer, an image is nothing more than a list of values corresponding to a grid, whether that’s a 2D-image grid or a 3D space. For engineers, the challenge of machine vision has been to find ways for computers to replicate the ability that humans have to interpret the content of images.
Machine-vision systems are essential in industry, especially in gauging, inspection, identification and guidance applications. In many robotic integration challenges, the vision system is the critical piece without which the robot would lack the input needed to complete the task. In others, it’s the piece that’s got engineers tearing their hair out the most.
For some tasks, such as checking that pieces of fruit are the correct color or checking the fill level of a bottle, a 2D-vision system is sufficient. In other cases, such as a bin-picking task in which image data must be used to direct the robot to a location in 3D space, a 3D-imaging device may be used.
For most applications, vision systems must locate the target object in the field of view, then compare the object to a predefined pattern.
2D Vision vs 3D Vision
When a 2D-vision system is used to capture a 3D object, it gathers a flat view of the object from one perspective. For example, a spherical object would appear as a circle. For many applications, the missing height information is not needed. Common applications for 2D vision include:
- Verification of features and position
- Dimension checking
- Barcode reading
- Character recognition
- Label verification
- Quality inspection
- Surveillance and object tracking
- Presence detection
In all 2D-vision applications, an accurate image is derived from the contrasts identified between light and dark areas of the image. For this reason, poor lighting and changes in lighting can reduce accuracy. Very dark or reflective objects can also be difficult to accurately detect. Finally, measurement errors may be introduced if target objects appear closer or further away from the camera. Without any z-axis information, the camera can only see closer or more distant objects as larger or smaller.
With 3D vision, on the other hand, the system collects more than a flat image. A 3D point cloud is collected, gathering x, y and z coordinates for a large number of points on the object’s surface. 3D vision can be done via a variety of technologies, including laser triangulation, stereo vision, time-of-flight and structured light.
Compared to 2D, 3D vision can do more complex tasks thanks to that additional z data:
- Thickness, height and volume measurement
- Dimensioning and space management
- Measuring shapes, holes, angles and curves
- Detection of surface or assembly defects
- Quality control and verification against 3D CAD models
- Robot guidance and surface tracking (e.g., for welding, gluing, deburring and more)
- Bin picking for placing, packing or assembly
- Object scanning and digitization
Weaknesses of 3D Vision
3D vision requires more powerful hardware and software to do the task. It’s more time-, processor- and software-intensive. While the capabilities of today’s computers are continuing to shrink that time, 2D imaging will likely always be faster.
In addition, 3D vision captures a massive amount of data as it creates a 3D representation of the scene. Very little of this data is actually needed to perform most tasks. A 3D point cloud may sacrifice quality of 2D information as it captures 3D information. It’s also more expensive and difficult to set up.
What Is 4D Vision?
Bin picking is a good example of a demanding robot-vision application. While picking objects one-by-one from a jumbled collection of identical objects in a box is trivial for a human worker, it’s been a complex challenge for robotics for many years.
It may seem intuitive that 3D vision is needed in cases in which the robot needs to see space in three dimensions. But why? While human eyes do take advantage of stereoscopic 3D vision, in which the differences in shape and position of objects in the view between each eye relate depth, human vision is assisted by one other organ: the brain. That’s because the brain is capable of using context and prior knowledge of objects in view to estimate depth and provide guidance.
That’s the idea behind Apera AI’s VCV-Cortex, a system that combines 2D vision with AI to deliver both the speed and simplicity of 2D vision with the robustness and capability of 3D vision. At least, that’s what Apera AI claims.
The system uses two cameras with plug-and-play software. The target object is scanned to create the pattern. Next, the AI system takes 24-48 hours to train and learn the object. After that, the robot equipped with the system is ready to pick the part. The system runs on an on-site computer.
Co-founder and CEO Armin Khatoonabadi explained how a scene is processed. The 2D camera captures the scene, then the AI extracts 2D information, such as shadows, edges, texture and color—like a human would—while performing object recognition and pose estimation. A pose of the object is sent to the robot to help it better perform the task as a 3D camera would.
“We use more AI, which extracts a lot of information,” he said. “The system doesn’t pay attention to only the edges. It pays attention to occlusion as well as textures, shadows. The system is trained on the geometry and shape of the picture. We have a lot of control over what it should do and what it should ignore. We can tell the system to ignore a piece of geometry, and this gives the user a great deal of control.”
According to Khatoonabadi, this 4D vision approach is more closely comparable to the way humans perceive vision than to computer 3D vision. When humans observe an object, laser measurement is not used to find its depth. Instead, stereoscopic vision is used. The 2D images are gathered by the eyes. The brain then takes over, interpreting an array of contextual information, including color, shadows, perspective, edges and more. Even with one eye closed—eliminating stereoscopic depth perception—humans can rapidly estimate distance or depth. The intelligence associated with how that contextual information is interpreted is highly valuable to overall vision.
While this system is a possible solution for engineers looking for plug-and-play machine vision with a capability boost compared to 2D systems, it still does not capture the data that a 3D system can. It only estimates it. There are undoubtedly many machine-vision applications that will require 3D data. As computer processing continues to get faster and cheaper, the speed and data drawbacks of 3D vision become less relevant.
For more on bin picking, check out this article.