Why It’s Notoriously Difficult to Compare AI and Human Perception

Science fiction is becoming reality as increasingly intelligent machines are gradually emerging — ones that not only specialize in things like chess, but that can also carry out higher-level reasoning, or even answer deep philosophical questions. For the past few decades, experts have been collectively bending their efforts toward the creation of such a human-like artificial intelligence, or a so-called “strong” or artificial general intelligence (AGI), which can learn to perform a wide range of tasks as easily as a human might.
But while current AI development may take some inspiration from the neuroscience of the human brain, is it actually appropriate to compare the way AI processes information with the way humans do it?
The answer to that question depends on how experiments are set up, and how AI models are structured and trained, according to new research from a team of German researchers from the University of Tübingen and other research institutes.
The team’s study suggests that because of the differences between the way AI and humans arrive at such decisions, any generalizations from such a comparison may not be completely reliable, especially if machines are used to automate critical tasks.
Eye of the Beholder
In particular, the team focused on analyzing the mechanics of human visual perceptions in contrast to computer vision, a field of research that seeks to develop ways for machines to “see” — and to understand what they see.
“As deep neural networks (DNNs) have become very successful in the domain of artificial intelligence, they have begun to directly influence our lives through image recognition, automated machine translation, precision medicine and many other applications,” explained Christina Funke of the Bethge Lab at the University of Tübingen, who along with Judy Borowski, shares the role of first co-author of the study. “Given the many parallels between these modern artificial algorithms and biological brains, many questions arise: How similar are human and machine vision really? Can we understand human vision by studying machine vision? Or the other way round: Can we gain insights from human vision to improve machine vision? All these questions motivate the comparison of these two intriguing systems.”
Comparing the two systems is one of the logical first steps in discovering how to build a human-level AGI. But as Borowski cautions: “While comparison studies can advance our understanding, they are not straightforward to conduct. Differences between the two systems can complicate the endeavor and open up several challenges.”
To highlight some serious pitfalls in comparing how machines and humans make decisions in complex recognition tasks, the researchers chose and comparatively deconstructed a variety of standard benchmark tests for processing visual data — namely closed contour detection, the Synthetic Visual Reasoning Test (SVRT), and assessing the recognition gap.
The team first used the closed contour detection test to see if ResNet-50 — a deep learning, image-classifying convolutional neural network (CNN) — would be able to identify if an image contained lines that close up to form closed contours — something that humans can do quite easily. Initially, it seemed that the model was able to recognize closed contoured shapes with both hard-edged and curved lines, with as much ease as a human.
However, the model failed to when parameters like line thicknesses or line colors were changed, indicating that AI’s human-level performance might degrade once presented with further variations, and also suggests that deep neural networks may sometimes find unexpected solutions that lie outside of the perceptual biases of humans. It’s a case where “humans can be too quick to conclude that machines learned human-like concepts,” said Funke.
In the Synthetic Visual Reasoning Test portion of the experiment, the team then set out to verify whether the AI could pick out identical shapes (the “same-different” task), as well as analyzing spatial arrangements (the “spatial task”), such as finding shapes nested inside other shapes.
Based on a review of previous studies, the team hypothesized that a DNN would do well on spatial tasks, but not the same-different task. Humans typically do well on both types of tasks, as they only need a few examples to learn, and are then able to generalize and apply that knowledge well to future examples. Surprisingly however, the researchers found that the DNN performed well on both kinds of tests, suggesting that the discrepancies may originate more in how neural networks are trained, how much training data is used, and how they are structured.
“This second case study highlights the difficulty of drawing general conclusions about the underlying mechanisms that reach beyond the tested architectures and training procedures,” Borowski noted.
The next portion of the team’s experiment involved using images of objects that are successively cropped and zoomed in until the subject can no longer recognize what is in the image — in other words, a “recognition gap”.
Previous research work has shown that there’s a large recognition gap when testing human subjects, and a smaller one with machines. However, the team chose to add an extra twist to their experiment, by testing the DNN with samples that were selected by a state-of-the-art search algorithm, rather than by human researchers, as was done in other studies. What they discovered was that the AI’s recognition gap between the minimal recognizable and maximal unrecognizable crops was just as large as those of human subjects, indicating that testing conditions must be consistently set in order to properly compare the two systems.
“All conditions, instructions and procedures should be as close as possible between humans and machines in order to ensure that all observed differences are due to inherently different decision strategies, rather than differences in the testing procedure,” concluded the team.
Ultimately, the team’s work points to a need for more carefully designed experiments when it comes to comparing machine and human systems, and to better understand the role that human prejudices may have in conducting such comparisons.
“The overarching aspect of all our case studies is human bias: they illustrate how much our own perspective can skew the design and interpretation of experiments,” explained Funke. “An example from everyday life could be our tendency of quickly describing an animal to be happy or sad simply because its face might have a human-like expression.”
Funke told us that the team now plans to release a revised paper and checklist to arXiv, the online repository for preprints of scientific papers, which will lay out ideas on how to better design, conduct and interpret comparative experiments that will take into account of these biases.
“From a broader perspective, we hope that our work inspires others to conduct comparison studies even though the idiom of ‘comparing apples and oranges’ even suggests that some systems should not be compared,” Borowski said. “We hope that scientists will feel encouraged to perform these challenging investigations, and will find our checklist (in the new version of our paper) to double-check their experiments helpful to make ‘fruitful’ contributions.”
Images: University of Tübingen, Bernstein Center for Computational Neuroscience, Werner Reichardt Centre for Integrative Neuroscience, and Volkswagen Machine Learning Research Lab