Off-The-Shelf Hacker: Three Approaches to Machine Vision
My latest project, the robotic skull, is getting complicated. Right now there is a quad-core processor in the JeVois smart vision sensor, a quad-core chip in the Raspberry Pi and a micro-controller in the Arduino. We’ll probably add at least another ESP8266 to the mix.
One of the prime reasons I’m developing this project is a testbed for exploring physical computing ideas. Being able to demonstrate concepts with a self-contained device makes a lot of sense and using a Raspberry Pi as the central “brain” works great. Just hook up an HDMI monitor or projector, plug in a wireless keyboard/mouse-pad and we’re good to go. In addition to “showing” how the various physical computing concepts and subsystems work, I can also put up slides using LibreOffice, if needed, right from the skull.
Integrating a Raspberry Pi, into the skull also lets me run guvcview, which becomes the graphical user interface to the JeVois sensor. Fire up guvcview, select a resolution and you can instantly see what the JeVois sensor is recognizing. Each resolution keys to a certain machine vision model. For example, the 1280×480 resolution corresponds to a two window (raw vs. recognized) rendering with the YOLO deep neural network object detection algorithm. There are currently quite a few user demos available.
The vision models are what separate the JeVois sensor from previous generation technology like the Pixy camera. The Pixy was capable of recognizing color blobs, given the adequate lighting. While light years ahead of its time, it was fussy and not as reliable as I would have liked. The JeVois can easily discern color blobs too and that’s just one of its many capabilities. To say the JeVois is jammed full of software is an understatement. Linux and the vision engine take up a remarkable 7.3GB of space on the micro-SD card.
So together, the Pi, the vision sensor and the Arduino make an incredibly powerful and flexible package. Read more about this past year in computer vision here.
Paradoxically, all I want to do is have the skull track me as I walk back and forth on stage. How complicated is that, right?
The magic happens in the machine vision models.
There are models to detect the lines on a road, the dots on a pair of dice, identifying objects and other things. The three vision models, I think will work for the robotic skull include color recognition, object detection and saliency.
Different models are explored by plugging in the JeVois sensor then starting the guvcview program on your Pi or external Linux notebook. You then select a resolution from the drop-down menu that corresponds to the model you’d like to run. A window will pop up with your video feed and the augmented information, in the form of colored boxes, circles and text tags, from the vision sensor. Recognition data also flows to the Arduino, over a separate serial port, at the same time to actuate the servo.
This model mimics the old Pixy device and keys to a resolution of 320×254. The model looks for values in a particular saturation, hue and value of color pixel range then finds the contours of the objects. It then sends object center information over the serial port. You pick it up with an Arduino to move a servo or send it to the Raspberry Pi for further processing and analysis.
The Pixy could certainly do this task and you could actually train it using a single push-button on the board itself. The JeVois is a little more complicated in that you use an external application to isolate the desired color range, then input those parameters to a file on the sensor.
I’ve experimented with recognizing colors on the JeVois and it’s roughly as good as the Pixy. I should point out that the Pixy could recognize an object at 50 frames per second, while the JeVois hums along at about 82 frames per second. Sensitivity seems about the same, resolution is better and both are dependent on ambient lighting conditions. I expect that color blob recognition will be my backup choice when choosing the final model that tracks me on stage.
Object Recognition Using Neural Networks
When you switch to 1240×480 resolution the YOLO neural network kicks in. YOLO is an acronym that stands for (in this case) for “you only look once.” The version in the JeVois sensor can detect about 1,000 objects, including a bike, a car, a person, a TV monitor, a bottle and so on. It’s fun to put various objects in front of the sensor and see the results.
Sometimes it thinks our dog, Sequin, is a toy poodle. Other times the skull correctly identifies her as a Maltese.
The normal detection time for a person, dog, table or chair, using this model is around 2,400 milliseconds or 2.4 seconds. Not blazingly fast, although actually pretty good, when you think about it, for a device that is roughly 1-inch cubed and costs around $50.
Object recognition might work well because the model doesn’t seem to have any trouble picking out people. I’ll be the only person on stage, although I’m not sure yet if the skull might fixate on other objects it identifies, like a table or a chair. Testing is always a part of off-the-shelf hacker development.
The last model, corresponding to 640×300 resolution, is saliency. Saliency is described as an image’s most attention-grabbing or conspicuous features. Think of a baby. They naturally track movement, bright objects and things that get their attention.
I found an in-depth paper on saliency, that might interest readers. Be aware that this stuff gets complicated very quickly. Right now, the saliency model seems like it will be the best bet for my “Dr. Torq on stage” tracking task. I’ll be the only thing moving around, at the front of the room, so the skull shouldn’t get confused as to what to track.
Of course, there’s always a chance that the sensor will see a bright spot on one of my slides or some other attention-grabbing object and stop watching what I do. Easy solution: don’t do any slides.
As I get more experience with the models, tweaking settings and adjusting environmental factors, should eventually provide reliable tracking.
Teaming up multiple computing resources with cutting-edge machine vision models will hopefully make my seemingly trivial job of tracking a person a reality. That’s just the beginning. I think combinations of computing sub-systems talking to each other, along with advanced, perhaps AI-equipped software is the next big wave. We’re way beyond blinking an LED.
Think about it, we can do all these cool things, right now with an off-the-shelf hacker mindset.
The tech world just keeps getting better and better.