Modal Title
IoT Edge Computing

Off-The-Shelf Hacker: Tell Me What You See

How to build a object recognition system with speech synthesis capabilities.
Jul 13th, 2019 6:00am by
Featued image for: Off-The-Shelf Hacker:  Tell Me What You See

“What is it?” my optometrist asked as I handed her the little blue JeVois camera.

“It’s a machine vision sensor that uses neural networks to recognize about 1,000 different objects,” I replied. “It can also see faces and even correctly identified our dog as a Maltese,” I continued. She was suitably impressed and appreciated my showing her the $50 gadget.

“This might be helpful for people with vision trouble,” she said. “Maybe with some kind of speech synthesis,” I said.

After a couple of weeks of thought and several hours of work, I’m excited to say I have a proof-of-concept object recognition setup with speech synthesis. Don’t let the seeming simplicity fool you, there is an awful lot going on when visually identifying an object then correctly saying its name. The magic is in the off-the-shelf parts and a tiny bit of extra programming to tie it all together.

JeVois Setup

The project has a couple of parts. There is the JeVois sensor which is mounted on a makeshift “third-hand” test stand. Then we also have the USB cable that plugs into my Linux ASUS notebook. And, we need a few objects to recognize. A micro keyboard, a pen and a wireless mouse fit the bill.

Third-hand stand with the JeVois smart machine vision sensor.

I chose the Detection DNN module because it can pick out 80 different common everyday objects, including keyboards, cars, chairs and cups as well as persons. It uses OpenCV’s deep neural networks (DNN) to look out at a scene, then put bounding boxes around each detected object. The data stream also includes the object’s relative location and rough size within the field of view of the camera.

The object name, certainty percentage, location and bounding box data flow over the USB and hardware serial lines. I hacked a Processing program to pull the data from the sensor into the Linux notebook and then used the eSpeak text-to-speech command-line application to read out the object name over the speaker.

The Processing Code

The program, written in the Processing programming language, turned out to be only about 30 lines of code.

A serial.readString.until example from the site served as the model and I added a few extra capabilities so it would work with the JeVois sensor.

After the customary variable initialization, we need to open the serial port and clear out any odd serial data that might be in the input buffer. The JeVois sensor uses the /dev/ttyACM0 port for serial communication over the USB cable. It should be plugged in before running the Processing program.

The main loop reads in a string of text until it finds a line-feed, signifying one line of input.

Next, the input line is broken into tokens. Notice that the delimiter, within the double quotes is a space and a colon. The tokens are broken out using those separator characters. A typical line from the JeVois looks like the following.

N2 keyboard:61 -756 -406 1653 638

We index the tokens from left to right as an array. The N2 token becomes the [0] index and indicates that the data line is in the “Normal” format and is a two-dimensional string of data. Formats include Terse, Normal, Detail and Fine.

The name of the object is then the [1] token, while the [2] token is the certainty value (61% certain it is a keyboard), separated by the colon. The next two tokens are the XY object center-of-mass location and the last two are the width and height.

After the tokens are extracted the program sends the object text string out to the eSpeak command-line application using the external program execution function. ESpeak says the object name then we return for the next object from the JeVois sensor. The delay(1000) function helps prevent an echo of object names, since we are able to recognize objects at about one every 250 milliseconds. Clearing the input buffer with the myPortclear() and making it empty with myString = null, helps calling out objects one at a time, as well.

That’s pretty much it to recognizing an object and saying the correct name through the notebook speaker.

Making It Work

Make sure the JeVois cable is plugged into two USBs on the Linux notebook. Next, start guvcview with the following command line.

guvcview -x 640x498 -d /dev/video1 -g gtk3

You may have to adjust the video port device. If the notebook webcam shows up in the viewer window try /dev/video0 or /dev/video2. The 640×498 resolution corresponds to the Detection DNN module on the JeVois camera.

While there are other ways to do the next step, I’ll go through it so readers understand the concepts. We need to set a few parameters on the JeVois and a super-easy way to do it is through the serial monitor on the Arduino IDE.

Open the Arduino IDE and make sure you are connected to the /dev/ttyACM0 serial port. Open the serial monitor and type the following lines. Hit the send button after each line.

setpar serout USB
setpar serstyle Normal

The first one sets the serial output of the JeVois device to use the USB line. You could also substitute All for USB, which will send data over the hardware port and the USB port, at the same time. The second line sets the data format to Normal, which gives the object name, in addition to other data.

The data should flow in the serial monitor similar to the line formatting, as we discussed above.

Finally, start the Processing IDE and run the program to speak the names of the objects.

The video goes through the setup and demo of the object recognition with text-to-speech audio output on the Linux notebook. Notice that the JeVois doesn’t always recognize the object correctly. For example, it thinks the nano-keyboard/mousepad is a “remote.” A lot depends on the ambient lighting and angle of the object to the camera.

Next Steps

There is a lot of room for improvement with the proof-of-concept. I’d like to move the Processing program over to a Raspberry Pi and see if it will work reliably on that platform. Perhaps, Hedley, the robotic skull, could use the setup to identify objects and say the name through his speaker.

It might be interesting to replace the Processing script with a little command-line program written in the C language. I’ll look into it, soon.

Of course, my optometrist will surely want to see the results when I craft it into a portable form.

Catch Dr. Torq’s Off-The-Shelf Hacker column, each Saturday, only on The New Stack! Contact him directly for consulting, speaking appearances and commissioned projects at or 407-718-3274.

Group Created with Sketch.
THE NEW STACK UPDATE A newsletter digest of the week’s most important stories & analyses.
TNS owner Insight Partners is an investor in: Shelf, Torq.