Machine Learning / Programming Languages

Off-The-Shelf Hacker: Creating Voice-Jaw Data with a Processing Script

4 Sep 2018 8:37am, by

Last week we looked at the Arduino-jaw servo side of Hedley, my talking robotic skull. Initially, his jaw moved in sync with a small analog electret microphone connected to the Arduino NG microcontroller. It gave mediocre results between my spoken voice and Hedley’s flapping mandible.

Current off-the-shelf audio to jaw-servo controllers work similarly. They take the raw audio and convert it into a pulse-width-modulation signal that moves the jaw servo. Audio is fed to an analog input through a wired connection to a speaker or the earphone output from a Raspberry Pi. Although this method works fine for simple props, Hedley’s jaw will have to move in response to data from onboard applications.

Hedley isn’t a telepresence robot. He’s a stand-alone smart robotic prototyping system. There won’t be an assistant behind the curtain remotely answering my questions or making comments during one of my tech talk presentations. I eventually want Hedley to be able to reply using Alexa-like responses, with or without a network connection.

We looked at the Arduino-jaw servo subsystem, last time. Today, we’ll examine the “script” or data creation side of the equation.

We’ll Need Some Phrases

For now, the tech talk “act” will be “scripted” between Hedley and myself. It will give the illusion of a conversation. Of course, that’s assuming I can remember MY lines. It should be a great effect and lay the foundation for getting a much more sophisticated artificial intelligence (AI) conversational process up and running. Everything is going to AI, you know.

For a scripted act, we’ll certainly need a few canned audio responses. Naturally, Hedley should sound like a robot. A logical choice for this job is to use eSpeak, which is easily integrated into Linux scripts. It will even send the audio to a .wav file by using the “-w” option. We covered this text-to-speech program a few weeks ago.

Here’s an example.

robnotebook%  espeak "My name is Hedley, please help me welcome Doctor Torq to the stage" -v en-us -p 30 -s 200 -k 20 -w intro-drtorq2.wav

The “-v” option sets the voice to US-English, a pitch (“-p” option) to 30, speed (“-s” option) to 200 and capitalization emphasis (“-k”) to 20.

You should definitely play around with the voices and other settings to find a combination that gives your robot a distinctive and interesting sound.

Down the line, I’ll need to be able to take a text response or some kind of data from an application and convert it into sounds that will come out of the speaker inside Hedley’s mouth. At the same time, the audio will have to be analyzed and the resulting data sent to Hedley’s Arduino-jaw servo subsystem.

I generated several .wav files with various responses for testing purposes. With the .wav files ready, lets now turn our attention to the Processing-based audio analysis program.

Turning a .wav File into Data

The Processing programming language, with its rich set of libraries, makes it easy to play a .wav audio file and analyze the waveforms, in near real-time. This is important because any lag between the sound and the jaw movement ruins the “talking skull” effect. There are a bunch of different functions in Processing for all kinds of sophisticated audio and visual effects.

Also, Processing closely mirrors the code layout for the Arduino. Why not use essentially the same code structure for the Arduino and your visual-audio code on a Linux notebook or Raspberry Pi?

I chose an “amplitude modulation” function to capture the prominent points of the audio wave profile. It gives reasonably realistic jaw movement. You’ll definitely need to tweak settings both on the Processing data analysis side and the Arduino-jaw servo programs for the best effect, with your project.

If you want to get really tricky, you might explore the Fast Fourier Transform (FFT) function, to grab specific parts of the audio waveform. It’s there if you want to give it a try. FFT analyzes the audio according to frequency, so you can precisely tailor your output data for specific sounds. The Wee Little Talker board does a similar thing, although much of the work is done in hardware. Hedley said he was happy speaking with the amplitude function for now.

Here’s the Processing code.

import processing.sound.*;
import processing.serial.*;

Serial myPort;

SoundFile sample1;
SoundFile sample2;
Amplitude rms;

float scale=8;

float smooth_factor=.2;

float sum;
int valpos;
int sendpos;
int i = 0;
int execloop = 1;

public void setup() {
    size(640,360);
    myPort = new Serial(this, "/dev/ttyUSB0", 115200);
    }      

public void draw() {
    background(125,255,125);
    noStroke();
    fill(255,0,150);   
    
    if (execloop == 1) {
         sample1 = new SoundFile(this, "/home/rob/hedleytheskull/intro-torq1.wav");
         sample1.rate(.45);
         sample1.play();
         rms = new Amplitude(this);
         rms.input(sample1);
         execloop = 0;
    }
    else if (execloop == 2) {
         sample2 = new SoundFile(this, "/home/rob/hedleytheskull/intro-torq2.wav");
         sample2.rate(.45);
         sample2.play();
         rms = new Amplitude(this);
         rms.input(sample2);
         execloop = 0;
    }    
  
    sum += (rms.analyze() - sum) * smooth_factor;  
    myPort.write(str(int(map((sum * 700), 30, 85, 8, 3))));
    myPort.write('\n');
}

At the top are the usual library and variable initializations. Then the serial line is started, so we can send the resultant analysis data out to move the jaw. The real magic happens down in the draw loop where the audio files are played and then examined with the Amplitude function. The output data is smoothed out a bit and then proportionally mapped into the zero through nine range expected by the Arduino-jaw servo subsystem. Each data point terminates with a newline.

Notice that I used two different .wav files. I simply chose one or the other using the “execloop” variable, at runtime. My plan is to put the audio response file names in a text file and then step through the list as Hedley and I talk. I’ll add a new push button to my wired slide clicker and use that as the trigger when I want to jump to the next phrase. A fake antique microphone will go on top of the clicker but won’t be connected to anything at this point. With a little practice, it should look like we are talking with each other.

I’ve tested Hedley on several friends and family members and they thought his “talking skull” effect looked pretty cool and realistic.

Going Further

There is much to do as we get closer to the upcoming Embedded System Conference (ESC) show at the end of October. Hedley is getting anxious to be back up on stage.

The .wav file list function is next and then I’ll program the extra button to advance through the phrases. Hedley and I will also have to put together several conversations we’ll use for our act. And, of course, there will be lots of rehearsal… so I remember my lines. Our shtick will need to be woven in with the slides.

By-the-way, we’ll be running the slides using LibreOffice from Hedley’s onboard Raspberry Pi 3. Maybe I should start calling him my “skulltop computer.” Talk about your case mod!

Feature image via Pixabay.


A digest of the week’s most important stories & analyses.

View / Add Comments

Please stay on topic and be respectful of others. Review our Terms of Use.