With nearly 70 percent of Internet data projected to be in video format by next year, it’s clear that the task of extracting textural data from video will be critical for data engineers, and that the process will have to be automated. BlueChasm’s CTO Ryan VanAlstine, and software developer Robert Rios demonstrated how to turn raw video into tagged data during the recent Watson Developer Conference in San Francisco.
Using a variety of open source tools and a simple algorithm, they are able to extract enough meaning from videos to summarize its content. The program is able to automatically start when a new video is submitted, leaving the entire process out of human hands. The code is available on their blog.
Video is just a sequence of images, but sending all of the images through the visual recognition is prohibitive both in cost and time. The key is sending through a representative sample from the video. Picking one frame out of 30, the code sends the images to Watson’s visual recognition program which returns the images tagged. The program adds up all the tags to determine what the video is about.
Rios dropped the video into FFmpeg which processes the video, creates sill images picking one frame in 30, and sends those frames to a folder as jpeg images. The 1-in-30 frames is an arbitrary number, he said, picked because he knew the video was slower. If the video has a number of camera angles, or lots of people you may want to decrease the ratio to get more frames.
He prefers FFmpeg over other tools because it gives him more flexibility with the video, allowing him to add timestamps, and create metadata. FFmpeg creates still images (1 frame per second) and load the jpegs in a folder.
The resulting jpegs are sent to a newly created folder, and sets off a loop for each image which sends the image to Watson’s visual recognition API to the classify endpoint.
The classify endpoint program does some error checking because sometimes the classifiers will come out empty or there is an error loading the tags, said Rios. Node sends the images synchronically, so that may cause errors sometimes when receiving results, so it’s good to do error checking before adding up the results. If the image comes back with an error, it’s marked unavailable.
The next step is to call the count method, which tallies up the tags which tell you what’s in the video.
This process can be combined with audio processing to create more useful tags. For example, a video of a celebrity will just return the celebrity name, stripping out the audio and sending it to Watson’s Audio Recognition API will determine what the video is about.
You can also send it through the Watson Tonal Analysis API which will return the emotional content of the audio, which will be useful for evaluating customer service responses or uploaded product reviews among other useful applications.
Be warned, said VanAlstine, that facial recognition is more expensive than object recognition, so you want to separate it out. In the programs they deliver to their customers, it is typical to run the video process through the object recognition, then if the video is mostly about people, then send it through facial recognition. For example, you could have a video about a car race and there are two people on the sidelines. There’s no reason to send the video to facial recognition because the object recognition data shows the video is about a car race.
IBM is a sponsor of The New Stack.