Microsoft AI Records 5,000 Audiobooks for Project Gutenberg
October is National Book Month, and this Halloween there’ll be something new. In an evolutionary leap for the free ebook site Project Gutenberg, readers can now hear the tales of Edgar Allan Poe — or Frankenstein, or Shakespeare’s Macbeth with its spooky witches — magically read out loud by a 21st-century synthesized AI voice.
Researchers from Microsoft, Google, and MIT have teamed up with Project Gutenberg’s executive Greg Newby to create 5,000 open-license audiobooks — roughly 35,000 hours of audio — read by a surprisingly human-sounding voice.
It’s a vast and varied collection containing both fiction and non-fiction — classic literature, plays, and even biographies. There’s something for everybody — from The Return of Sherlock Holmes by Sir Arthur Conan Doyle to The Return of Tarzan by Edgar Rice Burroughs. “We hope this contribution can provide value to both the research community, and the broader community of audiobook listeners,” the researchers wrote in a pre-print paper at arXiv.org. Titled “Large-Scale Automatic Audiobook Creation,” it argues that audiobooks “can dramatically improve a work of literature’s accessibility” — for the visually impaired, young children, and even new learners of a language.
And “Reactions have generally been positive,” Project Gutenberg’s executive director Greg Newby told us in an email interview. “Audiobooks are quite popular, even our older ones from 2004 that have relatively low quality. People appreciate having a variety of literary works available as audiobooks, and of course many of the new audiobooks that Microsoft made from Project Gutenberg texts were not otherwise available as audiobooks — they are not popular enough for major platforms.”
Newby remembers one negative reaction, from someone who called the whole endeavor “inappropriate” — taking a human work of literature and feeding it into an unfeeling machine for the sole purpose of then artificially mimicking both human voices and intonations. But “This seemed to me like a general reaction,” Newby says, “not from someone who was going to listen to any audiobooks or who had prior knowledge of Project Gutenberg.”
“From my point of view, the work they completed (with my input and collaboration) is excellent, and Project Gutenberg is in favor of any activities that make literature more accessible to a broader audience at little or no cost.
“The Microsoft effort certainly ticks those boxes.”
Excited about Tech Philanthropy
Their paper notes it can take hours of work to produce and publish an ebook. Actor Stephen Fry has recounted his tribulations accurately recording the text of the Harry Potter series:
The process is also expensive. But more importantly, the paper points out that audiobooks with a synthesized voice have “historically suffered from the robotic nature of text-to-speech systems.” In an explanatory video from Microsoft Cloud, Newby says that there’s always been a high demand for audiobooks — but “What we discovered, though, is that we weren’t really good at it, and so we ended up abandoning audiobooks.
“Until Microsoft said, ‘Hey we have some new technology for automated text-to-speech production.'”
In a video on the official Microsoft Developer channel on YouTube, Brendan Walsh summarized their stack for the ambitious project. “Fortunately, we’ve developed some tools, and we used some open source tools online that make it way, way easier… Specifically, we use Synapse ML with Apache Spark on Azure Synapse Analytics to generate a bunch of audiobooks.”
The end result was “The Project Gutenberg Open Audiobook Collection” — made available on the major podcast and streaming platforms, and also available in a single .zip file for researchers.
In the video, Walsh described himself as “excited about working on tech philanthropy.”
And leader researcher Mark Hamilton just sounded happy to be saying that their tech will “make these audiobooks really sound like a human’s reading them, instead of a robot!”
How Does It Sound?
The ebooks have their own pages on Spotify, Apple podcasts, Google Podcasts, and the Internet Archive. “Thank you for listening to this free audiobook,” each recording begins, “created by Project Gutenberg and Microsoft AI.”
And yes, although lacking the effusive human warmth of Stephen Fry, the voices could still easily be mistaken for a human. But they’re not perfect. The AI knows how to read Roman numerals — but gets confused by a stand-alone letters like “I” and “V”. (So when reading Shakespeare’s Macbeth it reads the designation of the first scene — Scene I — as “scene eye,” while the fifth scene becomes “scene vee.”) And when one of Macbeth’s witches talks about tormenting the sea captain who’s the “master o’ th’ Tiger” (presumably a ship named the Tiger) — the AI just gives up and spells out the letters, saying “master O T H Tiger.”
Although perhaps more disappointing is how it reads every part in the exact same voice. Macbeth and Lady Macbeth are the same male narrator, as are the three witches, Banquo, and King Duncan. Newby says he’s heard that feedback as well. “Someone else commented that there don’t seem to be any female-sounding voices, and asking why not. I’ve passed that comment to Microsoft, and agree there should be a variety of voices.”
The researchers’ paper also talks about their work on “automatic speaker and emotion inference system” which would scan the context of passages and then “dynamically change the reading voice and tone” to make dialogue “more life-like and engaging”, even predicting the appropriate emotion to use in their dialogue. (In 2020 some of the same researchers had worked on a more natural-sounding text-to-speech system — by first building a “spontaneous conversational speech corpus” for training, and then equipping their system with a “conversational context encoder” for selecting the appropriate tone for responses.)
Looking to the future, Newby says that “Eventually it would be great if people could select their own preferences for voice, speed, etc. and get an audiobook made just for them!” Newby says he has seen a demo of Microsoft’s technology which does swap in different voices for different characters, but unfortunately this feature “didn’t make it into the current collection.”
This is the first time I’ve heard AI audio narration referred to as synthetic speech…
— Terri Nakamura (@terrinakamura) September 22, 2023
The Shape of Things to Come
The project’s lead researcher even told Popular Science they hope to create free audiobooks for all 60,000 of the ebooks available on Project Gutenberg — possibly even translating them into different languages. We’ll see if we can scale this out,” Hamilton said in the YouTube interview on the Microsoft Developer channel.
And their paper also talks of a demonstration app that “allows conference attendees to create a custom audiobook, read aloud in their own voice, using only a few seconds of example sound.” In essence, the system “clones” each participant’s voice using a speedy technique known as “zero-shot text-to-speech. (Although attendees will also have the option of just selecting another pre-synthesized voice.) The attendees will no doubt also be amazed that the audiobook is generated in just a few seconds. In a video on YouTube, lead researcher Mark Hamilton creates an audiobook of Alice in Wonderland in 15 seconds.
And then users can even create a custom dedication, which the AI-speaking-in-their-voice will read before the text of the ebook. “Once the pipeline finishes we will email the user a link to download their custom-made audiobook.”
“The great thing about the work that Microsoft completed is that not only are the books completely free, so is the software. This could be leveraged by others interested in pursuing their own enhancements or in just using the software as it currently exists.”
- October is Cybersecurity Awareness Month — and CISA director Jen Easterly “challenges all Americans to help secure our world” with four things everyone can do.
- AI is replacing customer service jobs across the globe.
- Microsoft’s fourth annual cybercrime report finds espionage is a “predominant motivation,” fueling rises in information-stealing, covert monitoring, and manipulating what people read.
- The number of America’s digital nomads doubles to 17 million — spawning services to facilitate the experience.
- Las Vegas got thousands of AI-enhanced Elvis’s for U2’s inaugural show at the spectacular tech-enhanced auditorium The Sphere