Making Technology Understand the Spoken Word

Conference aims to go beyond speech recognition

5 October 2012

By now we’re used to devices that speak and even listen. But coming up with machines that understand remains a work in progress. After all, researchers are still learning how people comprehend the words they hear, unraveling slang, casual grammar, accents, and a web of unspoken context. Human speakers also convey meaning through tone of voice, stress on syllables, rising or falling inflections, rhythm shifts, and more.

To discuss how technical systems can comprehend speech and be made to recognize it better, researchers plan to meet from 2 to 5 December in Miami at the IEEE Workshop on Spoken Language Technology, sponsored by the IEEE Signal Processing Society.

Understanding spoken language goes beyond speech recognition to include the ability to summarize, accurately translate into other languages, and extract information from speech, says IEEE Member Yang Liu, the workshop’s cochair. “Speech recognition only aims to transcribe/recognize speech,” Liu says, “whereas spoken-language processing connects the transcript with language processing/understanding tasks, i.e., what people do with language. Spoken-language processing is also different from traditional natural-language processing, which is more text-based.

“For example, voice search gives users the ability to speak their search queries into a browser instead of typing them," Liu adds. "Another example is spoken-document retrieval: Instead of looking only at text-based pages, as traditional search engines do, we want them to search voice recordings, too. These two examples connect speech recognition—transcribing the spoken queries or speech recordings—with traditional text-based search.”

There are many potential applications. For people traveling in a foreign country, a cellphone could translate their spoken words to the native language. Classroom lectures could be summarized, mining information from hours or even years of speech. Robots could interact with people more naturally. Infants’ cries could be analyzed. Devices could assist the blind, deaf, or paralyzed.

About 85 papers and demonstrations are scheduled, all in a single track of oral and poster presentations so that no one need miss anything. Topics to be covered include spoken-document summarization and retrieval, spoken dialog systems, human/computer interaction, speaker/language recognition; machine translation of speech; and applications in such areas as education, health care, and assistive technologies.

Four keynote addresses are on the program. IEEE Senior Member Larry Heck, chief scientist of Microsoft’s Conversational Systems Lab, in Mountain View, Calif., is scheduled to discuss “The Conversational Web,” covering how to create a human-computer interface that combines spoken and written natural language with gesture, touch, and gaze. Member Kevin Knight, a senior research scientist and Fellow at the University of Southern California’s Information Sciences Institute in Marina Del Rey, Calif., is set to speak on advanced techniques in machine translation. Senior Member Andrew Senior, a research scientist at Google in New York City, will discuss the use of deep neural networks for large-vocabulary speech recognition in real time. And Lillian Lee, a computer science professor at Cornell University, in Ithaca, N.Y., is to focus on the effect that language has on people, as well as the effect that people have on language.

Three tutorials are planned—one on language modeling, one on extracting information from a speech when it’s not explicitly expressed, and the third on spoken dialog systems such as the iPhone’s Siri. The sessions are aimed not just at students and others new to the field “but also for more expert people who want to renew their knowledge,” says Senior Member Dilek Hakkani-Tür of the conference’s advisory board. “The tutorials are overviews of what each subject is about and the challenges involved,” he says. The tutorials are free for workshop registrants; no separate registration is needed.

“The conference is smaller than some others in the speech technology field, leading to more discussions among attendees and more interactive discussions with the speakers,” Liu says. “And with no parallel sessions, you can catch any paper you want.”

Adds Hakkani-Tür: “It’s a small venue but very focused. There’s time to talk with others working on the same things as you are. And there’s a lot of interaction at meals.”

IEEE membership offers a wide range of benefits and opportunities for those who share a common interest in technology. If you are not already a member, consider joining IEEE and becoming part of a worldwide network of more than 400,000 students and professionals.

Learn More