Microsoft Researchers Reach Human Parity in Conversational Speech Recognition

Microsoft researchers have set a world record for speech recognition, using a technology it announced this week with GPU-accelerated deep learning to recognize words in a conversation as well as a person does. Microsoft's team described how they achieved an error rate of 5.9 percent - the lowest ever for machine transcription - and about as accurate as people who transcribed the same conversation. It’s also a 6 percent improvement over a record Microsoft set only a month ago.

"We’ve reached human parity," said Xuedong Huang, the company’s chief speech scientist and co-author of a paper published this week. "This is an historic achievement."

Conversational speech poses some of the biggest challenges to speech recognition, said Geoffrey Zweig, who manages the Speech & Dialog research group at Microsoft.

"Speech recognition gets hard when people are talking informally, when they get excited, when they make mistakes and correct themselves, when they change topics. All of these are characteristics of conversational speech," he said.

The researchers credit their breakthrough in conversational speech recognition to deep learning, in particular, the systematic use of convolutional and recurrent neural networks. In their latest work, the team applied a type of recurrent neural network called Long Short-Term Memory (LSTM) to the language model.

LSTM networks have the advantage of being able to "remember" information for a longer period time, so they are sensitive to more words than most neural network language models are.

Microsoft’s Cognitive Toolkit (previously known as CNTK), an open source deep learning framework, played key role in reaching human parity for conversational speech recognition. The cognitive toolkit, which Microsoft announced this week, is a system for deep learning that is used to speed advances in areas such as speech and image recognition and search relevance on GPUs.

By using Nvidia's Tesla M40 GPUs, Zweig said researchers reduced the training time for some language models from months to weeks. "That makes all the difference because the rate of progress we can make is linked to the number of experiments we can run," he said.

More work needs to be done to improve speech recognition in real-life settings like parties or city streets, where there may be music, traffic, people talking and other types of background noise. Researchers are also improving conversational speech recognition for meetings, where there are often multiple speakers seated at different distances from a microphone.

Zweig said the research milestone means the company has the right tools to quickly deploy a new generation of improved speech recognition in its Cortana personal digital assistant, Xbox gaming console and other products.

Their long-term goal is to move from speech recognition to understanding, he said. This would make it possible for devices to answer questions or take actions based on what they’re told.