Baidu researchers beef up the search giant's speech recognition savvy

Chinese search giant Baidu foresees a time when half of all search queries come in the form of pictures and people's voices. Baidu doesn't want to disappoint users when that time comes.

Good thing company researchers have been making strides in their ability to understand voice queries. Today Baidu is issuing an academic paper documenting the performance of complex computer systems that it says can recognize speech in noisy environments more successfully than Apple, Google, and Microsoft --- the giants of mobile computing.

"Honestly, I think we're entering the era of speech 2.0," Andrew Ng, Baidu's recently appointed chief scientist, told VentureBeat in an interview.

Speech factors into a wide variety of existing Baidu applications, including voice search, navigation, a voice assistant, and dictation. So presumably, the research could lead to noticeable performance improvements for these applications in situations when mobile devices have traditionally had trouble correctly understanding what a person is saying --- like when someone is walking down a busy street or driving with the phone in the passenger seat. But at this point, Baidu merely wants to be public about its theoretical achievements.

The technology enabling all this advancement falls under a trendy label, deep learning, which is a type of artificial intelligence. Deep learning entails training systems called artificial neural networks on lots of information derived from audio, images, or other inputs, and then presenting the systems with new information and receiving inferences about it in response. Speech recognition is a popular use of deep learning. Ng is a figurehead in the deep learning field, along with the likes of Facebook's Yann LeCun and part-time Googler Geoff Hinton.

Ng, who worked on the Google Brain project before joining Baidu, sought to attain new heights of accuracy in speech recognition in part by bringing a seriously large amount of data to the table. Baidu began with more than 7,000 hours of people speaking, and it ultimately assembled more than 100,000 hours of what Ng described as "synthetic data" --- a combination of clear recordings of people speaking with noises, such as washing dishes or the buzz inside a restaurant. That's how Baidu attained solid samples for noisy recordings of people speaking --- and how the company was able to develop systems that could guess intelligently what people were saying despite the background noise.

The training of the neural networks has depended heavily, Ng said, on "giant GPU machines," or servers in Baidu data centers packing graphics cards that have traditionally been used for gaming, among other applications. In time, the speech recognition could be performed on smartphones and other connected devices. For now, though, people will have to wait.

"I'm excited to move this into production, but we haven't done that," Ng said.

More