ASRU 2013 Day 3: Applications

home..

ASRU 2013 Day 3: Applications

December 2013

As opposed to the previous two days, the talks on day 3 had less details. Several of the speakers did not want to divulge specifics that could potentially be used by competitors.

“The Growing Role of Speech in Google Products” presented by Pedro Moreno

Pedro leads to Google speech recognition team based in New York—Google also has another speech recognition team in Mountainview. The talk was quite interesting because it gave a peak behind the curtain. In 2005 the team was formed and by 2006 they had an ASR system built from the ground up. They have basically added a new application every year since then.

Everyday they process approximately 5-10 years worth of audio data. None of this data is ever manual transcribed for training. The only manual transcription done is for testing purposes (about half a million utterances per month). Given the amount of data they have access to, a large focus of their effort is on data selection (sampling) based on transcription confidence and other metrics. It is also important to flatten the distribution of data. You do not want to just train on data that is commonly seen, rare triphones must be modeled too.

Due to the unsupervised nature of their training, it is important to identify words that can cause a feedback problem. Internally they are referred to as “PP” words. In Korean, there is a word with a phonetic transcription of /p p/. Unfortunately, it sounds very similar to the noise of wind blowing across the microphone. Noise was being transcribed as /p p/, which was only reinforced during retraining, creating a vicious cycle.

Their current production cycle is to retrain all acoustic models every 4 weeks, but they would like to reduce it to a weekly cycle. This handles changes in both hardware and user behavior. Each model is typically trained on 4000 hours, all transcribed in an unsupervised manner. Accuracy of the training transcripts are improved by recognition with a heavier duty model not used in production.

The entire process is nearly completely automated. If a new technique from research seems promising, it is encoded directly into the pipeline. During training models are then built with the new technique. If the new technique improves performance (automatically checked), it is used in production. Metrics are important; they do not just use word error rate because the long tail is very important for them. One approach to testing is to provide side by side transcripts to users and see which one they prefer. They also measure click-throughs, corrections, and retentions.

Since 2011 they have been adding languages other than English to their pipeline. At this point, they have production systems for approximately 60 languages. Their speed is extremely impressive. Basically, they can send analysts into a country to collect data and have a production system a week later.

He briefly talked about Google’s goals for the future, at least in terms of ASR. They want to make the transition from pure transcription to conversation. Eventually they see Google services becoming something like a personal assistant. Of course this will require huge advances in understanding and semantic analysis.

“Utilization of ASRU Technology: Past and Future” presented by Joseph Olive

For this talk, ASRU Technology refers to any automatic speech recognition and understanding technologies, not just research that came from this workshop. Joe used to be a program manager at DARPA. He described his post-DARPA life as, “there is life after DARPA, it is just not as good.” He was also a manager at Bell Labs for many years before that. Or as he says, “I used to work for the telephone company.” Talk about an understatement.

He discussed several very early applications of speech technologies. His description of some of the first call routing systems did not sound all that different than current applications used by banks. He also showed a video—from what looked like the seventies—demonstrating what was possibly the first real-time speech-to-speech translation system.

During his brief mention of text-to-speech (TTS) systems, he discussed their evaluation. The standard evaluation metric is listener preference. Listeners are either asked to evaluate the quality of each sentence or asked to give a preference between two examples. Joe instead called for the use of comprehension as a metric instead of quality. Have a listener listen to a paragraph or two of text and test their comprehension. I think this is related to the concept of cognitive load. A poor synthesis would require more attention from a listener to fully understand.

In general, this talk contrasted the previous Google talk—there were even a few subtle digs at Google. Instead of optimism and progress, Joe presented a view that the field has not evolved as much as some of us may like to think. Improvements in WER have been made, but very little progress in understanding has been made.

Finally, Joe called for research labs—both academic and industrial—to share data and collaborate. Based on the discussions that followed later in the day, I see little hope of that.

“Calibration of Binary and Multiclass Probabilistic Classifiers in Automatic Speaker and Language Recognition” presented by Niko Brummer

Calibration can be thought of as the “goodness of soft decisions”. Classifiers typically give us the posterior probability of some class given the data. Often the actual values of the posteriors are irrelevant; we only care which class has the highest probability. Niko stressed that more interest should be paid to the values of these posteriors.

While a system may be accurate, it can also be poorly calibrated. He described a method for calibration that basically boiled down to cross entropy. Obviously researchers working in the field are familiar with cross entropy, but regularization is also typically used during training. The regularization improves results by preventing overfitting, but it also degrades the calibration. One solution he presented was to apply a linear regression to the final system to recalibrate it.

Niko is mostly familiar with speaker and language recognition. When asked about applying his ideas to speech recognition, he stated he would have to first familiarize himself with the literature. However, he did say that calibration for the keyword spotting task is an interesting and difficult problem since the metric—term weighted value (TWV)—uses information hidden at training time.

“Speech Technologies for Data Mining, Voice Analytics, and Voice Biometry” presented by Petr Schwarz

Basically, this talk described the types of solutions his company Phonexia provides. One use case I had not previously thought of was discussed, quality control for call centers. In general he stressed that providing data mining tools on top of transcription was very important. It is easier to sell a recognizer with a significant error rate, as all recognizers have, if higher-level analysis tools are included.

Petr also stressed the importance of voice activity detection—and other types of event detection for that matter. While it may not be the type of research that generates the most excitement at conferences, it is crucial for speech applications.

For many of their clients, data privacy and secrecy issues are paramount. They give them the ability to train systems using their tools onsite. Without this ability, many of their customers would not consider purchasing their products.

“From the Lab to the Living Room: The Challenges of Building Speech-Driven Applications for Children” presented by Brian Langer

Brian works for an entertainment company called ToyTalk. They have an iPad app called “The Winston Show” aimed at children in the 6-10 age range. It seems to be like an interactive television show where the children can interact through speech. Dealing with child speech presents a host of challenges, both for environment and voice reasons.

Their app is certainly a well-crafted product. They must have very good script writers and animators on staff. However, the interactivity is still highly constrained. In terms of speech processing and interaction, they are still clearly in the early stages. I wish them luck in improving processing and interaction. It is an exciting application and one that could potentially be very beneficial.

“Augmenting Conversations with a Speech Understanding Anticipatory Engine” presented by Marsal Gavalda

The final talk was another researcher presenting and discussing a product (the entire talk was presented on an iPad; first time I have seen that). His iPad app is called MindMeld. At its most basic, the application is a chat program. Multiple people can connect for discussions or meetings. What the app does during the conversation is what sets it apart.

Whenever someone speaks, an ASR engine running in the background transcribes the speech, which can be saved in a history and later searched. Based on the discussion, MindMeld automatically performs internet searches to bring up information that could be relevant to the discussion. If you see something interesting, it can be immediately shared with the other participants in the meeting.

At the moment, I’m not sure how sophisticated the application is. He discussed that an important problem is disambiguating the user input. They hope to use (are using?) contextual information (calendars, location, time, weather, etc) to handle this. Marsal believes that knowledge graphs could potentially improve the ability for machines to understand machines. The contextual information would be additional constraints applied when searching the knowledge graph for the relevant information.

Final Panel Discussion

All of the speakers were called to the front to participate in a Q&A with the audience. Researchers in industry outlined some areas of research that could potentially be useful in industry. Unsupervised methods for training systems is of a major importance. Large scale training with supervised data is not something commonly done in industry. Recognizing distant speech is also an area they view as important. At the very least, being able to recognize when a speaker is in a non-optimal condition would be very useful. They also stated that work in low-resource ASR is unlikely to be relevant to industry; the constraints do not match what is seen in the real world and seem self-imposed.

There was a long back and forth between academics and industry people (mostly Pedro). The academics were calling for companies to share their data for research. Those working in industry labs kept saying this was very unlikely for several reasons. It is one thing for Google to collect data for internal use, it is another to openly share it with the world. Things like that can be a public relations and legal nightmare. In addition, data is a competitive advantage for many companies. Why would they just openly share it with competitors?

Finally, Joe Olive spoke again about understanding. Even at a conference with understanding in the name, very little work seems to focus on it. He stressed that it is a major hurdle that needs to be addressed and decreasing WER will not solve it. I liked his closing quote, “No amount of data will solve the black hole of semantics.”

Comments? Send me an email.