Join Transform 2021 for the most important themes in enterprise AI & Data. Learn more.
At its I/O 2019 developer conference this week, Google showed off Live Caption, an Android Q feature that provides real-time continuous speech transcription. The company touted Live Caption as able to caption any media on your phone. But it turns out that “your phone” can’t be just any Android Q phone. “Live Caption is coming to select phones running Android Q later this year,” a Google spokesperson confirmed.
“It’s not going to be on all devices,” Brian Kemler, Android accessibility product manager, told Venturebeat. “It’s only going to be on some select, higher-end devices. This requires a lot of memory and space to run. In the beginning it will be limited, but we’ll roll it out over time.” As we get closer to Android Q’s launch, Google plans to release a list of sanctioned devices that will offer Live Caption.
This wasn’t clear from Google’s keynote or any of the ensuing coverage. The pitch was that this great on-device machine learning feature was coming in the latest Android release, for everyone to use.
“We believe technology can be more inclusive. And AI is providing us with new tools to dramatically improve experiences for people with disabilities,” Google CEO Sundar Pichai said onstage before showing off Live Caption and Google’s three new accessibility projects. Afterwards, he added: “You can imagine all the use cases for the broader community too. For example, the ability to watch any video if you’re in a meeting or on the subway, without disturbing the people around you.”
Live Caption works with songs, audio recordings, podcasts, and so on. The feature captions any content that you’re streaming, that you’ve downloaded, or even that you recorded yourself. It doesn’t matter if it’s from a first-party app or a third-party app — if your phone can play it, your phone can caption it. That also includes games, though Kemler has not tried it with Stadia yet.
On device vs. in the cloud
To use Live Caption, you hit one of your phone’s volume buttons and then tap the software icon when the volume UI pops up. Turn it on with a single tap, and as soon as speech is detected, captions will appear on your phone screen. You can double-tap to show more and drag the captions to anywhere on your screen. Kemler explained that Google made Live Caption a movable overlay because it’s not easy for Android to predict where the content will be or what else the user may want to do as they’re reading.
When you enable Live Caption for the first time, Google plans to show a banner explaining the feature.
“Hey, this is what it does. This is what it doesn’t do. Because we took this cloud-based model that was over 100GB and shrank it down to less than 100MB to fit on the device, it’s not going to be quite as perfect or accurate,” Kemler explained. “Not that cloud transcription is perfectly accurate, but it’s going to be a little bit better. But [Live Caption is enough] for apps where that caption content is not available, which remember is the vast majority of user-generated content. Which also, remember, is the vast majority of content. Even if you took YouTube, that’s 400 hours uploaded every minute, and then think of Facebook, Instagram, Snap, all podcasts, etc. Unlike TV and film, which by law are required to have captions, user-generated content doesn’t have it.”
Kemler let me play with the feature on a Pixel 3a, and it did indeed work as described. There is no separate app required, no need for a Wi-Fi or data connection, and no perceptible delay. He wouldn’t provide a word error rate target or range for Live Caption, but it’s clearly low enough for Google to confidently include the feature in Android Q.
Live Caption doesn’t save anything. If you want a transcription tool, Google offers Live Transcribe, released in February. Live Transcribe also uses machine learning algorithms to turn audio into real-time captions. But unlike Live Caption, it’s a full-screen experience, uses your smartphone’s microphone (or an external microphone), and relies on the Google Cloud Speech API to caption real-time spoken words in over 70 languages and dialects. You can also type back into it — Live Transcribe is really a communication tool.
Meanwhile, “Live Caption is the notion that, at the OS level, we should be able to caption any media on the device,” Kemler explained. “Not only to make that media accessible to people who can’t hear or who have trouble hearing, but also for people like us. You’re sitting at I/O and you need to watch a video and you want to do so silently. That’s a really important use case. You’re on the train, you’re on the plane, you don’t want audio in certain cases. There are other applications too. Think of learning another language — super helpful to have those captions in that language.”
Live Caption relies on the AudioPlaybackCaptureConfiguration API, which is being added as part of Android Q. That’s what makes it possible for the feature to capture your phone’s audio, even if you’ve muted the device.
“We will have a new API that’s available primarily for OEMs to use in the context of live captions,” Kemler elaborated. “It’s in what we call a ‘personal AI environment.’ It’s a very secure environment, and it gets special system privileges, like being able to pull audio, but it has to adhere to a set of principles. So, for instance, you can get captions, but Google would never have access to that audio. It’s just always going to be on the device. You can’t do anything with that audio other than provide those captions. So it’s very important for us that we honor security and privacy. Things that are sensitive stay local on the device.”
This is also why Live Caption doesn’t work on phone calls, voice calls, or video calls. And there are no plans to let Live Caption support transcriptions.
“Not for Live Caption. Obviously, we thought about that. But we want the captions to be truly captions in the sense that they’re ephemeral, if they help you understand or consume that experience. But we want to protect the people, the publishers, content, and content owners. We don’t want to give you the ability to pull out all that audio, transcribe it, and then do [whatever they want with it].”
Could someone use the API to do that? “Not the way we have it architected.”
When showing off Live Caption, Google has hinted that it’s also exploring automatically translating the captions if the content is not in your set language. But that’s a long way off. In fact, putting translations aside, Live Caption is only going to launch with one language supported.
“So, for release we’re going to launch in English,” Kemler confirmed. “And then we’re going to push as hard as we can to add as many other languages as possible. It will also depend a little bit on the devices. So if we go with an approach on Pixel, which is very skewed toward the English language, then we’ll look at the other big languages, like Japanese.”
When you unbox your new Android Q device that supports Live Caption, the first time you use the feature, it will have to download the offline model. It won’t be on the device, because Google wants to ensure you’re always using the latest model. Updates to the model will be delivered through Google Play Services. And since only English will be available, it will be straightforward. But one day, likely based on the language you pick in your phone’s initial setup process, your device will download the corresponding offline language model.
That process gets even more complicated when you start thinking about translation.
“Translation is not in the feature set,” Kemler emphasized. “It’s the tip of an iceberg. It looks like a very simple feature, but it has so many different layers to it. Translation requires a completely different pipeline, a completely different UI. We’re focused on nailing the MVP experience, number one. Number two, adding more languages, and getting it out more into the ecosystem. Translation is something that’s super important, but we want to make sure that the core experience is very high quality, is very good, and has a broad reach and broad adoption, before we get into everything we could possibly do with it.”
Google must learn to crawl before it can walk. And Translation is more of a run.
“We take a very dumbed-down version of the audio in mono — I think it’s 16 kilohertz — and then put that into the model,” said Kemler. “And if the model has features which add complexity — so things like capitalization and punctuation, that adds latency, it adds processing, and has a battery impact. And then we have to render that into text. So we have all of those things to do. And then ‘Oh, we want to translate on the fly?’ Well, we have to figure out the downloading of that model and then have another layer of processing in that pipeline. So we think, theoretically, it’s obviously something doable — and something intentionally, conceptually, we want to do — but there’s a cost to doing that.”
So the team would rather focus on the initial experience and getting users to adopt it and use it, “which we don’t think is going to be any problem. It’s so useful, and so utilitarian. And then we’ll look into doing more wizardry, where we can really optimize that pipeline.”
If the number of supported devices is small that will be a problem, as Live Caption won’t reach utilitarian status if most people can’t use it. So, in addition to improving the models and adding more languages, Google will also have to add support for more devices.
“We absolutely want to make the feature as available as possible,” Kemler said.