Schnelle-Walka: Speech

Posts mit dem Label Speech werden angezeigt. Alle Posts anzeigen

Dienstag, 24. Mai 2016

Beyond Google Auto

At the past Google IO, Google showed their upcoming version Android N. This version already contains parts that aim at usage in our cars. Previously, Google's efforts towards infotainment systems were bundled in Android Auto. Here, the smartphone had to be connected via USB to the car to use the screen on the dashboard as a screen for your device.

Android Auto (from http://www.digitaltrends.com/infotainment-system-reviews/android-auto-review/)

This worked only if the car's infotainment system supported that. Some tuners, like the Kenwwod DDX9902S, allowed for expanding these capabilities if replacing the current tuner.

Now, this is no longer necessary since the functionality is already built into the OS. This makes it applicable even to older cars. If the car has WLAN as it is available in newer cars, the user also does not need to fiddle around with USB any longer.

Here are some impressions taken from http://www.automotiveitnews.org/articles/share/1481628/

The center of the screens shows the navigation app while the the top and bottom portion of the screen is available for additional information

When initiating a phone call, only the center screen changes

When initiating a phone call, only the center screen changes

While many car manufacturers fear that their brands may become commodity devices, Google already made the next step and provided their answer to the question: Yes, the car will become a cradle for your smartphone.

Montag, 4. April 2016

The Other Data Problem of Machine Learning

There is one big problem that machine learning usually faces: The acquisition of data. This has been one of the bigger hindrances to train speech recognizers for quite some time. A nice read in this context is a blog from Arthur Chan from seven years ago, where he explains his thought on true open source dictation: http://arthur-chan.blogspot.de/2009/04/do-we-have-true-open-source-dictation.html

This problem increased, when deep learning entered the scene of speech recognition. More and more data is needed to create convincing systems. The story continues with spoken dialog management. Apple seems to want to make a step forward in this direction with the acquisition of VocalIQ: http://www.patentlyapple.com/patently-apple/2015/10/apple-has-acquired-vocal-iq-a-company-with-amazing-focus-on-a-digital-assistant-for-the-autonomous-car-beyond.html

All news tried to see this in the light of Apple's efforts towards integration into the automotive market. CarPlay http://www.apple.com/ios/carplay/ to display apps on the dashboard and what some people call iCar http://www.pcadvisor.co.uk/new-product/apple/apple-car-rumours-what-on-earth-is-icar-3626110/ were recently in the news.

I am not sure if there really is such a relation. It might be useful for Siri as well. Adaptive dialogs have been a research topic for some years, now. Maybe, it is time for this technology to address a broader market.

So far, Apple seemed to be reluctant with regard to learned dialog behavior. In the end, these processes cannot guarantee a promised behavior. This is also one of the main reasons, why this technology is not adopted as fast as in other fields where (deep) learning entered the scene. Pieraccini and Huerta describe this problem in Where do we go from here? Research and commercial spoken dialog systems as the VUI-completeness principle. They describe it as "the behavior of an application needs to be completely specified with respect to every possible situation that may arise during the interaction. No unpredictable user input should ever lead to unforeseeable behavior. Only two outcomes are acceptable, the user task is completed, or a fallback strategy is activated..." This quality measure has been established throughout years and is not available with statistical learning of the dialog strategy. In essence, this fear can be described as follows: Let's assume the user is asking "Hey, what is the weather like in Germany?". In (the very unlikely case) that it is in the data, the system may have learned that a good answer to this could be "Applepie".

Consequently, the data to train the system has to be selected and filtered. Sometimes, such a lack is discovered while the system is running. Usually, this is the worst case scenario. Recently, this happened to Apple's Siri. A question to Siri where to hide a dead body became evidence in a murder trial. Siri actually came up with some answers.

Screenshot of Siri 's answer to hide a body

Now, it has been corrected and Siri simply answers "I used to be able to answer this question.".

Similarly, Microsoft was in the news with its artificial agent Tay. Tay was meant to learn while people were interacting with it. It took less than 24 hours from the statement "Humans are super cool" to "“Hitler was right.”. Data was coming more or less unfiltered from hackers aiming to shape this attitude of Tay.

Evolvement of Tay on Twitter, from https://twitter.com/geraldmellor/status/712880710328139776/photo/1

Again, the base problem is in the ethics of the data: selection and filtering. But what are the correct settings for that? Who is in charge of determining the playground? Usually, this is the engineer developing the system (and thus his ethical background).
This "other problem of machine learning" seems to be not in the focus of those developing machine learning systems. Usually, they are busy with coming up with some data at all to initially train their system at all.

However, this problem is not really new. Think of Isaac Asimov who invented the laws of robotics. He already had the idea of guidance criteria to machine behavior. Maybe, we are in the need to develop something in this light while we move on this road.

And this is also true for spoken dialog systems that actively learn their behavior from usage as adaptive dialogs. It will be awkward to see learning systems out there that change their behavior to something that was never intended by the developer. I am waiting for those headlines.

Mittwoch, 16. März 2016

Google's Offline Personal Assistant

In June 2015, there were some first rumours that few commands for Google Now would be available even when you are offline. An APK teardown of the Google app reported at
http://www.androidpolice.com/2015/06/27/apk-teardown-google-app-v4-8-prepares-for-ok-google-offline-voice-commands-to-control-volume-and-brightness-and-much-more/ revealed some string resources that hinted on this.

<string name="offline_header_text">Offline voice tips</string>
<string name="offline_on_start_cue_cards_header_listening">Offline</string>
<string name="offline_on_start_cue_cards_header_timeout">Offline voice tips</string>
<string name="offline_on_start_cue_cards_second_header_listening">You can still say...</string>
<string name="offline_on_start_cue_cards_second_header_timeout">You can still say "Ok Google," then...</string>
<string name="offline_on_start_cue_cards_second_header_timeout_without_hotword">You can still touch the mic, then say...</string>
<string name="offline_options_start_hotword_disabled">You can still touch the mic, then say...</string>
<string name="offline_options_start_hotword_enabled">You can still say "Ok Google," then...</string>
<string name="offline_error_card_title_text">Something went wrong.</string>
<string name="error_offline_no_connectivity">Check your connection and try again.</string>

So far, Google Now required an online connection to work. Since this is not always a given it is beneficial to have a workaround in these cases. They found the following four options:

Make a call
Send a text
Play some music
Turn on Wi-Fi

Usually, such a tear-down is more of the kind of a rumor than facts. In this case, these rumors proved to be true. In September 2015, this functionality was made available as reported, e.g., on http://www.androidpolice.com/2015/09/28/the-google-android-app-now-supports-limited-voice-commands-for-offline-use/. The way this is reflected on the UI is shown on the following picture

Google Now in offline mode,taken from http://www.androidpolice.com/2015/09/28/the-google-android-app-now-supports-limited-voice-commands-for-offline-use/

So, some commands are also available, when you are offline. So far, this list is larger than the original one, but limited to

Play Music
Open Gmail (works with any app name on the device)
Turn on Wi-Fi
Turn up the volume
Turn on the flashlight
Turn on airplane mode
Turn on Bluetooth
Dim the screen

Unfortunately, this is only true for the English version. For instance, it refuses to work in German, even if the English offline recognition is downloaded to the device. Neither English or German commands will work. Instead, the following screen is shown.

Google Now in offline mode for German on my Samsung Galaxy 5

Yet, it is unclear, when this will work. But, Google seems to be advancing their embedded technology as reported in Personalized Speech Recognition on Mobile Devices. Here, they describe a remarkable speed-up of their embedded speech recognizer. They state their newest technology "...provides a 2× speed-up in evaluating our acoustic models as compared to the unquantized model, with only a small performance degredation". The recognition performance for open ended dictation in an open domain WER increased from 12.9% to 13.5%. Moreover, they report a decrease of the footprint. Their acoustic model "...is compressed to a tenth of its original size". Apart from that they still feature language model personalization through a combination of vocabulary injection and on-the-fly language model biasing.

In the end, they "built(d) a system which runs 7× faster than real-time on a Nexus 5, with a total system footprint of 20.3 MB"

The latter work only aims for the recognition task as it is available with what they call "Voice Typing". It still needs integration of NLU to make it actual commands to use it for Google Now.

So, Google seems to be on the way for a personal assistant that can also be used if you are not connected to the internet. Some of the commands may make not much sense if you are offline, but some will work, sooner or later. English is supported in first place and it is unclear when other languages will follow. But it is a start.

Mittwoch, 3. Februar 2016

Almost 20th anniversary of "Voice recognition is ready for primetime"

As long as I can remember, the voice industry announced "Voice recognition is ready for primetime", e.g., in an articel from 1999 http://www.ahcmedia.com/articles/117677-is-voice-recognition-ready-for-prime-time. For a long time, I had the impression that there was not much improvement in the NIST ASR benchmark results.

NIST ASR benchmark results

All reported results seemed to be converging to some magic barrier that was still far from the human error rate. Recently, IBM reports on some remarkable improvements employing the switchboard corpus http://arxiv.org/pdf/1505.05899v1.pdf. Although they also rely on DNNs, they outperform current system (~12-14% WER) and claim to achieve a WER of ~8%. So we are coming closer to human performance.
It actually took some some time until speech really took of. The biggest advancements were clearly made with the advent of deep learning.

This seems not to be really true for NLU. Manning states in http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00239 that computational linguists should not worry since NLU never perceived this breakthrough when deep learning was applied.

Nevertheless, people started to realize the recent advancements. Especially Apple and Google did a good job in making it publicly available and usable. A recent survey from Parks associates shows that speech products are used by more than 39% of smartphone users (http://www.parksassociates.com/360view/360-mobile-2015). Here, about 50% of Apple users are using it, while only around 30% of Android phone users are using voice. The researchers state that "Among smartphone users ages 18-24, 48% use voice recognition software, and use of the “Siri” voice recognition software among iPhone users increased from 40% to 52% between 2013 and 2015. This translates into 15% of all U.S. broadband households using Siri.". So, the coming generation seem to appreciate the use of voice control.

http://mobilemarketingwatch.com/voice-recognition-on-the-rise-parks-associates-report-shows-40-percent-of-u-s-smartphone-owners-use-it-65002/

Maybe, the speech industry made their promise for too long now, that voice is ready for primetime. Now, the gain in performance seems to be reflected in actual usage. And it is increasing...

Montag, 25. Januar 2016

Silent Speech Recognition

A somewhat interesting read on how we will talk to computers in the near future is here: http://www.wired.com/2015/09/future-will-talk-technology/

Still Smartphones are vision-centric device, as already discussed in http://schnelle-walka.blogspot.com/2015/12/smartphones-as-explicit-devices-do-not.html .Currently, I see people holding their smartphones in front of their face to initiate, e.g., voice queries starting by "OK Google" or similar. So, it is not needed to look at their phone to learn about the manufacturer of their phones. Moreover, since voice is ubiquitous, you will also learn their plans. A more subtle way seems to come with subvocalization. It exploits the fact that people tend to form words without speaking them out loud. Avoiding subvocalization is also one of the tricks to speed up reading http://marguspala.com/speed-reading-first-impressions-are-positive/

Subvalization slows down your reading speed

It is still an ongoing research topic in HCI, but I wonder how mature it is. Will it be useful at all? Or will we get used to people talking to their phones, gadgets or whatever in the same way that we got used to people having a phone call while they are walking?

Another interesting alternative comes with silent speech recognition. Denby et al. define it in Silent Speech Interfaces as: Silent speech recognition systems (SSRSs) enable speech communication to be needed when an audible acoustic signal is unavailable. Usually, these techniques employ brain computer interfaces (BCI) as a source of information. The following figure, taken from Yamaguchi et al., about Decoding Silent Speech in Japanese from Single Trial EEGs: Preliminary Results, are suited to describe the scenario.

Experimental setup for SSI from Yamaguchi et al.

In their research they investigated, among others, how to differentiate the Japanese words for spring, summer, autumn and winter. They were able to proof that this setup works well, but the results are still far from being usable at all.

But does this kind of interface makes sense at all? It might be useful in scenarios where noise is omnipresent and kills all efforts towards traditional speech recognition efforts. A car is one example. However, the apparatus needs to be less intrusive for this.

Montag, 21. Dezember 2015

Smartphones as Explicit Devices do not meet Weiser's Vision of Ubiquitous Computing

Mars Weiser uttered the vision of the invisible computer as "The most profound technologies are those that disappear. They weave themselves into the fabric of everyday life until they are indistinguishable from it." He mainly based his vision on the observation that the cardinality of human-computer relation was changing as shown in the following figure.

Human-Computer relation over time (from http://tuprints.ulb.tu-darmstadt.de/5184/)

However, we are not there, yet. Currently, I see people holding their smartphones in front of their face to initiate, e.g., voice queries starting by "OK Google" or similar. So, we are not using everyday objects that we encounter in our daily lives to interact with them but use a vision-centric device as our key. As a results screen sizes are still increasing during the past years.

Smartphone Screen Size Over Time | SpecOut

One of the drawbacks of this vision-centric key to smart spaces is that is is by no means hands-and eyes free. Google and Siri are continuously improving the voice capabilities of their personal agents but they still rely on manual interaction and force the users to look at the screens. It appears as if we forgot about the "invisible" attribute of Weiser's vision. Invisible meaning that we did not perceive it as a device. Today, the smartphone is still an explicit device.
One day, in the very near future, we will have to decide if this is the way to go. Do we want to reuse this thing everywhere, while we are walking the streets, in our cars,...?

Maybe, this also(!) motivates the wish for Android Auto and Apple CarPlay to have the car as another cradle for your smartphone.

Scenarios like those described in http://www.fastcodesign.com/3054733/the-new-story-of-computing-invisible-and-smarter-than-you are still far away. A video demonstrates their room E.

Prototypes like this are already existing in the labs and maybe, it is time to leave the labs.

Amazon Echo is, maybe, a first step in this direction. As a consequence, it became the best selling item above $100 http://www.theverge.com/2015/12/1/9826168/amazon-echo-fire-black-friday-sales

In contrast to the scenario in the video above, users do not need to speak up. It can be used for voice querying and controlling devices http://www.cnet.com/news/amazon-echo-and-alexa-the-most-impressive-new-technology-of-2015/ So, let's see how this one evolves with regard to Weiser's vision. Maybe, we see comparable approaches soon.

Mittwoch, 16. Dezember 2015

Nuance opens their NLU to developers

Nuance just opened their NLU platform as a beta to developers https://developer.nuance.com/public/index.php?task=mix.

It is more than simply NLU but a full stack including speech recognition that can be used in own applications as shown in their promotion video.

Similar to efforts of NLU startups residing under .ai, Nuance Mix is able to detect an intent and user defined entities from entered sentences. The possibility to also employ Nuance ASR, however, makes it more complete than those efforts. Maybe, this has to be seen as an attempt to strenghten Nuance approach to a virtual assistant that they call Nina. Nina has been out for a while but did not receive much attention so far.

The market of virtual assistants is already somewhat populated. Google Now and Apple's Siri are well known and established. Others, like Microsoft's Cortana also try to gain attraction. Recently, Microsoft opened Project Oxford as a cloud - based tool for the creation of smart (voice centric) applications. A comparable, but maybe more advanced offer is IBM Watson which is available for some time. Another one is Amazon Echo that also opened their platform to developers.

It appears that spoken language technology is mature enough to be really useful. Good news for developers who want to play around with voice interaction to control applications in the internet of things. Currently, there is a plethora of SDKs available that can be used for free. The current question is not if we will see more spoken interaction with everyday things in our lives, but who will win the race for a sufficient number of users and their data. Maybe, Nuance is already too late with Nuance Mix to enter that market. Maybe, they can step in nevertheless, relying on their year-long dominance in speech recognition

Freitag, 11. Dezember 2015

NLU vs Dialog Management

Recently, I stumbled across a blog from api.ai that their system now supports slot filling: https://api.ai/blog/2015/11/09/SlotFilling/. Note, that my goal is not on blaming their system.

Currently, I observe that efforts towards spoken interaction coming from cognitive computing are still not fully aware of what has been done in dialog management research in the past decades, and vice versa. Both parties are coming from different centers in the chain of spoken dialog systems.

While the AI community usually focuses on natural language understanding (linguistic analysis) the spoken dialog community focuses on the dialog manager as the central point in this chain.

Both have good reasons for their attitude and are able to deliver convincing results.

Cognitive computing sees the central point in the semantics which should also be grounded with previous utterances or external content. Speech input and output is in this view restricted to be some input into the system and some output. Dialog management can be really dumb in this case. Resulting user interfaces are currently more or less based on queries.

The dialog manager focused view regards the NLU to be some input into the system while the decision upon subsequent interaction is being handled in this component. Resulting user interfaces range from rigid state-based approaches over the information state update approach up to statistical motivated dialog managers like POMPD.

My hope is, that both communities start talking to each other to better incorporate convincing results of "the other component" to arrive at a convincing user experience.

Samstag, 5. Dezember 2015

Microsoft researchers expect human like capabilities of spoken systems in a few years

Researchers at Microsoft believe that we are only a few years away from equal capabilities of machines to understand spoken language as humans do. Although many advances have been made in the past years there are still too many challenges that need to be solved.This is especially true for distant speech recognition that we need to cope with in daily situations. Maybe, their statement is still a bit too optimistic. However, as systems are available already and people start using it they are right in their assumption that these systems will make progress. We just have to make sure that voice based assistants like Cortana are used at all. Currently some of these systems seem to be more a gimmick to play with until users become bored of it. Hence, they are actually dammed to improve fast to also be helpful.

http://news.microsoft.com/features/speak-hear-talk-the-long-quest-for-technology-that-understands-speech-as-well-as-a-human/

Schnelle-Walka