Sonntag, 17. Juli 2016

NLU is not a User Interface

Some time ago I already spoke about NLU vs Dialogmanagement. My hope was that people working in NLU and Voice User Interface design would start talking to each other. I enhanced these ideas in a paper submitted to the IUI Workshop on Interacting with Smart Objects: NLU vs. Dialog Management: To Whom am I Speaking? In essence "Dialogmangement-centered  systems  are  principally  constrained  because they anticipate the users input as plans to help them to achieve their goal.   Depending  on  the  implemented  dialog  strategy they allow for different degrees of flexibility. NLU-centered systems see the central point in the semantics of the utterance, which should also be grounded with previous utterances or external content.  Thus, whether speech or not, NLU regards this as a stream of some input to produce some output. Since no dialog model is employed,  resulting user interfaces currently do not handle much more than single queries".

Actual dialog systems must go beyond this and combine knowledge from both research domains to provide convincing user interfaces.

Now, I stumbled across a blog entry from Matthew Honnibal who bemoans the current hype around artificial intelligence and the ubiquitous promise for more natural user interfaces. He is right that voice simply is another user interface. He states:  "My point here is that a linguistic user interface (LUI) is just an interface. Your application still needs a conceptual model, and you definitely still need to communicate that conceptual model to your users. So, ask yourself: if this application had a GUI, what would that GUI look like?"

He continues with mapping the spoken input to method calls along with their parameters. Then, he concludes: "The linguistic interface might be better, or it might be worse. It comes down to design, and your success will be intimately connected to the application you’re trying to build."

This is exactly the point where voice user interface design comes into play. Each modality requires special design  knowledge for effective interfaces.  Matthew Honnibal seems neither be aware of the term VUI nor of the underlying aproaches and concepts. Maybe, it is time to rediscover it to build better voice-based interfaces employing state-of-the-art NLU technology.

Dienstag, 5. Juli 2016

AI and the Need for Human Values

Stuart Russel, professor in Berkeley  and who wrote the standard book on artificial intelligence with Peter Norvig, speculates  about the future of AI.

He has no doubts that AI will change the world. "In future, AI will increasingly help us live our lives", he said, "driving our cars and acting as smart virtual assistants that know our likes and dislikes and that will manage our day." The technology is already there that is more accurate in analyzing and monitoring a plethora of documents to forecast events or provide us with hints to make our lives easier. "Looking further ahead, it seems there are no serious obstacles to AI making progress until it reaches a point where it is better than human beings across a wide range of tasks."

In the Best of all cases "[W]we could reach a point, perhaps this century, where we're no longer constrained by our difficulties in feeding ourselves and stopping each other from killing people, and instead decide how we want the human race to be."

Ob the other side he also sees a great danger. Autonomous weapons may reveal as great threat. "Five guys with enough money can launch 10 million weapons against a city," he said.

He demands serious plans how to core with that. "A system that's superintelligent is going to find ways to achieve objectives that you didn't think of. So it's very hard to anticipate the potential problems that can arise." Therfore, "there will be a need to equip AI with a common sense understanding of human values."

He suggests the only absolute objective of autonomous robots should be the maximising the values of humans as a species.

This is all well said. But did the human species already reach a point to agree upon a common set of values? Who will be the one to decide how we want the human race to be? How would we teach those to AI? I fear that this remains a nice vision and that we will reach the point where we would have needed such an integration of values too early.

Dienstag, 24. Mai 2016

Beyond Google Auto

At the past Google IO, Google showed their upcoming version Android N. This version already contains parts that aim at usage in our cars. Previously, Google's efforts towards infotainment systems were bundled in Android Auto. Here, the smartphone had to be connected via USB to the car to use the screen on the dashboard as a screen for your device.
Android Auto (from http://www.digitaltrends.com/infotainment-system-reviews/android-auto-review/)
This worked only if the car's infotainment system supported that. Some tuners, like the Kenwwod DDX9902S, allowed for expanding these capabilities if replacing the current tuner.

Now, this is no longer necessary since the functionality is already built into the OS. This makes it applicable even to older cars. If the car has WLAN as it is available in newer cars, the user also does not need to fiddle around with USB any longer.

Here are some impressions taken from http://www.automotiveitnews.org/articles/share/1481628/
The center of the screens shows the navigation app while the the top and bottom portion of the screen is available for additional information
When initiating a phone call, only the center screen changes

When initiating a phone call, only the center screen changes
While many car manufacturers fear that their brands may become commodity devices, Google already made the next step and provided their answer to the question: Yes, the car will become a cradle for your smartphone.

Montag, 4. April 2016

The Other Data Problem of Machine Learning

There is one big problem that machine learning usually faces: The acquisition of data. This has been one of the bigger hindrances to train speech recognizers for quite some time. A nice read in this context is a blog from Arthur Chan from seven years ago, where he explains his thought on true open source dictation: http://arthur-chan.blogspot.de/2009/04/do-we-have-true-open-source-dictation.html

This problem increased, when deep learning entered the scene of speech recognition. More and more data is needed to create convincing systems. The story continues with spoken dialog management. Apple seems to want to make a step forward in this direction with the acquisition of VocalIQ:  http://www.patentlyapple.com/patently-apple/2015/10/apple-has-acquired-vocal-iq-a-company-with-amazing-focus-on-a-digital-assistant-for-the-autonomous-car-beyond.html
All news tried to see this in the light of Apple's efforts towards integration into the automotive market. CarPlay http://www.apple.com/ios/carplay/ to display apps on the dashboard and what some people call iCar http://www.pcadvisor.co.uk/new-product/apple/apple-car-rumours-what-on-earth-is-icar-3626110/ were recently in the news.
I am not sure if there really is such a relation. It might be useful for Siri as well. Adaptive dialogs have been a research topic for some years, now. Maybe, it is time for this technology to address a broader market.

So far, Apple seemed to be reluctant with regard to learned dialog behavior. In the end, these processes cannot guarantee a promised behavior. This is also one of the main reasons, why this technology is not adopted as fast as in other fields where (deep) learning entered the scene. Pieraccini and Huerta describe this problem in Where do we go from here? Research and commercial spoken dialog systems as the VUI-completeness principle. They describe it as "the behavior of an application needs to be completely specified with respect to every possible situation that may arise during the interaction. No unpredictable user input should ever lead to unforeseeable behavior. Only two outcomes are acceptable, the user task is completed, or a fallback strategy is activated..." This quality measure has been established throughout years and is not available with statistical learning of the dialog strategy. In essence, this fear can be described as follows: Let's assume the user is asking "Hey, what is the weather like in Germany?". In (the very unlikely case) that it is in the data, the system may have learned that a good answer to this could be "Applepie".

Consequently, the data to train the system has to be selected and filtered. Sometimes, such a lack is discovered while the system is running. Usually, this is the worst case scenario. Recently, this happened to Apple's Siri. A question to Siri where to hide a dead body became evidence in a murder trial. Siri actually came up with some answers.
Screenshot of Siri 's answer to hide a body 
Now, it has been corrected and Siri simply answers "I used to be able to answer this question.".

Similarly, Microsoft was in the news with its artificial agent Tay. Tay was meant to learn while people were interacting with it. It took less than 24 hours from the statement "Humans are super cool" to "“Hitler was right.”. Data was coming more or less unfiltered from hackers aiming to shape this attitude of Tay.

Evolvement of Tay on Twitter, from https://twitter.com/geraldmellor/status/712880710328139776/photo/1

Again, the base problem is in the ethics of the data: selection and filtering. But what are the correct settings for that? Who is in charge of determining the playground? Usually, this is the engineer developing the system (and thus his ethical background).
This "other problem of machine learning" seems to be not in the focus of those developing machine learning systems. Usually, they are busy with coming up with some data at all to initially train their system at all.

However, this problem is not really new. Think of Isaac Asimov who invented the laws of robotics. He already had the idea of guidance criteria to machine behavior. Maybe, we are in the need to develop something in this light while we move on this road.

And this is also true for spoken dialog systems that actively learn their behavior from usage as adaptive dialogs. It will be awkward to see learning systems out there that change their behavior to something that was never intended by the developer. I am waiting for those headlines.

Mittwoch, 16. März 2016

Google's Offline Personal Assistant

In June 2015, there were some first rumours that few commands for Google Now would be available even when you are offline. An APK teardown of the Google app reported at
http://www.androidpolice.com/2015/06/27/apk-teardown-google-app-v4-8-prepares-for-ok-google-offline-voice-commands-to-control-volume-and-brightness-and-much-more/ revealed some string resources that hinted on this.

<string name="offline_header_text">Offline voice tips</string>
<string name="offline_on_start_cue_cards_header_listening">Offline</string>
<string name="offline_on_start_cue_cards_header_timeout">Offline voice tips</string>
<string name="offline_on_start_cue_cards_second_header_listening">You can still say...</string>
<string name="offline_on_start_cue_cards_second_header_timeout">You can still say "Ok Google," then...</string>
<string name="offline_on_start_cue_cards_second_header_timeout_without_hotword">You can still touch the mic, then say...</string>
<string name="offline_options_start_hotword_disabled">You can still touch the mic, then say...</string>
<string name="offline_options_start_hotword_enabled">You can still say "Ok Google," then...</string>
<string name="offline_error_card_title_text">Something went wrong.</string>
<string name="error_offline_no_connectivity">Check your connection and try again.</string>

So far,  Google Now required an online connection to work. Since this is not always a given it is beneficial to have a workaround in these cases. They found the following four options:
  • Make a call
  • Send a text
  • Play some music
  • Turn on Wi-Fi

Usually, such a tear-down is more of the kind of a rumor than facts. In this case, these rumors proved to be true. In September 2015, this functionality was made available as reported, e.g., on http://www.androidpolice.com/2015/09/28/the-google-android-app-now-supports-limited-voice-commands-for-offline-use/. The way this is reflected on the UI is shown on the following picture

Google Now in offline mode,taken from http://www.androidpolice.com/2015/09/28/the-google-android-app-now-supports-limited-voice-commands-for-offline-use/
So, some commands are also available, when you are offline. So far, this list is larger than the original one, but limited to
  • Play Music
  • Open Gmail (works with any app name on the device)
  • Turn on Wi-Fi
  • Turn up the volume
  • Turn on the flashlight
  • Turn on airplane mode
  • Turn on Bluetooth
  • Dim the screen
Unfortunately, this is only true for the English version. For instance, it refuses to work in German, even if the English offline recognition is downloaded to the device. Neither English or German commands will work. Instead, the following screen is shown.
Google Now in offline mode for German on my Samsung Galaxy 5
Yet, it is unclear, when this will work. But, Google seems to be advancing their embedded technology as reported in Personalized Speech Recognition on Mobile Devices. Here, they describe a remarkable speed-up of their embedded speech recognizer. They state their newest technology "...provides a 2× speed-up in evaluating our acoustic models as compared to the unquantized model, with only a small performance degredation". The recognition performance for open ended dictation in an open domain WER increased from 12.9% to 13.5%. Moreover, they report a decrease of the footprint. Their acoustic model "...is compressed to a tenth of its original size". Apart from that they still feature language model personalization through a combination of vocabulary injection and on-the-fly language model biasing.
In the end, they "built(d) a system which runs 7× faster than real-time on a Nexus 5, with a total system footprint of 20.3 MB"

The latter work only aims for the recognition task as it is available with what they call "Voice Typing". It still needs integration of NLU to make it actual commands to use it for Google Now.

So, Google seems to be on the way for a personal assistant that can also be used if you are not connected to the internet. Some of the commands may make not much sense if you are offline, but some will work, sooner or later. English is supported in first place and it is unclear when other languages will follow. But it is a start.




Freitag, 19. Februar 2016

Golden ages for NLU developers?

Currently, the landscape around NLU and AI is booming. Many startups are entering the market, trying to get a foot in the door that seems to be wide open, right now. The following figure, taken from an article at http://venturebeat.com/2016/02/14/intelligent-assistance-the-slow-growth-space-that-will-eventually-wow-us/, shows a snapshot of available artificial assistants in October 2015. And it is still growing...
Intelligent Assistance landscape, taken from http://venturebeat.com/2016/02/14/intelligent-assistance-the-slow-growth-space-that-will-eventually-wow-us/
Technology is improving rapidly and so are new features and functionality. At the same time, users' expectations towards speech technology grow. However, there are also some voices stating potential drawbacks of the current evolution. This technology "could leave half of the world unemployed" as stated by Moshe Vardi in http://www.theguardian.com/technology/2016/feb/13/artificial-intelligence-ai-unemployment-jobs-moshe-vardi. He expects that AI could wipe out 50% of the middle-class jobs in the next 30 years. He envisions a similar scenario like years ago when automation hit the working class. Now, it could be the middle class. The key lies in cognitive computing. IBM defines it in the context of their whitepaper about IBM Watson as "Cognitive Computing refers to systems that learn at scale, reason with purpose and interact with humans naturally. Rather than being explicitly programmed, they learn and reason from their interactions with us and from their experiences with their environment."

AI technology will change our lives for sure, and it is already doing. Vardi's scenario is not out of the world but it is only one scenario. It is on us what we make out of it.

For now, this change leaves developers that are interested in playing around with this technology with a multitude of frameworks that may be used for free. There are so many players in this field that startups need to gain momentum. A common strategy is to open their API to the public with just a registration. No fee. I already mentioned some in my post about Nuance to open their NLU platform.

It is great to play around with speech technology, but it has also several risks to rely upon a certain supplier. Will the startup still be there when my product is ready for the market? Will the supplier change the conditions after they gained sufficient momentum? ...

The last point happened, e.g., with Maluuba. They used to provide developers access to their system with open source code that they had on GitHub. Maluuba removed the repository from GitHub, but there are still some fragments of their napi in the Internet (Who is in charge of cleaning the internet?). It compiles but it requires registration at the Maluuba developer site which has been shut down.

Don't get me wrong. This is completely OK if you want to earn money. They have a great product and they made it from a startup to a global player in a very short time. I just showcases the risks of developers who want to develop products based on these offerings.

It looks like golden ages for NLU developers that want to play around with this technology, but it may be safer to rethink when you are going for actual products. This makes the landscapes much smaller.

Mittwoch, 3. Februar 2016

Almost 20th anniversary of "Voice recognition is ready for primetime"

As long as I can remember, the voice industry announced "Voice recognition is ready for primetime", e.g., in an articel from 1999 http://www.ahcmedia.com/articles/117677-is-voice-recognition-ready-for-prime-time. For a long time, I had the impression that there was not much improvement in the NIST ASR benchmark results.
NIST ASR benchmark results

All reported results seemed to be converging to some magic barrier that was still far from the human error rate. Recently, IBM reports on some remarkable improvements employing the switchboard corpus http://arxiv.org/pdf/1505.05899v1.pdf. Although they also rely on DNNs, they outperform current system (~12-14% WER) and claim to achieve a WER of ~8%. So we are coming closer to human performance.
It actually took some some time until speech really took of. The biggest advancements were clearly made with the advent of deep learning.

This seems not to be really true for NLU. Manning states in http://www.mitpressjournals.org/doi/pdf/10.1162/COLI_a_00239 that computational linguists should not worry since NLU never perceived this breakthrough when deep learning was applied.

Nevertheless, people started to realize the recent advancements. Especially Apple and Google did a good job in making it publicly available and usable. A recent survey from Parks associates shows that  speech products are used by more than 39% of smartphone users (http://www.parksassociates.com/360view/360-mobile-2015). Here, about 50% of Apple users are using it, while only around 30% of Android phone users are using voice. The researchers state that "Among smartphone users ages 18-24, 48% use voice recognition software, and use of the “Siri” voice recognition software among iPhone users increased from 40% to 52% between 2013 and 2015. This translates into 15% of all U.S. broadband households using Siri.". So, the coming generation seem to appreciate the use of voice control.
http://mobilemarketingwatch.com/voice-recognition-on-the-rise-parks-associates-report-shows-40-percent-of-u-s-smartphone-owners-use-it-65002/ 

Maybe, the speech industry made their promise for too long now, that voice is ready for primetime. Now, the gain in performance seems to be reflected in actual usage. And it is increasing...

Montag, 25. Januar 2016

Silent Speech Recognition



A somewhat interesting read on how we will talk to computers in the near future is here: http://www.wired.com/2015/09/future-will-talk-technology/

Still Smartphones are vision-centric device, as already discussed in http://schnelle-walka.blogspot.com/2015/12/smartphones-as-explicit-devices-do-not.html .Currently, I see people holding their smartphones in front of their face to initiate, e.g., voice queries starting by "OK Google" or similar. So, it is not needed to look at their phone to learn about the manufacturer of their phones. Moreover, since voice is ubiquitous, you will also learn their plans. A more subtle way seems to come with subvocalization. It exploits the fact that people tend to form words without speaking them out loud. Avoiding subvocalization is also one of the tricks to speed up reading http://marguspala.com/speed-reading-first-impressions-are-positive/
Subvalization slows down your reading speed
It is still an ongoing research topic in HCI, but I wonder how mature it is. Will it be useful at all? Or will we get used to people talking to their phones, gadgets or whatever in the same way that we got used to people having a phone call while they are walking?

Another interesting alternative comes with silent speech recognition. Denby et al. define it in Silent Speech Interfaces as: Silent speech recognition systems (SSRSs) enable speech communication to be needed when an audible acoustic signal is unavailable. Usually, these techniques employ brain computer interfaces (BCI) as a source of information. The following figure, taken from Yamaguchi et al., about Decoding Silent Speech in Japanese from Single Trial EEGs: Preliminary Results, are suited to describe the scenario.
Experimental setup for SSI from Yamaguchi et al.


In their research they investigated, among others, how to differentiate the Japanese words for spring, summer, autumn and winter. They were able to proof that this setup works well, but the results are still far from being usable at all.

But does this kind of interface makes sense at all? It might be useful in scenarios where noise is omnipresent and kills all efforts towards traditional speech recognition efforts. A car is one example. However, the apparatus needs to be less intrusive for this.