14 December 2017

Improving End-to-End Models For Speech Recognition




Traditional automatic speech recognition (ASR) systems, used for a variety of voice search applications at Google, are comprised of an acoustic model (AM), a pronunciation model (PM) and a language model (LM), all of which are independently trained, and often manually designed, on different datasets [1]. AMs take acoustic features and predict a set of subword units, typically context-dependent or context-independent phonemes. Next, a hand-designed lexicon (the PM) maps a sequence of phonemes produced by the acoustic model to words. Finally, the LM assigns probabilities to word sequences. Training independent components creates added complexities and is suboptimal compared to training all components jointly. Over the last several years, there has been a growing popularity in developing end-to-end systems, which attempt to learn these separate components jointly as a single system. While these end-to-end models have shown promising results in the literature [2, 3], it is not yet clear if such approaches can improve on current state-of-the-art conventional systems.

Today we are excited to share “State-of-the-art Speech Recognition With Sequence-to-Sequence Models [4],” which describes a new end-to-end model that surpasses the performance of a conventional production system [1]. We show that our end-to-end system achieves a word error rate (WER) of 5.6%, which corresponds to a 16% relative improvement over a strong conventional system which achieves a 6.7% WER. Additionally, the end-to-end model used to output the initial word hypothesis, before any hypothesis rescoring, is 18 times smaller than the conventional model, as it contains no separate LM and PM.

Our system builds on the Listen-Attend-Spell (LAS) end-to-end architecture, first presented in [2]. The LAS architecture consists of 3 components. The listener encoder component, which is similar to a standard AM, takes the a time-frequency representation of the input speech signal, x, and uses a set of neural network layers to map the input to a higher-level feature representation, henc. The output of the encoder is passed to an attender, which uses henc to learn an alignment between input features x and predicted subword units {yn, … y0}, where each subword is typically a grapheme or wordpiece. Finally, the output of the attention module is passed to the speller (i.e., decoder), similar to an LM, that produces a probability distribution over a set of hypothesized words.
Components of the LAS End-to-End Model.
All components of the LAS model are trained jointly as a single end-to-end neural network, instead of as separate modules like conventional systems, making it much simpler.
Additionally, because the LAS model is fully neural, there is no need for external, manually designed components such as finite state transducers, a lexicon, or text normalization modules. Finally, unlike conventional models, training end-to-end models does not require bootstrapping from decision trees or time alignments generated from a separate system, and can be trained given pairs of text transcripts and the corresponding acoustics.

In [4], we introduce a variety of novel structural improvements, including improving the attention vectors passed to the decoder and training with longer subword units (i.e., wordpieces). In addition, we also introduce numerous optimization improvements for training, including the use of minimum word error rate training [5]. These structural and optimization improvements are what accounts for obtaining the 16% relative improvement over the conventional model.

Another exciting potential application for this research is multi-dialect and multi-lingual systems, where the simplicity of optimizing a single neural network makes such a model very attractive. Here data for all dialects/languages can be combined to train one network, without the need for a separate AM, PM and LM for each dialect/language. We find that these models work well on 7 english dialects [6] and 8 Indian languages [7], while outperforming a model trained separately on each individual language/dialect.

While we are excited by our results, our work is not done. Currently, these models cannot process speech in real time [8, 9, 10], which is a strong requirement for latency-sensitive applications such as voice search. In addition, these models still compare negatively to production when evaluated on live production data. Furthermore, our end-to-end model is learned on 22,000 audio-text pair utterances compared to a conventional system that is typically trained on significantly larger corpora. In addition, our proposed model is not able to learn proper spellings for rarely used words such as proper nouns, which is normally performed with a hand-designed PM. Our ongoing efforts are focused now on addressing these challenges.

Acknowledgements
This work was done as a strong collaborative effort between Google Brain and Speech teams. Contributors include Tara Sainath, Rohit Prabhavalkar, Bo Li, Kanishka Rao, Shankar Kumar, Shubham Toshniwal, Michiel Bacchiani and Johan Schalkwyk from the Speech team; as well as Yonghui Wu, Patrick Nguyen, Zhifeng Chen, Chung-cheng Chiu, Anjuli Kannan, Ron Weiss and Navdeep Jaitly from the Google Brain team. The work is described in more detail in papers [4-11]

References
[1] G. Pundak and T. N. Sainath, “Lower Frame Rate Neural Network Acoustic Models," in Proc. Interspeech, 2016.

[2] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell,” CoRR, vol. abs/1508.01211, 2015

[3] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and N. Jaitly, “A Comparison of Sequence-to-sequence Models for Speech Recognition,” in Proc. Interspeech, 2017.

[4] C.C. Chiu, T.N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, A. Kannan, R.J. Weiss, K. Rao, K. Gonina, N. Jaitly, B. Li, J. Chorowski and M. Bacchiani, “State-of-the-art Speech Recognition With Sequence-to-Sequence Models,” submitted to ICASSP 2018.

[5] R. Prabhavalkar, T.N. Sainath, Y. Wu, P. Nguyen, Z. Chen, C.C. Chiu and A. Kannan, “Minimum Word Error Rate Training for Attention-based Sequence-to-Sequence Models,” submitted to ICASSP 2018.

[6] B. Li, T.N. Sainath, K. Sim, M. Bacchiani, E. Weinstein, P. Nguyen, Z. Chen, Y. Wu and K. Rao, “Multi-Dialect Speech Recognition With a Single Sequence-to-Sequence Model” submitted to ICASSP 2018.

[7] S. Toshniwal, T.N. Sainath, R.J. Weiss, B. Li, P. Moreno, E. Weinstein and K. Rao, “End-to-End Multilingual Speech Recognition using Encoder-Decoder Models”, submitted to ICASSP 2018.

[8] T.N. Sainath, C.C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen and Z. Chen, “Improving the Performance of Online Neural Transducer Models”, submitted to ICASSP 2018.

[9] D. Lawson*, C.C. Chiu*, G. Tucker*, C. Raffel, K. Swersky, N. Jaitly. “Learning Hard Alignments with Variational Inference”, submitted to ICASSP 2018.

[10] T.N. Sainath, R. Prabhavalkar, S. Kumar, S. Lee, A. Kannan, D. Rybach, V. Schogol, P. Nguyen, B. Li, Y. Wu, Z. Chen and C.C. Chiu, “No Need for a Lexicon? Evaluating the Value of the Pronunciation Lexica in End-to-End Models,” submitted to ICASSP 2018.

[11] A. Kannan, Y. Wu, P. Nguyen, T.N. Sainath, Z. Chen and R. Prabhavalkar. “An Analysis of Incorporating an External Language Model into a Sequence-to-Sequence Model,” submitted to ICASSP 2018.

Facebook pushes pre-roll ads on Watch as it stops subsidizing Live


 Everyone’s least favorite ads are coming to Facebook, but six second pre-rolls will only appear on original Watch tab videos you purposefully view and not in the News Feed. Facebook is embracing pre-rolls after years of shunning them as it tries to make pay outs to video creators sustainable. Facebook’s head of video Fidji Simo tells TechCrunch that it will not renew direct… Read More
Read Full Article

Best Amazon Alexa Voice Commands for Phillips Hue


The Philips Hue personal wireless lighting system is an excellent way to make your dumb lightbulbs smart, but wouldn’t it be cool if you could talk to your lights? You know, say something like “Howdy there, if it’s not too much trouble, could you turn the lights on?” Well, thanks to Amazon Echo and Alexa, you can! Alexia Rolls Artsn Itln 3Chs Srd 10.5 Oz Alexia Rolls Artsn Itln 3Chs Srd 10.5 Oz Buy Now At Amazon Today I’ll be showing you the best voice commands for Hue and Alexa, although many of these commands will also work with Google...

Read the full article: Best Amazon Alexa Voice Commands for Phillips Hue


Read Full Article

A new version of Mixer, Microsoft’s Twitch rival, hits iOS and Android


 Microsoft today is officially launching a new version of its Mixer mobile gameplay streaming app, its Twitch rival. The app, which is initially available on Android with iOS arriving soon, was first introduced into beta testing this fall, with a focus on improvements to its overall user experience, content discovery, performance and personalization features. For example, the beta build… Read More

Read Full Article

Facebook pushes pre-roll ads on Watch as it stops subsidizing Live


 Everyone’s least favorite ads are coming to Facebook, but six second pre-rolls will only appear on original Watch tab videos you purposefully view and not in the News Feed. Facebook is embracing pre-rolls after years of shunning them as it tries to make pay outs to video creators sustainable. Facebook’s head of video Fidji Simo tells TechCrunch that it will not renew direct… Read More

Read Full Article

Google adds price tracking and deals to Google Flights, Google Trips and hotel search


 Google today is expanding its booking features for travelers using Google services including Trips, Flights and hotel search, with a focus on helping people find better rates. For example, Google can now tell you when’s the best time to buy an airline ticket or see when room rates are higher, among other things. These price-tracking features are similar to those that some other travel… Read More
Read Full Article

How to Dictate Email in Microsoft Outlook


dictate-email-outlook

It’s time to speed up how you write emails. If you struggle typing quickly, dictating email can help boost your productivity. We’re going to show you Dictate, which integrates straight into Outlook. Dictate is a utility developed by Microsoft and also works with other Office programs. You just plug your microphone in, click a button, and start talking. Everything you say is then transcribed. If you use Dictate or have a different speech-to-text software that you use, please let us know in the comments. About Dictate Microsoft Garage is a division of Microsoft that allows employees to work on their own projects...

Read the full article: How to Dictate Email in Microsoft Outlook


Read Full Article

10 Environmental Games That Teach Kids About Earth, Ecology & Conservation


The kids of today will inherit the Earth of tomorrow. They will also be left to clean up the mess we leave behind today. With the right ecological education, we can hope they will hit the ground running. A lot of schools and educational institutions are doing their bit by including the environment as part of the curriculum. Words like “carbon footprint” and “global warming” come to them as easily as the name of any present-day rock star. Technology has a trick. Environmental education can be taken out of musty textbooks and turned into interactive games. Like any other strategy game, kids...

Read the full article: 10 Environmental Games That Teach Kids About Earth, Ecology & Conservation


Read Full Article

Snapchat launches augmented reality developer platform Lens Studio


 Snapchat is finally opening up so outside developers can help it offer infinite augmented reality experiences beyond those it designs in-house. Today Snap launches the Lens Studio AR developer tool for desktops so anyone can create World Lenses that place interactive, imaginary 3D objects in your photos and videos. But brands, news publishers, and developers will have to promote their own… Read More
Read Full Article

Snapchat launches augmented reality developer platform Lens Studio


 Snapchat is finally opening up so outside developers can help it offer infinite augmented reality experiences beyond those it designs in-house. Today Snap launches the Lens Studio AR developer tool for desktops so anyone can create World Lenses that place interactive, imaginary 3D objects in your photos and videos. But brands, news publishers, and developers will have to promote their own… Read More

Read Full Article

A week on the wrist with the Alpina Startimer


 It’s refreshing to wear a mechanical watch. The soft sweep of the seconds hand reminds us of the fleeting nature of time while the endless ticking in a dark room is a comfort and a spur to action. Add in a little limited edition provenance with big face and crown and you’ve got a stew going. This particular stew is called the Alpina Startimer. It is a pilot’s watch, a watch… Read More

Read Full Article

9 Safari Settings You Should Change for a Better Browsing Experience


safari-settings-better-experience

A lot of Mac enthusiasts prefer using Safari over Chrome, thanks to its low battery consumption. With macOS High Sierra, Safari is better than ever before, but as with all browsers there are quirks that need fixing. Fortunately enough, most of the browser’s annoying quirks can be fixed by making a quick visit to Safari’s settings. Others need a little more work, but we’ll cover those step by step. Ready to upgrade your Safari browsing experience on your Mac? Let’s begin with fixing Safari’s default preferences first. 1. Enable Link Preview in the Status Bar When browsing through websites, you come across all sorts of hyperlinks. Some links...

Read the full article: 9 Safari Settings You Should Change for a Better Browsing Experience


Read Full Article

Trim, Cut, or Split a Video the Quickest Way for Free


edit-video-quick-free

With phones, DSLRs, and GoPros we’re all shooting more video than ever. But it rarely comes straight out of the camera in perfect shape. Often you’ll need to do a little editing of a video before showing or sharing it. You might want to trim a few seconds off the start or end, or cut it so that it’s a more shareable length. Fortunately, you don’t need any editing skills to do this. You just need the right software — and you’ve probably already got it installed on your computer. So let’s take a look at the quickest way to trim videos...

Read the full article: Trim, Cut, or Split a Video the Quickest Way for Free


Read Full Article

Still Using Internet Explorer? 9 Questions Answered


ie9-faq

Internet Explorer isn’t the most popular browser, and Microsoft has stopped adding new features to it. Yet it’s still used by many today. Whether you’re forced to use it for work or just love Internet Explorer for personal reasons, you should know how to use it effectively. Here, we’ve gathered some frequently asked questions about Internet Explorer (IE). Read on to find easy answers for some of the most important functions of the browser. 1. What Is the Latest Version of Internet Explorer? The newest (and last) version of Internet Explorer is version 11. Only Windows 7, 8.1, and 10...

Read the full article: Still Using Internet Explorer? 9 Questions Answered


Read Full Article

How to Use Vineyard to Run Windows Apps on Linux


vineyard-windows-apps

Linux is an awesome operating system landscape with a smattering of distributions for all purposes. There are tons of motivations for switching, from monetary savings to learning new skills, and simply supporting the open-source community. However, when switching from Windows to Linux, certain programs cease to properly function. Wine is an application compatibility layer which allows users to run Windows apps on Linux or Mac. Using Wine frontends such as PlayOnLinux or Vineyard significantly simplifies the installing and running Windows apps on Linux with Wine. Learn all about Vineyard, including what it is and how to install it. What Is...

Read the full article: How to Use Vineyard to Run Windows Apps on Linux


Read Full Article

Azulle Byte 3 Review: This Tiny, Fanless Mini PC Does Everything


Our verdict of the Azulle Byte 3 Mini PC:The Byte 3 is a perfect media center and general computing device for most people's needs, with an attractive design and running the full Windows 10. However, the included remote could have been much better, and the raw performance is lacklustre. With US-based support, we think the $200 price point is just about right. 810.From US-based Azulle Tech comes the latest in a line of fanless mini PCs: the Byte 3. Retailing at around $200 for the base model, that includes a full edition of Windows 10 Pro. Is it good enough for...

Read the full article: Azulle Byte 3 Review: This Tiny, Fanless Mini PC Does Everything


Read Full Article

Twitter Makes It Easy to Create a Thread of Tweets


Twitter has long been hamstrung by the limited number of characters afforded to its users. The 140 character limit was born out of necessity, but became both a blessing and a curse. Twitter recently increased its character limit to 280, and it’s now making it easier to thread tweets. Threads have always been possible on Twitter, with people replying to themselves with numbered tweets. These have come to be known as tweetstorms because they usually read as angry rants on a subject close to the person’s heart. And now Twitter is making tweetstorms official. Creating a Tweetstorm Is Now Easy...

Read the full article: Twitter Makes It Easy to Create a Thread of Tweets


Read Full Article

How to View and Delete All Your Windows 10 Activity History


windows-10-privacy

Windows 10 collects and saves your activity history both on your computer and to the cloud, from browsing history to location information to everything in between. Luckily Microsoft makes it easy to see all the data that is being stored, and also makes it easy to delete it. What Data Does Windows 10 Track? The data that Windows collects includes: Edge browsing history Bing search history Location data (if it’s enabled) Cortana voice commands If you use Microsoft’s HealthVault or the Microsoft Band device, any activity collected through that service is also stored. Microsoft says it collects this data in order...

Read the full article: How to View and Delete All Your Windows 10 Activity History


Read Full Article

How to Delete Your Amazon Echo Voice Data


personalize-amazon-echo

Does the Amazon Echo eavesdrop on conversations? Is our chit-chat a privacy disaster in the making? Should we be worried about what Amazon is learning about us in our very homes? Short answer: No. The Amazon Echo does NOT eavesdrop on pillow talk. The far-field communication technology DOES pick up voice commands and connects to the voice-controlled intelligent personal assistant service called Alexa. The voice requests are processed in the cloud and the results are delivered to the device. And those voice requests are saved. The Amazon Echo only records and stores the wake word and the voice command that follows. You can see the complete record of...

Read the full article: How to Delete Your Amazon Echo Voice Data


Read Full Article

How to Check Amazon Seller Feedback and Not Get Scammed


Compared to other online marketplaces, Amazon is exceptionally reliable. Though sites such as eBay and Alibaba do have refund policies, Amazon’s is perhaps the most comprehensive and buyer-friendly. But that doesn’t mean that things can’t go wrong. Scammers are everywhere and the sheer number of users on Amazon makes the site an attractive proposition. Even though your money is relatively safe even if you are scammed, nobody really wants the hassle of dealing with the claims system. It’s much easier to do your due diligence before hitting “Buy.” Thankfully, Amazon makes it easy. Here’s how to check seller feedback on Amazon...

Read the full article: How to Check Amazon Seller Feedback and Not Get Scammed


Read Full Article

Crunch Report | Glow-in-the-Dark Plants


Blue Origin’s Crew Capsule 2.0 takes first flight, scientists at MIT make glow-in-the-dark plants and Google is opening an AI center in China. All this on Crunch Report. Read More
Read Full Article

Plenty of Fish adds new conversation features to differentiate itself from Tinder


 Match Group, which houses a large portfolio of dating app brands – including most notably, Tinder, Match, and OKCupid – is prepping a notable upgrade to one of its older brands: Plenty of Fish. The dating service, often dubbed ‘POF’ by its users, was founded in 2003 then sold to Match Group in 2015 for $575 million. But it has since remained fairly quiet, in terms of… Read More
Read Full Article

Half of Amazon app users have been switched to a new, swipe-based 1-Click checkout


 Has Amazon’s ‘one-click’ checkout on mobile looked a little different to you lately? A number of users of Amazon’s mobile applications have recently reported seeing a new checkout option that replaces the click – well, on mobile, the tap – with a swipe instead. As it turns out, the option is part of a fairly large-scale test Amazon has underway. Currently… Read More

Read Full Article

The Top 10 Things Everybody Googled in 2017


When people want to know the answer to something they invariably Google it. But what have people been Googling in 2017? Google has published various lists detailing what people have Googled over the past 12 months, including overall search terms, consumer tech, and memes. These lists, when taken collectively, provide an interesting insight into the state of the world right now. They show what piqued people’s interests, what made the news, which celebrities were hot, what sporting events got people talking, and which TV shows people watched. The Search Terms That Trended in 2017 The first, and most important, list...

Read the full article: The Top 10 Things Everybody Googled in 2017


Read Full Article

Gfycat wants to fix your low-fidelity GIFs with machine learning


 We all love to share GIFs — and there are plenty of ways to do that, through online portals or keyboards — but often times because there is so much content, you’ll end up surfacing up a lower-fidelity GIF. There can be plenty of copies of the same video clips as a GIF, or maybe it’s just difficult to capture and upload, but Gfycat hopes that it can be solved at a… Read More

Read Full Article

A Summary of the First Conference on Robot Learning




Whether in the form of autonomous vehicles, home assistants or disaster rescue units, robotic systems of the future will need to be able to operate safely and effectively in human-centric environments. In contrast to to their industrial counterparts, they will require a very high level of perceptual awareness of the world around them, and to adapt to continuous changes in both their goals and their environment. Machine learning is a natural answer to both the problems of perception and generalization to unseen environments, and with the recent rapid progress in computer vision and learning capabilities, applying these new technologies to the field of robotics is becoming a very central research question.

This past November, Google helped kickstart and host the First Conference on Robot Learning (CoRL) at our campus in Mountain View. The goal of CoRL was to bring machine learning and robotics experts together for the first time in a single-track conference, in order to foster new research avenues between the two disciplines. The sold-out conference attracted 350 researchers from many institutions worldwide, who collectively presented 74 original papers, along with 5 keynotes by some of the most innovative researchers in the field.
Prof. Sergey Levine, CoRL 2017 co-chair, answering audience questions.
Sayna Ebrahimi (UC Berkeley) presenting her research.
Videos of the inaugural CoRL are available on the conference website. Additionally, we are delighted to announce that next year, CoRL moves to Europe! CoRL 2018 will be chaired by Professor Aude Billard from the École Polytechnique Fédérale de Lausanne, and will tentatively be held in the Eidgenössische Technische Hochschule (ETH) in Zürich on October 29th-31st, 2018. Looking forward to seeing you there!
Prof. Ken Goldberg, CoRL 2017 co-chair, and Jeffrey Mahler (UC Berkeley) during a break.