Amazon Trains Alexa on GPUs to Better Handle Complex Queries

To improve how natural language processing (NLP) systems such as Alexa handle complex requests, Amazon researchers, in collaboration with the University of Massachusetts Amherst, developed a deep learning-based, sequence-to-sequence model that can better handle simple and complex queries. 

“Virtual assistants such as Amazon Alexa, Apple Siri, and Google Assistant often rely on a semantic parsing component to understand which action(s) to execute for an utterance spoken by its users,” the researchers stated in a paper published on ArXiv, Don’t Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

Based on a sequence-to-sequence model and pointer-generator network, the approach does not impose restrictions on the semantic parsing schema. They say it achieves state-of-the-art performance, with improvements between 3.3% and 7.7% percent. 

“A major part of any voice assistant is a semantic parsing component designed to understand the action requested by its users: given the transcription of an utterance, a voice assistant must identify the action requested by a user (play music, turn on lights, etc.), as well as parse any entities that further refine the action to perform (which song to play? which lights to turn on?),” the research team explained. 

Despite major breakthroughs in NLP systems, this task remains a challenging one, the team said. This is due to the sheer number of possible combinations that a user can express in a voice command. 


The architecture – a sequence-to-sequence model with pointer-generator network (Seq2Seq-Ptr). The model is currently decoding the symbol after MediaType( by looking at the scores over the tagging vocabulary and the attention over the source pointers. It generates ptr2 as that has the highest overall score.

The sequence-to-sequence model and the pointer-generator network consist a pretrained BERT model as the encoder, and a transformer-based decoder augmented with the previously mentioned pointer-generator network. This allows the team to generate pointers to the source sequence in the target sequence. 


Figure 1: Semantic parsing of a “simple” query. Simple queries define single-action (intent) and can be decomposed into a set of non-overlapping entities (slots).  Source: Amazon

On the training side, all the models are trained using eight NVIDIA V100 GPUs, each with 16 GB memory, on the Amazon Web Services cloud. 

In another ArXiv paper by Amazon researchers, Pre-training For Query Rewriting In A Spoken Language Understanding System, the team also used NVIDIA V100 GPUs with PyTorch to train an AI system in automatically rewriting complex queries and improving Alexa’s understanding of those queries. 

Both papers highlight the ongoing research to accelerate NLP systems with the end goal of improving tools such as Alexa. 

Read more>