Accepted Papers: Industry Track
- A Novel BERT based Distillation Method for Ensemble Ranking Model
Wangshu Zhang, Junhong Liu, Zujie Wen, Yafang Wang and Gerard de Melo - An Empirical Study on Multi-Task Learning for Text Style Transfer and Paraphrase Generation
Pawel Bujnowski, Kseniia Ryzhova, Hyungtak Choi, Katarzyna Witkowska, Jaroslaw Piersa, Tymoteusz Krumholc and Katarzyna Beksa - An Industry Evaluation of Embedding-based Entity Alignment
Ziheng Zhang, Hualuo Liu, Jiaoyan Chen, xi chen, Bo Liu, YueJia Xiang and Yefeng Zheng - Applying Model-agnostic Methods to Handle Inherent Noise in Large Scale Text Classification
Kshitij Tayal, Rahul Ghosh and Vipin Kumar - Assessing Social License to Operate from the Public Discourse on Social Media
Chang Xu, Cecile Paris, Ross Sparks, Surya Nepal and Keith VanderLinden - Best Practices for Data-Efficient Modeling in NLG:How to Train Production-Ready Neural Models with Less Data
Ankit Arun, Soumya Batra, Vikas Bhardwaj, Ashwini Challa, Pinar Donmez, Peyman Heidari, Hakan Inan, Shashank Jain, Anuj Kumar, Shawn Mei, Karthik Mohan and Michael White - Data-Efficient Paraphrase Generation to Bootstrap Intent Classification and Slot Labeling for New Features in Task-Oriented Dialog Systems
Shailza Jolly, Tobias Falke, Caglar Tirkaz and Daniil Sorokin - Delexicalized Paraphrase Generation
Boya Yu, Konstantine Arkoudas and Wael Hamza - Evaluating Cross-Lingual Transfer Learning Approaches in Multilingual Conversational Agent Models
Lizhen Tan and Olga Golovneva - Extreme Model Compression for On-device Natural Language Understanding
Kanthashree Mysore Sathyendra, Samridhi Choudhary and Leah Nicolich-Henkin - Graph Regularized Graph Convolutional Networks for Short Text Classification using Side Information
Kshitij Tayal, Nikhil Rao, Saurabh Agarwal, Xiaowei Jia, Karthik Subbian and Vipin Kumar - hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Text Normalization
Piyush Makhija, Ankit Kumar and Anuj Gupta - Interactive Question Clarification in Dialogue via Reinforcement Learning
Xiang Hu, Zujie Wen, Yafang Wang, Xiaolong Li and Gerard de Melo - Learning Domain Terms - Empirical Methods to Enhance Enterprise Text Analytics Performance
Gargi Roy, Lipika Dey, Mohammad Shakir and Tirthankar Dasgupta - Leveraging User Paraphrasing Behavior In Dialog Systems To Automatically Collect Annotations For Long-Tail Utterances
Tobias Falke, Markus Boese, Daniil Sorokin, Caglar Tirkaz and Patrick Lehnen - Misspelling Detection from Noisy Product Images
Varun Nagaraj Rao and Mingwei Shen - Multi-task Learning of Spoken Language Understanding by Integrating N-Best Hypotheses with Hierarchical Attention
Mingda Li, Xinyue Liu, Weitong Ruan, Luca Soldaini, Wael Hamza and Chengwei Su - Scalable Cross-lingual Treebank Synthesis for Improved Production Dependency Parsers
Yousef El-Kurdi, Hiroshi Kanayama, Efsun Sarioglu Kayi, Vittorio Castelli, Todd Ward and Radu Florian - ScopeIt: Scoping Task Relevant Sentences in Documents
Barun Patra, Vishwas Suryanarayanan, Chala Fufa, Pamela Bhattacharya and Charles Lee - Semantic Diversity for Natural Language Understanding Evaluation in Dialog Systems
Enrico Palumbo, Andrea Mezzalira, Cristina Marco, Alessandro Manzotti and Daniele Amberti - Towards building a Robust Industry-scale Question Answering System
Rishav Chakravarti, Anthony Ferritto, Bhavani Iyer, Lin Pan, Radu Florian, Salim Roukos and Avi Sil - Uncertainty Modeling for Machine Comprehension Systems using Efficient Bayesian Neural Networks
Zhengyuan Liu, Pavitra Krishnaswamy, AiTi Aw and Nancy Chen
A Novel BERT based Distillation Method for Ensemble Ranking Model
Wangshu Zhang, Junhong Liu, Zujie Wen, Yafang Wang and Gerard de Melo
Recent years have witnessed a fast development of neural ranking networks. Meanwhile, computational burden becomes extremely heavy due to increasingly growing size of ranking models as well as the adoption of model ensemble. Knowledge Distillation(KD) is a common approach to balance effectiveness and efficiency, however, it is not straightforward to apply KD to ranking problems. Ranking Distillation(RD) is proposed to address this issue but only show effectiveness in recommendation problems. We propose a novel two-stage distillation method for ranking problems and a smaller student model can be trained that benefits from the better performance of teacher model while controlling increased inference latency and computational burden. A novel BERT based ranking model structure for ranking tasks and is designed as our student model. All ranking candidates are placed into the BERT model at the same time, the self-attention mechanism allows the model to see more information when ranking document list. Experiments show the advantages of our method, the performance and inference latency both better than the teacher model.
An Empirical Study on Multi-Task Learning for Text Style Transfer and Paraphrase Generation
Pawel Bujnowski, Kseniia Ryzhova, Hyungtak Choi, Katarzyna Witkowska, Jaroslaw Piersa, Tymoteusz Krumholc and Katarzyna Beksa
The topic of this paper is neural multitask training for text style transfer. We present an efficient method for neutral-to-style transformation using the transformer framework. We demonstrate how to prepare a robust model utilizing large paraphrases corpora together with a small parallel style transfer corpus. We study how much style transfer data is needed for a model on the example of two transformations: neutral-to-cute on internal corpus and modern-to-antique on publicly available Bible copora. Additionally, we propose a synthetic measure for the automatic evaluation of style transfer models. We hope our research is a step towards replacing common but limited rule-based style transfer systems by more flexible machine learning models for both public and commercial usage.
An Industry Evaluation of Embedding-based Entity Alignment
Ziheng Zhang, Hualuo Liu, Jiaoyan Chen, xi chen, Bo Liu, YueJia Xiang and Yefeng Zheng
Embedding-based entity alignment has been widely investigated in recent years, but most proposed methods still rely on an ideal supervised learning setting with a large number of unbiased seed mappings for training and validation, which significantly limits their usage. In this study, we evaluate those state-of-the-art methods in an industrial context, where the impact of seed mappings with different sizes and different biases is explored. Besides the popular benchmarks from DBpedia and Wikidata, we contribute and evaluate a new industrial benchmark that is extracted from two heterogeneous knowledge graphs (KGs) under deployment for medical applications. The experimental results enable the analysis of the advantages and disadvantages of these alignment methods and the further discussion of suitable strategies for their industrial deployment.
Applying Model-agnostic Methods to Handle Inherent Noise in Large Scale Text Classification
Kshitij Tayal, Rahul Ghosh and Vipin Kumar
Text classification is a fundamental problem, and recently, deep neural networks (DNN) have shown promising results in many natural language tasks. However, their human-level performance relies on high-quality annotations, which are time-consuming and expensive to collect. As we move towards large inexpensive datasets, the inherent label noise degrades the generalization of DNN. While most machine learning literature focuses on building complex networks to handle noise, in this work, we evaluate model-agnostic methods to handle inherent noise in large scale text classification that can be easily incorporated into existing machine learning workflows with minimal interruption. Specifically, we conduct a point-by-point comparative study between several noise-robust methods on three datasets encompassing three popular classification models. To our knowledge, this is the first time such a comprehensive study in text classification encircling popular models and model-agnostic loss methods has been conducted. In this study, we describe our learning and demonstrate the application of our approach, which outperformed baselines by up to 10 % in classification accuracy while requiring no network modifications.
Assessing Social License to Operate from the Public Discourse on Social Media
Chang Xu, Cecile Paris, Ross Sparks, Surya Nepal and Keith VanderLinden
Organisations are monitoring their Social License to Operate (SLO) with increasing regularity. SLO, the level of support organisations have from the public, is typically assessed through surveys and focus groups, which require expensive manual efforts and yield results that quickly become outdated. In this paper, we present SIRTA (Social Insight via Real-Time Text Analytics), a novel real-time text analytics system for assessing and monitoring organisations' SLO levels by analysing the public discourse occurring on social media. To assess SLO levels, SIRTA performs a series of text classification tasks, leveraging recent language understanding techniques (e.g., BERT) for deriving opinions from the posts. To monitor the SLO levels over time, SIRTA employs quality control mechanisms to reliably identify the SLO trends and variations of multiple organisations simultaneously. We conducted a number of experiments to evaluate the SIRTA classifiers, which show that the system is effective in distilling opinions on social media for SLO level assessment. We present a case study that suggests that the continuous monitoring of SLO levels afforded by SIRTA enables the early detection of critical SLO variations.
Best Practices for Data-Efficient Modeling in NLG:How to Train Production-Ready Neural Models with Less Data
Ankit Arun, Soumya Batra, Vikas Bhardwaj, Ashwini Challa, Pinar Donmez, Peyman Heidari, Hakan Inan, Shashank Jain, Anuj Kumar, Shawn Mei, Karthik Mohan and Michael White
Natural language generation (NLG) is a critical component in conversational systems, owing to its role of formulating a correct and natural text response. Traditionally, NLG components have been deployed using template-based solutions. Although neural network solutions recently developed in the research community have been shown to provide several benefits, deployment of such model-based solutions has been challenging due to high latency, correctness issues, and high data needs. In this paper, we present approaches that have helped us deploy data-efficient neural solutions for NLG in conversational systems to production. We describe a family of sampling and modeling techniques to attain production quality with light-weight neural network models using only a fraction of the data that would be necessary otherwise, and show a thorough comparison between each. Our results show that domain complexity dictates the appropriate approach to achieve high data efficiency. Finally, we distill the lessons from our experimental findings into a list of best practices for production-level NLG model development, and present them in a brief runbook. Importantly, the end products of all of the techniques are small sequence-to-sequence models (~2Mb) that we can reliably deploy in production. These models achieve the same quality as large pretrained models (~1Gb) as judged by human raters.
Data-Efficient Paraphrase Generation to Bootstrap Intent Classification and Slot Labeling for New Features in Task-Oriented Dialog Systems
Shailza Jolly, Tobias Falke, Caglar Tirkaz and Daniil Sorokin
Recent progress through advanced neural models pushed the performance of task-oriented dialog systems to almost perfect accuracy on existing benchmark datasets for intent classification and slot labeling. However, in evolving real-world dialog systems, where new functionality is regularly added, a major additional challenge is the lack of annotated training data for such new functionality, as the necessary data collection efforts are laborious and time-consuming. A potential solution to reduce the effort is to augment initial seed data by paraphrasing existing utterances automatically. In this paper, we propose a new, data-efficient approach following this idea. Using an interpretation-to-text model for paraphrase generation, we are able to rely on existing dialog system training data, and, in combination with shuffling-based sampling techniques, we can obtain diverse and novel paraphrases from small amounts of seed data. In experiments on a public dataset and with a real-world dialog system, we observe improvements for both intent classification and slot labeling, demonstrating the usefulness of our approach.
Delexicalized Paraphrase Generation
Boya Yu, Konstantine Arkoudas and Wael Hamza
We present a neural model for paraphrasing and train it to generate delexicalized sentences. We achieve this by creating training data in which each input is paired with a number of reference paraphrases. These sets of reference paraphrases represent semantic equivalence based on annotated slots and intents. To understand semantics from different types of slots, other than anonymizing slots, we apply convolutional neural networks (CNN) preceding pooling on slot values and use pointers to locate slots in the output. We show empirically that the paraphrases generated are of high quality which lead to additional 1.29% exact match of real time utterances. We also show that natural language understanding (NLU) tasks including intent classification and named-entity recognition can benefit from data augmentation using automatically generated paraphrases.
Evaluating Cross-Lingual Transfer Learning Approaches in Multilingual Conversational Agent Models
Lizhen Tan and Olga Golovneva
With the recent explosion in popularity of voice assistant devices, there is a growing interest in making them available to user populations in additional countries and languages. However, to provide the highest accuracy and best performance for specific user populations, most existing voice assistant models are developed individually for each region or language, which requires linear investment of effort. In this paper, we propose a general multilingual model framework for Natural Language Understanding (NLU) models, which can help bootstrap new language models faster and reduce the amount of effort required to develop each language separately. We explore how different deep learning architectures affect multilingual NLU model performance. Our experimental results show that these multilingual models can reach same or better performance compared to monolingual models across language-specific test data while require less effort in creating features and model maintenance.
Extreme Model Compression for On-device Natural Language Understanding
Kanthashree Mysore Sathyendra, Samridhi Choudhary and Leah Nicolich-Henkin
Natural Language Understanding (NLU) generally consists of three component tasks - Domain Classification (DC), Intent Classification (IC) and Named Entity Recognition (NER). In this paper, we propose and experiment with techniques for extreme compression of neural NLU models, making them suitable for execution on resource-constrained devices. We propose a task-aware, end-to-end compression approach that performs word-embedding compression jointly with NLU task learning. We show our results on a large scale real-world NLU system trained on a large number of intents and containing large vocabulary sizes. Our approach outperforms a range of baselines and achieves a compression rate of 97.4% (227MB to 5.8MB) with less than 3.7% degradation in predictive performance. Our analysis indicates that the signal from the downstream task is important for effective compression with minimal degradation in performance.
Graph Regularized Graph Convolutional Networks for Short Text Classification using Side Information
Kshitij Tayal, Nikhil Rao, Saurabh Agarwal, Xiaowei Jia, Karthik Subbian and Vipin Kumar
Short text classification is a fundamental problem in natural language processing, social network analysis, and e-commerce. The lack of structure in short text sequences limits the success of popular NLP methods based on deep learning. Simpler methods that rely on bag-of-words representations tend to perform on par with complex deep learning methods. To tackle the limitations of textual features in short text, we propose a Graph-regularized Graph Convolution Network (GR-GCN), which augments graph convolution networks by incorporating label dependencies in the output space. Our model achieves state-of-the-art results on both proprietary and external datasets, outperforming several baseline methods by up to 6% . Furthermore, we show that compared to baseline methods, GR-GCN is more robust to noise in textual features.
hinglishNorm - A Corpus of Hindi-English Code Mixed Sentences for Text Normalization
Piyush Makhija, Ankit Kumar and Anuj Gupta
We present hinglishNorm - a human annotated corpus of Hindi-English code-mixed sentences for text normalization task. Each sentence in the corpus is aligned to its corresponding human annotated normalized form. To the best of our knowledge, there is no corpus of Hindi-English code-mixed sentences for text normalization task that is publicly available. Our work is the first attempt in this direction. The corpus contains 13494 parallel segments. Further, we present baseline normalization results on this corpus. We obtain a Word Error Rate (WER) of 15.55, BiLingual Evaluation Understudy (BLEU) score of 71.2, and Metric for Evaluation of Translation with Explicit ORdering (METEOR) score of 0.50.
Interactive Question Clarification in Dialogue via Reinforcement Learning
Xiang Hu, Zujie Wen, Yafang Wang, Xiaolong Li and Gerard de Melo
Coping with ambiguous questions has been a perennial problem in real-world dialogue systems. Although clarification by asking questions is a common form of human interaction, it is hard to define appropriate questions to elicit more specific intents from a user. In this work, we propose a reinforcement model to clarify ambiguous questions by suggesting refinements of the original query. We first formulate a collection partitioning problem to select a set of labels enabling us to distinguish potential unambiguous intents. We list the chosen labels as intent phrases to the user for further confirmation. The selected label along with the original user query then serves as a refined query, for which a suitable response can more easily be identified. The model is trained using reinforcement learning with a deep policy network. We evaluate our model based on real-world user clicks and demonstrate significant improvements across several different experiments.
Learning Domain Terms - Empirical Methods to Enhance Enterprise Text Analytics Performance
Gargi Roy, Lipika Dey, Mohammad Shakir and Tirthankar Dasgupta
Performance of standard text analytics algorithms are known to substantially degraded on consumer generated data, which are often very noisy. These algorithms also do not work well on enterprise data which has a very different nature from News repositories, storybooks or wikipedia data. Text cleaning is a mandatory step which aims at noise removal and correction to improve performance. However, enterprise data need special cleaning methods since it contains many domain terms which appear to be noise against a standard dictionary, but in reality are not so. In this work we present detailed analysis of characteristics of enterprise data and suggest unsupervised methods for cleaning these repositories after domain terms have been automatically segregated from true noise terms. Noise terms are thereafter corrected in a contextual fashion. The effectiveness of the method is established through careful manual evaluation of error corrections over several standard datasets, including those available for hate speech detection, where there is deliberate distortion to avoid detection. We also share results to show enhancement in classification accuracy after noise correction.
Leveraging User Paraphrasing Behavior In Dialog Systems To Automatically Collect Annotations For Long-Tail Utterances
Tobias Falke, Markus Boese, Daniil Sorokin, Caglar Tirkaz and Patrick Lehnen
In large-scale commercial dialog systems, users express the same request in a wide variety of alternative ways with a long tail of less frequent alternatives. Handling the full range of this distribution is challenging, in particular when relying on manual annotations. However, the same users also provide useful implicit feedback as they often paraphrase an utterance if the dialog system failed to understand it. We propose MARUPA, a method to leverage this type of feedback by creating annotated training examples from it. MARUPA creates new data in a fully automatic way, without manual intervention or effort from annotators, and specifically for currently failing utterances. By re-training the dialog system on this new data, accuracy and coverage for long-tail utterances can be improved. In experiments, we study the effectiveness of this approach in a commercial dialog system across various domains and three languages.
Misspelling Detection from Noisy Product Images
Varun Nagaraj Rao and Mingwei Shen
Misspellings are introduced on products either due to negligence or as an attempt to deliberately deceive stakeholders. This leads to a revenue loss for online sellers and fosters customer mistrust. Existing spelling research has primarily focused on advancement in misspelling correction and the approach for misspelling detection has remained the use of a large dictionary. The dictionary lookup results in the incorrect detection of several non-dictionary words as misspellings. In this paper, we propose a method to automatically detect misspellings from product images in an attempt to reduce false positive detections. We curate a large scale corpus, define a rich set of features and propose a novel model that leverages importance weighting to account for within class distributional variance. Finally, we experimentally validate this approach on both the curated corpus and an out-of-domain public dataset and show that it leads to an improvement of up to 20\% in F1 score. The approach thus creates a more robust, generalized deployable solution and reduces reliance on large scale custom dictionaries used today. Corpus features will be released publicly upon acceptance.
Multi-task Learning of Spoken Language Understanding by Integrating N-Best Hypotheses with Hierarchical Attention
Mingda Li, Xinyue Liu, Weitong Ruan, Luca Soldaini, Wael Hamza and Chengwei Su
Currently, in spoken language understanding (SLU) systems, the automatic speech recognition (ASR) module produces multiple interpretations (or hypotheses) for the input audio signal and the natural language understanding (NLU) module takes the one with the highest confidence score for domain or intent classification. However, the interpretations can be noisy, and solely relying on one interpretation can cause information loss. To address the problem, many research works attempt to rerank the interpretations for a better choice while some recent works get better performance by integrating all the hypotheses during prediction. In this paper, we follow the way of integrating hypotheses but strengthen the training mode by involving more tasks, some of which may be not in existing tasks of NLU but relevant, via multi-task learning or transfer learning. Moreover, we propose the Hierarchical Attention Mechanism (HAM) to further improve the performance with the acoustic-model features like confidence scores, which are ignored in the current hypotheses integration models. The experimental results show that compared to the standard estimation with one hypothesis, the multi-task learning with HAM can improve the domain and intent classification by relatively 19% and 37%, which are much higher than improvements with current integration or reranking methods. To illustrate the cause of improvements brought by our model, we decode the hidden representations of some utterance examples and compare the generated texts with hypotheses and transcripts. The comparison shows that our model could recover the transcription by integrating the fragmented information among hypotheses and identifying the frequent error patterns of the ASR module, and even rewrite the query for a better understanding, which reveals the characteristic of multi-task learning of broadcasting knowledge.
Scalable Cross-lingual Treebank Synthesis for Improved Production Dependency Parsers
Yousef El-Kurdi, Hiroshi Kanayama, Efsun Sarioglu Kayi, Vittorio Castelli, Todd Ward and Radu Florian
We present scalable dependency treebank synthesis techniques that exploit advances in language representation modeling which leverage vast amounts of unlabeled general-purpose multilingual text. We introduce a data augmentation technique that uses the synthetic treebanks to improve production-grade parsers. The synthetic treebanks are generated using a state-of-the-art biaffine parser enhanced with Transformer pretrained models, such as Multilingual BERT (M-BERT). The new parser improves LAS by up to two points on 7 languages. Trend line results of LAS as the augmented treebank size scales surpasses performance of production models trained on originally annotated Universal Dependency (UD) treebanks.
ScopeIt: Scoping Task Relevant Sentences in Documents
Barun Patra, Vishwas Suryanarayanan, Chala Fufa, Pamela Bhattacharya and Charles Lee
A prominent problem faced by conversational agents working with large documents (Eg: email-based assistants) is the frequent presence of information in the document that is irrelevant to the assistant. This in turn makes it harder for the agent to accurately detect intents, extract entities relevant to those intents and perform the desired action. To address this issue we present a neural model for scoping relevant information for the agent from a large document. We show that when used as the first step in a popularly used email-based assistant for helping users schedule meetings, our proposed model helps improve the performance of the intent detection and entity extraction tasks required by the agent for correctly scheduling meetings: across a suite of 6 downstream tasks, by using our proposed method, we observe an average gain of 35\% in precision without any drop in recall. Additionally, we demonstrate that the same approach can be used for component level analysis in large documents, such as signature block identification.
Semantic Diversity for Natural Language Understanding Evaluation in Dialog Systems
Enrico Palumbo, Andrea Mezzalira, Cristina Marco, Alessandro Manzotti and Daniele Amberti
The quality of Natural Language Understanding (NLU) models is typically evaluated using aggregated metrics on a large number of utterances. In a dialog system, though, the manual analysis of failures on specific utterances is a time-consuming and yet critical endeavor to guarantee a high-quality customer experience. A crucial question for this analysis is how to create a test set of utterances that covers a diversity of possible customer requests. In this paper, we introduce the task of generating a test set with high semantic diversity for NLU evaluation in dialog systems and we describe an approach to address it. The approach starts by extracting high-traffic utterance patterns. Then, for each pattern, it achieves high diversity selecting utterances from different regions of the utterance embedding space. We compare three selection strategies based on clustering of utterances in the embedding space, on solving the maximum distance optimization problem and on simple heuristics such as random uniform sampling and popularity. The evaluation shows that the highest semantic and lexicon diversity is obtained by a greedy maximum sum of distance solver in a comparable runtime with the clustering and the heuristics approaches.
Towards building a Robust Industry-scale Question Answering System
Rishav Chakravarti, Anthony Ferritto, Bhavani Iyer, Lin Pan, Radu Florian, Salim Roukos and Avi Sil
Industry-scale NLP systems necessitate two features. 1. Robustness: “zero-shot transfer learn-ing” (ZSTL) performance has to be commendable and 2. Efficiency: systems have to train efficiently and respond instantaneously. In this paper, we introduce the development of a production model called BINARY (roBust questIoN AnsweRing sYstem) which possess the above two characteristics. For robustness, it trains on the recently introduced Natural Questions (NQ) dataset. NQ poses additional challenges over older datasets like SQuAD: (a) QA systems need to read and comprehend an entire Wikipedia article rather than a small passage, and (b) NQ does not suffer from observation bias during construction, resulting in less lexical overlap between the question and the article. BINARY consists of Attention-over-Attention, diversity among attention heads, hierarchical transfer learning, and synthetic data augmentation while being computationally inexpensive. Building on top of the powerful BERT_QA model, BINARY provides a ∼2.0% absolute boost in F1over the industry-scale state-of-the-art (SOTA) system on NQ. Further, we show thatBINARY transfers zero-shot to unseen real life and important domains as it yields respectable performance on two benchmarks: the BioASQ and the newly introduced CovidQA datasets.
Uncertainty Modeling for Machine Comprehension Systems using Efficient Bayesian Neural Networks
Zhengyuan Liu, Pavitra Krishnaswamy, AiTi Aw and Nancy Chen
While neural approaches have achieved significant improvement in machine comprehension tasks, models often work as a black-box, resulting in lower explainability, which requires special attention in domains such as healthcare or education. Quantifying uncertainty helps pave the way towards more interpretable neural networks. In classification and regression tasks, Bayesian neural networks have been effective in capturing model uncertainty. However, inference time increases linearly due to the required sampling process in Bayesian neural networks, thus speed becomes a bottleneck in tasks with high system complexity such as question-answering or dialogue generation. In this work, we propose a hybrid neural architecture to quantify model uncertainty using Bayesian weight approximation but boosts up the inference speed by 80% relative at test time, and apply it for a clinical dialogue comprehension task. The proposed approach is also used to enable active learning so that an updated model can be trained more optimally with new incoming data by selecting samples that are not well-represented in the current training scheme.