Accepted Papers: Systems Demonstrations
- Ve'rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement
Khalid Alnajjar, Mika Hämäläinen, Jack Rueter and Niko Partanen - MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources
Farhad Akhbardeh, Travis Desell and Marcos Zampieri - DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool
Ernie Chang, Jeriah Caplinger, Alex Marin, Xiaoyu Shen and Vera Demberg - Demo Application for the AutoGOAL Framework
Suilan Estevez-Velarde, Alejandro Piad-Morffis, Yoan Gutiérrez, Andres Montoyo, Rafael Muñoz-Guillena and Yudivián Almeida Cruz - Fast Word Predictor for On-Device Application
Huy Tien Nguyen, Khoi Tuan Nguyen, Anh Tuan Nguyen and Thanh Lac Thi Tran - Semantic search with domain-specific word-embedding and production monitoring in Fintech
Mojtaba Farmanbar, Nikki Van Ommeren and Boyang Zhao - CogniVal in Action: An Interface for Customizable Cognitive Word Embedding Evaluation
Nora Hollenstein, Adrian van der Lek and Ce Zhang - A Multilingual Reading Comprehension System for more than 100 Languages
Anthony Ferritto, Sara Rosenthal, Mihaela Bornea, Kazi Hasan, Rishav Chakravarti, Salim Roukos, Radu Florian and Avi Sil - XplaiNLI: Explainable Natural Language Inference through Visual Analytics
Aikaterini-Lida Kalouli, Rita Sevastjanova, Valeria de Paiva, Richard Crouch and Mennatallah El-Assady - Discussion Tracker: Supporting Teacher Learning about Students' Collaborative Argumentation in High School Classrooms
Luca Lugini, Christopher Olshefski, Ravneet Singh, Diane Litman and Amanda Godley - Neural text normalization leveraging similarities of strings and sounds
Riku Kawamura, Tatsuya Aoki, Hidetaka Kamigaito, Hiroya Takamura and Manabu Okumura - An Online Readability Leveled Arabic Thesaurus
Zhengyang Jiang, Nizar Habash and Muhamed Al Khalil - TrainX – Named Entity Linking with Active Sampling and Bi-Encoders
Tom Oberhauser, Tim Bischoff, Karl Brendel, Maluna Menke, Tobias Klatt, Amy Siu, Felix Alexander Gers and Alexander Löser - BullStop: A Mobile App for Cyberbullying Prevention
Semiu Salawu, Yulan He and Jo Lumsden - Annobot: Platform for Annotating and Creating Datasets through Conversation with a Chatbot
Rafał Poświata and Michał Perełkiewicz - Arabic Curriculum Analysis
Hamdy Mubarak, Shimaa Amer, Ahmed Abdelali and Kareem Darwish - Epistolary Education in 21st Century: A System to Support Composition of E-mails by Students to Superiors in Japanese
Kenji Ryu and Michal Ptaszynski
Ve'rdd. Narrowing the Gap between Paper Dictionaries, Low-Resource NLP and Community Involvement
Khalid Alnajjar, Mika Hämäläinen, Jack Rueter and Niko Partanen
We present an open-source online dictionary editing system, \verdd{}, that offers a chance to re-evaluate and edit grassroots dictionaries that have been exposed to multiple amateur editors. The idea is to incorporate community activities into a state-of-the-art finite-state language description of a seriously endangered minority language, Skolt Sami. Problems involve getting the community to take part in things above the pencil-and-paper level. At times, it seems that the native speakers and the dictionary oriented are lacking technical understanding to utilize the infrastructures which might make their work more meaningful in the future, i.e. multiple reuse of all of their input. Therefore, our system integrates with the existing tools and infrastructures for Uralic language masking the technical complexities behind a user-friendly UI.
MaintNet: A Collaborative Open-Source Library for Predictive Maintenance Language Resources
Farhad Akhbardeh, Travis Desell and Marcos Zampieri
Maintenance record logbooks are an emerging text type in NLP. They typically consist of free text documents with many domain specific technical terms, abbreviations, as well as non-standard spelling and grammar, which poses difficulties to NLP pipelines trained on standard corpora. Analyzing and annotating such documents is of particular importance in the development of predictive maintenance systems, which aim to provide operational efficiencies, prevent accidents and save lives. In order to facilitate and encourage research in this area, we have developed MaintNet, a collaborative open-source library of technical and domain-specific language datasets. MaintNet provides novel logbook data from the aviation, automotive, and facilities domains along with tools to aid in their (pre-)processing and clustering. Furthermore, it provides a way to encourage discussion on and sharing of new datasets and tools for logbook data analysis.
DART: A Lightweight Quality-Suggestive Data-to-Text Annotation Tool
Ernie Chang, Jeriah Caplinger, Alex Marin, Xiaoyu Shen and Vera Demberg
We present a lightweight annotation tool, the Data AnnotatoR Tool (DART), for the general task of labeling structured data with textual descriptions. The tool is implemented as an interactive application that reduces human efforts in annotating large quantities of structured data, e.g. in the format of a table or tree structure. By using a backend sequence-to-sequence model, our system iteratively analyzes the annotated labels in order to better sample unlabeled data. In a simulation experiment performed on annotating large quantities of structured data, DART has been shown to reduce the total number of annotations needed with active learning and automatically suggesting relevant labels.
Demo Application for the AutoGOAL Framework
Suilan Estevez-Velarde, Alejandro Piad-Morffis, Yoan Gutiérrez, Andres Montoyo, Rafael Muñoz-Guillena and Yudivián Almeida Cruz
This paper introduces a web demo that showcases the main characteristics of the AutoGOAL framework. AutoGOAL is a framework in Python for automatically finding the best way to solve a given task. It has been designed mainly for automatic machine learning(AutoML) but it can be used in any scenario where several possible strategies are available to solve a given computational task. In contrast with alternative frameworks, AutoGOAL can be applied seamlessly to Natural Language Processing as well as structured classification problems. This paper presents an overview of the framework's design and experimental evaluation in several machine learning problems, including two recent NLP challenges. The accompanying software demo is available online (https://autogoal.github.io/demo) and full source code is provided under the MIT open-source license (https://autogoal.github.io).
Fast Word Predictor for On-Device Application
Huy Tien Nguyen, Khoi Tuan Nguyen, Anh Tuan Nguyen and Thanh Lac Thi Tran
Learning on large text corpora, deep neural networks achieve promising results in the next word prediction task. However, deploying these huge models on devices has to deal with constraints of low latency and a small binary size. To address these challenges, we propose a fast word predictor performing efficiently on mobile devices. Compared with a standard neural network which has a similar word prediction rate, the proposed model obtains 60% reduction in memory size and 100X faster inference time on a middle-end mobile device. The method is developed as a feature for a chat application which serves more than 100 million users.
Semantic search with domain-specific word-embedding and production monitoring in Fintech
Mojtaba Farmanbar, Nikki Van Ommeren and Boyang Zhao
We present an end-to-end search engine system with domain-specific custom language models for accurate expansion of search terms. The text mining pipeline tackles several challenges faced in an industry-setting, including multi-lingual jargon-rich unstructured text and privacy compliance. Combined with a novel statistical approach for word embedding evaluations, the models can be monitored in a production setting. Our approach is used in real-world settings in risk management in the financial sector and has wide applicability to other domain-specific areas.
CogniVal in Action: An Interface for Customizable Cognitive Word Embedding Evaluation
Nora Hollenstein, Adrian van der Lek and Ce Zhang
We demonstrate the functionalities of the new user interface for CogniVal. CogniVal is a framework for cognitive evaluation of word embeddings, which evaluates the quality of embeddings based on their performance to predict human word representations from a range of cognitive language processing datasets. In this work, we present an easy-to-use command line interface for CogniVal with multiple improvements over the original work, including the possibility to evaluate custom embeddings against custom cognitive data sources.
A Multilingual Reading Comprehension System for more than 100 Languages
Anthony Ferritto, Sara Rosenthal, Mihaela Bornea, Kazi Hasan, Rishav Chakravarti, Salim Roukos, Radu Florian and Avi Sil
This paper presents M-GAAMA, a Multilingual Question Answering architecture and demo system. This is the first multilingual machine reading comprehension (MRC) demo which is able to answer questions in over 100 languages. M-GAAMA answers questions from a given passage in the same or different language. It incorporates several existing multilingual models that can be used interchangeably in the demo such as M-BERT and XLM-R. The M-GAAMA demo also improves language accessibility by incorporating the IBM Watson machine translation widget to provide additional capabilities to the user to see an answer in their desired language. Further, we also show how M-GAAMA can be used in downstream tasks by incorporating it into an END-TO-END-QA system using CFO (Chakravarti et al., 2019). We experiment with our system architecture on the Multi-Lingual Question Answering (MLQA) and the CORD-19 COVID (Wang et al., 2020; Tang et al., 2020) datasets to provide insights into the performance of the system.
XplaiNLI: Explainable Natural Language Inference through Visual Analytics
Aikaterini-Lida Kalouli, Rita Sevastjanova, Valeria de Paiva, Richard Crouch and Mennatallah El-Assady
Advances in Natural Language Inference (NLI) have helped us understand what the state-of-the-art models really learn and what their generalization power is. Recent research has revealed some heuristics and biases of these models. However, to date, there is no systematic effort to capitalize on those insights through a system that uses these to explain the NLI decisions. To this end, we propose XplaiNLI, an eXplainable, interactive visualization interface that computes NLI with different methods and provides explanations for the decisions made by the different approaches.
Discussion Tracker: Supporting Teacher Learning about Students' Collaborative Argumentation in High School Classrooms
Luca Lugini, Christopher Olshefski, Ravneet Singh, Diane Litman and Amanda Godley
Teaching collaborative argumentation is an advanced skill that many K-12 teachers struggle to develop. To address this, we have developed Discussion Tracker, a classroom discussion analytics system based on novel algorithms for classifying argument moves, specificity, and collaboration. Results from a classroom deployment indicate that teachers found the analytics useful, and that the underlying classifiers perform with moderate to substantial agreement with humans.
Neural text normalization leveraging similarities of strings and sounds
Riku Kawamura, Tatsuya Aoki, Hidetaka Kamigaito, Hiroya Takamura and Manabu Okumura
We propose neural models that can normalize text by considering the similarities of word strings and sounds. We experimentally compared a model that considers the similarities of both word strings and sounds, a model that considers only the similarity of word strings or of sounds, and a model without the similarities as a baseline. Results showed that leveraging the word string similarity succeeded in dealing with misspellings and abbreviations, and taking into account the sound similarity succeeded in dealing with phonetic substitutions and emphasized characters. So that the proposed models achieved higher F1 scores than the baseline.
An Online Readability Leveled Arabic Thesaurus
Zhengyang Jiang, Nizar Habash and Muhamed Al Khalil
This demo paper introduces the online Readability Leveled Arabic Thesaurus interface. For a given user input word, this interface provides the word’s possible lemmas, roots, English glosses, related Arabic words and phrases, and readability on a five-level scale. This interface builds on and connects multiple existing Arabic resources and processing tools. This system enables Arabic speakers and learners to benefit from advances in Arabic computational linguistics technologies. Feedback from users of the system will help the developers to identify lexical coverage gaps and errors.
TrainX – Named Entity Linking with Active Sampling and Bi-Encoders
Tom Oberhauser, Tim Bischoff, Karl Brendel, Maluna Menke, Tobias Klatt, Amy Siu, Felix Alexander Gers and Alexander Löser
We demonstrate TrainX, a system for Named Entity Linking for medical experts. It combines state-of-the-art entity recognition and linking architectures, such as Flair and fine-tuned Bi-Encoders based on BERT, with an easy-to-use interface for healthcare professionals. We support medical experts in annotating training data by using active sampling strategies to forward informative samples to the annotator. We demonstrate that our model is capable of linking against large knowledge bases, such as UMLS (3.6 million entities), and supporting zero-shot cases, where the linker has never seen the entity before. Those zero-shot capabilities help to mitigate the problem of rare and expensive training data that is a common issue in the medical domain.
BullStop: A Mobile App for Cyberbullying Prevention
Semiu Salawu, Yulan He and Jo Lumsden
Social media has become the new playground for bullies. Young people are now regularly exposed to a wide range of abuse online. In response to the increasing prevalence of cyberbullying, online social networks have increased efforts to clamp down on online abuse but unfortunately, the nature, complexity and sheer volume of cyberbullying means that many cyberbullying incidents go undetected. BullStop is a mobile app for detecting and preventing cyberbullying and online abuse on social media platforms. It uses deep learning models to identify instances of cyberbullying and can automatically initiate actions such as deleting offensive messages and blocking bullies on behalf of the user. Our system not only achieves impressive prediction results but also demonstrates excellent potential for use in real-world scenarios and is freely available on the Google Play Store.
Annobot: Platform for Annotating and Creating Datasets through Conversation with a Chatbot
Rafał Poświata and Michał Perełkiewicz
In this paper, we introduce Annobot: a platform for annotating and creating datasets through conversation with a chatbot. This natural form of interaction has allowed us to create a more accessible and flexible interface, especially for mobile devices. Our solution has a wide range of applications such as data labelling for binary, multi-class/label classification tasks, preparing data for regression problems, or creating sets for issues such as machine translation, question answering or text summarization. Additional features include pre-annotation, active sampling, online learning and real-time inter-annotator agreement. The system is integrated with the popular messaging platform: Facebook Messanger. Experiments showed the advantages of the proposed platform compared to other labelling tools. The source code of Annobot is available at https://github.com/rafalposwiata/annobot.
Arabic Curriculum Analysis
Hamdy Mubarak, Shimaa Amer, Ahmed Abdelali and Kareem Darwish
Developing a platform that analyzes the content of curricula can help identify their shortcomings and whether they are tailored to specific desired outcomes. In this paper, we present a system to analyze Arabic curricula and provide insights about their content. It allows users to explore word presence, surface-forms used as well as contrasting statistics between different countries from which the curricula were selected. Also, it provides a facility to grade text in reference to given grade level and gives users feedback about the complexity or difficulty of words used in a text.
Epistolary Education in 21st Century: A System to Support Composition of E-mails by Students to Superiors in Japanese
Kenji Ryu and Michal Ptaszynski
E-mail is a communication tool widely used by people of all ages on the Internet today, often in business and formal situations, especially in Japan. Moreover, Japanese E-mail communication has a set of specific rules taught using specialized guidebooks. E-mail literacy education for many Japanese students is typically provided in a traditional, yet inefficient lecture-based way. We propose a system to support Japanese students in writing E-mails to superiors (teachers, job hunting representatives, etc.). We firstly make an investigation into the importance of formal E-mails in Japan, and what is needed to successfully write a formal E-mail. Next, we develop the system with accordance to those rules. Finally, we evaluated the system twofold. The results, although performed on a small number of samples, were generally positive, and clearly indicated additional ways to improve the system.