Natural Language, Dialog and Speech Symposium (NDS2020)
Friday, November 13, 2020, 10:00 AM - 5:00 PM EST
The New York Academy of Sciences, 7 World Trade Center, 250 Greenwich St Fl 40, New York
The New York Academy of Sciences
Natural language, dialog and speech (NDS) researchers focus on communication between people and computers using human languages both in written and spoken forms. They develop models for analyzing the structure and content of human conversation and create artificial agents who can engage in human-like interaction with people and other agents.
This Symposium will address areas including dialog systems, spoken and natural language understanding, natural language generation and speech synthesis. It will feature Keynote Presentations from leading researchers and short, early career investigator presentations selected from among the submitted abstracts.
New York University
The New York Academy of Sciences
Brooklyn College & The Graduate Center, CUNY
JP Morgan Chase
New York University
November 13, 2020
KEYNOTE ADDRESS 1: Three Considerations for Human-centric AI
The past two decades have seen enormous advances in machine learning and AI, including the development of virtual agents, that are able to interact with humans to assist with learning, task completion and entertainment.
As we look to the future and focus on improving these human machine interactions to make them more human-centric, we need to consider the following: our emotions, our ethics and our community.
In this talk, Dr. Taniya Mishra will discuss how we can build systems that respond to human emotions and cognitive states, create ethical AI that mitigates algorithmic bias, and transform the tech and AI landscape by creating a diverse and inclusive AI community.
STAR Talks: Session 1
Goal-Oriented Multitask Dialogue Modeling of Supreme Court Oral Arguments
Dialogue modeling has advanced in a number of domains, but many complex domains such as the Supreme Court of the United States (SCOTUS) remain understudied. Goal-oriented dialogues like those found in SCOTUS have more strictly defined outcomes than open-domain dialogues, while having less strictly scripted language than task-oriented dialogues. This work evaluates the effect of balancing turn-level goal prediction tasks -- (a) deciding an outcome of a conversation and (b) addressing a topic -- with a next turn ranking task in a multitask setting using the ParlAI dialogue framework and SCOTUS oral arguments. While most meeting corpora such as AMI and ICSI involve multiple roles and multifarious meeting objectives, SCOTUS dialogues have relatively simple outcomes (vote for or against petitioner), two sides arguing opposing points (petitioner/respondent), and strict power distinctions (Justice/Counsel). This work experiments with modeling Justice turns in a Counsel-Justice dyadic exchange. Justices’ turns are targeted as they participate in multiple conversations and exert more control over the dialogue than Counsel. This enables us to control for power, initiative, and speaker traits. We jointly model a next turn ranking task and a turn-level goal prediction task, the latter of which may use a vote objective for (a) or a topic objective for (b). The effects of different objectives are analyzed in aggregate, in ideological aggregate, and for each speaker.
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks
Large pretrained language models have been shown to store factual knowledge in their parameters and achieve state of the art when finetuned on downstream NLP tasks. However, their ability to access and precisely manipulate knowledge is still limited, so on knowledge-intensive tasks, their performance lags behind task-specific architectures. Also, explaining their decisions and updating their world knowledge remain open research problems. Pretrained models with a differentiable access mechanism to explicit non-parametric memory can overcome this issue but have so far been only investigated for extractive downstream tasks. We explore a general-purpose finetuning recipe for retrieval-augmented generation (RAG) - models that combine parametric and non-parametric memory for language generation. We introduce RAG models where the parametric memory is a pretrained seq2seq model and the non-parametric memory is a dense vector index of Wikipedia, accessed with a pretrained retriever.
First Impression is the Last Impression? Acoustic-Prosodic Cues to Persuasiveness in Competitive Debate Speeches
Do men and women persuade differently? Are they evaluated differently? Using a data set of over 1800 audio segments of first and last minutes of tournament speeches, their evaluation scores and demographic data, we investigate gender disparity in acoustic-prosodic features and any ensuing impacts on evaluations of persuasiveness. Debate tournaments provide a useful means of systematically answering these questions because they include these four components: (i) a diverse pool of intrinsically motivated professional debaters; (ii) exogenously assigned speaking position, topic and opponents; (iii) transparent scoring criteria, based purely on comparative argumentation strength; and (iv) accountable, selective panels of judges. With this data set, we analyze the acoustic-prosodic correlates of persuasiveness (i.e.: pitch, intensity, harmonic-to-noise ratio (HNR), jitter, shimmer and speaking rate), taking into account individual traits (e.g: gender, native language, institution ranking, study major), to explore the existence and magnitude of discriminatory evaluation standards across social groups. This work contributes a large-scale analysis of acoustic-prosodic cues in a strategically relevant context, and discusses how demographic characteristics of speakers influence judges’ perception of persuasive argumentative speeches.
Active Imitation Learning with Noisy Guidance
Imitation learning algorithms provide state-of-the-art results on many structured prediction tasks by learning near-optimal search policies. Such algorithms assume training-time access to an expert that can provide the optimal action at any queried state; unfortunately, the number of such queries is often prohibitive, frequently rendering these approaches impractical. To combat this query complexity, we consider an active learning setting in which the learning algorithm has additional access to a much cheaper noisy heuristic that provides noisy guidance. Our algorithm, LEAQI, learns a difference classifier that predicts when the expert is likely to disagree with the heuristic, and queries the expert only when necessary. We apply LEAQI to three sequence labeling tasks, demonstrating significantly fewer queries to the expert and comparable (or better) accuracies over a passive approach.
KEYNOTE ADDRESS 2: Gender, Politics and Charisma: Speaking Style in Political Speech
Charisma was defined by Max Weber as “a certain quality of an individual personality, by virtue of which he is set apart from ordinary men and treated as endowed with supernatural, superhuman, or at least specifically exceptional powers or qualities … not accessible to the ordinary person, but … regarded as of divine origin or as exemplary” on which basis “the individual concerned is treated as a leader” (Weber ‘47). In prior work we examined individual differences correlated with country of origin and political leanings in production and perception of political speech. More recently we have investigated gender differences in production and perception and how raters differ depending upon their level of education, personality traits, and their own speaking style. Currently we are examining political speech again, from participants in the Democratic presidential contest, a more gender-balanced group than in the past, as a way to investigate gender differences in political speech production and perception.
STAR Talks: Session 2
CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models
Pretrained language models, especially masked language models (MLMs) have seen success across many NLP tasks. However, there is ample evidence that they use the cultural biases that are undoubtedly present in the corpora they are trained on, implicitly creating harm with biased representations. To measure some forms of social bias in language models against protected demographic groups in the US, we introduce the Crowdsourced Stereotype Pairs benchmark (CrowS-Pairs). CrowS-Pairs has 1508 examples that cover stereotypes dealing with nine types of bias, like race, religion, and age. In CrowS-Pairs a model is presented with two sentences: one that is more stereotyping and another that is less stereotyping. The data focuses on stereotypes about historically disadvantaged groups and contrasts them with advantaged groups. We find that all three of the widely-used MLMs we evaluate substantially favor sentences that express stereotypes in every category in CrowS-Pairs. As work on building less biased models advances, this dataset can be used as a benchmark to evaluate progress.
Automatic Fact-Guided Sentence Modification
Online encyclopediae like Wikipedia contain large amounts of text that need frequent corrections and updates. The new information may contradict existing content in encyclopediae. In this paper, we focus on rewriting such dynamically changing articles. This is a challenging constrained generation task, as the output must be consistent with the new information and fit into the rest of the existing document. To this end, we propose a two-step solution: (1) We identify and remove the contradicting components in a target text for a given claim, using a neutralizing stance model; (2) We expand the remaining text to be consistent with the given claim, using a novel two-encoder sequence-to-sequence model with copy attention. Applied to a Wikipedia fact update dataset, our method successfully generates updated sentences for new claims, achieving the highest SARI score. Furthermore, we demonstrate that generating synthetic data through such rewritten sentences can successfully augment the FEVER fact-checking training dataset, leading to a relative error reduction of 13%.
Deception Detection in a Human-Machine Visual Dialogue Task
When humans attempt to detect deception, they perform two actions: recognize signs of deception, and ask questions to attempt to unveil a deceptive conversational partner. We focus on the latter, constructing a dialogue system that asks questions to attempt to catch a potentially deceptive conversation partner. To explore these complexities in a non-stationary environment, we appeal to an eye-spy style visual dialogue game where a questioner and oracle communicate, achieving common ground to identify a pre-specified object within an image. The questioner interrogates the oracle via yes or no questions, in an attempt to identify some predetermined target entity. To investigate deception, we instruct humans to interact with this autonomous questioner and act in any way they believe would cause the questioner to fail. We use this dialogue to ground an autonomous oracle with human deceptive strategies. We then introduce the questioner to a modified game where it is randomly paired with a cooperative or deceptive oracle with a new goal to either identify the pre-specified object or identify if it is paired with the deceptive oracle. Using reinforcement learning, we train the questioner to succeed in this modified game setting. Our work explores the design of conversational systems which exhibit resilience to human deception in non-stationary environments and establishes a test-bed for investigation of human-machine deception and misinformation.
Counterfactually-Augmented SNLI Training Data Does Not Yield Better Generalization Than Unaugmented Data
A growing body of work shows that models exploit annotation artifacts to achieve state-of-the-art performance on standard crowdsourced benchmarks—datasets collected from crowdworkers to create an evaluation task—while still failing on minimally perturbed examples. Recent work has explored the use of counterfactually-augmented data—data built by minimally editing a set of seed examples to yield counterfactual labels—to augment training data associated with these benchmarks and build more robust classifiers that generalize better. We use English natural language inference data to test model generalization and robustness and find that models trained on counterfactually-augmented Stanford Natural Language Inference (SNLI) data do not generalize better compared to similarly large unaugmented datasets, yielding slightly worse out-of-domain performance with a difference of 0.4 points. Further, we find that the data augmentation hurts performance by 6.7 points on one of our evaluation sets with the counterfactually-augmented training set yielding worse results than the seed examples. We thus argue that careful consideration should be given to the trade-offs between seed examples and augmented data in counterfactually-augmented datasets and encourage researchers to explore this line of work before using such data for training.
KEYNOTE ADDRESS 3: Semantic Parsing for Natural Language Interfaces
Natural language promises to be the ultimate interface for interacting with computers, allowing users to effortlessly tap into the wealth of digital information and extract insights from it. Today, virtual assistants such as Alex, Siri, and Google Assistant have given a glimpse into how this long-standing dream can become a reality, but there is still much work to be done.
In this talk, I will discuss building natural language interfaces based on semantic parsing, which converts natural language into programs that can be executed by a computer. There are multiple challenges for building semantic parsers: how to acquire data without requiring laborious annotation, how to represent the meaning of sentences, and perhaps most importantly, how to widen the domains and capabilities of a semantic parser. Finally, I will talk about a new promising paradigm for tackling these challenges based on learning interactively from users.
KEYNOTE ADDRESS 4: De-noising Sequence-to-Sequence Pre-training
De-noising auto-encoders can be pre-trained at a very large scale by noising and then reconstructing any input text. Existing methods, based on variations of masked languages models, have transformed the field and now provide the de facto initialization to be tuned for nearly every task. In this talk, I will present our work on sequence-to-sequence pre-training that introduces and carefully measures the impact of two new types of noising strategies. I will first describe an approach that allows arbitrary noising, by learning to translate any corrupted text back to the original with standard Transformer-based neural machine translation architectures. I will show that the resulting mono-lingual (BART) and multi-lingual (mBART) models provide effective initialization for learning a wide range of discrimination and generation tasks, including question answer, summarization, and machine translation. I will also present our recently introduced MARGE model, where we self-supervise the reconstruction of target text by retrieving a set of related texts (in many languages) and conditioning on them to maximize the likelihood of generating the original. The objective noisily captures aspects of paraphrase, translation, multi-document summarization, and information retrieval, allowing for strong zero-shot performance with no fine-tuning, as well as consistent performance gain when fine-tuned for individual tasks. Together, these techniques provide the most comprehensive set of pre-training methods to date, as well as the first viable alternative to the dominant masked language modeling pre-training paradigm.
Closing Remarks and Awards