Figure 8 - uploaded by Marilyn A. Walker
Content may be subject to copyright.
Chart comparing distribution of human ratings for SPOT, RBS, ICF, NOAGG and RANDOM.

Chart comparing distribution of human ratings for SPOT, RBS, ICF, NOAGG and RANDOM.

Contexts in source publication

Context 1
... SPOT was statistically better than both of these systems (p .01). Figure 8 shows that SPOT got more high rankings than either of the rule- based systems. In a sense this may not be that surprising, because as Hovy and Wanner (1996) point out, it is difficult to construct a rule-based sentence planner that handles all the rule interac- tions in a reasonable way. ...
Context 2
... it would have been a pos- sible outcome for SPOT to not be different than either system, e.g. if the sp-trees produced by RANDOM were all equally good, or if the ag- gregation rules that SPOT learned produced out- put less readable than NOAGG. Figure 8 shows that the distributions of scores for SPOT vs. the baseline systems are very different, with SPOT skewed towards higher scores. ...

Citations

... This paper applies few-shot PBL to the task of controllable generation of DAs using an overgenerateand-rank NLG framework. The overgenerate-andrank paradigm for NLG has primarily used two methods for ranking: (1) language model probability (Langkilde and Knight, 1998); and (2) ranking functions trained from human feedback (Rambow et al., 2001;Bangalore et al., 2000;Liu et al., 2016). We extend this framework by applying it in the context of PBL, by using DA probability in ranking, and by comparing many ranking functions, including Beyond-BLEU and BLEU baselines (Wieting et al., 2019;Papineni et al., 2002). ...
Preprint
Full-text available
Dialogue systems need to produce responses that realize multiple types of dialogue acts (DAs) with high semantic fidelity. In the past, natural language generators (NLGs) for dialogue were trained on large parallel corpora that map from a domain-specific DA and its semantic attributes to an output utterance. Recent work shows that pretrained language models (LLMs) offer new possibilities for controllable NLG using prompt-based learning. Here we develop a novel few-shot overgenerate-and-rank approach that achieves the controlled generation of DAs. We compare eight few-shot prompt styles that include a novel method of generating from textual pseudo-references using a textual style transfer approach. We develop six automatic ranking functions that identify outputs with both the correct DA and high semantic accuracy at generation time. We test our approach on three domains and four LLMs. To our knowledge, this is the first work on NLG for dialogue that automatically ranks outputs using both DA and attribute accuracy. For completeness, we compare our results to fine-tuned few-shot models trained with 5 to 100 instances per DA. Our results show that several prompt settings achieve perfect DA accuracy, and near perfect semantic accuracy (99.81%) and perform better than few-shot fine-tuning.
... This paper applies few-shot PBL to the task of controllable generation of DAs using an overgenerateand-rank NLG framework. The overgenerate-andrank paradigm for NLG has primarily used two methods for ranking: (1) language model probability (Langkilde and Knight, 1998); and (2) ranking functions trained from human feedback (Rambow et al., 2001;Bangalore et al., 2000;Liu et al., 2016). We extend this framework by applying it in the context of PBL, by using DA probability in ranking, and by comparing many ranking functions, including Beyond-BLEU and BLEU baselines (Wieting et al., 2019;Papineni et al., 2002). ...
... This means that they currently cover a limited set of relations, ones that are populated frequently enough to make writing templates worthwhile. As previous work on dialogue generation has shown, even combinations of existing relations typically require multiple additional templates to be written [34,33,42]. The existing KG-RG entities and relations are in Table 1, as well as novel KG-RG relations and entities that we experiment with below with 2-shot tuning. ...
... We also showed that Athena-Jurassic performs well in 2-shot tuning tests, using completely novel sets of KG triples with relations and entities never seen in tuning. These novel MRs are not currently included in Athena, because the relations are rare, and creating templates for novel relations or sets of relations is typically not worth the human effort [34,33]. For example the MR in M4 in Figure 7 describes the event of Muhammed Ali lighting the Olympic torch in 1996, a rarely populated event for the athlete entity type. ...
Preprint
Full-text available
One challenge with open-domain dialogue systems is the need to produce high-quality responses on any topic. We aim to improve the quality and coverage of Athena, an Alexa Prize dialogue system. We utilize Athena's response generators (RGs) to create training data for two new neural Meaning-to-Text RGs, Athena-GPT-Neo and Athena-Jurassic, for the movies, music, TV, sports, and video game domains. We conduct few-shot experiments, both within and cross-domain, with different tuning set sizes (2, 3, 10), prompt formats, and meaning representations (MRs) for sets of WikiData KG triples, and dialogue acts with 14 possible attribute combinations. Our evaluation uses BLEURT and human evaluation metrics, and shows that with 10-shot tuning, Athena-Jurassic's performance is significantly better for coherence and semantic accuracy. Experiments with 2-shot tuning on completely novel MRs results in a huge performance drop for Athena-GPT-Neo, whose semantic accuracy falls to 0.41, and whose untrue hallucination rate increases to 12%. Experiments with dialogue acts for video games show that with 10-shot tuning, both models learn to control dialogue acts, but Athena-Jurassic has significantly higher coherence, and only 4% untrue hallucinations. Our results suggest that Athena-Jurassic can reliably produce outputs of high-quality for live systems with real users. To our knowledge, these are the first results demonstrating that few-shot tuning on a massive language model can create NLGs that generalize to new domains, and produce high-quality, semantically-controlled, conversational responses directly from MRs and KG triples.
... It is an English place. In contrast, earlier models of statistical natural language generation (SNLG) for dialogue were based around the NLG architecture in Figure 1 ( Rambow et al., 2001;Stent, 2002). Here the dialogue manager sends one or more dialogue acts and their arguments to the NLG engine, which then makes decisions how to render the utterance using separate modules for content planning and structuring, sentence planning and surface realization (Reiter and Dale, 2000). ...
Preprint
Responses in task-oriented dialogue systems often realize multiple propositions whose ultimate form depends on the use of sentence planning and discourse structuring operations. For example a recommendation may consist of an explicitly evaluative utterance e.g. Chanpen Thai is the best option, along with content related by the justification discourse relation, e.g. It has great food and service, that combines multiple propositions into a single phrase. While neural generation methods integrate sentence planning and surface realization in one end- to-end learning framework, previous work has not shown that neural generators can: (1) perform common sentence planning and discourse structuring operations; (2) make decisions as to whether to realize content in a single sentence or over multiple sentences; (3) generalize sentence planning and discourse relation operations beyond what was seen in training. We systematically create large training corpora that exhibit particular sentence planning operations and then test neural models to see what they learn. We compare models without explicit latent variables for sentence planning with ones that provide explicit supervision during training. We show that only the models with additional supervision can reproduce sentence planing and discourse operations and generalize to situations unseen in training.
... It is an English place. In contrast, earlier models of statistical natural language generation (SNLG) for dialogue were based around the NLG architecture in Figure 1 ( Rambow et al., 2001;Stent, 2002;Stent and Molina, 2009 Here the dialogue manager sends one or more dialogue acts and their arguments to the NLG en-gine, which then makes decisions how to render the utterance using separate modules for content planning and structuring, sentence planning and surface realization (Reiter and Dale, 2000). The sentence planner's job includes: ...
... We see Steps 1 to 3 as the overgeneration phase, aimed at vastly expanding the types of stylistic variation possible, while Step 4 is the ranking phase, in a classic overgenerate and rank NLG architecture (Langkilde and Knight, 1998;Rambow et al., 2001). We focus in this paper on Steps 1 to 3, expecting to improve these steps before we move on to Step 4. Thus, in this paper, we conducted an evaluation experiment to compare three different types of NLG templates: pre-defined BASIC templates similar to those used in current NLG engines for the restaurant domain Wen et al., 2015), the basic templates stylized with Emilios decor and service are both decent, but its food quality is nothing short of excellent. ...
Article
Many of the creative and figurative elements that make language exciting are lost in translation in current natural language generation engines. In this paper, we explore a method to harvest templates from positive and negative reviews in the restaurant domain, with the goal of vastly expanding the types of stylistic variation available to the natural language generator. We learn hyperbolic adjective patterns that are representative of the strongly-valenced expressive language commonly used in either positive or negative reviews. We then identify and delexicalize entities, and use heuristics to extract generation templates from review sentences. We evaluate the learned templates against more traditional review templates, using subjective measures of "convincingness", "interestingness", and "naturalness". Our results show that the learned templates score highly on these measures. Finally, we analyze the linguistic categories that characterize the learned positive and negative templates. We plan to use the learned templates to improve the conversational style of dialogue systems in the restaurant domain.
... Ranking function is employed to decide whether additional repetitions need to be carried out before the final text output. Model is also evaluated with the use of a travel system [65] which showcases different aspects of the novel model. ...
Article
Full-text available
Natural Language Generation (NLG) is defined as the systematic approach for producing human understandable natural language text based on non-textual data or from meaning representations. This is a significant area which empowers human-computer interaction. It has also given rise to a variety of theoretical as well as empirical approaches. This paper intends to provide a detailed overview and a classification of the state-of-the-art approaches in Natural Language Generation. The paper explores NLG architectures and tasks classed under document planning, micro-planning and surface realization modules. Additionally, this paper also identifies the gaps existing in the NLG research which require further work in order to make NLG a widely usable technology.
... We see Steps 1 to 3 as the overgeneration phase, aimed at vastly expanding the types of stylistic variation possible, while Step 4 is the ranking phase, in a classic overgenerate and rank NLG architecture (Langkilde and Knight, 1998;Rambow et al., 2001). We focus in this paper on Steps 1 to 3, expecting to improve these steps before we move on to Step 4. Thus, in this paper, we conducted an evaluation experiment to compare three different types of NLG templates: pre-defined BASIC templates similar to those used in current NLG engines for the restaurant domain Wen et al., 2015), the basic templates stylized with Emilios decor and service are both decent, but its food quality is nothing short of excellent. ...
... A different direction has been followed by [34], [35] and [6], where an over-generate and rank approach to sentence generation has been suggested. In this approach, the overgeneration phase can follow user-and domain-independent rules to generate a set of possible sentences and the ranking phase is responsible of ranking theal. ...
Conference Paper
Full-text available
We propose a novel approach for handling first-time users in the context of automatic report generation from time-series data in the health domain. Handling first-time users is a common problem for Natural Language Generation (NLG) and interactive systems in general-the system cannot adapt to users without prior interaction or user knowledge. In this paper, we propose a novel framework for generating medical reports for first-time users, using multi-objective optimisation (MOO) to account for the preferences of multiple possible user types, where the content preferences of potential users are modelled as objective functions. Our proposed approach outperforms two meaningful baselines in an evaluation with prospective users, yielding large (= .79) and medium (= .46) effect sizes respectively.
... This challenge has been also noted by Sripada et al. (2004). We handle this challenge in three different ways: (1) by applying multi-label classification which is able to handle mis-matches in aligned corpora (Chapter 4); (2) by asking users to rate expert constructed and random summaries in order to derive their preferences, similar to Rambow et al. (2001) (Chapters 3, 4 and 5); and (3) by clustering the experts' responses so as experts with same preferences belong to the same cluster (Chapter 6). ...