Joyce Y. Chai's research while affiliated with Michigan State University and other places

Publications (86)

Preprint
Full-text available
In this paper, we study the problem of recognizing compositional attribute-object concepts within the zero-shot learning (ZSL) framework. We propose an episode-based cross-attention (EpiCA) network which combines merits of cross-attention mechanism and episode-based training strategy to recognize novel compositional concepts. Firstly, EpiCA bases o...
Preprint
Full-text available
In the NLP community, recent years have seen a surge of research activities that address machines' ability to perform deep language understanding which goes beyond what is explicitly stated in text, rather relying on reasoning and knowledge of the world. Many benchmark tasks and datasets have been created to support the development and evaluation o...
Preprint
Full-text available
We present a new explainable AI (XAI) framework aimed at increasing justified human trust and reliance in the AI machine through explanations. We pose explanation as an iterative communication process, i.e. dialog, between the machine and human user. More concretely, the machine generates sequence of explanations in a dialog which takes into accoun...
Preprint
This paper presents an explainable AI (XAI) system that provides explanations for its predictions. The system consists of two key components -- namely, the prediction And-Or graph (AOG) model for recognizing and localizing concepts of interest in input data, and the XAI model for providing explanations to the user about the AOG's predictions. In th...
Conference Paper
Full-text available
Language communication plays an important role in human learning and knowledge acquisition. With the emergence of a new generation of cognitive robots, empowering these robots to learn directly from human partners becomes increasingly important. This paper gives a brief introduction to interactive task learning where humans can teach physical agent...
Article
Full-text available
One significant simplification in most previous work on robot learning is the closed-world assumption where the robot is assumed to know ahead of time a complete set of predicates describing the state of the physical world. However, robots are not likely to have a complete model of the world especially when learning a new task. To address this prob...
Article
To enable situated human-robot dialogue, techniques to support grounded language communication are essential. One particular challenge is to ground human language to a robot's internal representation of the physical world. Although copresent in a shared environment, humans and robots have mismatched capabilities in reasoning; perception, and action...
Conference Paper
Robotic systems are traditionally programmed through off-line coding interfaces for manufacturing tasks. These programming methods are usually time-consuming and cost a lot of human efforts. They cannot meet the emerging requirements of robotic systems in many areas such as intelligent manufacturing and customized production. To address this issue,...
Conference Paper
Full-text available
To enable effective collaborations between humans and cog-nitive robots, it is important for robots to continuously acquire task knowledge from human partners. To address this issue, we are currently developing a framework that supports task learning through visual demonstration and natural language dialogue. One core component of this framework is...
Conference Paper
Previous work has shown that, when the agent and the hu- man have mismatched representations of the shared world, traditional approaches that generate a single long referring expression to refer to an object are inadequate for listen- ers to correctly identify the target object. To mediate the mismatched representations, collaborative models have b...
Conference Paper
Enabling natural language control of robots is challenging, since human users are often not familiar with the underlying robotic system, and its capabilities and limitations. Many exceptions may occur when natural language commands are translated into lower-level robot actions. This paper gives a brief introduction to three levels of exceptions and...
Conference Paper
Full-text available
In human-robot dialogue, although a robot and its human partner are co-present in a shared environment, they have significantly mismatched perceptual capabilities (e.g., recognizing objects in the surroundings). When a shared perceptual basis is missing, it becomes difficult for the robot to identify referents in the physical world that are referre...
Conference Paper
Robots often have limited knowledge and need to continuously acquire new knowledge and skills in order to collaborate with its human partners. To address this issue, this paper describes an approach which allows human partners to teach a robot (i.e., a robotic arm) new high-level actions through natural language instructions. In particular, built u...
Conference Paper
In situated dialogue with artificial agents (e.g., robots), although a human and an agent are co-present, the agent's representation and the human's representation of the shared environment are significantly mismatched. Because of this misalignment, our previous work has shown that when the agent applies traditional approaches to generate referring...
Conference Paper
A new planning and control scheme for natural language control of robotic operations using the perceptive feedback is presented. Different from the traditional open-loop natural language control, the scheme incorporates the highlevel planning and low-level control of the robotic systems and makes the high-level planning become a closed-loop process...
Conference Paper
Full-text available
In situated human-robot dialogue, although humans and robots are co-present in a shared environment, they have significantly mismatched capabilities in perceiving the shared environment. Their representations of the shared world are misaligned. In order for humans and robots to communicate with each other successfully using language, it is importan...
Article
This editorial introduction first explains the origin of this special section. It then outlines how each of the two articles included sheds light on possibilities for conversational dialog systems to use eye gaze as a signal that reflects aspects of participation in the dialog: degree of engagement and turn taking behavior, respectively.
Chapter
In situated dialogue, although an artificial agent and its human partner are co-present in a shared environment, they have significantly mismatched capabilities in perceiving the environment. When a shared perceptual basis is broken, referential grounding between partners becomes more challenging. Our hypothesis is that in such a situation, non-ver...
Article
Nominal predicates often carry implicit arguments. Recent work on semantic role labeling has focused on identifying arguments within the local context of a predicate; implicit arguments, however, have not been systematically examined. To address this limitation, we have manually annotated a corpus of implicit arguments for ten predicates from NomBa...
Conference Paper
Full-text available
In language-based interaction between a human and an artificial agent (e.g., robot) in a physical world, because the human and the agent have different knowledge and capabilities in perceiving the shared environment, referential grounding is very difficult. To facilitate such interaction, it is important for the agent to continuously learn and acqu...
Conference Paper
Full-text available
To enable effective referential grounding in situated human robot dialogue, we have conducted an empirical study to investigate how conversation partners collaborate and mediate shared basis when they have mismatched visual perceptual capabilities. In particular, we have developed a graph-based representation to capture linguistic discourse and vis...
Conference Paper
Text input aids such as automatic correction systems play an increasingly important role in facilitating fast text entry and efficient communication between text message users. Although these tools are beneficial when they work correctly, they can cause significant communication problems when they fail. To improve its autocorrection performance, it...
Article
Software (soft) keyboards are becoming increasingly popular on mobile devices. To attempt to improve soft keyboard input accuracy, key-target resizing algorithms that dynamically change the size of each key's target area have been developed. Although methods that employ personalized touch models have been shown to outperform general models, previou...
Article
Given the recent advances in eye tracking technology and the availability of nonintrusive and high-performance eye tracking devices, there has never been a better time to explore new opportunities to incorporate eye gaze in intelligent and natural human-machine communication. In this special issue, we present six articles that cover various aspects...
Conference Paper
Many prior studies have investigated the recovery of semantic arguments for nominal predicates. The models in many of these studies have assumed that arguments are independent of each other. This assumption simplifies the computational modeling of semantic arguments, but it ignores the joint nature of natural language. This paper presents a prelimi...
Article
To tackle the vocabulary problem in conversational systems, previous work has applied unsupervised learning approaches on co-occurring speech and eye gaze during interaction to automatically acquire new words. Although these approaches have shown promise, several issues related to human language behavior and human-machine conversation have not been...
Article
Most conversation systems tend to fail when unexpected words are encountered. To overcome this problem, conversational systems must be able to learn new words automatically during human machine conversation. Motivated by psycholinguistic findings on eye gaze and human language processing, we have developed several techniques to incorporate human ey...
Conference Paper
This workshop brought researchers from academia and industry together to share recent advances and discuss research directions and opportunities for next generation of intelligent human machine interaction that incorporate eye gaze.
Conference Paper
Full-text available
Despite its substantial coverage, Nom- Bank does not account for all within- sentence arguments and ignores extra- sentential arguments altogether. These ar- guments, which we call implicit, are im- portant to semantic processing, and their recovery could potentially benefit many NLP applications. We present a study of implicit arguments for a sele...
Conference Paper
In situated dialogue humans often utter linguistic expressions that refer to extralinguistic entities in the environment. Correctly resolving these references is critical yet challenging for artificial agents partly due to their limited speech recognition and language understanding capabilities. Motivated by psycholinguistic studies demonstrating a...
Article
Full-text available
In human robot dialogue, identifying intended referents from human partners' spatial language is challenging. This is partly due to automated inference of potentially ambiguous underlying reference system (i.e., frame of reference). To im-prove spatial language understanding, we conducted an em-pirical study to investigate the prevalence of ambigui...
Conference Paper
The second person pronoun you serves different functions in English. Each of these different types often corresponds to a different term when translated into another language. Correctly identifying different types of you can be beneficial to machine translation systems. To address this issue, we investigate disambiguation of different types of you...
Article
To tackle the vocabulary problem in conversational systems, previous work has applied unsupervised learning approaches on co-occurring speech and eye gaze during interaction to automatically acquire new words. Although these approaches have shown promise, several issues related to human language behavior and human-machine conversation have not been...
Conference Paper
While a significant amount of research has been devoted to textual entailment, automated entailment from conversational scripts has re- ceived less attention. To address this limi- tation, this paper investigates the problem of conversation entailment: automated inference of hypotheses from conversation scripts. We examine two levels of semantic re...
Conference Paper
During multiparty meetings, participants can use non-verbal modalities such as hand gestures to make reference to the shared environment. Therefore, one hypothesis is that incorporating hand gestures can improve coreference identification, a task that automatically identifies what participants refer to with their linguistic expressions. To evaluate...
Conference Paper
Full-text available
In multimodal human machine conversation, successfully interpreting human attention is critical. While attention has been studied extensively in linguistic processing and visual processing, it is not clear how linguistic attention is aligned with visual attention in multimodal conversational interfaces. To address this issue, we conducted a prelimi...
Conference Paper
Full-text available
Nominals frequently surface without overtly expressed arguments. In order to measure the potential benefit of nominal SRL for down- stream processes, such nominals must be ac- counted for. In this paper, we show that a state-of-the-art nominal SRL system with an overall argument F1 of 0.76 suffers a perfor- mance loss of more than 9% when nominals...
Conference Paper
Given the increasing amount of conversa- tion data, techniques to automatically ac- quire information about conversation par- ticipants have become more important. Towards this goal, we investigate the prob- lem of conversation entailment, a task that determines whether a given conversa- tion discourse entails a hypothesis about the participants. T...
Conference Paper
Motivated by the psycholinguistic finding that human eye gaze is tightly linked to speech production, previous work has ap- plied naturally occurring eye gaze for au- tomatic vocabulary acquisition. However, unlike in the typical settings for psycholin- guistic studies, eye gaze can serve differ- ent functions in human-machine conver- sation. Some...
Conference Paper
One major bottleneck in conversational sys- tems is their incapability in interpreting un- expected user language inputs such as out-of- vocabulary words. To overcome this problem, conversational systems must be able to learn new words automatically during human ma- chine conversation. Motivated by psycholin- guistic findings on eye gaze and human...
Conference Paper
Multimodal conversational interfaces allow users to carry a dialog with a graphical display using speech to ac- complish a particular task. Motivated by previous psy- cholinguistic findings, we examine how eye-gaze con- tributes to reference resolution in such a setting. Specif- ically, we present an integrated probabilistic framework that combines...
Conference Paper
In a multimodal conversational interface supporting speech and deictic gesture, deictic gestures on the graphical display have been traditionally used to identify user attention, for example, through reference resolution. Since the context of the identified attention can potentially constrain the associ- ated intention, our hypothesis is that deict...
Article
In a conversational system, determining the user's focus of attention is crucial to the suc-cess of the system. Motivated by previous psycholinguistic findings, we are currently ex-amining how eye gaze contributes to automat-ed identification of user attention during con-versation. In particular, we are developing techniques that can predict an obj...
Article
Full-text available
rst participation in the ciQA task. Instead of exploring conversation strategies in question answering (3, 4), we decided to focus on simple interaction strategies using relevance feedback. In our view, the ciQA task is not designed to evaluate user initiative interaction strategies. Since NIST assessors act as users, the motivation to take an init...
Article
Motivated by the recent effort on scenario-based context question answering (QA), this paper investigates the role of discourse processing and its implication on query expansion for a sequence of questions. Our view is that a question sequence is not random, but rather follows a coherent manner to serve some information goals. Therefore, this seque...
Article
Full-text available
Text queries are natural and intuitive for users to describe their information needs. However, text-based image retrieval faces many challenges. Traditional text retrieval techniques on image descriptions have not been very successful. This is mainly due to the inconsistent textual descrip- tions and the discrepancies between user queries and terms...
Conference Paper
Motivated by psycholinguistic findings, we are currently investigating the role of eye gaze in spoken language understand- ing for multimodal conversational sys- tems. Our assumption is that, during hu- man machine conversation, a user's eye gaze on the graphical display indicates salient entities on which the user's atten- tion is focused. The spe...
Conference Paper
In a conversational system, determining a user's focus of attention is crucial to the success of the system. Mo- tivated by previous psycholinguistic findings, we are currently examining how eye gaze contributes to au- tomated identification of user attention during human- machine conversation. As part of this effort, we inves- tigate the contribut...
Conference Paper
Full-text available
Motivated by psycholinguistic findings that eye gaze is tightly linked to human lan- guage production, we developed an unsuper- vised approach based on translation models to automatically learn the mappings between words and objects on a graphic display dur- ing human machine conversation. The ex- perimental results indicate that user eye gaze can...
Article
Full-text available
Resolving ambiguity in the process of query translation is crucial to cross-language information retrieval (CLIR), given the short length of queries. This problem is even more challenging when only a bilingual dictionary is available, which is the focus of our work described here. In this paper, we will present a statistical framework for dictionar...
Conference Paper
Previous studies have shown that, in multimodal conver- sational systems, fusing information from multiple modal- ities together can improve the overall input interpretation through mutual disambiguation. Inspired by these findings, this paper investigates non-verbal modalities, in particular deictic gesture, in spoken language processing. Our assu...
Article
Multimodal conversational interfaces provide a natural means for users to communicate with computer systems through multiple modalities such as speech and gesture. To build effective multimodal interfaces, automated interpretation of user multimodal inputs is important. Inspired by the previous investigation on cognitive status in multimodal human...
Conference Paper
In interactive question answering (QA), users and systems take turns to ask questions and provide answers. In such an interactive setting, user questions largely depend on the answers provided by the system. One question is whether user follow-up questions can provide feedback for the system to automatically assess its performance (e.g.,assess whet...
Conference Paper
Question answering (QA) systems take users' natural language questions and retrieve relevant answers from large repositories of free texts. Despite recent progress in QA research, most work on question answering is still focused on isolated questions. In a real- world information seeking scenario, questions are not asked in isolation, but rather in...
Conference Paper
To enable conversational QA, it is important to examine key issues addressed in conversational systems in the context of question answering. In conversational systems, understanding user intent is critical to the success of interaction. Recent studies have also shown that the capability to automatically identify problematic situations during intera...
Conference Paper
To enable conversational QA, it is impor- tant to examine key issues addressed in conversational systems in the context of question answering. In conversational sys- tems, understanding user intent is criti- cal to the success of interaction. Recent studies have also shown that the capabil- ity to automatically identify problematic situations durin...
Conference Paper
To alleviate the vocabulary problem, this paper investigates the role of user term feedback in interactive text-based image retrieval. Term feedback refers to the feedback from a user on specific terms regarding their relevance to a target image. Previous studies have indicated the effectiveness of term feedback in interactive text retrieval [14]....
Conference Paper
Typical cross language retrieval requires special linguistic resources, such as bilingual dictionaries and parallel corpus. In this study, we focus on the cross lingual retrieval problem that only uses online translation systems. We compare two approaches: a translation-based approach that directly translates queries into the language of documents...
Conference Paper
Full-text available
One key to cross-language information retrieval is how to efficiently resolve the translation ambiguity of queries given their short length. This problem is even more challenging when only bilingual dictionaries are available, which is the focus of this paper. In the previous research of cross-language information retrieval using bilingual dictiona...
Conference Paper
Multimodal conversational interfaces provide a natural means for users to communicate with computer systems through multiple modalities such as speech, gesture, and gaze. To build effective multimodal interfaces, understanding user multimodal inputs is important. Previous linguistic and cognitive studies indicate that user language behavior does no...
Conference Paper
To improve the robustness in multimodal input interpretation, this paper presents a new salience driven approach. This approach is based on the observation that, during multimodal conversation, information from deictic gestures (e.g., point or circle) on a graphical display can signal a part of the physical world (i.e., representation of the domain...
Chapter
In a multimodal human-machine conversation, user inputs are often abbreviated or imprecise. Simply fusing multimodal inputs together may not be sufficient to derive a complete understanding of the inputs. Aiming to handle a wide variety of multimodal inputs, we are building a context-based multimodal interpretation framework called MIND (Multimodal...
Conference Paper
How to assign appropriate weights to terms is one of the critical issues in information retrieval. Many term weighting schemes are unsupervised. They are either based on the empirical observation in information retrieval, or based on generative approaches for language modeling. As a result, the existing term weighting schemes are usually insufficie...
Conference Paper
The goal of automatic image annotation is to automatically generate annotations for images to describe their content. In the past, statistical machine translation models have been successfully applied to automatic image annotation task [8]. It views the process of annotating images as a process of translating the content from a 'visual language' to...
Conference Paper
Image annotations allow users to access a large image database with textual queries. There have been several studies on automatic image annotation utilizing machine learning techniques, which automatically learn statistical models from annotated images and apply them to generate annotations for unseen images. One common problem shared by most previ...
Conference Paper
In this report, we describe our studies with cross language and interactive image retrieval in ImageCLEF 2004. Typical cross language retrieval requires special linguistic resources, such as bilingual dictionaries. In this study, we focus on the issue of how to achieve good retrieval performance given only an online translation system. We compare t...
Conference Paper
Collaborative filtering identifies information interest of a particular user based on the information provided by other similar users. The memory-based approaches for collaborative filtering (e.g., Pearson correlation coefficient approach) identify the similarity between two users by comparing their ratings on a set of items. In these approaches, d...
Article
Collaborative filtering identifies information interest of a particular user based on the information provided by other similar users. The memory-based approaches for collaborative filtering (e.g., Pearson correlation coefficient approach) identify the similarity between two users by comparing their ratings on a set of items. In these approaches, d...
Conference Paper
Multimodal user interfaces allow users to interact with computers through multiple modalities, such as speech, gesture, and gaze. To be effective, multimodal user interfaces must correctly identify all objects which users refer to in their inputs. To systematically resolve different types of references, we have developed a probabilistic approach th...
Conference Paper
In a multimodal conversation, the way users communicate with a system depends on the available interaction channels and the situated context (e. g., conversation focus, visual feedback). These dependencies form a rich set of constraints from various perspectives such as temporal alignments between different modalities, coherence of conversation, an...
Article
Multimodal reference resolution is a process that automatically identifies what users refer to during multimodal human-machine conversation. Given the substantial work on multimodal reference resolution; it is important to evaluate the current state of the art, understand the limitations, and identify directions for future improvement. We conducted...
Article
In a real-world setting, questions are not asked in isolation, but rather in a cohesive manner that involves a sequence of related questions to meet user's information needs. The capability to interpret and answer questions based on context is important. In this paper, we discuss the role of discourse modeling in context question answering. In part...
Article
In a multimodal conversation, user referring patterns could be complex, involving multiple referring expressions from speech utterances and multiple gestures. To resolve those references, multimodal integration based on semantic constraints is insufficient. In this paper, we describe a graph-based probabilistic approach that simultaneously combines...
Conference Paper
In a multimodal human-machine conversation, user inputs are often abbreviated or imprecise. Sometimes, merely fusing multimodal inputs together cannot derive a complete understanding. To address these inadequacies, we are building a semantics-based multimodal interpretation framework called MIND (Multimodal Interpretation for Natural Dialog). The u...
Article
Full-text available
In situated dialogue, although artificial agents and their hu-man partners are co-present in a shared environment, their representations of the environment are significantly differ-ent. When a shared basis is missing, referential grounding between partners becomes more challenging. Our hypothe-sis is that in such a situation, non-verbal modalities...
Article
Non-standard spellings in text messages often convey extra pragmatic information not found in the standard word form. However, text message normalization sys-tems that transform non-standard text mes-sage spellings to standard form tend to ignore this information. To address this problem, this paper examines the types of extra pragmatic information...
Article
This paper presents a preliminary investi- gation into the use of NomLex classes for NomBank semantic role labeling (SRL). We hypothesize that modeling each class indi- vidually will result in more homogeneous training data and better performance com- pared to a baseline approach that is not class- based. Our current experimental results, which are...

Citations

... That is, the symbols in the system should be linked to meanings in the real world. 3 However, there are discrepancies in the definition of grounding under different contexts (Chai et al., 2018;Mollo and Millière, 2023). In addition to the high-level concept of interpreting symbols in the real world, grounding can be interpreted differently in various scenarios, including but not necessarily limited to the following NLP-centric ones, 4 where some of them are indeed instantiations of the high-level definition by Harnad (1990): ...
... Graphgan [67] NetMF [68] GraphSAGE [69] Generative model Embed graph into latent space VGAE [70] GraphRNN [71] Graph-GAE [70] Graph-VIN [72] Incorporate semantics for embedding DGMG [73] Sem-GAN [74] Continued on next page Method: The algorithm consists of two major parts: 1) random walk generator and 2) update procedure. Random walk generator: The algorithm has two nested loops, the outer loop which represents the number of times (γ) the walk should start from each vertex v i of the graph. ...
... The value of a system is the measure of collection of functions or benefits it render to the users. However, this value becomes obvious when the benefits or functions enhance the effectiveness and efficiency of users operations [2]. . Usability of a system that provide some services is the extend or rate at which the system can be applied proficiently and satisfactorily to achieve specific goals for a given user. ...
... 1. Collaboration based on conversational grounding : according to Baker, Hansen, Joiner, and Traum (1999), a good collaboration is possible if collaborating entities can create a shared knowledge base, beliefs, or assumptions surrounding their goal. The main challenge for this approach is the difficulty of grounding human language to a suitable representation of the real world that machines can fully understand and process (Chai, Fang, Liu, & She, 2016). 2. Collaboration based on theory of mind: this collaboration relies on self-understanding and interpretation of others' understandings (Koch & Oulasvirta, 2018). ...
... Several methods have been proposed during the past years for the human teaching process such as force-sensor-based teaching [21], vision-system-based teaching [22], and natural-language-based teaching [23]. On the other hand, different learning method has been used for the robot learning process which allows the robot to extract, learn, and figure out task strategies from human teachers [24]. ...
... Lesh and Etzioni (1995) introduced a goal recognizer that observes human actions to prune inconsistent actions or goals from an input graph state. She and Chai (2016) use linguistic and environment features to induce a hypothesis set of goal predicates that can be handed to a planner. These approaches have demonstrated the ability to infer goals in small-sized domains (grid-world like domains) with limited scalability to complex domains. ...
... Initially, visual event recognition was limited to k-way classification tasks [20,30,108], but Gupta et al. [31] expanded the definition of a visual event through the introduction of "visual semantic role labeling", or the task of mapping images to a limited taxonomy of scenarios following a subject-object-action template. A "grounded" version of this task was proposed by Yang et al. [130]. These ideas were built on by Yatskar et al. [131] and Pratt et al. [85] who presented a more complex task of mapping images to template-based scenarios, but with a much wider range of templates following the FrameNet ontology [7]. ...
... There has also been early related work on generating sportscast commentaries from simulation (RoboCup) soccer videos represented as non-visual state information (Chen and Mooney, 2008). Also, Liu et al. (2016a) presented some initial ideas on robots learning grounded task representations by watching and interacting with humans performing the task (i.e., by converting human demonstration videos to Causal And-Or graphs). On the other hand, we propose a new video-chat dataset where the dialogue models need to generate the next response in the sequence of chats, conditioned both on the raw video features as well as the previous textual chat history. ...
... Objective measures are preferred to represent user's feedback because they are less affected by logical and analytical elements (Liu et al., 2013). In this perspective, previous studies used physiological responses such as electrocardiography (ECG), skin conductance and electromyography to obtain objective responses of users (Gouizi et al, 2011;Lackey et al., 2015;Xu et al., 2015). ...
... The most widely-used method is to use the GUI to select the primitive [37,38]. Additionally, language-based demonstrations can guide the robot to implement certain motion primitives in specific sequences with language instructions [39,40]. ...