Article

Spatial Concept Acquisition for a Mobile Robot That Integrates Self-Localization and Unsupervised Word Discovery From Spoken Sentences

February 2016
IEEE Transactions on Cognitive and Developmental Systems 8(4)

February 2016
8(4)

DOI:10.1109/TCDS.2016.2565542

Authors:

Akira Taniguchi

Ritsumeikan University

Tadahiro Taniguchi

Ritsumeikan University

Tetsunari Inamura

Tamagawa University

In this paper, we propose a novel unsupervised learning method for the lexical acquisition of words related to places visited by robots, from human continuous speech signals. We address the problem of learning novel words by a robot that has no prior knowledge of these words except for a primitive acoustic model. Further, we propose a method that allows a robot to effectively use the learned words and their meanings for self-localization tasks. The proposed method is nonparametric Bayesian spatial concept acquisition method (SpCoA) that integrates the generative model for self-localization and the unsupervised word segmentation in uttered sentences via latent variables related to the spatial concept. We implemented the proposed method SpCoA on SIGVerse, which is a simulation environment, and TurtleBot2, which is a mobile robot in a real environment. Further, we conducted experiments for evaluating the performance of SpCoA. The experimental results showed that SpCoA enabled the robot to acquire the names of places from speech sentences. They also revealed that the robot could effectively utilize the acquired spatial concepts and reduce the uncertainty in self-localization.

Autonomous planning based on spatial concepts to tidy up home environments with service robots

Article

Apr 2021

Tidy-up tasks by service robots in home environments are challenging in robotics applications because they involve various interactions with the environment. In particular, robots are required not only to grasp, move, and release various home objects but also to plan the order and positions for placing the objects. In this paper, we propose a novel planning method that can efficiently estimate the order and positions of the objects to be tidied up by learning the parameters of a probabilistic generative model. The model allows a robot to learn the distributions of the co-occurrence probability of the objects and places to tidy up using the multimodal sensor information collected in a tidied environment. Additionally, we develop an autonomous robotic system to perform the tidy-up operation. We evaluate the effectiveness of the proposed method by an experimental simulation that reproduces the conditions of the Tidy Up Here task of the World Robot Summit 2018 international robotics competition. The simulation results show that the proposed method enables the robot to successively tidy up several objects and achieves the best task score among the considered baseline tidy-up methods.

Autonomous Planning Based on Spatial Concepts to Tidy Up Home Environments with Service Robots

Preprint

Full-text available

Feb 2020

Tidy-up tasks by service robots in home environments are challenging in the application of robotics because they involve various interactions with the environment. In particular, robots are required not only to grasp, move, and release various home objects, but also plan the order and positions where to put them away. In this paper, we propose a novel planning method that can efficiently estimate the order and positions of the objects to be tidied up based on the learning of the parameters of a probabilistic generative model. The model allows the robot to learn the distributions of co-occurrence probability of objects and places to tidy up by using multimodal sensor information collected in a tidied environment. Additionally, we develop an autonomous robotic system to perform the tidy-up operation. We evaluate the effectiveness of the proposed method in an experimental simulation that reproduces the conditions of the Tidy Up Here task of the World Robot Summit international robotics competition. The simulation results showed that the proposed method enables the robot to successively tidy up several objects and achieves the best task score compared to baseline tidy-up methods.

Semantic Mapping Based on Spatial Concepts for Grounding Words Related to Places in Daily Environments

Article

Full-text available

May 2019

An autonomous robot performing tasks in a human environment needs to recognize semantic information about places. Semantic mapping is a task in which suitable semantic information is assigned to an environmental map so that a robot can communicate with people and appropriately perform tasks requested by its users. We propose a novel statistical semantic mapping method called SpCoMapping, which integrates probabilistic spatial concept acquisition based on multimodal sensor information and a Markov random field applied for learning the arbitrary shape of a place on a map.SpCoMapping can connect multiple words to a place in a semantic mapping process using user utterances without pre-setting the list of place names. We also develop a nonparametric Bayesian extension of SpCoMapping that can automatically estimate an adequate number of categories. In the experiment in the simulation environments, we showed that the proposed method generated better semantic maps than previous semantic mapping methods; our semantic maps have categories and shapes similar to the ground truth provided by the user. In addition, we showed that SpCoMapping could generate appropriate semantic maps in a real-world environment.

Improved and Scalable Online Learning of Spatial Concepts and Language Models with Mapping

Preprint

Jan 2019

We propose a novel online learning algorithm, called SpCoSLAM 2.0, for spatial concepts and lexical acquisition with high accuracy and scalability. Previously, we proposed SpCoSLAM as an online learning algorithm based on unsupervised Bayesian probabilistic model that integrates multimodal place categorization, lexical acquisition, and SLAM. However, our previous algorithm had limited estimation accuracy owing to the influence of the early stages of learning, and increased computational complexity with added training data. Therefore, we introduce techniques such as fixed-lag rejuvenation to reduce the calculation time while maintaining an accuracy higher than that of the previous algorithm. The results show that, in terms of estimation accuracy, the proposed algorithm exceeds the previous algorithm and is comparable to batch learning. In addition, the calculation time of the proposed algorithm does not depend on the amount of training data and becomes constant for each step of the scalable algorithm. Our approach will contribute to the realization of long-term spatial language interactions between humans and robots.

Online Spatial Concept and Lexical Acquisition with Simultaneous Localization and Mapping

Conference Paper

Full-text available

Apr 2017

In this paper, we propose an online learning algorithm based on a Rao-Blackwellized particle filter for spatial concept acquisition and mapping. We have proposed a nonparametric Bayesian spatial concept acquisition model (SpCoA). We propose a novel method (SpCoSLAM) integrating SpCoA and FastSLAM in the theoretical framework of the Bayesian generative model. The proposed method can simultaneously learn place categories and lexicons while incrementally generating an environmental map. Furthermore, the proposed method has scene image features and a language model added to SpCoA. In the experiments, we tested online learning of spatial concepts and environmental maps in a novel environment of which the robot did not have a map. Then, we evaluated the results of online learning of spatial concepts and lexical acquisition. The experimental results demonstrated that the robot was able to more accurately learn the relationships between words and the place in the environmental map incrementally by using the proposed method.

Hierarchical Bayesian Model for the Transfer of Knowledge on Spatial Concepts based on Multimodal Information

Preprint

Mar 2021

This paper proposes a hierarchical Bayesian model based on spatial concepts that enables a robot to transfer the knowledge of places from experienced environments to a new environment. The transfer of knowledge based on spatial concepts is modeled as the calculation process of the posterior distribution based on the observations obtained in each environment with the parameters of spatial concepts generalized to environments as prior knowledge. We conducted experiments to evaluate the generalization performance of spatial knowledge for general places such as kitchens and the adaptive performance of spatial knowledge for unique places such as `Emma's room' in a new environment. In the experiments, the accuracies of the proposed method and conventional methods were compared in the prediction task of location names from an image and a position, and the prediction task of positions from a location name. The experimental results demonstrated that the proposed method has a higher prediction accuracy of location names and positions than the conventional method owing to the transfer of knowledge.

SpCoMapGAN: Spatial Concept Formation-based Semantic Mapping with Generative Adversarial Networks

Conference Paper

Oct 2020

In semantic mapping, which connects semantic information to an environment map, it is a challenging task for robots to deal with both local and global information of environments. In addition, it is important to estimate semantic information of unobserved areas from already acquired partial observations in a newly visited environment. On the other hand, previous studies on spatial concept formation enabled a robot to relate multiple words to places from bottom-up observations even when the vocabulary was not provided beforehand. However, the robot could not transfer global information related to the room arrangement between semantic maps from other environments. In this paper, we propose SpCoMapGAN, which generates the semantic map in a newly visited environment by training an inference model using previously estimated semantic maps. SpCoMapGAN uses generative adversarial networks (GANs) to transfer semantic information based on room arrangements to a newly visited environment. Our proposed method assigns semantics to the map of an unknown environment using the prior distribution of the map trained in known environments and the multimodal observations made in the unknown environment. We experimentally show in simulation that SpCoMapGAN can use global information for estimating the semantic map and is superior to previous methods. Finally, we also demonstrate in a real environment that SpCoMapGAN can accurately 1) deal with local information, and 2) acquire the semantic information of real places.

Spatial concept-based navigation with human speech instructions via probabilistic inference on Bayesian generative model

Article

Full-text available

Oct 2020

Robots are required to not only learn spatial concepts autonomously but also utilize such knowledge for various tasks in a domestic environment. Spatial concept represents a multimodal place category acquired from the robot's spatial experience including vision, speech-language, and self-position. The aim of this study is to enable a mobile robot to perform navigational tasks with human speech instructions, such as ‘Go to the kitchen’, via probabilistic inference on a Bayesian generative model using spatial concepts. Specifically, path planning was formalized as the maximization of probabilistic distribution on the path-trajectory under speech instruction, based on a control-as-inference framework. Furthermore, we described the relationship between probabilistic inference based on the Bayesian generative model and control problem including reinforcement learning. We demonstrated path planning based on human instruction using acquired spatial concepts to verify the usefulness of the proposed approach in the simulator and in real environments. Experimentally, places instructed by the user's speech commands showed high probability values, and the trajectory toward the target place was correctly estimated. Our approach, based on probabilistic inference concerning decision-making, can lead to further improvement in robot autonomy.

From Continuous Observations to Symbolic Concepts: A Discrimination-Based Strategy for Grounded Concept Learning

Article

Full-text available

Jun 2020

Autonomous agents perceive the world through streams of continuous sensori-motor data. Yet, in order to reason and communicate about their environment, agents need to be able to distill meaningful concepts from their raw observations. Most current approaches that bridge between the continuous and symbolic domain are using deep learning techniques. While these approaches often achieve high levels of accuracy, they rely on large amounts of training data, and the resulting models lack transparency, generality, and adaptivity. In this paper, we introduce a novel methodology for grounded concept learning. In a tutor-learner scenario, the method allows an agent to construct a conceptual system in which meaningful concepts are formed by discriminative combinations of prototypical values on human-interpretable feature channels. We evaluate our approach on the CLEVR dataset, using features that are either simulated or extracted using computer vision techniques. Through a range of experiments, we show that our method allows for incremental learning, needs few data points, and that the resulting concepts are general enough to be applied to previously unseen objects and can be combined compositionally. These properties make the approach well-suited to be used in robotic agents as the module that maps from continuous sensory input to grounded, symbolic concepts that can then be used for higher-level reasoning tasks.

Spatial Concept-Based Navigation with Human Speech Instructions via Probabilistic Inference on Bayesian Generative Model

Preprint

Feb 2020

Robots are required to not only learn spatial concepts autonomously but also utilize such knowledge for various tasks in a domestic environment. Spatial concept represents a multimodal place category acquired from the robot's spatial experience including vision, speech-language, and self-position. The aim of this study is to enable a mobile robot to perform navigational tasks with human speech instructions, such as `Go to the kitchen', via probabilistic inference on a Bayesian generative model using spatial concepts. Specifically, path planning was formalized as the maximization of probabilistic distribution on the path-trajectory under speech instruction, based on a control-as-inference framework. Furthermore, we described the relationship between probabilistic inference based on the Bayesian generative model and control problem including reinforcement learning. We demonstrated path planning based on human instruction using acquired spatial concepts to verify the usefulness of the proposed approach in the simulator and in real environments. Experimentally, places instructed by the user's speech commands showed high probability values, and the trajectory toward the target place was correctly estimated. Our approach, based on probabilistic inference concerning decision-making, can lead to further improvement in robot autonomy.

Neuro-SERKET: Development of Integrative Cognitive System Through the Composition of Deep Probabilistic Generative Models

Article

Full-text available

Jan 2020

This paper describes a framework for the development of an integrative cognitive system based on probabilistic generative models (PGMs) called Neuro-SERKET. Neuro-SERKET is an extension of SERKET, which can compose elemental PGMs developed in a distributed manner and provide a scheme that allows the composed PGMs to learn throughout the system in an unsupervised way. In addition to the head-to-tail connection supported by SERKET, Neuro-SERKET supports tail-to-tail and head-to-head connections, as well as neural network-based modules, i.e., deep generative models. As an example of a Neuro-SERKET application, an integrative model was developed by composing a variational autoencoder (VAE), a Gaussian mixture model (GMM), latent Dirichlet allocation (LDA), and automatic speech recognition (ASR). The model is called VAE + GMM + LDA + ASR. The performance of VAE + GMM + LDA + ASR and the validity of Neuro-SERKET were demonstrated through a multimodal categorization task using image data and a speech signal of numerical digits.

System for augmented human–robot interaction through mixed reality and robot training by non-experts in customer service environments

Article

Feb 2020

Human–robot interaction during general service tasks in home or retail environment has been proven challenging, partly because (1) robots lack high-level context-based cognition and (2) humans cannot intuit the perception state of robots as they can for other humans. To solve these two problems, we present a complete robot system that has been given the highest evaluation score at the Customer Interaction Task of the Future Convenience Store Challenge at the World Robot Summit 2018, which implements several key technologies: (1) a hierarchical spatial concepts formation for general robot task planning and (2) a mixed reality interface to enable users to intuitively visualize the current state of the robot perception and naturally interact with it. The results obtained during the competition indicate that the proposed system allows both non-expert operators and end users to achieve human–robot interactions in customer service environments. Furthermore, we describe a detailed scenario including employee operation and customer interaction which serves as a set of requirements for service robots and a road map for development. The system integration and task scenario described in this paper should be helpful for groups facing customer interaction challenges and looking for a successfully deployed base to build on.

Symbol Emergence in Cognitive Developmental Systems: A Survey

Article

Full-text available

Jan 2018

Symbol emergence through a robot's own interactive exploration of the world without human intervention has been investigated now for several decades. However, methods that enable a machine to form symbol systems in a robust bottom-up manner are still missing. Clearly, this shows that we still do not have an appropriate computational understanding that explains symbol emergence in biological and artificial systems. Over the years it became more and more clear that symbol emergence has to be posed as a multi-faceted problem. Therefore, we will first review the history of the symbol emergence problem in different fields showing their mutual relations. Then we will describe recent work and approaches to solve this problem with the aim of providing an integrative and comprehensive overview of symbol emergence for future research.

Double Articulation Analyzer With Prosody for Unsupervised Word and Phone Discovery

Article

Full-text available

Jan 2022

Word and phone discovery are important tasks in the language development of human infants. Infants acquire words and phones from unsegmented speech signals using segmentation cues such as distributional, prosodic, and co-occurrence information. Many pre-existing computational models designed to represent this process tend to focus on distributional or prosodic cues. In this study, we propose a nonparametric Bayesian probabilistic generative model called the prosodic hierarchical Dirichlet process-hidden language model (prosodic HDP-HLM) designed to perform simultaneous phone and word discovery from continuous speech signals encoded as time-series data that may exhibit a double articulation structure. Prosodic HDP-HLM, as an extension of HDP-HLM, considers both prosodic and distributional cues within a single integrative generative model. We further propose a prosodic double articulation analyzer (Prosodic DAA) based on an inference procedure derived for prosodic HDP-HLM. We conducted three experiments on different types of datasets, including, Japanese vowel sequence, utterances for teaching object names and features, and utterances following Zipf’s law, and the results demonstrated the validity of the proposed method. The results show that the Prosodic DAA successfully used prosodic cues and was able to discover words directly from continuous human speech using distributional and prosodic information in an unsupervised manner, outperforming a method that solely used distributional cues. In contrast, the phone discovery performance did not improve. We also show that prosodic cues contributed to word discovery performance more when the word frequency was distributed more naturally, i.e., following Zipf’s law.

A whole brain probabilistic generative model: Toward realizing cognitive architectures for developmental robots

Article

Mar 2022
NEURAL NETWORKS

Building a human-like integrative artificial cognitive system, that is, an artificial general intelligence (AGI), is the holy grail of the artificial intelligence (AI) field. Furthermore, a computational model that enables an artificial system to achieve cognitive development will be an excellent reference for brain and cognitive science. This paper describes an approach to develop a cognitive architecture by integrating elemental cognitive modules to enable the training of the modules as a whole. This approach is based on two ideas: (1) brain-inspired AI, learning human brain architecture to build human-level intelligence, and (2) a probabilistic generative model(PGM)-based cognitive architecture to develop a cognitive system for developmental robots by integrating PGMs. The proposed development framework is called a whole brain PGM (WB-PGM), which differs fundamentally from existing cognitive architectures in that it can learn continuously through a system based on sensory-motor information. In this paper, we describe the rationale for WB-PGM, the current status of PGM-based elemental cognitive modules, their relationship with the human brain, the approach to the integration of the cognitive modules, and future challenges. Our findings can serve as a reference for brain studies. As PGMs describe explicit informational relationships between variables, WB-PGM provides interpretable guidance from computational sciences to brain science. By providing such information, researchers in neuroscience can provide feedback to researchers in AI and robotics on what the current models lack with reference to the brain. Further, it can facilitate collaboration among researchers in neuro-cognitive sciences as well as AI and robotics.

Multiagent multimodal categorization for symbol emergence: emergent communication via interpersonal cross-modal inference

Article

Full-text available

Jan 2022

This paper describes a computational model of multiagent multimodal categorization that realizes emergent communication. We clarify whether the computational model can reproduce the following functions in a symbol emergence system, comprising two agents with different sensory modalities playing a naming game. (1) Function for forming a shared lexical system that comprises perceptual categories and corresponding signs, formed by agents through individual learning and semiotic communication. (2) Function to improve the categorization accuracy in an agent via semiotic communication with another agent, even when some sensory modalities of each agent are missing. (3) Function that an agent infers unobserved sensory information based on a sign sampled from another agent in the same manner as cross-modal inference. We propose an interpersonal multimodal Dirichlet mixture (Inter-MDM), which is derived by dividing an integrative probabilistic generative model, which is obtained by integrating two Dirichlet mixtures (DMs). The Markov chain Monte Carlo algorithm realizes emergent communication. The experimental results demonstrated that Inter-MDM enables agents to form multimodal categories and appropriately share signs between agents. It is shown that emergent communication improves categorization accuracy, even when some sensory modalities are missing. Inter-MDM enables an agent to predict unobserved information based on a shared sign.

Knowledge design and conceptualization in autonomous robotics

Thesis

Dec 2020

Cristiano Russo

The employment of personal robots or service robots has aroused much interest in recent years with an amazing growth of robotics in different domains. Design of companion robots able to assist, to share and to accompany individuals with limited autonomy in their daily life is the challenge of the future decade. However, performances of nowadays robotic bodies and prototypes remain very far from defeating such challenge. Although sophisticated humanoid robots have been developed, much more effort is needed for improving their cognitive capabilities.Actually, the above-mentioned commercially available robots or prototypes are not still able to naturally adapt themselves to the complex environment in which they are supposed to evolve with humans. In the same way, the existing prototypes are not able to interact in a versatile way with their users. In fact they are still very far from interpreting the diversity and the complexity of perceived information or to construct knowledge relating the surrounding environment. The development of bio-inspired approaches based on Artificial Cognition for perception and autonomous acquisition of knowledge in robotics is a feasible strategy to overcome these limitations. A number of advances have already conducted to the realization of an artificial-cognition-based system allowing a robot to learn and create knowledge from observation (association of sensory information and natural semantics). Within this context, the present work takes advantage from evolutionary process for semantic interpretation of sensory information to make emerge the machine-awareness about its surrounding environment. The main purpose of the Doctoral Thesis is to extend the already accomplished efforts (researches) in order to allow a robot to extract, to construct and to conceptualize the knowledge about its surrounding environment. Indeed, the goal of the doctoral research is to generalize the aforementioned concepts for an autonomous, or semi-autonomous, construction of knowledge from the perceived information (e.g. by a robot). In other words, the expected goal of the proposed doctoral research is to allow a robot progressively conceptualize the environment in which it evolves and to share the constructed knowledge with its user. To this end, a semantic-multimedia knowledge base has been created based on an ontological model and implemented through a NoSQL graph database. This knowledge base is the founding element of the thesis work on which multiple approaches have been investigated, based on semantic, multimedia and visual information. The developed approaches combine this information through classic machine learning techniques, both supervised and unsupervised, together with transfer learning techniques for the reuse of semantic features from deep neural networks models. Other techniques based on ontologies and the Semantic Web have been explored for the acquisition and integration of further knowledge in the knowledge base developed. The different areas investigated have been united in a comprehensive logical framework. The experiments conducted have shown an effective correspondence between the interpretations based on semantic and visual features, from which emerged the possibility for a robotic agent to expand its knowledge generalization skills in even unknown or partially known environments, which allowed to achieve the objectives set.

Predicting Secondary Task Performance: A Directly Actionable Metric for Cognitive Overload Detection

Article

Full-text available

Sep 2021

In this paper, we address cognitive overload detection from unobtrusive physiological signals for users in dual-tasking scenarios. Anticipating cognitive overload is a pivotal challenge in interactive cognitive systems and could lead to safer shared-control between users and assistance systems. Our framework builds on the assumption that decision mistakes on the cognitive secondary task of dual-tasking users correspond to cognitive overload events, wherein the cognitive resources required to perform the task exceed the ones available to the users. We propose DecNet, an end-to-end sequence-to-sequence deep learning model that infers in real-time the likelihood of user mistakes on the secondary task, i.e., the practical impact of cognitive overload, from eye-gaze and head-pose data. We train and test DecNet on a dataset collected in a simulated driving setup from a cohort of 20 users on two dual-tasking decision-making scenarios, with either visual or auditory decision stimuli. DecNet anticipates cognitive overload events in both scenarios and can perform in time-constrained scenarios, anticipating cognitive overload events up to 2s before they occur. We show that DecNet’s performance gap between audio and visual scenarios is consistent with user perceived difficulty. This suggests that single modality stimulation induces higher cognitive load on users, hindering their decision-making abilities.

Multiagent Multimodal Categorization for Symbol Emergence: Emergent Communication via Interpersonal Cross-modal Inference

Preprint

Sep 2021

This paper describes a computational model of multiagent multimodal categorization that realizes emergent communication. We clarify whether the computational model can reproduce the following functions in a symbol emergence system, comprising two agents with different sensory modalities playing a naming game. (1) Function for forming a shared lexical system that comprises perceptual categories and corresponding signs, formed by agents through individual learning and semiotic communication between agents. (2) Function to improve the categorization accuracy in an agent via semiotic communication with another agent, even when some sensory modalities of each agent are missing. (3) Function that an agent infers unobserved sensory information based on a sign sampled from another agent in the same manner as cross-modal inference. We propose an interpersonal multimodal Dirichlet mixture (Inter-MDM), which is derived by dividing an integrative probabilistic generative model, which is obtained by integrating two Dirichlet mixtures (DMs). The Markov chain Monte Carlo algorithm realizes emergent communication. The experimental results demonstrated that Inter-MDM enables agents to form multimodal categories and appropriately share signs between agents. It is shown that emergent communication improves categorization accuracy, even when some sensory modalities are missing. Inter-MDM enables an agent to predict unobserved information based on a shared sign.

SIGVerse: A Cloud-Based VR Platform for Research on Multimodal Human-Robot Interaction

Article

Full-text available

May 2021

Research on Human-Robot Interaction (HRI) requires the substantial consideration of an experimental design, as well as a significant amount of time to practice the subject experiment. Recent technology in virtual reality (VR) can potentially address these time and effort challenges. The significant advantages of VR systems for HRI are: 1) cost reduction, as experimental facilities are not required in a real environment; 2) provision of the same environmental and embodied interaction conditions to test subjects; 3) visualization of arbitrary information and situations that cannot occur in reality, such as playback of past experiences, and 4) ease of access to an immersive and natural interface for robot/avatar teleoperations. Although VR tools with their features have been applied and developed in previous HRI research, all-encompassing tools or frameworks remain unavailable. In particular, the benefits of integration with cloud computing have not been comprehensively considered. Hence, the purpose of this study is to propose a research platform that can comprehensively provide the elements required for HRI research by integrating VR and cloud technologies. To realize a flexible and reusable system, we developed a real-time bridging mechanism between the robot operating system (ROS) and Unity. To confirm the feasibility of the system in a practical HRI scenario, we applied the proposed system to three case studies, including a robot competition named RoboCup@Home. via these case studies, we validated the system’s usefulness and its potential for the development and evaluation of social intelligence via multimodal HRI.

Semiotically adaptive cognition: toward the realization of remotely-operated service robots for the new normal symbiotic society

Article

Full-text available

Jun 2021

The installation of remotely-operated service robots in the environments of our daily life (including offices, homes, and hospitals) can improve work-from-home policies and enhance the quality of the so-called new normal. However, it is evident that remotely-operated robots must have partial autonomy and the capability to learn and use local semiotic knowledge. In this paper, we argue that the development of semiotically adaptive cognitive systems is key to the installation of service robotics technologies in our service environments. To achieve this goal, we describe three challenges: the learning of local knowledge, the acceleration of onsite and online learning, and the augmentation of human–robot interactions.

Whole brain Probabilistic Generative Model toward Realizing Cognitive Architecture for Developmental Robots

Preprint

Mar 2021

Building a humanlike integrative artificial cognitive system, that is, an artificial general intelligence, is one of the goals in artificial intelligence and developmental robotics. Furthermore, a computational model that enables an artificial cognitive system to achieve cognitive development will be an excellent reference for brain and cognitive science. This paper describes the development of a cognitive architecture using probabilistic generative models (PGMs) to fully mirror the human cognitive system. The integrative model is called a whole-brain PGM (WB-PGM). It is both brain-inspired and PGMbased. In this paper, the process of building the WB-PGM and learning from the human brain to build cognitive architectures is described.

Approaches, Challenges, and Applications for Deep Visual Odometry: Toward to Complicated and Emerging Areas

Article

Full-text available

Nov 2020

Visual odometry (VO) is a prevalent way to deal with the relative localization problem, which is becoming increasingly mature and accurate, but it tends to be fragile under challenging environments. Comparing with classical geometry-based methods, deep learning-based methods can automatically learn effective and robust representations, such as depth, optical flow, feature, ego-motion, etc., from data without explicit computation. Nevertheless, there still lacks a thorough review of the recent advances of deep learning-based VO (Deep VO). Therefore, this paper aims to gain a deep insight on how deep learning can profit and optimize the VO systems. We first screen out a number of qualifications including accuracy, efficiency, scalability, dynamicity, practicability, and extensibility, and employ them as the criteria. Then, using the offered criteria as the uniform measurements, we detailedly evaluate and discuss how deep learning improves the performance of VO from the aspects of depth estimation, feature extraction and matching, pose estimation. We also summarize the complicated and emerging areas of Deep VO, such as mobile robots, medical robots, augmented reality and virtual reality, etc. Through the literature decomposition, analysis, and comparison, we finally put forward a number of open issues and raise some future research directions in this field.

Approaches, Challenges, and Applications for Deep Visual Odometry: Toward to Complicated and Emerging Areas

Preprint

Sep 2020

Knowledge Acquisition and Design Using Semantics and Perception: A Case Study for Autonomous Robots

Article

Full-text available

Oct 2021
NEURAL PROCESS LETT

The pervasive use of artificial intelligence and neural networks in several different research fields has noticeably improved multiple aspects of human life. The application of these techniques to machines has made them progressively more “intelligent” and able to solve tasks considered extremely complex for a human being. This technological evolution has deeply influenced the way we interact with machines. Purely symbolic artificial intelligence and techniques like ontologies, have also been successfully used in the past applied to robotics, but have also shown some limitations and failings in the knowledge construction task. In fact, the exhibited “intelligence” is rarely the result of a real autonomous decision, but it is rather hard-encoded in the machine. While a number of approaches have already been proposed in literature concerning knowledge acquisition from the surrounding environment, they are either exclusively based on low-level features or they involve solely high-level semantics-based attributes. Moreover, they often don’t use a general high-level knowledge base for grounding the acquired knowledge. In this contexts, the use of semantics technologies, such as ontologies, is mostly employed for action-oriented tasks. In this article we propose an extension of a novel approach for knowledge acquisition based on a general semantic knowledge-base and the fusion of semantics and visual information by means of neural networks and ontologies. The proposed approach has been implemented on a humanoid robotic platform and the experimental results are shown and discussed.

SIGVerse: A cloud-based VR platform for research on social and embodied human-robot interaction

Preprint

May 2020

Common sense and social interaction related to daily-life environments are considerably important for autonomous robots, which support human activities. One of the practical approaches for acquiring such social interaction skills and semantic information as common sense in human activity is the application of recent machine learning techniques. Although recent machine learning techniques have been successful in realizing automatic manipulation and driving tasks, it is difficult to use these techniques in applications that require human-robot interaction experience. Humans have to perform several times over a long term to show embodied and social interaction behaviors to robots or learning systems. To address this problem, we propose a cloud-based immersive virtual reality (VR) platform which enables virtual human-robot interaction to collect the social and embodied knowledge of human activities in a variety of situations. To realize the flexible and reusable system, we develop a real-time bridging mechanism between ROS and Unity, which is one of the standard platforms for developing VR applications. We apply the proposed system to a robot competition field named RoboCup@Home to confirm the feasibility of the system in a realistic human-robot interaction scenario. Through demonstration experiments at the competition, we show the usefulness and potential of the system for the development and evaluation of social intelligence through human-robot interaction. The proposed VR platform enables robot systems to collect social experiences with several users in a short time. The platform also contributes in providing a dataset of social behaviors, which would be a key aspect for intelligent service robots to acquire social interaction skills based on machine learning techniques.

Improved and scalable online learning of spatial concepts and language models with mapping

Article

Full-text available

Jul 2020
AUTON ROBOT

We propose a novel online learning algorithm, called SpCoSLAM 2.0, for spatial concepts and lexical acquisition with high accuracy and scalability. Previously, we proposed SpCoSLAM as an online learning algorithm based on unsupervised Bayesian probabilistic model that integrates multimodal place categorization, lexical acquisition, and SLAM. However, our original algorithm had limited estimation accuracy owing to the influence of the early stages of learning, and increased computational complexity with added training data. Therefore, we introduce techniques such as fixed-lag rejuvenation to reduce the calculation time while maintaining an accuracy higher than that of the original algorithm. The results show that, in terms of estimation accuracy, the proposed algorithm exceeds the original algorithm and is comparable to batch learning. In addition, the calculation time of the proposed algorithm does not depend on the amount of training data and becomes constant for each step of the scalable algorithm. Our approach will contribute to the realization of long-term spatial language interactions between humans and robots.

Integrated Cognitive Architecture for Robot Learning of Action and Language

Article

Full-text available

Nov 2019

The manner in which humans learn, plan, and decide actions is a very compelling subject. Moreover, the mechanism behind high-level cognitive functions, such as action planning, language understanding, and logical thinking, has not yet been fully implemented in robotics. In this paper, we propose a framework for the simultaneously comprehension of concepts, actions, and language as a first step toward this goal. This can be achieved by integrating various cognitive modules and leveraging mainly multimodal categorization by using multilayered multimodal latent Dirichlet allocation (mMLDA). The integration of reinforcement learning and mMLDA enables actions based on understanding. Furthermore, the mMLDA, in conjunction with grammar learning and based on the Bayesian hidden Markov model (BHMM), allows the robot to verbalize its own actions and understand user utterances. We verify the potential of the proposed architecture through experiments using a real robot.

Survey on frontiers of language and robotics

Article

Full-text available

Jun 2019

The understanding and acquisition of a language in a real-world environment is an important task for future robotics services. Natural language processing and cognitive robotics have both been focusing on the problem for decades using machine learning. However, many problems remain unsolved despite significant progress in machine learning (such as deep learning and probabilistic generative models) during the past decade. The remaining problems have not been systematically surveyed and organized, as most of them are highly interdisciplinary challenges for language and robotics. This study conducts a survey on the frontier of the intersection of the research fields of language and robotics, ranging from logic probabilistic programming to designing a competition to evaluate language understanding systems. We focus on cognitive developmental robots that can learn a language from interaction with their environment and unsupervised learning methods that enable robots to learn a language without hand-crafted training data.

Knowledge Construction Through Semantic Interpretation of Visual Information

Chapter

Full-text available

May 2019

The skills required by machines in the last decade have grown exponentially. Recent efforts made by the scientific community have shown amazing results in the field of research related to artificial intelligence and robotics. Recent studies show that machines may be superior to humans in carrying out certain tasks. However, in many approaches they still fail to achieve high-level skills required to support humans and interact with them. Furthermore, the “intelligence” exhibited is hardly ever the result of a real autonomous decision. In this article we propose a novel approach, with the aim of providing a machine with the ability to evolve and build its own knowledge by combining both semantic and visual information. The proposed concept, its implementation and experimental results are shown and discussed.

Symbol Emergence as an Interpersonal Multimodal Categorization

Preprint

May 2019

This study focuses on category formation for individual agents and the dynamics of symbol emergence in a multi-agent system through semiotic communication. Semiotic communication is defined, in this study, as the generation and interpretation of signs associated with the categories formed through the agent's own sensory experience or by exchange of signs with other agents. From the viewpoint of language evolution and symbol emergence, organization of a symbol system in a multi-agent system is considered as a bottom-up and dynamic process, where individual agents share the meaning of signs and categorize sensory experience. A constructive computational model can explain the mutual dependency of the two processes and has mathematical support that guarantees a symbol system's emergence and sharing within the multi-agent system. In this paper, we describe a new computational model that represents symbol emergence in a two-agent system based on a probabilistic generative model for multimodal categorization. It models semiotic communication via a probabilistic rejection based on the receiver's own belief. We have found that the dynamics by which cognitively independent agents create a symbol system through their semiotic communication can be regarded as the inference process of a hidden variable in an interpersonal multimodal categorizer, if we define the rejection probability based on the Metropolis-Hastings algorithm. The validity of the proposed model and algorithm for symbol emergence is also verified in an experiment with two agents observing daily objects in the real-world environment. The experimental results demonstrate that our model reproduces the phenomena of symbol emergence, which does not require a teacher who would know a pre-existing symbol system. Instead, the multi-agent system can form and use a symbol system without having pre-existing categories.

Hierarchical Spatial Concept Formation Based on Multimodal Information for Human Support Robots

Article

Full-text available

Mar 2018

In this paper, we propose a hierarchical spatial concept formation method based on the Bayesian generative model with multimodal information e.g., vision, position and word information. Since humans have the ability to select an appropriate level of abstraction according to the situation and describe their position linguistically, e.g., “I am in my home” and “I am in front of the table,” a hierarchical structure of spatial concepts is necessary in order for human support robots to communicate smoothly with users. The proposed method enables a robot to form hierarchical spatial concepts by categorizing multimodal information using hierarchical multimodal latent Dirichlet allocation (hMLDA). Object recognition results using convolutional neural network (CNN), hierarchical k-means clustering result of self-position estimated by Monte Carlo localization (MCL), and a set of location names are used, respectively, as features in vision, position, and word information. Experiments in forming hierarchical spatial concepts and evaluating how the proposed method can predict unobserved location names and position categories are performed using a robot in the real world. Results verify that, relative to comparable baseline methods, the proposed method enables a robot to predict location names and position categories closer to predictions made by humans. As an application example of the proposed method in a home environment, a demonstration in which a human support robot moves to an instructed place based on human speech instructions is achieved based on the formed hierarchical spatial concept.

Autonomous cognitive developmental models of robots-a survey

Conference Paper

Full-text available

Oct 2017

Cross-Situational Learning with Bayesian Generative Models for Multimodal Category and Word Learning in Robots

Article

Full-text available

Dec 2017

In this paper, we propose a Bayesian generative model that can form multiple categories based on each sensory-channel and can associate words with any of four sensory-channels (action, position, object, and color). This paper focuses on cross-situational learning using the co-occurrence between words and information of sensory-channels in complex situations rather than conventional situations of cross-situational learning. We conducted a learning scenario using a simulator and a real humanoid iCub robot. In the scenario, a human tutor provided a sentence that describes an object of visual attention and an accompanying action to the robot. The scenario was set as follows: the number of words per sensory-channel was three or four, and the number of trials for learning was 20 and 40 for the simulator and 25 and 40 for the real robot. The experimental results showed that the proposed method was able to estimate the multiple categorizations and to learn the relationships between multiple sensory-channels and words accurately. In addition, we conducted an action generation task and an action description task based on word meanings learned in the cross-situational learning scenario. The experimental results showed the robot could successfully use the word meanings learned by using the proposed method.

Learning Relationships Between Objects and Places by Multimodal Spatial Concept with Bag of Objects

Conference Paper

Full-text available

Oct 2017

Human support robots need to learn the relationships between objects and places to provide services such as cleaning rooms and locating objects through linguistic communications. In this paper, we propose a Bayesian probabilistic model that can automatically model and estimate the probability of objects existing in each place using a multimodal spatial concept based on the co-occurrence of objects. In our experiments, we evaluated the estimation results for objects by using a word to express their places. Furthermore, we showed that the robot could perform tasks involving cleaning up objects, as an example of the usage of the method. We showed that the robot correctly learned the relationships between objects and places.

Symbol Emergence in Robotics for Long-Term Human-Robot Collaboration**This research was partially supported by a Grant-in-Aid for Young Scientists (B) 2012-2014 (24700233) and a Grant-in-Aid for Young Scientists (A) 2015-2019 (15H05319) funded by the Ministry of Education, Culture, Sports, Science, and Technology, Japan, and by CREST, JST.

Article

Dec 2016

Tadahiro Taniguchi

Humans can acquire language through physical interaction with their environment and semiotic interaction with other people. It is very important to understand how humans can form a symbol system and obtain semiotic skills through their autonomous mental development from a computational point of view. A machine learning system that enables a robot to obtain and modulate its symbol system is crucially important to develop robotic systems that achieve long-term human-robot communication and collaboration. In this paper, I introduce the basis of our research field and related topics. Specifically, I describe the concept of symbol emergence systems and the recent research topics , e.g., multimodal categorization, spatial concept formation, language acquisition, and double articulation analysis, that will contribute to future human-robot communication and collaboration.

On Parallelism in Music and Language: A Perspective from Symbol Emergence Systems Based on Probabilistic Generative Models

Chapter

Jun 2023

Tadahiro Taniguchi

Music and language are structurally similar. Such structural similarity is often explained by generative processes. This paper describes the recent development of probabilistic generative models (PGMs) for language learning and symbol emergence in robotics. Symbol emergence in robotics aims to develop a robot that can adapt to real-world environments and human linguistic communications and acquire language from sensorimotor information alone (i.e., in an unsupervised manner). This is regarded as a constructive approach to symbol emergence systems. To this end, a series of PGMs have been developed, including those for simultaneous phoneme and word discovery, lexical acquisition, object and spatial concept formation, and the emergence of a symbol system. By extending the models, a symbol emergence system comprising a multi-agent system in which a symbol system emerges is revealed to be modeled using PGMs. In this model, symbol emergence can be regarded as collective predictive coding. This paper expands on this idea by combining the theory that “emotion is based on the predictive coding of interoceptive signals” and “symbol emergence systems”, and describes the possible hypothesis of the emergence of meaning in music.Keywordssymbol emergence systemsprobabilistic generative modelsymbol emergence in roboticsautomatic music compositionlanguage evolution

Probabilistic Model of Spatial Concepts Integrating Generative Adversarial Networks for Semantic Mapping

Conference Paper

Full-text available

Jun 2020

This paper proposes SpCoMapGAN, a method to generate the semantic map in a newly visited environment by training an inference model using previously estimated semantic maps. SpCoMapGAN uses generative adversarial networks (GANs) to transfer semantic information based on room arrangements to the newly visited environment. We experimentally show in simulation that SpCoMapGAN can use global information for estimating the semantic map and is superior to previous related methods.

Hierarchical Bayesian model for the transfer of knowledge on spatial concepts based on multimodal information

Article

Full-text available

Nov 2021

This paper proposes a hierarchical Bayesian model based on spatial concepts that enables a robot to transfer the knowledge of places from experienced environments to a new environment. The transfer of knowledge based on spatial concepts is modeled as the calculation process of the posterior distribution based on the observations obtained in each environment with the parameters of spatial concepts generalized to environments as prior knowledge. We conducted experiments to evaluate the generalization performance of spatial knowledge for general places such as kitchens and the adaptive performance of spatial knowledge for unique places such as ‘Emma's room’ in a new environment. In the experiments, the accuracies of the proposed method and conventional methods were compared in the prediction task of location names from an image and a position, and the prediction task of positions from a location name. The experimental results demonstrated that the proposed method has a higher prediction accuracy of location names and positions than the conventional method owing to the transfer of knowledge.

Integrative Cognitive Systems for Language Understanding and Symbol Emergence in Robotics認知システムの統合によるロボットの言語理解と記号創発: —Probabilistic Generative Model-based Approach towards Symbol Emergence Systems——確率的生成モデルに基づく記号創発システム論の展開—

Article

Jun 2021

Tadahiro Taniguchi

Double Articulation Analyzer with Prosody for Unsupervised Word and Phoneme Discovery

Preprint

Mar 2021

Infants acquire words and phonemes from unsegmented speech signals using segmentation cues, such as distributional, prosodic, and co-occurrence cues. Many pre-existing computational models that represent the process tend to focus on distributional or prosodic cues. This paper proposes a nonparametric Bayesian probabilistic generative model called the prosodic hierarchical Dirichlet process-hidden language model (Prosodic HDP-HLM). Prosodic HDP-HLM, an extension of HDP-HLM, considers both prosodic and distributional cues within a single integrative generative model. We conducted three experiments on different types of datasets, and demonstrate the validity of the proposed method. The results show that the Prosodic DAA successfully uses prosodic cues and outperforms a method that solely uses distributional cues. The main contributions of this study are as follows: 1) We develop a probabilistic generative model for time series data including prosody that potentially has a double articulation structure; 2) We propose the Prosodic DAA by deriving the inference procedure for Prosodic HDP-HLM and show that Prosodic DAA can discover words directly from continuous human speech signals using statistical information and prosodic information in an unsupervised manner; 3) We show that prosodic cues contribute to word segmentation more in naturally distributed case words, i.e., they follow Zipf's law.

Visual Perception for Multiple Human–Robot Interaction From Motion Behavior

Article

Dec 2019

Visual perception is an important component for human–robot interaction processes in robotic systems. Interaction between humans and robots depends on the reliability of the robotic vision systems. The variation of camera sensors and the capability of these sensors to detect many types of sensory inputs improve the visual perception. The analysis of activities, motions, skills, and behaviors of humans and robots have been addressed by utilizing the heat signatures of the human body. The human motion behavior is analyzed by body movement kinematics, and the trajectory of the target is used to identify the objects and the human target in the omnidirectional (O-D) thermal images. The process of human target identification and gesture recognition by traditional sensors have problem for multitarget scenarios since these sensors may not keep all targets in their narrow field of view (FOV) at the same time. O-D thermal view increases the robots’ line-of-sights and ability to obtain better perception in the absence of light. The human target is informed of its position, surrounding objects and any other human targets in its proximity so that humans with limited vision or vision disability can be assisted to improve their ability in their environment. The proposed method helps to identify the human targets in a wide FOV and light independent conditions to assist the human target and improve the human–robot and robot–robot interactions. The experimental results show that the identification of the human targets is achieved with a high accuracy.

Symbol Emergence as an Interpersonal Multimodal Categorization

Article

Full-text available

Dec 2019

This study focuses on category formation for individual agents and the dynamics of symbol emergence in a multi-agent system through semiotic communication. In this study, the semiotic communication refers to exchanging signs composed of the signifier (i.e., words) and the signified (i.e., categories). We define the generation and interpretation of signs associated with the categories formed through the agent's own sensory experience or by exchanging signs with other agents as basic functions of the semiotic communication. From the viewpoint of language evolution and symbol emergence, organization of a symbol system in a multi-agent system (i.e., agent society) is considered as a bottom-up and dynamic process, where individual agents share the meaning of signs and categorize sensory experience. A constructive computational model can explain the mutual dependency of the two processes and has mathematical support that guarantees a symbol system's emergence and sharing within the multi-agent system. In this paper, we describe a new computational model that represents symbol emergence in a two-agent system based on a probabilistic generative model for multimodal categorization. It models semiotic communication via a probabilistic rejection based on the receiver's own belief. We have found that the dynamics by which cognitively independent agents create a symbol system through their semiotic communication can be regarded as the inference process of a hidden variable in an interpersonal multimodal categorizer, i.e., the complete system can be regarded as a single agent performing multimodal categorization using the sensors of all agents, if we define the rejection probability based on the Metropolis-Hastings algorithm. The validity of the proposed model and algorithm for symbol emergence, i.e., forming and sharing signs and categories, is also verified in an experiment with two agents observing daily objects in the real-world environment. In the experiment, we compared three communication algorithms: no communication, no rejection, and the proposed algorithm. The experimental results demonstrate that our model reproduces the phenomena of symbol emergence, which does not require a teacher who would know a pre-existing symbol system. Instead, the multi-agent system can form and use a symbol system without having pre-existing categories.

Forming the Concept of Direction Developmentally

Article

Nov 2019

The concept of direction is critical for a robot to establish its abilities in spatial perception. This article addresses the issue of how a robot developmentally forms the concept of direction. We propose a novel framework, based on the related mechanisms of humans, in which motion cues are employed instead of other commonly used sensing means like vision or audition. Using motion cues is actually one of the most important ways for humans to acquire concepts. This leads to advantages in two ways: 1) models based on motion cues are usually more convenient for a robot’s control of motion and 2) multimodal perceptual cues can complement each other and result in improved robustness. The framework behind our methodology lies in developmentally forming the concept of direction in two successive phases: 1) the robot acquires the perceptual representation of its motion by motor babbling and 2) the corresponding higher level features are gradually captured, based on which the conceptual representation is obtained and then the concept of direction is formed. The proposed framework and the corresponding models are evaluated with a PKU-HR5.0II humanoid robot. The results show that the robot can successfully form the concept of direction in a manner similar to that of humans.

Unsupervised learning for spoken word production based on simultaneous word and phoneme discovery without transcribed data

Conference Paper

Sep 2017

Unsupervised spatial lexical acquisition by updating a language model with place clues

Article

Nov 2017
ROBOT AUTON SYST

This paper describes how to achieve highly accurate unsupervised spatial lexical acquisition from speech-recognition results including phoneme recognition errors. In most research into lexical acquisition, the robot has no pre-existing lexical knowledge. The robot acquires sequences of some phonemes as words from continuous speech signals. In a previous study, we proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA) that integrates the robot's position and words obtained by unsupervised word segmentation from uncertain syllable recognition results. However, SpCoA has a very critical problem to be solved in lexical acquisition; the boundaries of word segmentation are incorrect in many cases because of many phoneme recognition errors. Therefore, we propose an unsupervised machine learning method (SpCoA++) for the robust lexical acquisition of novel words relating to places visited by the robot. The proposed SpCoA++ method performs an iterative estimation of learning spatial concepts and updating a language model using place information. SpCoA++ can select a candidate including many words that better represent places from multiple word-segmentation results by maximizing the mutual information between segmented words and spatial concepts. The experimental results demonstrate a significant improvement of the phoneme accuracy rate of learned words relating to place in the proposed method by word-segmentation results based on place information, in comparison to the conventional methods. We indicate that the proposed method enables the robot to acquire words from speech signals more accurately, and improves the estimation accuracy of the spatial concepts.

Learning of Labeling Room Space for Mobile Robots Based on Visual Motor Experience

Conference Paper

Oct 2017
Lect Notes Comput Sci

A model was developed to allow a mobile robot to label the areas of a typical domestic room, using raw sequential visual and motor data, no explicit information on location was provided, and no maps were constructed. The model comprised a deep autoencoder and a recurrent neural network. The model was demonstrated to (1) learn to correctly label areas of different shapes and sizes, (2) be capable of adapting to changes in room shape and rearrangement of items in the room, and (3) attribute different labels to the same area, when approached from different angles. Analysis of the internal representations of the model showed that a topological structure corresponding to the room structure was self-organized as the trajectory of the internal activations of the network.

Towards Symbol Emergence System-based Human-Robot Collaboration

Article

Oct 2016

Tadahiro Taniguchi

Multiple Categorization by iCub: Learning Relationships between Multiple Modalities and Words

Conference Paper

Full-text available

Oct 2016

Human infants can acquire word meanings by estimating the relationships among multiple situations and words. In this paper, we propose a Bayesian probabilistic model that can learn multiple categorizations and words related to any of four modalities (action, object, position, and color). This paper focuses on a cross-situational learning using the co-occurrence of sentences and situations. We conducted a learning experiment using the humanoid iCub robot. In this experiment, the human tutor describes a sentence about an object of visual attention to the robot and an action of the robot. The experimental results show that the proposed method was able to estimate the multiple categorizations and to accurately learn the relationships between multiple modalities and words.

Grounded Spatial Symbols for Task Planning Based on Experience

Conference Paper

Full-text available

Oct 2013

Providing autonomous humanoid robots with the abilities to react in an adaptive and intelligent manner involves low level control and sensing as well as high level reasoning. However, the integration of both levels still remains challenging due to the representational gap between the continuous state space on the sensorimotor level and the discrete symbolic entities used in high level reasoning. In this work, we approach the problem of learning a representation of the space which is applicable on both levels. This representation is grounded on the sensorimotor level by means of exploration and on the language level by making use of common sense knowledge. We demonstrate how spatial knowledge can be extracted from these two sources of experience. Combining the resulting knowledge in a systematic way yields a solution to the grounding problem which has the potential to substantially decrease the learning effort.

Lingodroids: Socially grounding place names in privately grounded cognitive maps

Article

Full-text available

Nov 2011
ADAPT BEHAV

For mobile robots to communicate meaningfully about their spatial environment, they require personally constructed cognitive maps and social interactions to form languages with shared meanings. Geographic spatial concepts introduce particular problems for grounding—connecting a word to its referent in the world—because such concepts cannot be directly and solely based on sensory perceptions. In this article we investigate the grounding of geographic spatial concepts using mobile robots with cognitive maps, called Lingodroids. Languages were established through structured interactions between pairs of robots called where-are-we conversations. The robots used a novel method, termed the distributed lexicon table, to create flexible concepts. This method enabled words for locations, termed toponyms, to be grounded through experience. Their understanding of the meaning of words was demonstrated using go-to games in which the robots independently navigated to named locations. Studies in real and virtual reality worlds show that the system is effective at learning spatial language: robots learn words easily—in a single trial as children do—and the words and their meaning are sufficiently robust for use in real world tasks.

The Mathematics of Statistical Machine Translation: Parameter Estimation

Article

Full-text available

Jan 1993
COMPUT LINGUIST

We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. We define a concept of word-by-word alignment between such pairs of sentences. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We give an algorithm for seeking the most probable of these alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Accordingly, we have restricted our work to these two languages; but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus.

A Constructive Definition of the Dirichlet Prior

Article

Full-text available

Jan 1994
STAT SINICA

Jayaram Sethuraman

Robots that Learn to Converse: Developmental Approach to Situated Language Processing

Article

Full-text available

This paper presents a machine learning method that enables robots to learn to communicate linguistically from scratch through verbal and behavioral interaction with users. The method combines speech, visual, and tactile information ob-tained by interaction in the real world. It learns speech units, words, concepts of objects, motions, grammar, and pragmatic and communicative capabilities, which are integrated in a dy-namic graphical model. Experimental results show that through a practical, small number of learning episodes with a user, the robot was eventually able to understand even fragmental and ambiguous utterances, respond to them with confirmation ques-tions and/or acting, generate directive utterances appropriate for the given situation, and answer questions. This paper discusses the importance of a developmental approach to realize natural situated human-robot conversations.

Multimodal Language Acquisition Based on Motor Learning and Interaction

Chapter

Full-text available

Dec 2009

This work presents a developmental and ecological approach to language acquisition in robots, which has its roots in the interaction between infants and their caregivers. We show that the signal directed to infants by their caregivers include several hints that can facilitate the language acquisition and reduce the need for preprogrammed linguistic structure. Moreover, infants also produce sounds, which enables for richer types of interactions such as imitation games, and for the use of motor learning. By using a humanoid robot with embodied models of the infant’s ears, eyes, vocal tract, and memory functions, we can mimic the adult-infant interaction and take advantage of the inherent structure in the signal. Two experiments are shown, where the robot learn a number of word-object associations and the articulatory target positions for a number of vowels.

Simulator platform that enables social interaction simulation — SIGVerse: SocioIntelliGenesis simulator

Conference Paper

Full-text available

Jan 2011

Understanding mechanisms of intelligence of human beings and animals is one of the most important approaches to develop intelligent robot systems. Since the mechanisms of such real-life intelligent systems are so complex, physical interactions between agents and their environment and the social interactions between agents should be considered. Comprehension and knowledge in many peripheral fields such as cognitive science, developmental psychology, brain science, evolutionary biology, and robotics is also required. Discussions from an interdisciplinary aspect are very important for implementing this approach, but such collaborative research is time-consuming and labor-intensive, and it is difficult to obtain fruitful results from such research because the basis of experiments is very different in each research field. In the social science field, for example, several multi-agent simulation systems have been proposed for modeling factors such as social interactions and language evolution, whereas robotics researchers often use dynamics and sensor simulators. However, there is no integrated system that uses both physical simulations and social communication simulations. Therefore, we developed a simulator environment called SIGVerse that combines dynamics, perception, and communication simulations for synthetic approaches to research into the genesis of social intelligence. In this paper, we introduce SIGVerse, its example application and perspectives.

Learning Physically Grounded Lexicons from Spoken Utterances

Chapter

Full-text available

Jan 2012

Robots That Learn Language: A Developmental Approach to Situated Human-Robot Conversations

Chapter

Full-text available

Sep 2007

Naoto Iwahashi

8.1 Sharing the risk of being misunderstood The experiments in learning a pragmatic capability illustrate the importance of sharing the risk of not being understood correctly between the user and the robot. In the learning period for utterance understanding by the robot, the values of the local confidence parameters changed significantly when the robot acted incorrectly in the first trial and correctly in the second trial. To facilitate learning, the user had to gradually increase the ambiguity of utterances according to the robot's developing ability to understand them and had to take the risk of not being understood correctly. In the its learning period for utterance generation, the robot adjusted its utterances to the user while learning the global confidence function. When the target understanding rate was set to 0.95, the global confidence function became very unstable in cases where the robot's expectations of being understood correctly at a high probability were not met. This instability could be prevented by using a lower value of , which means that the robot would have to take a greater risk of not being understood correctly. Accordingly, in human-machine interaction, both users and robots must face the risk of not being understood correctly and thus adjust their actions to accommodate such risk in order to effectively couple their belief systems. Although the importance of controlling the risk of error in learning has generally been seen as an exploration-exploitation trade-off in the field of reinforcement learning by machines (e.g., (Dayan & Sejnowski, 1996)), we argue here that the mutual accommodation of the risk of error by those communicating is an important basis for the formation of mutual understanding. 8.2 Incomplete observed information and fast adaptation In general, an utterance does not contain complete information about what a speaker wants to convey to a listener. The proposed learning method interpreted such utterances according to the situation by providing necessary but missing information by making use of the assumption of shared beliefs. The method also enabled the robot and the user to adapt such an assumption of shared beliefs to each other with little interaction. We can say that the method successfully

Julius—An open source real-time large vocabulary recognition engine

Conference Paper

Full-text available

Sep 2001

Julius is a high-performance, two-pass LVCSR decoder for researchers and developers. Based on word 3-gram and context-dependent HMM, it can perform almost real- time decoding on most current PCs in 20k word dicta- tion task. Major search techniques are fully incorporated such as tree lexicon, N-gram factoring, cross-word con- text dependency handling, enveloped beam search, Gaus- sian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized carefully to be inde- pendent from model structures, and various HMM types are supported such as shared-state triphones and tied- mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit. The main platform is Linux and other Unix workstations, and partially works on Windows. Julius is distributed with open license to- gether with source codes, and has been used by many researchers and developers in Japan.

Learning Lexicons from Spoken Utterances Based on Statistical Model Selection

Conference Paper

Full-text available

Jan 2009

This paper proposes a method for the unsupervised learning of lexicons from pairs of a spoken utterance and an object as its meaning under the condition that any priori linguistic knowledge other than acoustic models of Japanese phonemes is not used. The main problems are the word segmentation of spoken utterances and the learning of the phoneme sequences of the words. To obtain a lexicon, a statistical model, which represents the joint probability of an utterance and an object, is learned based on the minimum description length (MDL) principle. The model consists of three parts: a word list in which each word is represented by a phoneme sequence, a word-bigram model, and a word-meaning model. Through alternate learning processes of these parts, acoustically, grammatically, and semantically appropriate units of phoneme sequences that cover all utterances are acquired as words. Experimental results show that our model can acquire phoneme sequences of object words with about 83.6% accuracy.

Sharable software repository for Japanese large vocabulary continuous speech recognition

Conference Paper

Full-text available

Nov 1998

Probabilistic grounding of situated speech using plan recognition and reference resolution

Conference Paper

Full-text available

Oct 2005

Situated, spontaneous speech may be ambiguous along acoustic, lexical, grammatical and semantic dimensions. To understand such a seemingly difficult signal, we propose to model the ambiguity inherent in acoustic signals and in lexical and grammatical choices using compact, probabilistic representations of multiple hypotheses. To resolve semantic ambiguities we propose a situation model that captures aspects of the physical context of an utterance as well as the speaker's intentions, in our case represented by recognized plans. In a single, coherent Framework for Understanding Situated Speech (FUSS) we show how these two influences, acting on an ambiguous representation of the speech signal, complement each other to disambiguate form and content of situated speech. This method produces promising results in a game playing environment and leaves room for other types of situation models.

Bayesian Unsupervised Word Segmentation with Nested Pitman-Yor Language Modeling.

Conference Paper

Full-text available

Jan 2009

In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any "word" indications.

Bayesian Learning of a Language Model from Continuous Speech

Article

Full-text available

Feb 2012

We propose a novel scheme to learn a language model (LM) for automatic speech recognition (ASR) directly from continuous speech. In the proposed method, we first generate phoneme lattices using an acoustic model with no linguistic constraints, then perform training over these phoneme lattices, simultaneously learning both lexical units and an LM. As a statistical framework for this learning problem, we use non-parametric Bayesian statistics, which make it possible to balance the learned model's complexity (such as the size of the learned vocabulary) and expressive power, and provide a principled learning algorithm through the use of Gibbs sampling. Implementation is performed using weighted finite state transducers (WFSTs), which allow for the simple handling of lattice input. Experimental results on natural, adult-directed speech demonstrate that LMs built using only continuous speech are able to significantly reduce ASR phoneme error rates. The proposed technique of joint Bayesian learning of lexical units and an LM over lattices is shown to significantly contribute to this improvement.

Are We There Yet? Grounding Temporal Concepts in Shared Journeys

Article

Full-text available

Jun 2011

An understanding of time and temporal concepts is critical for interacting with the world and with other agents in the world. What does a robot need to know to refer to the temporal aspects of events could a robot gain a grounded understanding of a long journey, or soon? Cognitive maps constructed by individual agents from their own journey experiences have been used for grounding spatial concepts in robot languages. In this paper, we test whether a similar methodology can be applied to learning temporal concepts and an associated lexicon to answer the question how long did it take to complete a journey. Using evolutionary language games for specific and generic journeys, successful communication was established for concepts based on representations of time, distance, and amount of change. The studies demonstrate that a lexicon for journey duration can be grounded using a variety of concepts. Spatial and temporal terms are not identical, but the studies show that both can be learned using similar language evolution methods, and that time, distance, and change can serve as proxies for each other under noisy conditions. Effective concepts and names for duration provide a first step towards a grounded lexicon for temporal interval logic.

A sticky HDP-HMM with application to speaker diarization

Article

Full-text available

May 2009
ANN APPL STAT

We consider the problem of speaker diarization, the problem of segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. The problem is rendered particularly difficult by the fact that we are not allowed to assume knowledge of the number of people participating in the meeting. To address this problem, we take a Bayesian nonparametric approach to speaker diarization that builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006) 1566--1581]. Although the basic HDP-HMM tends to over-segment the audio data---creating redundant states and rapidly switching among them---we describe an augmented HDP-HMM that provides effective control over the switching rate. We also show that this augmentation makes it possible to treat emission distributions nonparametrically. To scale the resulting architecture to realistic diarization problems, we develop a sampling algorithm that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence, greatly improving mixing rates. Working with a benchmark NIST data set, we show that our Bayesian nonparametric architecture yields state-of-the-art speaker diarization results.

RatSLAM: A hippocampal model for simultaneous localization and mapping

Conference Paper

Full-text available

Jan 2004

The work presents a new approach to the problem of simultaneous localization and mapping - SLAM - inspired by computational models of the hippocampus of rodents. The rodent hippocampus has been extensively studied with respect to navigation tasks, and displays many of the properties of a desirable SLAM solution. RatSLAM is an implementation of a hippocampal model that can perform SLAM in real time on a real robot. It uses a competitive attractor network to integrate odometric information with landmark sensing to form a consistent representation of the environment. Experimental results show that RatSLAM can operate with ambiguous landmark information and recover from both minor and major path integration errors.

An Efficient FastSLAM Algorithm for Generating Maps of Large-Scale Cyclic Environments from Raw Laser Range Measurements

Conference Paper

Full-text available

Nov 2003

The ability to learn a consistent model of its environment is a prerequisite for autonomous mobile robots. A particularly challenging problem in acquiring environment maps is that of closing loops; loops in the environment create challenging data association problems [J.-S. Gutman et al., 1999]. This paper presents a novel algorithm that combines Rao-Blackwellized particle filtering and scan matching. In our approach scan matching is used for minimizing odometric errors during mapping. A probabilistic model of the residual errors of scan matching process is then used for the resampling steps. This way the number of samples required is seriously reduced. Simultaneously we reduce the particle depletion problem that typically prevents the robot from closing large loops. We present extensive experiments that illustrate the superior performance of our approach compared to previous approaches.

Monte Carlo localization for mobile robots

Conference Paper

Full-text available

Feb 1999

To navigate reliably in indoor environments, a mobile robot must know where it is. Thus, reliable position estimation is a key problem in mobile robotics. We believe that probabilistic approaches are among the most promising candidates to providing a comprehensive and real-time solution to the robot localization problem. However, current methods still face considerable hurdles. In particular the problems encountered are closely related to the type of representation used to represent probability densities over the robot's state space. Earlier work on Bayesian filtering with particle-based density representations opened up a new approach for mobile robot localization based on these principles. We introduce the Monte Carlo localization method, where we represent the probability density involved by maintaining a set of samples that are randomly drawn from it. By using a sampling-based representation we obtain a localization method that can represent arbitrary distributions. We show experimentally that the resulting method is able to efficiently localize a mobile robot without knowledge of its starting location. It is faster, more accurate and less memory-intensive than earlier grid-based methods,

Learning lexicons from spoken utterances based on statistical model selection

Conference Paper

Sep 2009

FastSLAM: a factored solution to the simultaneous localization and mapping problem

Article

Jan 2003

Mutual learning of an object concept and language model based on MLDA and NPYLM

Article

Mar 2015

Humans develop their concept of an object by classifying it into a category, and acquire language by interacting with others at the same time. Thus, the meaning of a word can be learnt by connecting the recognized word and concept. We consider such an ability to be important in allowing robots to flexibly develop their knowledge of language and concepts. Accordingly, we propose a method that enables robots to acquire such knowledge. The object concept is formed by classifying multimodal information acquired from objects, and the language model is acquired from human speech describing object features. We propose a stochastic model of language and concepts, and knowledge is learnt by estimating the model parameters. The important point is that language and concepts are interdependent. There is a high probability that the same words will be uttered to objects in the same category. Similarly, objects to which the same words are uttered are highly likely to have the same features. Using this relation, the accuracy of both speech recognition and object classification can be improved by the proposed method. However, it is difficult to directly estimate the parameters of the proposed model, because there are many parameters that are required. Therefore, we approximate the proposed model, and estimate its parameters using a nested Pitman--Yor language model and multimodal latent Dirichlet allocation to acquire the language and concept, respectively. The experimental results show that the accuracy of speech recognition and object classification is improved by the proposed method.

Multimodal concept and word learning using phoneme sequences with errors

Conference Paper

Nov 2013

In this study, we propose a method for concept formation and word acquisition for robots. The proposed method is based on multimodal latent Dirichlet allocation (MLDA) and the nested Pitman-Yor language model (NPYLM). A robot obtains haptic, visual, and auditory information by grasping, observing, and shaking an object. At the same time, a user teaches object features to the robot through speech, which is recognized using only acoustic models and transformed into phoneme sequences. As the robot is supposed to have no language model in advance, the recognized phoneme sequences include many phoneme recognition errors. Moreover, the recognized phoneme sequences with errors are segmented into words in an unsupervised manner; however, not all words are necessarily segmented correctly. The words including these errors have a negative effect on the learning of word meanings. To overcome this problem, we propose a method to improve unsupervised word segmentation and to reduce phoneme recognition errors by using multimodal object concepts. In the proposed method, object concepts are used to enhance the accuracy of word segmentation, reduce phoneme recognition errors, and correct words so as to improve the categorization accuracy. We experimentally demonstrate that the proposed method can improve the accuracy of word segmentation and reduce the phoneme recognition error and that the obtained words enhance the categorization accuracy.

Iterative Bayesian word segmentation for unsupervised vocabulary discovery from phoneme lattices

Conference Paper

May 2014

In this paper we present an algorithm for the unsupervised segmentation of a lattice produced by a phoneme recognizer into words. Using a lattice rather than a single phoneme string accounts for the uncertainty of the recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model (LM) is known. We propose a computationally efficient iterative approach, which alternates between the following two steps: First, the most probable string is extracted from the lattice using a phoneme LM learned on the segmentation result of the previous iteration. Second, word segmentation is performed on the extracted string using a word and phoneme LM which is learned alongside the new segmentation. We present results on lattices produced by a phoneme recognizer on the WSJ-CAM0 dataset. We show that our approach delivers superior segmentation performance than an earlier approach found in the literature, in particular for higher-order language models.

Computational aspects of sequential Monte Carlo filter and smoother

Article

Jun 2014

Genshiro Kitagawa

Progress in information technologies has enabled to apply computer-intensive methods to statistical analysis. In time series modeling, sequential Monte Carlo method was developed for general nonlinear non-Gaussian state-space models and it enables to consider very complex nonlinear non-Gaussian models for real-world problems. In this paper, we consider several computational problems associated with sequential Monte Carlo filter and smoother, such as the use of a huge number of particles, two-filter formula for smoothing, and parallel computation. The posterior mean smoother and the Gaussian-sum smoother are also considered.

Communication between Lingodroids with different cognitive capabilities

Conference Paper

May 2013

Previous studies have shown how Lingodroids, language learning mobile robots, learn terms for space and time, connecting their personal maps of the world to a publically shared language. One caveat of previous studies was that the robots shared the same cognitive architecture, identical in all respects from sensors to mapping systems. In this paper we investigate the question of how terms for space can be developed between robots that have fundamentally different sensors and spatial representations. In the real world, communication needs to occur between agents that have different embodiment and cognitive capabilities, including different sensors, different representations of the world, and different species (including humans). The novel aspects of these studies is that one robot uses a forward facing camera to estimate appearance and uses a biologically inspired continuous attractor network to generate a topological map; the other robot uses a laser scanner to estimate range and uses a probabilistic filter approach to generate an occupancy grid. The robots hold conversations in different locations to establish a shared language. Despite their different ways of sensing and mapping the world, the robots are able to create coherent lexicons for the space around them.

Online learning of concepts and words using multimodal LDA and hierarchical Pitman-Yor Language Model

Conference Paper

Oct 2012
Rep U S

In this paper, we propose an online algorithm for multimodal categorization based on the autonomously acquired multimodal information and partial words given by human users. For multimodal concept formation, multimodal latent Dirichlet allocation (MLDA) using Gibbs sampling is extended to an online version. We introduce a particle filter, which significantly improve the performance of the online MLDA, to keep tracking good models among various models with different parameters. We also introduce an unsupervised word segmentation method based on hierarchical Pitman-Yor Language Model (HPYLM). Since the HPYLM requires no predefined lexicon, we can make the robot system that learns concepts and words in completely unsupervised manner. The proposed algorithms are implemented on a real robot and tested using real everyday objects to show the validity of the proposed system.

Learning form sights and sounds : a computational model

Article

Jan 2002

D. K. Roy

This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithms. In evaluations the model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects. These results demonstrate the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling. © 2002 Cognitive Science Society, Inc. All rights reserved.

FastSLAM: A Factored Solution to the Simultaneous Localization and Mapping Problem

Conference Paper

Nov 2002

The ability to simultaneously localize a robot and accurately map its surroundings is considered by many to be a key prerequisite of truly autonomous robots. However, few approaches to this problem scale up to handle the very large number of landmarks present in real environments. Kalman filter-based algorithms, for example, require time quadratic in the number of landmarks to incorporate each sensor observation. This paper presents FastSLAM, an algorithm that recursively estimates the full posterior distribution over robot pose and landmark locations, yet scales logarithmically with the number of landmarks in the map. This algorithm is based on an exact factorization of the posterior into a product of conditional landmark distributions and a distribution over robot paths. The algorithm has been run successfully on as many as 50,000 landmarks, environments far beyond the reach of previous approaches. Experimental results demonstrate the advantages and limitations of the FastSLAM algorithm on both simulated and real-world data.

Learning Words from Sights and Sounds: A Computational Model

Article

Jan 2002

Language acquisition through a human–robot interface by combining speech, visual, and behavioral information

Article

Nov 2003
INFORM SCIENCES

Naoto Iwahashi

This paper describes new language-processing methods suitable for human–robot interfaces. These methods enable a robot to learn linguistic knowledge from scratch in unsupervised ways. The learning is done through statistical optimization in the process of human–robot communication, combining speech, visual, and behavioral information in a probabilistic framework. The linguistic knowledge learned includes speech units like phonemes, lexicon, and grammar, and is represented by a graphical model that includes hidden Markov models. In experiments, a robot was eventually able to understand utterances according to given situations, and act appropriately.

Learning Place-Names from Spoken Utterances and Localization Results by Mobile Robot.

Conference Paper

Aug 2011

This paper proposes a method for the unsupervised learning of place-names from pairs of a spoken utterance and a localization result, which represents a current location of a mobile robot, without any priori linguistic knowledge other than a phoneme acoustic model. In previous work, we have proposed a lexical learning method based on statistical model selection. This method can learn the words that represent a single object, such as proper nouns, but cannot learn the words that represent classes of objects, such as general nouns. This paper describes improvements of the method for learning both a phoneme sequence of each word and a distribution of objects that the word represents.

Multimodal categorization by hierarchical Dirichlet process

Conference Paper

Sep 2011
Rep U S

In this paper, we propose a nonparametric Bayesian framework for categorizing multimodal sensory sig nals such as audio, visual, and haptic information by robots. The robot uses its physical embodiment to grasp and observe an object from various viewpoints as well as listen to the sound during the observation. The multimodal information enables the robot to form human-like object categories that are bases of intelligence. The proposed method is an extension of Hierarchi cal Dirichlet Process (HDP), which is a kind of nonparametric Bayesian models, to multimodal HDP (MHDP). MHDP can estimate the number of categories, while the parametric model, e.g. LDA-based categorization, requires to specify the number in advance. As this is an unsupervised learning method, a human user does not need to give any correct labels to the robot and it can classify objects autonomously. At the same time the proposed method provides a probabilistic framework for inferring object properties from limited observations. Validity of the proposed method is shown through some experimental results.

Incorporating Temporal and Semantic Information with Eye Gaze for Automatic Word Acquisition in Multimodal Conversational Systems.

Conference Paper

Oct 2008

One major bottleneck in conversational sys- tems is their incapability in interpreting un- expected user language inputs such as out-of- vocabulary words. To overcome this problem, conversational systems must be able to learn new words automatically during human ma- chine conversation. Motivated by psycholin- guistic findings on eye gaze and human lan- guage processing, we are developing tech- niques to incorporate human eye gaze for au- tomatic word acquisition in multimodal con- versational systems. This paper investigates the use of temporal alignment between speech and eye gaze and the use of domain knowl- edge in word acquisition. Our experiment re- sults indicate that eye gaze provides a poten- tial channel for automatically acquiring new words. The use of extra temporal and domain knowledge can significantly improve acquisi- tion performance.

Grounding of Word Meanings in Latent Dirichlet Allocation-Based Multimodal Concepts

Article

Jan 2011

In this paper we propose a latent Dirichlet allocation (LDA)-based framework for multimodal categorization and words grounding by robots. The robot uses its physical embodiment to grasp and observe an object from various view points, as well as to listen to the sound during the observing period. This multimodal information is used for categorizing and forming multimodal concepts using multimodal LDA. At the same time, the words acquired during the observing period are connected to the related concepts, which are represented by the multimodal LDA. We also provide a relevance measure that encodes the degree of connection between words and modalities. The proposed algorithm is implemented on a robot platform and some experiments are carried out to evaluate the algorithm. We also demonstrate simple conversation between a user and the robot based on the learned model.

Context-Based Word Acquisition for Situated Dialogue in a Virtual World

Article

Mar 2010
JAIR

To tackle the vocabulary problem in conversational systems, previous work has applied unsupervised learning approaches on co-occurring speech and eye gaze during interaction to automatically acquire new words. Although these approaches have shown promise, several issues related to human language behavior and human-machine conversation have not been addressed. First, psycholinguistic studies have shown certain temporal regularities between human eye movement and language production. While these regularities can potentially guide the acquisition process, they have not been incorporated in the previous unsupervised approaches. Second, conversational systems generally have an existing knowledge base about the domain and vocabulary. While the existing knowledge can potentially help bootstrap and constrain the acquired new words, it has not been incorporated in the previous models. Third, eye gaze could serve different functions in human-machine conversation. Some gaze streams may not be closely coupled with speech stream, and thus are potentially detrimental to word acquisition. Automated recognition of closely-coupled speech-gaze streams based on conversation context is important. To address these issues, we developed new approaches that incorporate user language behavior, domain knowledge, and conversation context in word acquisition. We evaluated these approaches in the context of situated dialogue in a virtual world. Our experimental results have shown that incorporating the above three types of contextual information significantly improves word acquisition performance.

Learning spatial concepts from RatSLAM representations

Article

May 2007
ROBOT AUTON SYST

RatSLAM is a biologically-inspired visual SLAM and navigation system that has been shown to be effective indoors and outdoors on real robots. The spatial representation at the core of RatSLAM, the experience map, forms in a distributed fashion as the robot learns the environment. The activity in RatSLAM's experience map possesses some geometric properties, but still does not represent the world in a human readable form. A new system, dubbed RatChat, has been introduced to enable meaningful communication with the robot. The intention is to use the "language games" paradigm to build spatial concepts that can be used as the basis for communication. This paper describes the first step in the language game experiments, showing the potential for meaningful categorization of the spatial representations in RatSLAM. (c) 2006 Elsevier B.V. All rights reserved.

Comparing Partitions

Article

Feb 1985
J CLASSIF

The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between ±1.

States of particles: before the teaching utterance (top); after the teaching utterance (bottom)

Fig. 12. States of particles: before the teaching utterance (top); after the teaching utterance (bottom). The uttered sentence is "kokowa ** dayo," (which means "Here is **.") "**" is the name of the place.

Computational aspects of sequential Monte Carlo filter and smoother Simulator platform that enables social interaction simulation –SIGVerse: SocioIntelliGenesis simulator–

Jan 2010
443-471

G Kitagawa
T Inamura
T Shibata
H Sena
T Hashimoto
N Kawai
T Miyashita
Y Sakurai
M Shimizu
M Otake
K Hosoda

G. Kitagawa, " Computational aspects of sequential Monte Carlo filter and smoother, " Annals of the Institute of Statistical Mathematics, vol. 66, no. 3, pp. 443–471, 2014. [31] T. Inamura, T. Shibata, H. Sena, T. Hashimoto, N. Kawai, T. Miyashita, Y. Sakurai, M. Shimizu, M. Otake, K. Hosoda, et al., " Simulator platform that enables social interaction simulation –SIGVerse: SocioIntelliGenesis simulator–, " in IEEE/SICE International Symposium on System Integration, 2010, pp. 212–217.

Robots that learn to converse: Developmental approach to situated language processing

Jan 2009
532-537

N Iwahashi
R Taguchi
K Sugiura
K Funakoshi
M Nakano

N. Iwahashi, R. Taguchi, K. Sugiura, K. Funakoshi, and M. Nakano, "Robots that learn to converse: Developmental approach to situated language processing," in Proceedings of International Symposium on Speech and Language Processing, 2009, pp. 532-537.

Mutual learning of an object concept and language model based on MLDA and NPYLM

Jan 2014
600-607

T Nakamura
T Nagai
K Funakoshi
S Nagasaka
T Taniguchi
N Iwahashi

T. Nakamura, T. Nagai, K. Funakoshi, S. Nagasaka, T. Taniguchi, and N. Iwahashi, "Mutual learning of an object concept and language model based on MLDA and NPYLM," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2014, pp. 600-607.

Mutual learning of an object concept and language model based on MLDA and NPYLM

T Nakamura

Spatial Concept Acquisition for a Mobile Robot That Integrates Self-Localization and Unsupervised Word Discovery From Spoken Sentences

Abstract

No full-text available

Recommended publications

Integration of Self-Organizing Feature Maps and reinforcement learning in robotics

An Hybrid Approach to Solve the Global Localization Problem For Indoor Mobile Robots Considering Sen...

INTERSPEECH 2010 Learning a Language Model from Continuous Speech

ROS Integration for Miniature Mobile Robots