Article

Spatial Concept Acquisition for a Mobile Robot That Integrates Self-Localization and Unsupervised Word Discovery From Spoken Sentences

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper, we propose a novel unsupervised learning method for the lexical acquisition of words related to places visited by robots, from human continuous speech signals. We address the problem of learning novel words by a robot that has no prior knowledge of these words except for a primitive acoustic model. Further, we propose a method that allows a robot to effectively use the learned words and their meanings for self-localization tasks. The proposed method is nonparametric Bayesian spatial concept acquisition method (SpCoA) that integrates the generative model for self-localization and the unsupervised word segmentation in uttered sentences via latent variables related to the spatial concept. We implemented the proposed method SpCoA on SIGVerse, which is a simulation environment, and TurtleBot2, which is a mobile robot in a real environment. Further, we conducted experiments for evaluating the performance of SpCoA. The experimental results showed that SpCoA enabled the robot to acquire the names of places from speech sentences. They also revealed that the robot could effectively utilize the acquired spatial concepts and reduce the uncertainty in self-localization.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Spatial concepts refer to the knowledge of the place categories autonomously formed based on the multimodal information of the spatial experience acquired by the robot in the environment according to [8][9][10][11][12][13]. Therefore, we believe that the spatial concept works well not only for this particular task but also for other tasks related to place and object placement. ...
... It is important to appropriately generalize and form place categories based on object positions while dealing with the uncertainty of the observations. To solve these issues, unsupervised learning approaches for spatial concepts were utilized in studies related to autonomous place categorization by a robot [8][9][10][11][12][13]. Taniguchi et al. proposed nonparametric Bayesian spatial concept acquisition methods, SpCoA [10] and SpCoA++ [11], which integrate self-localization and unsupervised wordsegmentation from speech signals as PGMs through the latent variables of spatial concepts. ...
... To solve these issues, unsupervised learning approaches for spatial concepts were utilized in studies related to autonomous place categorization by a robot [8][9][10][11][12][13]. Taniguchi et al. proposed nonparametric Bayesian spatial concept acquisition methods, SpCoA [10] and SpCoA++ [11], which integrate self-localization and unsupervised wordsegmentation from speech signals as PGMs through the latent variables of spatial concepts. Their methods improve the accuracy of self-localization and recognition of the place names in spoken sentences. ...
Article
Tidy-up tasks by service robots in home environments are challenging in robotics applications because they involve various interactions with the environment. In particular, robots are required not only to grasp, move, and release various home objects but also to plan the order and positions for placing the objects. In this paper, we propose a novel planning method that can efficiently estimate the order and positions of the objects to be tidied up by learning the parameters of a probabilistic generative model. The model allows a robot to learn the distributions of the co-occurrence probability of the objects and places to tidy up using the multimodal sensor information collected in a tidied environment. Additionally, we develop an autonomous robotic system to perform the tidy-up operation. We evaluate the effectiveness of the proposed method by an experimental simulation that reproduces the conditions of the Tidy Up Here task of the World Robot Summit 2018 international robotics competition. The simulation results show that the proposed method enables the robot to successively tidy up several objects and achieves the best task score among the considered baseline tidy-up methods.
... Spatial concepts refer to the knowledge of the place categories autonomously formed based on the multimodal information of the spatial experience acquired by the robot in the environment according to [8][9][10][11][12][13]. Therefore, we believe that the spatial concept works well not only for this particular task but also for other tasks related to place and object placement. ...
... It is important to appropriately generalize and form place categories based on object positions while dealing with the uncertainty of the observations. To solve these issues, unsupervised learning approaches for spatial concepts were utilized in studies related to autonomous place categorization by a robot [8][9][10][11][12][13]. Taniguchi et al. proposed nonparametric Bayesian spatial concept acquisition methods, SpCoA [10] and SpCoA++ [11], which integrate self-localization and unsupervised word-segmentation from speech signals as PGMs through the latent variables of spatial concepts. ...
... To solve these issues, unsupervised learning approaches for spatial concepts were utilized in studies related to autonomous place categorization by a robot [8][9][10][11][12][13]. Taniguchi et al. proposed nonparametric Bayesian spatial concept acquisition methods, SpCoA [10] and SpCoA++ [11], which integrate self-localization and unsupervised word-segmentation from speech signals as PGMs through the latent variables of spatial concepts. Their methods improve the accuracy of self-localization and recognition of the place names in spoken sentences. ...
Preprint
Full-text available
Tidy-up tasks by service robots in home environments are challenging in the application of robotics because they involve various interactions with the environment. In particular, robots are required not only to grasp, move, and release various home objects, but also plan the order and positions where to put them away. In this paper, we propose a novel planning method that can efficiently estimate the order and positions of the objects to be tidied up based on the learning of the parameters of a probabilistic generative model. The model allows the robot to learn the distributions of co-occurrence probability of objects and places to tidy up by using multimodal sensor information collected in a tidied environment. Additionally, we develop an autonomous robotic system to perform the tidy-up operation. We evaluate the effectiveness of the proposed method in an experimental simulation that reproduces the conditions of the Tidy Up Here task of the World Robot Summit international robotics competition. The simulation results showed that the proposed method enables the robot to successively tidy up several objects and achieves the best task score compared to baseline tidy-up methods.
... This simple visual recognitionbased approach also has the same problems. Spatial concept formation methods have been developed to enable robots to acquire place-related words as well as estimate categories and regions (Ishibushi et al., 2015;Taniguchi et al., 2016bTaniguchi et al., , 2017Taniguchi et al., , 2018. These methods can estimate the number of categories using the Dirichlet process (Teh et al., 2005). ...
... These methods can estimate the number of categories using the Dirichlet process (Teh et al., 2005). Taniguchi et al. (2016b) proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA). However, although spatial concept formation methods can acquire unknown words and deal with multimodal information, including the image features typically extracted using CNNs, they cannot perform semantically segment a map appropriately because the position distributions corresponding to semantic categories are modeled by Gaussian distributions. ...
... SpCoMapping integrates probabilistic spatial concept acquisition (Ishibushi et al., 2015;Taniguchi et al., 2016b) and SLAM via an MRF to generate a map of semantic information. It solves the overwrite problem by assigning each cell of the semantic map a probabilistic variable. ...
Article
Full-text available
An autonomous robot performing tasks in a human environment needs to recognize semantic information about places. Semantic mapping is a task in which suitable semantic information is assigned to an environmental map so that a robot can communicate with people and appropriately perform tasks requested by its users. We propose a novel statistical semantic mapping method called SpCoMapping, which integrates probabilistic spatial concept acquisition based on multimodal sensor information and a Markov random field applied for learning the arbitrary shape of a place on a map.SpCoMapping can connect multiple words to a place in a semantic mapping process using user utterances without pre-setting the list of place names. We also develop a nonparametric Bayesian extension of SpCoMapping that can automatically estimate an adequate number of categories. In the experiment in the simulation environments, we showed that the proposed method generated better semantic maps than previous semantic mapping methods; our semantic maps have categories and shapes similar to the ground truth provided by the user. In addition, we showed that SpCoMapping could generate appropriate semantic maps in a real-world environment.
... In addition, the proposed un- supervised Bayesian generative model integrates multimodal place categorization, lexical acquisition, and SLAM. Taniguchi et al. [7], [8] proposed the nonparametric Bayesian spatial concept acquisition method (SpCoA) using an unsupervised word-segmentation method latticelm [9] and SpCoA++, which enables highly accurate lexical acquisition by updating the language model. Isobe et al. [10] proposed a learning method for the relationship between objects and places using image features obtained by the convolutional neural network (CNN) [11]. ...
... Isobe et al. [10] proposed a learning method for the relationship between objects and places using image features obtained by the convolutional neural network (CNN) [11]. However, these methods [7], [8], [10] cannot learn spatial concepts from unknown environments without a map because they rely on batch-learning algorithms. Therefore, we developed in previous work an online algorithm, SpCoSLAM [6], that can sequentially learn a map, spatial concepts that integrate positions, speech signals, and scene images. ...
... In addition, we performed the evaluations of place categorization and lexical acquisition related to places. We compared the performances of the following methods: (A) SpCoSLAM [6] (B) SpCoSLAM with AW (Section II-C.1) (C) SpCoSLAM with AW + WS (Section II-C.2) (D) SpCoSLAM 2.0 (FLR-i t , C t ) (E) SpCoSLAM 2.0 (RS) (F) SpCoSLAM 2.0 (FLR-i t , C t + RS) (G) SpCoSLAM 2.0 (FLR-i t , C t , S t + SBU) (H) SpCoA (Batch learning) [7] (I) SpCoA++ (Batch learning) [8] Methods (A)-(C) performed the conventional and modified SpCoSLAM algorithms. Methods (D)-(F) performed the proposed improved algorithms in different conditions. ...
Preprint
We propose a novel online learning algorithm, called SpCoSLAM 2.0, for spatial concepts and lexical acquisition with high accuracy and scalability. Previously, we proposed SpCoSLAM as an online learning algorithm based on unsupervised Bayesian probabilistic model that integrates multimodal place categorization, lexical acquisition, and SLAM. However, our previous algorithm had limited estimation accuracy owing to the influence of the early stages of learning, and increased computational complexity with added training data. Therefore, we introduce techniques such as fixed-lag rejuvenation to reduce the calculation time while maintaining an accuracy higher than that of the previous algorithm. The results show that, in terms of estimation accuracy, the proposed algorithm exceeds the previous algorithm and is comparable to batch learning. In addition, the calculation time of the proposed algorithm does not depend on the amount of training data and becomes constant for each step of the scalable algorithm. Our approach will contribute to the realization of long-term spatial language interactions between humans and robots.
... Taniguchi et al. [4] proposed a method that integrated ambiguous speech-recognition results with the self-localization method for learning spatial concepts. In addition, Taniguchi et al. [5] proposed the nonparametric Bayesian spatial concept acquisition method (SpCoA) based on an unsupervised word-segmentation method known as latticelm [6]. On the other hand, Ishibushi et al. [7] proposed a self-localization method that exploits image features using a convolutional neural network (CNN) [8]. ...
... On the other hand, Ishibushi et al. [7] proposed a self-localization method that exploits image features using a convolutional neural network (CNN) [8]. These methods [4], [5], [7] cannot cope with changes in the names of places and the environment because these methods use batch learning algorithms. In addition, these methods cannot learn spatial concepts from unknown environments without a map, i.e., the robot needs to have a map generated by SLAM beforehand. ...
... Araki et al. [14] performed a pseudo-online algorithm using the nested Pitman-Yor language model (NPYLM) [15]. However, these studies [5], [14] have reported that word segmentation of speech recognition results including errors causes over-segmentation [16]. In this paper, we will improve the accuracy of speech recognition by updating the language models sequentially. ...
Conference Paper
Full-text available
In this paper, we propose an online learning algorithm based on a Rao-Blackwellized particle filter for spatial concept acquisition and mapping. We have proposed a nonparametric Bayesian spatial concept acquisition model (SpCoA). We propose a novel method (SpCoSLAM) integrating SpCoA and FastSLAM in the theoretical framework of the Bayesian generative model. The proposed method can simultaneously learn place categories and lexicons while incrementally generating an environmental map. Furthermore, the proposed method has scene image features and a language model added to SpCoA. In the experiments, we tested online learning of spatial concepts and environmental maps in a novel environment of which the robot did not have a map. Then, we evaluated the results of online learning of spatial concepts and lexical acquisition. The experimental results demonstrated that the robot was able to more accurately learn the relationships between words and the place in the environmental map incrementally by using the proposed method.
... Spatial concept models have been proposed to enable a robot to learn the knowledge of places based on the user's linguistic instructions and the robot's observations through on-site learning [14][15][16]. The spatial concept models learn spatial categories from the user's linguistic instructions and the robot's observations (i.e., position, language, and vision) through unsupervised learning based on a probabilistic generative model and Bayesian inference in each home environment. ...
... The spatial concept is the categorical knowledge of a place formed from multimodal information, e.g., location names, visual features, and spatial areas on a map. Taniguchi et al. proposed a nonparametric spatial concept acquisition (SpCoA) [14] that learns the spatial concept (C t ) and spatial region (R t ) from the self-position (x t ) obtained using Monte Carlo localization (MCL) [17] and a linguistic instruction from the user as a bag of words (w t ) (Fig. A1). Fig. A1 shows the graphical model of SpCoA. ...
Preprint
This paper proposes a hierarchical Bayesian model based on spatial concepts that enables a robot to transfer the knowledge of places from experienced environments to a new environment. The transfer of knowledge based on spatial concepts is modeled as the calculation process of the posterior distribution based on the observations obtained in each environment with the parameters of spatial concepts generalized to environments as prior knowledge. We conducted experiments to evaluate the generalization performance of spatial knowledge for general places such as kitchens and the adaptive performance of spatial knowledge for unique places such as `Emma's room' in a new environment. In the experiments, the accuracies of the proposed method and conventional methods were compared in the prediction task of location names from an image and a position, and the prediction task of positions from a location name. The experimental results demonstrated that the proposed method has a higher prediction accuracy of location names and positions than the conventional method owing to the transfer of knowledge.
... SpCoMapping is an extended method of spatial concept formation [9], [10], [11], [12], using a Markov random field (MRF) for semantic mapping [2]. SpCoMapping learns the vocabulary representing a place and a region simultaneously, taking into account the shapes of the environment and obstacles. ...
... (A) SpCoMapGAN (proposed) (B) SpCoMapping [2] (C) SpCoA [11] In the experiment, a human first moved the robot in the simulation environment while it collected images and positions. The words were sent directly to the robot using Robot Operating System (ROS) [26] topics, assuming a state in which speech recognition could be performed reliably. ...
Conference Paper
In semantic mapping, which connects semantic information to an environment map, it is a challenging task for robots to deal with both local and global information of environments. In addition, it is important to estimate semantic information of unobserved areas from already acquired partial observations in a newly visited environment. On the other hand, previous studies on spatial concept formation enabled a robot to relate multiple words to places from bottom-up observations even when the vocabulary was not provided beforehand. However, the robot could not transfer global information related to the room arrangement between semantic maps from other environments. In this paper, we propose SpCoMapGAN, which generates the semantic map in a newly visited environment by training an inference model using previously estimated semantic maps. SpCoMapGAN uses generative adversarial networks (GANs) to transfer semantic information based on room arrangements to a newly visited environment. Our proposed method assigns semantics to the map of an unknown environment using the prior distribution of the map trained in known environments and the multimodal observations made in the unknown environment. We experimentally show in simulation that SpCoMapGAN can use global information for estimating the semantic map and is superior to previous methods. Finally, we also demonstrate in a real environment that SpCoMapGAN can accurately 1) deal with local information, and 2) acquire the semantic information of real places.
... The learning procedure for each step is described in Appendix 2. The variables of the joint posterior distribution can be learned by Gibbs sampling, which is a Markov chain Monte-Carlo-based batch learning algorithm, in a manner similar to the nonparametric Bayesian spatial concept acquisition method (SpCoA) [29]. In addition, it is also possible to perform learning in a spatial concept formation model after the map is generated via any other SLAM. ...
... For each place, on average, 15 training datasets were provided. The latent variables C t and i t were assumed to be almost accurately estimated, and model parameters of the spatial concept were obtained via Gibbs sampling, similar to [29]. The visual features of a camera-acquired image were not treated. ...
Article
Full-text available
Robots are required to not only learn spatial concepts autonomously but also utilize such knowledge for various tasks in a domestic environment. Spatial concept represents a multimodal place category acquired from the robot's spatial experience including vision, speech-language, and self-position. The aim of this study is to enable a mobile robot to perform navigational tasks with human speech instructions, such as ‘Go to the kitchen’, via probabilistic inference on a Bayesian generative model using spatial concepts. Specifically, path planning was formalized as the maximization of probabilistic distribution on the path-trajectory under speech instruction, based on a control-as-inference framework. Furthermore, we described the relationship between probabilistic inference based on the Bayesian generative model and control problem including reinforcement learning. We demonstrated path planning based on human instruction using acquired spatial concepts to verify the usefulness of the proposed approach in the simulator and in real environments. Experimentally, places instructed by the user's speech commands showed high probability values, and the trajectory toward the target place was correctly estimated. Our approach, based on probabilistic inference concerning decision-making, can lead to further improvement in robot autonomy.
... Other approaches take a probabilistic perspective on concept learning, similar to Lake et al. (2015), but focussing on the domain of robotics. Concepts are learned through unsupervised online learning algorithms, combining multi-modal data streams (most often perceptual data and raw speech data) through statistical approaches such as Bayesian generative models or latent semantic analysis (Nakamura et al., 2007;Aoki et al., 2016;Taniguchi et al., 2016Taniguchi et al., , 2017. Through this integration of data streams, the acquired concepts constitute mappings between words and objects, as studied by Nakamura et al. (2007) and Aoki et al. (2016), or between words and spatial locations, as studied by Taniguchi et al. (2016Taniguchi et al. ( , 2017. ...
... Concepts are learned through unsupervised online learning algorithms, combining multi-modal data streams (most often perceptual data and raw speech data) through statistical approaches such as Bayesian generative models or latent semantic analysis (Nakamura et al., 2007;Aoki et al., 2016;Taniguchi et al., 2016Taniguchi et al., , 2017. Through this integration of data streams, the acquired concepts constitute mappings between words and objects, as studied by Nakamura et al. (2007) and Aoki et al. (2016), or between words and spatial locations, as studied by Taniguchi et al. (2016Taniguchi et al. ( , 2017. The latter further used these concepts to aid a mobile robot in generating a map of the environment without any prior information. ...
Article
Full-text available
Autonomous agents perceive the world through streams of continuous sensori-motor data. Yet, in order to reason and communicate about their environment, agents need to be able to distill meaningful concepts from their raw observations. Most current approaches that bridge between the continuous and symbolic domain are using deep learning techniques. While these approaches often achieve high levels of accuracy, they rely on large amounts of training data, and the resulting models lack transparency, generality, and adaptivity. In this paper, we introduce a novel methodology for grounded concept learning. In a tutor-learner scenario, the method allows an agent to construct a conceptual system in which meaningful concepts are formed by discriminative combinations of prototypical values on human-interpretable feature channels. We evaluate our approach on the CLEVR dataset, using features that are either simulated or extracted using computer vision techniques. Through a range of experiments, we show that our method allows for incremental learning, needs few data points, and that the resulting concepts are general enough to be applied to previously unseen objects and can be combined compositionally. These properties make the approach well-suited to be used in robotic agents as the module that maps from continuous sensory input to grounded, symbolic concepts that can then be used for higher-level reasoning tasks.
... The learning procedure for each step is described in Appendix B. The variables of the joint posterior distribution can be learned by Gibbs sampling, which is a Markov chain Monte-Carlobased batch learning algorithm, in a manner similar to the nonparametric Bayesian spatial concept acquisition method (SpCoA) [27]. ...
... For each place, on average, 15 training datasets were provided. The latent variables C t and i t were assumed to be almost accurately estimated, and model parameters of the spatial concept were obtained via Gibbs sampling, similar to [27]. The word dictionary was provided in advance. ...
Preprint
Robots are required to not only learn spatial concepts autonomously but also utilize such knowledge for various tasks in a domestic environment. Spatial concept represents a multimodal place category acquired from the robot's spatial experience including vision, speech-language, and self-position. The aim of this study is to enable a mobile robot to perform navigational tasks with human speech instructions, such as `Go to the kitchen', via probabilistic inference on a Bayesian generative model using spatial concepts. Specifically, path planning was formalized as the maximization of probabilistic distribution on the path-trajectory under speech instruction, based on a control-as-inference framework. Furthermore, we described the relationship between probabilistic inference based on the Bayesian generative model and control problem including reinforcement learning. We demonstrated path planning based on human instruction using acquired spatial concepts to verify the usefulness of the proposed approach in the simulator and in real environments. Experimentally, places instructed by the user's speech commands showed high probability values, and the trajectory toward the target place was correctly estimated. Our approach, based on probabilistic inference concerning decision-making, can lead to further improvement in robot autonomy.
... A further advancement of such cognitive systems allows the robots to find meanings of words by treating a linguistic input as another modality [13][14][15]. Cognitive models have recently become more complex in realizing various cognitive capabilities: grammar acquisition [16], language model learning [17], hierarchical concept acquisition [18,19], spatial concept acquisition [20], motion skill acquisition [21], and task planning [7] (see Fig. 1). It results in an increase in the development cost of each cognitive system. ...
... Theoretical and empirical validations should be applied for further applications. So far, many researchers, including the authors, have proposed a lot of cognitive models for robots: object concept formation based on its appearance, usage and functions [41], formation of integrated concept of objects and motions [42], grammar learning [16], language understanding [43], spatial concept formation and lexical acquisition [8,20,44], simultaneous phoneme and word discovery [45][46][47] and cross-situational learning [48,49]. These models are regarded as an integrative model that are constructed by combining small-scale models. ...
Article
Full-text available
This paper describes a framework for the development of an integrative cognitive system based on probabilistic generative models (PGMs) called Neuro-SERKET. Neuro-SERKET is an extension of SERKET, which can compose elemental PGMs developed in a distributed manner and provide a scheme that allows the composed PGMs to learn throughout the system in an unsupervised way. In addition to the head-to-tail connection supported by SERKET, Neuro-SERKET supports tail-to-tail and head-to-head connections, as well as neural network-based modules, i.e., deep generative models. As an example of a Neuro-SERKET application, an integrative model was developed by composing a variational autoencoder (VAE), a Gaussian mixture model (GMM), latent Dirichlet allocation (LDA), and automatic speech recognition (ASR). The model is called VAE + GMM + LDA + ASR. The performance of VAE + GMM + LDA + ASR and the validity of Neuro-SERKET were demonstrated through a multimodal categorization task using image data and a speech signal of numerical digits.
... (1) A deployment-ready implementation of a hierarchical spatial concepts formation method, Spatial Concepts (SpCo) [4][5][6][7], based on a Bayesian generative model that relies on multimodal observations from image features, self-location, and word information. This method allows SpCo-enabled robots to learn and associate spatial concepts, like places and objects, to natural language commands, without supervision and with the same level of abstraction that humans would use depending on the context and the environment. ...
... Our SpCo implementation for customer interaction is the result of several previous works. At the origin is a nonparametric Bayesian Spatial Concepts Acquisition (SpCoA) [4] that could deal with the learning of unknown words by using unsupervised stochastic word segmentation. It was followed by SpCoSLAM [5] that achieved online learning of spatial concepts, language model, and map, based on a particle filter. ...
Article
Human–robot interaction during general service tasks in home or retail environment has been proven challenging, partly because (1) robots lack high-level context-based cognition and (2) humans cannot intuit the perception state of robots as they can for other humans. To solve these two problems, we present a complete robot system that has been given the highest evaluation score at the Customer Interaction Task of the Future Convenience Store Challenge at the World Robot Summit 2018, which implements several key technologies: (1) a hierarchical spatial concepts formation for general robot task planning and (2) a mixed reality interface to enable users to intuitively visualize the current state of the robot perception and naturally interact with it. The results obtained during the competition indicate that the proposed system allows both non-expert operators and end users to achieve human–robot interactions in customer service environments. Furthermore, we describe a detailed scenario including employee operation and customer interaction which serves as a set of requirements for service robots and a road map for development. The system integration and task scenario described in this paper should be helpful for groups facing customer interaction challenges and looking for a successfully deployed base to build on.
... Symbol emergence through a robot's own interactive exploration has been explored in many setups for decades [2], [4]- [14]. In previous studies, researchers had shown that robots can form object categories, obtain motor primitives, and learn vocabularies from sensorimotor interaction with their environment without any direct human intervention. ...
... There have been many related studies. For example, Taniguchi introduced the idea of spatial concept and built a machine learning system to enable a robot to form place categories and learn place names [14], [74]. Mangin proposed multimodal concept formation using a non-negative matrix factorization method [75]. ...
Article
Full-text available
Symbol emergence through a robot's own interactive exploration of the world without human intervention has been investigated now for several decades. However, methods that enable a machine to form symbol systems in a robust bottom-up manner are still missing. Clearly, this shows that we still do not have an appropriate computational understanding that explains symbol emergence in biological and artificial systems. Over the years it became more and more clear that symbol emergence has to be posed as a multi-faceted problem. Therefore, we will first review the history of the symbol emergence problem in different fields showing their mutual relations. Then we will describe recent work and approaches to solve this problem with the aim of providing an integrative and comprehensive overview of symbol emergence for future research.
... Cross-situational learning and visually grounded acoustic unit discovery have been explored in the fields of speech and computational linguistics [22]- [26]. In robotics, unsupervised word discovery methods have been developed to achieve online lexical acquisition and overcome the out-of-vocabulary problem [19]- [21], [27]- [30]. As a result, considering multiple cues is crucial for the development of an unsupervised phone and word discovery model. ...
Article
Full-text available
Word and phone discovery are important tasks in the language development of human infants. Infants acquire words and phones from unsegmented speech signals using segmentation cues such as distributional, prosodic, and co-occurrence information. Many pre-existing computational models designed to represent this process tend to focus on distributional or prosodic cues. In this study, we propose a nonparametric Bayesian probabilistic generative model called the prosodic hierarchical Dirichlet process-hidden language model (prosodic HDP-HLM) designed to perform simultaneous phone and word discovery from continuous speech signals encoded as time-series data that may exhibit a double articulation structure. Prosodic HDP-HLM, as an extension of HDP-HLM, considers both prosodic and distributional cues within a single integrative generative model. We further propose a prosodic double articulation analyzer (Prosodic DAA) based on an inference procedure derived for prosodic HDP-HLM. We conducted three experiments on different types of datasets, including, Japanese vowel sequence, utterances for teaching object names and features, and utterances following Zipf’s law, and the results demonstrated the validity of the proposed method. The results show that the Prosodic DAA successfully used prosodic cues and was able to discover words directly from continuous human speech using distributional and prosodic information in an unsupervised manner, outperforming a method that solely used distributional cues. In contrast, the phone discovery performance did not improve. We also show that prosodic cues contributed to word discovery performance more when the word frequency was distributed more naturally, i.e., following Zipf’s law.
... Specifically, it is important to appropriately generalize and form place categories while dealing with observation uncertainties. To address these issues, PGMs for spatial concept formation have been constructed (Hagiwara, Inoue, Kobayashi, & Taniguchi, 2018;Katsumata, Taniguchi, Hafi, Hagiwara, & Taniguchi, 2020;Taniguchi, Hagiwara, Taniguchi, & Inamura, 2017;Taniguchi, Taniguchi, & Inamura, 2016). Taniguchi et al. (2017) proposed the spatial concept formation using SLAM (SpCoSLAM), that is, place categorization and mapping through unsupervised online learning from multimodal observation. ...
Article
Building a human-like integrative artificial cognitive system, that is, an artificial general intelligence (AGI), is the holy grail of the artificial intelligence (AI) field. Furthermore, a computational model that enables an artificial system to achieve cognitive development will be an excellent reference for brain and cognitive science. This paper describes an approach to develop a cognitive architecture by integrating elemental cognitive modules to enable the training of the modules as a whole. This approach is based on two ideas: (1) brain-inspired AI, learning human brain architecture to build human-level intelligence, and (2) a probabilistic generative model(PGM)-based cognitive architecture to develop a cognitive system for developmental robots by integrating PGMs. The proposed development framework is called a whole brain PGM (WB-PGM), which differs fundamentally from existing cognitive architectures in that it can learn continuously through a system based on sensory-motor information. In this paper, we describe the rationale for WB-PGM, the current status of PGM-based elemental cognitive modules, their relationship with the human brain, the approach to the integration of the cognitive modules, and future challenges. Our findings can serve as a reference for brain studies. As PGMs describe explicit informational relationships between variables, WB-PGM provides interpretable guidance from computational sciences to brain science. By providing such information, researchers in neuroscience can provide feedback to researchers in AI and robotics on what the current models lack with reference to the brain. Further, it can facilitate collaboration among researchers in neuro-cognitive sciences as well as AI and robotics.
... Miyazawa et al. constructed a PGM that combines MLDA with a cognitive module that learns grammatical knowledge [25]. Taniguchi et al. proposed spatial concept formation methods by applying a similar idea to multimodal information obtained by a mobile robot, including positional information [26][27][28]. Hagiwara et al. also proposed a hierarchical spatial category formation model that applies hierarchical MLDA to the inference of a hierarchical structure in spatial categories [29]. ...
Article
Full-text available
This paper describes a computational model of multiagent multimodal categorization that realizes emergent communication. We clarify whether the computational model can reproduce the following functions in a symbol emergence system, comprising two agents with different sensory modalities playing a naming game. (1) Function for forming a shared lexical system that comprises perceptual categories and corresponding signs, formed by agents through individual learning and semiotic communication. (2) Function to improve the categorization accuracy in an agent via semiotic communication with another agent, even when some sensory modalities of each agent are missing. (3) Function that an agent infers unobserved sensory information based on a sign sampled from another agent in the same manner as cross-modal inference. We propose an interpersonal multimodal Dirichlet mixture (Inter-MDM), which is derived by dividing an integrative probabilistic generative model, which is obtained by integrating two Dirichlet mixtures (DMs). The Markov chain Monte Carlo algorithm realizes emergent communication. The experimental results demonstrated that Inter-MDM enables agents to form multimodal categories and appropriately share signs between agents. It is shown that emergent communication improves categorization accuracy, even when some sensory modalities are missing. Inter-MDM enables an agent to predict unobserved information based on a shared sign.
... The purpose is to endow a robot with the ability to model its knowledge about the environment and to acquire new knowledge when it occurs in a kind of self-consciousness process. In [166] a method for spatial Concept Acquisition is presented. The authors use an unsupervised learning method for the lexical acquisition of words related to places visited by robots, from human continuous speech signals. ...
Thesis
The employment of personal robots or service robots has aroused much interest in recent years with an amazing growth of robotics in different domains. Design of companion robots able to assist, to share and to accompany individuals with limited autonomy in their daily life is the challenge of the future decade. However, performances of nowadays robotic bodies and prototypes remain very far from defeating such challenge. Although sophisticated humanoid robots have been developed, much more effort is needed for improving their cognitive capabilities.Actually, the above-mentioned commercially available robots or prototypes are not still able to naturally adapt themselves to the complex environment in which they are supposed to evolve with humans. In the same way, the existing prototypes are not able to interact in a versatile way with their users. In fact they are still very far from interpreting the diversity and the complexity of perceived information or to construct knowledge relating the surrounding environment. The development of bio-inspired approaches based on Artificial Cognition for perception and autonomous acquisition of knowledge in robotics is a feasible strategy to overcome these limitations. A number of advances have already conducted to the realization of an artificial-cognition-based system allowing a robot to learn and create knowledge from observation (association of sensory information and natural semantics). Within this context, the present work takes advantage from evolutionary process for semantic interpretation of sensory information to make emerge the machine-awareness about its surrounding environment. The main purpose of the Doctoral Thesis is to extend the already accomplished efforts (researches) in order to allow a robot to extract, to construct and to conceptualize the knowledge about its surrounding environment. Indeed, the goal of the doctoral research is to generalize the aforementioned concepts for an autonomous, or semi-autonomous, construction of knowledge from the perceived information (e.g. by a robot). In other words, the expected goal of the proposed doctoral research is to allow a robot progressively conceptualize the environment in which it evolves and to share the constructed knowledge with its user. To this end, a semantic-multimedia knowledge base has been created based on an ontological model and implemented through a NoSQL graph database. This knowledge base is the founding element of the thesis work on which multiple approaches have been investigated, based on semantic, multimedia and visual information. The developed approaches combine this information through classic machine learning techniques, both supervised and unsupervised, together with transfer learning techniques for the reuse of semantic features from deep neural networks models. Other techniques based on ontologies and the Semantic Web have been explored for the acquisition and integration of further knowledge in the knowledge base developed. The different areas investigated have been united in a comprehensive logical framework. The experiments conducted have shown an effective correspondence between the interpretations based on semantic and visual features, from which emerged the possibility for a robotic agent to expand its knowledge generalization skills in even unknown or partially known environments, which allowed to achieve the objectives set.
... to other machine learning domains, e.g., computer vision [17], [18], [19] and natural language processing [20], [21]. Also, cognitive load inference does not provide a directly actionable feedback signal for real-world assistance systems, as humans may still perform well under stress [22]. ...
Article
Full-text available
In this paper, we address cognitive overload detection from unobtrusive physiological signals for users in dual-tasking scenarios. Anticipating cognitive overload is a pivotal challenge in interactive cognitive systems and could lead to safer shared-control between users and assistance systems. Our framework builds on the assumption that decision mistakes on the cognitive secondary task of dual-tasking users correspond to cognitive overload events, wherein the cognitive resources required to perform the task exceed the ones available to the users. We propose DecNet, an end-to-end sequence-to-sequence deep learning model that infers in real-time the likelihood of user mistakes on the secondary task, i.e., the practical impact of cognitive overload, from eye-gaze and head-pose data. We train and test DecNet on a dataset collected in a simulated driving setup from a cohort of 20 users on two dual-tasking decision-making scenarios, with either visual or auditory decision stimuli. DecNet anticipates cognitive overload events in both scenarios and can perform in time-constrained scenarios, anticipating cognitive overload events up to 2s before they occur. We show that DecNet’s performance gap between audio and visual scenarios is consistent with user perceived difficulty. This suggests that single modality stimulation induces higher cognitive load on users, hindering their decision-making abilities.
... Miyazawa et al. constructed a PGM that combines MLDA with a cognitive module that learns grammatical knowledge [25]. Taniguchi et al. proposed spatial concept formation methods by applying a similar idea to multimodal information obtained by a mobile robot, including positional information [26][27][28]. Hagiwara et al. also proposed a hierarchical spatial category formation model that applies hierarchical MLDA to the inference of a hierarchical structure in spatial categories [29]. ...
Preprint
This paper describes a computational model of multiagent multimodal categorization that realizes emergent communication. We clarify whether the computational model can reproduce the following functions in a symbol emergence system, comprising two agents with different sensory modalities playing a naming game. (1) Function for forming a shared lexical system that comprises perceptual categories and corresponding signs, formed by agents through individual learning and semiotic communication between agents. (2) Function to improve the categorization accuracy in an agent via semiotic communication with another agent, even when some sensory modalities of each agent are missing. (3) Function that an agent infers unobserved sensory information based on a sign sampled from another agent in the same manner as cross-modal inference. We propose an interpersonal multimodal Dirichlet mixture (Inter-MDM), which is derived by dividing an integrative probabilistic generative model, which is obtained by integrating two Dirichlet mixtures (DMs). The Markov chain Monte Carlo algorithm realizes emergent communication. The experimental results demonstrated that Inter-MDM enables agents to form multimodal categories and appropriately share signs between agents. It is shown that emergent communication improves categorization accuracy, even when some sensory modalities are missing. Inter-MDM enables an agent to predict unobserved information based on a shared sign.
... Our previous system (SIGVerse ver.2) (Inamura, 2010) has been applied in studies such as the analysis of human behavior (Ramirez-Amaro et al., 2014), learning of spatial concepts (Taniguchi et al., 2016), and VR-based rehabilitation . These studies employed multimodal data (1)-(4) and functions (i)-(iii); however, the reusability of the conventional SIGVerse is restricted because it does not adopt ROS as its application programming interface (API). ...
Article
Full-text available
Research on Human-Robot Interaction (HRI) requires the substantial consideration of an experimental design, as well as a significant amount of time to practice the subject experiment. Recent technology in virtual reality (VR) can potentially address these time and effort challenges. The significant advantages of VR systems for HRI are: 1) cost reduction, as experimental facilities are not required in a real environment; 2) provision of the same environmental and embodied interaction conditions to test subjects; 3) visualization of arbitrary information and situations that cannot occur in reality, such as playback of past experiences, and 4) ease of access to an immersive and natural interface for robot/avatar teleoperations. Although VR tools with their features have been applied and developed in previous HRI research, all-encompassing tools or frameworks remain unavailable. In particular, the benefits of integration with cloud computing have not been comprehensively considered. Hence, the purpose of this study is to propose a research platform that can comprehensively provide the elements required for HRI research by integrating VR and cloud technologies. To realize a flexible and reusable system, we developed a real-time bridging mechanism between the robot operating system (ROS) and Unity. To confirm the feasibility of the system in a practical HRI scenario, we applied the proposed system to three case studies, including a robot competition named RoboCup@Home. via these case studies, we validated the system’s usefulness and its potential for the development and evaluation of social intelligence via multimodal HRI.
... For a robot to rapidly adapt to a new environment, acquiring local knowledge (e.g. spatial concept formation [38][39][40]) while exploring the environment is important. Here, it is necessary to address the problem wherein the prediction by the robot becomes ambiguous in a domain with no or few training data. ...
Article
Full-text available
The installation of remotely-operated service robots in the environments of our daily life (including offices, homes, and hospitals) can improve work-from-home policies and enhance the quality of the so-called new normal. However, it is evident that remotely-operated robots must have partial autonomy and the capability to learn and use local semiotic knowledge. In this paper, we argue that the development of semiotically adaptive cognitive systems is key to the installation of service robotics technologies in our service environments. To achieve this goal, we describe three challenges: the learning of local knowledge, the acceleration of onsite and online learning, and the augmentation of human–robot interactions.
... Specifically, it is important to appropriately generalize and form place categories while dealing with the observation uncertainties. To address these issues, PGMs for spatial concept formation have been constructed (Taniguchi et al., 2016aHagiwara et al., 2018;Katsumata et al., 2020). Taniguchi et al. (2017) have proposed the spatial concept formation with SLAM (SpCoSLAM), which is place categorization and mapping through unsupervised online learning from multimodal observation. ...
Preprint
Building a humanlike integrative artificial cognitive system, that is, an artificial general intelligence, is one of the goals in artificial intelligence and developmental robotics. Furthermore, a computational model that enables an artificial cognitive system to achieve cognitive development will be an excellent reference for brain and cognitive science. This paper describes the development of a cognitive architecture using probabilistic generative models (PGMs) to fully mirror the human cognitive system. The integrative model is called a whole-brain PGM (WB-PGM). It is both brain-inspired and PGMbased. In this paper, the process of building the WB-PGM and learning from the human brain to build cognitive architectures is described.
... V ISUAL odometry is the problem of estimating the camera pose from consecutive images and is a fundamental capability required in many computer vision and robotics applications, such as Autonomous / Medical Robots, Augmented / Mixed / Virtual Reality and other complicated and emerging applications based on localization information, such as indoor and outdoor navigation, scene understanding, space exploration [1]- [3]. ...
Article
Full-text available
Visual odometry (VO) is a prevalent way to deal with the relative localization problem, which is becoming increasingly mature and accurate, but it tends to be fragile under challenging environments. Comparing with classical geometry-based methods, deep learning-based methods can automatically learn effective and robust representations, such as depth, optical flow, feature, ego-motion, etc., from data without explicit computation. Nevertheless, there still lacks a thorough review of the recent advances of deep learning-based VO (Deep VO). Therefore, this paper aims to gain a deep insight on how deep learning can profit and optimize the VO systems. We first screen out a number of qualifications including accuracy, efficiency, scalability, dynamicity, practicability, and extensibility, and employ them as the criteria. Then, using the offered criteria as the uniform measurements, we detailedly evaluate and discuss how deep learning improves the performance of VO from the aspects of depth estimation, feature extraction and matching, pose estimation. We also summarize the complicated and emerging areas of Deep VO, such as mobile robots, medical robots, augmented reality and virtual reality, etc. Through the literature decomposition, analysis, and comparison, we finally put forward a number of open issues and raise some future research directions in this field.
... V ISUAL odometry is the problem of estimating the camera pose from consecutive images and is a fundamental capability required in many computer vision and robotics applications, such as Autonomous / Medical Robots, Augmented / Mixed / Virtual Reality and other complicated and emerging applications based on localization information, such as indoor and outdoor navigation, scene understanding, space exploration [1]- [3]. ...
Preprint
Visual odometry (VO) is a prevalent way to deal with the relative localization problem, which is becoming increasingly mature and accurate, but it tends to be fragile under challenging environments. Comparing with classical geometry-based methods, deep learning-based methods can automatically learn effective and robust representations, such as depth, optical flow, feature, ego-motion, etc., from data without explicit computation. Nevertheless, there still lacks a thorough review of the recent advances of deep learning-based VO (Deep VO). Therefore, this paper aims to gain a deep insight on how deep learning can profit and optimize the VO systems. We first screen out a number of qualifications including accuracy, efficiency, scalability, dynamicity, practicability, and extensibility, and employ them as the criteria. Then, using the offered criteria as the uniform measurements, we detailedly evaluate and discuss how deep learning improves the performance of VO from the aspects of depth estimation, feature extraction and matching, pose estimation. We also summarize the complicated and emerging areas of Deep VO, such as mobile robots, medical robots, augmented reality and virtual reality, etc. Through the literature decomposition, analysis, and comparison, we finally put forward a number of open issues and raise some future research directions in this field.
... In [37] a method for spatial Concept Acquisition is presented. The authors use an unsupervised learning method for the lexical acquisition of words related to places visited by robots, from human continuous speech signals. ...
Article
Full-text available
The pervasive use of artificial intelligence and neural networks in several different research fields has noticeably improved multiple aspects of human life. The application of these techniques to machines has made them progressively more “intelligent” and able to solve tasks considered extremely complex for a human being. This technological evolution has deeply influenced the way we interact with machines. Purely symbolic artificial intelligence and techniques like ontologies, have also been successfully used in the past applied to robotics, but have also shown some limitations and failings in the knowledge construction task. In fact, the exhibited “intelligence” is rarely the result of a real autonomous decision, but it is rather hard-encoded in the machine. While a number of approaches have already been proposed in literature concerning knowledge acquisition from the surrounding environment, they are either exclusively based on low-level features or they involve solely high-level semantics-based attributes. Moreover, they often don’t use a general high-level knowledge base for grounding the acquired knowledge. In this contexts, the use of semantics technologies, such as ontologies, is mostly employed for action-oriented tasks. In this article we propose an extension of a novel approach for knowledge acquisition based on a general semantic knowledge-base and the fusion of semantics and visual information by means of neural networks and ontologies. The proposed approach has been implemented on a humanoid robotic platform and the experimental results are shown and discussed.
... Our previous system (SIGVerse ver.2) [17] has been utilized for studies such as analysis of human behavior [43], learning of spatial concepts [48], and VR-based rehabilitation [16]. These studies employed multimodal data (1) to (4) and functions (i) to (iii); however, the re-usability of conventional SIGVerse is restricted due to its application programming interface (API). ...
Preprint
Common sense and social interaction related to daily-life environments are considerably important for autonomous robots, which support human activities. One of the practical approaches for acquiring such social interaction skills and semantic information as common sense in human activity is the application of recent machine learning techniques. Although recent machine learning techniques have been successful in realizing automatic manipulation and driving tasks, it is difficult to use these techniques in applications that require human-robot interaction experience. Humans have to perform several times over a long term to show embodied and social interaction behaviors to robots or learning systems. To address this problem, we propose a cloud-based immersive virtual reality (VR) platform which enables virtual human-robot interaction to collect the social and embodied knowledge of human activities in a variety of situations. To realize the flexible and reusable system, we develop a real-time bridging mechanism between ROS and Unity, which is one of the standard platforms for developing VR applications. We apply the proposed system to a robot competition field named RoboCup@Home to confirm the feasibility of the system in a realistic human-robot interaction scenario. Through demonstration experiments at the competition, we show the usefulness and potential of the system for the development and evaluation of social intelligence through human-robot interaction. The proposed VR platform enables robot systems to collect social experiences with several users in a short time. The platform also contributes in providing a dataset of social behaviors, which would be a key aspect for intelligent service robots to acquire social interaction skills based on machine learning techniques.
... Taguchi et al. (2011) proposed an unsupervised method for simultaneously categorizing self-positions and phoneme sequences from user speech without any prior language model. Taniguchi et al. (2016Taniguchi et al. ( , 2018a proposed the nonparametric Bayesian Spatial Concept Acquisition method (SpCoA) using an unsupervised word segmentation method, latticelm (Neubig et al. 2012), and SpCoA++ for highly accurate lexical acquisition as a result of updating the language model. Gu et al. (2016) proposed a method to learn relative spatial concepts, i.e., the words related to distance and direction, from the positional relationship between an utterer and objects. ...
Article
Full-text available
We propose a novel online learning algorithm, called SpCoSLAM 2.0, for spatial concepts and lexical acquisition with high accuracy and scalability. Previously, we proposed SpCoSLAM as an online learning algorithm based on unsupervised Bayesian probabilistic model that integrates multimodal place categorization, lexical acquisition, and SLAM. However, our original algorithm had limited estimation accuracy owing to the influence of the early stages of learning, and increased computational complexity with added training data. Therefore, we introduce techniques such as fixed-lag rejuvenation to reduce the calculation time while maintaining an accuracy higher than that of the original algorithm. The results show that, in terms of estimation accuracy, the proposed algorithm exceeds the original algorithm and is comparable to batch learning. In addition, the calculation time of the proposed algorithm does not depend on the amount of training data and becomes constant for each step of the scalable algorithm. Our approach will contribute to the realization of long-term spatial language interactions between humans and robots.
... Of course, there are other concepts as well. For example, spatial concepts (Taniguchi et al., 2016a) can be considered; however, to realize such a concept, the robot is required to have a mobile base. ...
Article
Full-text available
The manner in which humans learn, plan, and decide actions is a very compelling subject. Moreover, the mechanism behind high-level cognitive functions, such as action planning, language understanding, and logical thinking, has not yet been fully implemented in robotics. In this paper, we propose a framework for the simultaneously comprehension of concepts, actions, and language as a first step toward this goal. This can be achieved by integrating various cognitive modules and leveraging mainly multimodal categorization by using multilayered multimodal latent Dirichlet allocation (mMLDA). The integration of reinforcement learning and mMLDA enables actions based on understanding. Furthermore, the mMLDA, in conjunction with grammar learning and based on the Bayesian hidden Markov model (BHMM), allows the robot to verbalize its own actions and understand user utterances. We verify the potential of the proposed architecture through experiments using a real robot.
... Other nominal concept formation methods have also been developed. For example, locational, or spatial, concept formation methods have been proposed Taniguchi et al. [120,121]. ...
Article
Full-text available
The understanding and acquisition of a language in a real-world environment is an important task for future robotics services. Natural language processing and cognitive robotics have both been focusing on the problem for decades using machine learning. However, many problems remain unsolved despite significant progress in machine learning (such as deep learning and probabilistic generative models) during the past decade. The remaining problems have not been systematically surveyed and organized, as most of them are highly interdisciplinary challenges for language and robotics. This study conducts a survey on the frontier of the intersection of the research fields of language and robotics, ranging from logic probabilistic programming to designing a competition to evaluate language understanding systems. We focus on cognitive developmental robots that can learn a language from interaction with their environment and unsupervised learning methods that enable robots to learn a language without hand-crafted training data.
... The purpose is to endow a robot with the ability to model its knowledge about the environment and to acquire new knowledge when it occurs in a kind of self-consciousness process. In [16] a method for spatial Concept Acquisition is presented. The authors use an unsupervised learning method for the lexical acquisition of words related to places visited by robots, from human continuous speech signals. ...
Chapter
Full-text available
The skills required by machines in the last decade have grown exponentially. Recent efforts made by the scientific community have shown amazing results in the field of research related to artificial intelligence and robotics. Recent studies show that machines may be superior to humans in carrying out certain tasks. However, in many approaches they still fail to achieve high-level skills required to support humans and interact with them. Furthermore, the “intelligence” exhibited is hardly ever the result of a real autonomous decision. In this article we propose a novel approach, with the aim of providing a machine with the ability to evolve and build its own knowledge by combining both semantic and visual information. The proposed concept, its implementation and experimental results are shown and discussed.
... • The extension for a mutual segmentation model of sound strings and situations based on multimodal information will be achieved based on a multimodal LDA with nested Pitman-Yor language model (Nakamura et al., 2014) and a spatial concept acquisition model that integrates self-localization and unsupervised word discovery from spoken sentences (Taniguchi et al., 2016a). ...
Preprint
This study focuses on category formation for individual agents and the dynamics of symbol emergence in a multi-agent system through semiotic communication. Semiotic communication is defined, in this study, as the generation and interpretation of signs associated with the categories formed through the agent's own sensory experience or by exchange of signs with other agents. From the viewpoint of language evolution and symbol emergence, organization of a symbol system in a multi-agent system is considered as a bottom-up and dynamic process, where individual agents share the meaning of signs and categorize sensory experience. A constructive computational model can explain the mutual dependency of the two processes and has mathematical support that guarantees a symbol system's emergence and sharing within the multi-agent system. In this paper, we describe a new computational model that represents symbol emergence in a two-agent system based on a probabilistic generative model for multimodal categorization. It models semiotic communication via a probabilistic rejection based on the receiver's own belief. We have found that the dynamics by which cognitively independent agents create a symbol system through their semiotic communication can be regarded as the inference process of a hidden variable in an interpersonal multimodal categorizer, if we define the rejection probability based on the Metropolis-Hastings algorithm. The validity of the proposed model and algorithm for symbol emergence is also verified in an experiment with two agents observing daily objects in the real-world environment. The experimental results demonstrate that our model reproduces the phenomena of symbol emergence, which does not require a teacher who would know a pre-existing symbol system. Instead, the multi-agent system can form and use a symbol system without having pre-existing categories.
... Gu et al. (2016) proposed a method of learning relative space categories from ambiguous instructions. Taniguchi et al. (2014Taniguchi et al. ( , 2016 proposed computational models for a mobile robot to acquire spatial concepts based on information from recognized speech and estimated self-location. Here, the spatial concept was defined as the distributions of names and positions at each place. ...
Article
Full-text available
In this paper, we propose a hierarchical spatial concept formation method based on the Bayesian generative model with multimodal information e.g., vision, position and word information. Since humans have the ability to select an appropriate level of abstraction according to the situation and describe their position linguistically, e.g., “I am in my home” and “I am in front of the table,” a hierarchical structure of spatial concepts is necessary in order for human support robots to communicate smoothly with users. The proposed method enables a robot to form hierarchical spatial concepts by categorizing multimodal information using hierarchical multimodal latent Dirichlet allocation (hMLDA). Object recognition results using convolutional neural network (CNN), hierarchical k-means clustering result of self-position estimated by Monte Carlo localization (MCL), and a set of location names are used, respectively, as features in vision, position, and word information. Experiments in forming hierarchical spatial concepts and evaluating how the proposed method can predict unobserved location names and position categories are performed using a robot in the real world. Results verify that, relative to comparable baseline methods, the proposed method enables a robot to predict location names and position categories closer to predictions made by humans. As an application example of the proposed method in a home environment, a demonstration in which a human support robot moves to an instructed place based on human speech instructions is achieved based on the formed hierarchical spatial concept.
... However, the limitations of these two approaches are the pre-defined knowledge, rather than acquired by the robot completely. Taniguch et al. [25] proposed nonparametric Bayesian spatial concept acquisition method, which allowed the robot to obtain toponym from the utterance of the sentence and use the acquired spatial concept to reduce the uncertainty of the self-localization effectively. Compared with the first two methods, it is more autonomous for robots. ...
... However, Taniguchi et al. (2016a) assumed that the name of a place would be learned from an uttered word. Taniguchi et al. (2016b) proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA) based on place categorization and unsupervised word segmentation. SpCoA could acquire the names of places from spoken sentences including multiple words. ...
Article
Full-text available
In this paper, we propose a Bayesian generative model that can form multiple categories based on each sensory-channel and can associate words with any of four sensory-channels (action, position, object, and color). This paper focuses on cross-situational learning using the co-occurrence between words and information of sensory-channels in complex situations rather than conventional situations of cross-situational learning. We conducted a learning scenario using a simulator and a real humanoid iCub robot. In the scenario, a human tutor provided a sentence that describes an object of visual attention and an accompanying action to the robot. The scenario was set as follows: the number of words per sensory-channel was three or four, and the number of trials for learning was 20 and 40 for the simulator and 25 and 40 for the real robot. The experimental results showed that the proposed method was able to estimate the multiple categorizations and to learn the relationships between multiple sensory-channels and words accurately. In addition, we conducted an action generation task and an action description task based on word meanings learned in the cross-situational learning scenario. The experimental results showed the robot could successfully use the word meanings learned by using the proposed method.
... Regarding related work on place categorization, Taniguchi et al. proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA) on the basis of unsupervised word segmentation and a nonparametric Bayesian generative model that integrates self-localization and clustering in both words and places [2]. Hagiwara et al. proposed a method that enables robots to autonomously form place concepts using hierarchical multimodal latent Dirichlet allocation (hMLDA) [3], based on position and visual information [4]. ...
Conference Paper
Full-text available
Human support robots need to learn the relationships between objects and places to provide services such as cleaning rooms and locating objects through linguistic communications. In this paper, we propose a Bayesian probabilistic model that can automatically model and estimate the probability of objects existing in each place using a multimodal spatial concept based on the co-occurrence of objects. In our experiments, we evaluated the estimation results for objects by using a word to express their places. Furthermore, we showed that the robot could perform tasks involving cleaning up objects, as an example of the usage of the method. We showed that the robot correctly learned the relationships between objects and places.
... They showed that a robot could improve its localization performance by integrating object recognition results and Monte Carlo localization results. Taniguchi et al. (2016a) proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA) that integrates a generative model for self-localization and the unsupervised word segmentation in uttered sentences via latent variables related to the spatial concept. In these studies, the knowledge of places and their names is learned by a robot in a totally unsupervised and statistical manner. ...
Article
Humans can acquire language through physical interaction with their environment and semiotic interaction with other people. It is very important to understand how humans can form a symbol system and obtain semiotic skills through their autonomous mental development from a computational point of view. A machine learning system that enables a robot to obtain and modulate its symbol system is crucially important to develop robotic systems that achieve long-term human-robot communication and collaboration. In this paper, I introduce the basis of our research field and related topics. Specifically, I describe the concept of symbol emergence systems and the recent research topics , e.g., multimodal categorization, spatial concept formation, language acquisition, and double articulation analysis, that will contribute to future human-robot communication and collaboration.
Chapter
Music and language are structurally similar. Such structural similarity is often explained by generative processes. This paper describes the recent development of probabilistic generative models (PGMs) for language learning and symbol emergence in robotics. Symbol emergence in robotics aims to develop a robot that can adapt to real-world environments and human linguistic communications and acquire language from sensorimotor information alone (i.e., in an unsupervised manner). This is regarded as a constructive approach to symbol emergence systems. To this end, a series of PGMs have been developed, including those for simultaneous phoneme and word discovery, lexical acquisition, object and spatial concept formation, and the emergence of a symbol system. By extending the models, a symbol emergence system comprising a multi-agent system in which a symbol system emerges is revealed to be modeled using PGMs. In this model, symbol emergence can be regarded as collective predictive coding. This paper expands on this idea by combining the theory that “emotion is based on the predictive coding of interoceptive signals” and “symbol emergence systems”, and describes the possible hypothesis of the emergence of meaning in music.Keywordssymbol emergence systemsprobabilistic generative modelsymbol emergence in roboticsautomatic music compositionlanguage evolution
Conference Paper
Full-text available
This paper proposes SpCoMapGAN, a method to generate the semantic map in a newly visited environment by training an inference model using previously estimated semantic maps. SpCoMapGAN uses generative adversarial networks (GANs) to transfer semantic information based on room arrangements to the newly visited environment. We experimentally show in simulation that SpCoMapGAN can use global information for estimating the semantic map and is superior to previous related methods.
Article
Full-text available
This paper proposes a hierarchical Bayesian model based on spatial concepts that enables a robot to transfer the knowledge of places from experienced environments to a new environment. The transfer of knowledge based on spatial concepts is modeled as the calculation process of the posterior distribution based on the observations obtained in each environment with the parameters of spatial concepts generalized to environments as prior knowledge. We conducted experiments to evaluate the generalization performance of spatial knowledge for general places such as kitchens and the adaptive performance of spatial knowledge for unique places such as ‘Emma's room’ in a new environment. In the experiments, the accuracies of the proposed method and conventional methods were compared in the prediction task of location names from an image and a position, and the prediction task of positions from a location name. The experimental results demonstrated that the proposed method has a higher prediction accuracy of location names and positions than the conventional method owing to the transfer of knowledge.
Preprint
Infants acquire words and phonemes from unsegmented speech signals using segmentation cues, such as distributional, prosodic, and co-occurrence cues. Many pre-existing computational models that represent the process tend to focus on distributional or prosodic cues. This paper proposes a nonparametric Bayesian probabilistic generative model called the prosodic hierarchical Dirichlet process-hidden language model (Prosodic HDP-HLM). Prosodic HDP-HLM, an extension of HDP-HLM, considers both prosodic and distributional cues within a single integrative generative model. We conducted three experiments on different types of datasets, and demonstrate the validity of the proposed method. The results show that the Prosodic DAA successfully uses prosodic cues and outperforms a method that solely uses distributional cues. The main contributions of this study are as follows: 1) We develop a probabilistic generative model for time series data including prosody that potentially has a double articulation structure; 2) We propose the Prosodic DAA by deriving the inference procedure for Prosodic HDP-HLM and show that Prosodic DAA can discover words directly from continuous human speech signals using statistical information and prosodic information in an unsupervised manner; 3) We show that prosodic cues contribute to word segmentation more in naturally distributed case words, i.e., they follow Zipf's law.
Article
Visual perception is an important component for human–robot interaction processes in robotic systems. Interaction between humans and robots depends on the reliability of the robotic vision systems. The variation of camera sensors and the capability of these sensors to detect many types of sensory inputs improve the visual perception. The analysis of activities, motions, skills, and behaviors of humans and robots have been addressed by utilizing the heat signatures of the human body. The human motion behavior is analyzed by body movement kinematics, and the trajectory of the target is used to identify the objects and the human target in the omnidirectional (O-D) thermal images. The process of human target identification and gesture recognition by traditional sensors have problem for multitarget scenarios since these sensors may not keep all targets in their narrow field of view (FOV) at the same time. O-D thermal view increases the robots’ line-of-sights and ability to obtain better perception in the absence of light. The human target is informed of its position, surrounding objects and any other human targets in its proximity so that humans with limited vision or vision disability can be assisted to improve their ability in their environment. The proposed method helps to identify the human targets in a wide FOV and light independent conditions to assist the human target and improve the human–robot and robot–robot interactions. The experimental results show that the identification of the human targets is achieved with a high accuracy.
Article
Full-text available
This study focuses on category formation for individual agents and the dynamics of symbol emergence in a multi-agent system through semiotic communication. In this study, the semiotic communication refers to exchanging signs composed of the signifier (i.e., words) and the signified (i.e., categories). We define the generation and interpretation of signs associated with the categories formed through the agent's own sensory experience or by exchanging signs with other agents as basic functions of the semiotic communication. From the viewpoint of language evolution and symbol emergence, organization of a symbol system in a multi-agent system (i.e., agent society) is considered as a bottom-up and dynamic process, where individual agents share the meaning of signs and categorize sensory experience. A constructive computational model can explain the mutual dependency of the two processes and has mathematical support that guarantees a symbol system's emergence and sharing within the multi-agent system. In this paper, we describe a new computational model that represents symbol emergence in a two-agent system based on a probabilistic generative model for multimodal categorization. It models semiotic communication via a probabilistic rejection based on the receiver's own belief. We have found that the dynamics by which cognitively independent agents create a symbol system through their semiotic communication can be regarded as the inference process of a hidden variable in an interpersonal multimodal categorizer, i.e., the complete system can be regarded as a single agent performing multimodal categorization using the sensors of all agents, if we define the rejection probability based on the Metropolis-Hastings algorithm. The validity of the proposed model and algorithm for symbol emergence, i.e., forming and sharing signs and categories, is also verified in an experiment with two agents observing daily objects in the real-world environment. In the experiment, we compared three communication algorithms: no communication, no rejection, and the proposed algorithm. The experimental results demonstrate that our model reproduces the phenomena of symbol emergence, which does not require a teacher who would know a pre-existing symbol system. Instead, the multi-agent system can form and use a symbol system without having pre-existing categories.
Article
The concept of direction is critical for a robot to establish its abilities in spatial perception. This article addresses the issue of how a robot developmentally forms the concept of direction. We propose a novel framework, based on the related mechanisms of humans, in which motion cues are employed instead of other commonly used sensing means like vision or audition. Using motion cues is actually one of the most important ways for humans to acquire concepts. This leads to advantages in two ways: 1) models based on motion cues are usually more convenient for a robot’s control of motion and 2) multimodal perceptual cues can complement each other and result in improved robustness. The framework behind our methodology lies in developmentally forming the concept of direction in two successive phases: 1) the robot acquires the perceptual representation of its motion by motor babbling and 2) the corresponding higher level features are gradually captured, based on which the conceptual representation is obtained and then the concept of direction is formed. The proposed framework and the corresponding models are evaluated with a PKU-HR5.0II humanoid robot. The results show that the robot can successfully form the concept of direction in a manner similar to that of humans.
Article
This paper describes how to achieve highly accurate unsupervised spatial lexical acquisition from speech-recognition results including phoneme recognition errors. In most research into lexical acquisition, the robot has no pre-existing lexical knowledge. The robot acquires sequences of some phonemes as words from continuous speech signals. In a previous study, we proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA) that integrates the robot's position and words obtained by unsupervised word segmentation from uncertain syllable recognition results. However, SpCoA has a very critical problem to be solved in lexical acquisition; the boundaries of word segmentation are incorrect in many cases because of many phoneme recognition errors. Therefore, we propose an unsupervised machine learning method (SpCoA++) for the robust lexical acquisition of novel words relating to places visited by the robot. The proposed SpCoA++ method performs an iterative estimation of learning spatial concepts and updating a language model using place information. SpCoA++ can select a candidate including many words that better represent places from multiple word-segmentation results by maximizing the mutual information between segmented words and spatial concepts. The experimental results demonstrate a significant improvement of the phoneme accuracy rate of learned words relating to place in the proposed method by word-segmentation results based on place information, in comparison to the conventional methods. We indicate that the proposed method enables the robot to acquire words from speech signals more accurately, and improves the estimation accuracy of the spatial concepts.
Conference Paper
A model was developed to allow a mobile robot to label the areas of a typical domestic room, using raw sequential visual and motor data, no explicit information on location was provided, and no maps were constructed. The model comprised a deep autoencoder and a recurrent neural network. The model was demonstrated to (1) learn to correctly label areas of different shapes and sizes, (2) be capable of adapting to changes in room shape and rearrangement of items in the room, and (3) attribute different labels to the same area, when approached from different angles. Analysis of the internal representations of the model showed that a topological structure corresponding to the room structure was self-organized as the trajectory of the internal activations of the network.
Conference Paper
Full-text available
Human infants can acquire word meanings by estimating the relationships among multiple situations and words. In this paper, we propose a Bayesian probabilistic model that can learn multiple categorizations and words related to any of four modalities (action, object, position, and color). This paper focuses on a cross-situational learning using the co-occurrence of sentences and situations. We conducted a learning experiment using the humanoid iCub robot. In this experiment, the human tutor describes a sentence about an object of visual attention to the robot and an action of the robot. The experimental results show that the proposed method was able to estimate the multiple categorizations and to accurately learn the relationships between multiple modalities and words.
Conference Paper
Full-text available
Providing autonomous humanoid robots with the abilities to react in an adaptive and intelligent manner involves low level control and sensing as well as high level reasoning. However, the integration of both levels still remains challenging due to the representational gap between the continuous state space on the sensorimotor level and the discrete symbolic entities used in high level reasoning. In this work, we approach the problem of learning a representation of the space which is applicable on both levels. This representation is grounded on the sensorimotor level by means of exploration and on the language level by making use of common sense knowledge. We demonstrate how spatial knowledge can be extracted from these two sources of experience. Combining the resulting knowledge in a systematic way yields a solution to the grounding problem which has the potential to substantially decrease the learning effort.
Article
Full-text available
For mobile robots to communicate meaningfully about their spatial environment, they require personally constructed cognitive maps and social interactions to form languages with shared meanings. Geographic spatial concepts introduce particular problems for grounding—connecting a word to its referent in the world—because such concepts cannot be directly and solely based on sensory perceptions. In this article we investigate the grounding of geographic spatial concepts using mobile robots with cognitive maps, called Lingodroids. Languages were established through structured interactions between pairs of robots called where-are-we conversations. The robots used a novel method, termed the distributed lexicon table, to create flexible concepts. This method enabled words for locations, termed toponyms, to be grounded through experience. Their understanding of the meaning of words was demonstrated using go-to games in which the robots independently navigated to named locations. Studies in real and virtual reality worlds show that the system is effective at learning spatial language: robots learn words easily—in a single trial as children do—and the words and their meaning are sufficiently robust for use in real world tasks.
Article
Full-text available
We describe a series of five statistical models of the translation process and give algorithms for estimating the parameters of these models given a set of pairs of sentences that are translations of one another. We define a concept of word-by-word alignment between such pairs of sentences. For any given pair of such sentences each of our models assigns a probability to each of the possible word-by-word alignments. We give an algorithm for seeking the most probable of these alignments. Although the algorithm is suboptimal, the alignment thus obtained accounts well for the word-by-word relationships in the pair of sentences. We have a great deal of data in French and English from the proceedings of the Canadian Parliament. Accordingly, we have restricted our work to these two languages; but we feel that because our algorithms have minimal linguistic content they would work well on other pairs of languages. We also feel, again because of the minimal linguistic content of our algorithms, that it is reasonable to argue that word-by-word alignments are inherent in any sufficiently large bilingual corpus.
Article
Full-text available
This paper presents a machine learning method that enables robots to learn to communicate linguistically from scratch through verbal and behavioral interaction with users. The method combines speech, visual, and tactile information ob-tained by interaction in the real world. It learns speech units, words, concepts of objects, motions, grammar, and pragmatic and communicative capabilities, which are integrated in a dy-namic graphical model. Experimental results show that through a practical, small number of learning episodes with a user, the robot was eventually able to understand even fragmental and ambiguous utterances, respond to them with confirmation ques-tions and/or acting, generate directive utterances appropriate for the given situation, and answer questions. This paper discusses the importance of a developmental approach to realize natural situated human-robot conversations.
Chapter
Full-text available
This work presents a developmental and ecological approach to language acquisition in robots, which has its roots in the interaction between infants and their caregivers. We show that the signal directed to infants by their caregivers include several hints that can facilitate the language acquisition and reduce the need for preprogrammed linguistic structure. Moreover, infants also produce sounds, which enables for richer types of interactions such as imitation games, and for the use of motor learning. By using a humanoid robot with embodied models of the infant’s ears, eyes, vocal tract, and memory functions, we can mimic the adult-infant interaction and take advantage of the inherent structure in the signal. Two experiments are shown, where the robot learn a number of word-object associations and the articulatory target positions for a number of vowels.
Conference Paper
Full-text available
Understanding mechanisms of intelligence of human beings and animals is one of the most important approaches to develop intelligent robot systems. Since the mechanisms of such real-life intelligent systems are so complex, physical interactions between agents and their environment and the social interactions between agents should be considered. Comprehension and knowledge in many peripheral fields such as cognitive science, developmental psychology, brain science, evolutionary biology, and robotics is also required. Discussions from an interdisciplinary aspect are very important for implementing this approach, but such collaborative research is time-consuming and labor-intensive, and it is difficult to obtain fruitful results from such research because the basis of experiments is very different in each research field. In the social science field, for example, several multi-agent simulation systems have been proposed for modeling factors such as social interactions and language evolution, whereas robotics researchers often use dynamics and sensor simulators. However, there is no integrated system that uses both physical simulations and social communication simulations. Therefore, we developed a simulator environment called SIGVerse that combines dynamics, perception, and communication simulations for synthetic approaches to research into the genesis of social intelligence. In this paper, we introduce SIGVerse, its example application and perspectives.
Chapter
Full-text available
8.1 Sharing the risk of being misunderstood The experiments in learning a pragmatic capability illustrate the importance of sharing the risk of not being understood correctly between the user and the robot. In the learning period for utterance understanding by the robot, the values of the local confidence parameters changed significantly when the robot acted incorrectly in the first trial and correctly in the second trial. To facilitate learning, the user had to gradually increase the ambiguity of utterances according to the robot's developing ability to understand them and had to take the risk of not being understood correctly. In the its learning period for utterance generation, the robot adjusted its utterances to the user while learning the global confidence function. When the target understanding rate was set to 0.95, the global confidence function became very unstable in cases where the robot's expectations of being understood correctly at a high probability were not met. This instability could be prevented by using a lower value of , which means that the robot would have to take a greater risk of not being understood correctly. Accordingly, in human-machine interaction, both users and robots must face the risk of not being understood correctly and thus adjust their actions to accommodate such risk in order to effectively couple their belief systems. Although the importance of controlling the risk of error in learning has generally been seen as an exploration-exploitation trade-off in the field of reinforcement learning by machines (e.g., (Dayan & Sejnowski, 1996)), we argue here that the mutual accommodation of the risk of error by those communicating is an important basis for the formation of mutual understanding. 8.2 Incomplete observed information and fast adaptation In general, an utterance does not contain complete information about what a speaker wants to convey to a listener. The proposed learning method interpreted such utterances according to the situation by providing necessary but missing information by making use of the assumption of shared beliefs. The method also enabled the robot and the user to adapt such an assumption of shared beliefs to each other with little interaction. We can say that the method successfully
Conference Paper
Full-text available
Julius is a high-performance, two-pass LVCSR decoder for researchers and developers. Based on word 3-gram and context-dependent HMM, it can perform almost real- time decoding on most current PCs in 20k word dicta- tion task. Major search techniques are fully incorporated such as tree lexicon, N-gram factoring, cross-word con- text dependency handling, enveloped beam search, Gaus- sian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized carefully to be inde- pendent from model structures, and various HMM types are supported such as shared-state triphones and tied- mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit. The main platform is Linux and other Unix workstations, and partially works on Windows. Julius is distributed with open license to- gether with source codes, and has been used by many researchers and developers in Japan.
Conference Paper
Full-text available
This paper proposes a method for the unsupervised learning of lexicons from pairs of a spoken utterance and an object as its meaning under the condition that any priori linguistic knowledge other than acoustic models of Japanese phonemes is not used. The main problems are the word segmentation of spoken utterances and the learning of the phoneme sequences of the words. To obtain a lexicon, a statistical model, which represents the joint probability of an utterance and an object, is learned based on the minimum description length (MDL) principle. The model consists of three parts: a word list in which each word is represented by a phoneme sequence, a word-bigram model, and a word-meaning model. Through alternate learning processes of these parts, acoustically, grammatically, and semantically appropriate units of phoneme sequences that cover all utterances are acquired as words. Experimental results show that our model can acquire phoneme sequences of object words with about 83.6% accuracy.
Conference Paper
Full-text available
Situated, spontaneous speech may be ambiguous along acoustic, lexical, grammatical and semantic dimensions. To understand such a seemingly difficult signal, we propose to model the ambiguity inherent in acoustic signals and in lexical and grammatical choices using compact, probabilistic representations of multiple hypotheses. To resolve semantic ambiguities we propose a situation model that captures aspects of the physical context of an utterance as well as the speaker's intentions, in our case represented by recognized plans. In a single, coherent Framework for Understanding Situated Speech (FUSS) we show how these two influences, acting on an ambiguous representation of the speech signal, complement each other to disambiguate form and content of situated speech. This method produces promising results in a game playing environment and leaves room for other types of situation models.
Conference Paper
Full-text available
In this paper, we propose a new Bayesian model for fully unsupervised word segmentation and an efficient blocked Gibbs sampler combined with dynamic programming for inference. Our model is a nested hierarchical Pitman-Yor language model, where Pitman-Yor spelling model is embedded in the word model. We confirmed that it significantly outperforms previous reported results in both phonetic transcripts and standard datasets for Chinese and Japanese word segmentation. Our model is also considered as a way to construct an accurate word n-gram language model directly from characters of arbitrary language, without any "word" indications.
Article
Full-text available
We propose a novel scheme to learn a language model (LM) for automatic speech recognition (ASR) directly from continuous speech. In the proposed method, we first generate phoneme lattices using an acoustic model with no linguistic constraints, then perform training over these phoneme lattices, simultaneously learning both lexical units and an LM. As a statistical framework for this learning problem, we use non-parametric Bayesian statistics, which make it possible to balance the learned model's complexity (such as the size of the learned vocabulary) and expressive power, and provide a principled learning algorithm through the use of Gibbs sampling. Implementation is performed using weighted finite state transducers (WFSTs), which allow for the simple handling of lattice input. Experimental results on natural, adult-directed speech demonstrate that LMs built using only continuous speech are able to significantly reduce ASR phoneme error rates. The proposed technique of joint Bayesian learning of lexical units and an LM over lattices is shown to significantly contribute to this improvement.
Article
Full-text available
An understanding of time and temporal concepts is critical for interacting with the world and with other agents in the world. What does a robot need to know to refer to the temporal aspects of events could a robot gain a grounded understanding of a long journey, or soon? Cognitive maps constructed by individual agents from their own journey experiences have been used for grounding spatial concepts in robot languages. In this paper, we test whether a similar methodology can be applied to learning temporal concepts and an associated lexicon to answer the question how long did it take to complete a journey. Using evolutionary language games for specific and generic journeys, successful communication was established for concepts based on representations of time, distance, and amount of change. The studies demonstrate that a lexicon for journey duration can be grounded using a variety of concepts. Spatial and temporal terms are not identical, but the studies show that both can be learned using similar language evolution methods, and that time, distance, and change can serve as proxies for each other under noisy conditions. Effective concepts and names for duration provide a first step towards a grounded lexicon for temporal interval logic.
Article
Full-text available
We consider the problem of speaker diarization, the problem of segmenting an audio recording of a meeting into temporal segments corresponding to individual speakers. The problem is rendered particularly difficult by the fact that we are not allowed to assume knowledge of the number of people participating in the meeting. To address this problem, we take a Bayesian nonparametric approach to speaker diarization that builds on the hierarchical Dirichlet process hidden Markov model (HDP-HMM) of Teh et al. [J. Amer. Statist. Assoc. 101 (2006) 1566--1581]. Although the basic HDP-HMM tends to over-segment the audio data---creating redundant states and rapidly switching among them---we describe an augmented HDP-HMM that provides effective control over the switching rate. We also show that this augmentation makes it possible to treat emission distributions nonparametrically. To scale the resulting architecture to realistic diarization problems, we develop a sampling algorithm that employs a truncated approximation of the Dirichlet process to jointly resample the full state sequence, greatly improving mixing rates. Working with a benchmark NIST data set, we show that our Bayesian nonparametric architecture yields state-of-the-art speaker diarization results.
Conference Paper
Full-text available
The work presents a new approach to the problem of simultaneous localization and mapping - SLAM - inspired by computational models of the hippocampus of rodents. The rodent hippocampus has been extensively studied with respect to navigation tasks, and displays many of the properties of a desirable SLAM solution. RatSLAM is an implementation of a hippocampal model that can perform SLAM in real time on a real robot. It uses a competitive attractor network to integrate odometric information with landmark sensing to form a consistent representation of the environment. Experimental results show that RatSLAM can operate with ambiguous landmark information and recover from both minor and major path integration errors.
Conference Paper
Full-text available
The ability to learn a consistent model of its environment is a prerequisite for autonomous mobile robots. A particularly challenging problem in acquiring environment maps is that of closing loops; loops in the environment create challenging data association problems [J.-S. Gutman et al., 1999]. This paper presents a novel algorithm that combines Rao-Blackwellized particle filtering and scan matching. In our approach scan matching is used for minimizing odometric errors during mapping. A probabilistic model of the residual errors of scan matching process is then used for the resampling steps. This way the number of samples required is seriously reduced. Simultaneously we reduce the particle depletion problem that typically prevents the robot from closing large loops. We present extensive experiments that illustrate the superior performance of our approach compared to previous approaches.
Conference Paper
Full-text available
To navigate reliably in indoor environments, a mobile robot must know where it is. Thus, reliable position estimation is a key problem in mobile robotics. We believe that probabilistic approaches are among the most promising candidates to providing a comprehensive and real-time solution to the robot localization problem. However, current methods still face considerable hurdles. In particular the problems encountered are closely related to the type of representation used to represent probability densities over the robot's state space. Earlier work on Bayesian filtering with particle-based density representations opened up a new approach for mobile robot localization based on these principles. We introduce the Monte Carlo localization method, where we represent the probability density involved by maintaining a set of samples that are randomly drawn from it. By using a sampling-based representation we obtain a localization method that can represent arbitrary distributions. We show experimentally that the resulting method is able to efficiently localize a mobile robot without knowledge of its starting location. It is faster, more accurate and less memory-intensive than earlier grid-based methods,
Article
Humans develop their concept of an object by classifying it into a category, and acquire language by interacting with others at the same time. Thus, the meaning of a word can be learnt by connecting the recognized word and concept. We consider such an ability to be important in allowing robots to flexibly develop their knowledge of language and concepts. Accordingly, we propose a method that enables robots to acquire such knowledge. The object concept is formed by classifying multimodal information acquired from objects, and the language model is acquired from human speech describing object features. We propose a stochastic model of language and concepts, and knowledge is learnt by estimating the model parameters. The important point is that language and concepts are interdependent. There is a high probability that the same words will be uttered to objects in the same category. Similarly, objects to which the same words are uttered are highly likely to have the same features. Using this relation, the accuracy of both speech recognition and object classification can be improved by the proposed method. However, it is difficult to directly estimate the parameters of the proposed model, because there are many parameters that are required. Therefore, we approximate the proposed model, and estimate its parameters using a nested Pitman--Yor language model and multimodal latent Dirichlet allocation to acquire the language and concept, respectively. The experimental results show that the accuracy of speech recognition and object classification is improved by the proposed method.
Conference Paper
In this study, we propose a method for concept formation and word acquisition for robots. The proposed method is based on multimodal latent Dirichlet allocation (MLDA) and the nested Pitman-Yor language model (NPYLM). A robot obtains haptic, visual, and auditory information by grasping, observing, and shaking an object. At the same time, a user teaches object features to the robot through speech, which is recognized using only acoustic models and transformed into phoneme sequences. As the robot is supposed to have no language model in advance, the recognized phoneme sequences include many phoneme recognition errors. Moreover, the recognized phoneme sequences with errors are segmented into words in an unsupervised manner; however, not all words are necessarily segmented correctly. The words including these errors have a negative effect on the learning of word meanings. To overcome this problem, we propose a method to improve unsupervised word segmentation and to reduce phoneme recognition errors by using multimodal object concepts. In the proposed method, object concepts are used to enhance the accuracy of word segmentation, reduce phoneme recognition errors, and correct words so as to improve the categorization accuracy. We experimentally demonstrate that the proposed method can improve the accuracy of word segmentation and reduce the phoneme recognition error and that the obtained words enhance the categorization accuracy.
Conference Paper
In this paper we present an algorithm for the unsupervised segmentation of a lattice produced by a phoneme recognizer into words. Using a lattice rather than a single phoneme string accounts for the uncertainty of the recognizer about the true label sequence. An example application is the discovery of lexical units from the output of an error-prone phoneme recognizer in a zero-resource setting, where neither the lexicon nor the language model (LM) is known. We propose a computationally efficient iterative approach, which alternates between the following two steps: First, the most probable string is extracted from the lattice using a phoneme LM learned on the segmentation result of the previous iteration. Second, word segmentation is performed on the extracted string using a word and phoneme LM which is learned alongside the new segmentation. We present results on lattices produced by a phoneme recognizer on the WSJ-CAM0 dataset. We show that our approach delivers superior segmentation performance than an earlier approach found in the literature, in particular for higher-order language models.
Article
Progress in information technologies has enabled to apply computer-intensive methods to statistical analysis. In time series modeling, sequential Monte Carlo method was developed for general nonlinear non-Gaussian state-space models and it enables to consider very complex nonlinear non-Gaussian models for real-world problems. In this paper, we consider several computational problems associated with sequential Monte Carlo filter and smoother, such as the use of a huge number of particles, two-filter formula for smoothing, and parallel computation. The posterior mean smoother and the Gaussian-sum smoother are also considered.
Conference Paper
Previous studies have shown how Lingodroids, language learning mobile robots, learn terms for space and time, connecting their personal maps of the world to a publically shared language. One caveat of previous studies was that the robots shared the same cognitive architecture, identical in all respects from sensors to mapping systems. In this paper we investigate the question of how terms for space can be developed between robots that have fundamentally different sensors and spatial representations. In the real world, communication needs to occur between agents that have different embodiment and cognitive capabilities, including different sensors, different representations of the world, and different species (including humans). The novel aspects of these studies is that one robot uses a forward facing camera to estimate appearance and uses a biologically inspired continuous attractor network to generate a topological map; the other robot uses a laser scanner to estimate range and uses a probabilistic filter approach to generate an occupancy grid. The robots hold conversations in different locations to establish a shared language. Despite their different ways of sensing and mapping the world, the robots are able to create coherent lexicons for the space around them.
Conference Paper
In this paper, we propose an online algorithm for multimodal categorization based on the autonomously acquired multimodal information and partial words given by human users. For multimodal concept formation, multimodal latent Dirichlet allocation (MLDA) using Gibbs sampling is extended to an online version. We introduce a particle filter, which significantly improve the performance of the online MLDA, to keep tracking good models among various models with different parameters. We also introduce an unsupervised word segmentation method based on hierarchical Pitman-Yor Language Model (HPYLM). Since the HPYLM requires no predefined lexicon, we can make the robot system that learns concepts and words in completely unsupervised manner. The proposed algorithms are implemented on a real robot and tested using real everyday objects to show the validity of the proposed system.
Article
This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithms. In evaluations the model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects. These results demonstrate the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling. © 2002 Cognitive Science Society, Inc. All rights reserved.
Conference Paper
The ability to simultaneously localize a robot and accurately map its surroundings is considered by many to be a key prerequisite of truly autonomous robots. However, few approaches to this problem scale up to handle the very large number of landmarks present in real environments. Kalman filter-based algorithms, for example, require time quadratic in the number of landmarks to incorporate each sensor observation. This paper presents FastSLAM, an algorithm that recursively estimates the full posterior distribution over robot pose and landmark locations, yet scales logarithmically with the number of landmarks in the map. This algorithm is based on an exact factorization of the posterior into a product of conditional landmark distributions and a distribution over robot paths. The algorithm has been run successfully on as many as 50,000 landmarks, environments far beyond the reach of previous approaches. Experimental results demonstrate the advantages and limitations of the FastSLAM algorithm on both simulated and real-world data.
Article
This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithms. In evaluations the model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects. These results demonstrate the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling.
Article
This paper describes new language-processing methods suitable for human–robot interfaces. These methods enable a robot to learn linguistic knowledge from scratch in unsupervised ways. The learning is done through statistical optimization in the process of human–robot communication, combining speech, visual, and behavioral information in a probabilistic framework. The linguistic knowledge learned includes speech units like phonemes, lexicon, and grammar, and is represented by a graphical model that includes hidden Markov models. In experiments, a robot was eventually able to understand utterances according to given situations, and act appropriately.
Conference Paper
This paper proposes a method for the unsupervised learning of place-names from pairs of a spoken utterance and a localization result, which represents a current location of a mobile robot, without any priori linguistic knowledge other than a phoneme acoustic model. In previous work, we have proposed a lexical learning method based on statistical model selection. This method can learn the words that represent a single object, such as proper nouns, but cannot learn the words that represent classes of objects, such as general nouns. This paper describes improvements of the method for learning both a phoneme sequence of each word and a distribution of objects that the word represents.
Conference Paper
In this paper, we propose a nonparametric Bayesian framework for categorizing multimodal sensory sig­ nals such as audio, visual, and haptic information by robots. The robot uses its physical embodiment to grasp and observe an object from various viewpoints as well as listen to the sound during the observation. The multimodal information enables the robot to form human-like object categories that are bases of intelligence. The proposed method is an extension of Hierarchi­ cal Dirichlet Process (HDP), which is a kind of nonparametric Bayesian models, to multimodal HDP (MHDP). MHDP can estimate the number of categories, while the parametric model, e.g. LDA-based categorization, requires to specify the number in advance. As this is an unsupervised learning method, a human user does not need to give any correct labels to the robot and it can classify objects autonomously. At the same time the proposed method provides a probabilistic framework for inferring object properties from limited observations. Validity of the proposed method is shown through some experimental results.
Conference Paper
One major bottleneck in conversational sys- tems is their incapability in interpreting un- expected user language inputs such as out-of- vocabulary words. To overcome this problem, conversational systems must be able to learn new words automatically during human ma- chine conversation. Motivated by psycholin- guistic findings on eye gaze and human lan- guage processing, we are developing tech- niques to incorporate human eye gaze for au- tomatic word acquisition in multimodal con- versational systems. This paper investigates the use of temporal alignment between speech and eye gaze and the use of domain knowl- edge in word acquisition. Our experiment re- sults indicate that eye gaze provides a poten- tial channel for automatically acquiring new words. The use of extra temporal and domain knowledge can significantly improve acquisi- tion performance.
Article
In this paper we propose a latent Dirichlet allocation (LDA)-based framework for multimodal categorization and words grounding by robots. The robot uses its physical embodiment to grasp and observe an object from various view points, as well as to listen to the sound during the observing period. This multimodal information is used for categorizing and forming multimodal concepts using multimodal LDA. At the same time, the words acquired during the observing period are connected to the related concepts, which are represented by the multimodal LDA. We also provide a relevance measure that encodes the degree of connection between words and modalities. The proposed algorithm is implemented on a robot platform and some experiments are carried out to evaluate the algorithm. We also demonstrate simple conversation between a user and the robot based on the learned model.
Article
To tackle the vocabulary problem in conversational systems, previous work has applied unsupervised learning approaches on co-occurring speech and eye gaze during interaction to automatically acquire new words. Although these approaches have shown promise, several issues related to human language behavior and human-machine conversation have not been addressed. First, psycholinguistic studies have shown certain temporal regularities between human eye movement and language production. While these regularities can potentially guide the acquisition process, they have not been incorporated in the previous unsupervised approaches. Second, conversational systems generally have an existing knowledge base about the domain and vocabulary. While the existing knowledge can potentially help bootstrap and constrain the acquired new words, it has not been incorporated in the previous models. Third, eye gaze could serve different functions in human-machine conversation. Some gaze streams may not be closely coupled with speech stream, and thus are potentially detrimental to word acquisition. Automated recognition of closely-coupled speech-gaze streams based on conversation context is important. To address these issues, we developed new approaches that incorporate user language behavior, domain knowledge, and conversation context in word acquisition. We evaluated these approaches in the context of situated dialogue in a virtual world. Our experimental results have shown that incorporating the above three types of contextual information significantly improves word acquisition performance.
Article
RatSLAM is a biologically-inspired visual SLAM and navigation system that has been shown to be effective indoors and outdoors on real robots. The spatial representation at the core of RatSLAM, the experience map, forms in a distributed fashion as the robot learns the environment. The activity in RatSLAM's experience map possesses some geometric properties, but still does not represent the world in a human readable form. A new system, dubbed RatChat, has been introduced to enable meaningful communication with the robot. The intention is to use the "language games" paradigm to build spatial concepts that can be used as the basis for communication. This paper describes the first step in the language game experiments, showing the potential for meaningful categorization of the spatial representations in RatSLAM. (c) 2006 Elsevier B.V. All rights reserved.
Article
The problem of comparing two different partitions of a finite set of objects reappears continually in the clustering literature. We begin by reviewing a well-known measure of partition correspondence often attributed to Rand (1971), discuss the issue of correcting this index for chance, and note that a recent normalization strategy developed by Morey and Agresti (1984) and adopted by others (e.g., Miligan and Cooper 1985) is based on an incorrect assumption. Then, the general problem of comparing partitions is approached indirectly by assessing the congruence of two proximity matrices using a simple cross-product measure. They are generated from corresponding partitions using various scoring rules. Special cases derivable include traditionally familiar statistics and/or ones tailored to weight certain object pairs differentially. Finally, we propose a measure based on the comparison of object triples having the advantage of a probabilistic interpretation in addition to being corrected for chance (i.e., assuming a constant value under a reasonable null hypothesis) and bounded between ±1.
States of particles: before the teaching utterance (top); after the teaching utterance (bottom)
  • Fig
Fig. 12. States of particles: before the teaching utterance (top); after the teaching utterance (bottom). The uttered sentence is "kokowa ** dayo," (which means "Here is **.") "**" is the name of the place.
Computational aspects of sequential Monte Carlo filter and smoother Simulator platform that enables social interaction simulation –SIGVerse: SocioIntelliGenesis simulator–
  • G Kitagawa
  • T Inamura
  • T Shibata
  • H Sena
  • T Hashimoto
  • N Kawai
  • T Miyashita
  • Y Sakurai
  • M Shimizu
  • M Otake
  • K Hosoda
G. Kitagawa, " Computational aspects of sequential Monte Carlo filter and smoother, " Annals of the Institute of Statistical Mathematics, vol. 66, no. 3, pp. 443–471, 2014. [31] T. Inamura, T. Shibata, H. Sena, T. Hashimoto, N. Kawai, T. Miyashita, Y. Sakurai, M. Shimizu, M. Otake, K. Hosoda, et al., " Simulator platform that enables social interaction simulation –SIGVerse: SocioIntelliGenesis simulator–, " in IEEE/SICE International Symposium on System Integration, 2010, pp. 212–217.
Robots that learn to converse: Developmental approach to situated language processing
  • N Iwahashi
  • R Taguchi
  • K Sugiura
  • K Funakoshi
  • M Nakano
N. Iwahashi, R. Taguchi, K. Sugiura, K. Funakoshi, and M. Nakano, "Robots that learn to converse: Developmental approach to situated language processing," in Proceedings of International Symposium on Speech and Language Processing, 2009, pp. 532-537.
Mutual learning of an object concept and language model based on MLDA and NPYLM
  • T Nakamura
  • T Nagai
  • K Funakoshi
  • S Nagasaka
  • T Taniguchi
  • N Iwahashi
T. Nakamura, T. Nagai, K. Funakoshi, S. Nagasaka, T. Taniguchi, and N. Iwahashi, "Mutual learning of an object concept and language model based on MLDA and NPYLM," in IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2014, pp. 600-607.
Mutual learning of an object concept and language model based on MLDA and NPYLM
  • T Nakamura