Conference Paper

Simultaneous Estimation of Self-position and Word from Noisy Utterances and Sensory Information

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

In this paper, we propose a novel learning method that can simultaneously estimate the self-position of a robot and place names. The robot moves in a room environment and performs probabilistic self-localization based on noisy sensory information. Speech recognition results include the uncertainty of phonemes or syllables, because the robot does not have lexical knowledge in advance. The purpose of this study is to reduce the uncertainty of both self-position and speech recognition using knowledge about place names, which is obtained from human speech. The proposed method integrates ambiguous speech recognition results with the self-localization method, i.e., Monte Carlo localization, using a Bayesian approach. Probability distributions over places and the speech recognition error are modeled using the proposed method. We implemented the proposed method in SIGVerse, which is a simulation environment. Experimental results showed that the robot can acquire the names of several places and use this knowledge to reduce the uncertainty of estimation in its position in a self-localization task. In addition, we evaluated the performance of the lexical acquisition task for the names of places and showed its effectiveness. Results showed that the robot could acquire spatial concepts by integrating noisy information from sensors and speech.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... Spatial concepts refer to the knowledge of the place categories autonomously formed based on the multimodal information of the spatial experience acquired by the robot in the environment according to [8][9][10][11][12][13]. Therefore, we believe that the spatial concept works well not only for this particular task but also for other tasks related to place and object placement. ...
... It is important to appropriately generalize and form place categories based on object positions while dealing with the uncertainty of the observations. To solve these issues, unsupervised learning approaches for spatial concepts were utilized in studies related to autonomous place categorization by a robot [8][9][10][11][12][13]. Taniguchi et al. proposed nonparametric Bayesian spatial concept acquisition methods, SpCoA [10] and SpCoA++ [11], which integrate self-localization and unsupervised wordsegmentation from speech signals as PGMs through the latent variables of spatial concepts. ...
Article
Tidy-up tasks by service robots in home environments are challenging in robotics applications because they involve various interactions with the environment. In particular, robots are required not only to grasp, move, and release various home objects but also to plan the order and positions for placing the objects. In this paper, we propose a novel planning method that can efficiently estimate the order and positions of the objects to be tidied up by learning the parameters of a probabilistic generative model. The model allows a robot to learn the distributions of the co-occurrence probability of the objects and places to tidy up using the multimodal sensor information collected in a tidied environment. Additionally, we develop an autonomous robotic system to perform the tidy-up operation. We evaluate the effectiveness of the proposed method by an experimental simulation that reproduces the conditions of the Tidy Up Here task of the World Robot Summit 2018 international robotics competition. The simulation results show that the proposed method enables the robot to successively tidy up several objects and achieves the best task score among the considered baseline tidy-up methods.
... Spatial concepts refer to the knowledge of the place categories autonomously formed based on the multimodal information of the spatial experience acquired by the robot in the environment according to [8][9][10][11][12][13]. Therefore, we believe that the spatial concept works well not only for this particular task but also for other tasks related to place and object placement. ...
... It is important to appropriately generalize and form place categories based on object positions while dealing with the uncertainty of the observations. To solve these issues, unsupervised learning approaches for spatial concepts were utilized in studies related to autonomous place categorization by a robot [8][9][10][11][12][13]. Taniguchi et al. proposed nonparametric Bayesian spatial concept acquisition methods, SpCoA [10] and SpCoA++ [11], which integrate self-localization and unsupervised word-segmentation from speech signals as PGMs through the latent variables of spatial concepts. ...
Preprint
Full-text available
Tidy-up tasks by service robots in home environments are challenging in the application of robotics because they involve various interactions with the environment. In particular, robots are required not only to grasp, move, and release various home objects, but also plan the order and positions where to put them away. In this paper, we propose a novel planning method that can efficiently estimate the order and positions of the objects to be tidied up based on the learning of the parameters of a probabilistic generative model. The model allows the robot to learn the distributions of co-occurrence probability of objects and places to tidy up by using multimodal sensor information collected in a tidied environment. Additionally, we develop an autonomous robotic system to perform the tidy-up operation. We evaluate the effectiveness of the proposed method in an experimental simulation that reproduces the conditions of the Tidy Up Here task of the World Robot Summit international robotics competition. The simulation results showed that the proposed method enables the robot to successively tidy up several objects and achieves the best task score compared to baseline tidy-up methods.
... This is because 2D semantic mapping is challenging and a 2D semantic map can be applied to the autonomous vacuum cleaner robot. Taniguchi et al. (2016a) proposed a method that estimated words related to places and performed self-localization by Monte Carlo localization (MCL) simultaneously. Ishibushi et al. (2015) proposed a self-localization method that integrated semantic information obtained from image recognition performed by a CNN, following an idea proposed by Taniguchi et al. (2016a). ...
... Taniguchi et al. (2016a) proposed a method that estimated words related to places and performed self-localization by Monte Carlo localization (MCL) simultaneously. Ishibushi et al. (2015) proposed a self-localization method that integrated semantic information obtained from image recognition performed by a CNN, following an idea proposed by Taniguchi et al. (2016a). Taniguchi et al. proposed SpCoA and an extension (Taniguchi et al., 2016b that integrated a generative model for self-localization and unsupervised word segmentation in uttered sentences via the latent variables related to the spatial concept. ...
Article
Full-text available
An autonomous robot performing tasks in a human environment needs to recognize semantic information about places. Semantic mapping is a task in which suitable semantic information is assigned to an environmental map so that a robot can communicate with people and appropriately perform tasks requested by its users. We propose a novel statistical semantic mapping method called SpCoMapping, which integrates probabilistic spatial concept acquisition based on multimodal sensor information and a Markov random field applied for learning the arbitrary shape of a place on a map.SpCoMapping can connect multiple words to a place in a semantic mapping process using user utterances without pre-setting the list of place names. We also develop a nonparametric Bayesian extension of SpCoMapping that can automatically estimate an adequate number of categories. In the experiment in the simulation environments, we showed that the proposed method generated better semantic maps than previous semantic mapping methods; our semantic maps have categories and shapes similar to the ground truth provided by the user. In addition, we showed that SpCoMapping could generate appropriate semantic maps in a real-world environment.
... Their methods enabled more accurate object categorization by using multimodal information. Taniguchi et al. (2016a) proposed a method for simultaneous estimation of self-positions and words from noisy sensory information and an uttered word. Their method integrated ambiguous speech recognition results with the self-localization method for learning spatial concepts. ...
... Their method integrated ambiguous speech recognition results with the self-localization method for learning spatial concepts. However, Taniguchi et al. (2016a) assumed that the name of a place would be learned from an uttered word. Taniguchi et al. (2016b) proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA) based on place categorization and unsupervised word segmentation. ...
Article
Full-text available
In this paper, we propose a Bayesian generative model that can form multiple categories based on each sensory-channel and can associate words with any of four sensory-channels (action, position, object, and color). This paper focuses on cross-situational learning using the co-occurrence between words and information of sensory-channels in complex situations rather than conventional situations of cross-situational learning. We conducted a learning scenario using a simulator and a real humanoid iCub robot. In the scenario, a human tutor provided a sentence that describes an object of visual attention and an accompanying action to the robot. The scenario was set as follows: the number of words per sensory-channel was three or four, and the number of trials for learning was 20 and 40 for the simulator and 25 and 40 for the real robot. The experimental results showed that the proposed method was able to estimate the multiple categorizations and to learn the relationships between multiple sensory-channels and words accurately. In addition, we conducted an action generation task and an action description task based on word meanings learned in the cross-situational learning scenario. The experimental results showed the robot could successfully use the word meanings learned by using the proposed method.
... In this paper, we aim to develop a method that enables mobile robots to learn spatial concepts and an environmental map sequentially from interaction with an environment and human, even in an unknown environment without prior knowledge. Taniguchi et al. [4] proposed a method that integrated ambiguous speech-recognition results with the self-localization method for learning spatial concepts. In addition, Taniguchi et al. [5] proposed the nonparametric Bayesian spatial concept acquisition method (SpCoA) based on an unsupervised word-segmentation method known as latticelm [6]. ...
... On the other hand, Ishibushi et al. [7] proposed a self-localization method that exploits image features using a convolutional neural network (CNN) [8]. These methods [4], [5], [7] cannot cope with changes in the names of places and the environment because these methods use batch learning algorithms. In addition, these methods cannot learn spatial concepts from unknown environments without a map, i.e., the robot needs to have a map generated by SLAM beforehand. ...
Conference Paper
Full-text available
In this paper, we propose an online learning algorithm based on a Rao-Blackwellized particle filter for spatial concept acquisition and mapping. We have proposed a nonparametric Bayesian spatial concept acquisition model (SpCoA). We propose a novel method (SpCoSLAM) integrating SpCoA and FastSLAM in the theoretical framework of the Bayesian generative model. The proposed method can simultaneously learn place categories and lexicons while incrementally generating an environmental map. Furthermore, the proposed method has scene image features and a language model added to SpCoA. In the experiments, we tested online learning of spatial concepts and environmental maps in a novel environment of which the robot did not have a map. Then, we evaluated the results of online learning of spatial concepts and lexical acquisition. The experimental results demonstrated that the robot was able to more accurately learn the relationships between words and the place in the environmental map incrementally by using the proposed method.
... SpCoMapping is an extended method of spatial concept formation [9], [10], [11], [12], using a Markov random field (MRF) for semantic mapping [2]. SpCoMapping learns the vocabulary representing a place and a region simultaneously, taking into account the shapes of the environment and obstacles. ...
Conference Paper
In semantic mapping, which connects semantic information to an environment map, it is a challenging task for robots to deal with both local and global information of environments. In addition, it is important to estimate semantic information of unobserved areas from already acquired partial observations in a newly visited environment. On the other hand, previous studies on spatial concept formation enabled a robot to relate multiple words to places from bottom-up observations even when the vocabulary was not provided beforehand. However, the robot could not transfer global information related to the room arrangement between semantic maps from other environments. In this paper, we propose SpCoMapGAN, which generates the semantic map in a newly visited environment by training an inference model using previously estimated semantic maps. SpCoMapGAN uses generative adversarial networks (GANs) to transfer semantic information based on room arrangements to a newly visited environment. Our proposed method assigns semantics to the map of an unknown environment using the prior distribution of the map trained in known environments and the multimodal observations made in the unknown environment. We experimentally show in simulation that SpCoMapGAN can use global information for estimating the semantic map and is superior to previous methods. Finally, we also demonstrate in a real environment that SpCoMapGAN can accurately 1) deal with local information, and 2) acquire the semantic information of real places.
Article
This paper describes how to achieve highly accurate unsupervised spatial lexical acquisition from speech-recognition results including phoneme recognition errors. In most research into lexical acquisition, the robot has no pre-existing lexical knowledge. The robot acquires sequences of some phonemes as words from continuous speech signals. In a previous study, we proposed a nonparametric Bayesian spatial concept acquisition method (SpCoA) that integrates the robot's position and words obtained by unsupervised word segmentation from uncertain syllable recognition results. However, SpCoA has a very critical problem to be solved in lexical acquisition; the boundaries of word segmentation are incorrect in many cases because of many phoneme recognition errors. Therefore, we propose an unsupervised machine learning method (SpCoA++) for the robust lexical acquisition of novel words relating to places visited by the robot. The proposed SpCoA++ method performs an iterative estimation of learning spatial concepts and updating a language model using place information. SpCoA++ can select a candidate including many words that better represent places from multiple word-segmentation results by maximizing the mutual information between segmented words and spatial concepts. The experimental results demonstrate a significant improvement of the phoneme accuracy rate of learned words relating to place in the proposed method by word-segmentation results based on place information, in comparison to the conventional methods. We indicate that the proposed method enables the robot to acquire words from speech signals more accurately, and improves the estimation accuracy of the spatial concepts.
Chapter
Full-text available
This work presents a developmental and ecological approach to language acquisition in robots, which has its roots in the interaction between infants and their caregivers. We show that the signal directed to infants by their caregivers include several hints that can facilitate the language acquisition and reduce the need for preprogrammed linguistic structure. Moreover, infants also produce sounds, which enables for richer types of interactions such as imitation games, and for the use of motor learning. By using a humanoid robot with embodied models of the infant’s ears, eyes, vocal tract, and memory functions, we can mimic the adult-infant interaction and take advantage of the inherent structure in the signal. Two experiments are shown, where the robot learn a number of word-object associations and the articulatory target positions for a number of vowels.
Conference Paper
Full-text available
Understanding mechanisms of intelligence of human beings and animals is one of the most important approaches to develop intelligent robot systems. Since the mechanisms of such real-life intelligent systems are so complex, physical interactions between agents and their environment and the social interactions between agents should be considered. Comprehension and knowledge in many peripheral fields such as cognitive science, developmental psychology, brain science, evolutionary biology, and robotics is also required. Discussions from an interdisciplinary aspect are very important for implementing this approach, but such collaborative research is time-consuming and labor-intensive, and it is difficult to obtain fruitful results from such research because the basis of experiments is very different in each research field. In the social science field, for example, several multi-agent simulation systems have been proposed for modeling factors such as social interactions and language evolution, whereas robotics researchers often use dynamics and sensor simulators. However, there is no integrated system that uses both physical simulations and social communication simulations. Therefore, we developed a simulator environment called SIGVerse that combines dynamics, perception, and communication simulations for synthetic approaches to research into the genesis of social intelligence. In this paper, we introduce SIGVerse, its example application and perspectives.
Conference Paper
Full-text available
Julius is a high-performance, two-pass LVCSR decoder for researchers and developers. Based on word 3-gram and context-dependent HMM, it can perform almost real- time decoding on most current PCs in 20k word dicta- tion task. Major search techniques are fully incorporated such as tree lexicon, N-gram factoring, cross-word con- text dependency handling, enveloped beam search, Gaus- sian pruning, Gaussian selection, etc. Besides search efficiency, it is also modularized carefully to be inde- pendent from model structures, and various HMM types are supported such as shared-state triphones and tied- mixture models, with any number of mixtures, states, or phones. Standard formats are adopted to cope with other free modeling toolkit. The main platform is Linux and other Unix workstations, and partially works on Windows. Julius is distributed with open license to- gether with source codes, and has been used by many researchers and developers in Japan.
Conference Paper
In this paper, we propose a self-localization method that exploits object recognition results by using convolutional neural networks (CNNs) for autonomous vehicles. Monte-Carlo localization (MCL) is one of the most popular localization methods that use odometry and distance sensor data for determining vehicle position. Some errors are often observed in the localization tasks and MCL often suffers from global positional errors. A global positional error means that particles representing a vehicle's position are distributed in the form of a multimodal distribution, i.e., the distribution has several peaks. To overcome this problem, we propose a method in which an autonomous vehicle employs object recognition results, obtained using CNNs, as the measurement data with a Bag-of-Features representation in an integrative manner. The semantic information found in the recognition results obtained using the CNN reduces the global errors in localization. The experimental results show that the proposed method can converge the distribution of the vehicle positions and particle orientations and reduce the global positional errors.
Article
Progress in information technologies has enabled to apply computer-intensive methods to statistical analysis. In time series modeling, sequential Monte Carlo method was developed for general nonlinear non-Gaussian state-space models and it enables to consider very complex nonlinear non-Gaussian models for real-world problems. In this paper, we consider several computational problems associated with sequential Monte Carlo filter and smoother, such as the use of a huge number of particles, two-filter formula for smoothing, and parallel computation. The posterior mean smoother and the Gaussian-sum smoother are also considered.
Article
This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithms. In evaluations the model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects. These results demonstrate the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling. © 2002 Cognitive Science Society, Inc. All rights reserved.
Article
This paper presents an implemented computational model of word acquisition which learns directly from raw multimodal sensory input. Set in an information theoretic framework, the model acquires a lexicon by finding and statistically modeling consistent cross-modal structure. The model has been implemented in a system using novel speech processing, computer vision, and machine learning algorithms. In evaluations the model successfully performed speech segmentation, word discovery and visual categorization from spontaneous infant-directed speech paired with video images of single objects. These results demonstrate the possibility of using state-of-the-art techniques from sensory pattern recognition and machine learning to implement cognitive models which can process raw sensor data without the need for human transcription or labeling.
Article
This paper describes new language-processing methods suitable for human–robot interfaces. These methods enable a robot to learn linguistic knowledge from scratch in unsupervised ways. The learning is done through statistical optimization in the process of human–robot communication, combining speech, visual, and behavioral information in a probabilistic framework. The linguistic knowledge learned includes speech units like phonemes, lexicon, and grammar, and is represented by a graphical model that includes hidden Markov models. In experiments, a robot was eventually able to understand utterances according to given situations, and act appropriately.
Conference Paper
This paper proposes a method for the unsupervised learning of place-names from pairs of a spoken utterance and a localization result, which represents a current location of a mobile robot, without any priori linguistic knowledge other than a phoneme acoustic model. In previous work, we have proposed a lexical learning method based on statistical model selection. This method can learn the words that represent a single object, such as proper nouns, but cannot learn the words that represent classes of objects, such as general nouns. This paper describes improvements of the method for learning both a phoneme sequence of each word and a distribution of objects that the word represents.
Article
In this paper we propose a latent Dirichlet allocation (LDA)-based framework for multimodal categorization and words grounding by robots. The robot uses its physical embodiment to grasp and observe an object from various view points, as well as to listen to the sound during the observing period. This multimodal information is used for categorizing and forming multimodal concepts using multimodal LDA. At the same time, the words acquired during the observing period are connected to the related concepts, which are represented by the multimodal LDA. We also provide a relevance measure that encodes the degree of connection between words and modalities. The proposed algorithm is implemented on a robot platform and some experiments are carried out to evaluate the algorithm. We also demonstrate simple conversation between a user and the robot based on the learned model.