Source publication
Conference Paper
Full-text available
Future human-robot collaboration employs language in instructing a robot about specific tasks to perform in its surroundings. This requires the robot to be able to associate spatial knowledge with language to understand the details of an assigned task so as to behave appropriately in the context of interaction. In this paper, we propose a probabili...

Similar publications

Preprint
Full-text available
In this paper we propose a conceptual framework for higher-order artificial neural networks. The idea of higher-order networks arises naturally when a model is required to learn some group of transformations, every element of which is well-approximated by a traditional feedforward network. Thus the group as a whole can be represented as a hyper net...

Citations

... The advantage is that they work without the support of a tutor, however, they are less sample efficient and often also less accurate. Examples are cross-situational learning (Siskind, 1996;Smith et al., 2011) based approaches that have been used to ground objects, actions, and spatial concepts (Dawson et al., 2013;Aly et al., 2017). Only limited work, i.e., (Belpaeme and Morse, 2012;Nevens and Spranger, 2017;Roesler, 2020a), has been done to compare or combine both approaches (see Section 2.3). ...
... 2 The meaning of the term concept is still an area of active philosophical debate (see e.g. (Margolis and Laurence, 2007;Zalta, 2021)) and in most grounding studies it is either used synonymous to words or symbols, e.g., Stramandinoli et al. (2011);Aly et al. (2017), or it is completely avoided by directly stating that words are grounded through concrete representations, e.g., (Nakamura et al., 2009;Marocco et al., 2010). Since the scenario employed in this study contains synonyms and homonyms, concepts can neither be represented through words nor concrete representations, instead they are implicitly represented by the connections between words and concrete representations. ...
... Another limitation of the models is that they are not able to handle synonyms, i.e., multiple words referring to the same concept 4 , which is a substantial limitation because many words are synonymous in specific contexts 5 . Aly et al. (2017), Roesler et al. (2018), and also employed probabilistic models for grounding, however, they used different experimental setups, grounded different modalities, i.e., spatial relations, actions and shapes, and investigated different research questions. For example, investigated the utility of different word representations for grounding of unknown synonyms, which are words for which at least one of their synonyms have been encountered during training while the word itself was not encountered. ...
Article
Full-text available
Natural and efficient communication with humans requires artificial agents that are able to understand the meaning of natural language. However, understanding natural language is non-trivial and requires proper grounding mechanisms to create links between words and corresponding perceptual information. Since the introduction of the “Symbol Grounding Problem” in 1990, many different grounding approaches have been proposed that either employed supervised or unsupervised learning mechanisms. The latter have the advantage that no other agent is required to learn the correct groundings, while the former are often more sample-efficient and accurate but require the support of another agent, like a human or another artificial agent. Although combining both paradigms seems natural, it has not achieved much attention. Therefore, this paper proposes a hybrid grounding framework which combines both learning paradigms so that it is able to utilize support from a tutor, if available, while it can still learn when no support is provided. Additionally, the framework has been designed to learn in a continuous and open-ended manner so that no explicit training phase is required. The proposed framework is evaluated through two different grounding scenarios and its unsupervised grounding component is compared to a state-of-the-art unsupervised Bayesian grounding framework, while the benefit of combining both paradigms is evaluated through the analysis of different feedback rates. The obtained results show that the employed unsupervised grounding mechanism outperforms the baseline in terms of accuracy, transparency, and deployability and that combining both paradigms increases both the sample-efficiency as well as the accuracy of purely unsupervised grounding, while it ensures that the framework is still able to learn the correct mappings, when no supervision is available.
... However, in these studies, an utterance used for learning consisted of only one word. Aly et al. suggested a probabilistic framework for learning words representing spatial concepts (spatial prepositions) and object categories based on visual cues representing spatial layouts and geometric characteristics of objects in a tabletop scene [24]. However, segmented word sequences are used for teaching. ...
Preprint
Full-text available
This paper proposes methods for unsupervised lexical acquisition for relative spatial concepts using spoken user utterances. A robot with a flexible spoken dialog system must be able to acquire linguistic representation and its meaning specific to an environment through interactions with humans as children do. Specifically, relative spatial concepts (e.g., front and right) are widely used in our daily lives, however, it is not obvious which object is a reference object when a robot learns relative spatial concepts. Therefore, we propose methods by which a robot without prior knowledge of words can learn relative spatial concepts. The methods are formulated using a probabilistic model to estimate the proper reference objects and distributions representing concepts simultaneously. The experimental results show that relative spatial concepts and a phoneme sequence representing each concept can be learned under the condition that the robot does not know which located object is the reference object. Additionally, we show that two processes in the proposed method improve the estimation accuracy of the concepts: generating candidate word sequences by class n-gram and selecting word sequences using location information. Furthermore, we show that clues to reference objects improve accuracy even though the number of candidate reference objects increases.
... Theoretical and empirical validations should be applied for further applications. So far, many researchers, including the authors, have proposed a lot of cognitive models for robots: object concept formation based on its appearance, usage and functions [41], formation of integrated concept of objects and motions [42], grammar learning [16], language understanding [43], spatial concept formation and lexical acquisition [8,20,44], simultaneous phoneme and word discovery [45][46][47] and cross-situational learning [48,49]. These models are regarded as an integrative model that are constructed by combining small-scale models. ...
Article
Full-text available
This paper describes a framework for the development of an integrative cognitive system based on probabilistic generative models (PGMs) called Neuro-SERKET. Neuro-SERKET is an extension of SERKET, which can compose elemental PGMs developed in a distributed manner and provide a scheme that allows the composed PGMs to learn throughout the system in an unsupervised way. In addition to the head-to-tail connection supported by SERKET, Neuro-SERKET supports tail-to-tail and head-to-head connections, as well as neural network-based modules, i.e., deep generative models. As an example of a Neuro-SERKET application, an integrative model was developed by composing a variational autoencoder (VAE), a Gaussian mixture model (GMM), latent Dirichlet allocation (LDA), and automatic speech recognition (ASR). The model is called VAE + GMM + LDA + ASR. The performance of VAE + GMM + LDA + ASR and the validity of Neuro-SERKET were demonstrated through a multimodal categorization task using image data and a speech signal of numerical digits.
... Grounding indicates the assignment of meaning to an abstract symbol, e.g. a word, through perceptual information [18]. Previous studies that investigated the use of crosssituational learning for grounding of objects [13,40] as well as spatial concepts [2,10,41] ensured that one word appears several times together with the same perceptual feature vector so that a corresponding mapping can be created [14]. However, natural language is ambiguous due to homonymy, i.e. one word refers to several objects or actions, and synonymy, i.e. one object or action can be referred to by several different words. ...
... object-object, is slightly smaller than the mean distances between word vectors of the Object and Action modalities. 2 The corpus can be downloaded at http://mattmahoney.net/dc/text8.zip. The object will be lifted up. ...
Conference Paper
Full-text available
Abstract—In order to interact with people in a natural way, a robot must be able to link words to objects and actions. Although previous studies in the literature have investigated grounding, they did not consider grounding of unknown synonyms. In this paper, we introduce a probabilistic model for grounding unknown synonymous object and action names using cross-situational learning. The proposed Bayesian learning model uses four different word representations to determine synonymous words. Afterwards, they are grounded through geometric characteristics of objects and kinematic features of the robot joints during action execution. The proposed model is evaluated through an interaction experiment between a human tutor and HSR robot. The results show that semantic and syntactic information both enable grounding of unknown synonyms and that the combination of both achieves the best grounding.
... Part-Of-Speech (POS) Tagging assigns parts of speech to words from a text body, allowing words having the same written form, but different meanings in different contexts to be understood accordingly by the software. Developed by Reference [16] and later advanced by other contributors [17][18][19], the method is enjoying recent applications regarding the analysis of written information on social media websites [20][21][22], dealing with grammar related aspects [23,24], identifying plagiarism [25], or even acting as a human-machine interface [26]. Named Entity Recognition enables the localization and categorization of "important and proper nouns in a text" [27], helping the reviewer to focus on important concepts that characterize the text. ...
Article
Full-text available
When knowledge is developed fast, as it is the case so often nowadays, one of the main difficulties in initiating new research in any field is to identify the domain’s specific state-of-the-art and trends. In this context, to evaluate the potential of a research niche by assisting the literature review process and to add a new and modern large-scale and automated dimension to it, the paper proposes a methodology that uses “Latent Semantic Analysis” (LSA) for identifying trends, focused within the knowledge space created at the intersection of three sustainability-related methodologies/concepts: “virtual Quality Management” (vQM), “Industry 4.0”, and “Product Life-Cycle” (PLC). The LSA was applied to a significant number of scientific papers published around these concepts to generate ontology charts that describe the knowledge structure of each by the frequency, position, and causal relation of associated notions. These notions are combined for defining the common high-density knowledge zone from where new technological solutions are expected to emerge throughout the PLC. The authors propose the concept of the knowledge space, which is characterized through specific descriptors with their own evaluation scales, obtained by processing the emerging information as identified by a combination of classic and innovative techniques. The results are validated through an investigation that surveys a relevant number of general managers, specialists, and consultants in the field of quality in the automotive sector from Romania. This practical demonstration follows each step of the theoretical approach and yields results that prove the capability of the method to contribute to the understanding and elucidation of the scientific area to which it is applied. Once validated, the method could be transferred to fields with similar characteristics.
... Understanding syntactic structure of language has been intensively investigated in the literature of cognitive robotics and computational linguistics. In cognitive robotics, different research studies proposed computational models for grounding nouns, verbs, adjectives, and prepositions encoding spatial relationships between objects [1,2,24,29,42]. However, they have not investigated grammar understanding at the phrase level, which constitutes a higher level than grounding words through perception. ...
Conference Paper
Full-text available
Abstract—Robots are progressively moving into spaces that have been primarily shaped by human agency; they collaborate with human users in different tasks that require them to understand human language so as to behave appropriately in space. To this end, a stubborn challenge that we address in this paper is inferring the syntactic structure of language, which embraces grounding parts of speech (e.g., nouns, verbs, and prepositions) through visual perception, and induction of Combinatory Categorial Grammar (CCG) in situated human-robot interaction. This could pave the way towards making a robot able to understand the syntactic relationships between words (i.e., understand phrases), and consequently the meaning of human instructions during interaction, which is a future scope of this current study
... To achieve this goal, robots have to do "Symbol Grounding", which was first described in Harnad [15], to relate words and sensory data that refer to the same object or action to each other. Although, previous studies investigated the use of cross-situational learning for grounding of objects [11,27] as well as spatial concepts [2,9,28], they ensured that one word appears several times together with the same perceptual feature vector to allow the creation of a corresponding mapping [12]. However, natural language is ambiguous due to homonymy, i.e. one word refers to several objects or actions, and synonymy, i.e. one object or action can be referred to by several different words. ...
... For example, in the first sentence structure, the object was preceded by an action and a determiner, while in the second sentence structure, it was directly preceded by an action. Unlike our previous studies [1,2], where the POS tagging model achieved better results using more informative sentences containing more syntactic word categories. When the Word Vectors representation was used the model did not learn the Others modality 6 because it only contains one word (the article the), which is not sufficient to create an independent cluster for the learning model. ...
Conference Paper
Full-text available
Natural human-robot interaction requires robots to link words to objects and actions through grounding. Although grounding has been investigated in previous studies, none of them considered grounding of synonyms. In this paper, we try to fill this gap by introducing a Bayesian learning model for grounding synonymous object and action names using cross-situational learning. Three different word representations are employed with the probabilistic model and evaluated according to their grounding performance. Words are grounded through geometric characteristics of objects and kinematic features of the robot joints during action execution. An interaction experiment between a human tutor and HSR robot is used to evaluate the proposed model. The results show that representing words by syntactic and/or semantic information achieves worse grounding results than representing them by unique numbers.
... Salvi et al. [35] developed a probabilistic affordance model to learn word meanings in terms of actions and object features. Matuszek et al. [26] proposed a probabilistic 1 joint-learning framework that employs categorial grammar to develop compositional meaning representations of language and physical objects in the environment. Grounding action verbs (e.g., push, raise, and touch) in sensorimotor behavior has been fueled by the ambition of making robots able to learn the associated motor behavior to the verbs [13]. ...
... In this study, we propose an extended computational model of Aly et al. [1] for grounding action verbs, spatial concepts, and object characteristics (color and geometry) of language, with syntactic information, through visual perception, which has not been sufficiently addressed in the literature through a similar global approach that could allow a robot to understand human instructions autonomously. The rest of the paper is organized as follows: Section (II) presents a general description of the system architecture, Sections (III, IV, V) illustrate, in detail, the different subsystems of the framework, Section (VI) illustrates the experimental design, Section (VII) provides a description of the experimental results, and finally, Section (VIII) concludes the paper. ...
... The rest of the paper is organized as follows: Section (II) presents a general description of the system architecture, Sections (III, IV, V) illustrate, in detail, the different subsystems of the framework, Section (VI) illustrates the experimental design, Section (VII) provides a description of the experimental results, and finally, Section (VIII) concludes the paper. (1) Speech recognition system (Google HTML5 API speech recognition toolkit) that recognizes human tutor's instructions during interaction with Toyota HSR robot 1 , (2) Skeleton tracking and modeling system, which tracks the locations of arm joints while manipulating objects, and models them on a Hidden Markov Model (HMM), (3) 3D Object segmentation system, which segments major planes and objects of the scene into point clouds in order to determine their spatial relationships, in addition to their color and geometric characteristics, (4) Part-of-Speech (POS) tagging system, which marks words of the recognized sentences with 1 It is an off-the-shelf module that will not be highlighted in this study. syntactic attributes through an unsupervised approach (i.e., it marks a word sequence with numerical tags without employing any pre-tagged corpus), and finally (5) Probabilistic learning model for grounding action verbs, spatial concepts, and object characteristics of language through perception. ...
Conference Paper
Full-text available
Creating successful human-robot collaboration requires robots to have high-level cognitive functions that could allow them to understand human language and actions in space. To meet this target, an elusive challenge that we address in this paper is to understand object-directed actions through grounding language based on visual cues representing the dynamics of human actions on objects, object characteristics (color and geometry), and spatial relationships between objects in a tabletop scene. The proposed probabilistic framework investigates unsupervised Part-of-Speech (POS) tagging to determine syntactic categories of words so as to infer grammatical structure of language. The dynamics of object-directed actions are characterized through the locations of the human arm joints – modeled on a Hidden Markov Model (HMM) – while manipulating objects, in addition to those of objects represented in 3D point clouds. These corresponding point clouds to segmented objects encode geometric features and spatial semantics of referents and landmarks in the environment. The proposed Bayesian learning model is successfully evaluated through interaction experiments between a human user and Toyota HSR robot in space.
... In addition, as a further extension of the proposed method, we intend increasing the types of sensory-channels, adding a positional relationship between objects, and identifying words that are not related to sensory-channels. For example, Aly et al. (2017) learned object categories and spatial prepositions by using a model similar to the proposed model. It would be possible to merge the proposed method with this model in the theoretical framework of the Bayesian generative model. ...
Article
Full-text available
In this paper, we propose a Bayesian generative model that can form multiple categories based on each sensory-channel and can associate words with any of four sensory-channels (action, position, object, and color). This paper focuses on cross-situational learning using the co-occurrence between words and information of sensory-channels in complex situations rather than conventional situations of cross-situational learning. We conducted a learning scenario using a simulator and a real humanoid iCub robot. In the scenario, a human tutor provided a sentence that describes an object of visual attention and an accompanying action to the robot. The scenario was set as follows: the number of words per sensory-channel was three or four, and the number of trials for learning was 20 and 40 for the simulator and 25 and 40 for the real robot. The experimental results showed that the proposed method was able to estimate the multiple categorizations and to learn the relationships between multiple sensory-channels and words accurately. In addition, we conducted an action generation task and an action description task based on word meanings learned in the cross-situational learning scenario. The experimental results showed the robot could successfully use the word meanings learned by using the proposed method.
Article
Full-text available
This paper proposes methods for unsupervised lexical acquisition for relative spatial concepts using spoken user utterances. A robot with a flexible spoken dialog system must be able to acquire linguistic representation and its meaning specific to an environment through interactions with humans as children do. Specifically, relative spatial concepts (e.g. front and right) are widely used in our daily lives, however, it is not obvious which object is a reference object when a robot learns relative spatial concepts. Therefore, we propose methods by which a robot without prior knowledge of words can learn relative spatial concepts. The methods are formulated using a probabilistic model to estimate the proper reference objects and distributions representing concepts simultaneously. The experimental results show that relative spatial concepts and a phoneme sequence representing each concept can be learned under the condition that the robot does not know which located object is the reference object. Additionally, we show that two processes in the proposed method improve the estimation accuracy of the concepts: generating candidate word sequences by class n-gram and selecting word sequences using location information. Furthermore, we show that clues to reference objects improve accuracy even though the number of candidate reference objects increases.