Table 1 - uploaded by Bruno Dumas
Content may be subject to copyright.
Differences between GUIs and MUIs. GUI MUI

Differences between GUIs and MUIs. GUI MUI

Source publication
Chapter
Full-text available
The grand challenge of multimodal interface creation is to build reliable processing systems able to analyze and understand multiple communication means in real-time. This opens a number of associated issues covered by this chapter, such as heterogeneous data types fusion, architectures for real-time processing, dialog management, machine learning...

Similar publications

Chapter
Full-text available
In the last decades, much progress has been made regarding the measurement of emotions in the laboratory. However, this research has often been criticized as dealing with artificial situations that have little ecological validity. The Internet affords an interesting challenge. There is no doubt that millions of people experience and express emotion...

Citations

... Since the first graphic systems appeared, as well as the use of multimedia (mainly audio) in computers, the first multimodal interfaces emerged. With the appearance of mobile devices, firstly handheld computers, and then smart mobile phones, it became necessary to improve the interaction between users and the devices since their limited capabilities, especially screen and processing, make it very difficult to work with interfaces [5]. Traditional user graphs. ...
... Previous research has investigated the use of individual modalities in HRI scenarios, such as displaying facial expressions for giving feedback (Mirnig et al., 2014), comparing unimodal displays of emotion (Tsiourti et al., 2017), or investigating preferences regarding a person-following robot's auditory feedback behavior (Olatunji et al., 2020). In a Human-Computer Interaction context, multimodal input has been argued to offer advantages such as supporting user preferences and user learning, and reducing cognitive load and user errors when fusing information from user input modes (Goodrich and Schultz, 2007;Dumas et al., 2009). Advantages of multiple output modalities for communication from system to the user may be providing information in complementary forms (Park et al., 2018) or improved inferring of past causal information (Han and Yanco, 2023). ...
Article
Full-text available
In human-robot collaboration, failures are bound to occur. A thorough understanding of potential errors is necessary so that robotic system designers can develop systems that remedy failure cases. In this work, we study failures that occur when participants interact with a working system and focus especially on errors in a robotic system’s knowledge base of which the system is not aware. A human interaction partner can be part of the error detection process if they are given insight into the robot’s knowledge and decision-making process. We investigate different communication modalities and the design of shared task representations in a joint human-robot object organization task. We conducted a user study ( N = 31) in which the participants showed a Pepper robot how to organize objects, and the robot communicated the learned object configuration to the participants by means of speech, visualization, or a combination of speech and visualization. The multimodal, combined condition was preferred by 23 participants, followed by seven participants preferring the visualization. Based on the interviews, the errors that occurred, and the object configurations generated by the participants, we conclude that participants tend to test the system’s limitations by making the task more complex, which provokes errors. This trial-and-error behavior has a productive purpose and demonstrates that failures occur that arise from the combination of robot capabilities, the user’s understanding and actions, and interaction in the environment. Moreover, it demonstrates that failure can have a productive purpose in establishing better user mental models of the technology.
... Multimodality is an inter-disciplinary approach that recognizes that humans produce meaning in a variety of ways and each mode offers distinct constraints and possibilities [12]. Under a Human Computer Interaction (HCI) perspective, multimodal interaction [23] provides the user with multiple modes of interacting with a system, and a multimodal interface [3,18] provides several distinct tools for input and output of data. According to a Human Centered AI vision [19], multi-modality is one of the interaction paradigms, together with conversational user interfaces and natural gestures, which are making Intelligent User Interfaces (IUIs) a reality, and are able to augment the user interaction and cognition possibilities enhancing human potential by combining the strengths of both human intelligence and AI. ...
Conference Paper
This paper describes the opportunities of multimodality in the field of robot-to-human communication. In the proposed approach, the coordinated and integrated use of multimedia elements, i.e., text, images, and animations, with the robot's speech plays a very important role in the overall effectiveness of the communicative act. The reference robot used in the research was Pepper, a humanoid robot equipped with a tablet on its front. During the research, various mul-timodal communication strategies were formalised, implemented and preliminarily evaluated by means of a questionnaire. The results show some statistically significant preferences for specific strategies, marking new avenues of investigation with regard to robot-to-human multimodal communication and its adaptation to the user features.
... It lies at the crossroads of several research areas, including computer vision, psychology, artificial intelligence, and many others [1]. Multimodal interaction focuses on interacting with computers in a more "human" way using speech, gestures, and other modalities [2]. Many studies indicate that multimodal interaction can offer better flexibility and reliability [2,3]. ...
... Multimodal interaction focuses on interacting with computers in a more "human" way using speech, gestures, and other modalities [2]. Many studies indicate that multimodal interaction can offer better flexibility and reliability [2,3]. It can meet the needs of diverse users with a range of usage patterns and preferences [3]. ...
... As mentioned in Section 2.2, the modality weights w 1 , . . . , w 5 are calculated using Equation (2). Subsequently, we manually add strength coefficients α for each modality based on the strength of interaction intent observed in the real world. ...
Article
Full-text available
Multimodal interaction systems can provide users with natural and compelling interactive experiences. Despite the availability of various sensing devices, only some commercial multimodal applications are available. One reason may be the need for a more efficient framework for fusing heterogeneous data and addressing resource pressure. This paper presents a parallel multimodal integration framework that ensures that the errors and external damages of integrated devices remain uncorrelated. The proposed relative weighted fusion method and modality delay strategy process the heterogeneous data at the decision level. The parallel modality operation flow allows each device to operate across multiple terminals, reducing resource demands on a single computer. The universal fusion methods and independent devices further remove constraints on the integrated modality number, providing the framework with extensibility. Based on the framework, we develop a multimodal virtual shopping system, integrating five input modalities and three output modalities. The objective experiments show that the system can accurately fuse heterogeneous data and understand interaction intent. User studies indicate the immersive and entertaining of multimodal shopping. Our framework proposes a development paradigm for multimodal systems, fostering multimodal applications across various domains.
... In fact, some previous studies explored the factors influencing the choice among different modalities such as speech, text, and gestures. Factors such as efficiency, accuracy, privacy and security (Himmelsbach et al., 2015), user characteristics (Dumas et al., 2009;Ghosh and Joshi, 2013;Schüssel et al., 2013;Schaffer et al., 2015), attitudes toward technology, and quality perceptions and personality factors (Wechsung, 2014) have been shown to impact modality selection. Jian et al. (2013) demonstrated the potential of multimodal interaction (interacting with multiple modalities such as text and speech) for older adults. ...
... Multimodal interaction frameworks, such as CASE (Dumas et al., 2009), CARE (Nigay and Coutaz, 1997), or Vernier and Nigay's (Vernier and Nigay, 2000) that defined combinations on different aspects such as the semantic aspect (combination based on meaning), can help interaction designers to investigate different ways of combining modalities. For instance, one can combine these different interfaces in a way that the same output message can be sent to all the interfaces (redundancy) or the output can be assigned to the most suitable interface (assignment) or the two interfaces can complement the information with each other in order to send one meaning to the user (complementary). ...
Article
Full-text available
Introduction The use of multiple interfaces may improve the perception of a stronger relationship between a conversational virtual coach and older adults. The purpose of this paper is to show the effect of output combinations [single-interface (chatbot, tangible coach), multi-interface (assignment, redundant-complementary)] of two distinct conversational agent interfaces (chatbot and tangible coach) on the eCoach-user relationship (closeness, commitment, complementarity) and the older adults' feeling of social presence of the eCoach. Methods Our study was conducted with two different study settings: an online web survey and a face to face experiment. Results Our online study with 59 seniors shows that the output modes in multi-interface redundant-complementary manner significantly improves the eCoach-user relationship and social presence of the eCoach compared to only using single-interfaces outputs. Whereas in our face to face experiment with 15 seniors, significant results were found only in terms of higher social presence of multi-interface redundant complementary manner compared to chatbot only. Discussion We also investigated the effect of each study design on our results, using both quantitative and qualitative methods.
... The motivation behind focusing on specific aspects of the architecture stems from the objective of creating a robust and versatile framework for SEGs in rehabilitation. The incorporation of Decision, Action, Perception, and Interpretation states [18] ensures a comprehensive understanding of patient interactions, enabling personalized and adaptive therapy exercises. Additionally, the inclusion of a data structuring and storage module [19] is crucial for effectively processing and analyzing the diverse data generated via numerous modalities in rehabilitation settings. ...
... By incorporating fusion, fission, flexibility/equivalence, and complementarity functionalities [18], the architecture seamlessly integrates and coordinates various modalities, enhancing the overall effectiveness of rehabilitation interventions. Dynamic modality management at runtime [22] facilitates the adaptability and customization of gameplay to meet changing patient needs. ...
... The Decision, Action, Perception, and Interpretation states were used [18]. States represent groups of modules. ...
Article
Full-text available
Serious Exergames (SEGs) have been little concerned with flexibility/equivalence, complementarity, and monitoring (functionalities of systems that deal with a wide variety of inputs). These functionalities are necessary for health SEGs due to the variety of treatments and measuring requirements. No known SEG architectures include these three functionalities altogether. In this paper, we present the 123-SGR software architecture for the creation of an SEG that is appropriate to the needs of professionals and patients in the area of rehabilitation. An existing SEG was adapted and therapy-related sensor devices (Pneumotachograph, Manovacuometer, Pressure Belt, and Oximeter) were built to help the patient interact with the SEG. The architecture allows the most varied input combinations, with and without fusion, and these combinations are possible for both conscious and unconscious signals. Health and Technology professionals have assessed the SEG and found that it had the functionalities of flexibility/equivalence, complementarity, and monitoring, and that these are really important and necessary functionalities. The 123-SGR architecture can be used as a blueprint for future SEG development.
... Humans are able to describe visual scenes linguistically and, conversely, to generate visual representations, e.g., in their mind's eye or on a sheet of paper, on the basis of linguistic descriptions [75,76]. These modality changes require mental capabilities in the area of multimodal fusion and fission [21]. From a computational point of view, the second of these capabilities is modeled in terms of text-to-scene systems (e.g. ...
Chapter
Full-text available
We introduce Semantic Scene Builder (SeSB), a VR-based text-to-3D scene framework using SemAF (Semantic Annotation Framework) as a scheme for annotating discourse structures. SeSB integrates a variety of tools and resources by using SemAF and UIMA as a unified data structure to generate 3D scenes from textual descriptions. Based on VR, SeSB allows its users to change annotations through body movements instead of symbolic manipulations: from annotations in texts to corrections in editing steps to adjustments in generated scenes, all this is done by grabbing and moving objects. We evaluate SeSB in comparison with a state-of-the-art open source text-to-scene method (the only one which is publicly available) and find that our approach not only performs better, but also allows for modeling a greater variety of scenes.KeywordsText-to-3D Scene GenerationSemantic Annotation FrameworkVirtual Reality
... The research field of multimodal machine learning (ML) brings unique challenges for both computational and theoretical research given the heterogeneity of various data sources (Baltrušaitis et al., 2018;Liang et al., 2022). At its core lies the learning of multimodal representations that capture correspondences between modalities for prediction, and has emerged as a vibrant interdisciplinary field of immense importance and with extraordinary potential in multimedia (Naphade et al., 2006;Liang et al., 2023b), affective computing , robotics (Kirchner et al., 2019;Lee et al., 2019), finance (Hollerer et al., 2018), dialogue (Pittermann et al., 2010), humancomputer interaction (Dumas et al., 2009;Obrenovic and Starcevic, 2004), and healthcare (Frantzidis et al., 2010;Xu et al., 2019). In order to accelerate research in building general-purpose multimodal models across diverse research areas, modalities, and tasks, we contribute MULTIBENCH (Figure 1), a systematic and unified large-scale benchmark that brings us closer to the requirements of real-world multimodal applications. ...
Preprint
Learning multimodal representations involves integrating information from multiple heterogeneous sources of data. In order to accelerate progress towards understudied modalities and tasks while ensuring real-world robustness, we release MultiZoo, a public toolkit consisting of standardized implementations of > 20 core multimodal algorithms and MultiBench, a large-scale benchmark spanning 15 datasets, 10 modalities, 20 prediction tasks, and 6 research areas. Together, these provide an automated end-to-end machine learning pipeline that simplifies and standardizes data loading, experimental setup, and model evaluation. To enable holistic evaluation, we offer a comprehensive methodology to assess (1) generalization, (2) time and space complexity, and (3) modality robustness. MultiBench paves the way towards a better understanding of the capabilities and limitations of multimodal models, while ensuring ease of use, accessibility, and reproducibility. Our toolkits are publicly available, will be regularly updated, and welcome inputs from the community.
... Multimodal interfaces pose specific challenges, such as data fusion and real-time processing [5,6], but reward the effort with better stability [7] and reliability [8]. In addition, compared to unimodal interfaces [9], a higher user acceptance is achieved without compromising the process through a higher perceived mental workload [4,10]; an aspect that is beneficial for ergonomic and health reasons. ...
Article
Full-text available
Multimodal user interfaces promise natural and intuitive human–machine interactions. However, is the extra effort for the development of a complex multisensor system justified, or can users also be satisfied with only one input modality? This study investigates interactions in an industrial weld inspection workstation. Three unimodal interfaces, including spatial interaction with buttons augmented on a workpiece or a worktable, and speech commands, were tested individually and in a multimodal combination. Within the unimodal conditions, users preferred the augmented worktable, but overall, the interindividual usage of all input technologies in the multimodal condition was ranked best. Our findings indicate that the implementation and the use of multiple input modalities is valuable and that it is difficult to predict the usability of individual input modalities for complex systems.
... Multimodal learning has a long-standing history that predates and continues through the deep learning era [15]. The advancements in deep learning have been instrumental in the considerable growth of multimodal learning [16], leading to surge of interest in multimodal interfaces, with focus on improving the flexibility, transparency, and expressiveness of HCIs [3,17,18]. Gesture signals have been used in various multimodal applications, such as speech disambiguation [19] and object manipulation in augmented reality [20]. Recently, there has been an increasing trend of using gestures to enhance HCIs [21,22,23,24]. ...
Preprint
Full-text available
The adoption of multimodal interactions by Voice Assistants (VAs) is growing rapidly to enhance human-computer interactions. Smartwatches have now incorporated trigger-less methods of invoking VAs, such as Raise To Speak (RTS), where the user raises their watch and speaks to VAs without an explicit trigger. Current state-of-the-art RTS systems rely on heuristics and engineered Finite State Machines to fuse gesture and audio data for multimodal decision-making. However, these methods have limitations, including limited adaptability, scalability, and induced human biases. In this work, we propose a neural network based audio-gesture multimodal fusion system that (1) Better understands temporal correlation between audio and gesture data, leading to precise invocations (2) Generalizes to a wide range of environments and scenarios (3) Is lightweight and deployable on low-power devices, such as smartwatches, with quick launch times (4) Improves productivity in asset development processes.