Dinei Florencio's research while affiliated with Microsoft and other places

Publications (25)

Article
Detecting camouflaged moving foreground objects has been known to be difficult due to the similarity between the foreground objects and the background. Conventional methods cannot distinguish the foreground from background due to the small differences between them and thus suffer from under-detection of the camouflaged foreground objects. In this p...
Article
Full-text available
We present a context-driven method to encode nodes of an octree, which is typically used to encode point cloud geometry. Instead of using one bit per node of the tree, the context allows for deriving probabilities for that node based on distances of the actual voxel to voxels in a reference point cloud. Accurate probabilities of the node state allo...
Article
Full-text available
Multi-channel speech enhancement with ad-hoc sensors has been a challenging task. Speech model guided beamforming algorithms are able to recover natural sounding speech, but the speech models tend to be oversimplified or the inference would otherwise be too complicated. On the other hand, deep learning based enhancement approaches are able to learn...
Article
Full-text available
Auditory spatial localization in humans is performed using a combination of interaural time differences, interaural level differences, as well as spectral cues provided by the geometry of the ear. To render spatialized sounds within a virtual reality (VR) headset, either individualized or generic Head Related Transfer Functions (HRTFs) are usually...
Data
The experimental procedure, setup, and conditions can be seen here: https://youtu.be/i97RvpXO0s4.
Conference Paper
Foreground detection has been widely studied for decades due to its importance in many practical applications. Most of the existing methods assume foreground and background show visually distinct characteristics and thus the foreground can be detected once a good background model is obtained. However, there are many situations where this is not the...
Article
Full-text available
Humans are good at selectively listening to specific target conversations, even in the presence of multiple concurrent speakers. In our research, we study how auditory-visual cues modulate this selective listening. We do so by using immersive Virtual Reality technologies with spatialized audio. Exposing 32 participants to an Information Masking Tas...
Article
We propose a new Head-Related Transfer Function (HRTF) interpolation method using Isomap, a nonlinear dimensionality reduction technique. First, we construct a single manifold for all subjects across both azimuth and elevation angles through the construction of an Intersubject Graph (ISG) that includes important prior knowledge of the HRTFs such as...
Article
Full-text available
In this paper, we introduce a real-time face recognition (and announcement) system targeted at aiding the blind and low-vision people. The system uses a Microsoft Kinect sensor as a wearable device, performs face detection, and uses temporal coherence along with a simple biometric procedure to generate a sound associated with the identified person,...
Conference Paper
Full-text available
In this paper we consider the problem of speech enhancement in real-world like conditions where multiple noises can simultaneously corrupt speech. Most of the current literature on speech enhancement focus primarily on presence of single noise in corrupted speech which is far from real-world environments. Specifically, we deal with improving speech...
Article
We present a new anthropometry-based method to personalize head-related transfer functions (HRTFs) using manifold learning in both azimuth and elevation angles with a single nonlinear regression model. The core element of our approach is a domain-specific nonlinear dimensionality reduction technique, denominated Isomap, over the intraconic componen...
Article
Compressing attributes on 3D point clouds such as colors or normal directions has been a challenging problem, since these attribute signals are unstructured. In this paper, we propose to compress such attributes with graph transform. We construct graphs on small neighborhoods of the point cloud by connecting nearby points, and treat the attributes...
Article
Depth image compression is important for compact representation of 3D visual data in "texture-plus-depth" format, where texture and depth maps from one or more viewpoints are encoded and transmitted. A decoder can then synthesize a freely chosen virtual view via depth-imagebased rendering (DIBR) using nearby coded texture and depth maps as referenc...
Conference Paper
Full-text available
In this paper, we introduce a new anthropometric-based method for customizing of Head-Related Transfer Functions (HRTF) in the horizontal plane. The method uses Isomap, artificial neural networks (ANN), and a neighborhood-based reconstruction procedure. We first modify Isomap's graph construction step to emphasize the individuality of HRTFs and per...
Article
A number of applications in acoustics, such as echo cancellation, require learning the acoustic impulse response from each deployed loudspeaker to each microphone- the room transfer function. This has conventionally been done separately at each microphone for each loudspeaker. However, the signals arriving at the array share a common structure, whi...
Conference Paper
The increasing use of mobile robots in social contexts makes it important to provide them with the ability to behave in the most socially acceptable way possible. In this paper we investigate the problem of making a robot learn how to approach a person in order to increase the chance of a successful engagement. We propose the use of Gaussian Proces...
Conference Paper
We present a method for a mobile robot to follow a person autonomously where there is an interaction between the robot and human during following. The planner takes into account the predicted trajectory of the human and searches future trajectories of the robot for the path with the highest utility. Contrary to traditional motion planning, instead...

Citations

... In recent years, machine learning techniques have also gained popularity in the selection of near-field beamforming sensors. These methods take advantage of the power of data-driven algorithms to learn patterns and relationships from training data, enabling the identification of sensors that provide the most beamforming information [25]. ...
... Due to high similarities between object and background, COD is usually more challenging than traditional object segmentation tasks. Early approaches to COD used different handcrafted features like texture and contrast 21 , 3d-convexity 22 , optical flow 23 or try to solve the problem in the frequency domain 24 . As the complexity of the scene rises, these approaches often struggle to segment the camouflaged objects. ...
... This technique possesses the coding of an interlayer residual feature. In their study, De et al. [23] established context by measuring the distances between the actual voxels and those of a reference PC. They then used this context to determine the probabilities of the node states, which helped lower the bit rate. ...
... Neural networks can learn any function given a sufficient number of parameters and training data, and have been successfully used in fields like image classification, inverse problems, image segmentation, [14][15][16][17] and many fields in acoustics. 18 They have also already been used in beamforming, mostly in areas related to speech recognition for cleaning speech recordings, 19 and in recent years, intensively in various approaches for sound source localization. 20 Related to beamforming, some researchers have successfully applied neural networks to estimate Capon beamforming weights, 21 where the neural network indirectly learned the optimal weightings. ...
... Visual "capture" effects, where visual stimuli associated with a sound source affect its localization in space, have been extensively studied in experimental psychology [35]. Similar effects were recently observed in VR using generic HRTFs [9]. Thus, in future works, we plan to investigate the effect of different approaches and degrees of HRTF personalization, as well as to compare our acoustic simulation approach to other auralisation engines and simulators, such as Project Triton [58], in relation to the level of HRTF personalization. ...
... Other tools that implement supervised conditional GAN include [104] [169] [170] [93]. In [171], Bayesian network is exploited to generate estimated clean speech from a noisy one. ...
... For the computation of the vector of beamformer coefficients f k , several criteria exist. Here f k is estimated using a MSE based criterion similar to [22] f k = argmin ...
... In initial investigations, the majority of approaches employ basic features such as texture, edges, luminosity, and color to differentiate the camouflaged object from its surroundings [36][37][38][39][40][41]. Nevertheless, camouflage often disrupts the inherent features to deceive the observer, rendering these approaches comparatively less efficacious. ...
... Given the above discussion and the current experimental limitations, we plan to apply our methodology along four main directions to evaluate the influence on co-immersion of: (a) ERs simplification following previous related works [15,31]. To this end, a possible artificial reverberator are Scattering Delay Networks (SDN) [18] which model ERs according to physical properties; (b) different listening environments, considering a wider set than the three VAEs used in the present study; (c) different conversational scenarios, considering concurrent talking [30], turn-taking dialogues, or partially overlapping speakers; (d) the effect of visual elements in modulating auditory perception and cognition, e.g. visual rendering matching the acoustics features (size, material, etc.). ...
... It showed a better performance in comparison with interpolation methods without a priori knowledge. In recent years, more novel interpolation methods using prior data to model HRTFs have been proposed, such as a Gaussian process regression method (Luo et al., 2013), a manifold learning method (Grijalva et al., 2017), a spherical-harmonics-based spatial aliasing cancellation method (Alon et al., 2018), etc. However, the above methods are relatively complicated, and no subjective experiments have been conducted to evaluate the perceptual errors. ...