Figure 6 - uploaded by Marc Bolaños
Content may be subject to copyright.
General scheme of the semantic feature extraction methodology.

General scheme of the semantic feature extraction methodology.

Source publication
Article
Full-text available
While wearable cameras are becoming increasingly popular, locating relevant information in large unstructured collections of egocentric images is still a tedious and time consuming processes. This paper addresses the problem of organizing egocentric photo streams acquired by a wearable camera into semantically meaningful segments. First, contextual...

Citations

... The Egocentric Dataset of the University of Barcelona -Segmentation (EDUB-Seg) [161,45] is a dataset acquired with Narrative Clip, taking a picture every 30 seconds, containing 18,735 frames from 7 users. For the sake of variety, each user recorded in different scenarios: attending a conference, on holiday, during the weekend, and during the week. ...
Article
The egocentric action recognition EAR field has recently increased its popularity due to the affordable and lightweight wearable cameras available nowadays such as GoPro and similars. Therefore, the amount of egocentric data generated has increased, triggering the interest in the understanding of egocentric videos. More specifically, the recognition of actions in egocentric videos has gained popularity due to the challenge that it poses: the wild movement of the camera and the lack of context make it hard to recognise actions with a performance similar to that of third-person vision solutions. This has ignited the research interest on the field and, nowadays, many public datasets and competitions can be found in both the machine learning and the computer vision communities. In this survey, we aim to analyse the literature on egocentric vision methods and algorithms. For that, we propose a taxonomy to divide the literature into various categories with subcategories, contributing a more fine-grained classification of the available methods. We also provide a review of the zero-shot approaches used by the EAR community, a methodology that could help to transfer EAR algorithms to real-world applications. Finally, we summarise the datasets used by researchers in the literature.
... This summarization aims to support people with neuronal degradation. Other similar studies have been proposed based on the same methodology of clustering-based event segmentation [37] and summarization using contextual and semantic information [38]. ...
... They performed a summary of autobiographical episodes and a semantic key-frame selection and, finally, implemented text-based inverted index retrieval techniques. The episode temporal segmentation was based on semantic regularized-clustering [38]. This model was applied to a data set, and the results suggest that this system stimulates the memory of patients with mild cognitive impairment; for example, patients with dementia. ...
... However, most of these data sets focus on different and smaller amounts of data for specific use case applications and not on capturing all the daily activities and behaviors of a lifelogger. An example of these data sets is the Egocentric Dataset of the University of Barcelona (EDUB) [78], which is divided into different sub-data sets depending on the data annotations, such as EDUB-Obj data set for object localization or segmentation [89], EDUB-Seg data set for egocentric event segmentation [37,38], and EDUB-SegDesc data set that can be used either for egocentric event segmentation or for egocentric sequence description [90]. ...
Article
Full-text available
Background: Over the past decade, the wide availability and small size of different types of sensors, together with the decrease in pricing, have allowed the acquisition of a substantial amount of data about a person's life in real time. These sensors can be incorporated into personal electronic devices available at a reasonable cost, such as smartphones and small wearable devices. They allow the acquisition of images, audio, location, physical activity, and physiological signals among other data. With these data, usually denoted as lifelog data, we can then analyze and understand personal experiences and behaviors. This process is called lifelogging. Objective: The objective of this paper was to present a narrative review of the existing literature about lifelogging over the past decade. To achieve this goal, we analyzed lifelogging applications used to retrieve relevant information from daily digital data, some of them with the purpose of monitoring and assisting people with memory issues and others designed for memory augmentation. We aimed for this review to be used by researchers to obtain a broad idea of the type of data used, methodologies, and applications available in this research field. Methods: We followed a narrative review methodology to conduct a comprehensive search for relevant publications in Google Scholar and Scopus databases using lifelog topic-related keywords. A total of 411 publications were retrieved and screened. Of these 411 publications, 114 (27.7%) publications were fully reviewed. In addition, 30 publications were manually included based on our bibliographical knowledge of this research field. Results: From the 144 reviewed publications, a total of 113 (78.5%) were selected and included in this narrative review based on content analysis. The findings of this narrative review suggest that lifelogs are prone to become powerful tools to retrieve memories or increase knowledge about an individual's experiences or behaviors. Several computational tools are already available for a considerable range of applications. These tools use multimodal data of different natures, with visual lifelogs being one of the most used and rich sources of information. Different approaches and algorithms to process these data are currently in use, as this review will unravel. Moreover, we identified several open questions and possible lines of investigation in lifelogging. Conclusions: The use of personal lifelogs can be beneficial to improve the quality of our life, as they can serve as tools for memory augmentation or for providing support to people with memory issues. Through the acquisition and analysis of lifelog data, lifelogging systems can create digital memories that can be potentially used as surrogate memory. Through this narrative review, we understand that contextual information can be extracted from lifelogs, which provides an understanding of the daily life of a person based on events, experiences, and behaviors.
... To this end, we asked healthy participants to retrieve their own AMs cued by pictures taken automatically (i.e., every 30 sec) by a wearable camera carried during one week of daily life routine, after 1e2 weeks, and 6e14 months from the encoding period. To ensure pictures presented during the test cued most of the episodic events that unfolded during the encoding week, we implemented a convolutional network-based algorithm (Dimiccoli et al., 2015) on the entire recorded picture set that automatically grouped together temporally adjacent images sharing contextual and semantic attributes, akin to how we conceive what underlies an event episode from a perception and memory perspective (Zacks & Swallow, 2007; see also D'Argembeau, 2018 and. In doing so, the large picture set collected reflecting an entire day's life activity (e.g.,~400 pictures) is grouped into a workable number of picture subsets (e.g.,~20) depicting sequences of temporally adjacent episodic events (e.g., breakfast at home, commuting to work, buying oranges in the corner shop, eating a sandwich at the park). ...
... We implemented a deep neural network-based algorithm, SR-Clustering, to automatically organize the stream of each participant's pictures into a set of temporally evolving meaningful events (Dimiccoli et al., 2015). The algorithm segments picture sequences into discrete events (e.g., having breakfast in a kitchen, commuting to work, being in a meeting) based on its ability to identify similar contextual and semantic features from the picture stream. ...
Article
Full-text available
Autobiographical memory (AM) has been largely investigated as the ability to recollect specific events that belong to an individual’s past. However, how we retrieve real-life routine episodes and how the retrieval of these episodes changes with the passage of time remain unclear. Here, we asked participants to use a wearable camera that automatically captured pictures to record instances during a week of their routine life and implemented a deep neural network-based algorithm to identify picture sequences that represented episodic events. We then asked each participant to return to the lab to retrieve AMs for single episodes cued by the selected pictures 1 week, 2 weeks and 6 to 14 months after encoding while scalp electroencephalographic (EEG) activity was recorded. We found that participants were more accurate in recognizing pictured scenes depicting their own past than pictured scenes encoded in the lab, and that memory recollection of personally experienced events rapidly decreased with the passing of time. We also found that the retrieval of real-life picture cues elicited a strong and positive ‘ERP old/new effect’ over frontal regions and that the magnitude of this ERP effect was similar throughout memory tests over time. However, we observed that recognition memory induced a frontal theta power decrease and that this effect was mostly seen when memories were tested after 1 and 2 weeks but not after 6 to 14 months from encoding. Altogether, we discuss the implications for neuroscientific accounts of episodic retrieval and the potential benefits of developing individual-based AM exploration strategies at the clinical level.
... In [15] events are intended as groups of images highlighting the presence of personal locations of interest specified by the enduser. In the domain of egocentric photo-streams events are defined as temporal semantic segments sharing semantic and contextual information [16], [17]. ...
... Events in egocentric photo-streams correspond to temporally adjacent images that share contextual and semantic features, as defined in [16]. This method relates sequential images represented as a combination of semantic and visual features extracted with a CNN. ...
... Image-sequence level: The image-sequence level models took into account temporal information and used as a backbone the previously trained models. Our event boundaries were segmented using the SR-Clustering [16]. In order to measure the importance of the temporal information in the models, we used boundaries from three other settings. ...
Chapter
The recognition of human activities captured by a wearable photo-camera is especially suited for understanding the behavior of a person. However, it has received comparatively little attention with respect to activity recognition from fixed cameras. In this work, we propose to use segmented events from photo-streams as temporal boundaries to improve the performance of activity recognition. Furthermore, we robustly measure its effectiveness when images of the evaluated person have been seen during training, and when the person is completely unknown during testing. Experimental results show that leveraging temporal boundary information on pictures of seen people improves all classification metrics, particularly it improves the classification accuracy up to 85.73%.
... In [21], the semantic regularized clustering method was used in a related study to represent photos as semantic visual concepts instead of CNN feature vectors. They first obtain a set of objects/tags/concepts detected in the photos, with their associated confidence values. ...
... To compare the proposed SAPS method with the baseline methods in [16,17,21], and [22] six participants were asked to share photos taken through smartphones during the past year. Although all photos included timestamps, some photos had missing GPS data. ...
... • PhotoToc [16]: Time-based clustering that uses an adaptive threshold method to detect noticeable time gaps • Multi-modal [17]: A generative probabilistic model that uses the combination of visual features such as time, gps, color and CNN to determine the optimal event boundaries • Semantic regularized clustering (SR-clustering): SR-clustering, as described in [21], which uses visual and semantic features • CES [22]: A segmentation framework that uses an LSTM-based generative network to decide whether a photo is an event boundary by comparing the visual context generated from the photos in the past, to that predicted in the future. ...
Article
The number of people collecting photos has surged owing to social media and cloud services in recent years. A typical approach to summarize a photo collection is dividing it into events and selecting key photos from each event. Despite the fact that a certain event comprises several sub-events, few studies have proposed sub-event segmentation. We propose the sentiment analysis-based photo summarization (SAPS) method, which automatically summarizes personal photo collections by utilizing metadata and visual sentiment features. For this purpose, we first cluster events using metadata of photos and then calculate the novelty scores to determine the sub-event boundaries. Next, we summarize the photo collections using a ranking algorithm that measures sentiment, emotion, and aesthetics. We evaluate the proposed method by applying it to the photo collections of six participants consisting of 5,480 photos in total. We observe that our sub-event segmentation based on sentiment features outperforms the existing baseline methods. Furthermore, the proposed method is also more effective in finding sub-event boundaries and key photos, because it focuses on detailed sentiment features instead of general content features.
... We visually depict the pipelines in Fig.1. 1. Temporal photo-sequences segmentation: We introduce the Semantic Regularized Clustering automatic model (SR-Clustering) [1] for the definition of temporal boundaries for the division of egocentric photosequences into moments, which are sequences of images describing the same environment. This model takes into account semantic concepts in the image together with the global image context for event representation. ...
... Moreover, we introduce a hierarchical model composed of different layers of deep neural networks for classification. Figure 1: Given a collection of egocentric photo-streams describing the lifestyle of the camera wearer, we have developed automatic tools for the temporal segmentation into events [1], sentiment classification [6,7], routine discovery [5,8], social pattern analysis [4], and food-related scene recognition [2] We adapt this model to the introduced taxonomy for the recognition of visually highly similar foodrelated images into 15 different classes. Finally, we introduce the EgoFoodScenes dataset composed of 33000 images and 15 food-related environments. ...
Article
Full-text available
Describing people's lifestyle has become a hot topic in the field of artificial intelligence. Lifelogging is described as the process of collecting personal activity data describing the daily behaviour of a person. Nowadays, the development of new technologies and the increasing use of wearable sensors allow to automatically record data from our daily living. In this paper, we describe our developed automatic tools for the analysis of collected visual data that describes the daily behaviour of a person. For this analysis, we rely on sequences of images collected by wearable cameras, which are called egocentric photo-streams. These images are a rich source of information about the behaviour of the camera wearer since they show an objective and first-person view of his or her lifestyle.
... Indeed, wearable cameras allow to capture, from a first-person (egocentric) perspective, and "in the wild", long unconstrained videos (≈35fps) and image sequences (aka photostreams, ≈2fpm). Due to their low temporal resolution, the segmentation of first-person image sequences is particularly challenging, and has received special attention from the community [4][5][6][7][8][9][10][11][12][13][14]. Indeed, abrupt changes in appearance may arise even between temporally adjacent frames within an event due to sudden camera movements and the low frame rate, making it difficult to distinguish them from event transitions. ...
... Our main contributions are: (i) we re-frame the event learning problem as the problem of learning a graph embedding, (ii) we introduce an original graph initialization approach based on the concept of temporal self-similarity, (iii) we propose a novel technical approach to solve the graph embedding problem when the underlying graph structure is unknown, (iv) we demonstrate that the learnt graph embedding is suitable for the task of temporal segmentation, achieving state-of-the-art results on two challenging reference benchmark datasets [11,20], without relying on any training set for learning the representation, (v) we show that the proposed DGE generalizes to other problems, yielding state-of-the-art results also on two reference benchmark datasets for the Human Motion Segmentation problem. ...
... Tavalera et al. [7] proposed to combine agglomerative clustering with a change detection method within a graph-cut energy minimization framework. Later on, [11] extended this framework and proposed an improved feature representation by building a vocabulary of concepts. Paci et al. [8] proposed a Siamese ConvNets based approach that aims at learning a similarity function between low temporal resolution egocentric images. ...
Article
Full-text available
Recently, self-supervised learning has proved to be effective to learn representations of events suitable for temporal segmentation in image sequences, where events are understood as sets of temporally adjacent images that are semantically perceived as a whole. However, although this approach does not require expensive manual annotations, it is data hungry and suffers from domain adaptation problems. As an alternative, in this work, we propose a novel approach for learning event representations named Dynamic Graph Embedding (DGE). The assumption underlying our model is that a sequence of images can be represented by a graph that encodes both semantic and temporal similarity. The key novelty of DGE is to learn jointly the graph and its graph embedding. At its core, DGE works by iterating over two steps: 1) updating the graph representing the semantic and temporal similarity of the data based on the current data representation, and 2) updating the data representation to take into account the current data graph structure. The main advantage of DGE over state-of-the-art self-supervised approaches is that it does not require any training set, but instead learns iteratively from the data itself a low-dimensional embedding that reflects their temporal and semantic similarity. Experimental results on two benchmark datasets of real image sequences captured at regular time intervals demonstrate that the proposed DGE leads to event representations effective for temporal segmentation. In particular, it achieves robust temporal segmentation on the EDUBSeg and EDUBSeg-Desc benchmark datasets, outperforming the state of the art. Additional experiments on two Human Motion Segmentation benchmark datasets demonstrate the generalization capabilities of the proposed DGE.
... In common two-level detection frameworks (such as Fast RCNN, Faster RCNN etc.), RoI Pooling is used to pool the corresponding region into a fixed-size feature map according to the location coordinates of the preselected box, to facilitate subsequent classification and boundary box regression operations. The position of the preselection box is usually obtained by model regression, which is generally a float point number, but the pooled feature map requires a fixed size, so RoI Pooling has two quantisation processes [56], as shown in Fig. 10 is the Faster RCNN detection framework. Two quantification process of RoI Pool will lead to the deviation of the pixel, taking the feature scaling step size 32 as an example, the deviation of 0.1 pixels on the feature image is the deviation of 3.2 pixels on the original image, so the deviation of 0.8 pixels in Fig. 9, in the original picture, the difference of 30 pixels is approached, which will have a huge impact on the detection of small target objects. ...
Article
Full-text available
Image semantic segmentation has always been a research hotspot in the field of robots. Its purpose is to assign different semantic category labels to objects by segmenting different objects. However, in practical applications, in addition to knowing the semantic category information of objects, robots also need to know the position information of objects to complete more complex visual tasks. Aiming at a complex indoor environment, this study designs an image semantic segmentation network framework of joint target detection. Using the parallel operation of adding semantic segmentation branches to the target detection network, it innovatively implements multi-vision task combining object classification, detection and semantic segmentation. By designing a new loss function, adjusting the training using the idea of transfer learning, and finally verifying it on the self-built indoor scene data set, the experiment proves that the method in this study is feasible and effective, and has good robustness.
... The focus of this paper is on temporal video segmentation of daylong egocentric video streams. Due to the task's utility as a pre-processing for many higher-level inference problems like indexing and summarization, the problem is a well-researched area Oral Session C3: Multimodal Analysis and Description &Summarization, Analytics, and Storytelling MM '20, October 12-16, 2020 [7] ✓ ✗ ✓ ✓ ✓ SR-Clustering [14] ✓ ✓ ✗ ✓ ✓ CES [12] ✓ ✓ ✗ ✗ ✗ Ours ✓ ✓ ✓ ✓ ✓ Table 1: Comparison of state of the art with our method on various criteria important for applicability to egocentric videos. ...
... Adaptive Windowing: For variable length events one can use adaptive windowing [7], which maintains the size of a window dynamically, by growing the window if the current event is long, and drop a sub-window from the tail if an event boundary is detected. [14] combines low-level features with high-level semantic labels, and has suggested a graph cut technique to look for the trade-off between the adaptive windowing [7] and agglomerative clustering. ...
... Feature Vector: For all the video datasets, we use the input at 5fps and use frame-wise AlexNet [29] features as used by SR-Clustering [14]. However, for a fair comparison on the photo-stream datasets, we use LSTM features similar to one used by [12]. ...
... Picture selection. In experiment 1, we implemented a deep neural networkbased algorithm, SR-Clustering (Dimiccoli et al., 2015), to automatically organize the stream of each participant's pictures into a set of temporally evolving meaningful . CC-BY-NC-ND 4.0 International license (which was not certified by peer review) is the author/funder. ...
Preprint
Full-text available
How does one retrieve real-life episodic memories? Here, we tested the hypothesis, derived from computational models, that successful retrieval relies on neural dynamics patterns that rapidly shift towards stable states. We implemented cross-temporal correlation analysis of electroencephalographic (EEG) recordings while participants retrieved episodic memories cued by pictures collected with a wearable camera depicting real-life episodes taking place at home and at the office. We found that the retrieval of real-life episodic memories is supported by rapid shift towards brain states of stable activity, that the degree of neural stability is associated with the ability of the participants to recollect the episodic content cued by the picture, and that each individual elicits stable EEG patterns that were not shared with other participants. These results indicate that the retrieval of autobiographical memory episodes is supported by rapid shifts of neural activity towards stable states.