Conference Paper

A multi-task learning approach for meal assessment

Authors:
To read the full-text of this research, you can request a copy directly from the authors.

Abstract

Key role in the prevention of diet-related chronic diseases plays the balanced nutrition together with a proper diet. The conventional dietary assessment methods are time-consuming, expensive and prone to errors. New technology-based methods that provide reliable and convenient dietary assessment, have emerged during the last decade. The advances in the field of computer vision permitted the use of meal image to assess the nutrient content usually through three steps: food segmentation, recognition and volume estimation. In this paper, we propose a use one RGB meal image as input to a multi-task learning based Convolutional Neural Network (CNN). The proposed approach achieved outstanding performance, while a comparison with state-of-the-art methods indicated that the proposed approach exhibits clear advantage in accuracy, along with a massive reduction of processing time.

No full-text available

Request Full-text Paper PDF

To read the full-text of this research,
you can request a copy directly from the authors.

... In the field of computer vision, many deep learning methods have been successfully applied to food recognition [16][17][18][19][20][21][22][23][24][25]. Liang et al. ...
... Liang et al. Ref. [22]introduces a multi-task structure to share information between food and ingredients. Ref. [24] aggregates different classes and different scales to obtain a better feature representation. ...
... MSMVFA [22]: MSMVFA fused three different levels of features and different granularity of features to get a better feature representation. ...
Article
Full-text available
Image-based food pattern classification poses challenges of non-fixed spatial distribution and ingredient occlusion for mainstream computer vision algorithms. However, most current approaches classify food and ingredients by directly extracting abstract features of the entire image through a convolutional neural network (CNN), ignoring the relationship between food and ingredients and ingredient occlusion problem. To address these issues mentioned, we propose a FoodNet for both food and ingredient recognition, which uses a multi-task structure with a multi-scale relationship learning module (MSRL) and a label dependency learning module (LDL). As ingredients normally co-occur in an image, we present the LDL to use the dependency of ingredient to alleviate the occlusion problem of ingredient. MSRL aggregates multi-scale information of food and ingredients, then uses two relational matrixs to model the food-ingredient matching relationship to obtain richer feature representation. The experimental results show that FoodNet can achieve good performance on the Vireo Food-172 and UEC Food-100 datasets. It is worth noting that it reaches the most state-of-the-art level in terms of ingredient recognition in the Vireo Food-172 and UECFood-100.The source code will be made available at https://github.com/visipaper/FoodNet.
... Although such methods have achieved high accuracy in most scenarios, they require an additional depth image as input and may not perform well for black and reflective objects due to the intrinsic limitation of the depth sensors [20]. On the other hand, the studies reported in [6], [20], [28] predict the depth map from a single food image using supervised CNNs for 3D food model building [6], [20] or the direct food volume regression [28]. The performance of such methods greatly depends on the available densely annotated training databases, which are however costly and inconvenient for real applications. ...
... Although such methods have achieved high accuracy in most scenarios, they require an additional depth image as input and may not perform well for black and reflective objects due to the intrinsic limitation of the depth sensors [20]. On the other hand, the studies reported in [6], [20], [28] predict the depth map from a single food image using supervised CNNs for 3D food model building [6], [20] or the direct food volume regression [28]. The performance of such methods greatly depends on the available densely annotated training databases, which are however costly and inconvenient for real applications. ...
... This way the annotation effort is reduced by 95%. The annotated classes include the plate area, the table area and 6 food categories as defined in [28]. ...
Preprint
Full-text available
Food volume estimation is an essential step in the pipeline of dietary assessment and demands the precise depth estimation of the food surface and table plane. Existing methods based on computer vision require either multi-image input or additional depth maps, reducing convenience of implementation and practical significance. Despite the recent advances in unsupervised depth estimation from a single image, the achieved performance in the case of large texture-less areas needs to be improved. In this paper, we propose a network architecture that jointly performs geometric understanding (i.e., depth prediction and 3D plane estimation) and semantic prediction on a single food image, enabling a robust and accurate food volume estimation regardless of the texture characteristics of the target plane. For the training of the network, only monocular videos with semantic ground truth are required, while the depth map and 3D plane ground truth are no longer needed. Experimental results on two separate food image databases demonstrate that our method performs robustly on texture-less scenarios and is superior to unsupervised networks and structure from motion based approaches, while it achieves comparable performance to fully-supervised methods.
... The studies mentioned above were mostly related to intake detection and classification rather than intake amount estimation. However, food volume estimation is another problem to be considered, which is the 'how much' problem [73,115,128,[153][154][155]. Meal estimation could be realised based on the respective number of intaking gestures for consuming liquid, soup, and meal [41], but the accuracy was not evaluated. ...
... Furthermore, 3D reconstruction algorithms were used in this design, reaching a (5.75 ± 3.75)% error in the volume [154]. A CNN was proposed for depth prediction and volume estimation and significantly improved performance with less than 0.2 s runtime, which was 25 times shorter than conventional 3D reconstruction methods [128]. A geometric model for food amount estimation from single-view images was proposed and achieved less than 6% error for energy estimation, but only on the assumption of accurate segmentation and food classification [153]. ...
Article
Full-text available
Food and fluid intake monitoring are essential for reducing the risk of dehydration,malnutrition, and obesity. The existing research has been preponderantly focused on dietary moni-toring, while fluid intake monitoring, on the other hand, is often neglected. Food and fluid intakemonitoring can be based on wearable sensors, environmental sensors, smart containers, and thecollaborative use of multiple sensors. Vision-based intake monitoring methods have been widelyexploited with the development of visual devices and computer vision algorithms. Vision-basedmethods provide non-intrusive solutions for monitoring. They have shown promising performancein food/beverage recognition and segmentation, human intake action detection and classification,and food volume/fluid amount estimation. However, occlusion, privacy, computational efficiency,and practicality pose significant challenges. This paper reviews the existing work (253 articles) onvision-based intake (food and fluid) monitoring methods to assess the size and scope of the availableliterature and identify the current challenges and research gaps. This paper uses tables and graphs todepict the patterns of device selection, viewing angle, tasks, algorithms, experimental settings, andperformance of the existing monitoring systems.
... The idea of multi-tasking learning had also been used to design some complex end-to-end VBDA networks. Lu et al. (2018) presented a MTL-based VBDA network, which simultaneously implemented food recognition, segmentation, and volume estimation. The feature extraction module is composed of ResNet50 and feature pyramid network. ...
... Considering that MADiMa only contains meal images taken by a monocular camera, they created one fast food dataset from McDonald's to support two-view images and stereo image pairs as input to comprehensively evaluate the performance of goFOOD TM . Lu et al. (2018) also used the MADiMa dataset as the training and evaluation dataset. Lu et al. (2021a) used it to evaluate the performance of the proposed method. ...
Preprint
Background: Maintaining a healthy diet is vital to avoid health-related issues, e.g., undernutrition, obesity and many non-communicable diseases. An indispensable part of the health diet is dietary assessment. Traditional manual recording methods are burdensome and contain substantial biases and errors. Recent advances in Artificial Intelligence, especially computer vision technologies, have made it possible to develop automatic dietary assessment solutions, which are more convenient, less time-consuming and even more accurate to monitor daily food intake. Scope and approach: This review presents one unified Vision-Based Dietary Assessment (VBDA) framework, which generally consists of three stages: food image analysis, volume estimation and nutrient derivation. Vision-based food analysis methods, including food recognition, detection and segmentation, are systematically summarized, and methods of volume estimation and nutrient derivation are also given. The prosperity of deep learning makes VBDA gradually move to an end-to-end implementation, which applies food images to a single network to directly estimate the nutrition. The recently proposed end-to-end methods are also discussed. We further analyze existing dietary assessment datasets, indicating that one large-scale benchmark is urgently needed, and finally highlight key challenges and future trends for VBDA. Key findings and conclusions: After thorough exploration, we find that multi-task end-to-end deep learning approaches are one important trend of VBDA. Despite considerable research progress, many challenges remain for VBDA due to the meal complexity. We also provide the latest ideas for future development of VBDA, e.g., fine-grained food analysis and accurate volume estimation. This survey aims to encourage researchers to propose more practical solutions for VBDA.
... As a simpler alternative, supervised CNNs permit the use of a single RGB-image as input [15,43,45] for the prediction of the corresponding depth map needed for 3D food model building [15,43] or the direct food volume regression [45]. ...
... As a simpler alternative, supervised CNNs permit the use of a single RGB-image as input [15,43,45] for the prediction of the corresponding depth map needed for 3D food model building [15,43] or the direct food volume regression [45]. ...
Article
Full-text available
The Mediterranean diet (MD) is regarded as a healthy eating pattern with beneficial effects both for the decrease of the risk for non-communicable diseases and also for body weight reduction. In the current manuscript, we propose an automated smartphone application which monitors and evaluates the user’s adherence to MD using images of the food and drinks that they consume. We define a set of rules for automatic adherence estimation, which focuses on the main MD food groups. We use a combination of a convolutional neural network (CNN) and a graph convolutional network to detect the types of foods and quantities from the users’ food images and the defined set of rules to evaluate the adherence to MD. Our experiments show that our system outperforms a basic CNN in terms of recognizing food items and estimating quantity and yields comparable results as experienced dietitians when it comes to overall MD adherence estimation. As the system is novel, these results are promising; however, there is room for improvement of the accuracy by gathering and training with more data and certain refinements can be performed such as re-defining the set of rules to also be able to be used for sub-groups of MD (e.g., vegetarian type of MD).
... The GoCARB system is an initial attempt to achieve practical estimation of the food volume in real scenarios and has been validated both technically [22] and in a framework of pre-clinical and clinical trials [15,23]. Following the development and great progress of the CNNs, a number of recent studies have tried to address the estimation of food volume using single view colour images [14,[24][25][26]. Ref. [14,26] uses the CNNs predicting the depth image from single-view colour image, while the predicted depth map is used for the food volume calculation. ...
... Following the development and great progress of the CNNs, a number of recent studies have tried to address the estimation of food volume using single view colour images [14,[24][25][26]. Ref. [14,26] uses the CNNs predicting the depth image from single-view colour image, while the predicted depth map is used for the food volume calculation. Ref. [24,25] treats the food volume as a latent variable and predicts the food nutrient content directly from the colour image using the CNNs. ...
Article
Full-text available
Accurate estimation of nutritional information may lead to healthier diets and better clinical outcomes. We propose a dietary assessment system based on artificial intelligence (AI), named goFOODTM . The system can estimate the calorie and macronutrient content of a meal, on the sole basis of food images captured by a smartphone. goFOODTM requires an input of two meal images or a short video. For conventional single-camera smartphones, the images must be captured from two different viewing angles; smartphones equipped with two rear cameras require only a single press of the shutter button. The deep neural networks are used to process the two images and implements food detection, segmentation and recognition, while a 3D reconstruction algorithm estimates the food’s volume. Each meal’s calorie and macronutrient content is calculated from the food category, volume and the nutrient database. goFOODTM supports 319 fine-grained food categories, and has been validated on two multimedia databases that contain non-standardized and fast food meals. The experimental results demonstrate that goFOODTM performed better than experienced dietitians on the non-standardized meal database, and was comparable to them on the fast food database. goFOODTM provides a simple and efficient solution to the end-user for dietary assessment.
... Besides food and no-food segmentation, semantic segmentation where both pixel-wise classification and segmentation are obtained simultaneously has been a much less explored path in food-related tasks, mainly due to lack of large-scale datasets with pixel-wise class annotations. To the best of our knowledge, [7,39,51,65] are the only works accomplishing semantic segmentation of food images by employing DCNNs. In [65], the first version of DeepLab [15] has been used in semantic segmentation of Food201segmented dataset. ...
... In [7] and [39], DeepLab-v2 and Seg-Net are used in semantic segmentation of UNIMIB2016 food dataset [23], respectively. A new DCNN architecture, namely Depth Net, is proposed in [51], which accomplishes instance segmentation of food images by Mask R-CNN with an extension of it that is accomplishing volume estimation. ...
Article
Full-text available
The problem of food segmentation is quite challenging since food is characterized by intrinsic high intra-class variability. Also, segmentation of food images taken in-the-wild may be characterized by acquisition artifacts, and that could be problematic for the segmentation algorithms. A proper evaluating of segmentation algorithms is of paramount importance for the design and improvement of food analysis systems that can work in less-than-ideal real scenarios. In this paper, we evaluate the performance of different deep learning-based segmentation algorithms in the context of food. Due to the lack of large-scale food segmentation datasets, we initially create a new dataset composed of 5000 images of 50 diverse food categories. The images are accurately annotated with pixel-wise annotations. In order to test the algorithms under different conditions, the dataset is augmented with the same images but rendered under different acquisition distortions that comprise illuminant change, JPEG compression, Gaussian noise, and Gaussian blur. The final dataset is composed of 120,000 images. Using standard benchmark measures, we conducted extensive experiments to evaluate ten state-of-the-art segmentation algorithms on two tasks: food localization and semantic food segmentation.
... On the contrary, quantity estimation studies have suffered the lack of proper datasets. Since 3D information have to be inferred for a correct volume estimation, in a totally unsupervised environment with no spatial references, it results in a extremely challenging problem [23,8,6,25,7,2,18]. ...
... The authors performed semantic segmentation by using U-Net and then they used a modified version of the CNN in [9] for depth inference from single RGB input. In 2018, Lu et al. [18] presented a Multi-Tasking Learning approach to estimate volume of food items from single RGB image. The proposed CNN architecture is composed by multiple modules. ...
Conference Paper
Full-text available
In the last decade food understanding has become a very attractive topic. This has implied the growing demand of Computer Vision algorithms for automatic diet assessment to treat or prevent food related diseases. However, the intrinsic variability of food, makes the research in this field incredibly challenging. Although many papers about classification or recognition of food images have been published in recent years, the literature lacks of works which address volume and calories estimation problem. Since an ideal food understanding engine should be able to provide information about nutritional values, the knowledge of the volume is essential. Differently from the state-of-art works, in this paper we address the problem of volume estimation through Learning to Rank algorithms. Our idea is to work with a predefined set of possible portion size and exploit a ranking approach based on Support Vector Machine (SVM) to sort food images according to the volume. At the best of our knowledge, this is the first work where food volume analysis is treated as a raking problem. To validate the proposed methodology we introduce a new dataset of 99 food images related to 11 food plates. Each food image belongs to one over three possible portion size (i.e., small, medium, large). Then, we provide a baseline experiment to assess the problem of learning to rank food images by using three different image descriptors based on Bag of Visual Words, GoogleNet and MobileNet. Experimental results, confirm that the exploited paradigm obtain good performances and that a ranking function for food volume analysis can be successfully learnt.
... Geometry-based estimation is performed from single images if a depth sensor is available [14,20,28] or multiple images, otherwise [8,17]. Learningbased methods have also been proposed [16,18]. However, generalization issues usually make them less effective than geometry-based methods. ...
Chapter
Full-text available
People living with type 1 diabetes (PwT1D) face multiple challenges in self-managing their blood glucose levels, including the need for accurate carbohydrate counting, and the requirements of adjusting insulin dosage. Our paper aims to alleviate the demands of diabetes self-management by developing a complete system that employs computer vision to estimate the carbohydrate content of meals and utilizes reinforcement learning to personalize insulin dosing. Our findings demonstrate that this system results in a significantly greater percentage of time spent in the target glucose range compared to the combined standard bolus calculator treatment and carbohydrate counting. This approach could potentially improve glycaemic control for PwT1D and reduce the burden of carbohydrate and insulin dosage estimations.
... Classification was implemented with CNNs in 35 (78%) studies. CNNs were also used for the final volume estimation phase in 1 study (2%) (103). ...
Article
Full-text available
Dietary assessment can be crucial for the overall well-being of humans and at least in some instances for the prevention and management of chronic, life-threatening diseases. Recall and manual record keeping methods for food intake monitoring are available, but often inaccurate when applied for a long period of time. On the other hand, automatic record keeping approaches that adopt mobile cameras and computer vision methods seem to simplify the process and can improve current human-centric diet monitoring methods. Here we present an extended critical literature overview of image-based food recognition systems (IBFRS) combining a camera of the user's mobile device with computer vision methods and publicly available food datasets (PAFD). In brief, such systems consist of several phases, such as the segmentation of the food items on the plate, the classification of the food items in a specific food category, and the estimation phase of volume, calories or nutrients of each food item. 159 studies were screened in this systematic review of IBFRS. A detailed overview of the methods adopted in each of the 78 included studies of this systematic review of IBFRS is provided along with their performance on PAFD. Studies that included IBFRS without presenting their performance in at least one of the abovementioned phases were excluded. Among the included studies, 45 (58%) studies adopted deep learning methods and especially Convolutional Neural Networks (CNNs) in at least one phase of the IBFRS with input PAFD. Among the implemented techniques, CNNs outperform all other approaches on the PAFD with a large volume of data, since the richness of these datasets provides adequate training resources for such algorithms. We also present evidence for the benefits of application of IBFRS in professional dietetic practice. Furthermore, challenges related to the IBFRS presented here are also thoroughly discussed along with future directions.
... The food reconstruction is implemented by performing the Iterative Closest Point (ICP) algorithm over the front-view depth map and an inferred back-view depth map. Ferdinan et al. [27] and Lu et al. [28] took a different approach. They formulated the volume estimation as a volume regression problem from implicit 3D features transformed from the depth information. ...
Article
Full-text available
It is well known that many chronic diseases are associated with unhealthy diet. Although improving diet is critical, adopting a healthy diet is difficult despite its benefits being well understood. Technology is needed to allow an assessment of dietary intake accurately and easily in real-world settings so that effective intervention to manage being overweight, obesity, and related chronic diseases can be developed. In recent years, new wearable imaging and computational technologies have emerged. These technologies are capable of performing objective and passive dietary assessments with a much simplified procedure than traditional questionnaires. However, a critical task is required to estimate the portion size (in this case, the food volume) from a digital image. Currently, this task is very challenging because the volumetric information in the two-dimensional images is incomplete, and the estimation involves a great deal of imagination, beyond the capacity of the traditional image processing algorithms. In this work, we present a novel Artificial Intelligent (AI) system to mimic the thinking of dietitians who use a set of common objects as gauges (e.g., a teaspoon, a golf ball, a cup, and so on) to estimate the portion size. Specifically, our human-mimetic system “mentally” gauges the volume of food using a set of internal reference volumes that have been learned previously. At the output, our system produces a vector of probabilities of the food with respect to the internal reference volumes. The estimation is then completed by an “intelligent guess”, implemented by an inner product between the probability vector and the reference volume vector. Our experiments using both virtual and real food datasets have shown accurate volume estimation results.
... Such audio signals have also been used to extract information such as the food type [3,23]. There are also alternative types of approaches for identifying food types; for example, leveraging photos that people take with their mobile phones can be used to detect food-relevant photos and then subsequently to perform image segmentation and identify the food type [18,13,10]. Alternatively, user input can be requested when an automatic eating detection system detects eating activity [5]. ...
Preprint
Full-text available
Food texture is a complex property; various sensory attributes such as perceived crispiness and wetness have been identified as ways to quantify it. Objective and automatic recognition of these attributes has applications in multiple fields, including health sciences and food engineering. In this work we use an in-ear microphone, commonly used for chewing detection, and propose algorithms for recognizing three food-texture attributes, specifically crispiness, wetness (moisture), and chewiness. We use binary SVMs, one for each attribute, and propose two algorithms: one that recognizes each texture attribute at the chew level and one at the chewing-bout level. We evaluate the proposed algorithms using leave-one-subject-out cross-validation on a dataset with 9 subjects. We also evaluate them using leave-one-food-type-out cross-validation, in order to examine the generalization of our approach to new, unknown food types. Our approach performs very well in recognizing crispiness (0.95 weighted accuracy on new subjects and 0.93 on new food types) and demonstrates promising results for objective and automatic recognition of wetness and chewiness.
... Such audio signals have also been used to extract information such as the food type [3,23]. There are also alternative types of approaches for identifying food types; for example, leveraging photos that people take with their mobile phones can be used to detect food-relevant photos and then subsequently to perform image segmentation and identify the food type [18,13,10]. Alternatively, user input can be requested when an automatic eating detection system detects eating activity [5]. ...
Chapter
Full-text available
Food texture is a complex property; various sensory attributes such as perceived crispiness and wetness have been identified as ways to quantify it. Objective and automatic recognition of these attributes has applications in multiple fields, including health sciences and food engineering. In this work we use an in-ear microphone, commonly used for chewing detection, and propose algorithms for recognizing three food-texture attributes, specifically crispiness, wetness (moisture), and chewiness. We use binary SVMs, one for each attribute, and propose two algorithms: one that recognizes each texture attribute at the chew level and one at the chewing-bout level. We evaluate the proposed algorithms using leave-one-subject-out cross-validation on a dataset with 9 subjects. We also evaluate them using leave-one-food-type-out cross-validation, in order to examine the generalization of our approach to new, unknown food types. Our approach performs very well in recognizing crispiness (0.95 weighted accuracy on new subjects and 0.93 on new food types) and demonstrates promising results for objective and automatic recognition of wetness and chewiness.
... Most of the approaches for estimating food volume are based on geometry [18], [34], and require multiple RGB images and a reference object as input for 3D food model construction, and these are not robust to low texture food and large view changes. Food volume can also be estimated using the CNN-based regression method [35], while the performance depends heavily on the quantity of the training data. With the development of high-quality depth sensors or stereo cameras on smartphones, depth maps can be utilised for estimating food volume with no need for reference object or extensive training data [27], [36], [37]; this provides more stable and more accurate results than with geometry-based approaches [18], [27], [34]. ...
Article
Regular monitoring of nutrient intake in hospitalised patients plays a critical role in reducing the risk of disease-related malnutrition. Although several methods to estimate nutrient intake have been developed, there is still a clear demand for a more reliable and fully automated technique, as this could improve data accuracy and reduce both the burden on participants and health costs. In this paper, we propose a novel system based on artificial intelligence (AI) to accurately estimate nutrient intake, by simply processing RGB Depth (RGB-D) image pairs captured before and after meal consumption. The system includes a novel multi-task contextual network for food segmentation, a few-shot learning-based classifier built by limited training samples for food recognition, and an algorithm for 3D surface construction. This allows sequential food segmentation, recognition, and estimation of the consumed food volume, permitting fully automatic estimation of the nutrient intake for each meal. For the development and evaluation of the system, a dedicated new database containing images and nutrient recipes of 322 meals is assembled, coupled to data annotation using innovative strategies. Experimental results demonstrate that the estimated nutrient intake is highly correlated (> 0.91) to the ground truth and shows very small mean relative errors (< 20%), outperforming existing techniques proposed for nutrient intake assessment.
... The authors use convolutional neural networks for 2D human pose estimation by getting joint points in colour image; then they map the results to the corresponding depth channel to obtain 3D joint points. RGBD cameras are also used to learn a 3D-2D correspondence in food volume estimation for diet (Allegra et al., 2017;Lu et al., 2018), to support assistive technologies (Tian, 2014;Milotta et al., 2015) and even in industrial manufacturing field (Munaro et al., 2016;Liu and Wang, 2019). ...
Article
The aim of this study was to confirm the identities of numerous portraits attributed to the composer Vincenzo Bellini by using 3D-to-2D projection. This study also followed on from earlier research on three death masks of Bellini, the results of which had shown that the wax mask in Catania's Bellini museum best represented Bellini's face compared to the other two. This study used the aforementioned 3D wax death mask obtained through Reverse Engineering as a reference for a morphometric comparison with 14 other portraits. For each portrait, the linear 3D-to-2D transformation M was found which minimized the distance between the 2D landmarks in the picture and the projected landmarks on the 3D mask. This normalized the distances considering the scale of the portrait and the final dissimilarity score with the mask. In particular, the analytical results were able to identify two portraits which particularly resembled the 3D death mask providing future researchers with the chance to carry out historical-artistic evaluations. We were also able to develop a new tool – Image Mark Pro - to easily annotate 2D images by introducing landmark locations. Since it was so reliable for manually annotating landmarks, we decided to make it publicly available for future research.
... Most of the approaches for estimating food volume are based on geometry [18], [34], and require multiple RGB images and a reference object as input for 3D food model construction, and these are not robust to low texture food and large view changes. Food volume can also be estimated using the CNN-based regression method [35], while the performance depends heavily on the quantity of the training data. With the development of high-quality depth sensors or stereo cameras on smartphones, depth maps can be utilised for estimating food volume with no need for reference object or extensive training data [27], [36], [37]; this provides more stable and more accurate results than with geometry-based approaches [18], [27], [34]. ...
Preprint
Regular monitoring of nutrient intake in hospitalised patients plays a critical role in reducing the risk of disease-related malnutrition. Although several methods to estimate nutrient intake have been developed, there is still a clear demand for a more reliable and fully automated technique, as this could improve data accuracy and reduce both the burden on participants and health costs. In this paper, we propose a novel system based on artificial intelligence (AI) to accurately estimate nutrient intake, by simply processing RGB Depth (RGB-D) image pairs captured before and after meal consumption. The system includes a novel multi-task contextual network for food segmentation, a few-shot learning-based classifier built by limited training samples for food recognition, and an algorithm for 3D surface construction. This allows sequential food segmentation, recognition, and estimation of the consumed food volume, permitting fully automatic estimation of the nutrient intake for each meal. For the development and evaluation of the system, a dedicated new database containing images and nutrient recipes of 322 meals is assembled, coupled to data annotation using innovative strategies. Experimental results demonstrate that the estimated nutrient intake is highly correlated (> 0.91) to the ground truth and shows very small mean relative errors (< 20%), outperforming existing techniques proposed for nutrient intake assessment.
... An MTL-based CNN was presented by Lu et Fig. 2 Our general approach of multi-task learning for attribute-aware semantic segmentation; Unlike other existing works, an attribute-aware loss function is proposed to deal with the attributes (either exists or not), and applicable to any arbitrary base model. al. [33] to address food segmentation, recognition, and volume estimation, which successfully outperforms the baseline methods. Another MTL architecture with heavy sharing of weights and features was introduced in [34] to perform four tasks: 2D pose estimation, 3D pose estimation, 2D action recognition, and 3D action recognition. ...
Article
Numerous applications such as autonomous driving, satellite imagery sensing, and biomedical imaging use computer vision as an important tool for perception tasks. For Intelligent Transportation Systems (ITS), it is required to precisely recognize and locate scenes in sensor data. Semantic segmentation is one of computer vision methods intended to perform such tasks. However, the existing semantic segmentation tasks label each pixel with a single object's class. Recognizing object attributes, e.g., pedestrian orientation, will be more informative and help for a better scene understanding. Thus, we propose a method to perform semantic segmentation with pedestrian attribute recognition simultaneously. We introduce an attribute-aware loss function that can be applied to an arbitrary base model. Furthermore, a re-annotation to the existing Cityscapes dataset enriches the ground-truth labels by annotating the attributes of pedestrian orientation. We implement the proposed method and compare the experimental results with others. The attribute-aware semantic segmentation shows the ability to outperform baseline methods both in the traditional object segmentation task and the expanded attribute detection task.
... To this end, a number of computer vision approaches have been developed, in order to extract nutrient information from meal images by using machine learning. Typically, such systems detect the different food items in a picture [1], [27], [21], estimate their volumes [16], [6], [10] and calculate the nutrient content using a food composition database [20]. In some cases however, inferring the nutrient content of a meal from an image can be really challenging -due to unseen ingredients (e.g. ...
Preprint
Full-text available
Direct computer vision based-nutrient content estimation is a demanding task, due to deformation and occlusions of ingredients, as well as high intra-class and low inter-class variability between meal classes. In order to tackle these issues, we propose a system for recipe retrieval from images. The recipe information can subsequently be used to estimate the nutrient content of the meal. In this study, we utilize the multi-modal Recipe1M dataset, which contains over 1 million recipes accompanied by over 13 million images. The proposed model can operate as a first step in an automatic pipeline for the estimation of nutrition content by supporting hints related to ingredient and instruction. Through self-attention, our model can directly process raw recipe text, making the upstream instruction sentence embedding process redundant and thus reducing training time, while providing desirable retrieval results. Furthermore, we propose the use of an ingredient attention mechanism, in order to gain insight into which instructions, parts of instructions or single instruction words are of importance for processing a single ingredient within a certain recipe. Attention-based recipe text encoding contributes to solving the issue of high intra-class/low inter-class variability by focusing on preparation steps specific to the meal. The experimental results demonstrate the potential of such a system for recipe retrieval from images. A comparison with respect to two baseline methods is also presented.
... They created a new food RGB-D image dataet, Madima17, for training of a CNN. Lu et al. [10] proposed a multi-task CNN architecture to perform food segmentation and food volume estimation simultaneously. They extended Mask R-CNN [9] by adding depth and volume estimation networks. ...
Conference Paper
Some recent smartphones such as iPhone Xs have pair of cameras which can be used as stereo cameras on their backside. Regarding iPhone with iOS11 or more, the official API provides the function to estimate depth information from two backside cameras in the real-time way. By taking advantage of this function, we have developed an iOS app, "DepthCalorieCam", which estimates the amount of food calories based on food volumes. In the proposed app, it takes a RGB-D image of a dish, estimate categories and volumes of foods on the dish, and calculate the amount of their calories using the pre-registered calorie density of each food category. We have achieved very accurate calorie estimation by using depth information. The error of estimated calories was reduced greatly compared with the existing size-based systems.
... To this end, a number of computer vision approaches have been developed, in order to extract nutrient information from meal images by using machine learning. Typically, such systems detect the different food items in a picture [1], [27], [21], estimate their volumes [16], [6], [10] and calculate the nutrient content using a food composition database [20]. In some cases however, inferring the nutrient content of a meal from an image can be really challenging -due to unseen ingredients (e.g. ...
Conference Paper
Full-text available
Direct computer vision based-nutrient content estimation is a demanding task, due to deformation and occlusions of ingredients, as well as high intra-class and low inter-class variability between meal classes. In order to tackle these issues, we propose a system for recipe retrieval from images. The recipe information can subsequently be used to estimate the nutrient content of the meal. In this study, we utilize the multi-modal Recipe1M dataset, which contains over 1 million recipes accompanied by over 13 million images. The proposed model can operate as a first step in an automatic pipeline for the estimation of nutrition content by supporting hints related to ingredient and instruction. Through self-attention, our model can directly process raw recipe text, making the upstream instruction sentence embedding process redundant and thus reducing training time, while providing desirable retrieval results. Furthermore, we propose the use of an ingredient attention mechanism, in order to gain insight into which instructions, parts of instructions or single instruction words are of importance for processing a single ingredient within a certain recipe. Attention-based recipe text encoding contributes to solving the issue of high intra-class/low inter-class variability by focusing on preparation steps specific to the meal. The experimental results demonstrate the potential of such a system for recipe retrieval from images. A comparison with respect to two baseline methods is also presented.
... Quantity estimation can be addressed with multi-task learning by training CNNs that learns both the food classification and the relative calories/volume. However, this technique requires a dataset with the annotated calories [11] or the depth information in the images [15]. ...
Chapter
Full-text available
The self-management of nutritional diseases requires a system that combines food tracking with the potential risks of food categories on people’s health based on their personal health records (PHRs). The challenges range from the design of an effective food image classification strategy to the development of a full-fledged knowledge-based system. This maps the results of the classification strategy into semantic information that can be exploited for reasoning. However, current works mainly address the single challenges separately without their integration into a whole pipeline. In this paper, we propose a new end-to-end semantic platform where: (i) the classification strategy aims to extract food categories from food pictures; (ii) an ontology is used for detecting the risk factors of food categories for specific diseases; (iii) the Linked Open Data (LOD) Cloud is queried for extracting information concerning related diseases and comorbidities; and, (iv) information from the users’ PHRs are exploited for generating proper personal feedback. Experiments are conducted on a new publicly released dataset. Quantitative and qualitative evaluations, from two living labs, demonstrate the effectiveness and the suitability of the proposed approach.
... Most of the approaches for estimating food volume are based on geometry [18], [34], and require multiple RGB images and a reference object as input for 3D food model construction, and these are not robust to low texture food and large view changes. Food volume can also be estimated using the CNN-based regression method [35], while the performance depends heavily on the quantity of the training data. With the development of high-quality depth sensors or stereo cameras on smartphones, depth maps can be utilised for estimating food volume with no need for reference object or extensive training data [27], [36], [37]; this provides more stable and more accurate results than with geometry-based approaches [18], [27], [34]. ...
Conference Paper
Regular nutrient intake monitoring in hospitalised patients plays a critical role in reducing the risk of disease-related malnutrition (DRM). Although several methods to estimate nutrient intake have been developed, there is still a clear demand for a more reliable and fully automated technique, as this could improve the data accuracy and reduce both the participant burden and the health costs. In this paper, we propose a novel system based on artificial intelligence to accurately estimate nutrient intake, by simply processing RGB depth image pairs captured before and after a meal consumption. For the development and evaluation of the system, a dedicated and new database of images and recipes of 322 meals was assembled, coupled to data annotation using innovative strategies. With this database, a system was developed that employed a novel multi-task neural network and an algorithm for 3D surface construction. This allowed sequential semantic food segmentation and estimation of the volume of the consumed food, and permitted fully automatic estimation of nutrient intake for each food type with a 15% estimation error.
... In particular, it requires solving various problems, such as: fine-grained recognition to distinguish subtly different forms of food, instance segmentation and counting, mask generation, depth/volume estimation from a single image. Most of the existing state-of-the-art work focuses specifically on one of the sub-problems of food detection with computer vision techniques [22] [23] [24]. They all focus on a single task with strict environmental conditions or external assistance so that they are still far away from the holy grail of the automated food journaling systems. ...
Preprint
Full-text available
We present a mobile application made to recognize food items of multi-object meal from a single image in real-time, and then return the nutrition facts with components and approximate amounts. Our work is organized in two parts. First, we build a deep convolutional neural network merging with YOLO, a state-of-the-art detection strategy, to achieve simultaneous multi-object recognition and localization with nearly 80% mean average precision. Second, we adapt our model into a mobile application with extending function for nutrition analysis. After inferring and decoding the model output in the app side, we present detection results that include bounding box position and class label in either real-time or local mode. Our model is well-suited for mobile devices with negligible inference time and small memory requirements with a deep learning algorithm.
... Quantity estimation can be addressed with a multi-task learning approach by defining a tailored CNN that both learns the classification of the food in the dish and the relative calories or volume. However, this interesting direction requires a dataset with the annotated calories [8] or the depth information in the images [20]. In [3] the authors use CNNs to perform semantic segmentation to estimate the leftovers in the trays in the canteens. ...
Chapter
Full-text available
The self-management of chronic diseases related to dietary habits includes the necessity of tracking what people eat. Most of the approaches proposed in the literature classify food pictures by labels describing the whole recipe. The main drawback of this kind of strategy is that a wrong prediction of the recipe leads to a wrong prediction of any ingredient of such a recipe. In this paper we present a multi-label food classification approach, exploiting deep neural networks, where each food picture is classified with labels describing the food categories of the ingredients in each recipe. The aim of our approach is to support the detection of food categories in order to detect which one might be dangerous for a user affected by chronic disease. Our approach relies on background knowledge where recipes, food categories, and their relatedness with chronic diseases are modeled within a state-of-the-art ontology. Experiments conducted on a new publicly released dataset demonstrated the effectiveness of the proposed approach with respect to state-of-the-art classification strategies.
... In recent days, object detection is being used for so many applications. There are some state-of-the-arts which work for different types of object detection such as flower detection [8], fruit detection [9,10], food segmentation and detection [11] cats and dogs detection [12] etc. The main goal of all these detection algorithms is to obtain higher efficiency and cover different complex use cases by overcoming different limitations. ...
Article
Full-text available
In this paper, an efficient approach has been proposed to localize every clearly visible object or region of object from an image, using less memory and computing power. For object detection we have processed every input image to overcome several complexities, which are the main limitations to achieve better result, such as overlap between multiple objects, noise in the image background, poor resolution etc. We have also implemented an improved Convolutional Neural Network based classification or recognition algorithm which has proved to provide better performance than baseline works. Combining these two detection and recognition approaches, we have developed a competent multi-class Fruit Detection and Recognition (FDR) model that is very proficient regardless of different limitations such as high and poor image quality, complex background or lightening condition, different fruits of same shape and color, multiple overlapped fruits, existence of non-fruit object in the image and the variety in size, shape, angel and feature of fruit. This proposed FDR model is also capable of detecting every single fruit separately from a set of overlapping fruits. Another major contribution of our FDR model is that it is not a dataset oriented model which works better on only a particular dataset as it has been proved to provide better performance while applying on both real world images (e.g., our own dataset) and several states of art datasets. Nevertheless, taking a number of challenges into consideration, our proposed model is capable of detecting and recognizing fruits from image with a better accuracy and average precision rate of about 0.9875
Book
Originated with Bachelor's project in 2021 with three Bangaladesh young friends. Present third author suggested by second author. More refined and more precise version can be written.
Chapter
Background: EEG provides researchers with an opportunity to study neural correlates in terms of temporal connectivity. This connectivity can shed light on the possible network topology between a healthy person versus a patient or help differentiate between two different groups (experts and non-experts). Purpose: With the help of machine learning models, the difference in network topology can be used to understand the neural correlations between healthy control and a patient with ease compared to traditional EEG analysis. Further, a comparative analysis between the different spectral connectivity measures provides the best suitable measure for the study. Methods: EEG data from a meditation study (n = 31) and Parkinson's study (n = 24) containing the resting-state EEG recordings are utilized here. The EEG data is converted to spectral connectivity: coherence, which becomes the input for the machine learning models, support vector machine, k-means clustering, deep convolution neural networks, recurrent neural networks, and graph neural networks. Results: Classification accuracies of SVM and RNN are 56.585 and 56%, whereas D-CNN provides an accuracy of 59.5%. Both (~ 7%) k-means and GNN failed in the off-the-shelf approach. Conclusion: The comparative study shows the application capabilities of neural networks machine learning with commonly used machine learning models and the impact the various connectivity measures have on model accuracy.
Chapter
According to the WHO, an unhealthy diet is responsible for nearly 20% of all deaths worldwide. Most of the population are living in countries where obesity and overweight cause more fatalities than underweight. The issue here is not a lack of food; rather people are unaware of what is in their diet. Knowing how many calories are in the foods that individuals eat and can assist them maintain their body’s health. Knowing how many calories are in the foods that individuals eat, can assist them maintain their body’s health by meeting the body’s fundamental calorie needs. It would have a variety of beneficial impacts, such as living a healthy lifestyle and providing a suitable amount of energy for regular exercise. Those who do not even worry over their caloric demands, on the other hand, will suffer a variety of health issues, such as obese and increasing ailments such as hypertension and prediabetes. People could easily decide how many calories they want to consume if they could estimate their calorie intake using images of their food. Determining the real quantity of caloric from meal technologically involves the food item’s region, size, and weight. Deep learning algorithms can identify the object and calories are estimated based on the object detection method and volume estimation method. If people knew how many calories were in their food, this problem could be mitigated slightly. This is accomplished in three stages: (1) image segmentation to determine each food’s contour, (2) image recognition using faster R-CNN, and (3) estimate the food’s weight and calories. In this study, the proposed system detects the object of each food’s contour using Otsu’s method and estimates the calories of each food, with data trained using faster RCNN.
Article
Background Maintaining a healthy diet is vital to avoid health-related issues, e.g., undernutrition, obesity and many non-communicable diseases. An indispensable part of the health diet is dietary assessment. Traditional manual recording methods are not only burdensome but time-consuming, and contain substantial biases and errors. Recent advances in Artificial Intelligence (AI), especially computer vision technologies, have made it possible to develop automatic dietary assessment solutions, which are more convenient, less time-consuming and even more accurate to monitor daily food intake. Scope and approach This review presents Vision-Based Dietary Assessment (VBDA) architectures, including multi-stage architecture and end-to-end one. The multi-stage dietary assessment generally consists of three stages: food image analysis, volume estimation and nutrient derivation. The prosperity of deep learning makes VBDA gradually move to an end-to-end implementation, which applies food images to a single network to directly estimate the nutrition. The recently proposed end-to-end methods are also discussed. We further analyze existing dietary assessment datasets, indicating that one large-scale benchmark is urgently needed, and finally highlight critical challenges and future trends for VBDA. Key findings and conclusions After thorough exploration, we find that multi-task end-to-end deep learning approaches are one important trend of VBDA. Despite considerable research progress, many challenges remain for VBDA due to the meal complexity. We also provide the latest ideas for future development of VBDA, e.g., fine-grained food analysis and accurate volume estimation. This review aims to encourage researchers to propose more practical solutions for VBDA.
Conference Paper
While automatic tracking and measuring of our physical activity is a well established domain, not only in research but also in commercial products and every-day lifestyle, automatic measurement of eating behavior is significantly more limited. Despite the abundance of methods and algorithms that are available in bibliography, commercial solutions are mostly limited to digital logging applications for smart-phones. One factor that limits the adoption of such solutions is that they usually require specialized hardware or sensors. Based on this, we evaluate the potential for estimating the weight of consumed food (per bite) based only on the audio signal that is captured by commercial ear buds (Samsung Galaxy Buds). Specifically, we examine a combination of features (both audio and non-audio features) and trainable estimators (linear regression, support vector regression, and neural-network based estimators) and evaluate on an in-house dataset of 8 participants and 4 food types. Results indicate good potential for this approach: our best results yield mean absolute error of less than 1 g for 3 out of 4 food types when training food-specific models, and 2.1 g when training on all food types together, both of which improve over an existing literature approach.
Chapter
Currently, many segmentation image datasets are open to the public. However, only a few open segmentation image dataset of food images exists. Among them, UEC-FoodPix is a large-scale food image segmentation dataset which consists of 10,000 food images with segmentation masks. However, it contains some incomplete mask images, because most of the segmentation masks were generated automatically based on the bounding boxes. To enable accurate food segmentation, complete segmentation masks are required for training. Therefore, in this work, we created “UEC-FoodPix Complete” by refining the 9,000 segmentation masks by hand which were automatically generated in the previous UEC-FoodPix. As a result, the segmentation performance was much improved compared to the segmentation model trained with the original UEC-FoodPix. In addition, as applications of the new food segmentation dataset, we performed food calorie estimation using the food segmentation models trained with “UEC-FoodPix Complete”, and food image synthesis from segmentation masks.
Article
Full-text available
Food portion size estimation (FPSE) is critical in dietary assessment and energy intake estimation. Traditional methods such as visual estimation are now replaced by faster, more accurate sensor-based methods. This paper presents a comprehensive review of the use of sensor methodologies for portion size estimation. The review was conducted using the PRISMA guidelines and full texts of 67 scientific articles were reviewed. The contributions of this paper are three-fold: i) A taxonomy for sensor-based (SB) FPSE methods was identified, classifying the sensors (as wearable, portable and stationary) and the methodology (as direct and indirect). ii) A novel comprehensive review of the state-of-the-art SB-FPSE methods was conducted and 5 sensor modalities (Acoustic, Strain, Imaging, Weighing, and Motion sensors) were identified. iii) The accuracy of portion size estimation and the applicability to free-living conditions of these SB-FPSE methods were assessed. This article concludes with a discussion of challenges and future trends of SB-FPSE.
Article
Food recognition plays a much critical role in various health-care applications. However, it poses many challenges to current approaches due to the diverse appearances of food dishes and non-uniform composition of ingredients for the foods in the same category. Current methods primarily focus on the appearance of food dishes without considering their semantic information, easily finding the wrong attention areas of food images. Second, these methods lack dynamic weighting of multiple semantic features in the modeling process. Thus this paper proposes a novel end-to-end multi-task network, called MVANet, that incorporates multiple semantic features into the food recognition task from both ingredient recognition and recipe modeling. It also utilizes multi-view attention (MVA) mechanism to automatically adjust the weights of different semantic features in the modeling process and enables different tasks to interact with each other so as to obtain the more comprehensive feature representation. The experiments conducted on ChineseFoodNet and VIREO Food-172 benchmark databases validate the proposed method with the obvious improvement of the performance and lower parameter size.
Article
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network [1]. The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the widely adopted FCN [2] and also with the well known DeepLab-LargeFOV [3] , DeconvNet [4] architectures. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. SegNet was primarily motivated by scene understanding applications. Hence, it is designed to be efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than other competing architectures and can be trained end-to-end using stochastic gradient descent. We also performed a controlled benchmark of SegNet and other architectures on both road scenes and SUN RGB-D indoor scene segmentation tasks. These quantitative assessments show that SegNet provides good performance with competitive inference time and most efficient inference memory-wise as compared to other architectures. We also provide a Caffe implementation of SegNet and a web demo at http://mi.eng.cam.ac.uk/projects/segnet/.
Article
Full-text available
The increase in awareness of people towards their nutritional habits has drawn considerable attention to the field of automatic food analysis. Focusing on self-service restaurants environment, automatic food analysis is not only useful for extracting nutritional information from foods selected by customers, it is also of high interest to speed up the service solving the bottleneck produced at the cashiers in times of high demand. In this paper, we address the problem of automatic food tray analysis in canteens and restaurants environment, which consists in predicting multiple foods placed on a tray image. We propose a new approach for food analysis based on convolutional neural networks, we name Semantic Food Detection, which integrates in the same framework food localization, recognition and segmentation. We demonstrate that our method improves the state of the art food detection by a considerable margin on the public dataset UNIMIB2016 achieving about 90% in terms of F-measure, and thus provides a significant technological advance towards the automatic billing in restaurant environments.
Article
Full-text available
The increasing prevalence of diet-related chronic diseases coupled with the ineffectiveness of traditional diet management methods have resulted in a need for novel tools to accurately and automatically assess meals. Recently, computer vision based systems that use meal images to assess their content have been proposed. Food portion estimation is the most difficult task for individuals assessing their meals and it is also the least studied area. The present paper proposes a three-stage system to calculate portion sizes using two images of a dish acquired by mobile devices. The first stage consists in understanding the configuration of the different views, after which a dense 3D model is built from the two images; finally, this 3D model serves to extract the volume of the different items. The system was extensively tested on 77 real dishes of known volume, and achieved an average error of less than 10% in 5.5 seconds per dish. The proposed pipeline is computationally tractable and requires no user input, making it a viable option for fully automated dietary assessment.
Article
Full-text available
Food diary applications represent a tantalizing market. Such applications, based on image food recognition, opened to new challenges for computer vision and pattern recognition algorithms. Recent works in the field are focusing either on hand-crafted representations or on learning these by exploiting deep neural networks. Despite the success of such a last family of works, these generally exploit off-the shelf deep architectures to classify food dishes. Thus, the architectures are not cast to the specific problem. We believe that better results can be obtained if the deep architecture is defined with respect to an analysis of the food composition. Following such an intuition, this work introduces a new deep scheme that is designed to handle the food structure. Specifically, inspired by the recent success of residual deep network, we exploit such a learning scheme and introduce a slice convolution block to capture the vertical food layers. Outputs of the deep residual blocks are combined with the sliced convolution to produce the classification score for specific food categories. To evaluate our proposed architecture we have conducted experimental results on three benchmark datasets. Results demonstrate that our solution shows better performance with respect to existing approaches (e.g., a top-1 accuracy of 90.27% on the Food-101 challenging dataset).
Article
Full-text available
In this paper we formulate structure from motion as a learning problem. We train a convolutional network end-to-end to compute depth and camera motion from successive, unconstrained image pairs. The architecture is composed of multiple stacked encoder-decoder networks, the core part being an iterative network that is able to improve its own predictions. The network estimates not only depth and motion, but additionally surface normals, optical flow between the images and confidence of the matching. A crucial component of the approach is a training loss based on spatial relative differences. Compared to traditional two-frame structure from motion methods, results are more accurate and more robust. In contrast to the popular depth-from-single-image networks, DeMoN learns the concept of matching and, thus, better generalizes to structures not seen during training.
Article
Full-text available
We present a novel and practical deep fully convolutional neural network architecture for semantic pixel-wise segmentation termed SegNet. This core trainable segmentation engine consists of an encoder network, a corresponding decoder network followed by a pixel-wise classification layer. The architecture of the encoder network is topologically identical to the 13 convolutional layers in the VGG16 network . The role of the decoder network is to map the low resolution encoder feature maps to full input resolution feature maps for pixel-wise classification. The novelty of SegNet lies is in the manner in which the decoder upsamples its lower resolution input feature map(s). Specifically, the decoder uses pooling indices computed in the max-pooling step of the corresponding encoder to perform non-linear upsampling. This eliminates the need for learning to upsample. The upsampled maps are sparse and are then convolved with trainable filters to produce dense feature maps. We compare our proposed architecture with the fully convolutional network (FCN) architecture and its variants. This comparison reveals the memory versus accuracy trade-off involved in achieving good segmentation performance. The design of SegNet was primarily motivated by road scene understanding applications. Hence, it is efficient both in terms of memory and computational time during inference. It is also significantly smaller in the number of trainable parameters than competing architectures and can be trained end-to-end using stochastic gradient descent. We also benchmark the performance of SegNet on Pascal VOC12 salient object segmentation and the recent SUN RGB-D indoor scene understanding challenge. We show that SegNet provides competitive performance although it is significantly smaller than other architectures. We also provide a Caffe implementation of SegNet and a webdemo at http://mi.eng.cam.ac.uk/projects/segnet/
Article
Full-text available
We consider the problem of depth estimation from a single molecular image in this work. It is a challenging task as no reliable depth cues are available, e.g., stereo correspondences, motions, etc. Previous efforts have been focusing on exploiting geometric priors or additional sources of information, with all using hand-crafted features. Recently, there is mounting evidence that features from deep convolutional neural networks (CNN) are setting new records for various vision applications. On the other hand, considering the continuous characteristic of the depth values, depth estimations can be naturally formulated into a continuous conditional random field (CRF) learning problem. Therefore, we in this paper present a deep convolutional neural field model for estimating depths from a single image, aiming to jointly explore the capacity of deep CNN and continuous CRF. Specifically, we propose a deep structured learning scheme which learns the unary and pairwise potentials of continuous CRF in a unified deep CNN framework. The proposed method can be used for depth estimations of general scenes with no geometric priors nor any extra information injected. In our case, the integral of the partition function can be analytically calculated, thus we can exactly solve the log-likelihood optimization. Moreover, solving the MAP problem for predicting depths of a new image is highly efficient as closed-form solutions exist. We experimentally demonstrate that the proposed method outperforms state-of-the-art depth estimation methods on both indoor and outdoor scene datasets.
Article
Full-text available
Computer vision-based food recognition could be used to estimate a meal's carbohydrate content for diabetic patients. This study proposes a methodology for automatic food recognition, based on the bag-of-features (BoF) model. An extensive technical investigation was conducted for the identification and optimization of the best performing components involved in the BoF architecture, as well as the estimation of the corresponding parameters. For the design and evaluation of the prototype system, a visual dataset with nearly 5000 food images was created and organized into 11 classes. The optimized system computes dense local features, using the scale-invariant feature transform on the HSV color space, builds a visual dictionary of 10000 visual words by using the hierarchical $k$-means clustering and finally classifies the food images with a linear support vector machine classifier. The system achieved classification accuracy of the order of 78%, thus proving the feasibility of the proposed approach in a very challenging image dataset.
Article
Full-text available
Dietary assessment is important in health maintenance and intervention in many chronic conditions, such as obesity, diabetes, and cardiovascular disease. However, there is currently a lack of convenient methods for measuring the volume of food (portion size) in real-life settings. We present a computational method to estimate food volume from a single photographical image of food contained in a typical dining plate. First, we calculate the food location with respect to a 3D camera coordinate system using the plate as a scale reference. Then, the food is segmented automatically from the background in the image. Adaptive thresholding and snake modeling are implemented based on several image features, such as color contrast, regional color homogeneity and curve bending degree. Next, a 3D model representing the general shape of the food (e.g., a cylinder, a sphere, etc.) is selected from a pre-constructed shape model library. The position, orientation and scale of the selected shape model are determined by registering the projected 3D model and the food contour in the image, where the properties of the reference are used as constraints. Experimental results using various realistically shaped foods with known volumes demonstrated satisfactory performance of our image based food volume measurement method even if the 3D geometric surface of the food is not completely represented in the input image.
Conference Paper
Full-text available
This paper addresses the problem of detecting and segmenting partially occluded objects of a known category. We first define a part labelling which densely covers the object. Our Layout Consistent Random Field (LayoutCRF) model then imposes asymmetric local spatial constraints on these labels to ensure the consistent layout of parts whilst allowing for object deformation. Arbitrary occlusions of the object are handled by avoiding the assumption that the whole object is visible. The resulting system is both efficient to train and to apply to novel images, due to a novel annealed layout-consistent expansion move algorithm paired with a randomised decision tree classifier. We apply our technique to images of cars and faces and demonstrate state-of-the-art detection and segmentation performance even in the presence of partial occlusion.
Conference Paper
Full-text available
ABSTRACT Weintroduce,the first visual dataset of fast foods with a total of 4,545 still images, 606 stereo pairs, 303 360, videos for structure from motion, and 27 privacy-preserving videos of eating events of volunteers. This work ,was ,motivated ,by research ,on fast ,food recognition for dietary assessment. The data was ,collected by obtaining three instances of 101 foods from 11 popular fast food chains, and capturing images and videos in both restaurant conditions and a controlled,lab setting. We benchmark,the dataset using two standard approaches, color histogram and bag of SIFT features in conjunction with a discriminative classifier. Our dataset and the benchmarks,are designed to stimulate research in this area and will be released freely to the research community. Index Terms— Food image dataset, object recognition
Article
Full-text available
Multitask Learning is an approach to inductive transfer that improves learning for one task by using the information contained in the training signals of other related tasks. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better. In this thesis we demonstrate multitask learning for a dozen problems. We explain how multitask learning works and show that there are many opportunities for multitask learning in real domains. We show that in some cases features that would normally be used as inputs work better if used as multitask outputs instead. We present suggestions for how to get the most out of multitask learning in artificial neural nets, present an algorithm for multitask learning with case based methods like k nearest neighbor and kernel regression, and sketch an algorithm for multitask learning in decision trees. Multitask learning improves generalization performance, can be applied in many different kinds of domains, and can be used with different learning algorithms. We conjecture there will be many opportunities for its use on real world problems. Thesis (Master's).
Article
Multitask Learning is an approach to inductive transfer that improves generalization by using the domain information contained in the training signals of related tasks as an inductive bias. It does this by learning tasks in parallel while using a shared representation; what is learned for each task can help other tasks be learned better. This paper reviews prior work on MTL, presents new evidence that MTL in backprop nets discovers task relatedness without the need of supervisory signals, and presents new results for MTL with k-nearest neighbor and kernel regression. In this paper we demonstrate multitask learning in three domains. We explain how multitask learning works, and show that there are many opportunities for multitask learning in real domains. We present an algorithm and results for multitask learning with case-based methods like k-nearest neighbor and kernel regression, and sketch an algorithm for multitask learning in decision trees. Because multitask learning works, can be applied to many different kinds of domains, and can be used with different learning algorithms, we conjecture there will be many opportunities for its use on real-world problems.
Conference Paper
With the arrival of Convolutional Neural Networks, the complex problem of food recognition has experienced an important improvement recently. The best results have been obtained using methods based on very deep Convolutional Neural Networks, which show that the deeper the model, the better the classification accuracy is. However, very deep neural networks may suffer from the overfitting problem. In this paper, we propose a combination of multiple classifiers based on Convolutional models that complement each other and thus, achieve an improvement in performance. The evaluation of our approach is done on 2 public datasets: Food-101 as a dataset with a wide variety of fine-grained dishes, and Food-11 as a dataset of high-level food categories, where our approach outperforms the independent Convolutional Neural Networks models.
Conference Paper
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
With the arrival of convolutional neural networks, the complex problem of food recognition has experienced an important improvement in recent years. The best results have been obtained using methods based on very deep convolutional neural networks, which show that the deeper the model,the better the classification accuracy will be obtain. However, very deep neural networks may suffer from the overfitting problem. In this paper, we propose a combination of multiple classifiers based on different convolutional models that complement each other and thus, achieve an improvement in performance. The evaluation of our approach is done on two public datasets: Food-101 as a dataset with a wide variety of fine-grained dishes, and Food-11 as a dataset of high-level food categories, where our approach outperforms the independent CNN models.
Article
We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. We achieve this by simultaneously training depth and camera pose estimation networks using the task of view synthesis as the supervisory signal. The networks are thus coupled via the view synthesis objective during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performing comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performing favorably with established SLAM systems under comparable input settings.
Article
Feature pyramids are a basic component in recognition systems for detecting objects at different scales. But recent deep learning object detectors have avoided pyramid representations, in part because they are compute and memory intensive. In this paper, we exploit the inherent multi-scale, pyramidal hierarchy of deep convolutional networks to construct feature pyramids with marginal extra cost. A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales. This architecture, called a Feature Pyramid Network (FPN), shows significant improvement as a generic feature extractor in several applications. Using FPN in a basic Faster R-CNN system, our method achieves state-of-the-art single-model results on the COCO detection benchmark without bells and whistles, surpassing all existing single-model entries including those from the COCO 2016 challenge winners. In addition, our method can run at 5 FPS on a GPU and thus is a practical and accurate solution to multi-scale object detection. Code will be made publicly available.
Conference Paper
The prevalence of diet-related chronic diseases strongly impacts global health and health services. Currently, it takes training and strong personal involvement to manage or treat these diseases. One way to assist with dietary assessment is through computer vision systems that can recognize foods and their portion sizes from images and output the corresponding nutritional information. When multiple food items may exist, a food segmentation stage should also be applied before recognition. In this study, we propose a method to detect and segment the food of already detected dishes in an image. The method combines region growing/merging techniques with a deep CNN-based food border detection. A semi-automatic version of the method is also presented that improves the result with minimal user input. The proposed methods are trained and tested on non-overlapping subsets of a food image database including 821 images, taken under challenging conditions and annotated manually. The automatic and semi-automatic dish segmentation methods reached average accuracies of 88% and 92%, respectively, in roughly 0.5 seconds per image.
Article
Automatic food understanding from images is an interesting challenge with applications in different domains. In particular, food intake monitoring is becoming more and more important because of the key role that it plays in health and market economies. In this paper, we address the study of food image processing from the perspective of Computer Vision. As first contribution we present a survey of the studies in the context of food image processing from the early attempts to the current state-of-the-art methods. Since retrieval and classification engines able to work on food images are required to build automatic systems for diet monitoring (e.g., to be embedded in wearable cameras), we focus our attention on the aspect of the representation of the food images because it plays a fundamental role in the understanding engines. The food retrieval and classification is a challenging task since the food is intrinsically deformable and presents high variability in appearance. To properly study the peculiarities of different image representations we propose the UNICT-FD1200 dataset. It composed by 4754 food images of 1200 distinct dishes acquired during real meals. Each food plate is acquired multiple times and the overall dataset presents both geometric and photometric varabilities. The images of the dataset have been manually labeled considering 8 categories: Appetizer, Main Course, Second Course, Single Course, Side Dish, Dessert, Breakfast, Fruit. We have performed tests employing different representations of the state-of-the-art to assess the related performances on the UNICT-FD1200 dataset. Finally, we propose a new representation based on the perceptual concept of Anti-Textons which is able to encode spatial information between Textons outperformimg other representations in the context of food retieval and Classification.
Conference Paper
In this paper, we propose a novel effective framework to expand an existing image dataset automatically leveraging existing categories and crowdsourcing. Especially, in this paper, we focus on expansion on food image data set. The number of food categories is uncountable, since foods are different from a place to a place. If we have a Japanese food dataset, it does not help build a French food recognition system directly. That is why food data sets for different food cultures have been built independently so far. Then, in this paper, we propose to leverage existing knowledge on food of other cultures by a generic “foodness” classifier and domain adaptation. This can enable us not only to built other-cultured food datasets based on an original food image dataset automatically, but also to save as much crowd-sourcing costs as possible. In the experiments, we show the effectiveness of the proposed method over the baselines.
Conference Paper
In this paper we address the problem of automatically recognizing pictured dishes. To this end, we introduce a novel method to mine discriminative parts using Random Forests (rf), which allows us to mine for parts simultaneously for all classes and to share knowledge among them. To improve efficiency of mining and classification, we only consider patches that are aligned with image superpixels, which we call components. To measure the performance of our rf component mining for food recognition, we introduce a novel and challenging dataset of 101 food categories, with 101’000 images. With an average accuracy of 50.76%, our model outperforms alternative classification methods except for cnn, including svm classification on Improved Fisher Vectors and existing discriminative part-mining algorithms by 11.88% and 8.13%, respectively. On the challenging mit-Indoor dataset, our method compares nicely to other s-o-a component-based classification methods.
Article
Deeper neural networks are more difficult to train. We present a residual learning framework to ease the training of networks that are substantially deeper than those used previously. We explicitly reformulate the layers as learning residual functions with reference to the layer inputs, instead of learning unreferenced functions. We provide comprehensive empirical evidence showing that these residual networks are easier to optimize, and can gain accuracy from considerably increased depth. On the ImageNet dataset we evaluate residual nets with a depth of up to 152 layers---8x deeper than VGG nets but still having lower complexity. An ensemble of these residual nets achieves 3.57% error on the ImageNet test set. This result won the 1st place on the ILSVRC 2015 classification task. We also present analysis on CIFAR-10 with 100 and 1000 layers. The depth of representations is of central importance for many visual recognition tasks. Solely due to our extremely deep representations, we obtain a 28% relative improvement on the COCO object detection dataset. Deep residual nets are foundations of our submissions to ILSVRC & COCO 2015 competitions, where we also won the 1st places on the tasks of ImageNet detection, ImageNet localization, COCO detection, and COCO segmentation.
Conference Paper
Diet-related chronic diseases severely affect personal and global health. However, managing or treating these diseases currently requires long training and high personal involvement to succeed. Computer vision systems could assist with the assessment of diet by detecting and recognizing different foods and their portions in images. We propose novel methods for detecting a dish in an image and segmenting its contents with and without user interaction. All methods were evaluated on a database of over 1600 manually annotated images. The dish detection scored an average of 99% accuracy with a .2s/image run time, while the automatic and semi-automatic dish segmentation methods reached average accuracies of 88% and 91% respectively, with an average run time of .5s/image, outperforming competing solutions.
Article
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully-convolutional network that simultaneously predicts object bounds and objectness scores at each position. RPNs are trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. With a simple alternating optimization, RPN and Fast R-CNN can be trained to share convolutional features. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007 (73.2% mAP) and 2012 (70.4% mAP) using 300 proposals per image. The code will be released.
Conference Paper
We present a new dataset with the goal of advancing the state-of-the-art in object recognition by placing the question of object recognition in the context of the broader question of scene understanding. This is achieved by gathering images of complex everyday scenes containing common objects in their natural context. Objects are labeled using per-instance segmentations to aid in understanding an object's precise 2D location. Our dataset contains photos of 91 objects types that would be easily recognizable by a 4 year old along with per-instance segmentation masks. With a total of 2.5 million labeled instances in 328k images, the creation of our dataset drew upon extensive crowd worker involvement via novel user interfaces for category detection, instance spotting and instance segmentation. We present a detailed statistical analysis of the dataset in comparison to PASCAL, ImageNet, and SUN. Finally, we provide baseline performance analysis for bounding box and segmentation detection results using a Deformable Parts Model.
Article
Predicting depth is an essential component in understanding the 3D geometry of a scene. While for stereo images local correspondence suffices for estimation, finding depth relations from a single image is less straightforward, requiring integration of both global and local information from various cues. Moreover, the task is inherently ambiguous, with a large source of uncertainty coming from the overall scale. In this paper, we present a new method that addresses this task by employing two deep network stacks: one that makes a coarse global prediction based on the entire image, and another that refines this prediction locally. We also apply a scale-invariant error to help measure depth relations rather than scale. By leveraging the raw datasets as large sources of training data, our method achieves state-of-the-art results on both NYU Depth and KITTI, and matches detailed depth boundaries without the need for superpixelation.
Conference Paper
In this paper, we propose novel methodologies for the automatic segmentation and recognition of multi-food images. The proposed methods implement the first modules of a carbohydrate counting and insulin advisory system for type 1 diabetic patients. Initially the plate is segmented using pyramidal mean-shift filtering and a region growing algorithm. Then each of the resulted segments is described by both color and texture features and classified by a support vector machine into one of six different major food classes. Finally, a modified version of the Huang and Dom evaluation index was proposed, addressing the particular needs of the food segmentation problem. The experimental results prove the effectiveness of the proposed method achieving a segmentation accuracy of 88.5% and recognition rate equal to 87%.
Conference Paper
In this paper, we propose a two-step method to recognize multiple-food images by detecting candidate regions with several methods and classifying them with various kinds of features. In the first step, we detect several candidate regions by fusing outputs of several region detectors including Felzenszwalb's deformable part model (DPM) [1], a circle detector and the JSEG region segmentation. In the second step, we apply a feature-fusion-based food recognition method for bounding boxes of the candidate regions with various kinds of visual features including bag-of-features of SIFT and CSIFT with spatial pyramid (SP-BoF), histogram of oriented gradient (HoG), and Gabor texture features. In the experiments, we estimated ten food candidates for multiple-food images in the descending order of the confidence scores. As results, we have achieved the 55.8% classification rate, which improved the baseline result in case of using only DPM by 14.3 points, for a multiple-food image data set. This demonstrates that the proposed two-step method is effective for recognition of multiple-food images.
Article
Can a large convolutional neural network trained for whole-image classification on ImageNet be coaxed into detecting objects in PASCAL? We show that the answer is yes, and that the resulting system is simple, scalable, and boosts mean average precision, relative to the venerable deformable part model, by more than 40% (achieving a final mAP of 48% on VOC 2007). Our framework combines powerful computer vision techniques for generating bottom-up region proposals with recent advances in learning high-capacity convolutional neural networks. We call the resulting system R-CNN: Regions with CNN features. The same framework is also competitive with state-of-the-art semantic segmentation methods, demonstrating its flexibility. Beyond these results, we execute a battery of experiments that provide insight into what the network learns to represent, revealing a rich hierarchy of discriminative and often semantically meaningful features.
Article
We formulate a layered model for object detection and image segmentation. We describe a generative probabilistic model that composites the output of a bank of object detectors in order to define shape masks and explain the appearance, depth ordering, and labels of all pixels in an image. Notably, our system estimates both class labels and object instance labels. Building on previous benchmark criteria for object detection and image segmentation, we define a novel score that evaluates both class and instance segmentation. We evaluate our system on the PASCAL 2009 and 2010 segmentation challenge data sets and show good test results with state-of-the-art performance in several categories, including segmenting humans.
Conference Paper
We propose DiaWear, a novel assistive mobile phone-based calorie monitoring system to improve the quality of life of diabetes patients and individuals with unique nutrition management needs. Our goal is to achieve improved daily semi-automatic food recognition using a mobile wearable cell phone. DiaWear currently uses a neural network classification scheme to identify food items from a captured image. It is difficult to account for the varying and implicit nature of certain foods using traditional image recognition techniques. To overcome these limitations, we introduce the role of the mobile phone as a platform to gather contextual information from the user and system in obtaining better food recognition.
Conference Paper
We present a system that improves accuracy of food intake assessment using computer vision techniques. Traditional dietetic method suffers from the drawback of either inaccurate assessment or complex lab measurement. Our solution is to use a mobile phone to capture images of foods, recognize food types, estimate their respective volumes and finally return quantitative nutrition information. Automated and accurate food recognition presents the following challenges. First, there exist a large variety of food types that people consume in everyday life. Second, a single category of food may contain large variations due to different ways of preparation. Also, diverse lighting conditions may lead to varying visual appearance of foods. All of these pose a challenge to the state of the art recognition approaches. Moreover, the low quality images captured using cellphones make the task of 3D reconstruction difficult. In this paper, we combine several vision techniques (visual recognition and 3D reconstruction) to achieve quantitative food intake estimation. Evaluation of both recognition and reconstruction is provided in the experimental results.
Article
Studies of food habits and dietary intakes face a number of unique respondent and observer considerations at different stages from early childhood to late adolescence. Despite this, intakes have often been reported as if valid, and the interpretation of links between intake and health has been based, often erroneously, on the assumption of validity. However, validation studies of energy intake data have led to the widespread recognition that much of the dietary data on children and adolescents is prone to reporting error, mostly through under-reporting. Reporting error is influenced by body weight status and does not occur systematically across different age groups or different dietary survey techniques. It appears that the available methods for assessing the dietary intakes of children are, at best, able to provide unbiased estimates of energy intake only at the group level, while the food intake data of most adolescents are particularly prone to reporting error at both the group and the individual level. Moreover, evidence for the existence of subject-specific responding in dietary assessments challenges the assumption that repeated measurements of dietary intake will eventually obtain valid data. Only limited progress has been made in understanding the variables associated with misreporting in these age groups, the associated biases in estimating nutrient intakes and the most appropriate way to interpret unrepresentative dietary data. Until these issues are better understood, researchers should exercise considerable caution when evaluating all such data.
Carbohydrate Estimation Supported by the GoCARB System in Individuals With Type 1 Diabetes: A Randomized Prospective Pilot Study
  • L Bally
  • J Dehais
  • C T Nakas
  • M Anthimopoulos
  • M Laimer
  • D Rhyner
  • G Rosenberg
  • T Zueger
  • P Diem
  • S Mougiakakou
  • C Stetter
11 Rich Caruana, Multitask Learning, Machine Learning
Model-based measurement of food portion size for image-based dietary assessment using 3D/2D registration
  • Hs
  • W Chen
  • Z Jia
  • Y N Li
  • J D Sun
  • M Fernstrom
  • Sun
Diabetes60 - Inferring Bread Units From Food Images Using Fully Convolutional Neural Networks
  • P F Christ
  • S Schlecht
  • F Ettlinger
Feature pyramid networks for object detection
  • T.-Y Lin
  • P Doll'ar
  • R Girshick
  • K He
  • B Hariharan
  • S Belongie