ArticlePDF Available

FoodCam: A real-time food recognition system on a smartphone

April 2014
Multimedia Tools and Applications 74(14)

April 2014
74(14)

DOI:10.1007/s11042-014-2000-8

Authors:

The University of Electro-Communications

We propose a mobile food recognition system, FoodCam, the purposes of which are estimating calorie and nutrition of foods and recording a user’s eating habits. In this paper, we propose image recognition methods which are suitable for mobile devices. The proposed method enables real-time food image recognition on a consumer smartphone. This characteristic is completely different from the existing systems which require to send images to an image recognition server. To recognize food items, a user draws bounding boxes by touching the screen first, and then the system starts food item recognition within the indicated bounding boxes. To recognize them more accurately, we segment each food item region by GrubCut, extract image features and finally classify it into one of the one hundred food categories with a linear SVM. As image features, we adopt two kinds of features: one is the combination of the standard bag-of-features and color histograms with χ 2 kernel feature maps, and the other is a HOG patch descriptor and a color patch descriptor with the state-of-the-art Fisher Vector representation. In addition, the system estimates the direction of food regions where the higher SVM output score is expected to be obtained, and it shows the estimated direction in an arrow on the screen in order to ask a user to move a smartphone camera. This recognition process is performed repeatedly and continuously. We implemented this system as a standalone mobile application for Android smartphones so as to use multiple CPU cores effectively for real-time recognition. In the experiments, we have achieved the 79.2 % classification rate for the top 5 category candidates for a 100-category food dataset with the ground-truth bounding boxes when we used HOG and color patches with the Fisher Vector coding as image features. In addition, we obtained positive evaluation by a user study compared to the food recording system without object recognition.

Processing flow and assignment on CPU cores for the system using bag-of-features and color histogram.

…

Classification rate by the two proposed methods and the server-side method [Matsuda et al.(2012)].

…

Classification rate of the mobile version and the extended server version of the second proposed method (HOG and Color patch with Fisher Vector).

…

Classification rates in case of the ground-truth bounding box (BB), 25% magnified bounding box and adjusted bounding box after 25% magnification (shown as "25%, GrabCut").

…

Cumulative rate of the estimated orientation in case of 10, 15, 20, and 25% shifted bounding boxes.

…

Figures - uploaded by Keiji Yanai

Content may be subject to copyright.

Content uploaded by Keiji Yanai

Content may be subject to copyright.

Noname manuscript No.

(will be inserted by the editor)

Real-time Food Recognition System on a Smartphone

Yoshiyuki Kawano ·Keiji Yanai

Received: date / Accepted: date

Abstract We propose a mobile food recognition system the purposes of which are esti-

mating calorie and nutritious of foods and recording a user’s eating habits. Since we adopt

image recognition methods which are suitable for mobile devices and all the processes on

image recognition is performed on a smartphone, the system does not need to send images

to a server and runs on an ordinary smartphone in a real-time way.

To recognize food items, a user draws bounding boxes by touching the screen ﬁrst, and

then the system starts food item recognition within the indicated bounding boxes. To recog-

nize them more accurately, we segment each food item region by GrubCut, extract image

features and ﬁnally classify it into one of the one hundred food categories with a linear

SVM. As image features, we adopt two kinds of features: one is the combination of the

standard bag-of-features and color histograms with χ2kernel feature maps, and the other

is a HOG patch descriptor and a color patch descriptor with the state-of-the-art Fisher Vec-

tor representation. In addition, the system estimates the direction of food regions where the

higher SVM output score is expected to be obtained, and it shows the estimated direction in

an arrow on the screen in order to ask a user to move a smartphone camera. This recogni-

tion process is performed repeatedly and continuously. We implemented this system as an

Android smartphone application so as to use multiple CPU cores effectively for real-time

recognition.

In the experiments, we have achieved the 79.2% classiﬁcation rate for the top 5 category

candidates for a 100-category food dataset with the ground-truth bounding boxes when we

used HOG and color patches with the Fisher Vector coding as image features. In addition, we

obtained positive evaluation by a user study compared to the food recording system without

object recognition.

Keywords Food Recognition ·Dietary Recording ·Smartphone ·Fisher Vector ·Mobile

Image Recognition

The University of Electro-Communications, Tokyo

1-5-1 Chofugaoka, Chofu-shi, Tokyo, 182-8585 Japan

Tel.: +81-42-440-7365

Fax: +81-42-443-5334

E-mail: {kawano-y,yanai}@mm.inf.uec.ac.jp

2 Y. Kawano and K. Yanai

1 Introduction

In recent years, food habit recording services for smartphones such as iPhone and Android

phones have become popular. They can awake users’ food habit problems such as bad food

balance and unhealthy food trend, which is useful for disease prevention and diet. However,

most of such services require selecting eaten food items from hierarchical menus by hand,

which is too time-consuming and troublesome for most of the people to continue using such

services for a long period.

Due to recent rapid progress of smartphones, they have obtained enough computational

power for real-time image recognition. Currently, a quad-core CPU is common as a smart-

phone’s CPU, which is almost equivalent to a PC’s CPU released several years ago in terms

of performance. Old-style mobile systems with image recognition need to send images to

high-performance servers, which must makes communication delay, requires communica-

tion costs, and the availability of which depends on network conditions. In addition, in

proportion to increase number of users, more computational resources of servers is also

required, which makes it difﬁcult to recognize objects in a real-time way.

On the other hand, image recognition on the client side, that is, on a smartphone is much

more promising in terms of availability, communication cost, delay, and server costs. It

needs no wireless connection and no commutation delay. Then, by taking advantage of rich

computational power of recent smartphones as well as recent advanced object recognition

techniques, in this paper, we propose a real-time food recognition system which runs on a

common smartphone.

To do that, we adopt two kinds of image recognition methods: one is the combination

of the standard bag-of-features(BoF) and color histograms with χ2kernel feature maps, and

the other is a HOG patch descriptor and a color patch descriptor with the state-of-the-art

Fisher Vector representation.

In the ﬁrst method, we adopt a bag-of-feature representation with SURF local features

and a standard color histogram as image features, and we use a linear SVM and a fast χ2

kernel based on kernel feature maps [Vedaldi and Zisserman(2012)] as an image classiﬁer.

In the second method, we adopt the more sophisticate method than the ﬁrst one. Some ef-

fective image representations have been proposed recently, which can boost recognition per-

formance in case of using a linear SVM. Especially, Fisher Vector [Perronnin and Dance(2007),

Perronnin et al.(2010)] is known as a high performance method among the recent image

representations. It is suitable for a linear classiﬁer, and was turned out that it can improved

recognition accuracy more than popular combination of bag-of-features (BoF) and a non-

linear SVM. Moreover, while BoF needs larger dictionary to improve recognition accuracy,

larger dictionary brings increase of computational cost for searching nearest visual words.

On the other hand, Fisher Vector is able to achieve high recognition accuracy with even small

dictionary, and low computational complexity. This is an advantage for mobile devices.

Thus adopting Fisher Vector as encoding method is better in terms of recognition ac-

curacy and processing time for a mobile object recognition. However, a system on a smart-

phone does not exist so far that carries out rapid and high precision image recognition with

Fisher Vector. Then we propose a recognition method for a smartphone which is rapid and

accurate by making good use of computational resource of the smartphone with Fisher Vec-

tor.

In the experiments, we have achieved 79.2% classiﬁcation rate within the top ﬁve can-

didates for a 100-category food dataset with ground-truth bounding boxes using the second

method. This results outperformed the existing food recognition method running on the

server-side regarding classiﬁcation accuracy on the same 100 food categories. The process-

Real-time Food Recognition System on a Smartphone 3

Selected Food

Recognition Result Confidence Food Image

Input Volume

Suggest Direction

Fig. 1 The screen-shot of the main screen of the proposed system.

ing time by the second method for 100 kinds of food image recognition takes only 0.065

second.

Here, we explain the implemented system brieﬂy. Figure 1 shows the screen-shot of

the proposed system which runs as an Android smartphone application. First a user point a

smartphone camera to foods, draws a bounding box (represented in the yellow rectangular

in the ﬁgure) by dragging on the screen, and then food image recognition is activated for

the given bounding box. The top ﬁve candidates for the yellow bounding box are shown

on the left side of the screen. If a user touches one of the candidate items, the food cate-

gory name and the photo are recorded as a daily food record in the system. In addition, the

proposed system has functions on automatic adjustment of bounding boxes based on the

segmentation result by GrubCut [Rother et al.(2004)], and estimation of the direction of the

expected food regions based on Efﬁcient Sub-window Search (ESS) [Lampert et al.(2008)].

Since this recognition process is performed repeatedly a user can search for good position

of a smartphone camera to recognize foods accurately by moving it continuously without

pushing a camera shutter button.

To summarize our contribution in this paper, it consists of ﬁve folds: (1) implementing

an interactive and real-time food recognition and recording system running on a consumer

smartphone, (2) using a liner SVM with a fast χ2kernel for fast and accurate food recog-

nition as the ﬁrst method, (3) using a Fisher Vector coding with a HOG and color patches

as the second method, (4) adjustment of the given bounding box, and (5) estimation of the

direction of the expected food region automatically.

The rest of this paper is organized as follows: Section 2 describes related work. In Sec-

tion 3, we explain the overview of the proposed system. In Section 4, we explain the detailed

method for food recognition. In Section 5, we describe the detail of implementation of the

proposed system as a smartphone application. In Section 6 describes the experimental results

and user study, and in Section 7 we conclude this paper.

4 Y. Kawano and K. Yanai

2 Related Work

2.1 Food Recognition

Food recognition is difﬁcult task, since appearances of food items are various even within

the same category. This is a kind of ﬁne-grained image categorization problem. As food

recognition, Yang et al. [Yang et al.(2010)] proposed pairwise local features which exploit

positional relation of eight basic food ingredients. Matsuda et al. [Matsuda et al.(2012)]

proposed a method for recognition multiple-food images which ﬁrst detected food regions

by several detectors including circle detector, JSEG [Deng and Manjunath(2001)] and De-

formable Part Model(DPM) [Felzenszwalb et al.(2010)], next recognized food categories

with the extracted color, texture, gradient, and SIFT using multiple kernel learning (MKL).

As a food recording system with food recognition, Webapplication FoodLog [Kitamura et al.(2008),

Kitamura et al.(2009)] estimates food balance. It divides the food image into 300 blocks,

from each blocks extracts color and DCT coefﬁcients, next classiﬁes to ﬁve groups such as

staple, main dish, side dish, fruit, and non-food.

The TADA dietary assessment system [Mariappan et al.(2009)] has food identiﬁcation

and quantity estimation, although it has some restriction that food must be put on white

dishes and food photos must be taken with a checkerboard to food quantity estimation.

Recently, the same research group proposed a method to estimate volume of foods which

was based on a 3D template matching [Chae et al.(2011),He et al.(2013)].

In all the above-mentioned systems image recognition processes perform on servers,

which prevents systems from being interactive due to communication delays. On the other

hand, our system can recognize food items on a client side in a real-time way, which re-

quires no communication to outside computational resources and enables user to use it

interactively. Note that our current system recognize only 100 food categories, and it re-

quires user’s assistances to estimate food volumes by touching a slider on the system screen.

However, user’s assistances are very easy, since our system is an interactive system on a

smartphone which has a touching screen.

2.2 Mobile Device and Computer Vision

With the spread of smartphones, some mobile applications exploiting computer vision tech-

nique have been proposed. Google Goggles 1is one of the most well-known mobile image

recognition applications which recognize speciﬁc object such as famous landmarks, prod-

uct logos, famous art and so on in photos taken by users, and return the names and some

information on the recognized objects to users. However, recognition targets are limited to

speciﬁc objects the appearance of which stays unchanged, which is different from this paper

the target of which is generic objects.

Kumar et al. [Kumar et al.(2012)] proposed recognize 184 species application “Leaf-

snap”. For a leaf image on the solid light-colored background they segment and extract

curvature-based shape features. Maruyama et al. [Maruyama et al.(2012)] proposed a mo-

bile recipe recommendation system which extracted color feature, recognized 30 kinds of

food ingredients and recommended food recipes based on the recognized ingredients on a

mobile device. In the above-mentioned mobile vision systems, generic object recognition

1http://www.google.com/mobile/goggles/

Real-time Food Recognition System on a Smartphone 5

for leaves and food ingredients was performed, while we tackle category-level object recog-

nition on foods on a mobile device.

As an interactive mobile vision system, Yu et al. [Yu et al.(2011)] proposed Active

Query Sensing (AQS) the objective of which is localization of the current position by match-

ing of street-view photos. When the system fails in location search, it suggests the best view-

ing direction for the next reference photo to a user based on the pre-computed saliency on

each location. Our system is also built as an interactive system which detects object regions

based on the bounding boxes a user draws and suggests the direction of food regions where

the higher evaluation output score is expected.

3 System Overview

The ﬁnal objective of the proposed system is to support users to record daily foods and check

their food eating habits. To do that easily, we built-in food image recognition technique on

the proposed system. In this paper, we mainly describe a food image recognition part of the

proposed system.

Processing ﬂow of typical usage of the proposed system is as follows (See Figure 2 as

well):

1. A user points a smartphone camera toward food items before eating them. The system

is continuously acquiring frame images from the camera device in the background.

2. A user draws bounding boxes over food items on the screen. The bounding boxes will

be automatically adjusted to the food regions. More than two bounding boxes can be

drawn at the same time. .

3. Food recognition is carried out for each of the regions within the bounding boxes. At the

same time, the direction of the region having the higher evaluation score is estimated for

each bounding box.

4. As results of food recognition and direction estimation, the top ﬁve food item candidates

and the direction arrows are shown on the screen.

5. A user selects food items from the food candidate list by touching on the screen, if found.

Before selecting food items, a user can indicate relative rough volume of selected food

item by the slider on the right bottom on the screen. If not, user moves a smartphone

slightly and go back to 3.

6. The calorie and nutrition of each of the recognized food items are shown on the screen.

In addition, a user can see his/her own meal record and its detail including calories and

proteins of each food items on the screen as shown in Figure 3(a). Meal records can be sent

to the server, and a user can see them on the Web (Figure 3(b)).

4 Methods

In this section, we explain the following three processing steps:

1. Adjustment of bounding boxes.

2. Recognition of food items within the given bounding boxes.

3. Estimation of the direction of the possible food region.

Regarding recognition of food items, we propose two kinds of the methods. In this

section, we will explain both of them.

6 Y. Kawano and K. Yanai

Start recognition

Input bounding box

Select food items

Save food image

Point a smartphone to dishes

Fig. 2 System process ﬂow

Assesment

and

4-Gun

(a) Meal records on the

screen.

(b) Meal records on the

Web.

Fig. 3 Meal records.

4.1 Bounding Box Adjustment

Before starting to recognize food items for the frame images taken by a smartphone camera,

at ﬁrst the system requires that a user draws bounding boxes which bounds food items on

the screen by dragging along the diagonal lines of the boxes.

First, a user draws bounding boxes roughly on the screen by dragging. The bounding

boxes a user draws are not always accurate and sometimes they are too large for actual food

regions. Therefore, we apply well-known graph-cut-based segmentation algorithm Grab-

Cut [Rother et al.(2004)], and then modify the bounding boxes so as to ﬁt them to food

regions segmented by GrabCut. GrabCut needs initial foreground and background regions

as seeds. Here, we provide GrubCut the regions within bounding boxes as foreground and

the areas out of the doubly-extended boxes of the original bounding boxes as background.

Since the computational cost of GrabCut is relatively high in the real-time recognition sys-

Real-time Food Recognition System on a Smartphone 7

tem, the bounding box adjustment is performed only once after the original bounding box

was drawn.

4.2 Food Item Recognition

Food recognition is performed for each of the window images within the given bounding

boxes. Firstly, image features are extracted from each window, secondly feature vectors are

built based on the pre-computed codebook, and ﬁnally we evaluate the feature vectors with

the trained linear SVMs on 100 food categories. The top ﬁve candidates which have the top

ﬁve SVM scores over 100 categories are shown on the screen as food item candidates for

the given bounding boxes.

As image features, we adopt two kinds of features: one is the combination of the stan-

dard bag-of-features and color histograms with χ2kernel feature maps, and the other is a

HOG patch descriptor and a color patch descriptor with the state-of-the-art Fisher Vector

representation. We explain both of the methods and compare them in Section 6.

4.2.1 Image Feature: Bag-of-features and Color Histogram

As the ﬁrst image features, we adopt bag-of-features [Csurka et al.(2004)] and color his-

tograms both of which are standard image features for generic object recognition. It is shown

that fusion of many kinds of image features is effectivefor food recognition [Matsuda et al.(2012)].

Since we aims for implementing a mobile recognition system which can run in the real-

time way, image features to be extracted should be minimum requisites. Then we eval-

uated and compared performance and computational costs of various kinds of common

standard image features including global and local features such as color histogram, color

moment, color auto correlogram, Gabor texture feature, HoG (Histogram of Oriented Gradi-

ent) [Dalal and Triggs(2005)], Pyramid HoG, and Bag-of-features with SURF features [Bay et al.(2008)].

Finally we chose the combination of color histogram and Bag-of-features with SURF. To

save computational cost, if the longer side of a bounding box is more than 200, the image

extracted from the bounding box is resized so that the longer side becomes 200 preserving

its aspect ratio.

Color Histogram: We divide a window image into 3×3blocks, and extract a 64-bin RGB,

color histogram from each block. Totally, we extract a 576-dim color histogram. Note that

we examined HSV and La*b* color histograms as well, and RGB color histogram achieved

the best among them.

Bag-of-fetures with SURF: As local features, we use dense-sampled SURF features. SURF [Bay et al.(2008)]

is an invariant 64-dim local descriptor for scale, orientation and illumination change. We

sample points by dense sampling in scale 12 and 16 with every 8 pixel with the window. To

convert bag-of-features vectors, we prepared a codebook where the number of codeword was

500 by k-means clustering in advance. We apply soft assignment [Philbin et al.(2008)] by

3 nearest-neighbor assigned reciprocal number of Euclid distance to the codeword, also we

use fast approximated-nearest-neighbor search based kd-tree to search the codebook for the

corresponding codeword for each sampled point. Finally, we create a 500-dim bag-of-SURF

vector.

8 Y. Kawano and K. Yanai

Feature Embedding: As a classiﬁer, we use a linear SVM because of its efﬁciency on com-

putation and memory. Although a linear SVM is fast and can save memory, classiﬁcation

accuracy is not as good as a non-linear kernel SVM such as a SVM with χ2-RBF kernel.

To compensate weakness of a liner SVM, we use explicit embedding technique. In this

work, we adopt kernel feature maps. Vedaldi et al. [Vedaldi and Zisserman(2012)] proposed

homogeneous kernel maps for χ2, intersection, Jansen-Shanon’s and Hellinger’s kernels,

which are represented in the closed-form expression. We choose mapping for χ2kernel and

we set a parameter so that the dimension of mapped feature vectors are 3 times as many as

the dimension of original feature vectors as shown in the following equation:

ϕ(x) = x



0.8

0.6 cos(0.6 log x)

0.6 sin(0.6 log x)



(1)

This mapping can be applied for L1-normalized histogram [Vedaldi and Zisserman(2012)].

Then we apply this to L1-normalized color histograms and Bag-of-SURF vectors. Finally

we obtained a 1728-dim vector as a color feature, and a 1500-dim vector as a BoF vectors.

4.2.2 Image Feature: HOG and Color Patch with Fisher Vector

As the second image features, we adopt a HOG patch and a Color patch as local descriptor,

and Fisher Vector as representation of local descriptors.

HOG Patch: Histogram of Oriented Gradients(HOG) was proposed by N.Dalal et al.[Dalal and Triggs(2005)].

It is similar to SIFT is terms of how to describe local patterns which is based on gradient his-

togram. The characteristics of HOG is no invariant for scale and rotation, which is different

from the standard local descriptors such as SIFT [Lowe(2004)] and SURF [Bay et al.(2008)].

Since HOG description is very simple, it is able to describe much faster than the common lo-

cal descriptors such as SIFT and SURF. This is important characteristic to carry out real-time

recognition on a smartphone. In addition, it is able to extract local feature more densely to

compensate no invariance on scale change and rotation. As a result, it improves recognition

accuracy.

We extract HOG features as local features. We divide a local patch into 2×2blocks

(totally four blocks), and extract gradient histogram regarding eight orientations from each

block. Totally, we extract 32-dim HOG Patch features. Then the 32-dim HOG Patch is L2

normalized to an L2 unit length. Note that we did not adopt HOG-speciﬁc normalization by

sliding fusion. The HOG used here is a simpler-version of the original HOG. After extracting

32-dim HOG patch, PCA is applied each HOG patch to reduce dimensions from 32 to 24.

Color Patch: We use mean and variance of RGB value of pixels within a local patch as

a Color Patch feature. We divide a local patch into 2×2blocks, and extract mean and

variance of RGB value of each pixel within each block. Totally, we extract 24-dim Color

Patch features. PCA is applied without dimension reduction. The dimension of a Color Patch

feature are kept to 24-dim.

Real-time Food Recognition System on a Smartphone 9

Fisher Vector: Recently, Fisher Vector [Perronnin and Dance(2007),Perronnin et al.(2010)]

Recently, Fisher Vector [Perronnin and Dance(2007),Perronnin et al.(2010)] becomes

known as a high performance method to represent a set of local features. It can decrease

quantization error than bag-of-features [Csurka et al.(2004)] by using of a high order statis-

tic. It was turned out that it can improved recognition accuracy more than bag-of-features

(BoF) and other recent representations such as Locality-constrained Linear Coding (LLC) [Wang et al.(2010)][Chatﬁeld et al.(2011)],

and most of the higher rank teams in the large-scale visual recognition challenge (LSVRC)

used Fisher Vector [Jia et al.(2012)].

Moreover, while BoF needs larger dictionary to improve recognition accuracy, larger

dictionary brings increase of computational cost for searching nearest visual words. On

the other hand, Fisher Vector is able to achieve high recognition accuracy with even small

dictionary, and low computational complexity. This is an advantage for mobile devices.

According to [Perronnin et al.(2010)], we encode local descriptors into Fisher Vector.

We choice probability density functions (pdf) as Gaussian mixture model (GMM). Then pdf

is given as follows.

p(x|θ) =



i=1

πiN(x|µi, Σi)(2)

where xis a local descriptor, Kis number of component of Gaussian, θ={πi, µi, Σi, i =

1, ..., K}is a parameter of GMM. πiis the mixing coefﬁcient, µiis a mean vector and Σi

is a covariance matrix. At this point, we assume that the covariance matrix is diagonal and

diagonal elements are presented variance vector σ2.

The probability of xtis belong component i(estimated posterior probability) is given

as follows.

γt(i) = πiN(xt|µi, Σi)

N

j=1 πjN(xt|µj, Σj)(3)

Then the gradient with respect to the mean and variance is deﬁned as follows,

µ,i =1

T√πi



t=1

γt(i)xt−µi

σi(4)

σ,i =1

T√2πi



t=1

γt(i)(xt−µi)2

σ2

i−1(5)

Finally, gradient GX

µ,i and GX

σ,i are calculated for all the Gaussian. Fisher Vector GX

θis

their concatenation. Therefore Fisher Vector is 2KD-dimensional.

In this paper, the number of component of Gaussian is 32 and local descriptors re-

duced to 24 dimensions by PCA. Thus each feature vector is 1536-dimensional. To im-

prove recognition accuracy, we apply power normalization (α= 0.5) and L2 normaliza-

tion [Perronnin et al.(2010)].

4.2.3 Classiﬁcation

As a classiﬁer, we use a linear kernel SVM, and we adopt the one-vs-rest strategy for multi-

class classiﬁcation. In the experiment, since we prepared 100 food categories, we trained

100 linear SVM classiﬁers.

10 Y. Kawano and K. Yanai

Linear kernel is deﬁned as the inner product of two vectors. In advance, we computed

the inner product of a support vector and the weight of the corresponding support vector,

then Linear SVM can be written as follows:

f(x) =



i=1

yiαiK(x,xi) + b=



i=1

yiαi⟨x,xi⟩+b

=⟨



i=1

yiαixi,x⟩+b=⟨w,x⟩+b(6)

where xis an input vector, f(x)is an output SVM score, xiis support vector, yi∈

{+1,−1}is a class label, αiis a weight of the corresponding support vector, bis a bias

vector, and Mis the number of support vector. By this transformation, we can save memory

to store support vectors as well as calculation of kernels. Therefore, when Nis the dimen-

sion of feature vector, calculation of a SVM score requires O(N)operations and O(N)

memory space. We train SVMs with LIBLINEAR [Fan et al.(2008)] in off-line.

For both cases of the ﬁrst and the second image features, we combine the output values

of both linear SVMs for gradient-based features and color-based features in the late fusion

manner. The weight to combine two SVM output values is estimated by cross-validation in

the experiments.

4.3 Estimation of the more reliable direction

In case that no categories with high SVM scores are obtained, camera position and viewing

direction should be changed slightly to obtain more reliable SVM evaluation scores. To

encourage a user to move a smartphone camera, the proposed system has a function to

estimate the direction to get the more reliable SVM score and show the direction as an

arrow on the screen as shown in Figure 1.

To estimate the direction of the food regions with more reliable SVM score, we adopt an

effective window search method for object detection, Efﬁcient Sub-window Search (ESS) [Lampert et al.(2008)],

which can be directly applied for a combination of a liner SVM and bag-of-features. Note

that estimation of the more reliable direction can be carried out for the ﬁrst type of the image

feature, because ESS assumes image features is represented by bag-of-features.

The weighting factor wof an input vector of the SVM classiﬁer represented in Equa-

tion 6 can be decomposed into a vector w+including positive elements and a vector w−

including negative elements.

w=w++w

−(7)

Therefore, an SVM output score for one rectangular region can be calculated in O(1) opera-

tion by making use of w+and w−integral images according to the ESS method [Lampert et al.(2008)].

In case of the above-mentioned soft assignments to codewords, the output score can be cal-

culated by a product of wand assigned values to codewords. We search for the window

which achieves the maximum SVM score by Efﬁcient Sliding Windows search so as to keep

more than 50% area being overlapped with the original window. Finally the relative direc-

tion to the window with the maximum score from the current window are shown as an arrow

on the screen (See Figure 1).

Real-time Food Recognition System on a Smartphone 11

5 Implementation on a Smartphone

We implemented two version of the systems as Android applications for an Android smart-

phone which has a quad-core CPU. The ﬁrst version uses bag-of-features and color his-

togram, while the second version uses a HOG and Color patch with Fisher Vector. We im-

plemented both systems as multi-threaded systems for using multiple CPU cores effectively.

In both systems, we ﬁrst apply GrabCut to adjust the size of the given bounding box after

a bounding box is drawn. Because the computational cost of GrabCut is relatively high,

bounding box adjustment is carried out once for one bounding box. After the adjustment

was ﬁnished, feature extraction begins. The ways to extract features are different for both

systems.

For the ﬁrst version, high-cost SURF descriptor extraction, assignment to codewords,

evaluation of 50 kinds of fast χ2linear SVMs and direction estimation are carried out over

four cores in parallel, while low-cost color histogram extraction is performed on a single

core. Figure 4 shows the ﬂow of the processing steps and usage of four CPU cores in the

ﬁrst version system.

For the second system, we parallelized extraction for HOG Patch and Color Patch fea-

ture. Both extractions are carried out over two cores in parallel. Extract descriptor, reduce

to dimension by PCA, encoding into Fisher Vector, power normalization, L2 normalization

and classify with SVMs are carried out over 2 cores in parallel for each feature, totally

over 4 cores in parallel. Figure 5 shows the ﬂow of the processing steps and usage of four

CPU cores in the second version system. Note that estimation of the direction of foods is

not implemented in the second version system, since the ESS method which the method on

direction estimation is based on assumes that the feature representation is bag-of-features.

For the second system, we devised speeding-up of computation by pre-computation.

The gradient of Fisher Vector with respect to the mean Eq.(4) is deformation as follows to

decrease the number of operation. The number of local descriptors Tis much bigger than

the number of GMM component Kand the dimension of local descriptor D, and calculate

Eq.(4) for each component, so effectively.

µ,i =1

√πiσi



t=1

γt(i)(xt−µi)(8)

Moreover in advance, we computed the term of calculate posterior probability by GMM and

the gradient with respect to the mean and sigma in off-line, and we create the lookup table

for acceleration (using for calculate posterior probability log πi−0.5×log |Σi|,1/√2πi

and 1/σ2of Eq.(5) and 1/√πiσiof Eq.(8)).

We trained SVMs in off-line. And all the parameter values using recognition steps are

loaded on main memory (eigenvalue and eigenvector for PCA, created lookup table, mean of

GMM, weight vectors of SVMs). Although all the values can be stored on main memory in

advance, Fisher Vector is able to bring better recognition result with even smaller dictionary

than that of BoF. We also set the dimension of feature vectors smaller reduced by PCA. As

a result, memory space required for Fisher Vector is smaller than the space for codebook for

conventional BoF, and in respect of memory Fisher Vector is also superior to BoF.

Regarding memory space, the ﬁrst-version system adopts 1500-dim SURF-BoF and

1728-dim color histogram, while the second-version system adopts 1536-dim HOG Patch-

FV and 1536-dim Color Patch-FV. The feature vector of the second one is more compact but

more dense. And then many values are loaded on memory to encode Fisher Vector faster,

sum of these require only less one ﬁfth in case of HOG Patch and less one fourth in case of

12 Y. Kawano and K. Yanai

Table 1 Classiﬁcation rate within the top and top 5 candidates by the proposed methods, the server-side

method proposed by [Matsuda et al.(2012)], and the extended version of the proposed method.

method top1 top5

SURF-BoF+ColorHistogram 42.0 68.2

HOG Patch-FV+Color Patch-FV 49.7 77.6

HOG Patch-FV+Color Patch-FV(ﬂip) 51.9 79.2

MKL[Matsuda et al.(2012)] 51.6 76.8

Extended HOG Patch-FV+Color Patch-FV(ﬂip) 59.6 82.9

Color Patch than memory space of codebook for BoF, respectively. Java heap (mainly except

image processing) and native heap (mainly image processing) the implemented application

required were about 16MB and 3MB. We realize the mobile system with low memory space.

Therefore we are able to increase the number of recognition target and feature dimension.

6 Experiments

In this section, we describe experimental results regarding recognition accuracy and pro-

cessing time. In addition, we also explain the evaluation result by user study.

In the experiments, we prepared one hundred categories food image dataset which has

more than 100 images per categories and all the food item in which are marked with bound-

ing boxes. The total number of food images in the dataset is 12,905. Figure 6 shows all the

category names and their sample photos.

We set the validation data and the test data for each category as 20 images. The rest is

the train data, and evaluated classiﬁcation rate 5 trials, randomly changing the images in

the ﬁve-fold cross validation manner. We used Samsung Galaxy Note II (1.6GHz 4 cores, 4

threads, Android 4.1) for measuring the processing time of image recognition.

6.1 Evaluation on classiﬁcation accuracy

In this experiments, we compare the two types of the proposed systems with server-side

recognition system by Matsuda et al [Matsuda et al.(2012)].

Figure 7 shows classiﬁcation rate of each recognition method and Table 1 shows classiﬁ-

cation rate within top1 and top 5. SURF-BoF+Color Hist. achieved 42.0% and 68.2% within

the top 1 and the top 5 candidate, HOG Patch-FV+Color Patch-FV achieved 49.7% and

77.6% classiﬁcation rate, respectively. In case of adding ﬂipped images as training data, the

results were slightly improved, and it achieved 51.9% and 79.2% classiﬁcation rate within

top1 and top5. Adding ﬂipped training images makes variability of training data increase,

which is expected to be effective for HOG patch, since HOG is not invariant to rotation.

Our second approach is better than [Matsuda et al.(2012)] which is the server-side high

cost recognition system. This shows food recognition on a smartphone is equivalent to the

conventional server system in terms of recognition accuracy.

The bottom row in Table 1 shows the classiﬁcation rate of the extended version of the

second approach. In this extended version, the number of GMM components are doubled

(64 Gaussians), the dimension of HOG Patches are 32 (original was 24), and spatial pyra-

mid [Lazebnik et al.(2006)] was applied. Although the dimension of image features, pro-

cessing time and memory requirements increased greatly, the classiﬁcation rate was im-

Real-time Food Recognition System on a Smartphone 13

SURF

food region

score

Core 1 Core 2 Core 4 Core 3

Color

SVM

Bag-of-Features

reliable direction

Fig. 4 Processing ﬂow and assignment on CPU cores for the system using bag-of-features and color his-

togram.

HOG Patch

food region

score

Core 1 Core 2 Core 4 Core 3

Fisher Vector

Color Patch

Fisher Vector

SVM SVM

Fig. 5 Processing ﬂow and assignment on CPU cores for the system using a HOG patch and a Color patch

with Fisher Vector

14 Y. Kawano and K. Yanai

rice eels on rice pilaf chicken-’n’-egg on

rice pork cutlet on rice beef curry sushi chicken rice fried rice tempurabowl

bibimbap toast croissant roll bread raisinbread chip butty hamburger pizza sandwiches udon noodle

tempura udon soba noodle ramen noodle beef noodle tensin noodle fried noodle spaghetti Japanese-style

pancake takoyaki gratin

sauteed

vegetables croquette grilled eggplant sauteed spinach vegetable

tempura miso soup potage sausage oden omelet

ganmodoki jiaozi stew teriyaki grilled

ﬁsh fried ﬁsh grilled salmon salmon meuniere sashimi grilled paciﬁc

saury sukiyaki

sweet and sour

pork

lightly roasted

ﬁsh

steamed egg

hotchpotch tempura fried chicken sirloin cutlet nanbanzuke boiledﬁsh seasoned beef

with potatoes hambarg steak

steak driedﬁsh ginger pork saute spicy

chili-ﬂavored tofu yakitori cabbage roll omelet eggsunny-side up natto cold tofu

egg roll chilled noodle stir-fried beef and

peppers simmered pork boiled chicken

and vegetables sashimi bowl sushi bowl

ﬁsh-shaped

pancake with

bean jam

shrimp with chill

source roast chicken

steamed meat

dumpling omelet with fried

rice cutlet curry spaghettimeat

sauce fried shrimp potato salad green salad macaronisalad

Japanese tofu and

vegetable

chowder

pork miso soup

chinese soup beef bowl rice ball pizza toast dippingnoodles frenchfries goya chanpuru kinpira-style

sauteed burdock hot dog mixed rice

Fig. 6 100 kinds of food images which are recognition targets in the paper

proved much. This version can be carried out on not a smartphone but a server, since the

dimension of Color-FV and HOG-FV are 15,360 and 20,480, respectively. We show these

results for reference. Figure 8 shows the results by the original second method and the ex-

tended method.

As the next evaluations, we compare the results by single features. Firstly, we compare

the results by gradient-based local features which are not only SURF-BoF and HOG Fisher

Vector but also HOG-BoF. Secondly, we compare the results by color features which are

color histogram, Color-patch Fisher Vector and Color-patch BoF.

Figure 9 shows the comparison with SURF-BoF, HOG Patch-BoF and HOG Patch-FV.

First, we set that a step of dense grid sampling is every 8 pixels. The difference top1 and top5

classiﬁcation rate from SURF-BoF and HOG Patch-BoF is only 0.62% and 2.1%, respec-

tively, which means that SURF is slight better. However, HOG Patch-FV achieved higher

performance than SURF-BoF, differ is 2.7% and 3.22%. And then HOG Patch extraction is

very fast than SURF extraction. In this paper, we set that a step of dense grid sampling is

every 6 pixels. Then classiﬁcation rate was improved. In case of adding horizontally ﬂipped

images for training data, we achieved 36.3% and 63.2% classiﬁcation rate with top1 and

top5 candidates. The classiﬁcation rate is 7.52% and 6.9% higher than SURF-BoF.

Next, we evaluated Color Patch feature. Figure 10 shows the color histogram, Color

Patch-BoF and Color Patch-FV. Color histogram divides a given image into 3×3blocks,

and extract a 64-bin RGB color histogram from each block. Totally, we extract a 576-dim

color histogram. Then we apply χ2kernel feature map. Finally we build a 1728-dim color

Real-time Food Recognition System on a Smartphone 15

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

12345678910

Classification Rate

Number of Candidates

SURF-BoF+ColorHistogram

HOG Patch-FV+Color Patch-FV

HOG Patch-FV+Color Patch-FV (flip)

MKL

Fig. 7 Classiﬁcation rate by the two proposed methods and the server-side method [Matsuda et al.(2012)].

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10

Classification Rate

Number of Candidates

Server

Mobile

Fig. 8 Classiﬁcation rate of the mobile version and the extended server version of the second proposed

method (HOG and Color patch with Fisher Vector).

histogram. The difference top1 and top5 classiﬁcation rate from Color Patch-BoF and color

histogram is only 0.76% and 2.28%, Color Patch-BoF is slight better. However, in case of

Color Patch-FV, classiﬁcation rate is much improved, and top-1 and top-5 classiﬁcation rate

is 13.0% and 18.4% higher than color histogram. In case of adding horizontally ﬂipped im-

ages for training data, we achieved 43.0% and 70.6% classiﬁcation rate. In case of using only

Color Patch feature, we achieved much better classiﬁcation rate than the ﬁrst-version sys-

tem, which means that Fisher Vector representation for Color Patch feature is very effective

for food recognition.

According to the experiments of recognition accuracy, we could successfully show ef-

fectiveness of our proposed method. Moreover, we achieved better results than the server-

side result which was obtained by very high cost recognition method [Matsuda et al.(2012)].

16 Y. Kawano and K. Yanai

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

1 2 3 4 5 6 7 8 9 10

Classification Rate

Number of Candidates

SURF-BoF(8 grid step)

HOG Patch-BoF(8 grid step)

HOG Patch-FV(8 grid step)

HOG Patch-FV

HOG Patch-FV (flip)

Fig. 9 Classiﬁcation rate by SURF-BoF, HOG-Patch-BoF and HOG-Patch-Fisher-Vector with 6 or 8 grid

step

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10

Classification Rate

Number of Candidates

Color Histogram

Color Patch-BoF

Color Patch-FV

Color Patch-FV (flip)

Fig. 10 Classiﬁcation rate by Color Histogram, Color-Patch-BoF and Color-Patch-Fisher-Vector

Therefore, we have proved that rapid image recognition and high precision is possible on a

smartphone.

6.2 Evaluation on bounding box adjustment

Next, we made an experiments to examine effectiveness of bounding box adjustment. We

magniﬁed ground-truth bounding boxes of test images with 25% in terms of bounding box

size. In fact, we used only 1912 food images as test images in the 5-fold cross-validation

classiﬁcation experiment, since for some food photos their size are almost the same as the

size of attached bounding boxes and they do not includes 25% background regions. We

Real-time Food Recognition System on a Smartphone 17

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1 2 3 4 5 6 7 8 9 10

Classification Rate

Number of Candidates

25%

25%, GrabCut

Fig. 11 Classiﬁcation rates in case of the ground-truth bounding box (BB), 25% magniﬁed bounding box

and adjusted bounding box after 25% magniﬁcation (shown as “25%, GrabCut”).

compared the groundtruth bounding boxes, 25%-magniﬁed bounding boxes, and adjusted

bounding boxes after 25% magniﬁcation in terms of classiﬁcation rate under the same con-

dition as the previous experiment. Figure 11 shows the results, which indicates that 25%

magniﬁcation of groundtruth bounding boxes degraded the classiﬁcation rate within the top

ﬁve by 6.3%, while 25% magniﬁcation with bounding box adjustment degraded the rate by

only 4.5%. From this results, GrubCut-based bounding box adjustments can be regarded as

being effective.

6.3 Evaluation on estimation on food direction

Finally, we made an experiment on estimation of the direction of a food window. We evalu-

ated error in the direction estimation in case of sifting the ground-truth bounding boxes by

10, 15, 20, and 25% to each of eight directions around the original boxes. Figure12 shows

cumulative classiﬁcation rates of the estimated direction with different shifts. The rates with

less than ±20◦error and ±40◦error were 31.81% and 50.34% in case of 15% shift, and are

34.54%, and 54.16% in case of 25% shift. From these results, when the difference between

the ground-truth and the given bounding box is small, estimation of the direction of the

ground-truth bounding box is more difﬁcult. This is because the difference of SVM scores

between them is small in case that the difference in the location of the bounding boxes is

small.

6.4 Evaluation of processing time

We measured processing times on the latest smartphone, Samsung Galaxy Note II (1.6GHz

Quad Core with Android 4.1). We measured recognition time by repeating 20 times and

averaging them. The results are shown in Table 2. Processing time for recognition by the

ﬁrst method and by the second method are 0.26 seconds and 0.065 seconds, respectively.

18 Y. Kawano and K. Yanai

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

±20° ±40° ±60° ±80° ±100° ±120° ±140° ±160° ±180°

Classification Rate

allowable angular difference error

10%

15%

20%

25%

Fig. 12 Cumulative rate of the estimated orientation in case of 10, 15, 20, and 25% shifted bounding boxes.

Table 2 Average processing time.

average time [sec]

Bounding Box Adjustment 0.70

[Recognition 1] SURF-BoF+Color Histogram 0.26

[Recognition 2] HOG Patch-FV+Color Patch-FV 0.065

Estimation of Food Direction 0.091

Direction estimation takes 0.09 seconds, while bounding box adjustment is relatively high-

cost, which takes 0.70 seconds. This is why bounding box adjustment are carried out once

after drawn.

Among 0.26 seconds for recognition by the ﬁrst method and 0.065 seconds for recog-

nition by the second methods, linear SVM classiﬁcation takes only 0.003 seconds, while

most of the time are taken for extraction of image features. From this results, extraction of

HOG-FV and Color-FV are much faster than extraction of SURF-BoF and color histogram.

In fact, extracting SURF features and voting each feature to one of the 500 codewords to

create BoF vectors are most time-consuming processing.

6.5 User Study

We asked ﬁve student subjects to evaluate quality of the proposed system in ﬁve step eval-

uation regarding food recognition, how easy to use, quality of direction estimation, and

comparison of the proposed system with the baseline which has no food recognition and

requires selecting food names from hierarchical menus by touching. The evaluation score 5,

3 and 1 means good, so-so, and bad, respectively. At the same time, we measured time for

selecting food items with food recognition, and compared it with the time for selecting food

items from the hierarchical menu by hand.

Figure 13 shows the spent time for selecting each food item. The median time were 5.1

second with food recognition, and 5.7 second by hand. This means the proposed system can

Real-time Food Recognition System on a Smartphone 19

䡚

2.0

䡚

3.0

䡚

4.0

䡚

5.0

䡚

6.0

䡚

7.0

䡚

8.0

䡚

9.0

䡚

10.0

䡚

11.0

䡚

12.0

䡚

13.0

䡚

14.0

䡚

15.0

䡚

Count

Spend Time[sec]

proposed system

hand

median

Fig. 13 Time comparison: food recognition vs. hierarchical menu.

Table 3 User study results which are the average of ﬁve-step evaluation scores.

Outcome Measure average score

Recognition quality 3.4

Facility to use 4.2

Quality of Direction Suggestion 2.4

Proposed System Quality 3.8

(compared with hand-selection)

help a user select food names faster than from a hierarchical menu by hand. However, for

some food items which not able to be recognized, it spent long time to ﬁnd food names using

food recognition.

Table 3 shows the system evaluation by the ﬁve grade evaluation. Except for suggest

direction, more than three points are obtained. Especially, usability of the system is good,

since recognition is carried out in a real-time way. On the other hand, estimation of the

expected food region is not evaluated as being effective, since the classiﬁcation accuracy is

not so good for practical use. We will improve it as a future work.

7 Conclusions

We proposed a mobile food image recognition system and two types of food recognition

methods. One is the combination of the standard bag-of-features and color histograms with

χ2kernel feature maps, and the other is a HOG patch descriptor and a color patch descriptor

with the state-of-the-art Fisher Vector representation. In both cases, we used a liner SVM as

a classiﬁer, which is fast and memory-efﬁcient.

In the experiments, we have achieved the 79.2% classiﬁcation rate for the top 5 category

candidates for a 100-category food dataset with the ground-truth bounding boxes when we

used HOG and color patches with the Fisher Vector coding as image features. It is 11.0 point

higher than the result by color histogram and SURF-BoF with χ2kernel feature maps. In

20 Y. Kawano and K. Yanai

addition, it is superior to very high cost server side recognition method. Regarding process-

ing time it is only 0.065 second for 100 kinds of targets. It is about four times faster than the

processing time by color histogram and SURF-BoF.

As feature works, we plan to extend the system regarding the following issues:

–Touch just a point instead of drawing bounding boxes to specify food regions.

–Use multiple images to improve accuracy of food item recognition.

–Improve accuracy of estimation of expected food regions or move the bounding boxes

automatically instead of only showing the direction.

–Take into account additional information such as user’s food history, GPS location data

and time information.

–Increase the number of food categories to make the system more practical.

Note that Android application of the proposed mobile food recognition system can be

downloaded from http://foodcam.mobi/ .

——————————————

References

Bay et al.(2008). Bay H, Ess A, Tuytelaars T, Van Gool L (2008) Speeded-up robust features (SURF). Com-

puter Vision and Image Understanding 110(3):346–359

Chae et al.(2011). Chae J, Woo I, Kim S, Maciejewski R, Zhu F, Delp E, Boushey C, Ebert D (2011) Volume

estimation using food speciﬁc shape templates in mobile image-based dietary assessment. In: Proc. of

the IS&T/SPIE Conference on Computational Imaging IX, vol 7873, p 78730K

Chatﬁeld et al.(2011). Chatﬁeld K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil is in the details:

an evaluation of recent feature encoding methods. In: Proc. of British Machine Vision Conference

Csurka et al.(2004). Csurka G, Bray C, Dance C, Fan L (2004) Visual categorization with bags of keypoints.

In: Proc.of ECCV Workshop on Statistical Learning in Computer Vision (SLCV), pp 59–74

Dalal and Triggs(2005). Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In:

Proc. of IEEE Computer Vision and Pattern Recognition

Deng and Manjunath(2001). Deng Y, Manjunath BS (2001) Unsupervised segmentation of color-texture re-

gions in images and video. IEEE Transactions on Pattern Analysis and Machine Intelligence 23(8):800–

810

Fan et al.(2008). Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) LIBLINEAR: A library for large

linear classiﬁcation. The Journal of Machine Learning Research 9:1871–1874

Felzenszwalb et al.(2010). Felzenszwalb P, Girshick R, McAllester D, Ramanan D (2010) Object detection

with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine

Intelligence 32(9):1627–1645

He et al.(2013). He Y, Xu C, Khanna N, Boushey C, Delp E (2013) Food image analysis: Segmentation,

identiﬁcation and weight estimation. In: Proc. of IEEE International Conference on Multimedia and

Expo

Jia et al.(2012). Jia D, Alex B, Sanjeev S, Hao S, Aditya K, Fei-Fei L (2012) Imagenet large scale vi-

sual recognition challenge 2012 (ILSVRC2012). http://www.image-net.org/challenges/

LSVRC/2012/index

Kitamura et al.(2008). Kitamura K, Yamasaki T, Aizawa K (2008) Food log by analyzing food images. In:

Proc. of ACM International Conference Multimedia, pp 999–1000

Kitamura et al.(2009). Kitamura K, Yamasaki T, Aizawa K (2009) Foodlog: Capture, analysis and retrieval

of personal food images via web. In: Proc. of ACM Multimedia Workshop on Multimedia for Cooking

and Eating Activities, pp 23–30

Kumar et al.(2012). Kumar N, Belhumeur P, Biswas A, Jacobs D, Kress W, Lopez I, Soares J (2012) Leafs-

nap: A computer vision system for automatic plant species identiﬁcation. In: Proc. of European Confer-

ence on Computer Vision

Lampert et al.(2008). Lampert CH, Blaschko MB, Hofmann T (2008) Beyond sliding windows: Object lo-

calization by efﬁcient subwindow search. In: Proc. of IEEE Computer Vision and Pattern Recognition

Lazebnik et al.(2006). Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid

matching for recognizing natural scene categories. In: Proc. of IEEE Computer Vision and Pattern Recog-

nition, pp 2169–2178

Real-time Food Recognition System on a Smartphone 21

Lowe(2004). Lowe DG (2004) Distinctive image features from scale-invariant keypoints. International Jour-

nal of Computer Vision 60(2):91–110

Mariappan et al.(2009). Mariappan A, Bosch M, Zhu F, Boushey C, Kerr D, Ebert D, Delp E (2009) Personal

dietary assessment using mobile devices. In: Proc. of the IS&T/SPIE Conference on Computational

Imaging VII, vol 7246, pp 72,460Z–1–72,460Z–12

Maruyama et al.(2012). Maruyama T, Kawano Y, Yanai K (2012) Real-time mobile recipe recommendation

system using food ingredient recognition. In: Proc. of ACM MM Workshop on Interactive Multimedia

on Mobile and Portable Devices(IMMPD)

Matsuda et al.(2012). Matsuda Y, Hoashi H, Yanai K (2012) Recognition of multiple-food images by detect-

ing candidate regions. In: Proc. of IEEE International Conference on Multimedia and Expo

Perronnin and Dance(2007). Perronnin F, Dance C (2007) Fisher kernels on visual vocabularies for image

categorization. In: Proc. of IEEE Computer Vision and Pattern Recognition

Perronnin et al.(2010). Perronnin F, S´

anchez J, Mensink T (2010) Improving the ﬁsher kernel for large-scale

image classiﬁcation. In: Proc. of European Conference on Computer Vision

Philbin et al.(2008). Philbin J, Chum O, Isard M, Sivic J, Zisserman A (2008) Lost in quantization: Improv-

ing particular object retrieval in large scale image databases. In: Proc. of IEEE Computer Vision and

Pattern Recognition

Rother et al.(2004). Rother C, Kolmogorov V, Blake A (2004) Grabcut: Interactive foreground extraction

using iterated graph cuts. In: ACM SIGGRAPH, pp 309–314

Vedaldi and Zisserman(2012). Vedaldi A, Zisserman A (2012) Efﬁcient additive kernels via explicit feature

maps. IEEE Transactions on Pattern Analysis and Machine Intelligence

Wang et al.(2010). Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding

for image classiﬁcation. In: Proc. of IEEE Computer Vision and Pattern Recognition, pp 3360–3367

Yang et al.(2010). Yang S, Chen M, Pomerleau D, Sukthankar R (2010) Food recognition using statistics of

pairwise local features. In: Proc. of IEEE Computer Vision and Pattern Recognition

Yu et al.(2011). Yu F, Ji R, Chang S (2011) Active query sensing for mobile location search. In: Proc. of

ACM International Conference Multimedia

A Lightweight Hybrid Model with Location-Preserving ViT for Efficient Food Recognition

Article

Full-text available

Jan 2024

Food-image recognition plays a pivotal role in intelligent nutrition management, and lightweight recognition methods based on deep learning are crucial for enabling mobile deployment. This capability empowers individuals to effectively manage their daily diet and nutrition using devices such as smartphones. In this study, we propose an Efficient Hybrid Food Recognition Net (EHFR–Net), a novel neural network that integrates Convolutional Neural Networks (CNN) and Vision Transformer (ViT). We find that in the context of food-image recognition tasks, while ViT demonstrates superiority in extracting global information, its approach of disregarding the initial spatial information hampers its efficacy. Therefore, we designed a ViT method termed Location-Preserving Vision Transformer (LP–ViT), which retains positional information during the global information extraction process. To ensure the lightweight nature of the model, we employ an inverted residual block on the CNN side to extract local features. Global and local features are seamlessly integrated by directly summing and concatenating the outputs from the convolutional and ViT structures, resulting in the creation of a unified Hybrid Block (HBlock) in a coherent manner. Moreover, we optimize the hierarchical layout of EHFR–Net to accommodate the unique characteristics of HBlock, effectively reducing the model size. Our extensive experiments on three well-known food image-recognition datasets demonstrate the superiority of our approach. For instance, on the ETHZ Food–101 dataset, our method achieves an outstanding recognition accuracy of 90.7%, which is 3.5% higher than the state-of-the-art ViT-based lightweight network MobileViTv2 (87.2%), which has an equivalent number of parameters and calculations.

Feature Based Text Search Engine Mitigating Data Diversity Problem Using Pre-Trained Large Language Model for Fast Deployment Services

Article

Full-text available

Mar 2024

The fairness & bias of narrow coverage of AI becomes another challenge for AI researchers. If a commercial AI trains with a biased dataset, there will be severe gender or racial fairness and bias issues. [43,44]. Since the researchers use primary language datasets to train AI, the broad audience cannot be satisfied if a novel LLM (Large Language Model) AI shows a knowledge or creativity limitation on their specific spoken language. Narrow coverage of the LLMs can lead the audience to misinterpretation and confusion if the service involves STT (Speech-To-Text). In this paper, to overcome this issue of data diversity, we propose the idea that the embedded, extracted features have captured semantic proximity information that can be useful to mitigate diversity issues. This project focused on the Korean language food dataset for STT services, where a narrow-trained A.I. is prone to show its limitations, such as lifestyle-related elements. To present our proof of concept, we trained a baseline model, GPT2, with the Korean Wikipedia dataset in 2022. Then, we employed DistilBERT and KoBERT for comparison. The extracted hidden _ state _output features from each model were utilized to build feature-extraction-based text search engines. We used the same idea of Local Sensitive Hashing (LSH) but effectively located a similar hash by applying transposed weights [38]. We also present conventional classification benchmarks for performance comparison using top-k measurements, times for training and memory & disc consumptions. In the discussion, we proposed that our idea can mitigate the diversity problem without re-training the model and tokenizer.

IndianFoodNet: effective Indian multi-food identification and recommendation for hypertensive patients using deep convolutional neural network

Article

Full-text available

Feb 2024
NEURAL COMPUT APPL

The food consumption has a direct effect on the health of an individual. Eating food without awareness of its ingredients may result in eating style-based diseases such as hypertension, diabetes, and several others. As per recent WHO survey, the number of persons with hypertension is very large in numbers. There is essentially a need of novel technique that can provide food recommendation to hypertensive persons, out of their multi-food items in their meals. In this research work, Indian multi-food items of the meal are recognized using fine-tuned deep convolutional neural network model. Further, in existing research works, only single food image is recognized, which is not relevant to real-life food consumption. In our proposed approach, contour-based image segmentation technique is used for multi-food meal. In existing research works, no dataset is available on Indian food items for hypertensive persons. The key contribution of this research work is the preparation of Indian food dataset of 30 classes for hypertensive patients. There are 15 Recommended food classes for the hypertensive person and 15 classes are not recommended foods to maintain the class balance (as calibrated through a professional dietitian) (Dr. Shuchi Upadhyay, Dietitian and Nutrition expert, UPES, Dehradun). The novel contribution is to present ‘IndianFood30’ dataset of hypertensive patients for research purposes. Further, a novel IndianFoodNet model is presented which is trained on these 30 Indian food classes. Several pre-trained models are available for research purposes, but there is no pre-trained model on Indian food for hypertensive persons. Food ingradients exhibit high intra-class variance, and these complex features are extracted using our proposed approach. The accuracy of the proposed approach is compared with state-of-the-art models such as VGGNet, Inception V3, GoogleNet, and ResNet. Our proposed approach is also compared with some recent techniques on some of the existing datasets such as UEC Food-100, UEC Food-256, and Food-101 datasets to show the performance and effectiveness of the proposed model. Experiment analysis validates that our proposed approach outperforms existing approaches significantly.

mid-DeepLabv3 : A Novel Approach for Image Semantic Segmentation Applied to African Food Dietary Assessments+

Article

Full-text available

Dec 2023
SENSORS-BASEL

Recent decades have witnessed the development of vision-based dietary assessment (VBDA) systems. These systems generally consist of three main stages: food image analysis, portion estimation, and nutrient derivation. The effectiveness of the initial step is highly dependent on the use of accurate segmentation and image recognition models and the availability of high-quality training datasets. Food image segmentation still faces various challenges, and most existing research focuses mainly on Asian and Western food images. For this reason, this study is based on food images from sub-Saharan Africa, which pose their own problems, such as inter-class similarity and dishes with mixed-class food. This work focuses on the first stage of VBDAs, where we introduce two notable contributions. Firstly, we propose mid-DeepLabv3+, an enhanced food image segmentation model based on DeepLabv3+ with a ResNet50 backbone. Our approach involves adding a middle layer in the decoder path and SimAM after each extracted backbone feature layer. Secondly, we present CamerFood10, the first food image dataset specifically designed for sub-Saharan African food segmentation. It includes 10 classes of the most consumed food items in Cameroon. On our dataset, mid-DeepLabv3+ outperforms benchmark convolutional neural network models for semantic image segmentation, with an mIoU (mean Intersection over Union) of 65.20%, representing a +10.74% improvement over DeepLabv3+ with the same backbone.

A Real-life Chinese Dishes Recognition System Evolved from Full Training to Transfer Learning and Domain Adaptation

Conference Paper

May 2024

An E-health System Recognizing Vegetable Images Using Extreme Learning Machine

Conference Paper

Dec 2023

What Sweet it is?: SweetNet: A comprehensive Classification of Milk Sweets

Conference Paper

Jan 2024

Integrated image and sensor-based food intake detection in free-living

Article

Full-text available

Jan 2024

The first step in any dietary monitoring system is the automatic detection of eating episodes. To detect eating episodes, either sensor data or images can be used, and either method can result in false-positive detection. This study aims to reduce the number of false positives in the detection of eating episodes by a wearable sensor, Automatic Ingestion Monitor v2 (AIM-2). Thirty participants wore the AIM-2 for two days each (pseudo-free-living and free-living). The eating episodes were detected by three methods: (1) recognition of solid foods and beverages in images captured by AIM-2; (2) recognition of chewing from the AIM-2 accelerometer sensor; and (3) hierarchical classification to combine confidence scores from image and accelerometer classifiers. The integration of image- and sensor-based methods achieved 94.59% sensitivity, 70.47% precision, and 80.77% F1-score in the free-living environment, which is significantly better than either of the original methods (8% higher sensitivity). The proposed method successfully reduces the number of false positives in the detection of eating episodes.

Occlusion Resilient Similar-Colored Separable Food Item Instance Segmentation

Conference Paper

Dec 2023

A Multi-Purpose Food Recognition System Using Convolutional Neural Network

Conference Paper

Nov 2023

Histograms of Oriented Gradients for Human Detection

Conference Paper

Jul 2005
IEEE Comput Soc Conf Comput Vis Pattern Recogn

We study the question of feature sets for robust visual object recognition, adopting linear SVM based human detection as a test case. After reviewing existing edge and gradient based descriptors, we show experimentally that grids of Histograms of Oriented Gradient (HOG) descriptors significantly outperform existing feature sets for human detection. We study the influence of each stage of the computation on performance, concluding that fine-scale gradients, fine orientation binning, relatively coarse spatial binning, and high-quality local contrast normalization in overlapping descriptor blocks are all important for good results. The new approach gives near-perfect separation on the original MIT pedestrian database, so we introduce a more challenging dataset containing over 1800 annotated human images with a large range of pose variations and backgrounds.

Beyond sliding windows: Object localization by efficient subwindow search

Article

Jan 2008

Most successful object recognition systems rely on binary classification, deciding only if an object is present or not, but not providing information on the actual object location. To perform localization, one can take a sliding window approach, but this strongly increases the computational cost, because the classifier function has to be evaluated over a large set of candidate subwindows. In this paper, we propose a simple yet powerful branch-and-bound scheme that allows efficient maximization of a large class of classifier functions over all possible subimages. It converges to a globally optimal solution typically in sublinear time. We show how our method is applicable to different object detection and retrieval scenarios. The achieved speedup allows the use of classifiers for localization that formerly were considered too slow for this task, such as SVMs with a spatial pyramid kernel or nearest neighbor classifiers based on the chi2-distance. We demonstrate state-of-the-art performance of the resulting systems on the UIUC Cars dataset, the PASCAL VOC 2006 dataset and in the PASCAL VOC 2007 competition.

Unsupervised Segmentation of Color-Texture Regions in Images and video

Article

Jan 2001

"GrabCut"

Article

Aug 2004

The problem of efficient, interactive foreground/background segmentation in still images is of great practical importance in image editing. Classical image segmentation tools use either texture (colour) information, e.g. Magic Wand, or edge (contrast) information, e.g. Intelligent Scissors. Recently, an approach based on optimization by graph-cut has been developed which successfully combines both types of information. In this paper we extend the graph-cut approach in three respects. First, we have developed a more powerful, iterative version of the optimisation. Secondly, the power of the iterative algorithm is used to simplify substantially the user interaction needed for a given quality of result. Thirdly, a robust algorithm for "border matting" has been developed to estimate simultaneously the alpha-matte around an object boundary and the colours of foreground pixels. We show that for moderately difficult examples the proposed method outperforms competitive tools.

The devil is in the details: An evaluation of recent feature encoding methods

Conference Paper

Nov 2011

Speeded-up robust features (SURF)

Article

Jun 2008

This article presents a novel scale- and rotation-invariant detector and descriptor, coined SURF (Speeded-Up Robust Features). SURF approximates or even outperforms previously proposed schemes with respect to repeatability, distinctiveness, and robustness, yet can be computed and compared much faster. This is achieved by relying on integral images for image convolutions; by building on the strengths of the leading existing detectors and descriptors (specifically, using a Hessian matrix-based measure for the detector, and a distribution-based descriptor); and by simplifying these methods to the essential. This leads to a combination of novel detection, description, and matching steps. The paper encompasses a detailed description of the detector and descriptor and then explores the effects of the most important parameters. We conclude the article with SURF's application to two challenging, yet converse goals: camera calibration as a special case of image registration, and object recognition. Our experiments underline SURF's usefulness in a broad range of topics in computer vision.

Proceedings of international conference on advances in computing, ICAdC 2012, Bangalore, India, July 4–6, 2012

Book

Jan 2012

Real-time mobile recipe recommendation system using food ingredient recognition

Conference Paper

Nov 2012

In this paper, we propose a mobile cooking recipe recom mendation system employing object recognition for food ingredients such as vegetables and meats. The proposed system carries out object recognition on food ingredients in a real-time way on an Android-based smartphone, and recommends cooking recipes related to the recognized food ingredients. By only pointing a built-in camera on a mobile device to food ingredients, the user can obtain a recipe list instantly. As an object recognition method, we adopt bag-of-features with SURF and color histogram extracted from multiple images as image features and linear SVM with the one-vs-rest strategy as a classifier. We built 30 kinds of food ingredient short video database for experiments. With this database, we achieved the 83.93% recognition rate within the top six candidates. In the experiment, we made user study by comparing mobile recipe recommendation systems with/without ingredient recognition.

Food image analysis: Segmentation, identification and weight estimation

Conference Paper

Jul 2013

We are developing a dietary assessment system that records daily food intake through the use of food images taken at a meal. The food images are then analyzed to extract the nutrient content in the food. In this paper, we describe the image analysis tools to determine the regions where a particular food is located (image segmentation), identify the food type (feature classification) and estimate the weight of the food item (weight estimation). An image segmentation and classification system is proposed to improve the food segmentation and identification accuracy. We then estimate the weight of food to extract the nutrient content from a single image using a shape template for foods with regular shapes and area-based weight estimation for foods with irregular shapes.

Recognition of Multiple-Food Images by Detecting Candidate Regions

Conference Paper

Jul 2012

In this paper, we propose a two-step method to recognize multiple-food images by detecting candidate regions with several methods and classifying them with various kinds of features. In the first step, we detect several candidate regions by fusing outputs of several region detectors including Felzenszwalb's deformable part model (DPM) [1], a circle detector and the JSEG region segmentation. In the second step, we apply a feature-fusion-based food recognition method for bounding boxes of the candidate regions with various kinds of visual features including bag-of-features of SIFT and CSIFT with spatial pyramid (SP-BoF), histogram of oriented gradient (HoG), and Gabor texture features. In the experiments, we estimated ten food candidates for multiple-food images in the descending order of the confidence scores. As results, we have achieved the 55.8% classification rate, which improved the baseline result in case of using only DPM by 14.3 points, for a multiple-food image data set. This demonstrates that the proposed two-step method is effective for recognition of multiple-food images.

FoodCam: A real-time food recognition system on a smartphone

Abstract and Figures

Recommended publications

Real-Time Mobile Food Recognition System

Rapid Mobile Object Recognition Using Fisher Vector

Automatic Expansion of a Food Image Dataset Leveraging Existing Categories with Domain Adaptation

Twitter Food Photo Mining and Analysis for One Hundred Kinds of Foods