FAST FOOD RECOGNITION FROM VIDEOS OF EATING FOR

FAST FOOD RECOGNITION FROM VIDEOS OF EATING FOR CALORIE ESTIMATION
Wen Wu, Jie Yang
Language Technologies Institute, School of Computer Science, Carnegie Mellon University
5000 Forbes Ave, Pittsburgh, PA, USA
ABSTRACT
Accurate and passive acquisition of dietary data from patients is essential for a better understanding of the etiology
of obesity and development of effective weight management
programs. Self-reporting is currently the main method for
such data acquisition. However, studies have shown that data
obtained by self-reporting seriously underestimate food intake and thus do not accurately reflect the real habitual behavior of individuals. Computer food recognition programs have
not yet been developed. In this paper, we present a study for
recognizing foods from videos of eating, which are directly
recorded in restaurants by a web camera. From recognition
results, our method then estimates food calories of intake. We
have evaluated our method on a database of 101 foods from 9
food restaurants in USA and obtained promising results.
Index Terms— fast food recognition, calorie estimation
1. INTRODUCTION
Obesity is a condition in which the natural energy reserve,
stored in the fat of humans and other mammals, is increased
to such a point that it promotes serious pathologic conditions.
Obesity research is to enhance study, diagnosis and treatment
of obesity. Accurate acquisition of dietary data from freeliving individuals is essential for a better understanding of the
etiology of obesity and development of effective weight management programs. Despite its wide application using questionnaires and structured interviews, studies have shown that
data obtained by self-reporting underestimate food intake and
do not accurately reflect the real information 1 . Thus, dietary
assessment has become one of bottlenecks for doctors to evaluate obese and other patients’ medical situations.
Obesity: having a very high amount of body fat in relation
to lean body mass, or Body Mass Index (BMI) of 30 or higher.
BMI: A measure of an adult’s weight in relation to his or her
height, specifically the adult’s weight in kilograms divided by
the square of his or her height in meters. Obesity in USA:2
Copyright 2009 IEEE. Published in the 2009 Intl. Conf. on Multimedia
and Expo (ICME 2009). Only personal use of this material is permitted.Food
chains’ materials shown here are only for research purpose.
1 National Institutes of Health, GEI Exposure Biology Program
2 Centers for Disease Control and Prevention (CDC), http://www.cdc.gov/
nccdphp/dnpa/obesity/trend/maps/
CDC data show that 49 states (ex. Colorado) in USA had a
prevalence of obesity greater than 20% and 30 states had a
prevalence equal to or greater than 25% in 2007. Obesity has
attracted attention from both society [1] and researchers [2].
In earlier days, FERET program3 on facial recognition
technology showed that creating object recognition programs
involve programs that learn to build visual object models and
evaluation under different conditions. Similarly, in order to
develop food recognition programs, we first need food images and videos with labels for training computer programs.
United States Department of Agriculture (USDA) has published National Nutrient Database (NND)4 and Food and Nutrient Database for Dietary Studies (FNDDS) [3]. But both
databases are mainly for human understanding and not ideal
for computer analysis due to data quality and collection protocols. The fact has inspired collecting a database of fast food
images, videos and data.
Building a food database is a starting point for developing and testing food recognition programs for obesity study.
Fast foods, which are standardized and have exact calorie info
published online, are chosen as food objects in the database
collection. By collecting and studying fast food data, however, we do not claim or conclude any relation between fast
foods and obesity in this paper. We expect that this available
database and our preliminary results would benefit research
communities to study obesity in terms of fast foods and image processing techniques. Table.1 shows some nutrient facts
of yogurt, plain and whole milk whose source is FNDDS. We
only focus on energy, aka calorie of food nutrients.
The application, which we focus on in this paper, is to
build a computer program to automatically recognize foods
from videos of eating and then estimate the energy of intake
in terms of calorie amount. Development and deployment of
such a computer program would provide a chance to facilitate medical treatment of obese patients and others by (semi)automatically logging food intakes and giving a summary of
consumed food items and estimated calories per meal, day
and so on. The program consists of one part that models certain foods of interest by analyzing a number of input images
and/or videos and the other part that analyzes new images or
videos and outputs recognition results and calorie estimation.
3 FERET:
4 NND:
http://www.frvt.org/FERET/
http://www.nal.usda.gov/fnic/foodcomp/search/
Protein (g)
3.47
Water (g)
87.9
Total Fat (g)
3.25
Sugars (g)
4.66
Carbohydrate (g)
4.66
Calcium (mg)
121
Energy (kcal)
61
Iron (mg)
0.05
Table 1. Selected nutrient values of yogurt (plain, whole
milk, 227g), data from FNDDS 3.0 [3].
The key idea of our approach is to formulate recognition of
foods from videos as an image matching and retrieval task.
We describe the problem formulation and our approach in detail in Section 3. Before that we briefly introduce the fast
food database in Section 2. Vision sensors and technologies
are one way of many which can be applied to record food
consumptions, however, they are our primary means for constructing and analyzing such a food database in our study. A
possible vision sensor for patients to carry would be a necklace camera.
2. A FAST FOOD IMAGE DATABASE
In summer 2008, a database that contains 101 foods from 9
food restaurants (the following list) was built 5 and contains
foods such as burgers, sandwiches, salads, chickens, drinks.
1. McDonald’s
2. Wendy’s
3. Arby’s
4. KFC
5. Taco Bell
6. Pizza Hut
7. Subway
8. Quiznos
9. Panera
For each restaurant, a list of to-collect foods is first decided, usually around 10 foods, which covers popular items
sold by the restaurant. Each restaurant is visited 3 times to
collect 3 instances for each food and record data of each instance in the restaurant and also in the lab where we have
a controlled environment. In the restaurant, before ordering foods, the store manager’s permission to take pictures
and videos inside the store is needed. In addition, we order
foods from the to-collect list for volunteers to eat so we can
record videos of eating. Unibrain Fire-i (a low-cost web camera capturing 640 × 480 and 30fps videos) is used to record
videos of eating by avoiding conversations and interruptions
and to expose volunteers’ faces. The recorded videos normally last about 10 minutes. While the volunteer is eating, 4
pictures of each food item at 4 different angles with 90 degrees apart on the restaurant table are taken without special
filming arrangements. After finishing in the restaurant, we
bring foods back to the lab for in-lab data collection. We
first take pictures of each food at 6 angles using a turntable
and then record a structure-from-motion video of the food by
turning the turntable for 360 degrees. Canon SD1100 is used
for photo-taking tasks and in-lab video capturing. Point Grey
5A
joint effort between Carnegie Mellon and Intel Research - Pittsburgh
Fig. 1. Example images from our fast food database.
Bumblebee I is used to collect stereo data (rectified images
and disparity map). Stereo data are not used in this study.
3. RECOGNIZING FOODS IN VIDEOS OF EATING
For a restaurant, given some training images of its K foods
and a video of single person eating it (Fig.2), predict which
among K foods appear in the video? Our first application goal
is to answer the above question by a computer program that
is trained on images, which focuses on recognizing foods not
deciding food portions. In other words, the program does not
answer questions such as Did she eat the whole hamburger?
The program does not discover the temporal ordering of foods
in the video either. Other data such as audio and motion information are not available for training.
Four restaurant images of each food instance are used for
training. These images are first resized from 2592 × 1944 to
640 × 480. Images recorded in the lab, which have controlled
lighting and no occlusion, are not used here. The purpose
of choosing restaurant images instead of lab ones is because
we want the problem setting to be as close to the real world
scenario as possible. For videos of eating, which are VGA
quality and normally last 10 minutes, we subsample them to
a few hundreds of frames. By this, we reduce the redundant
motion and temporal information in original videos.
The task is formulated as a problem of image matching of
training images to video frames. The goal is to output a relevance ranking list of foods and predict ones which most likely
appear in the video. To match a video frame to each training image, we first extract scale-invariant feature transform
(SIFT)[4] keypoints on both. We choose SIFT over other image descriptors such as color and edges because SIFT is less
sensitive to scale change, rotation and occlusion which often
happen in videos of eating.
We use a UCLA SIFT package 6 . We set the number
of scales per octave as 3, keypoint selection threshold as
thresh = 0.007 or 0.003 (picking better result of two) and
keypoint matching criterion as ρd = 1.563. This method is
called S in the paper. Matching is performed based on keypoints’ SIFT descriptors [4]. Numbers of pairs of matched
keypoints (for short, match counts) are ranked across food
6 http://vision.ucla.edu/
vedaldi/code/sift/
Fig. 2. A sample sequence of a video of eating in Wendy’s.
items. The frame-based food rankings and counts are both
stored. S is effective but not in all cases, so we enhance it by
adding a cosine matching criterion and setting ρc = 0.97 [5].
We call it SC and another method Sg which enhances S by a
RANSAC-based geometric criterion [6] and the error threshold for removing ’mismatches’ is e = 10. The following
equation shows above two matching schemes of S andSC.
fT g
d2 (f, f2 )
> ρd , cos(f, g) =
> ρc .
2
d (f, f1 )
kf k2 kgk2
where f, g are descriptors and f1 , f2 are f ’s nearest and second nearest descriptors from the model database. The ground
truth number (T ) of foods in the video is used to threshold top
T rank items. Recognition rate (RR) for each video is computed as the ratio of the number of correctly recognized foods
to T . Though T is unknown sometimes, it can be estimated by
coarse segmentation on video frames and counting potential
food segments evaluated by a pre-trained food model. Cutting the ranking list by T can be strict if an improvement by
another recognition method is needed. So we also use T +3 to
threshold (suffix 3) in Table 2.
We use two methods to obtain the final ranking (f-rank,
1 × n) of foods given an input video: 1) a ranking-based
matching (RBM) based on top T items of each frame-based
rankings; 2) a count-based matching (CBM) based on sum
of keypoint matching counts over all video frames. The following are math descriptions of two methods. For a frame
sequence (subsampled video), each matching algorithm (described above) generates two matrix outputs, R and Rc, both
of which are m × n. m is the number of frames (rows) and n
is the number of foods (columns). R(i, j) stores the food ID,
∈ 1...n, which is the j-th rank (sort match counts in descending order) for the i-th frame. Rc(i, j) stores the match counts
between the R(i, j) food’s model image and the i-th frame.
RBM works as follow. It first sorts R by row to obtain an
index matrix Rx (now the j-th column shows the j-th food’s
ranks across input frames), replaces all indexes (> T ) in Rx
with a big number, then sums Rx by column and obtain the
f-rank of n foods, and finally outputs results by cutting top T
of using sums of ranks (descending order). CBM follows the
same steps of RBM but generates the f-rank of n foods using
Rc with only top T counts reserved (rest set to 0).
Table 2 lists results of 8 methods (columns 5-7 are CBM
ARB
KFC
MCD
PAN
PIZ
QUI
SUB
TAC
WEN
AVG
Sg
0.00
0.20
0.67
0.20
0.67
0.33
1.00
0.50
0.33
0.43
S
0.50
0.80
0.33
0.60
0.00
0.33
0.67
0.75
0.67
0.52
SC
0.50
0.40
0.33
0.60
0.33
0.33
0.33
0.50
0.33
0.41
cSg
0.00
0.20
0.67
0.20
0.33
0.33
1.00
0.50
0.33
0.40
cS
0.50
0.80
0.33
0.60
0.00
0.67
0.67
0.75
0.67
0.55
cSC
0.50
0.40
0.33
0.60
0.33
0.33
0.33
0.50
0.33
0.41
S3
1.00
1.00
0.67
0.60
0.00
0.67
0.33
1.00
0.67
0.65
SC3
1.00
1.00
0.67
0.60
0.33
0.33
1.00
1.00
0.67
0.73
Table 2. Recognition rates of 8 methods on 9 restaurant data.
S: SIFT, C: cosine, g: geometric, prefix c: count-based matching (CBM), and suffix 3: T +3 to threshold.
methods and rest are RBM) on 9 restaurants’ data. The table shows CBM-SIFT (cS) performs best by achieving average RR = 0.55 on 9 restaurants. SIFT with geometry (Sg)
obtains RR = 1.0 on the Subway data. CBM methods perform similarly to RBM counterparts. By predicting 3 more
results, SIFT with cosine (SC3) achieves average RR = 0.73,
which can serve as baseline for further improvement. Our approach’s speed is bounded by keypoint computing and SIFT
matching (Matlab implementation used in our experiments).
It takes 1.5 hours to analyze a 10-min subsampled video using
around 50 training images (10-13 foods, 4 training images per
food). The problem of recognizing foods from videos based
on training instances seems easy, but is actually challenging
for humans. Take a look at some video frames and training
instances in Fig.3 and Fig.4. Camera focus, distortion, lighting and view point changes often make video quality ’poor’
for recognition.
To further analyze recognition performance, we select top
performers S and SC and study What foods they miss? and
What foods they falsely recognize? in a Panera video. Fig.3
shows the analysis including confusion matrix. It is a 12minute video and 5 foods appear. Food numbers are in the
first row of the confusion matrix. By analyzing results on
this video and other cases, we observe that false and missed
recognitions normally happen due to three reasons: 1) lack
of such a constraint to remove many-to-one matchings in our
methods; 2) no food masks are provided so other objects such
as trays and napkins distract the matching process; 3) other
foods are sometimes more similar to actual foods in the video
than their training instances. For example, different burgers
from the same brand can look like very similarly.
4. HOW MANY CALORIES IN YOUR LUNCH?
The second part of our application is to estimate calorie
amount based on food recognition. The calorie is a pre-SI
unit of energy, in particular, energy (heat), is the amount of
heat required to raise the temperature of one gram of water by
Fig. 4. Calorie estimation on a Subway video, on which Sg
gets RR = 1.0. (c1-3): food instances with calorie amounts.
Fig. 3. Recognition results on a Panera video. In confusion
matrix, S and SC: same as in Table 2, true: actual foods, miss:
missed recognitions, fpos: false recognitions.
1 Celsius degree. As one motivation of focusing on fast foods
in this study, calorie info of most fast foods is standardized
and available to the public. For all foods collected in the
database, we extract calorie info from each brand’s website.
An intuitive way of calorie estimation is to sum up calories
of recognized foods. The calorie estimation combined with
food recognition can help doctors evaluate obese patients and
also help patients themselves self-evaluate their diets.
Fig.4 shows recognition results on a video of eating in
Subway, in which three foods appear. The first row shows
a video frame in (a), in which a sandwich is apparent while
the soft drink and the other sandwich are partially or completely occluded. The actual and estimated calorie amount in
(b). The second row shows a training instance of each of three
foods (two 6-inch sandwiches and a medium soft drink) with
their calorie amount shown in parentheses. Food recognition
and calorie estimation shown in Fig.4 are both RR = 1.0 obtained by Sg (SIFT with geometry). But the estimated 672
calories for the meal are not the actual amount of energy because our method estimates calorie only based on food appearance not food portions. In addition, biased calorie estimations can also result from poor recognition performance.
Another issue which is particular to recognizing drinks is that
our methods can only recognize drink cups not drinks inside
without mentioning the issues of drink amount and refills.
5. DISCUSSION
This paper proposes and studies methods which recognize
fast foods from videos of eating and estimate meal calories
based on recognizing foods. Our work shows that an off-theshelf vision algorithm with appropriate engineering can lead
to promising results on this interesting camera-based task.
But our work does not provide any observation or conclusion between obesity and fast foods. During the work, we
have received comments such as Why cannot we provide voice
recorders to patients to record their daily eating logs? RFID
can be a highly reliable label for food and its nutrients, so
there seems no advantage to use camera for the task. Our
work does not conflict with the observations behind these
comments. The study and the proposed method in this paper serve more as a complementary role to other techniques
than a stand-alone system.
6. ACKNOWLEDGEMENT
This project is partially supported by NIH under Genes, Environment and Health Initiative (GEI) (Grant Number: 1U01
HL091736 01) and Carnegie Mellon University V-Unit fund.
We want to thank Mei Chen and Rahul Sukthankar from Intel
Research - Pittsburgh. They and Kapil Dev Dhingra are our
project partners in collection of the fast food database (that is
used in this study). Thank Manuela Veloso and M. Bernardine
Dias for their advice in this V-Unit project. Thank Daniel Huber and Srinivasa Narasimhan for their help. Thank Franziska
Kraus and Lei Yang for verifying the database quality.
7. REFERENCES
[1] E. Schlosser. Fast Food Nation: The Dark Side of the AllAmerican Meal. Houghton Mifflin Harcourt, 2001.
[2] E. Arredondo, D. Castaneda, J. P. Elder, D. Slymen, and
D. Dozier. Brand name logo recognition of fast food and healthy
food among children. Journal of Community Health, 2008.
[3] Food Surveys Research Group Beltsville, MD: Agricultural Research Service. USDA Food and Nutrient Database for Dietary
Studies, 3.0, 2008.
[4] D. G. Lowe. Distinctive image features from scale-invariant
keypoints. Intl. Journal of Computer Vision, 2004.
[5] W. Zhang and J. Kosecka. Image based localization in urban environments. International Symposium on 3D Data Processing,
Visualization and Transmission, 2006.
[6] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or ”how do i organize my holiday snaps?”.
ECCV, 2002.