FAST FOOD RECOGNITION FROM VIDEOS OF EATING FOR CALORIE ESTIMATION Wen Wu, Jie Yang Language Technologies Institute, School of Computer Science, Carnegie Mellon University 5000 Forbes Ave, Pittsburgh, PA, USA ABSTRACT Accurate and passive acquisition of dietary data from patients is essential for a better understanding of the etiology of obesity and development of effective weight management programs. Self-reporting is currently the main method for such data acquisition. However, studies have shown that data obtained by self-reporting seriously underestimate food intake and thus do not accurately reflect the real habitual behavior of individuals. Computer food recognition programs have not yet been developed. In this paper, we present a study for recognizing foods from videos of eating, which are directly recorded in restaurants by a web camera. From recognition results, our method then estimates food calories of intake. We have evaluated our method on a database of 101 foods from 9 food restaurants in USA and obtained promising results. Index Terms— fast food recognition, calorie estimation 1. INTRODUCTION Obesity is a condition in which the natural energy reserve, stored in the fat of humans and other mammals, is increased to such a point that it promotes serious pathologic conditions. Obesity research is to enhance study, diagnosis and treatment of obesity. Accurate acquisition of dietary data from freeliving individuals is essential for a better understanding of the etiology of obesity and development of effective weight management programs. Despite its wide application using questionnaires and structured interviews, studies have shown that data obtained by self-reporting underestimate food intake and do not accurately reflect the real information 1 . Thus, dietary assessment has become one of bottlenecks for doctors to evaluate obese and other patients’ medical situations. Obesity: having a very high amount of body fat in relation to lean body mass, or Body Mass Index (BMI) of 30 or higher. BMI: A measure of an adult’s weight in relation to his or her height, specifically the adult’s weight in kilograms divided by the square of his or her height in meters. Obesity in USA:2 Copyright 2009 IEEE. Published in the 2009 Intl. Conf. on Multimedia and Expo (ICME 2009). Only personal use of this material is permitted.Food chains’ materials shown here are only for research purpose. 1 National Institutes of Health, GEI Exposure Biology Program 2 Centers for Disease Control and Prevention (CDC), http://www.cdc.gov/ nccdphp/dnpa/obesity/trend/maps/ CDC data show that 49 states (ex. Colorado) in USA had a prevalence of obesity greater than 20% and 30 states had a prevalence equal to or greater than 25% in 2007. Obesity has attracted attention from both society [1] and researchers [2]. In earlier days, FERET program3 on facial recognition technology showed that creating object recognition programs involve programs that learn to build visual object models and evaluation under different conditions. Similarly, in order to develop food recognition programs, we first need food images and videos with labels for training computer programs. United States Department of Agriculture (USDA) has published National Nutrient Database (NND)4 and Food and Nutrient Database for Dietary Studies (FNDDS) [3]. But both databases are mainly for human understanding and not ideal for computer analysis due to data quality and collection protocols. The fact has inspired collecting a database of fast food images, videos and data. Building a food database is a starting point for developing and testing food recognition programs for obesity study. Fast foods, which are standardized and have exact calorie info published online, are chosen as food objects in the database collection. By collecting and studying fast food data, however, we do not claim or conclude any relation between fast foods and obesity in this paper. We expect that this available database and our preliminary results would benefit research communities to study obesity in terms of fast foods and image processing techniques. Table.1 shows some nutrient facts of yogurt, plain and whole milk whose source is FNDDS. We only focus on energy, aka calorie of food nutrients. The application, which we focus on in this paper, is to build a computer program to automatically recognize foods from videos of eating and then estimate the energy of intake in terms of calorie amount. Development and deployment of such a computer program would provide a chance to facilitate medical treatment of obese patients and others by (semi)automatically logging food intakes and giving a summary of consumed food items and estimated calories per meal, day and so on. The program consists of one part that models certain foods of interest by analyzing a number of input images and/or videos and the other part that analyzes new images or videos and outputs recognition results and calorie estimation. 3 FERET: 4 NND: http://www.frvt.org/FERET/ http://www.nal.usda.gov/fnic/foodcomp/search/ Protein (g) 3.47 Water (g) 87.9 Total Fat (g) 3.25 Sugars (g) 4.66 Carbohydrate (g) 4.66 Calcium (mg) 121 Energy (kcal) 61 Iron (mg) 0.05 Table 1. Selected nutrient values of yogurt (plain, whole milk, 227g), data from FNDDS 3.0 [3]. The key idea of our approach is to formulate recognition of foods from videos as an image matching and retrieval task. We describe the problem formulation and our approach in detail in Section 3. Before that we briefly introduce the fast food database in Section 2. Vision sensors and technologies are one way of many which can be applied to record food consumptions, however, they are our primary means for constructing and analyzing such a food database in our study. A possible vision sensor for patients to carry would be a necklace camera. 2. A FAST FOOD IMAGE DATABASE In summer 2008, a database that contains 101 foods from 9 food restaurants (the following list) was built 5 and contains foods such as burgers, sandwiches, salads, chickens, drinks. 1. McDonald’s 2. Wendy’s 3. Arby’s 4. KFC 5. Taco Bell 6. Pizza Hut 7. Subway 8. Quiznos 9. Panera For each restaurant, a list of to-collect foods is first decided, usually around 10 foods, which covers popular items sold by the restaurant. Each restaurant is visited 3 times to collect 3 instances for each food and record data of each instance in the restaurant and also in the lab where we have a controlled environment. In the restaurant, before ordering foods, the store manager’s permission to take pictures and videos inside the store is needed. In addition, we order foods from the to-collect list for volunteers to eat so we can record videos of eating. Unibrain Fire-i (a low-cost web camera capturing 640 × 480 and 30fps videos) is used to record videos of eating by avoiding conversations and interruptions and to expose volunteers’ faces. The recorded videos normally last about 10 minutes. While the volunteer is eating, 4 pictures of each food item at 4 different angles with 90 degrees apart on the restaurant table are taken without special filming arrangements. After finishing in the restaurant, we bring foods back to the lab for in-lab data collection. We first take pictures of each food at 6 angles using a turntable and then record a structure-from-motion video of the food by turning the turntable for 360 degrees. Canon SD1100 is used for photo-taking tasks and in-lab video capturing. Point Grey 5A joint effort between Carnegie Mellon and Intel Research - Pittsburgh Fig. 1. Example images from our fast food database. Bumblebee I is used to collect stereo data (rectified images and disparity map). Stereo data are not used in this study. 3. RECOGNIZING FOODS IN VIDEOS OF EATING For a restaurant, given some training images of its K foods and a video of single person eating it (Fig.2), predict which among K foods appear in the video? Our first application goal is to answer the above question by a computer program that is trained on images, which focuses on recognizing foods not deciding food portions. In other words, the program does not answer questions such as Did she eat the whole hamburger? The program does not discover the temporal ordering of foods in the video either. Other data such as audio and motion information are not available for training. Four restaurant images of each food instance are used for training. These images are first resized from 2592 × 1944 to 640 × 480. Images recorded in the lab, which have controlled lighting and no occlusion, are not used here. The purpose of choosing restaurant images instead of lab ones is because we want the problem setting to be as close to the real world scenario as possible. For videos of eating, which are VGA quality and normally last 10 minutes, we subsample them to a few hundreds of frames. By this, we reduce the redundant motion and temporal information in original videos. The task is formulated as a problem of image matching of training images to video frames. The goal is to output a relevance ranking list of foods and predict ones which most likely appear in the video. To match a video frame to each training image, we first extract scale-invariant feature transform (SIFT)[4] keypoints on both. We choose SIFT over other image descriptors such as color and edges because SIFT is less sensitive to scale change, rotation and occlusion which often happen in videos of eating. We use a UCLA SIFT package 6 . We set the number of scales per octave as 3, keypoint selection threshold as thresh = 0.007 or 0.003 (picking better result of two) and keypoint matching criterion as ρd = 1.563. This method is called S in the paper. Matching is performed based on keypoints’ SIFT descriptors [4]. Numbers of pairs of matched keypoints (for short, match counts) are ranked across food 6 http://vision.ucla.edu/ vedaldi/code/sift/ Fig. 2. A sample sequence of a video of eating in Wendy’s. items. The frame-based food rankings and counts are both stored. S is effective but not in all cases, so we enhance it by adding a cosine matching criterion and setting ρc = 0.97 [5]. We call it SC and another method Sg which enhances S by a RANSAC-based geometric criterion [6] and the error threshold for removing ’mismatches’ is e = 10. The following equation shows above two matching schemes of S andSC. fT g d2 (f, f2 ) > ρd , cos(f, g) = > ρc . 2 d (f, f1 ) kf k2 kgk2 where f, g are descriptors and f1 , f2 are f ’s nearest and second nearest descriptors from the model database. The ground truth number (T ) of foods in the video is used to threshold top T rank items. Recognition rate (RR) for each video is computed as the ratio of the number of correctly recognized foods to T . Though T is unknown sometimes, it can be estimated by coarse segmentation on video frames and counting potential food segments evaluated by a pre-trained food model. Cutting the ranking list by T can be strict if an improvement by another recognition method is needed. So we also use T +3 to threshold (suffix 3) in Table 2. We use two methods to obtain the final ranking (f-rank, 1 × n) of foods given an input video: 1) a ranking-based matching (RBM) based on top T items of each frame-based rankings; 2) a count-based matching (CBM) based on sum of keypoint matching counts over all video frames. The following are math descriptions of two methods. For a frame sequence (subsampled video), each matching algorithm (described above) generates two matrix outputs, R and Rc, both of which are m × n. m is the number of frames (rows) and n is the number of foods (columns). R(i, j) stores the food ID, ∈ 1...n, which is the j-th rank (sort match counts in descending order) for the i-th frame. Rc(i, j) stores the match counts between the R(i, j) food’s model image and the i-th frame. RBM works as follow. It first sorts R by row to obtain an index matrix Rx (now the j-th column shows the j-th food’s ranks across input frames), replaces all indexes (> T ) in Rx with a big number, then sums Rx by column and obtain the f-rank of n foods, and finally outputs results by cutting top T of using sums of ranks (descending order). CBM follows the same steps of RBM but generates the f-rank of n foods using Rc with only top T counts reserved (rest set to 0). Table 2 lists results of 8 methods (columns 5-7 are CBM ARB KFC MCD PAN PIZ QUI SUB TAC WEN AVG Sg 0.00 0.20 0.67 0.20 0.67 0.33 1.00 0.50 0.33 0.43 S 0.50 0.80 0.33 0.60 0.00 0.33 0.67 0.75 0.67 0.52 SC 0.50 0.40 0.33 0.60 0.33 0.33 0.33 0.50 0.33 0.41 cSg 0.00 0.20 0.67 0.20 0.33 0.33 1.00 0.50 0.33 0.40 cS 0.50 0.80 0.33 0.60 0.00 0.67 0.67 0.75 0.67 0.55 cSC 0.50 0.40 0.33 0.60 0.33 0.33 0.33 0.50 0.33 0.41 S3 1.00 1.00 0.67 0.60 0.00 0.67 0.33 1.00 0.67 0.65 SC3 1.00 1.00 0.67 0.60 0.33 0.33 1.00 1.00 0.67 0.73 Table 2. Recognition rates of 8 methods on 9 restaurant data. S: SIFT, C: cosine, g: geometric, prefix c: count-based matching (CBM), and suffix 3: T +3 to threshold. methods and rest are RBM) on 9 restaurants’ data. The table shows CBM-SIFT (cS) performs best by achieving average RR = 0.55 on 9 restaurants. SIFT with geometry (Sg) obtains RR = 1.0 on the Subway data. CBM methods perform similarly to RBM counterparts. By predicting 3 more results, SIFT with cosine (SC3) achieves average RR = 0.73, which can serve as baseline for further improvement. Our approach’s speed is bounded by keypoint computing and SIFT matching (Matlab implementation used in our experiments). It takes 1.5 hours to analyze a 10-min subsampled video using around 50 training images (10-13 foods, 4 training images per food). The problem of recognizing foods from videos based on training instances seems easy, but is actually challenging for humans. Take a look at some video frames and training instances in Fig.3 and Fig.4. Camera focus, distortion, lighting and view point changes often make video quality ’poor’ for recognition. To further analyze recognition performance, we select top performers S and SC and study What foods they miss? and What foods they falsely recognize? in a Panera video. Fig.3 shows the analysis including confusion matrix. It is a 12minute video and 5 foods appear. Food numbers are in the first row of the confusion matrix. By analyzing results on this video and other cases, we observe that false and missed recognitions normally happen due to three reasons: 1) lack of such a constraint to remove many-to-one matchings in our methods; 2) no food masks are provided so other objects such as trays and napkins distract the matching process; 3) other foods are sometimes more similar to actual foods in the video than their training instances. For example, different burgers from the same brand can look like very similarly. 4. HOW MANY CALORIES IN YOUR LUNCH? The second part of our application is to estimate calorie amount based on food recognition. The calorie is a pre-SI unit of energy, in particular, energy (heat), is the amount of heat required to raise the temperature of one gram of water by Fig. 4. Calorie estimation on a Subway video, on which Sg gets RR = 1.0. (c1-3): food instances with calorie amounts. Fig. 3. Recognition results on a Panera video. In confusion matrix, S and SC: same as in Table 2, true: actual foods, miss: missed recognitions, fpos: false recognitions. 1 Celsius degree. As one motivation of focusing on fast foods in this study, calorie info of most fast foods is standardized and available to the public. For all foods collected in the database, we extract calorie info from each brand’s website. An intuitive way of calorie estimation is to sum up calories of recognized foods. The calorie estimation combined with food recognition can help doctors evaluate obese patients and also help patients themselves self-evaluate their diets. Fig.4 shows recognition results on a video of eating in Subway, in which three foods appear. The first row shows a video frame in (a), in which a sandwich is apparent while the soft drink and the other sandwich are partially or completely occluded. The actual and estimated calorie amount in (b). The second row shows a training instance of each of three foods (two 6-inch sandwiches and a medium soft drink) with their calorie amount shown in parentheses. Food recognition and calorie estimation shown in Fig.4 are both RR = 1.0 obtained by Sg (SIFT with geometry). But the estimated 672 calories for the meal are not the actual amount of energy because our method estimates calorie only based on food appearance not food portions. In addition, biased calorie estimations can also result from poor recognition performance. Another issue which is particular to recognizing drinks is that our methods can only recognize drink cups not drinks inside without mentioning the issues of drink amount and refills. 5. DISCUSSION This paper proposes and studies methods which recognize fast foods from videos of eating and estimate meal calories based on recognizing foods. Our work shows that an off-theshelf vision algorithm with appropriate engineering can lead to promising results on this interesting camera-based task. But our work does not provide any observation or conclusion between obesity and fast foods. During the work, we have received comments such as Why cannot we provide voice recorders to patients to record their daily eating logs? RFID can be a highly reliable label for food and its nutrients, so there seems no advantage to use camera for the task. Our work does not conflict with the observations behind these comments. The study and the proposed method in this paper serve more as a complementary role to other techniques than a stand-alone system. 6. ACKNOWLEDGEMENT This project is partially supported by NIH under Genes, Environment and Health Initiative (GEI) (Grant Number: 1U01 HL091736 01) and Carnegie Mellon University V-Unit fund. We want to thank Mei Chen and Rahul Sukthankar from Intel Research - Pittsburgh. They and Kapil Dev Dhingra are our project partners in collection of the fast food database (that is used in this study). Thank Manuela Veloso and M. Bernardine Dias for their advice in this V-Unit project. Thank Daniel Huber and Srinivasa Narasimhan for their help. Thank Franziska Kraus and Lei Yang for verifying the database quality. 7. REFERENCES [1] E. Schlosser. Fast Food Nation: The Dark Side of the AllAmerican Meal. Houghton Mifflin Harcourt, 2001. [2] E. Arredondo, D. Castaneda, J. P. Elder, D. Slymen, and D. Dozier. Brand name logo recognition of fast food and healthy food among children. Journal of Community Health, 2008. [3] Food Surveys Research Group Beltsville, MD: Agricultural Research Service. USDA Food and Nutrient Database for Dietary Studies, 3.0, 2008. [4] D. G. Lowe. Distinctive image features from scale-invariant keypoints. Intl. Journal of Computer Vision, 2004. [5] W. Zhang and J. Kosecka. Image based localization in urban environments. International Symposium on 3D Data Processing, Visualization and Transmission, 2006. [6] F. Schaffalitzky and A. Zisserman. Multi-view matching for unordered image sets, or ”how do i organize my holiday snaps?”. ECCV, 2002.
© Copyright 2026 Paperzz