SEEING THROUGH THE APPEARANCE:
BODY SHAPE ESTIMATION USING MULTI-VIEW CLOTHING IMAGES
Wei-Yi Chang and Yu-Chiang Frank Wang
Research Center for Information Technology Innovation, Academia Sinica, Taipei, Taiwan
{webillchang, ycwang}@citi.sinica.edu.tw
ABSTRACT
We propose a learning-based algorithm for body shape estimation, which only requires 2D clothing images taken in
multiple views as the input data. Compared with the use of
3D scanners or depth cameras, although our setting is more
user friendly, it also makes the learning and estimation problems more challenging. In addition to utilizing ground truth
body images for constructing human body models at each
view of interest, our work uniquely associates the anthropometric measurements (e.g., body height or leg length) across
different views. For performing body shape estimation using
multi-view clothing images, the proposed algorithm solves an
optimization task which recovers the body shape with image
and measurement reconstruction guarantees. In the experiments, we will show that the use of our proposed method
would achieve satisfactory estimation results, and performs
favorably against single-view or other baseline approaches for
both body shape and measurement estimation.
Index Terms— Body shape estimation, multi-view image
reconstruction, regression models
1. INTRODUCTION
With the rapid growths of Internet and E-commerce, the number of users purchasing clothing items online is significantly
increasing. Generally, there are two major challenges for the
online shopping of clothing items. First, how to suggest clothing items which might be of interests to the users, or how to
recommend relevant clothing (e.g., jacket) or accessary items
(e.g., scarf or shoes), so that the presented items would fit
specific occasions or fashion trends. Second, given particular
clothing items like shirt or pants, how to suggest the proper
size for the particular user, and thus unnecessary returns can
be avoided. Once such detailed estimation is obtained, virtual
try-on systems like [1] can be applied to visualize the appearance of the user with selected clothing items on.
In this paper, we focus on the the latter task, i.e., body
shape estimation. Our work aims at estimating the body shape
of the user with his/her visual appearance using clothing images. As depicted in Figure 1, our goal is to predict the associated measurements (e.g., body height, leg length, or waist
Fig. 1. Body shape and measurement estimation for the applications
of online clothes shopping.
width, as determined later in Section 3). This is of practical interests, since the size charts and the actual sizes of the
clothes provided by different sellers are generally not consistent. Thus, given one or multiple image of an user with
with his/her clothes on, we need to estimate the corresponding body shape and measurements.
It is worth noting that, while there exist equipments such
as backscatter X-ray, laser and radio waves for precisely capturing the 3D human body shape, their costs are typically high
for general consumer uses. Even if such costs can be reduced,
it might not be practical to have users wear skin-tight clothes
for scanning their body shapes. This is why we particularly
focus on body shape estimation using 2D images with clothes
on. For addressing this task, silhouette-based approaches are
typically applied due to their low costs and ease of image collection. After subtracting the background regions from the
capture image, the silhouette of the human body (with clothes
on) could be extracted. It can be utilized for training and predicting body shape accordingly.
In this paper, we address the problem of body shape estimation using clothing images. To be more specific, we consider 2D images of the same person and gesture captured in
multiple views for estimation. Compared to the methods focusing on the use of a single clothing image at the frontal
view, we will show that our approach would improve the estimation accuracy of the resulting body shape. In addition
Fig. 3. Examples of silhouette extraction and image parsing for
learning body shape models: (a) input image, (b) skeleton extraction
by [15], and (c) the final parsing result.
Fig. 2. Our proposed framework for body shape estimation
using multi-view clothing images.
to the estimation of body shape, we further predict several
physical measurements such as body height and leg length,
which would be of particular and practical interests for online shopping applications. Based on the above motivation
and observation, we propose a novel learning-based algorithm
with the goal of recovering/estimating the body shape of the
input clothing image with cross-camera image and measurement correspondences. The flowchart of our proposed framework is shown in Figure 2. In Section 3, we will detail our
proposed algorithm, including the optimization details. In our
experiments, we will verify that our method is able to achieve
improved body shape estimation with consistency observed
for cross-camera measurements. We will also confirm that
our approach would perform favorably against baseline or recent methods, which considered only single-view images or
did not take multi-view information into consideration.
2. RELATED WORK
Body shape estimation has been an active research topic in
multimedia and computer vision communities. This is due
to its practical uses for applications such as computer games,
animation, virtual try-on, and visualization. There exist approaches which utilize 3D scanners [2] or depth camera [3]
for estimating the 3D body shape. For example, Anguelov et
al. [2] proposed a parametric body model of SCAPE (shape
completion and animation of people), which is to describe the
body shape and the posture of human. Tong et al. [3] presented a system to scan 3D full human body shape by using
multiple depth cameras. In addition to the use of 3D scanners and depth cameras, some researchers considered the images captured by single [4] or different cameras [5] for 3D
body shape estimation. For example, Guan et al. [4] utilized
SCAPE [2] and additional monocular cues (e.g., shading) to
reconstruct the 3D model from a single 2D image. Boisvert et
al. [5] proposed a silhouette-based reconstruction method by
integrating silhouettes observed at frontal and side views.
However, the above approaches require the user to wear
skin-tight clothes in order to derive his/her body shape model.
While precise estimation can be achieved, it might not be of
practical uses especially for shopping clothes online. To address this issue, Bălan et al. [6] presented a silhouette-based
method to estimate the human body shape using 2D clothing
images, aiming at recovering body shape images with minimum reconstruction errors. Guan et al. [7] presented a generative model for achieving the same goal. More specifically,
they combined an underlying naked body shape model (Contour Person [8]) with a low-dimensional clothing one which
is learned from training data. Hasler et al. [9] proposed to estimate the body shape of a dressed person using 3D scanners,
with additional robustness to contortion of the surface caused
by noise or loose clothes. Wuhrer et al. [10] solved the above
task by using 3D scanned video frames. To be more precise,
they utilized a posture-invariant shape space to model the human body with a skeleton-based deformation for modeling
posture variation.
In addition, anthropometric measurements like body
height or waist width also provide useful information cues
for constructing body shapes in 2D. In [11, 12], an extensive amount of such measurements were further utilized for
constructing the 3D body models. On the other hand, Tsoli
et al. [13] proposed a model-based approach to extract such
measurements from the 3D scans. Lin et al. [14] proposed an
automated system to extract the measurements by detecting
the turning points from the silhouette images.
For practicality and simplicity, we address the problem of
body shape estimation using 2D clothing images in this paper.
Using multi-view 2D clothing images, our proposed method
is able to predict the body shape and the associated physical measurements. Compared to existing approaches utilizing single-view clothing images, improved estimation performance can be achieved (as discussed later in Section 4).
3. OUR PROPOSED METHOD
3.1. Modeling of Human Body Shapes
3.1.1. Single-View Body Shape Modeling
For training purposes, we extract silhouette images at a fixed
view from training image data, in which each person wears
Fig. 4. Example of the learned body shape model. Note that the
mean and first three dominant principal components (PCs) are shown
from left to right.
skin-tight clothes for simulating the naked body. As suggested in [2, 6], principal component analysis (PCA) can
be directly applied to such training images for modeling the
shape variability of different subjects.
In practice, since the numbers of sample points along
different silhouettes are different, we advance a part-based
sampling technique for feature extraction. For each training
(naked) image, we apply the pose estimation algorithm of [15]
to generate a skeleton, which can be utilized as a reference to
parse the corresponding silhouette into several semantic parts
(e.g., head, arm, and leg etc.) [16]. After this parsing process
is complete, we sample the points from the silhouette of each
part with a fixed number. Finally, we concatenate the location
of the sampled points (in terms of their x and y coordinates)
of each part as the feature vector to describe the silhouette
of each training image (in terms of a d-dimensional vector).
Figure 3 shows an example of this parsing stage for feature
extraction. As depicted in Figure 3(c), different colors represent different semantic parts of the human body.
Once the features for describing human body images are
obtained, we perform PCA to identify the mean µ and the
principal components (PC) of the training images. Figure 4
shows the example images of the mean and the first three
dominant PCs using training silhouette images. As noted
in [2, 6], these PCs not only can be applied for reconstructing the body shape images, they also can be used to model
shape variation, which will be useful for body shape estimation. Once the mean (µ) and the PCs (U ) are derived from the
silhouettes, the body shape image (in terms of the silhouettes)
can be recovered by:
S(β) = U β + µ,
(1)
where U indicates the first L dominant PCs with size of d×L,
and β is the corresponding L-dimensional coefficients.
3.1.2. Multi-View Body Shape Modeling with Measurement
Constraints
The technique presented in Section 3.1.1 is to describe and
model shape variations at a particular camera view. When it
comes to body shape estimation, the use of multi-view images
Fig. 5. Five measurements considered for human body estimation.
will be expected to achieve improved accuracy. This is not
only because that more input images are available for reconstructing the body shape image of interest, the relationship between the measurements across these multi-view images can
be further utilized for refining the estimated result.
For each user, since the anthropometric measurements
like body height, leg length or waist width can be viewed
as constants across different camera views, we advance this
property as a constraint for body shape estimation. In our
work, we consider five different measurements which are
popular used in online shopping websites. Figure 5 shows
the five measurements of interest. Note that two vertical measurements (i.e., overall height and inside leg length) and three
horizontal measurements (chest width, waist width, and hip
width) are considered. With the parsing image output (as discussed in Section 3.1.1), we extract the landmarks along the
detected silhouettes for calculating the above measurements
(in terms of pixels).
To verify that such anthropometric measurements are constant (or the observed measurements are highly correlated)
across views, we calculate the Pearson correlation coefficients
for the five measurements across two selected views. The resulting Pearson correlation coefficients are listed in Table 1,
which supports the above motivation/statement. Based on this
observation, we propose to apply linear regression to associate F measurements across (any) two camera views:
m(t) = A(s,t) m(s) + B (s,t) ,
(2)
where m(s) and m(t) are the measurement vectors (in size
of F × 1) from the source (s) and target views (t), respectively. We have F = 5 different measurements in our work.
A(s,t) (in size of F × F ) and B (s,t) (in size of F × 1) are
the regression and bias matrix to be learned. We note that, for
A(s,t) , we only consider the highly correlated measurements
(as shown in Table 1) to learn the regression coefficients. For
example, we consider only measurement (D) and (E) to predict measurement (D), since the correlations are larger than
0.8. This implies the learning of sparse regression models
and accelerates the learning process. Once A(s,t) and B (s,t)
are obtained, we apply (2) as a constraint for our proposed
objective function for body shape estimation (as discussed in
the following subsection).
Table 1. The Pearson correlation coefficients observed between the
measurements using the SHREC’14 dataset [17].
view1-(A)
view1-(B)
view1-(C)
view1-(D)
view1-(E)
view2-(A)
0.9969
0.9666
0.5176
0.1885
0.0458
view2-(B)
0.8946
0.9583
0.1774
−0.1455
−0.2789
view2-(C)
0.6739
0.5130
0.9820
0.7138
0.6524
view2-(D)
0.0169
−0.2044
0.5536
0.9001
0.9637
view2-(E)
0.0113
−0.2093
0.5826
0.9327
0.9877
3.2.1. Objective Function
After collecting the PCs and regression models A(s,t) and
B (s,t) from k = 2 views, our goal is to determine the optimal coefficients (β k ) for each view, so that the naked human
body can be reconstructed. Unlike [6], we do not assume that
body shape images in different views share the same reconstruction coefficients β k . In other words, we allow β k to be
distinct across camera views, while the estimated body shape
is constrained and refined by the observed measurements.
Based on the above properties, we propose to perform
body shape estimation by solving the following optimization
problem:
βk
X
˜ e k , Sko ) + Estd (β k ) + ηEme (k),
d(S
k,β
(3)
k
e
o
where Sk,β
k and Sk are the estimated (reconstructed by (1))
and observed silhouettes (from clothing image input), respec˜ ) measures the difference between
tively. The function d(,
o
e
Sk and Sk,β k , Estd regularizes the reconstruction coefficients,
and Eme associates the resulting measurements across k = 2
camera views. Parameter η penalizes the measurement asso˜ ) as:
ciation errors. As suggested by [6], we determine d(,
˜ T) =
d(S,
X
X
(Sij · Cij (T ))/
Sij .
i,j
(4)
i,j
Note that Sij is the pixel (i, j) of silhouette S, and Cij (T ) is
a distance function defined as:
Cij (T ) =
0,
dist(Sij , p),
if Sij is inside T
otherwise,
(5)
where p ∈ T is the closest point to Sij , and dist(, ) calculates
the Euclidean distance between the points.
Since the silhouette of the estimated body shape should be
within that of the input clothing image, we penalize the recovered pixel points which are outside the observed silhouette. In
addition, we penalize the l-th coefficient (βlk ) that would result in wildly unnatural shapes (as noted in [6]). As a result,
Estd is defined as:
Estd (β k ) =
X
l
max(0,
Eme (k) =
X
km(k) − (A(v,k) m(v) + B (v,k) )kF ,
(7)
v6=k
where m(v) and m(k) represent the measurement vectors
from view v and k, respectively. Recall that A(v,k) and
B (v,k) are the regression models learned from training data
with highly correlated measurements.
3.2. Body Shape Estimation
min
the value of β k . Here, we only penalize βlk if its value is
larger than 3 times of the standard deviation (σT = 3).
Finally, in order to preserve the consistency of the measurements across camera views, we consider the predicted
measurements from other views (v 6= k) as constraints:
|βlk |
− σ T )2 ,
σlk
(6)
where σlk is the standard deviation of all βlk that are calculated
during the training phase, and σT is a threshold to constrain
3.2.2. Estimation via Optimization
To solve the minimization problem of (3) for estimating the
body shape S e and measurements m of the input cross-view
clothing images S1o and S2o , we apply the fminsearch function
of Matlab on images of each view, and then update the reconstruction coefficients across two camera views (i.e., β 1 and
β 2 ) alternatively. For initialization, we apply the reconstruction coefficients β of training data which are closest to S1o and
S2o to start the optimization of (3) with η = 0. This is for the
purpose of improved precision and efficiency for the estimation process. During the optimization, we fix η = 0.005 in
(3) till convergence.
4. EXPERIMENTS
4.1. Dataset and Settings
We now evaluate the performance of our proposed approach.
We note that, although the dataset considered by Bălan et
al. [6] contains both clothing and body images, there are
only 6 subjects available. To collect training and test images at multiple camera views, we apply the DC Suite 1 ,
which is a virtual try-on software package for modeling 3D
human body. We run this software package on the dataset
of SHREC’14 [17]. This allows us to synthesize both naked
body (training) and clothing (test) images at different views.
Although the SHREC’14 dataset contains body models for 20
males and 20 females, we remove two outlier female models
(one is extremely tall, while the other is clearly overweight
when comparing to others) before conducting experiments.
Since the clothing templates of the DC Suite are only
available for females, we eventually synthesize and collect
18 females with 4 different kinds of clothing appearances at
2 different camera views. Each image is of size 500 × 694
pixels. Figure 6 shows examples for both body and clothing
images at the two camera views considered. When performing evaluation, we randomly select half number of the subjects and use their body images as training data. The clothing
images of remaining half are applied for testing. Finally, we
present the average results on 3 random trials.
1 http://www.physan.net/eng/main/main.asp
Fig. 7. Comparisons of average estimation pixel errors.
(a) Naked body images
(b) Clothing images
Fig. 6. Examples of (a) body and (b) clothing image data. Note
that the top row shows the images generated by the 3D human body
model [17] (using DC Suite), while the next two rows show corresponding silhouette images at two different views.
4.2. Discussions
Fig. 8. Comparisons of average measurement errors.
4.2.1. Evaluation at the pixel level
Given the clothing images as inputs, we apply two different
metrics for evaluation. The pixel error (PixErr) which calculates the Mean Square Error (MSE) between the two estimated and ground truth silhouette images:
P ixErr =
N
1 X e
2
(pi − pgt
i ) ,
N i=1
(8)
where pei and pgt
i indicate the locations of the i-th pixel from
the estimated and ground truth silhouette images, respectively. N is the total number of non-zero pixels in a silhouette
image. In addition, we calculate the estimation errors from
the recovered measurements, since such values are directly
associated with the sizes for clothing items. We then define
the measurement error (MeaErr) as:
M eaErr =
F
X
2
(meei − megt
i ) ,
(9)
i=1
where mee and megt are the measurements derived from the
estimated and ground truth silhouettes, respectively (in terms
of pixels). F is the total number of measurements, and we
have F = 5 in this paper.
For comparisons, the single-view (SV) based approach is
first applied as a baseline, which adopts the objective function of (3) at only one view. Next, we consider the approach
of [6], which deals with multi-view inputs but assume that the
reconstruction coefficients are the same across views (denoted
as MV-shared). To relax the approach of [6], we extend [6] by
allowing distinct multi-view coefficients, but no measurement
constraints are utilized (i.e., without the last term in (3), and
denoted as MV-noM). Finally, the comparisons of PixErr and
MeaErr are presented in Figures 7 and 8, respectively (note
that our method is denoted as MV).
From Figures 7 and 8, it can be seen that our approach
outperformed the baseline approaches. That is, using crossview clothing images while associating/enforcing the derived measurements, our method was able to achieve satisfactory results. Compared to approaches which consider
only single-view inputs or require shared reconstruction coefficients across views, our method was shown to provide improved precision in estimating body shapes.
4.2.2. Evaluation in terms of physical sizes and beyond
Since the above two metrics are evaluated at the pixel level,
we further provide evaluation in terms of physical sizes. Take
measurements (A) and (B) in Figure 5 for example, we list
their measurement errors and the corresponding size difference in Table 2 (in terms of pixel errors and millimeter, respectively) . We note that, in Table 2, the physical sizes are
calculated using Body Visualizer 2 .
Finally, Figure 9 shows example clothing image inputs
and the estimated body shapes. From this visual comparisons, we see that our method performed favorably against
the other three baseline approaches. For body parts like arm
or waist widths, the body shapes recovered by our algorithm
better fit the ground truth images. From the above quantitative
and qualitative comparisons, the effectiveness of our propose
method can be successfully verified.
2 http://bodyvisualizer.com/
(a) Visualization of body shape estimation at the frontal view.
(b) Visualization of body shape estimation at a 45-degree view.
Fig. 9. Example estimation results of four different methods (from left to right: SV, MV-shared, MV-noM, and MV (ours)). Note that the
ground truth, observed, and estimated images/silhouettes are shown in red, yellow, and green, respectively.
Table 2. Average estimation errors for two selected measurements
in terms of pixels and physical lengths (in millimeter).
pixel error/mm
SV
MV-shared MV-noM
MV
Measurement-(A) 7.60/28.84 6.13/21.35 4.28/14.90 3.67/12.76
Measurement-(B) 6.54/20.28 5.81/18.01 5.44/16.87 4.80/14.89
[4] P. Guan, A. Weiss, A.O. Bălan, and M.J. Black, “Estimating
human shape and pose from a single image,” ICCV, 2009.
[5] J. Boisvert, C. Shu, S. Wuhrer, and P. Xi, “Three-dimensional
human shape inference from silhouettes reconstruction and validation,” Mach. Vision Appl., 2013.
[6] A. O. Bălan and M. J. Black, “The naked truth: Estimating
body shape under clothing,” ECCV, 2008.
5. CONCLUSION
We proposed a novel framework for body shape and measurement estimation, which only requires 2D clothing images
as input data. Our proposed algorithm can be viewed as
constructing a parametric model, which recovers body shape
images using information observed across different camera
views. More specifically, our method focuses on reconstructing the body shape with both multi-view image and
measurement guarantees. Different from prior approaches,
our method does not require the coefficients for image
reconstruction to be the same across camera views, while we
introduce additional constraints on the observed measurements for improved estimation. Quantitative and qualitative
experiments on a 2D clothing image dataset supported the
use of our approach, which was shown to perform favorably
against single-view or baseline approaches.
Acknowledgement This work is supported in part by the Ministry
of Science and Technology of Taiwan via MOST103-2221-E-001021-MY2 and NSC102-2221-E-001-005-MY2.
6. REFERENCES
[1] S. Hauswiesner, M. Straka, and G. Reitmayr, “Virtual try-on
through image-based rendering,” IEEE TVCG, 2013.
[2] D. Anguelov, P. Srinivasan, S. Thrun D Koller, J. Rodgers, and
J. Davis, “SCAPE: shape completion and animation of people,”
ACM Trans. Graph ( Proc. of SIGGRAPH), 2005.
[3] J. Tong, J. Zhou, L. Liu, Z. Pan, and H. Yan, “Scanning 3D
full human bodies using Kinects,” IEEE TVCG, 2012.
[7] P. Guan, O. Freifeld, and M. J. Black, “A 2D human body
model dressed in eigen clothing,” ECCV, 2010.
[8] O. Freifeld, A. Weiss, S. Zuffi, and M. J. Black, “Contour people: A parameterized model of 2D articulated human shape,”
IEEE CVPR, 2010.
[9] N. Hasler, C. Stoll, B. Rosenhahn, T. Thormählen, and H.-P.
Seidel, “Estimating body shape of dressed humans,” Computers & Graphics, 2009.
[10] S. Wuhrer, L. Pishchulin, A. Brunton, C. Shu, and J. Lang,
“Estimation of human body shape and posture under clothing,”
Comput. Vis. Image Understad., 2014.
[11] S. Wuhrer and C. Shu, “Estimating 3D human shapes from
measurements,” Mach. Vision Appl., 2013.
[12] Y. Chen, D. Robertson, and R. Cipolla, “A practical system
for modelling body shapes from single view measurements,”
BMVC, 2011.
[13] A. Tsoli, M. Loper, and M. J. Black, “Model-based anthropometry: Predicting measurements from 3D human scans in
multiple poses,” WACV, 2014.
[14] Y.-L. Lin and M.-J. J. Wang, “Automated body feature extraction from 2D images,” Expert Syst. Appl., 2011.
[15] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari,
“2D articulated human pose estimation and retrieval in (almost)
unconstrained still images,” IJCV, 2012.
[16] J. Dong, Q. Chen, X. Shen, J. Yang, and S. Yan, “Towards unified human parsing and pose estimation,” IEEE CVPR, 2014.
[17] D. Pickup et al, “SHREC’14 track: Shape retrieval of non-rigid
3D human models,” Proc. of the 7th Eurographics workshop
on 3D Object Retrieval, 2014.
© Copyright 2025 Paperzz