LEARNING-BASED PERCEPTUAL IMAGE QUALITY

LEARNING-BASED PERCEPTUAL IMAGE QUALITY IMPROVEMENT FOR VIDEO
CONFERENCING
Zicheng Liu, Cha Zhang and Zhengyou Zhang
Microsoft Research
One Microsoft Way
Redmond, WA 98052, USA
ABSTRACT
It is well known that in professional TV show filming, stage lighting has to be carefully designed in order to make the host and the
scene look visually appealing. The lighting affects not only the
brightness but also the color tone which plays a critical role in the
perceived look of the host and the mood of the stage. In contrast,
during video conferencing, the lighting is usually far from ideal
thus the perceived image quality is low. There has been a lot of research on improving the brightness of the captured images, but as
far as we know, there has not been any work addressing the color
tone issue. In this paper, we propose a learning-based technique to
improve the perceptual image quality during video conferencing.
The basic idea is to learn the color statistics from a training set of
images which look visually appealing, and adjust the color of an
input image so that its color statistics matches those in the training set. To validate our approach, we have conducted user study
and the results show that our technique significantly improves the
perceived image quality.
Fig. 1. Images on the left are input frames captured by two
different webcams. Images on the right are the enhanced
frames by the proposed method.
1. INTRODUCTION
Online video conferencing has attracted more and more interest recently due to the improvements in network bandwidth, computer performance and compression technologies. Webcams are getting popular in home users, allowing them to chat with families and friends over the Internet for free. However, the qualities of webcams vary significantly from one to another. One common issue is that
webcams tend to perform poorly under challenging lighting conditions such as low light or back light. In the past
several years, we have seen a lot of progress on brightness and contrast improvement in both hardware design and
image processing. Nowadays most commercially available
webcams are equipped with automatic gain control. Some
webcams even perform real time face tracking and use the
face area as the region of interest to determine the gain control parameters. There has been a lot of image processing
techniques being developed that improve the brightness and
contrast of the captured images [8, 1, 4, 7].
Exposure, however, is not the only factor that affects the
perceptual quality of a webcam video. As shown in Fig. 1,
the images on the left are frames directly obtained from an
off-the-shelf webcam. While the exposure of these frames
are both very good, they look pale and unpleasant. Instead,
the images on the right are much more appealing to most
users. In TV show filming, it is no secret that stage lighting
has to be carefully designed to make the images look visually appealing [3]. The lighting design involves not only the
brightness but also the color scheme, because the color tone
is essential in the perception of the look of the host and the
mood of the stage.
As far as we know, there has not been any work addressing the color tone issue in video conferencing. Since color
tone is more subjective than brightness and contrast, it is
difficult to come up with a quantitative formula to define
what is a visually appealing color tone. Therefore, existing approaches on example-based image enhancement often requires a user to select an example image. For instance,
Reinhard et al. [6] proposed a color transfer technique to impose one image’s color characteristics on another. The user
needs to select an appropriate target image for a given input image to conduct the color transfer. Qiu [5] proposed to
apply content-based image retrieval technology to obtain a
set of image clusters. Given an input image, its color can be
modified by selecting an appropriate cluster as target. They
showed that this technique works well for color enhancement of outdoor scenes when the user is available to select
an appropriate cluster. However, they did not address the
perceptual issue of face images and it is not clear how their
technique can be applied to video conferencing scenario.
In this paper, we propose a data driven approach for
video enhancement in video conferencing applications. The
basic idea is to use a set of professional-taken face images
as training examples. These images are often fine-tuned to
make sure the skin-tone and the contrast of the face regions
are visually pleasing. Given a new image, we adjust its
color so that the color statistics in the face region is similar to the training examples. This procedure automates the
enhancement process and is extremely efficient to compute.
We report our user study experiments which show that the
proposed method can significantly improve the perceived
video quality.
2. LEARNING-BASED COLOR TONE MAPPING
At the core of our algorithm is what we call learning-based
color tone mapping. The basic idea is to select a set of
training images which look good perceptually, and build a
Gaussian mixture model for the color distribution in the face
region. For any given input image, we perform color tone
mapping so that its color statistics in the face region matches
the training examples.
Let n denote the number of training images. For each
training image Ii , we perform automatic face detection [9]
to identify the face region. For each color channel, the mean
and standard deviation are computed for all the pixels in the
face region. Let vi = (mi1 , mi2 , mi3 , σ1i , σ2i , σ3i )T denote the
vector that consists of the means and standard deviations of
the three color channels in the face region. The distribution
of the vectors {vi }1≤i≤n are modelled as a mixture of Gaussians. Let m denote the number of mixture components. Let
(µj , Σj ) denote the mean vector and covariance matrix of
the j’th Gaussian mixture component, j = 1, ..., m.
Given any input image, let v = (m1 , m2 , m3 , σ1 , σ2 , σ3 )T
denote the means and standard deviations of the three color
channels in the face region. Let Dj (v) denote the Mahalanobis distance from v to j’th component, that is,
q
Dj (v) = (v − µj )T Σ−1
(1)
j (v − µj ).
The target mean and standard deviation vector for v is
defined as a weighted sum of the Gaussian mixture component centers µj , j = 1, ..., m, where the weights are inversely proportional to the Mahalanobis distances. More
specifically, denoting v̄ = (m̄1 , m̄2 , m̄3 , σ̄1 , σ̄2 , σ̄3 )T as the
target mean and standard deviation vector, we have
v̄ =
m
X
wj ∗ µj
(2)
j=1
where
1/Dj (v)
wj = Pm
.
l=1 1/Dl (v)
(3)
After we obtain the target mean and deviation vector, we
perform color tone mapping for each color channel to match
the target distribution. For color channel c, c = 1, 2, 3, let
y = fc (x) denote the desired tone mapping function. In
order to map the average intensity from mc to m̄c , fc (x)
needs to satisfy
fc (mc ) = m̄c .
(4)
In order to modify the standard deviation σc to match the
target σ̄c , we would like the derivative at mc to be equal to
σ̄c
σc , that is,
fc0 (mc ) =
σ̄c
.
σc
(5)
In addition, we need to make sure fc (x) is in the range of
[0,255], that is,
fc (0)
=0
fc (255) = 255
A simple function that satisfies these constraints is

0
x<0

σ̄c
(x
−
m
)
+
m̄
0
≤
x ≤ 255
fc (x) =
c
c
 σc
255
x > 255
(6)
(7)
The drawback of this function is that it saturates quickly at
the low and high intensities. To overcome this problem, we
instead fit a piecewise cubic spline that satisfy these constraints. To prevent quick saturation at the two ends, we add
constraints on the derivatives fc0 (0) and fc0 (255) as follows:
= 0.5 ∗
fc0 (0)
fc0 (255) = 0.5 ∗
m̄c
mc
255−m̄c
255−mc
(8)
Given the constraints of Equations 4, 5, 6, and 8, we can
fit a piecewise cubic spline that satisfy these constraints [2].
The details are omitted here.
Figure 2 (a) shows a tone mapping curve generated by
using equation 7. Figure 2 (b) shows the curve generated by
using cubic splines.
Note that the color tone mapping function is created
based on the pixels in the face region, but it is applied to
all the pixels in a given input image.
Fig. 2. Left: Tone mapping curve by using linear function. Right: Tone mapping curve by using piecewise cubic
splines.
3. APPLICATION TO VIDEO SEQUENCES
During video conferencing, the overall image intensity changes
due to automatic gain control, lighting change, and camera
movement. We need to update the color tone mapping function when such change occurs. For this purpose, we have
developed an overall image intensity change detector. It
works by keeping track of the mean of the average intensity
of the non-face region in the history frames. Given an new
frame, if its average intensity in the non-face region differs
from the accumulated mean by an amount larger than a predefined threshold, it is decided that there is an overall image
intensity change.
Figure 3 shows a flow chart of the algorithm where Y is
average intensity of the non-face region of the new frame,
Ȳ is the accumulated mean of the frames in the history, ∆
is a user defined threshold for determining whether there is
an overall image intensity change, and α is a parameter that
controls the update speed of Ȳ . If the current frame does not
have overall intensity change, Ȳ is updated from the current
frame Y . Otherwise, we reset Ȳ to be equal to Y .
Fig. 3. Flow diagram of the learning-based color tone mapping algorithm applied to video sequences.
machine.
We have tested our system on a variety of video cameras
and different people. We will report our user study result in
the next section. Figure 1 shows an example of learningbased color tone mapping. The images on the left are input
frames from two different video sequences. The images on
the right are the results of learning-based color tone mapping. We can see that the resulting images are perceived as
having colored lights being added to the scene that makes
the scene look slightly brighter, but more importantly, having a warmer tone.
5. USER STUDY
We collected 16 teleconferencing videos with 8 different
cameras, including those made by Logitech, Creative, Veo,
Dynex, Microsoft, etc. Each video sequence lasts about 10
We collected approximately 400 celebrity images from the
seconds. A field study was conducted, where the users were
web as our training data set . It covers different types of skin
asked to view the original video and the enhanced video side
colors. We use an EM algorithm to construct a Gaussian
by side (as shown in Figure 1). The orders of the two videos
mixture model with 5 mixture components. Figure 4 shows
are randomly shuffled and unknown to the users in order to
a sample image for each class. Please note that these images
avoid bias. After viewing the video, the users shall give a
are not necessarily the most representative images in their
score to each video, within the range of 1–5, where 1 indiclasses. Figure 5 shows the mean colors of the five classes.
To reduce computation, we perform overall intensity change cates very poor quality, 5 indicates very good quality, and
a score of 3 is considered acceptable. A total of 18 users
detection every two seconds. The color tone mapping is perresponded in our test. All the users said they used LCD
formed on RGB channels. When an overall image intensity
displays to watch the videos.
change is detected, the color tone mapping function is reFigure 6 shows the average scores of the 18 users for
computed for each color channel, and it is stored in a lookup
table. In the subsequent frames, color tone mapping operthe 16 video sequences. It can be seen that the enhanced
ation becomes a simple table look up which is extremely
video outperforms the original video in almost all the sequences. The average score of the original video is merely
fast. Our system uses only 5% CPU time for 320 × 240
2.55, which is below the acceptable level, while the provideo with frame rate of 30 frames per second on a 3.2 GHz
4. EXPERIMENT RESULTS
Fig. 4. A sample image from each class.
Fig. 6. User study results.
[2] G. Farin. Curves and surfaces for CAGD: A practical guide.
Academic Press, 1993.
[3] N. Fraser. Stage Lighting Design: A practical Guide.
Crowood Press, 2000.
Fig. 5. Mean colors of the five classes.
posed method has a score 3.30, which is now above the acceptable level. The t-test score between the two algorithms
for the 16 sequences is 0.001%, which shows that the difference is statistically significant.
6. CONCLUSION
We have presented a novel technique, called learning-based
color tone mapping, to improve the perceptual image quality for video conferencing. Compared to existing image enhancement techniques, the novelty of our technique is that
we adjust not just the brightness, but more importantly, the
color tone. We have tested our system on a variety of webcam devices and conducted user study. The user study results show that our technique significantly improves people’s perception on the video quality.
7. REFERENCES
[1] S. A. Bhukhanwala and T. V. Ramabadram. Automated global
enhancement of digitized photographs. IEEE Transactions on
Consumer Electronics, 40(1), 1994.
[4] G. Messina, A. Castorina, S. Battiato, and A. Bosco. Image
quality improvement by adaptive exposure correction techniques. In IEEE International Conference on Multimedia and
Expo (ICME), pages 549–552, Amsterdam, The Netherlands,
July 2003.
[5] G. Qiu. From content-based image retrieval to example-based
image processing. University of Nottingham Technical Report: Report-cvip-05-2004, May 2004.
[6] E. Reinhard, M. Ashikhmin, B. Gooch, and P. Shirley. Color
transfer between images. IEEE Computer Graphics and Applications, 34-41, September/October 2001.
[7] F. Saitoh. Image contrast enhancement using genetic algorithm. In IEEE International Conference on SMC, pages 899–
904, Amsterdam, The Netherlands, October 1999.
[8] C. Shi, K. Yu, J. Li, and S. Li. Automatic image quality improvement for videoconferencing. In IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2004.
[9] P. Viola and M. Jones. Robust real-time object detection. In
Second International Workshop on Statistical and Computational Theories of Vision, Vancouver, July 2001.