A Spatiotemporal Most-Apparent-Distortion Model for Video Quality

2011 18th IEEE International Conference on Image Processing
A SPATIOTEMPORAL MOST-APPARENT-DISTORTION MODEL
FOR VIDEO QUALITY ASSESSMENT
Phong V. Vu, Cuong T. Vu, and Damon M. Chandler
Laboratory of Computational Perception and Image Quality,
School of Electrical and Computer Engineering,
Oklahoma State University, Stillwater OK 74078 USA
ABSTRACT
This paper presents an algorithm for video quality assessment, spatiotemporal MAD (ST-MAD), which extends our
previous image-based algorithm (MAD [1]) to take into account visual perception of motion artifacts. ST-MAD employs spatiotemporal “images” (STS images [2]) created by
taking time-based slices of the original and distorted videos.
Motion artifacts manifest in the STS images as spatial artifacts, which allows one to quantify motion-based distortion
by using classical image-quality assessment techniques. STMAD estimates motion-based distortion by applying MAD’s
appearance-based model to compare the distorted video’s
STS images to the original video’s STS images. This comparison is further adjusted by using optical-flow-derived weights
designed to give greater precedence to fast-moving regions
located toward the center of the video. Testing on the LIVE
video database demonstrates that ST-MAD performs well in
predicting video quality.
Index Terms— video quality, image quality, distortion,
quality assessment
1. INTRODUCTION
The ability to quantify the quality of an image or video is a
crucial step for any system that processes digital media. Yet,
determining quality in a manner that agrees with human perception remains one of the greatest ongoing challenges in image processing.
Standard approaches to video quality assessment generally employ some form of frame-by-frame comparison. These
comparisons can be made by using quality assessment algorithms designed for still images (e.g., MS-SSIM [3], VSNR
[4]). Most often, however, substantial improvement can be
made by analyzing temporal information to take into account
motion artifacts. For example, Wang et al. [5] presented a
video quality assessment technique which combines the spatial SSIM algorithm with an additional stage that employs
motion analysis. The VQM algorithm of Pinson and Wolf
[6] employs “quality features” that capture spatial, temporal,
and color-based differences between the original and distorted
978-1-4577-1303-3/11/$26.00 ©2011 IEEE
videos. More recently, Seshadrinathan and Bovik proposed an
algorithm called MOVIE [7] which uses a 3D Gabor analysis
to measure spatial quality, temporal quality, and spatiotemporal quality (along motion trajectories). A more thorough
review of these and other techniques can be found in [8].
In this paper, we present an algorithm for video quality assessment which recasts the temporal quality assessment task
into a spatial quality assessment task by using “images” created by taking spatiotemporal slices of the original and distorted videos. As argued in [2], the patterns created by using
these spatiotemporal slice images (STS images) can provide
a visual summary of the type of motion present in the video
along a large temporal scale. Our new algorithm, spatiotemporal MAD (ST-MAD), applies the spatial appearance-based
model of our previous image-quality assessment algorithm
(MAD [1]) to the STS images. Our main assumptions are:
(1) motion artifacts will manifest as spatial artifacts in these
STS patterns for the original vs. distorted videos; and (2) the
appearance-based model of MAD can quantify these changes
in a manner that agrees with human perception. We demonstrate the efficacy of this approach on videos from the LIVE
video database.[8]
This paper is organized as follows: In Section 2, we
provide details of the ST-MAD algorithm. In Section 3, we
present and discuss the results of this algorithm on videos
from the LIVE video database. General conclusions are
presented in Section 4.
2. ALGORITHM
2.1. Spatial MAD
The original (spatial-only) MAD algorithm consists of two
stages: (1) a detection-based stage, which computes the
perceived degredation due to visual detection of distortions
(ddetect ); and (2) an appearance-based stage, which computes
the perceived degredation due to visual appearance changes
(dappear ). The detection-based stage of MAD computes
ddetect by using a masking-weighted block-based MSE computed in the lightness domain. The appearance-based stage of
MAD computes dappear by computing the average difference
between the block-based log-Gabor statistics of the original
2505
2011 18th IEEE International Conference on Image Processing
dappear,avg =
T
1
dappear,t .
T t=1
Distorted
d
Origginal
image to those of the distorted image.1
In the original MAD algorithm, the two scalar values
ddetect and dappear were combined into an overall distortion value via a weighted geometric mean. The weight was
selected to give greater contribution from ddetect for mildly
distorted images and greater contribution from dappear for
more heavily distorted images.
In ST-MAD, we apply MAD to each frame to compute the
sets {ddetect,t } and {dappear,t }, where t = 1, . . . , T denotes
the frame index. These values are then averaged across all
frames as follows:
T
1
ddetect,t
(1)
ddetect,avg =
T t=1
(2)
Finally, we employ the following adaptive combination rule
to arrive at a single scalar distortion value, dspatial , for the
entire video:
1−α
dspatial = dα
(3)
detect,avg + dappear,avg
where the weight α is selected as specified in [1] using
ddetect,avg to estimate the prevailing amount of distortion.
2.2. Temporal MAD
To estimate the perception of motion-based distortion, STMAD performs three steps: (1) it applies MAD’s appearancebased model to spatiotemporal images created from the orignal and distorted videos; (2) it weights these values by using
optical-flow-derived weights designed to give greater precedence to fast-moving regions located toward the center of the
video; and then (3) it combines the values from Step 1 with
the weights from Step 2 using a combination rule that varies
according to the prevailing amount of motion.
2.2.1. Step 1: Compute STS Images
A spatiotemporal slice (STS) image is constructed by extracting from each video frame a single row or column of the
frame. The extracted rows/columns are then assembled into
the STS image by stacking the rows (or columns) from top to
bottom (or from left to right), where the tth row/column in
the STS image corresponds to frame t.
Specifically, let I = [I1 , I2 , . . . , IT ] denote a video where
each frame It is of size C × R pixels. Let rowr (It ) denote
the rth row of It ; and let colc (It ) denote the cth column of It .
Let Srrows (I) denote the STS image constructed from the
th
r row of each frame of I. Let Sccols (I) denote the STS image constructed from the cth column of each frame of I. The
STS images are constructed by performing the following assigment using all T frames t = 1, ..., T :
(4)
Srrows (It ) = rowr (It )
cols
(5)
Sc (It ) = colc (It )
1 Due to space limitations, we can provide only a cursory overview of the
original MAD algorithm. We refer readers to [1] for a complete description.
Fig. 1. STS images from the 205th row of each frame of the original
rows
rows
(I)] and of the distorted video [S205
(Î)].
video [S205
Here, Srrows (It ) denotes the tth row of Srrows (I) constructed
from the rth row of frame It . Similarly, Sccols (It ) denotes the
tth column of Sccols (I) constructed from the cth column of
frame It .
By performing the above assignments for all T frames,
the STS images are constructed. Thus, Srrows (I) is a C ×
T image which contains spatial information in the horizontal
direction and temporal information in the vertical direction.
Similarly, Sccols (I) is a T × R image which contains temporal
information in the horizontal direction and spatial information
in the vertical direction.
Note that because each row r of It gives rise to a unique
STS image, one could compute a total of R versions of
Srrows (I), one for each of the R rows of each frame; and, one
could compute a total of C versions of Sccols (I), one for each
of the C columns of each frame. Here, to reduce the amount
of computation, we compute STS images for only one out of
every 8 rows or columns, resulting in a total of R/8 versions
of Srrows (I) and C/8 versions of Sccols (I). These STS images
are computed both for the original video frames [Srrows(I)
and Sccols (I)] and for the distorted video frames [Srrows(Î)
and Sccols (Î)], where Î denotes the distorted video.
Figure 1 shows the STS images computed by taking the
rows
205th row of each frame from the original video [S205
(I)]
rows
and from the distorted video [S205 (Î)]. Motion artifacts
manifest as differences between the STS images for the original and distorted videos (see the close-ups).
2.2.2. Step 2: Compute Motion Weights
To compute the motion weights, we apply the optical flow algorithm of Lukas and Kanade [9] to the original video. This
algorithm yields a motion-vector matrix for each pair of consecutive frames. We reduce the size of each matrix by averaging its entries in each 8 × 8 block to yield the smaller motion-
2506
2011 18th IEEE International Conference on Image Processing
vector matrix Mt (r, c) of size R/8 rows and C/8 columns.
Each entry in Mt (r, c) specifies local motion between frame
t − 1 and t.
The set of matrices {Mt (r, c)} , t = 2, . . . , T are averaged across time to yield a single matrix M (r, c). This latter
matrix is then collapsed into vectors u and v whose entries uc
and vr are given by
2
1 uc = g C (c)
M (r, c)
(6)
8
R/8 r
2
1 vr = g R (r)
M (r, c) ,
(7)
8
C/8 c
1.2(n− N 2−1 −1)
is a Gaussianwhere gN (n) = exp −0.5
N/2
weighting function designed to effect greater weighting toward the center of each frame.
Finally, the weight vectors are normalized to span the
range [0, 1] via ũ = u/ηu and ṽ = v/ηv , where
ũ and ṽ
denotethe normalized versions, and where ηu = c uc and
ηv = r vr .
2.2.3. Step 3: Estimate Motion-Based Distortion
To quantify the motion-based distortion, we first compute a
},
set of MAD’s appearance-based distortion values, {drows
r
where each entry drows
denotes
the
distortion
value
computed
r
for Srrows (I) vs. Srrows (Î). Similarly, we compute
a set of
, where
MAD’s appearance-based distortion values, dcols
c
each entry dcols
denotes
the
distortion
value
computed
for
c
Sccols (I) vs. Sccols (Î).
Next, we take a motion-weighted average of these data to
compute two scalar values denoting average row-based motion distortion and average column-based motion distortion:
1 rows
drows
d
ṽr (8)
avg = log10 (1000α + ηv ) ×
R/8 r r
1 cols
dcols
d ũc (9)
avg = log10 (1000(1 − α) + ηu ) ×
C/8 c c
where the free parameters α = 3/7 was chosen empirically.
Finally, the overall motion-based distortion value, dmotion ,
is given by
α cols 1−α
+ davg
.
(10)
dmotion = drows
avg
2.3. Combining Spatial and Temporal Distortion
The previous sections described how to compute the spatialbased distortion dspatial and the motion-based distortion
dmotion . As a final step, these values are combined as follows:
(11)
d = dmotion + 2.5 log10 (β × dspatial ) ,
v
. Here, d, is the final output
where β = log10 1 + ηvη+η
u
of the ST-MAD algorithm; it is a scalar value that denotes the
overall quality of the distorted video relative to the original
video.
3. RESULTS AND DISCUSSION
To assess the performance of our ST-MAD algorithm, we use
the LIVE video database [8] which contains 10 high quality videos with a variety of content as reference (originals),
and 150 distorted videos (15 distorted video for each original video). The distortion types are MPEG-2 compression
(MPEG-2), H.264 compression (H.264), simulated transmission of H.264 compressed bitstreams through error-prone IP
networks (IP), and through error-prone wireless networks
(wireless).
3.1. Demonstrative results
The first row of Figure 2 shows one frame of an original video
and the corresponding frame of the wireless distorted video.
The detection and appearance distortion maps of the distorted
frame are shown in the second row. Notice that there is a
strong blurring around the top left and the center of the distorted frame. The two distortion maps are able to accurately
capture the location and amount of that distortion.
The last two rows of Figure 2 show one of the STS images
for columns (using the 205th column) and an STS image for
rows (using the 205th row), respectively, along with the computed temporal appearance maps. These STS images effectively reflect the predominant distortion throughout the video
(e.g., the blurred horizontal stripe across the boat, the one at
the bottom of the picture in the calendar, and other distortion
around numbers). The locations and amounts of these types
of distortion are also well captured in the appearance maps.
3.2. Overall performance
We compare ST-MAD with five well-known quality assessment methods: PSNR, VSNR, MS-SSIM, VQM, and
MOVIE, on the LIVE video database which includes DMOS
values. Note that PSNR, VSNR, and MS-SSIM are methods
for quality assessment of still images. Here, they were extended to video by applying them on a frame-by-frame basis
and averaging the scores across all frames. Two criteria were
used to evaluate the performances of these algorithms: the
Pearson Linear Correlation Coefficient (CC) and the Spearman Rank Order Correlation Coefficient (SROCC). Before
computing each CC value, we applied a logistic transform
recommended in [10] to the predicted scores.
The performances of our algorithm and the five methods
on each type of distortion and overall are shown in Table 1. In
this table, we also include the individual performance of the
spatial-only MAD (S-MAD; i.e., dspatial ) and temporal-only
MAD (T-MAD; i.e., dmotion )). The individual performance
of T-MAD and S-MAD are noteworthy. In fact, only the combination of these two indexes, which makes ST-MAD, performs better than T-MAD on average throughout the database.
As can be seen from this table, ST-MAD outperforms
other algorithms on H.264 and MPEG-2 distortion videos in
2507
2011 18th IEEE International Conference on Image Processing
Original frame
Distorted frame
Table 1. Performance of ST-MAD and other quality assessment
algorithms on the LIVE video database.
Wireless
Spatial detection map
Spatial appearance map
Original STS image (cols)
Distorted STS image (cols)
Temporal appearance map
Original STS image (rows)
Distorted STS image (rows)
Temporal appearance map
PSNR
VSNR
MS-SSIM
VQM
MOVIE
S-MAD
T-MAD
ST-MAD
0.4675
0.6992
0.717
0.7325
0.8386
0.7887
0.7798
0.8123
PSNR
VSNR
MS-SSIM
VQM
MOVIE
S-MAD
T-MAD
ST-MAD
0.4334
0.7019
0.7285
0.7214
0.8109
0.7754
0.7812
0.806
IP
H.264
CC
0.4108 0.4385
0.7341 0.6216
0.7219 0.6919
0.648 0.6459
0.7622 0.7902
0.7616 0.7014
0.7554 0.9069
0.79
0.9097
SROCC
0.3206 0.4296
0.6894 0.646
0.6534 0.7051
0.6383 0.652
0.7157 0.7664
0.7628 0.6638
0.7459 0.9071
0.7686 0.9043
MPEG-2
All data
0.3856
0.598
0.6604
0.786
0.7595
0.6563
0.829
0.8422
0.4035
0.6896
0.7441
0.7236
0.8116
0.7366
0.8184
0.8299
0.3588
0.5915
0.6617
0.781
0.7733
0.6793
0.8292
0.8478
0.3684
0.6755
0.7361
0.7026
0.789
0.7211
0.8149
0.8242
of strategy,” Journal of Electronic Imaging, vol. 19, no. 1, pp.
011006, 2010.
[2] Chong-Wah Ngo, Ting-Chuen Pong, and Hong-Jiang Zhang,
“Motion analysis and segmentation through spatio-temporal
slices processing,” Image Processing, IEEE Transactions on,
vol. 12, no. 3, pp. 341 – 355, 2003.
[3] Z. Wang, E.P. Simoncelli, and A.C. Bovik, “Multiscale structural similarity for image quality assessment,” in Signals, Systems and Computers, 2003. Conference Record of the ThirtySeventh Asilomar Conference on, 2003, vol. 2, pp. 1398 – 1402
Vol.2.
Fig. 2. One frame from an original and distorted video. The spatial
detection and spatial appearance maps are yielded by MAD [1]. The
lower six images show STS images and corresponding appearance
maps generated by applying MAD to the STS images. The bottommost STS images have been cropped for displaying purposes.
[4] D. M. Chandler and S. S. Hemami, “Vsnr: A wavelet-based
visual signal-to-noise ratio for natural images,” in IEEE Transactions on Image Processing, 16, 2007.
the LIVE video database. Only MOVIE is better than STMAD on wireless distortion videos. ST-MAD is still a work
in progress; however, these results demonstrate its potential
for video quality assessment.
[5] Zhou Wang, Ligang Lu, and A.C. Bovik, “Video quality assessment using structural distortion measurement,” in Image
Processing. 2002. Proceedings. 2002 International Conference
on, 2002.
4. CONCLUSIONS
In this paper, we have presented an algorithm for video quality assessment, ST-MAD, which represents an extension of
our previous algorithm (MAD [1]), modified to take into account perception of motion artifacts. ST-MAD extends upon
its predecessor by comparing the differences in spatiotemporal slices taken from the original and distorted videos. This
comparison is made by: (1) applying MAD’s appearancebased model to images created from the slices, and then (2)
weighting these values by using optical-flow-derived weights
designed to give greater precedence to fast-moving regions
located toward the center of the video. Testing on the LIVE
video database has demonstrated that ST-MAD performs well
in predicting video quality and is currently one of the bestperforming algorithms.
[6] M.H. Pinson and S. Wolf, “A new standardized method for objectively measuring video quality,” Broadcasting, IEEE Transactions on, vol. 50, no. 3, pp. 312 – 322, 2004.
5. REFERENCES
[10] VQEG, “Final report from the video quality experts group on
the validation of objective models of video quality assessment,
phase II,” August 2003, http://www.vqeg.org.
[1] Eric C. Larson and Damon M. Chandler, “Most apparent distortion: full-reference image quality assessment and the role
[7] K. Seshadrinathan and A.C. Bovik, “Motion tuned spatiotemporal quality assessment of natural videos,” Image Processing, IEEE Transactions on, vol. 19, no. 2, pp. 335 –350,
2010.
[8] K. Seshadrinathan, R. Soundararajan, A.C. Bovik, and L.K.
Cormack, “Study of subjective and objective quality assessment of video,” Image Processing, IEEE Transactions on, vol.
19, no. 6, pp. 1427 –1441, 2010.
[9] B. D. Lucas and T. Kanade, “An iterative image registration
technique with an application to stereo vision,” Proc. 7th IJCAI, pp. 674 –679, 1981.
2508