Evaluation of video quality metrics on transmission distortions in H

1
Evaluation of video quality metrics on
transmission distortions in H.264 coded video
Iñigo Sedano, Maria Kihl, Kjell Brunnström and Andreas Aurelius
Abstract— The development of high-speed access networks has
enabled a variety of video delivery alternatives over the Internet,
for example IPTV and peer-to-peer based video services as
Voddler. Consequently, the development of real-time video QoE
monitoring methods is receiving large attention in the research
community. We believe that the good performing objective
metrics using reference information could be used to speed up the
development process of real-time video QoE monitoring methods.
Thus in this paper we study the accuracy of full-reference
objective methods for assessing the quality degradation due to
the transmission distortions. We evaluated several well-known
publicly-available full-reference objective metrics on the freely
available EPFL-PoliMI (Ecole Polytechnique Fédérale de
Lausanne and Politecnico di Milano) video quality assessment
database, which was specifically designed for the evaluation of
transmission distortions. The full-reference metrics are usually
evaluated using a reference which is uncompressed. Instead, we
study the performance of the metrics when the reference videos
are lightly compressed to ensure high quality.
Index Terms—Objective evaluation techniques, Performance
evaluation, IPTV & Internet TV
I. INTRODUCTION
T
HE development of high-speed access networks has
enabled a variety of video delivery alternatives over the
Internet, for example IPTV and peer-to-peer based video
services as Voddler [1]. A major problem is that the Quality of
Experience (QoE) of the video can be severely affected even
by a low packet loss rate. Consequently it is important to
allocate necessary network resources to minimize the loss of
video information. For doing that it is necessary to monitor
and estimate the QoE delivered to user. Accordingly, the
development of video quality metrics is receiving large
attention in the research community.
Currently, the methods using reference information and
Manuscript received December 3, 2010. This work was developed inside
the Future Internet project supported by the Basque Government within the
ETORTEK Programme and the FuSeN project supported by the Spanish
Ministerio de Ciencia e Innovación, under grant from Fundación Centros
Tecnológicos – Iñaki Goenaga and Tecnalia Research & Innovation. The
work was also partly financed by the CELTIC project IPNQSIS, with the
Swedish Governmental Agency for Innovation Systems (VINNOVA)
supporting the Swedish contribution.
I. Sedano is with Tecnalia Research & Innovation, Zamudio, E-48170
Spain (e-mail: [email protected]).
M. Kihl is with the Dept. of Electrical and Information Technology, Lund
University, SE-221 00 Sweden.
K. Brunnström and A. Aurelius are with Acreo AB, Networking and
Transmission Laboratory (Netlab), Kista, SE-164 40 Sweden.
suited for off-line non-monitoring purposes have shown much
better performance predicting perceived video quality, than
methods using no reference that could be used for monitoring.
Subjective assessments involving a panel of observers
constitutes up to now the most accurate method to measure the
video QoE and are also necessary for the development of
video quality metrics. However, subjective tests are expensive
and time consuming. Instead, the good performing objective
metrics using reference information could be used to speed up
the development process of real-time video QoE monitoring
methods. In order to understand their performance and how
data compression as well as packet loss affects the QoE of
video, we believe that the evaluation of already developed
full-reference metrics is valuable.
The accuracy of full-reference objective methods for
assessing the quality degradation due to the compression
process is an already well-studied subject, but independent
evaluations are, however, scarce. Furthermore, independent
evaluations of the impact of packet loss are even more rare. A
notable exception is the work by the Video Quality Experts
Group (VQEG) [2]. Nevertheless, until recently there has been
a lack of freely available video quality databases containing
transmission distortions, thus limiting the research progress in
the area.
Moorthy et al (2010) [3] evaluated publicly-available fullreference video quality assessment algorithms on the LIVE
Wireless Video Quality Assessment database. The database
was designed specifically for H.264 AVC compressed videos
transmitted over a wireless channel and consisted of 10
reference videos and 160 distorted videos. A large scale
subjective study involving over 30 different subjects was
conducted to assess the quality of the distorted videos in the
dataset. The LIVE Wireless Video Quality Assessment
database is no longer publicly available.
In this paper, we evaluate several full-reference objective
metrics on the freely available EPFL-PoliMI (Ecole
Polytechnique Fédérale de Lausanne and Politecnico di
Milano) video quality assessment database [4] [5] [6], which
has been specifically designed for the evaluation of
transmission distortions. The database contains 78 video
sequences at 4CIF spatial resolution (704×576 pixels). The
distorted videos are created from five 10 seconds long and one
8 seconds long uncompressed video sequences in planar I420
raw progressive format.
The reference videos are lightly compressed to ensure high
video quality in the absence of packet losses. The references
2
are thus similar in quality to the uncompressed original. In real
network deployments the uncompressed original sequence is
usually not available. Therefore we believe there is an interest
in evaluating the performance of the metrics when a
compressed reference of a bit lower quality is used instead.
The transmission distortions are simulated at different
packet loss rates (PLR) (0.1%, 0.4%, 1%, 3%, 5%, 10%) and
two different channel realizations are selected for each PLR.
The H.264/AVC reference software encoder adopting the
High Profile is used both for encoding and decoding the
videos. The compressed videos in the absence of packet losses
are used as the reference for the computation of the DMOS
(Differential Mean Opinion Score) values. Forty naive
subjects took part in the subjective tests.
We have evaluated the performance of the following wellknown publicly-available video quality algorithms: Peak
Signal to Noise Ratio (PSNR), Structural SIMilarity (SSIM)
index [11], Multi-scale SSIM (MS-SSIM) [13], Video Quality
Metric (VQM) [15], Visual Signal to Noise Ratio (VSNR)
[16], MOtion-based Video Integrity Evaluation (MOVIE)
[18], Spatial MOVIE [18] and Temporal MOVIE [18].
The performance of the objective models is evaluated using
the Spearman Rank Order Correlation Coefficient, the Pearson
Linear Correlation Coefficient, the Root Mean Square Error
(RMSE) and the Outlier Ratio. A non-linear regression is done
using a monotonic cubic polynomial function with four
parameters as recommended by the VQEG [7]. The
performance of the different metrics is compared by means of
a statistical significance analysis based on the Pearson, RMSE
and Outlier Ratio coefficients.
II. METHODOLOGY
A. EPFL-PoliMI video database
Three of the reference videos have a frame rate of 25 fps
(cropping HD resolution video sequences down to 4CIF ( 704
× 576 pixels2) resolution and downsampling the original
content from 50 fps to 25 fps) while the other three have a
frame of 30 fps. As it has been already stated, the reference
videos are lightly compressed to ensure high quality in the
absence of packet losses. Therefore, a fixed Quantization
Parameter between 28 and 32 was selected for each sequence.
The sequences were encoded in H.264/AVC [8] High
Profile. B-pictures and Context-adaptive binary arithmetic
coding (CABAC) were enabled for coding efficiency. Each
frame was divided into a fixed number of slices, where each
slice consists of a full row of macroblocks.
The error patterns were generated at six different PLRs
[0.1%, 0.4%, 1%, 3%, 5%, 10%] with a two state Gilbert’s
model with an average burst length of 3 packets. For each
PLR and content, two realizations were selected producing a
total of 72 distorted sequences.
The subjective evaluation was done using the 5 point ITU
continuous scale in the range [0-5] [9]. 21 subjects
participated in the evaluation at the PoliMI lab and 19 at the
EPFL lab. More details about the subjective evaluation can be
found in [4] [5] [6].
B. Processing of the subjective scores
In [5] the Shapiro-Wilk test was used to verify the normality
of distributions of scores across subjects and the results of the
test showed that the scores distributions are normal or close to
normal.
Although the raw subjective scores were already processed
in the EPFL-Polimi database, we processed them in a different
way.
A T-Student considering the overall mean and standard
deviation of the raw MOS individual scores of each lab
showed that at 95% confidence level the data from the two
labs can be merge. As an additional verification, the DMOS
and confidence interval (CI) values (in this case after
normalization, screening and re-scaling) were calculated for
each content and distortion type and compared between the
two labs, confirming that the data from the two labs can be
merged.
First of all, we calculated the difference scores by
substracting the scores of the degraded videos to the score of
the reference videos. The difference scores for the reference
videos are 0 and so are removed. Accordingly, a lower
difference score indicates a higher quality.
Afterwards, the Z-scores were computed for each subject
separately by means of the Matlab zscore function. The Zscores transform the original distribution to one in which the
mean becomes zero and the standard deviation becomes one.
Each subject may have used the rating scale differently and
with different offset. Indeed, this normalization procedure
reduces the gain and offset between the subjects.
Subsequently, the outliers were detected according to the
guidelines described in section 2.3.1 of Annex 2 of [9] and
removed.
Next, the Z-scores were re-scaled to the range [0,5]. The Zscores are assumed to be distributed as a standard Gaussian.
Consequently, 99% of the scores will lie in the range [-3,3]. In
our study 100% of the scores lied in that range. The re-scaling
was done by linearly mapping the range [-3,3] to the range
[0,5] using the following formula:
5.
3
6
Finally the Difference Mean Opinion Score (DMOS) of each
video was computed as the mean of the re-scaled Z-scores
from the 36 subjects that remained after rejection.
Additionally, the confidence intervals were also computed.
C. Objective assessment algorithms
The performance of the following video quality algorithms
was evaluated on the EPFL-PoliMI video quality assessment
database:
• Peak Signal to Noise Ratio (PSNR): PSNR is computed
using the mean of the MSE vector (contains the Mean
Square Error of each frame). The implementation used is
based on the "PSNR of YUV videos" program (yuvpsnr.m)
by Dima Pröfrock available in the MATLAB Central file
repository [10]. Only the luminance values were considered.
3
• Structural SIMilarity (SSIM): SSIM is computed for each
frame. After that an average value is produced. The
implementation used is an improved version of the original
version [11] in which the appropriate scale is estimated. The
implementation, named ssim.m can be downloaded in the
author’s implementation home page [12]. Only the
luminance values were considered.
• Multi-scale SSIM (MS-SSIM): MS-SSIM [13] is
computed for each frame. Afterwards an average value is
produced. The implementation used was downloaded from
the Laboratory for Image & Video Engineering (LIVE) at
the University Of Texas at Austin [14]. Only the luminance
values were considered.
• Video Quality Metric (VQM): The software version 2.2
for Linux used was downloaded from the author’s
implementation home page [15]. As for the parameters
used: parsing type none, spatial, valid, gain and temporal
calibration automatic, temporal algorithm sequence,
temporal valid uncertainty false, alignment uncertainty 15,
calibration frequency 15 and video model general model.
The files were converted to the format required by VQM
(Big-YUV file format, 4:2:2) using ffmpeg.
• Visual Signal to Noise Ratio (VSNR): VSNR [16] is
computed using the total signal and noise values of the
sequence. Only the luminance values were considered. We
modified the authors’ implementation available at [17] to
extract the signal and noise values in order to sum them
separately. Only the luminance values were considered.
• MOtion-based Video Integrity Evaluation (MOVIE):
MOVIE [18] includes three different versions: the Spatial
MOVIE index, the Temporal MOVIE index and the
MOVIE index. The MOVIE Index version 1.0 for Linux
was used and can be downloaded from [14]. The optional
parameters framestart, frameend or frameint were not used.
The default values of the metrics were used for all the
metrics. No registration problems occur in the dataset.
D. Statistical analysis
In order to test the performance of the objective algorithms
we compute the Spearman Rank Order Correlation Coefficient
(SROCC), the Pearson correlation coefficient, the Root Mean
Square Error (RMSE) and the Outlier Ratio (OR).
The Spearman coefficient assesses how well the
relationship between two variables can be described using a
monotonic function. The Pearson coefficient measures the
linear relationship between a model’s performance and the
subjective data. The RMSE provides a measure of the
prediction accuracy. Lastly, the consistency attribute of the
objective metric is evaluated by the Outlier Ratio.
The Pearson, RMSE and Outlier Ratio are computed after a
non-linear regression. The regression is done using a
monotonic cubic polynomial function with four parameters.
The function is constrained to be monotonic:
.
.
.
In the above equation the DMOSp is the predicted value.
The four parameters are obtained using the MATLAB
function “nlinfit”.
The performance of the metrics is compared by means of a
statistical significance analysis based on the Pearson, RMSE
and Outlier Ratio coefficients [7].
III. RESULTS
We include the scatter plots of the objective metrics scores
vs. DMOS for all the videos in the EPFL-PoliMI video quality
database. The fitting function is also plotted.
Fig. 1. Scatter plot PSNR
Fig. 2. Scatter plot SSIM
4
Fig. 3. Scatter plot MS-SSIM
Fig. 6. Scatter plot MOVIE
Fig. 4. Scatter plot VQM
Fig. 7. Scatter plot SPATIAL MOVIE
Fig. 5. Scatter plot VSNR
Fig. 8. Scatter plot TEMPORAL MOVIE
5
In the table I the values of the coefficients corresponding to
all the metrics are shown.
TABLE I
OBJECTIVE QUALITY ASSESSMENT ALGORITHMS
Pearson Spearman
RMSE
Outlier Ratio
PSNR
0.9586
0.9614
0.2195
0.6250
SSIM
0.9595
0.9696
0.2173
0.5972
MS-SSIM
0.9642
0.9781
0.2046
0.5972
VSNR
0.9744
0.9735
0.1733
0.4722
VQM
0.9619
0.9603
0.2109
0.5417
MOVIE
SPATIAL
MOVIE
TEMPORAL
MOVIE
0.9650
0.9622
0.2023
0.6250
0.9814
0.9787
0.1480
0.4583
0.9243
0.9142
0.2944
0.6111
[5]
[6]
[7]
[8]
[9]
[10]
IV. DISCUSSION
The statistical significance analysis based on the RMSE
shows that at 95% confidence level all the metrics are
statistically better than Temporal Movie. On the other hand,
Spatial Movie is statistically better than all the other metrics
except VSNR. VSNR is statistically better than PSNR and
SSIM. The statistical significance analysis based on the
Pearson confirms the lower performance of Temporal Movie
and the analysis based on the Outlier Ratio does not provide
an indication of the differences between the performances of
the metrics. The packet loss did not induce registration
problems, explaining partly the high correlation values
obtained.
V. CONCLUSIONS
The results show that when the lightly compressed
sequences without packet losses are taken as the reference
instead of the uncompressed videos, the correlation of the
selected objective algorithms is high, being the lowest Pearson
correlation coefficient after non-linear regression in our study
0.9243. However, we believe the overall correlation would be
lower if the models are evaluated over databases containing
more sequences, registration problems, different coding
parameters (e.g. flexible macroblock ordering) and error
concealment strategies. The statistical analysis based on
RMSE shows that at 95% confidence level, the Spatial Movie
index shows the highest performance and Temporal Movie the
lowest among the studied metrics.
REFERENCES
[1]
[2]
[3]
[4]
Voddler home page, http://www.voddler.com.
Video Quality Experts Group page, http://www.its.bldrdoc.gov/vqeg
A. K. Moorthy, K. Seshadrinathan, R. Soundararajan and A. C. Bovik,
"Wireless Video Quality Assessment: A study of subjective scores and
objective algorithms," IEEE Transactions on Circuits and Systems for
Video Technology, vol. 20, no.4, pp. 587-599, April 2010.
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
F. De Simone, M. Naccari, M. Tagliasacchi, F. Dufaux, S. Tubaro and
T. Ebrahimi, "Subjective assessment of H.264/AVC video sequences
transmitted over a noisy channel," in Proc. International Workshop on
Quality of Multimedia Experience (QoMEX), San Diego, California,
U.S.A, July 2009.
F. De Simone, M. Tagliasacchi, M. Naccari, S. Tubaro and T. Ebrahimi,
“A H264/AVC video database for the evaluation of quality metrics," in
Proc. IEEE International Conference on Acoustics, Speech, and Signal
Processing (ICASSP), Dallas, Texas, U.S.A, March 2010.
EPFL-PoliMI video quality assessment database [Online]. Available:
http://vqa.como.polimi.it
Final report from the video quality experts group on the validation of
objective models of multimedia quality assessment, phase I [Online].
Available:
ftp://vqeg.its.bldrdoc.gov/Documents/VQEG_Approved_Final_Reports/
VQEG_MM_Report_Final_v2.6.pdf
H.264/AVC reference software version JM14.2, Tech. Rep., Joint Video
Team (JVT) [Online]. Available:
http://iphome.hhi.de/suehring/tml/download/old_jm/
ITU-T, Recommendation ITU-R P 910, September 1999, Subjective
video quality assessment methods for multimedia applications.
MATLAB Central File Exchange [Online]. Available:
http://www.mathworks.com/matlabcentral/fileexchange/
Z. Wang, A. C. Bovik, H. R. Sheikh and E. P. Simoncelli, "Image
quality assessment: From error visibility to structural similarity," IEEE
Transactions on Image Processing, vol. 13, no. 4, pp. 600-612, April
2004.
The Structural SIMilarity (SSIM) index author’s home page [Online].
Available: http://www.ece.uwaterloo.ca/~z70wang/research/ssim/
Z. Wang, E. P. Simoncelli and A. C. Bovik, "Multi-scale structural
similarity for image quality assessment," IEEE Asilomar Conference
Signals, Systems and Computers, November 2003.
Laboratory for Image & Video Engineering [Online]. Available:
http://live.ece.utexas.edu/research/Quality/index.htm
Video Quality Metric (VQM) Software [Online]. Available:
http://www.its.bldrdoc.gov/vqm/
D. M. Chandler and S. S. Hemami, "VSNR: A Wavelet-Based Visual
Signal-to-Noise Ratio for Natural Images," IEEE Transactions on Image
Processing, vol. 16, no. 9, pp. 2284-2298, September 2007.
VSNR implementation from the authors [Online]. Available:
http://foulard.ece.cornell.edu/dmc27/vsnr/vsnr.html
K. Seshadrinathan and A. C. Bovik, "Motion Tuned Spatio-temporal
Quality Assessment of Natural Videos," IEEE Transactions on Image
Processing, vol. 19, no. 2, pp. 335-350, February 2010.