PREDICTING INTELLIGIBILITY OF COMPRESSED AMERICAN

PREDICTING INTELLIGIBILITY OF COMPRESSED AMERICAN SIGN LANGUAGE
VIDEO WITH OBJECTIVE QUALITY METRICS
Frank Ciaramello, Sheila Hemami
Anna Cavender, Richard Ladner, Eve Riskin
Cornell University
School of Electrical and Computer Engineering
Ithaca, NY 14853
University of Washington
Department of Computer Science and Engineering
Department of Electrical Engineering
Seattle, WA 98195
ABSTRACT
one upload slot. Two-way conversation would require symmetric channels, thus reducing the maximum available data rate in any
given direction. The results presented in this paper show that the
available bandwidth is insufficient for intelligibility of traditionally coded ASL videos. To optimize encoding for intelligibility, a
method of objectively predicting intelligibility must be developed.
Since ASL is a visual language, intuitively there should be
some correlation between intelligibility and visual quality. This
paper investigates two video quality metrics, an MSE-based metric
and a perceptual based metric, and their correlation with intelligibility. The MSE-based metric is computationally more efficient
than perceptual metrics, but does not always accurately predict
subjective quality scores for natural video. The perceptual quality
metric used was the continuous video quality evaluation (CVQE).
The metric implements a model of the human visual system (HVS)
and produces continuous quality estimates that have been shown in
[1] to match the subjective quality scores of human observers. This
paper investigates whether these results extend to the intelligibility
of ASL sequence instead of the aesthetics of traditional video.
A study was performed to characterize how fluent sign language users understood and tolerated coded ASL video. Various
ASL sequences were shown to fluent ASL users, both deaf and
hearing. Each participant rated not only how well they understood
the videos, but also how annoying it was to watch and whether
they’d be willing to use a cell phone with videos at the quality they
watched.
The quality scores generated using the CVQE and the MSEbased metric were compared with the results from the ASL intelligibility study. Section 2 describes the ASL study and presents the
results. Section 3 provides details for the CVQE metric and the
MSE-based metric. Finally, Section 4 presents the quality analysis of the ASL sequences and demonstrates a correlation between
the objective quality metrics and the subjective intelligibility and
annoyance ratings.
Transmission of compressed American Sign Language (ASL) video
over the cellular telephone network provides the opportunity for
the more than 1 million deaf people in the United States to enjoy the personalized, instant communication in their native language that is available to the hearing and speaking community.
Video compression for this application introduces specific challenges, the foremost of which is the development of an appropriate quality metric for use in both algorithm development and ratedistortion optimizations. Efforts in recent years to develop quality
metrics for video that accurately reflect the subjective opinions of
observers have focused on traditional quality as described in terms
of aesthetics. However, for coded ASL video to be maximally useful, it must be judged in terms of intelligibility by ASL speakers,
rather than in aesthetics. In this paper, the results of a study measuring ASL intelligibility of sequences encoded using H.264/AVC
are presented. Fifteen different sequences of ASL passages were
encoded at four different bitrates, and fluent ASL users observed
a random subset of these coded sequences and rated intelligibility and annoyance on a subjective scale. The results are compared
with predictions of subjective quality based on both the continuous
video quality evaluation (CVQE) metric as well as a MSE-based
metric. There is a strong correlation between intelligibility and the
quality metrics.
1. INTRODUCTION
In the past 10 years, the availability of cell phones has increased
dramatically, making communication problems of the past obsolete. However, sign language users have been left out of this communication boom. With the increasing availability of video-enabled
cell phones, users of American Sign Language (ASL) can begin to
take advantage of the freedom cellular technology offers. To accomplish two-way video communication of ASL, an appropriate
algorithm must be developed to encode ASL video at the low bitrates available on cellular networks while maintaining the intelligibility of the conversation.
Current GPRS technology provides bandwidths of approximately 9 kbps per slot for CS-1 channel coding and 13 kbps per
slot for CS-2 channel coding. Furthermore, typical phones operate in Class 8, which provides four download slots and only
This research is supported by the National Science Foundation under
grants CCF-0514357, CCF-0514353, and a Graduate Research Fellowship
to Anna Cavender
2. INTELLIGIBILITY EXPERIMENT
A small study was conducted with 11 members of the deaf community. Participants were asked to watch videos encoded at four
different bitrates: 16 kbps, 24 kbps, 48 kbps, and 96 kbps and at
three different screen sizes (all encoded at 96 kbps): 4.5”x3.1”,
3.1”x2.1”, and 2.2”x1.6”. They were then asked a series of questions regarding their subjective opinions about the videos.
PSNR of ASL Sequence "Game Show Host"
Results from ASL study
1
96 kbps
48 kbps
24 kbps
16 kbps
35
0.9
16 kbps
24 kbps
48 kbps
96 kbps
0.8
Average Observer Response
PSNR (dB)
30
25
20
0.7
0.6
0.5
0.4
0.3
0.2
15
0.1
10
0
50
100
150
Frame Number
200
250
300
Fig. 1. PSNR plot of ASL video “Game Show Host” at bitrates of
16 kbps, 24 kbps, 48 kbps, and 96 kbps.
0
How easy was it to undersand?
How annoying was it?
Would you use it?
Fig. 2. Qualitative results for each of the video compression rates
and each survey question.
– very difficult (0.00)
2.1. ASL Video Sequences
Fifteen video segments were extracted from [2], an instructional
ASL DVD. The segments varied in length from 7.2 sec to 150.9
sec, with a median length of 59.6 sec and a mean length of 53.2
sec. All videos were compressed at 29.97 fps and GOP sizes of
250 frames using x264, an open source implementation of the
H.264/AVC standard. In each frame interval, there is a single Iframe followed only by P-frames. Figure 1 is a PSNR plot of the
ASL video sequence “Game Show Host” at the compression rates
of 16 kbps, 24 kbps, 48 kbps, and 96 kbps.
2.2. Subjective Questionnaire
Each participant watched six videos that varied over the three chosen sizes, and six videos that varied over the three lower compression rates. After watching each video, participants answered a
four-question, multiple-choice survey that was given on the computer at the end of each video. The first question asked about
video content was “What was the name of the main character in
the story?” This question was asked simply to encourage the participants to pay close attention to the video and was not used in
any statistical tabulation. The remaining three questions were repeated for each video. The first two remaining questions appear
below along with the possible multiple choice answers and the rating given for those answers (the ratings were for tabulation purposes only and were not visible by participants). The third remaining question asked if the participant would use a video cell
phone at each video quality. This question was not used in any of
the correlation analysis.
• How would you rate the annoyance level of the video?
–
–
–
–
not at all annoying (1.00)
a little annoying (0.66)
somewhat annoying (0.33)
extremely annoying (0.00)
2.3. Intelligibility Results
Subjective intelligibility and annoyance ratings for each video were
calculated from the participants’ answers to the questionnaire. These
ratings were calculated by averaging each participant’s answers to
the two questions. Over the different screen sizes, intelligibility
rated very highly and had little variation. However, most participants preferred the medium size. This is most likely because the
smallest screen size was just too small and the largest screen size
had reduced spatial quality compared to the preferred one.
Unlike the video size responses, the responses when the compression rate varied had a large range of values. Figure 2 shows
responses averaged over participants. For the three questions, the
y-axis represents the perceived ease of video viewing, video annoyance, and the likelihood of mobile phone use with video quality, respectively. While the rates of 16 kbps and 24 kbps both can
be transmitted on current GPRS technology, the results show that
they were not at all easy to understand. Not surprisingly, the highest compression rate was best received by the participants. It is
interesting to note that even with no distortions, not all the observers rated the 96 kbps sequences as very easy to understand.
This is likely due to the small screen size.
• How difficult would you say it was to comprehend the video?
3. QUALITY METRICS
– very easy (1.00)
– easy (0.75)
3.1. CVQE
– neither easy nor difficult (0.50)
The CVQE metric takes as inputs the original and distorted sequences and produces an output which predicts perceived (aes-
– difficult (0.25)
thetic) quality. Prior to comparison, the reference and distorted
sequences are first registered spatially and temporally using small
subsets of information taken from each sequence. The metric then
follows the structure of a multichannel perceptual decomposition.
The luminance channel of both the reference sequence and the distorted sequence is extracted and then filtered temporally and spatially. Each set of coefficients is then converted to units of contrast
and passed through a nonlinearity that models spatial masking.
The distances between the elements in the reference and distorted
subsets are then collapsed into frame-level distortion scores. The
individual components are described below.
Temporal & Spatial Filtering Prior to spatial filtering, the
luminance channels are filtered with a lowpass finite impulse response (FIR) filter with an effective cutoff at approximately 10
Hz. The spatial frequency decomposition is performed with a fourlevel separable discrete wavelet transform (DWT). Critically sampled wavelet transforms downsample the coefficients in the HH,
LH and HL bands.
Masking and Summation The metric implements a masking
model based on the generalized gain control formulation in [3],
given by
rk,θ (x, y, n) =
wkp (ak,θ (x, y, n))p
b + wkq θ (ak,θ (x, y, n))q
4. INTELLIGIBILITY ANALYSIS
(1)
where rk,θ (x, y, n) is the channel response at frame n, spatial location (x, y), scale k, and orientation θ, ak,θ (x, y, n) are the contrast values, and the weights wl are given by the scaled contrast
sensitivity at scale k. The inhibitory exponent q is fixed at 2 and
the excitatory exponent p is fixed at 2.1. Inhibitory summation
only includes subbands within the same scale, but summed across
all orientations.
r
ef (x, y, n)
A distortion map d(x, y, n) is formed by pooling |rk,θ
r
-rk,θ ef (x, y, n)| across scale and orientation with a (frequency)
summation exponent of βf , set equal to 3, and then a single distortion value d(n) is calculated for each frame by pooling the map
over location (x, y) with a (spatial) summation exponent of βs , set
equal to 5.
Temporal Smoothing To convert to units of quality, the distortions d(n) are smoothed using the formulation given in [1] for
estimating the perceived distortion at time n given by d (n) from
a time series of frame-level distortion scores. These scores are
passed through a logistic function to convert each distortion score
to a quality scale from 0 to 100. This function has the form
pq(n) =
100
1 + ( d (n)
)ρ
ζ
Fig. 3. Typical ASL frame at 96 kbps, with the “sign box” highlighted.
(2)
where ζ and ρ are functional parameters set to 19.5834 and 4.306,
respectively.
3.2. MSE-Based Quality Metric
In order to properly compare CVQE with MSE, the MSE distortion
must be converted into a quality measure. The per-frame MSE for
each sequence was calculated and then temporally smoothed as in
[1]. The smoothed MSE trace was converted into a quality score
using the same logistic function (2) that was used for the perceived
distortion.
The ASL sequences used in the intelligibility study were all structured in the same way. Each sequence had a single signer in front
of a solid background. One of the features of ASL is that most of
the signs occur in a designated area, the “sign box.” This area
extends approximately from shoulder to shoulder and from the
navel to the top of the head. Because of this, each sequence was
segmented to separate the signer and the sign box from the background. Figure 3 is a sample frame from the ASL sequence “Ten
Commandments” with the “sign box” highlighted. The quality ratings were calculated using the segmented videos. This segmentation gives a more accurate representation of the quality of the
signer.
Using the appropriately segmented videos, per-frame distortion was generated from both MSE and from the perceptual metric
used in CVQE (1). The distortion analysis was done only for the
largest screen size over the four different compression rates. The
largest screen size corresponds to the lowest spatial quality. The
ASL study showed that even at this low spatial quality, the videos
still received high intelligibility scores. This offers a baseline for
sign language intelligibility over spatial distortions and led to the
decision to analyze only the sequences distorted due to compression.
Once calculated, the per-frame distortions were smoothed to
account for temporal masking and transformed into quality ratings
using the logistic function (2). Figures 4 and 5 are a representative
sample of the continuous quality ratings for each video sequence.
The figures are a very good example of all of the time traces. The
CVQE traces and the MSE-based traces had very similar shapes,
while the CVQE values were always higher than those generated
from the MSE-based metric. The peaks that occur every 8.3 sec
(250 frames) are a result of the GOP intervals. Each I-frame has
very low distortions and as more predicted frames are coded, the
distortions continue to increase, especially for the higher motion
signs sequences. The continuous quality ratings were averaged to
get a single video quality score for each video at each compression
rate. These scores were correlated with the results from the first
rated question of the intelligibility study. Both the CVQE and the
CVQE for ASL Sequence "A Fishy Story"
100
90
90
80
80
Perceived Quality
MSE−Based Quality
MSE−Based Quality for ASL Sequence "A Fishy Story"
100
70
60
50
60
50
96 kbps
48 kbps
24 kbps
16 kbps
40
30
70
0
10
96 kbps
48 kbps
24 kbps
16 kbps
40
20
30
40
50
Time (s)
60
70
80
90
100
30
0
10
20
30
40
50
Time (s)
60
70
80
90
100
Fig. 4. MSE-based time series traces for the ASL sequence “A
Fishy Story.”
Fig. 5. CVQE time series traces for the ASL sequence “A Fishy
Story.”
MSE were found to be highly correlated with intelligibility, having
average correlation coefficients of 0.71 and 0.73, respectively.
In 10 of the sequences, the correlation coefficients each were
greater than 0.9. The sequence “Keys” had the lowest non-negative
correlation. It received similar intelligibility scores for each of the
bitrates. All observers were able to understand the video at each
of the rates. Compared to the rest of the sequences, “Keys” had
more time with little or no motion, and despite the fact that there
were annoying visual distortions, it was rated as intelligible even at
low rates. From a non-ASL user’s perspective, this video sequence
appears to have fewer total signs that cause the lower motion.
There was also a single sequence that had a negative correlation value. This is because its intelligibility scores were higher for
the lowest rate than for the higher rates. There were two participants in particular who tended to give high scores for all of the
sequences. Both the users happened to watch this sequence, causing the average to be higher than normal. Without including this
sequence, the average correlation coefficient goes up to 0.84 for
CVQE and 0.86 for the MSE-based metric.
The results from the second question of the ASL study were
also used to correlate with the quality metrics. Because this question asked specifically about annoyance, the quality metrics yielded
very high correlation coefficients. The average coefficients are
0.90 for CVQE and 0.91 for the MSE-based metric. These values
are high because the second question is related to traditional video
quality, which can be accurately predicted with objective quality
metrics.
ing subjective intelligibility. It was shown that both the CVQE
and the MSE-based metric are highly correlated (correlation coefficients of 0.71 and 0.73) with the results from the intelligibility
study. Furthermore, there are some sequences, particularly those
with fewer number of signs, for which neither quality metric gives
a good indication of intelligibility.
The correlation scores demonstrate that objective quality metrics are very representative of an observer’s ability to understand
a compressed video, in most cases. An area of future research is
to obtain continuous intelligibility scores to compare with the continuous quality score generated using the CVQE. Furthermore, the
participants in the ASL study mentioned that the sequences used
were very familiar in the ASL community, because they are often
used for instruction. Also, since the sequences are educational, the
people sign more slowly and deliberately than someone would in
a normal conversation. One goal for future work is to perform another ASL intelligibility study using more natural, everyday ASL
conversations.
5. SUMMARY
The ASL intelligibility study characterized fluent ASL users’ subjective ratings of H.264/AVC coded video. The study demonstrated that conventional coding methods are inadequate for transmission of intelligible ASL video over today’s cellular networks.
In addition, the continuous video quality evaluation metric and an
MSE-based quality metric were explored as a method for predict-
6. REFERENCES
[1] M. Masry and S. S. Hemami, “CVQE: A metric for continuous
video quality evaluation at low rates,” in SPIE Human Vision
and Electronic Imaging, Santa Clara, CA, Jan. 2003
[2] K. Mikos, C. Smith, and E.M. Lentz, “Signing Naturally
Workbook and Videotext Expanded Edition: Level 1,” Dawnsign Press, 1993.
[3] A. B. Watson and J. A. Solomon, “A model of visual contrast
gain control and pattern masking,” JOSA A, vol. 14, no. 9, pp.
2379-91, Sept. 1997.