PREDICTING INTELLIGIBILITY OF COMPRESSED AMERICAN SIGN LANGUAGE VIDEO WITH OBJECTIVE QUALITY METRICS Frank Ciaramello, Sheila Hemami Anna Cavender, Richard Ladner, Eve Riskin Cornell University School of Electrical and Computer Engineering Ithaca, NY 14853 University of Washington Department of Computer Science and Engineering Department of Electrical Engineering Seattle, WA 98195 ABSTRACT one upload slot. Two-way conversation would require symmetric channels, thus reducing the maximum available data rate in any given direction. The results presented in this paper show that the available bandwidth is insufficient for intelligibility of traditionally coded ASL videos. To optimize encoding for intelligibility, a method of objectively predicting intelligibility must be developed. Since ASL is a visual language, intuitively there should be some correlation between intelligibility and visual quality. This paper investigates two video quality metrics, an MSE-based metric and a perceptual based metric, and their correlation with intelligibility. The MSE-based metric is computationally more efficient than perceptual metrics, but does not always accurately predict subjective quality scores for natural video. The perceptual quality metric used was the continuous video quality evaluation (CVQE). The metric implements a model of the human visual system (HVS) and produces continuous quality estimates that have been shown in [1] to match the subjective quality scores of human observers. This paper investigates whether these results extend to the intelligibility of ASL sequence instead of the aesthetics of traditional video. A study was performed to characterize how fluent sign language users understood and tolerated coded ASL video. Various ASL sequences were shown to fluent ASL users, both deaf and hearing. Each participant rated not only how well they understood the videos, but also how annoying it was to watch and whether they’d be willing to use a cell phone with videos at the quality they watched. The quality scores generated using the CVQE and the MSEbased metric were compared with the results from the ASL intelligibility study. Section 2 describes the ASL study and presents the results. Section 3 provides details for the CVQE metric and the MSE-based metric. Finally, Section 4 presents the quality analysis of the ASL sequences and demonstrates a correlation between the objective quality metrics and the subjective intelligibility and annoyance ratings. Transmission of compressed American Sign Language (ASL) video over the cellular telephone network provides the opportunity for the more than 1 million deaf people in the United States to enjoy the personalized, instant communication in their native language that is available to the hearing and speaking community. Video compression for this application introduces specific challenges, the foremost of which is the development of an appropriate quality metric for use in both algorithm development and ratedistortion optimizations. Efforts in recent years to develop quality metrics for video that accurately reflect the subjective opinions of observers have focused on traditional quality as described in terms of aesthetics. However, for coded ASL video to be maximally useful, it must be judged in terms of intelligibility by ASL speakers, rather than in aesthetics. In this paper, the results of a study measuring ASL intelligibility of sequences encoded using H.264/AVC are presented. Fifteen different sequences of ASL passages were encoded at four different bitrates, and fluent ASL users observed a random subset of these coded sequences and rated intelligibility and annoyance on a subjective scale. The results are compared with predictions of subjective quality based on both the continuous video quality evaluation (CVQE) metric as well as a MSE-based metric. There is a strong correlation between intelligibility and the quality metrics. 1. INTRODUCTION In the past 10 years, the availability of cell phones has increased dramatically, making communication problems of the past obsolete. However, sign language users have been left out of this communication boom. With the increasing availability of video-enabled cell phones, users of American Sign Language (ASL) can begin to take advantage of the freedom cellular technology offers. To accomplish two-way video communication of ASL, an appropriate algorithm must be developed to encode ASL video at the low bitrates available on cellular networks while maintaining the intelligibility of the conversation. Current GPRS technology provides bandwidths of approximately 9 kbps per slot for CS-1 channel coding and 13 kbps per slot for CS-2 channel coding. Furthermore, typical phones operate in Class 8, which provides four download slots and only This research is supported by the National Science Foundation under grants CCF-0514357, CCF-0514353, and a Graduate Research Fellowship to Anna Cavender 2. INTELLIGIBILITY EXPERIMENT A small study was conducted with 11 members of the deaf community. Participants were asked to watch videos encoded at four different bitrates: 16 kbps, 24 kbps, 48 kbps, and 96 kbps and at three different screen sizes (all encoded at 96 kbps): 4.5”x3.1”, 3.1”x2.1”, and 2.2”x1.6”. They were then asked a series of questions regarding their subjective opinions about the videos. PSNR of ASL Sequence "Game Show Host" Results from ASL study 1 96 kbps 48 kbps 24 kbps 16 kbps 35 0.9 16 kbps 24 kbps 48 kbps 96 kbps 0.8 Average Observer Response PSNR (dB) 30 25 20 0.7 0.6 0.5 0.4 0.3 0.2 15 0.1 10 0 50 100 150 Frame Number 200 250 300 Fig. 1. PSNR plot of ASL video “Game Show Host” at bitrates of 16 kbps, 24 kbps, 48 kbps, and 96 kbps. 0 How easy was it to undersand? How annoying was it? Would you use it? Fig. 2. Qualitative results for each of the video compression rates and each survey question. – very difficult (0.00) 2.1. ASL Video Sequences Fifteen video segments were extracted from [2], an instructional ASL DVD. The segments varied in length from 7.2 sec to 150.9 sec, with a median length of 59.6 sec and a mean length of 53.2 sec. All videos were compressed at 29.97 fps and GOP sizes of 250 frames using x264, an open source implementation of the H.264/AVC standard. In each frame interval, there is a single Iframe followed only by P-frames. Figure 1 is a PSNR plot of the ASL video sequence “Game Show Host” at the compression rates of 16 kbps, 24 kbps, 48 kbps, and 96 kbps. 2.2. Subjective Questionnaire Each participant watched six videos that varied over the three chosen sizes, and six videos that varied over the three lower compression rates. After watching each video, participants answered a four-question, multiple-choice survey that was given on the computer at the end of each video. The first question asked about video content was “What was the name of the main character in the story?” This question was asked simply to encourage the participants to pay close attention to the video and was not used in any statistical tabulation. The remaining three questions were repeated for each video. The first two remaining questions appear below along with the possible multiple choice answers and the rating given for those answers (the ratings were for tabulation purposes only and were not visible by participants). The third remaining question asked if the participant would use a video cell phone at each video quality. This question was not used in any of the correlation analysis. • How would you rate the annoyance level of the video? – – – – not at all annoying (1.00) a little annoying (0.66) somewhat annoying (0.33) extremely annoying (0.00) 2.3. Intelligibility Results Subjective intelligibility and annoyance ratings for each video were calculated from the participants’ answers to the questionnaire. These ratings were calculated by averaging each participant’s answers to the two questions. Over the different screen sizes, intelligibility rated very highly and had little variation. However, most participants preferred the medium size. This is most likely because the smallest screen size was just too small and the largest screen size had reduced spatial quality compared to the preferred one. Unlike the video size responses, the responses when the compression rate varied had a large range of values. Figure 2 shows responses averaged over participants. For the three questions, the y-axis represents the perceived ease of video viewing, video annoyance, and the likelihood of mobile phone use with video quality, respectively. While the rates of 16 kbps and 24 kbps both can be transmitted on current GPRS technology, the results show that they were not at all easy to understand. Not surprisingly, the highest compression rate was best received by the participants. It is interesting to note that even with no distortions, not all the observers rated the 96 kbps sequences as very easy to understand. This is likely due to the small screen size. • How difficult would you say it was to comprehend the video? 3. QUALITY METRICS – very easy (1.00) – easy (0.75) 3.1. CVQE – neither easy nor difficult (0.50) The CVQE metric takes as inputs the original and distorted sequences and produces an output which predicts perceived (aes- – difficult (0.25) thetic) quality. Prior to comparison, the reference and distorted sequences are first registered spatially and temporally using small subsets of information taken from each sequence. The metric then follows the structure of a multichannel perceptual decomposition. The luminance channel of both the reference sequence and the distorted sequence is extracted and then filtered temporally and spatially. Each set of coefficients is then converted to units of contrast and passed through a nonlinearity that models spatial masking. The distances between the elements in the reference and distorted subsets are then collapsed into frame-level distortion scores. The individual components are described below. Temporal & Spatial Filtering Prior to spatial filtering, the luminance channels are filtered with a lowpass finite impulse response (FIR) filter with an effective cutoff at approximately 10 Hz. The spatial frequency decomposition is performed with a fourlevel separable discrete wavelet transform (DWT). Critically sampled wavelet transforms downsample the coefficients in the HH, LH and HL bands. Masking and Summation The metric implements a masking model based on the generalized gain control formulation in [3], given by rk,θ (x, y, n) = wkp (ak,θ (x, y, n))p b + wkq θ (ak,θ (x, y, n))q 4. INTELLIGIBILITY ANALYSIS (1) where rk,θ (x, y, n) is the channel response at frame n, spatial location (x, y), scale k, and orientation θ, ak,θ (x, y, n) are the contrast values, and the weights wl are given by the scaled contrast sensitivity at scale k. The inhibitory exponent q is fixed at 2 and the excitatory exponent p is fixed at 2.1. Inhibitory summation only includes subbands within the same scale, but summed across all orientations. r ef (x, y, n) A distortion map d(x, y, n) is formed by pooling |rk,θ r -rk,θ ef (x, y, n)| across scale and orientation with a (frequency) summation exponent of βf , set equal to 3, and then a single distortion value d(n) is calculated for each frame by pooling the map over location (x, y) with a (spatial) summation exponent of βs , set equal to 5. Temporal Smoothing To convert to units of quality, the distortions d(n) are smoothed using the formulation given in [1] for estimating the perceived distortion at time n given by d (n) from a time series of frame-level distortion scores. These scores are passed through a logistic function to convert each distortion score to a quality scale from 0 to 100. This function has the form pq(n) = 100 1 + ( d (n) )ρ ζ Fig. 3. Typical ASL frame at 96 kbps, with the “sign box” highlighted. (2) where ζ and ρ are functional parameters set to 19.5834 and 4.306, respectively. 3.2. MSE-Based Quality Metric In order to properly compare CVQE with MSE, the MSE distortion must be converted into a quality measure. The per-frame MSE for each sequence was calculated and then temporally smoothed as in [1]. The smoothed MSE trace was converted into a quality score using the same logistic function (2) that was used for the perceived distortion. The ASL sequences used in the intelligibility study were all structured in the same way. Each sequence had a single signer in front of a solid background. One of the features of ASL is that most of the signs occur in a designated area, the “sign box.” This area extends approximately from shoulder to shoulder and from the navel to the top of the head. Because of this, each sequence was segmented to separate the signer and the sign box from the background. Figure 3 is a sample frame from the ASL sequence “Ten Commandments” with the “sign box” highlighted. The quality ratings were calculated using the segmented videos. This segmentation gives a more accurate representation of the quality of the signer. Using the appropriately segmented videos, per-frame distortion was generated from both MSE and from the perceptual metric used in CVQE (1). The distortion analysis was done only for the largest screen size over the four different compression rates. The largest screen size corresponds to the lowest spatial quality. The ASL study showed that even at this low spatial quality, the videos still received high intelligibility scores. This offers a baseline for sign language intelligibility over spatial distortions and led to the decision to analyze only the sequences distorted due to compression. Once calculated, the per-frame distortions were smoothed to account for temporal masking and transformed into quality ratings using the logistic function (2). Figures 4 and 5 are a representative sample of the continuous quality ratings for each video sequence. The figures are a very good example of all of the time traces. The CVQE traces and the MSE-based traces had very similar shapes, while the CVQE values were always higher than those generated from the MSE-based metric. The peaks that occur every 8.3 sec (250 frames) are a result of the GOP intervals. Each I-frame has very low distortions and as more predicted frames are coded, the distortions continue to increase, especially for the higher motion signs sequences. The continuous quality ratings were averaged to get a single video quality score for each video at each compression rate. These scores were correlated with the results from the first rated question of the intelligibility study. Both the CVQE and the CVQE for ASL Sequence "A Fishy Story" 100 90 90 80 80 Perceived Quality MSE−Based Quality MSE−Based Quality for ASL Sequence "A Fishy Story" 100 70 60 50 60 50 96 kbps 48 kbps 24 kbps 16 kbps 40 30 70 0 10 96 kbps 48 kbps 24 kbps 16 kbps 40 20 30 40 50 Time (s) 60 70 80 90 100 30 0 10 20 30 40 50 Time (s) 60 70 80 90 100 Fig. 4. MSE-based time series traces for the ASL sequence “A Fishy Story.” Fig. 5. CVQE time series traces for the ASL sequence “A Fishy Story.” MSE were found to be highly correlated with intelligibility, having average correlation coefficients of 0.71 and 0.73, respectively. In 10 of the sequences, the correlation coefficients each were greater than 0.9. The sequence “Keys” had the lowest non-negative correlation. It received similar intelligibility scores for each of the bitrates. All observers were able to understand the video at each of the rates. Compared to the rest of the sequences, “Keys” had more time with little or no motion, and despite the fact that there were annoying visual distortions, it was rated as intelligible even at low rates. From a non-ASL user’s perspective, this video sequence appears to have fewer total signs that cause the lower motion. There was also a single sequence that had a negative correlation value. This is because its intelligibility scores were higher for the lowest rate than for the higher rates. There were two participants in particular who tended to give high scores for all of the sequences. Both the users happened to watch this sequence, causing the average to be higher than normal. Without including this sequence, the average correlation coefficient goes up to 0.84 for CVQE and 0.86 for the MSE-based metric. The results from the second question of the ASL study were also used to correlate with the quality metrics. Because this question asked specifically about annoyance, the quality metrics yielded very high correlation coefficients. The average coefficients are 0.90 for CVQE and 0.91 for the MSE-based metric. These values are high because the second question is related to traditional video quality, which can be accurately predicted with objective quality metrics. ing subjective intelligibility. It was shown that both the CVQE and the MSE-based metric are highly correlated (correlation coefficients of 0.71 and 0.73) with the results from the intelligibility study. Furthermore, there are some sequences, particularly those with fewer number of signs, for which neither quality metric gives a good indication of intelligibility. The correlation scores demonstrate that objective quality metrics are very representative of an observer’s ability to understand a compressed video, in most cases. An area of future research is to obtain continuous intelligibility scores to compare with the continuous quality score generated using the CVQE. Furthermore, the participants in the ASL study mentioned that the sequences used were very familiar in the ASL community, because they are often used for instruction. Also, since the sequences are educational, the people sign more slowly and deliberately than someone would in a normal conversation. One goal for future work is to perform another ASL intelligibility study using more natural, everyday ASL conversations. 5. SUMMARY The ASL intelligibility study characterized fluent ASL users’ subjective ratings of H.264/AVC coded video. The study demonstrated that conventional coding methods are inadequate for transmission of intelligible ASL video over today’s cellular networks. In addition, the continuous video quality evaluation metric and an MSE-based quality metric were explored as a method for predict- 6. REFERENCES [1] M. Masry and S. S. Hemami, “CVQE: A metric for continuous video quality evaluation at low rates,” in SPIE Human Vision and Electronic Imaging, Santa Clara, CA, Jan. 2003 [2] K. Mikos, C. Smith, and E.M. Lentz, “Signing Naturally Workbook and Videotext Expanded Edition: Level 1,” Dawnsign Press, 1993. [3] A. B. Watson and J. A. Solomon, “A model of visual contrast gain control and pattern masking,” JOSA A, vol. 14, no. 9, pp. 2379-91, Sept. 1997.
© Copyright 2026 Paperzz