Video Compression…For Dummies?

Video Compression…For Dummies? George Blood Abstract Over time video formats and carriers become obsolete rapidly while file-based digital video formats are
evolving rapidly. The physical carriers of these files are unreliable; manufacturers assume they will be
used for acquisition then transferred to hard disc drive for production. For these reasons our starting
assumption is that all historic video formats must be migrated to the latest digital technology. Through
this paper, we discuss, in detail, standards and recommendations regarding video compression.
Author George Blood graduated from the University of Chicago (1983) with a Bachelor of Arts in Music Theory.
Active recording live concerts (from student recitals to opera and major symphony orchestras), since 1982
he has documented over 4,000 live events. From 1984 through 1989 he was a producer at WFMT-FM,
and has recorded and edited some 600 nationally syndicated radio programs, mostly of The Philadelphia
Orchestra. He has recorded or produced over 200 CDs, 3 of which were nominated for Grammy Awards.
Each month George Blood Audio and Video digitizes approximately 2,000 hours of audio and video
collections. He is the only student of Canadian pianist Marc-André Hamelin.
1. Background Two years ago we received a five-year contract from the Library of Congress to digitize audio and video1.
The Library’s in-house standard for video preservation masters is lossless JPEG2000 wrapped in MXF.
However, within the Library, only the Culpeper facility has the technology to work directly with this
format. As part of this contract, I was asked to prepare a white paper entitled “Determining Suitable
Digital Video Formats for Medium-term Storage”. In the paper we make recommendations on target
formats for video preservation when J2K/MXF is not yet a viable option. Originally, the white paper was
intended for use by other departments within the Library of Congress. However, we quickly realized the
information and recommendations would be useful to other institutions as well.
Our four starting premises are listed as follows:
•
•
1
Tape is Not an Option
10-bits Required
This paper was originally presented at The Memory of the World in the Digital Age: Digitization and Preservation
26 to 28 September 2012 Vancouver, British Columbia, Canada. We would like to thank the session chairs, Luciana
Duranti and Jonas Palm for the opportunity to publish our presentation.
•
•
Compression is Not an Option
One Size Does Not Fit All
2. The Problem with Tape Tape was rejected due to obsolescence. Standard definition machines are no longer manufactured, either
for analogue or digital formats. Current workflows are rapidly evolving around non-tape, file-based
systems, from acquisition to production to distribution.
In this paper, I discuss the two middle issues, 10-bit resolution and why compression is not
acceptable for preservation, in detail. The observation that one size does not fit all will then form the basis
for the structure of the recommendations made in the white paper for the Library of Congress.
3. Requirement of 10-­‐bit Resolution The requirement for 10-bit resolution is the subject of considerable discussion. I begin, therefore, by
reviewing how bit-depth works in video. The choice between 8- and 10-bits can been thought of as a sort
of compression as it excludes low level detail and softens the image. Let us explore the argument to “use
a lower bit rate for lower quality formats”. To appreciate why it is necessary to use 10-bits, let’s explore
how this works for audio.
In audio, each bit is equal to 6 decibels (dB). There is a maximum signal level, full scale. As you
increase the quantity of bits, you achieve a more dynamic range, or signal-to-noise ratio. Dynamic range
and signal-to-noise ratio are simply different ways of looking at the same phenomenon, the range from
minimum to maximum information captured.
As you add bits, the dynamic range of information that you can capture also increases. In an 8-bit
system you can capture 48dB of dynamic range. In a 16-bit system you can capture 96dB of dynamic
range and, finally, in a 24-bit system you have the ability to capture 144dB.
2
Figure 1. Dynamic range in digital audio is a function of the number of bits in the system.
If your source is, for example, an audiocassette with approximately 58dB of dynamic range, an 8bit system with 48dB of dynamic range will not be enough. A 16-bit system, such as that used on a
compact disc or DAT tape, however, will be more than enough and 24-bits will be wasted storage.
However, if your source is an extremely high quality studio master recording, then the additional
resolution and additional dynamic range of a 24-bit system makes sense.
Figure 2. Match dynamic range of system with dynamic range of source media.
3
Figure 3. Sources with higher dynamic range (higher quality) need more bits.
In practice, audio is now nearly always digitized at 24-bits for the sake of standardization. In audio,
an increased number of bits allows for a wider range of information to be captured. You match the range
of the source to the number of bits necessary to capture that range. Storage for audio has also become so
inexpensive that it is not cost prohibitive to store that much data. A 1-Terabyte hard-drive can hold up to
500 hours of preservation quality audio and only costs approximately $100.00. Such high-quality storage
for so little expense could certainly not have been achieved with ¼” tape.
Unfortunately, video does not work in the same way as audio.
The waveform monitor is a tool used to adjust video to technical standards.
A properly recorded analogue video will have voltages in the range between 7.5 and 110 IRE2. At
the lower end of the range, it is black and at the other end it is white. If we were to use 1-bit encoding, we
would only have these two choices: black and white. As we add bits, we gain more gradations. The goal is
a contiguous, smooth transition from black to white. In order to achieve that, a high number of bits are
required. Fundamentally the difference between audio and video is that in audio the step size is fixed and
the range changes, whereas in video the range is fixed at the step size changes.
2
IRE stands for Institute of Radio Engineers. The actual electronic values don’t matter for this discussion. The point
is there’s a defined range and that range doesn’t change with the bit depth.
4
Figure 4. Waveform monitor showing full range of values from 7.5 to 110 IRE. This is colour bars.
All properly recorded video will contain video levels between 7.5 and 110 IRE, and all of the
values in between. This includes VHS and U-matic formats. If they are able to capture black and able to
capture white and the source is analogue – analogue being a contiguous signal - the video will contain
everything in between. In formats that use colour-under, such as VHS and U-matic, the fewer lines of
colour resolution in the vertical does not affect the fact that at each point in time, the entire range of
luminance and colours are always available. Indeed, the analogue compression that give us 240 lines of
colour resolution makes it significantly more important to be sure to capture the full range of detail in
each of those lines3.
In a one bit system half the values would be black and half would be white, just like bi-tonal text
scanning. In a two bit system you get black and white plus two shades of grey. As you add more bits, you
get finer and finer gradations until you get a very smooth transition from full black to full white. Using
fewer bit, even 8, leads to banding, or visible steps. This example is in the luminance channel. The same
goes for the two chrominance channels that carry the colour information.
3
Analogue video captures less colour information than luminance information. The use of chroma subsampling
(4:2:2) matches the digital encoding strategy to the analogue encoding strategy, and is acceptable. Lower resolution
chroma subsampling, such as 4:2:0 and 4:1:1, are not acceptable for preservation.
5
4. Lossy Compression4 In an ideal world, lossy compression of video would not be a topic of discussion. We do not accept
compression in the preservation of any other format, however some feel that it is acceptable for video.
Using a lower bit rate for lower quality sources sounds like a good idea, but like bit depth in audio,
video encoding does not work this way. Once again, let’s detour to audio digitization to understand this
concept.
DATA RATE AND DIGITIZING AUDIO
A sound wave, in its simplest form, is a sine wave.
Figure 5. The simplest sound: sine wave.
Pulse Code Modulation is used in digitizing audio for preservation. This method captures the level
of signal at a regular interval in time. The interval is determined by the Nyquist formula, which states that
the highest frequency available for capture is one half the sample rate. A telephone call, for example, has
less information than a 1/2" stereo album master.
4
It’s important to remember the distinction being drawn between lossy compression, such as MPEG2, MPEG4, VC1, Silverlight, IMX, etc., which are deemed unacceptable; and mathematically lossless compression such as
JPEG2000, FFv1, and others. Lossless compression makes smaller files, but 100% of the data is recovered when
decoded. The issue of added complexity, or loss of transparency, of these technologies is beyond the scope of the
discussion of this paper.
6
Figure 6. Sampling frequency determined by Nyquist formula.
If the sample rate is too low for the source signal, information is lost. In the following image there
are two signals. The first is a repeat of the one above. The signal is being sampled sufficiently often,
according to Nyquist, to capture all the frequency information. In the lower signal there is information
between the samples that is not encoded and will be lost. In the lower waveform, the sample rate, the data
rate, is not high enough to capture the information in the signal.
Figure 7. The sample rate at the top is adequate. On the bottom information between samples is not
captured and lost forever.
7
To capture all the information in the lower signal, more samples must be taken, the data rate
increased, to completely and accurately capture the information.
Figure 8. Higher rate (more bits) necessary to capture additional information in bottom example.
If we use the new higher sampling rate on the upper signal, no additional information is captured.
Figure 9. Higher rate used on bottom (higher quality) example doesn’t capture additional information
when used on top (lower quality) example.
8
Therefore, we can reduce the data rate, use fewer bits, by reducing the sampling for our upper
example.5
Figure 10. Lower bit depth can be used on lower quality, top, example.
DATA RATE AND DIGITIZING VIDEO
Unlike analogue sound, which is a contiguous signal, analogue video is partially organized into
discrete elements. Seconds are divided into frames, and the frames contain discrete horizontal lines. The
lines, however, are a contiguous signal. When video is digitized, each frame is first sampled in what
amounts to a TIFF image of each frame. There are 720 pixels across each of 486 lines6.
Let’s say this is one line of video (Figure 11).
5
This discussion is greatly simplified for clarity. The “real world” is more complicated. In practice, a sample rate
higher than Nyquist is necessary. How much higher is not the point. The fundamental argument remains: a lower
sampling rate is adequate when there is less high frequency information.
6
The raster is 720 x 486 pixels in NTSC; and 720 x 576 in PAL.
9
Figure 11. One line of video.
In essence, in uncompressed video, you do the same thing in audio. You sample each of the 486
lines 720 times (Figure 12).
Figure 12. Line of video sampled at regular intervals.
This is the same process used to digitize still images, audio and video. At regular interval, in space
and/or time, you capture a value.
•
•
•
600 pixels per inch (images)
96,000 samples per second (audio)
720 pixels per line (video)
There are 720 pixels across 486 lines of video, creating a grid of 720 x 486 pixels. When this is
compressed, the first thing that happens is pixels are grouped into squares of 8x8 or 16x16 pixels7. These
7
Again the technical discussion here is simplified for clarity. The 8x8 pixel block is a classic strategy that has been
superseded by more complex algorithms. The fundamental argument remains: the picture is subdivided.
10
are referred to as “tiles” or “macroblocks”8. It is at this point, the very first stage of video compression,
where compression becomes a bad thing: compression fundamentally alters the organization of the
original video (Figure 13).
Figure 13. Lines are contiguous (top). When compressed organization is changed (bottom).
In an analogue video image there are horizontal divisions (the lines), but there are no vertical
divisions. This is not a theoretical problem equivalent to the issue of all digitization of analogue signals digitization entails taking a contiguous signal and sampling it in discrete elements that do not exist. In
video it is as though you had taken a page from a book, which also has clear horizontal organization9, cut
between the lines of text, and also cut vertically up the page. No amount of wheat starch paste and longfibre Japanese paper is going to reconstruct the fibres, which were cut creating the little squares.
If you sew, and work with fabric that has a pattern, you know the importance of matching the
pieces of fabric at the seams. However well you match the fabric, you will still have a seam (Figure 14).
Figure 14. “Can ya see it?” Seam line of very carefully matched patterned fabric.
When you work with the fabric, iron it, make a pleat, assemble a piece, you will be aware of the
seam; likewise with video. If you create these artificial divisions, you will encounter them in editing,
dissolves, cross fades, colour correcting, etc.
8
Video captures three values at each pixel, luminance (B&W) and two chrominance values. In this discussion we’ll
describe how an 8x8 block of only one value (say only the luminance layer) is encoded. Technically that is a block.
The combination of the 8x8 luminance layer and the two 8x8 chrominance layers is a macroblock.
9
In western scripts. The same argument applies for vertically oriented languages.
11
I know what you've been thinking: 'Houston, we have a problem.' Everyone in the room who made
it through 3rd grade math has done some simple division. You don't get an even number of blocks. Eight
divides into 720 evenly. But it does not divide evenly into 48610.
720 / 8 = 90
486 / 8 = 60.75
But don't you worry. This problem has been solved. We will just throw away 6 lines of video!
480 / 8 = 60
Wouldn't it make all preservation simpler if we were allowed to do things like this? Just guillotine
the brittle edges of book! Just brush away the pesky dust from pastels! Just low pass filter scratchy
records! If preservation is about capturing as much detail as possible, that any bit of information not
captured during digitization is forever lost, why do we allow this argument which deliberately and maliceaforethought discard over 1% of the information? Would you accept deleting 1 page out of every 100 in a
book?
Figure 15. Catastrophic macroblock decoding errors.
This image shows catastrophic macroblock decoding errors and it demonstrates the size of the
macroblock units (Figure 15). During compression, the continuous image is broken into squares that have
no relationship whatsoever to the original image11. These unrelated squares are encoded separately. Then
they are reassembled, or concatenated, on playback to reconstruct the image.
10
This is a problem with NTSC. PAL divides evenly: 576/8=72.
As mentioned in footnote 6, this discussion is simplified for clarity. More recent codecs, such as MPEG4, contain
strategies to address this problem using variable macroblock sizes. If an 8x8 block contains very different
information, such as in this example the frame and the wall, or the matte and the picture, a different block size, such
as 4x4, enables the subparts of the otherwise 8x8 block to be encoded separately, using a strategy better suited to the
different parts. The image remains, however, artificially divided.
11
12
Consider the line of video again, image. Each line is sampled at 720 points and at each of these
points each pixel is assigned a 10-bit value (Figure 16).
Figure 16. Line of video sampled at regular intervals. At each sample a 10-bit value is assigned.
Using compression, the signal is divided into segments (Figure 17).
Figure 17. Effect on one line of re-organizing video for compression.
For each segment we write a formula that mathematically describes the waveform. You’ll recall
from high school algebra how you would take a formula, plug in a value for x then solve for y. You then
take your sharpened #2 pencil and place a dot of the (x,y) pair on some graph paper. Then you’d plug in
another value for x, get another y and repeat a few times. Finally you’d connect the dots, drawing a curve
through the dots. Somehow it never looked quite right, your pencil was never quite sharp enough.
13
In encoding video we reverse the process. You start with the curve and derive a formula. This is a
very simplified demonstration of how discrete cosine transforms (DCT) works. If you have a large
amount of complex information, and video qualifies as a large amount of complex information, this is a
more efficient way to represent the signal. However, there will always be some error. And the formulae
here are complete nonsense. This example is just for illustration (Figure 18).
Figure 18. A formula is derived that describes a wave segment.
We do that for each segment of the video (Figure 19).
Figure 19. Formulae derived for each wave segment (formulae in example are nonsense; just intended for
demonstration).
14
Just as your #2 pencil could never quite draw the perfect curve, we can never write a formula that
matches the curve 100%. This creates two errors. The first is the difference between the analogue curve
and the mathematical representation, called quantization error. There’s an error in the quantity
represented. The second error comes when the segments are stitched, or concatenated back together. The
quantization error, however small, creates a discontinuity, an offset, when the wave segments are lined up
next to each other.
The more resolution we have in our formula, the more closely it will approximate the waveform. In
these examples, more places to the right of the decimal equals more resolution and require more bits.
2.341y=.327sin(x) has more information and uses more bits than 2.3y=.3sin(x)
When the data rate in video is reduced, the accuracy of the representation of the wave is reduced as
well and the error at the seams is increased during concatenation. These are referred to as “concatenation
errors at the macroblock boundaries”.
Of course, it is more complicated than this. In the real world we are not trying to find a formula to
describe a segment of just one line of video, but eight lines of video all at once (Figure 20).
Figure 20. This is the discrete cosine transform formula; much more complicated than the nonsense
examples.
This is happening both across and up and down the image. You have concatenation errors on all
four sides of the macroblocks. It is not a matter of whether you should use few bits for lower quality
sources, but, by converting to macroblocks, it is not possible to turn the bit rate up high enough to
overcome the change made by creating macroblocks. “Modern” or “intelligent” encoder designs are
“smart enough” to adjust or adapt or interpolate for these concatenation errors. However, they cannot
eliminate the fact that they have fundamentally changed the organization of the original signal by
“smoothing” or “hiding” these errors. Like a finish carpenter installing trim in a living room, a tube of
15
caulk hides a multitude of sins. All this leads to a softening of the image, the softness being an alteration
of the original. Lower the data rate enough and you can see the artefacts very clearly. “Visually lossless”
is a myth in preservation. Information has been discarded and forever lost. What may appear OK will
impact all future uses of the video, from use in high quality productions to encoding the latest-most-used
streaming format for the web.
EXAMPLE: http://www.youtube.com/watch?v=NXF0ovsRcQg
The fuzziness or softness of this image is the encoder working extremely hard to hide the
macroblock tiles. Clearly the data rate is not high enough, but even when it is “high enough”, the structure
of the original has been changed at the macroblock boundaries.
Transcoding from one compression to another can make this situation much worse. The cumulative
error is extremely high and the image quality will suffer through multiple decode/encode cycles. MPEG2
uses 8x8 macroblocks; while MPEG4 uses variable block sizes that can be 4, 8, 12 or 16 pixels square. In
transcoding, it is a certainty that the macroblocks will be subdivided differently for every frame and while
this is good for encoder efficiency, it is very bad for transcoding between different compression
schemata12.
This is an extreme example that uses the same JPEG encoding over and over, but it makes the case
for cumulative encoding errors. Since it uses the same encoding algorithm, the macroblocking does not
deteriorate.
EXAMPLE: http://vimeo.com/3750507
In an ideal world there would be commandments dictating the treatment of video for preservation.
Perhaps something along the lines of:
•
•
•
“Thou shalt not compress video”
“If video is already compressed, you may leave it this way”
“If you choose not to support this form of video compression, your only choice is
to decompress and store in uncompressed” (with process history metadata telling
it was previously compressed)
5. Conclusion The history of the 20th century is unique in the amount of the human experience captured in time-based
media, audio and video. The cultural record captured in video faces format obsolescence and rapid
change. Playback equipment for legacy carriers and formats is rapidly disappearing and long out of
production. This paper has argued that the race against the time when playback equipment will no longer
be available should not lead to compromising fundamental principles of quality and accuracy of
digitization. The technologies and formats exist to capture endangered media without significant loss.
However many people advocate the use of lower resolution and compression in the name of smaller file
12
Indeed this is a large part of the problem in the YouTube example – it’s been transcoded a few times at low bit
rates.
16
sizes. The information lost in these decisions is forever lost and will negatively impact future use and
access to this valuable material.
6. Recommended Target Formats for Digitizing Video This is a summary of the recommendations for digital file formats for preserving video made in the white
paper for the Library of Congress. The complete paper is available at:
http://dl.dropbox.com/u/11583358/IntrmMastVidFormatRecs_20111114.pdf
1. All analogue sources
10 bit uncompressed, 720x486
1. Digital sources on tape: non-transcoded transfer possible
Keep native; may decode to uncompressed
2. Digital sources on tape: transcode necessary
10 bit uncompressed, 720x486
3. Digital sources on other media
Evaluate: keep native or uncompressed
4. Optical discs
ISO Disc image
17