Video&Compression…For&Dummies?& - UNESCO · Video&Compression…For&Dummies?& & GeorgeBlood& Abstract& Over time video formats and carriers become obsolete rapidly while file-based

Video Compression…For Dummies? George Blood Abstract Over time video formats and carriers become obsolete rapidly while file-based digital video formats are evolving rapidly. The physical carriers of these files are unreliable; manufacturers assume they will be used for acquisition then transferred to hard disc drive for production. For these reasons our starting assumption is that all historic video formats must be migrated to the latest digital technology. Through this paper, we discuss, in detail, standards and recommendations regarding video compression. Author George Blood graduated from the University of Chicago (1983) with a Bachelor of Arts in Music Theory. Active recording live concerts (from student recitals to opera and major symphony orchestras), since 1982 he has documented over 4,000 live events. From 1984 through 1989 he was a producer at WFMT-FM, and has recorded and edited some 600 nationally syndicated radio programs, mostly of The Philadelphia Orchestra. He has recorded or produced over 200 CDs, 3 of which were nominated for Grammy Awards. Each month George Blood Audio and Video digitizes approximately 2,000 hours of audio and video collections. He is the only student of Canadian pianist Marc-André Hamelin.

1. Background

Two years ago we received a five-year contract from the Library of Congress to digitize audio and video1. The Library’s in-house standard for video preservation masters is lossless JPEG2000 wrapped in MXF. However, within the Library, only the Culpeper facility has the technology to work directly with this format. As part of this contract, I was asked to prepare a white paper entitled “Determining Suitable Digital Video Formats for Medium-term Storage”. In the paper we make recommendations on target formats for video preservation when J2K/MXF is not yet a viable option. Originally, the white paper was intended for use by other departments within the Library of Congress. However, we quickly realized the information and recommendations would be useful to other institutions as well.

Our four starting premises are listed as follows:

• Tape is Not an Option • 10-bits Required

1 This paper was originally presented at The Memory of the World in the Digital Age: Digitization and Preservation 26 to 28 September 2012 Vancouver, British Columbia, Canada. We would like to thank the session chairs, Luciana Duranti and Jonas Palm for the opportunity to publish our presentation.

2

• Compression is Not an Option • One Size Does Not Fit All

2. The Problem with Tape

Tape was rejected due to obsolescence. Standard definition machines are no longer manufactured, either for analogue or digital formats. Current workflows are rapidly evolving around non-tape, file-based systems, from acquisition to production to distribution.

In this paper, I discuss the two middle issues, 10-bit resolution and why compression is not acceptable for preservation, in detail. The observation that one size does not fit all will then form the basis for the structure of the recommendations made in the white paper for the Library of Congress.

3. Requirement of 10-‐bit Resolution

The requirement for 10-bit resolution is the subject of considerable discussion. I begin, therefore, by reviewing how bit-depth works in video. The choice between 8- and 10-bits can been thought of as a sort of compression as it excludes low level detail and softens the image. Let us explore the argument to “use a lower bit rate for lower quality formats”. To appreciate why it is necessary to use 10-bits, let’s explore how this works for audio.

In audio, each bit is equal to 6 decibels (dB). There is a maximum signal level, full scale. As you increase the quantity of bits, you achieve a more dynamic range, or signal-to-noise ratio. Dynamic range and signal-to-noise ratio are simply different ways of looking at the same phenomenon, the range from minimum to maximum information captured.

As you add bits, the dynamic range of information that you can capture also increases. In an 8-bit system you can capture 48dB of dynamic range. In a 16-bit system you can capture 96dB of dynamic range and, finally, in a 24-bit system you have the ability to capture 144dB.

3

Figure 1. Dynamic range in digital audio is a function of the number of bits in the system.

If your source is, for example, an audiocassette with approximately 58dB of dynamic range, an 8-

bit system with 48dB of dynamic range will not be enough. A 16-bit system, such as that used on a compact disc or DAT tape, however, will be more than enough and 24-bits will be wasted storage. However, if your source is an extremely high quality studio master recording, then the additional resolution and additional dynamic range of a 24-bit system makes sense.

Figure 2. Match dynamic range of system with dynamic range of source media.

4

Figure 3. Sources with higher dynamic range (higher quality) need more bits.

In practice, audio is now nearly always digitized at 24-bits for the sake of standardization. In audio,

an increased number of bits allows for a wider range of information to be captured. You match the range of the source to the number of bits necessary to capture that range. Storage for audio has also become so inexpensive that it is not cost prohibitive to store that much data. A 1-Terabyte hard-drive can hold up to 500 hours of preservation quality audio and only costs approximately $100.00. Such high-quality storage for so little expense could certainly not have been achieved with ¼” tape.

Unfortunately, video does not work in the same way as audio. The waveform monitor is a tool used to adjust video to technical standards. A properly recorded analogue video will have voltages in the range between 7.5 and 110 IRE2. At

the lower end of the range, it is black and at the other end it is white. If we were to use 1-bit encoding, we would only have these two choices: black and white. As we add bits, we gain more gradations. The goal is a contiguous, smooth transition from black to white. In order to achieve that, a high number of bits are required. Fundamentally the difference between audio and video is that in audio the step size is fixed and the range changes, whereas in video the range is fixed at the step size changes.

2 IRE stands for Institute of Radio Engineers. The actual electronic values don’t matter for this discussion. The point is there’s a defined range and that range doesn’t change with the bit depth.

5

Figure 4. Waveform monitor showing full range of values from 7.5 to 110 IRE. This is colour bars.

All properly recorded video will contain video levels between 7.5 and 110 IRE, and all of the

values in between. This includes VHS and U-matic formats. If they are able to capture black and able to capture white and the source is analogue – analogue being a contiguous signal - the video will contain everything in between. In formats that use colour-under, such as VHS and U-matic, the fewer lines of colour resolution in the vertical does not affect the fact that at each point in time, the entire range of luminance and colours are always available. Indeed, the analogue compression that give us 240 lines of colour resolution makes it significantly more important to be sure to capture the full range of detail in each of those lines3.

In a one bit system half the values would be black and half would be white, just like bi-tonal text scanning. In a two bit system you get black and white plus two shades of grey. As you add more bits, you get finer and finer gradations until you get a very smooth transition from full black to full white. Using fewer bit, even 8, leads to banding, or visible steps. This example is in the luminance channel. The same goes for the two chrominance channels that carry the colour information.

3 Analogue video captures less colour information than luminance information. The use of chroma subsampling (4:2:2) matches the digital encoding strategy to the analogue encoding strategy, and is acceptable. Lower resolution chroma subsampling, such as 4:2:0 and 4:1:1, are not acceptable for preservation.

6

4. Lossy Compression4

In an ideal world, lossy compression of video would not be a topic of discussion. We do not accept compression in the preservation of any other format, however some feel that it is acceptable for video.

Using a lower bit rate for lower quality sources sounds like a good idea, but like bit depth in audio, video encoding does not work this way. Once again, let’s detour to audio digitization to understand this concept. DATA RATE AND DIGITIZING AUDIO A sound wave, in its simplest form, is a sine wave.

Figure 5. The simplest sound: sine wave.

Pulse Code Modulation is used in digitizing audio for preservation. This method captures the level

of signal at a regular interval in time. The interval is determined by the Nyquist formula, which states that the highest frequency available for capture is one half the sample rate. A telephone call, for example, has less information than a 1/2" stereo album master.

4 It’s important to remember the distinction being drawn between lossy compression, such as MPEG2, MPEG4, VC-1, Silverlight, IMX, etc., which are deemed unacceptable; and mathematically lossless compression such as JPEG2000, FFv1, and others. Lossless compression makes smaller files, but 100% of the data is recovered when decoded. The issue of added complexity, or loss of transparency, of these technologies is beyond the scope of the discussion of this paper.

7

Figure 6. Sampling frequency determined by Nyquist formula.

If the sample rate is too low for the source signal, information is lost. In the following image there

are two signals. The first is a repeat of the one above. The signal is being sampled sufficiently often, according to Nyquist, to capture all the frequency information. In the lower signal there is information between the samples that is not encoded and will be lost. In the lower waveform, the sample rate, the data rate, is not high enough to capture the information in the signal.

Figure 7. The sample rate at the top is adequate. On the bottom information between samples is not

captured and lost forever.

8

To capture all the information in the lower signal, more samples must be taken, the data rate

increased, to completely and accurately capture the information.

Figure 8. Higher rate (more bits) necessary to capture additional information in bottom example.

If we use the new higher sampling rate on the upper signal, no additional information is captured.

Figure 9. Higher rate used on bottom (higher quality) example doesn’t capture additional information

when used on top (lower quality) example.

9

Therefore, we can reduce the data rate, use fewer bits, by reducing the sampling for our upper example.5

Figure 10. Lower bit depth can be used on lower quality, top, example.

DATA RATE AND DIGITIZING VIDEO

Unlike analogue sound, which is a contiguous signal, analogue video is partially organized into discrete elements. Seconds are divided into frames, and the frames contain discrete horizontal lines. The lines, however, are a contiguous signal. When video is digitized, each frame is first sampled in what amounts to a TIFF image of each frame. There are 720 pixels across each of 486 lines6.

Let’s say this is one line of video (Figure 11).

5 This discussion is greatly simplified for clarity. The “real world” is more complicated. In practice, a sample rate higher than Nyquist is necessary. How much higher is not the point. The fundamental argument remains: a lower sampling rate is adequate when there is less high frequency information. 6 The raster is 720 x 486 pixels in NTSC; and 720 x 576 in PAL.

10

Figure 11. One line of video.

In essence, in uncompressed video, you do the same thing in audio. You sample each of the 486

lines 720 times (Figure 12).

Figure 12. Line of video sampled at regular intervals.

This is the same process used to digitize still images, audio and video. At regular interval, in space

and/or time, you capture a value.

• 600 pixels per inch (images) • 96,000 samples per second (audio) • 720 pixels per line (video)

There are 720 pixels across 486 lines of video, creating a grid of 720 x 486 pixels. When this is

compressed, the first thing that happens is pixels are grouped into squares of 8x8 or 16x16 pixels7. These

7 Again the technical discussion here is simplified for clarity. The 8x8 pixel block is a classic strategy that has been superseded by more complex algorithms. The fundamental argument remains: the picture is subdivided.

11

are referred to as “tiles” or “macroblocks”8. It is at this point, the very first stage of video compression, where compression becomes a bad thing: compression fundamentally alters the organization of the original video (Figure 13).

Figure 13. Lines are contiguous (top). When compressed organization is changed (bottom).

In an analogue video image there are horizontal divisions (the lines), but there are no vertical divisions. This is not a theoretical problem equivalent to the issue of all digitization of analogue signals - digitization entails taking a contiguous signal and sampling it in discrete elements that do not exist. In video it is as though you had taken a page from a book, which also has clear horizontal organization9, cut between the lines of text, and also cut vertically up the page. No amount of wheat starch paste and long-fibre Japanese paper is going to reconstruct the fibres, which were cut creating the little squares.

If you sew, and work with fabric that has a pattern, you know the importance of matching the pieces of fabric at the seams. However well you match the fabric, you will still have a seam (Figure 14).

Figure 14. “Can ya see it?” Seam line of very carefully matched patterned fabric.

When you work with the fabric, iron it, make a pleat, assemble a piece, you will be aware of the

seam; likewise with video. If you create these artificial divisions, you will encounter them in editing, dissolves, cross fades, colour correcting, etc.

8 Video captures three values at each pixel, luminance (B&W) and two chrominance values. In this discussion we’ll describe how an 8x8 block of only one value (say only the luminance layer) is encoded. Technically that is a block. The combination of the 8x8 luminance layer and the two 8x8 chrominance layers is a macroblock. 9 In western scripts. The same argument applies for vertically oriented languages.

12

I know what you've been thinking: 'Houston, we have a problem.' Everyone in the room who made it through 3rd grade math has done some simple division. You don't get an even number of blocks. Eight divides into 720 evenly. But it does not divide evenly into 48610. 720 / 8 = 90 486 / 8 = 60.75

But don't you worry. This problem has been solved. We will just throw away 6 lines of video! 480 / 8 = 60

Wouldn't it make all preservation simpler if we were allowed to do things like this? Just guillotine the brittle edges of book! Just brush away the pesky dust from pastels! Just low pass filter scratchy records! If preservation is about capturing as much detail as possible, that any bit of information not captured during digitization is forever lost, why do we allow this argument which deliberately and malice-aforethought discard over 1% of the information? Would you accept deleting 1 page out of every 100 in a book?

Figure 15. Catastrophic macroblock decoding errors.

This image shows catastrophic macroblock decoding errors and it demonstrates the size of the

macroblock units (Figure 15). During compression, the continuous image is broken into squares that have no relationship whatsoever to the original image11. These unrelated squares are encoded separately. Then they are reassembled, or concatenated, on playback to reconstruct the image.

10 This is a problem with NTSC. PAL divides evenly: 576/8=72. 11 As mentioned in footnote 6, this discussion is simplified for clarity. More recent codecs, such as MPEG4, contain strategies to address this problem using variable macroblock sizes. If an 8x8 block contains very different information, such as in this example the frame and the wall, or the matte and the picture, a different block size, such as 4x4, enables the subparts of the otherwise 8x8 block to be encoded separately, using a strategy better suited to the different parts. The image remains, however, artificially divided.

13

Consider the line of video again, image. Each line is sampled at 720 points and at each of these

points each pixel is assigned a 10-bit value (Figure 16).

Figure 16. Line of video sampled at regular intervals. At each sample a 10-bit value is assigned.

Using compression, the signal is divided into segments (Figure 17).

Figure 17. Effect on one line of re-organizing video for compression.

For each segment we write a formula that mathematically describes the waveform. You’ll recall

from high school algebra how you would take a formula, plug in a value for x then solve for y. You then take your sharpened #2 pencil and place a dot of the (x,y) pair on some graph paper. Then you’d plug in another value for x, get another y and repeat a few times. Finally you’d connect the dots, drawing a curve through the dots. Somehow it never looked quite right, your pencil was never quite sharp enough.

14

In encoding video we reverse the process. You start with the curve and derive a formula. This is a very simplified demonstration of how discrete cosine transforms (DCT) works. If you have a large amount of complex information, and video qualifies as a large amount of complex information, this is a more efficient way to represent the signal. However, there will always be some error. And the formulae here are complete nonsense. This example is just for illustration (Figure 18).

Figure 18. A formula is derived that describes a wave segment.

We do that for each segment of the video (Figure 19).

Figure 19. Formulae derived for each wave segment (formulae in example are nonsense; just intended for

demonstration).

15

Just as your #2 pencil could never quite draw the perfect curve, we can never write a formula that matches the curve 100%. This creates two errors. The first is the difference between the analogue curve and the mathematical representation, called quantization error. There’s an error in the quantity represented. The second error comes when the segments are stitched, or concatenated back together. The quantization error, however small, creates a discontinuity, an offset, when the wave segments are lined up next to each other.

The more resolution we have in our formula, the more closely it will approximate the waveform. In these examples, more places to the right of the decimal equals more resolution and require more bits. 2.341y=.327sin(x) has more information and uses more bits than 2.3y=.3sin(x)

When the data rate in video is reduced, the accuracy of the representation of the wave is reduced as well and the error at the seams is increased during concatenation. These are referred to as “concatenation errors at the macroblock boundaries”.

Of course, it is more complicated than this. In the real world we are not trying to find a formula to describe a segment of just one line of video, but eight lines of video all at once (Figure 20).

Figure 20. This is the discrete cosine transform formula; much more complicated than the nonsense

examples.

This is happening both across and up and down the image. You have concatenation errors on all four sides of the macroblocks. It is not a matter of whether you should use few bits for lower quality sources, but, by converting to macroblocks, it is not possible to turn the bit rate up high enough to overcome the change made by creating macroblocks. “Modern” or “intelligent” encoder designs are “smart enough” to adjust or adapt or interpolate for these concatenation errors. However, they cannot eliminate the fact that they have fundamentally changed the organization of the original signal by “smoothing” or “hiding” these errors. Like a finish carpenter installing trim in a living room, a tube of

16

caulk hides a multitude of sins. All this leads to a softening of the image, the softness being an alteration of the original. Lower the data rate enough and you can see the artefacts very clearly. “Visually lossless” is a myth in preservation. Information has been discarded and forever lost. What may appear OK will impact all future uses of the video, from use in high quality productions to encoding the latest-most-used streaming format for the web. EXAMPLE: http://www.youtube.com/watch?v=NXF0ovsRcQg

The fuzziness or softness of this image is the encoder working extremely hard to hide the macroblock tiles. Clearly the data rate is not high enough, but even when it is “high enough”, the structure of the original has been changed at the macroblock boundaries.

Transcoding from one compression to another can make this situation much worse. The cumulative error is extremely high and the image quality will suffer through multiple decode/encode cycles. MPEG2 uses 8x8 macroblocks; while MPEG4 uses variable block sizes that can be 4, 8, 12 or 16 pixels square. In transcoding, it is a certainty that the macroblocks will be subdivided differently for every frame and while this is good for encoder efficiency, it is very bad for transcoding between different compression schemata12.

This is an extreme example that uses the same JPEG encoding over and over, but it makes the case for cumulative encoding errors. Since it uses the same encoding algorithm, the macroblocking does not deteriorate. EXAMPLE: http://vimeo.com/3750507

In an ideal world there would be commandments dictating the treatment of video for preservation. Perhaps something along the lines of:

• “Thou shalt not compress video” • “If video is already compressed, you may leave it this way” • “If you choose not to support this form of video compression, your only choice is

to decompress and store in uncompressed” (with process history metadata telling it was previously compressed)

5. Conclusion

The history of the 20th century is unique in the amount of the human experience captured in time-based media, audio and video. The cultural record captured in video faces format obsolescence and rapid change. Playback equipment for legacy carriers and formats is rapidly disappearing and long out of production. This paper has argued that the race against the time when playback equipment will no longer be available should not lead to compromising fundamental principles of quality and accuracy of digitization. The technologies and formats exist to capture endangered media without significant loss. However many people advocate the use of lower resolution and compression in the name of smaller file

12 Indeed this is a large part of the problem in the YouTube example – it’s been transcoded a few times at low bit rates.

17

sizes. The information lost in these decisions is forever lost and will negatively impact future use and access to this valuable material.

6. Recommended Target Formats for Digitizing Video

This is a summary of the recommendations for digital file formats for preserving video made in the white paper for the Library of Congress. The complete paper is available at: http://dl.dropbox.com/u/11583358/IntrmMastVidFormatRecs_20111114.pdf

1. All analogue sources 10 bit uncompressed, 720x486

1. Digital sources on tape: non-transcoded transfer possible

Keep native; may decode to uncompressed

2. Digital sources on tape: transcode necessary 10 bit uncompressed, 720x486

3. Digital sources on other media

Evaluate: keep native or uncompressed

4. Optical discs ISO Disc image

Video&Compression…For&Dummies?& - UNESCO · Video&Compression…For&Dummies?& & GeorgeBlood& Abstract& Over time video formats and carriers become obsolete rapidly while file-based

Documents