Full Resolution Image Compression with Recurrent Neural Networks G Toderici, D Vincent, N Johnston, etc. Zhiyi Su presents on NDEL group presentation on 09/30/2016 Motivation Motivation/Objectives • Further reduce the size of media materials and hence enable more massive storage and reduce the transmission time required. • Provide a neural network which is competitive across compression rates on images of arbitrary sizes. • Image compression is an area that neural networks were suspected to be good at. • Previous study showed it is possible to achieve better compression rate, but limited to 32×32 images. • Standard image compression focuses on large images but ignores (or even harms) low resolution images. But this actually becomes popular and a demand. E.g. thumbnail preview images on the internet. General image compression Source image Source encoder Usually break into small blocks. E.g. JPEG uses 8×8 pixel blocks Transfer the signal into frequency domain. E.g. Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT)* . This step is lossless. *The Quantizer Drop the high frequency components. This step is lossy. Entropy Encoder Represent the signal with the smallest number of bits possible, based on the knowledge of probability of each symbol occurs**. This most common model is Huffman coding. This step is lossless. JPEG: Joint Photographic Experts Group Compressed signal 01010010001001010 10101010101010010 10101010101010001 010011110010101… transforms mentioned here are all orthogonal transforms, at least in the given interval. But this is not necessarily required. The compressive sensing theory states that if signals are sparse in some basis, then they will be recovered form a small number of random linear measurements via attractable convex optimization techniques. For reference: http://airccse.org/journal/jma/7115ijma01.pdf ** For reference of entropy coding: http://www.math.tau.ac.il/~dcor/Graphics/adv-slides/entropy.pdf General image compression Source image Usually break into small blocks. E.g. JPEG uses 8×8 pixel blocks *The Source encoder Quantizer Entropy Encoder JPEG: Joint Photographic Experts Group Compressed signal Transfer the signal into frequency domain. E.g. Discrete Fourier Transform (DFT), Discrete Cosine Transform (DCT), Discrete Wavelet Transform (DWT)* transforms mentioned here are all orthogonal transforms, at least in the given interval. But this is not necessarily required. The compressive sensing theory states that if signals are sparse in some basis, then they will be recovered form a small number of random linear measurements via attractable convex optimization techniques. For reference: http://airccse.org/journal/jma/7115ijma01.pdf Image compression with neural networks Output compressed signal, 128 bits Source image (32×32×3) Encoder Binarizer Decoder Reconstructed image Residue (to be minimized) • A few things to be noticed: • RNN: Recurrent Neural Network • Progressive method. The network can generate better and better results over iterations (but also gives larger file size) Et : encoder at step t B: binarizer Dt: decoder at step t rt: residue at step t Recurrent Neural Network Recurrent neural networks have loops! A unrolled recurrent neural network xt is the input at time step t. ht is the hidden state at time step t. ht = f (U xt+W ht-1). ot is the output at time step t, which is not depicted in the figure above. ot is also a function of xt and ht, ot = g(xt , ht). A few things to be noticed: hi can be think of the memory of the network. The kernels/parameters f, g, U, W of the network are unchanged across all steps/iterations. This greatly reduce the number of parameters we have to learn. RNNs are currently under active studied in language modeling, machine translation, speech recognition, etc. For a more thorough introduction of FNNs: http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/ Unrolled structure of image compression model Source image (32×32×3) Encoder Binarizer (h0) Encoder (h0) Binarizer (h1) Encoder Decoder Decoder (h1) Binarizer (h2) Decoder (h2) Reconstructed image Residue Predicted residue Residue’ Predicted residue’ Residue’’ …… Output compressed signal, 128×M bits. M being the number of steps. This method could produce compressed files with an increment of 128 bits. This forms a progressive method in terms of bit rate (unit: bpp, bit per pixel) Final reconstruction Types of recurrent units • LSTM (Long Short-Term Memory) • Associated LSTM • GRU (Gated Recurrent Units) 1 iteration 128 bits 128 bits/(32×32 pixels) = 0.125 bpp 4 iterations 512 bits 512 bits/(32×32 pixels) = 0.5 bpp … 16 iterations 2 bpp For reference of these methods, please see the original paper: http://arxiv.org/abs/1608.05148 As well a link of very good explanation: http://colah.github.io/posts/2015-08-UnderstandingLSTMs/ Reconstruction framework • “One shot”: γ =0. The output of each iteration represents a complete reconstruction. • Additive: γ =1. The final image reconstruction is the sum of the outputs of all iterations. • Residual scaling: similar to additive, but residue is scaled before going to the next iteration. Et : encoder at step t B: binarizer Dt: decoder at step t rt: residue at step t x : the input image xt: the estimate of x at step t bt: the output compressed stream at step t gt: the scaling/gain factor at step t Results Training: a random sample of 6 million 1280×720 images on the web, decomposes the images into non-overlapping 32×32 tiles and samples 100 tiles that have the worst compression ratio when using the PNG compression algorithm. By selecting the patches that compress the least under PNG, we intend to create a dataset with “hard-to-compress” data. The hypothesis is that training on such patches should yield a better compression model. All models are trained up for approximately 1,000,000 training steps. Evaluation dataset: Kodak Photo CD dataset. (100k images) Evaluation metrics: 1. Multi-Scale Structural Similarity (MS-SSIM) 2. Peak Signal to Noise Ratio - Human Visual System (PSNR-HVS) In both metrics, the higher values imply a closer match between the reconstruction and reference images. Results LSTM: Long Short-Term Memory network GRU: Gated Recurrent Units MS-SSIM: Multi-Scale Structural Similarity PSNR-HVS: Peak Signal to Noise Ratio - Human Visual System AUC: Area Under the rate-distortion Curve Example bpp: bits per pixel. Smaller values imply a higher compress ratio. Example Example Original JPEG420 at 0.5 bpp Example Original LSTM residual scaling at 0.5 bpp Example Original JPEG420 at 1 bpp Example Original LSTM residual scaling at 1 bpp Example Original JPEG420 at 2 bpp Example Original LSTM residual scaling at 2 bpp Conclusions: • Presented a RNN based image compression based network. • This network on average achieve better than JPEG performance over all image size and compression rate. • Exceptionally good performance on low bit rate compressions. Future works: • Test this method on video compression • The domain of perceptual difference is still very much in active development. If there is a metric capable of correlating with human raters for all types of distortions, we could incorporate it directly into our loss function, and optimize directly for it. Thanks for your attention! • Any questions or comments? Huffman coding • Consider a string: AAAABBBCCDAAAAAAAAAAAA • Regular coding: A = 00, B = 01, C = 10, D = 11 • Compared with: Convolutional neural network • Most commonly seen pattern recognition and classification. • For reference, see: http://cs231n.github.io/convolutional-networks/
© Copyright 2026 Paperzz