Full Resolution Image Compression with Recurrent Neural Networks

Full Resolution Image
Compression with
Recurrent Neural Networks
G Toderici, D Vincent, N Johnston, etc.
Zhiyi Su presents on NDEL group presentation on
09/30/2016
Motivation
Motivation/Objectives
• Further reduce the size of media materials and hence
enable more massive storage and reduce the
transmission time required.
• Provide a neural network which is competitive across
compression rates on images of arbitrary sizes.
• Image compression is an area that neural networks were
suspected to be good at.
• Previous study showed it is possible to achieve better
compression rate, but limited to 32×32 images.
• Standard image compression focuses on large images
but ignores (or even harms) low resolution images. But
this actually becomes popular and a demand. E.g.
thumbnail preview images on the internet.
General image compression
Source
image
Source
encoder
Usually break into
small blocks.
E.g. JPEG uses 8×8
pixel blocks
Transfer the signal
into frequency
domain.
E.g. Discrete
Fourier Transform
(DFT), Discrete
Cosine Transform
(DCT), Discrete
Wavelet Transform
(DWT)* .
This step is lossless.
*The
Quantizer
Drop the high
frequency
components.
This step is lossy.
Entropy
Encoder
Represent the signal
with the smallest
number of bits
possible, based on
the knowledge of
probability of each
symbol occurs**.
This most common
model is Huffman
coding.
This step is lossless.
JPEG: Joint Photographic Experts
Group
Compressed
signal
01010010001001010
10101010101010010
10101010101010001
010011110010101…
transforms mentioned here are all orthogonal transforms, at least in the given interval. But this is not necessarily required. The compressive sensing theory
states that if signals are sparse in some basis, then they will be recovered form a small number of random linear measurements via attractable convex optimization
techniques. For reference: http://airccse.org/journal/jma/7115ijma01.pdf
** For reference of entropy coding: http://www.math.tau.ac.il/~dcor/Graphics/adv-slides/entropy.pdf
General image compression
Source
image
Usually break into
small blocks.
E.g. JPEG uses 8×8
pixel blocks
*The
Source
encoder
Quantizer
Entropy
Encoder
JPEG: Joint Photographic Experts
Group
Compressed
signal
Transfer the signal
into frequency
domain.
E.g. Discrete
Fourier Transform
(DFT), Discrete
Cosine Transform
(DCT), Discrete
Wavelet Transform
(DWT)*
transforms mentioned here are all orthogonal transforms, at least in the given interval. But this is not necessarily required. The compressive sensing theory
states that if signals are sparse in some basis, then they will be recovered form a small number of random linear measurements via attractable convex optimization
techniques. For reference: http://airccse.org/journal/jma/7115ijma01.pdf
Image compression with neural networks
Output
compressed
signal, 128 bits
Source image
(32×32×3)
Encoder
Binarizer
Decoder
Reconstructed
image
Residue
(to be minimized)
• A few things to be noticed:
• RNN: Recurrent Neural Network
• Progressive method. The network can generate better
and better results over iterations (but also gives larger
file size)
Et : encoder at step t
B: binarizer
Dt: decoder at step t
rt: residue at step t
Recurrent Neural Network
Recurrent neural
networks have loops!
A unrolled recurrent
neural network
xt is the input at time step t.
ht is the hidden state at time step t. ht = f (U xt+W ht-1).
ot is the output at time step t, which is not depicted in the figure above. ot is also a function of xt
and ht, ot = g(xt , ht).
A few things to be noticed:
hi can be think of the memory of the network.
The kernels/parameters f, g, U, W of the network are unchanged across all steps/iterations. This
greatly reduce the number of parameters we have to learn.
RNNs are currently under active studied in language modeling, machine translation, speech
recognition, etc. For a more thorough introduction of FNNs:
http://www.wildml.com/2015/09/recurrent-neural-networks-tutorial-part-1-introduction-to-rnns/
Unrolled structure of image
compression model
Source image
(32×32×3)
Encoder
Binarizer
(h0)
Encoder
(h0)
Binarizer
(h1)
Encoder
Decoder
Decoder
(h1)
Binarizer
(h2)
Decoder
(h2)
Reconstructed
image
Residue
Predicted
residue
Residue’
Predicted
residue’
Residue’’
……
Output compressed signal, 128×M bits.
M being the number of steps.
This method could produce compressed
files with an increment of 128 bits.
This forms a progressive method in terms
of bit rate (unit: bpp, bit per pixel)
Final
reconstruction
Types of recurrent units
• LSTM (Long Short-Term Memory)
• Associated LSTM
• GRU (Gated Recurrent Units)
1 iteration 128 bits
128 bits/(32×32 pixels) = 0.125 bpp
4 iterations
512 bits
512 bits/(32×32 pixels) = 0.5 bpp
…
16 iterations
2 bpp
For reference of these methods, please see the original paper: http://arxiv.org/abs/1608.05148
As well a link of very good explanation: http://colah.github.io/posts/2015-08-UnderstandingLSTMs/
Reconstruction framework
• “One shot”: γ =0. The output of each iteration
represents a complete reconstruction.
• Additive: γ =1. The final image reconstruction is the
sum of the outputs of all iterations.
• Residual scaling: similar to additive, but residue is
scaled before going to the next iteration.
Et : encoder at step t
B: binarizer
Dt: decoder at step t
rt: residue at step t
x : the input image
xt: the estimate of x at step t
bt: the output compressed stream
at step t
gt: the scaling/gain factor at step t
Results
Training: a random sample of 6 million 1280×720 images on the web, decomposes
the images into non-overlapping 32×32 tiles and samples 100 tiles that have the
worst compression ratio when using the PNG compression algorithm. By selecting
the patches that compress the least under PNG, we intend to create a dataset with
“hard-to-compress” data. The hypothesis is that training on such patches should
yield a better compression model. All models are trained up for approximately
1,000,000 training steps.
Evaluation dataset: Kodak Photo CD dataset. (100k images)
Evaluation metrics:
1. Multi-Scale Structural Similarity (MS-SSIM)
2. Peak Signal to Noise Ratio - Human Visual System (PSNR-HVS)
In both metrics, the higher values imply a closer match between the reconstruction
and reference images.
Results
LSTM: Long Short-Term Memory network
GRU: Gated Recurrent Units
MS-SSIM: Multi-Scale Structural Similarity
PSNR-HVS: Peak Signal to Noise Ratio - Human Visual System
AUC: Area Under the rate-distortion Curve
Example
bpp: bits per pixel. Smaller values imply a higher compress ratio.
Example
Example
Original
JPEG420 at 0.5 bpp
Example
Original
LSTM residual
scaling at 0.5 bpp
Example
Original
JPEG420 at 1 bpp
Example
Original
LSTM residual
scaling at 1 bpp
Example
Original
JPEG420 at 2 bpp
Example
Original
LSTM residual
scaling at 2 bpp
Conclusions:
• Presented a RNN based image compression based
network.
• This network on average achieve better than JPEG
performance over all image size and compression rate.
• Exceptionally good performance on low bit rate
compressions.
Future works:
• Test this method on video compression
• The domain of perceptual difference is still very much in
active development. If there is a metric capable of
correlating with human raters for all types of distortions,
we could incorporate it directly into our loss function,
and optimize directly for it.
Thanks for your attention!
• Any questions or comments?
Huffman coding
• Consider a string: AAAABBBCCDAAAAAAAAAAAA
• Regular coding:
A = 00, B = 01, C = 10, D = 11
• Compared with:
Convolutional neural network
• Most commonly seen pattern recognition and
classification.
• For reference, see:
http://cs231n.github.io/convolutional-networks/