Reference Constrained Cartoon Colorization

Reference Constrained Cartoon Colorization
Mit Shah
University of Texas at Austin
Austin, TX 78712
Nayan Singhal
University of Texas at Austin
Austin, TX 78712
Tushar Nagarajan
University of Texas at Austin
Austin, TX 78712
[email protected]
[email protected]
[email protected]
Abstract
the cartoon domain. The cartoon domain comes with its
own unique challenges.
Recent work on colorization using deep networks has
achieved impressive results on natural images. The colorization task in these works is underconstrained - a
grayscale image of a t-shirt could be any of several colors, which results in ambiguous colonization. In this work,
we look at the cartoon domain, where the color ambiguity is
much higher than natural images. We explore this colorization task with an additional constraint of being consistent
with a reference image. We train on a set of cartoons and
investigate if our model generalizes to unseen cartoons and
natural images.
• The color ambiguity is much higher in general - cartoons exaggerate color distributions.
• The pixel gradients are much lower across pixels most regions are uniformly colored except at the edges
where there is an abrupt change.
We aim to address these issues with a combination of finetuning for our cartoon domain and including extra supervision from a reference image. Our task therefore, is to
colorize a cartoon image to be consistent with a reference
image.
For the reference image, we choose natural images from
live-action adaptions of cartoons rather than other cartoons.
The motivation behind this decision is two-fold. (1) Using
cartoon references reduces this problem to a version of the
colorization by super-pixel matching task in [6]. We would
like to incorporate learning into our system to allow for better generalization. (2) We would like to test the generalizability of our system for natural images as well. Given a
cartoon of cat, and a natural cat images with different color,
textures; does the system impart those colors/textures onto
the cartoon image?
We choose the cartoon domain for one other reason accumulating a large amount of supervised data with images colored consistent to a reference is difficult and is discussed in detail in Section 3. Frames from cartoon clips
provide a large amount of training data. For reference images, live-action adaptations of popular cartoon provide a
decent source of this information. We build a dataset from
15 cartoons and their live-action adaptations for this purpose.
Building upon the model in [17], we train several variations of this deep net to predict the A,B channels of an image in LAB color-space given the L channel and a reference
image. We perform an array of experiments highlighting
the applicability of each model for our domain.
1. Introduction
Previous works in colorization have highlighted the difficulty in colorizing grayscale images [17, 9]. For natural
images, humans easily form semantic correlations between
common scene elements and their colors (water being blue,
ground being brown). These semantic associations form the
basic building blocks for a colorization algorithm to correctly impart color to a grayscale image.
One natural issue that arises due to the way the problem
is defined is the problem of color ambiguity. Colorization
is an underconstrained problem, which means that there can
be many plausible colorizations for the same object. For example, fruits like grapes, apples and bananas can take multiple colors, but grayscale images of each of them could be
identical. A fully automatic system can generate any of the
plausible colorizations, making learning parameters for this
kind of model a difficult task.
To address this issue, interactive models have been proposed where a user selects colors for some input regions
(Scribble based [10, 7, 16, 11, 14]) or a user gives a reference image (Transfer based [6]) from which color should
be imparted. This eliminates the ambiguity in colorization
but introduces a certain level of human intervention which
is absent in fully automatic models.
In this work, we address the problem of colorization in
1
2. Related Work
main where the reference is a natural image, and the target
image is a cartoon image.
Colorization methods can mainly be classified into two
categories: user interaction driven methods or fully automatic. The former models can be further classified into
Scribble based methods and Transfer based methods.
Levin et al. [10] proposed Scribble based methods which
requires manually specifying desired output colors for certain spatial locations in the image. They are then propagated
to adjacent pixels by a least-squares optimization method.
Interactive refinement is allowed via specifying additional
colors. Further improvements to scribble based methods
were proposed in [7, 16, 14, 11].
Automatic methods pose the problem as either regression onto continuous color space [2, 4] or classification of
quantized color values [1, 17].
Zhang et al. [17] cast colorization as a classification task
and use class re-balancing at training time to increase the
diversity of colors in the result. Larsson et al. [9] make use
of low-level and semantic representations to train a model
to predict per-pixel color histograms.
3. Dataset
Generating reference images that are different from the
original images is not a straight forward task - category
labels serve as a very coarse proxy for usable references
(for example, for a cartoon image of a dog, use the ImageNet category ’dog’ as reference images). During training
however, the consistently colorized output is also required,
which is unavailable by selecting references in this manner.
Therefore, to create our dataset, we use a series of cartoons and their live-action movie counterparts to harvest a
set of consistent references for our cartoon frames. Our
dataset primarily consists of cartoons involving characters
in distinct costumes because the live action counterparts
will share similar color schemes, making them usable references.
For each cartoon clip, frames are extracted at every 4s.
From these frames, several key-frames (with main characters) are manually selected and serve as a seed for extracting
other similar frames. Similar frames are selected by ranking
using cosine similarity over color-histogram features. From
each cartoon, several sets of frames corresponding to different characters is extracted. Frames from the live-action
clips are extracted in a similar manner.
In an attempt to train on sketches as our primary domain,
instead of grayscale images, we also generate corresponding sketches for each cartoon frame. These are based on cartoon outlines, and do not directly represent human-drawn
sketches, but serve as a proxy for them. For each cartoon
frame, the edges are enhanced using the open-source image manipulation tool - GIMP, and then thresholding is performed to obtain a binary image representing the sketch.
The resulting dataset contains 1133 cartoon images (involving 15 unique characters) and 1115 reference images
that span the same characters. The dataset generation process along with examples from the dataset can be seen in
Figure 1.
2.1. Colorization using Reference Images
Transfer-based methods use semantically related reference images, from which color is imparted to the target image. Mapping between reference and target images are generated automatically, using correspondences between local
descriptors [1, 12, 15], or in combination with manual inputs [3, 8].
Gupta et al. [6] proposed example-based method to colorize a gray image, in which user needs to input a reference
image that is semantically similar to target image. Features
are then extracted at the superpixel resolution and used in
the colorization process.
Our work is similar to these works in that we aim to
transfer color from semantically relevant regions from a reference onto a target image. We avoid using local descriptor
features as our reference image is from a different domain
(natural images) compared to our cartoon image.
Gatys et al. [5] introduced a Deep Neural Network that
uses neural representations to separate and recombine content and style of arbitrary images, and then imposes the style
of one image into the content of another - allowing the creation of artistic images.
We aim to do something similar for color - but with the
additional constraint of imposing the color at the semantically relevant regions. Using the methods of [5], there is no
supervision for where the stylistic elements of one image
gets transferred in the content image, which is why we base
our models on traditional colorization literature.
Our work borrows ideas from [6] with regard to using
reference images, and [17] for the end-to-end neural model
to adapt the colorization with a reference problem to the do-
3.1. Data Augmentation
For each cartoon image, while there may be several candidate reference images, there is only one dominant color
scheme (Spider-man reference and cartoon images all have
the character in a red costume). This bias allows models
trained on this data to ignore the reference image entirely
and predict the ’default’ color for the input image. Therefore, we randomly vary the hue of both the reference and the
cartoon images during training by the same amount, forcing
the network to learn a dependence on the reference image.
We also employ standard data augmentation techniques like
random cropping and mirroring to learn a more generalizable model.
2
Figure 1. Dataset generation pipeline. (1) Frames are extracted from the cartoon clip and the corresponding live-action clip. (2) Key-frames
are manually selected for dataset expansion. (3) Other frames are ranked using the key-frames to construct similar sets of frames. (4) Color
augmentation is applied to randomly selected pairs of frames (from cartoon and live-action) and included in the training set.
4. Model Architecture
sification of objects, and one branch for segmentation. Activations from the classification branch are used by the segmentation branch to fine-tune the segmentation output.
In a similar manner, we create a clone of VGG network
used in [17] and use that as our reference branch, while the
original CNN serves as our input branch. Connections from
each of the blocks after conv3 are made from the reference
branch to the input branch in order to incorporate color information in the final predictions. These connections serve
as cross-domain links between the two branches.
The connections perform a convolution operation on
the activations of one branch (to account for the domain
change), and add it to the respective activations from the
other branch. Connections are made in both directions
(from input → reference and reference → input) to allow
both branches to utilize information from the reference and
input images during training. The model architecture can
be seen in Figure 2. We refer to this model as full additive.
The input to the reference branch is a three channel
(LAB) reference image, while the input branch receives
only a single channel as before. The weights between these
two branches are not shared in any way as each branch
is meant to capture different information. While the input branch is meant to learn information to preserve edge
boundaries, learn common object-color correlations etc.,
the reference branch is meant to retrieve color information and has more of a region matching function. The two
branches also operate on different kinds of inputs, and while
they are not truly different modalities, distinct operations
need to be performed on each - parameter sharing does not
seem intuitive.
A simpler variant of this model where information from
only the reference branch is allowed to propagate to the in-
We build our models upon the modified VGG network
used in [17]. We explore several model architectures and
motivate their use in the different experiments performed.
The basic architecture consists of 8 blocks - each having
structure (conv, ReLU, conv, ReLU, BatchNorm), with no
pooling layers.
The general idea explored in our models is how to incorporate color information from a reference image into the
predictions on the grayscale image. The models differ in
how they incorporate this information.
4.1. Zhang et al. VGG
As a baseline, we fine-tune the model from [17] to cater
to our cartoon domain. In this model, no reference image
is used to establish consistency with. Due to the construction of our dataset, this model suffers from the ambiguity
of color assignment - the same grayscale image can have
multiple different colorizations without a reference. We use
the parameters learned from this model as initializations for
other model variants.
4.2. Stacked Input
A naive model to incorporate the reference image into
the prediction would be to stack the 1 channel input image
to the 3 channel reference image (LAB) and train the network. This serves as our second baseline.
4.3. Additive Connections
We incorporate the idea of skip connections similar to
those used in [13]. In their work, two CNN branches are
used to perform object segmentation: one branch for clas3
Figure 2. Model Architecture: The input and reference branch have the same architecture as [17]. Information from each branch can
propagate to the other via bridging connection at each conv block. The information is combined using element-wise addition after a
transformation to account for the input to reference or reference to input domain change. Color information from the reference image is
used to influence the final color prediction for the input image, while boundary information from the input branch is used to influence what
regions of color information are important.
5.2. Reference Consistent Colorization
put branch via additive connections is also created. This
simpler version is quicker to train, and performs quite differently than the more complex version. We refer to this
model as simple additive.
We show the results of two models, full additive and
simple additive (Section 4.3) on a variety of images from
our test set. These images are from a cartoon from which
no frames were used during training. This highlights the
model’s ability to generalize to different cartoons (still in
the cartoon domain) to some extent.
On an overall level, none of the models satisfactorily
transferred the colors from the required regions of the reference image to the relevant regions in the target image. We
present some observations about our models and speculate
on their poor performance. From Figure 4 we see that there
clearly has been a dependence learned from the reference
branch. However, instead of transferring color to the semantically correct regions of the image, the color is transferred to seemingly random parts of the image (primarily
the background).
4.4. Concatenating Connections
Similar to the model in Section 4.3, instead of adding the
activations from the reference branch directly to the activations of the input branch, the activations are concatenated.
This is a minor variation from the previous model, and performs slightly worse on the colorization task. This model is
explored here in the interest of completion.
5. Experiments
5.1. Fine-Tuning for Cartoon Domain
We fine-tune the model from [17] on our cartoon domain.
From Figure 3, we can see that fine-tuning on the cartoons
allows the model to learn domain specific characteristics
like strict edges and high color contrast. The images from
the fine-tuned model (base1) look visually more appealing.
This is expected as it is just a small domain shift.
Ref Basic
Simple Add Concat
Stack
Full Add
1
7752.53 7228.94
6202.76 6303.77 7069.09
2
7753.73 7676.18
6767.40 7136.79 6916.49
Table 1. Average Validation cross entropy loss of each model
tested for the two reference images used in Figure 4.
The difference between the simple additive model and
the full additive model is also highlighted in the results.
The simple additive model seems less liberal to colorize elements in the image, with several patches retaining their
original ’default’ color. This may be because of the limited influence that the reference branch has on the input
(only simple element-wise additions, compared to the bidirectional connections in the full additive model).
Figure 3. Colorization of the model from [17] before and after
fine-tuning on the cartoon domain.
4
Figure 4. Results on colorization of a series of test set images using two different reference images. While both models fail to place the
color of the reference image in the correct location, notice how the simple additive model is more conservative in distributing reference
image colors, while the full additive model is more liberal in doing so.
Figure 5. Results on colorization of a target cartoon image using 4 different references images. Left: The full additive model colors the cat
with some success according to the reference image. Right: The simple additive model correctly colors the pink cat, but both models have
difficulty with other reference images.
The reference images used in Figure 4 were full scenes
from live-action movies containing multiple objects and
complex backgrounds. As a simpler test, we use simple reference images with a single object to try and colorize cartoons. Using these references produce much better results.
In Figure 5, we can see that the colors of the reference image
find the correct object in the cartoon image to impose upon,
while ignoring the background. The segmentation is also
respected to some degree. While it is clear that if different
parts of the cats in the images were colored differently, the
color distribution would not transfer, the fact that the color
gets transferred to the foreground instead of the background
is a small success.
Looking at best performing target images, we can see
that none of them contain Superman clearly and large
enough as can be seen in worst performing target images.
The reason for this could be that due to poor colorization,
the images containing Superman clearly in the center (and
at a large scale) suffer from the highest loss.
On the other hand, if we see the best performing reference images, most of them have the clear colors in the
costume and are easily visible, while in worst performing
images Superman is not large enough and/or colors in his
costume are not clearly visible.
5.3. Qualitative Results
We evaluate which target images and reference images
perform well on the task. To measure that, we run our models for all the cartoon images keeping the reference image
constant and calculate the average loss for each reference
image. That way we can measure how well a particular
reference image is in colorizing the target images (and similarly, how the structure of the target image helps itself in
colorizing well). We sort both the reference and target images in increasing order of loss. The images at top are good
(reference or target) images for the given task. Top 4 and
bottom 4 target and reference images are shown in Figure 6.
Figure 6. References and Cartoon Ranking
5.4. Quantitative evaluation
To measure our models performance we choose to measure cross-entropy loss on the held out test set. This test
set consists of frames and references from a single, unseen
5
cartoon, and the loss is measured as an average across all
(image, reference) pairs in the dataset.
Simple Additive Model
Simple Concatenated Model
Full Additive Model
7637.30
6565.17
6878.95
Table 2. Average Cross entropy loss for all tested models
Figure 8. Output of full additive model trained specifically on
sketches.
Interestingly, the best model according to the test loss
was the concatenated model, but most of the colorizations
produced by this method are washed out and qualitatively
unappealing. This raises an issue about the method of evaluation (and even training!). A similar trend in performance
is observed even for L2 loss on the AB channels.
6. Conclusion
In this work we explored the task of colorization of cartoon images with a reference image. We fine-tune a network
to our cartoon domain and use that as a starting point for
several experiments. The results of our experiments show
that we indeed are able to bring some color information
from the reference image into the colorization of our input
image, but only to a certain degree.
In hindsight, this task is not at all straight-forward. Implicit in the ability to successfully colorize a cartoon image with a natural image reference, is the ability to identify
and draw a semantic mapping between regions from each
of those images. For example, first identifying that the tail
of a cat in a cartoon corresponds to the tail of a cat in a reference image. Previous work by [6] focused expressly on
generating those mappings on images that were of the same
domain - which itself proved to be difficult.
The second challenge of transferring colors, should have
been the sole focus of this work. We used element-wise addition as a means for information transfer between the two
branches in our network. This may have been an oversimplification considering the ability that we wished the network to possess. A strategy that we did not explore would
be to use the kernels from the reference branch to perform
a cross-convolution onto the input branch stream, with the
idea that the filters contain color information that would
need to be imposed onto the input image.
5.5. Supplementary Experiments
5.5.1
Neural Style Transfer
The model used by [5] was used to impose the style of the
reference images onto the cartoon images. While the stylistic elements get transferred quite well, there is no focus of
the model on specifically the colors and so while the output
has its own merit, it is not suitable for our task.
Figure 7. Top row: reference images. Bottom row: imposed style
images. [5] applied to the reference cartoon image of the cat used
in Figure 5 to highlight the difference between the proposed approach and the style-transfer approach.
5.5.2
Sketch Colorization
References
We attempt to extend our models to the sketch domain (constructed from the cartoon frames themselves). Sketches (the
way we generate them) are a curious domain because large
sections of the (now binarized) image are completely white
or completely black. With the absence of gradient information, it is very hard to learn color mappings. The changes in
the image are only edge gradients, and they occur in scales
much larger than the original grayscale image. The convolution kernel size that was used in our models was 3x3,
which may have been too small to capture this information,
resulting in a model that could not be trained. Figure 8 contains the output of our model on a set of sketches, to demonstrate how poorly the model performed.
[1] G. Charpiat, M. Hofmann, and B. Schölkopf. Automatic image colorization via multimodal predictions. In European
conference on computer vision, pages 126–139. Springer,
2008.
[2] Z. Cheng, Q. Yang, and B. Sheng. Deep colorization. In Proceedings of the IEEE International Conference on Computer
Vision, pages 415–423, 2015.
[3] A. Y.-S. Chia, S. Zhuo, R. K. Gupta, Y.-W. Tai, S.-Y. Cho,
P. Tan, and S. Lin. Semantic colorization with internet images. In ACM Transactions on Graphics (TOG), volume 30,
page 156. ACM, 2011.
[4] A. Deshpande, J. Rock, and D. Forsyth. Learning large-scale
automatic image colorization. In Proceedings of the IEEE
6
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
International Conference on Computer Vision, pages 567–
575, 2015.
L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm
of artistic style. arXiv preprint arXiv:1508.06576, 2015.
R. K. Gupta, A. Y.-S. Chia, D. Rajan, E. S. Ng, and H. Zhiyong. Image colorization using similar images. In Proceedings of the 20th ACM international conference on Multimedia, pages 369–378. ACM, 2012.
Y.-C. Huang, Y.-S. Tung, J.-C. Chen, S.-W. Wang, and J.L. Wu. An adaptive edge detection based colorization algorithm and its applications. In Proceedings of the 13th annual
ACM international conference on Multimedia, pages 351–
354. ACM, 2005.
R. Irony, D. Cohen-Or, and D. Lischinski. Colorization by
example. In Eurographics Symp. on Rendering, volume 2.
Citeseer, 2005.
G. Larsson, M. Maire, and G. Shakhnarovich. Learning
representations for automatic colorization. arXiv preprint
arXiv:1603.06668, 2016.
A. Levin, D. Lischinski, and Y. Weiss. Colorization using
optimization. In ACM Transactions on Graphics (TOG), volume 23, pages 689–694. ACM, 2004.
Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y.-Q. Xu, and H.Y. Shum. Natural image colorization. In Proceedings of
the 18th Eurographics conference on Rendering Techniques,
pages 309–320. Eurographics Association, 2007.
Y. Morimoto, Y. Taguchi, and T. Naemura. Automatic colorization of grayscale images using multiple images on the
web. In SIGGRAPH’09: Posters, page 32. ACM, 2009.
P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár.
Learning to refine object segments.
arXiv preprint
arXiv:1603.08695, 2016.
Y. Qu, T.-T. Wong, and P.-A. Heng. Manga colorization.
In ACM Transactions on Graphics (TOG), volume 25, pages
1214–1220. ACM, 2006.
T. Welsh, M. Ashikhmin, and K. Mueller. Transferring color
to greyscale images. In ACM Transactions on Graphics
(TOG), volume 21, pages 277–280. ACM, 2002.
L. Yatziv and G. Sapiro. Fast image and video colorization
using chrominance blending. IEEE Transactions on Image
Processing, 15(5):1120–1129, 2006.
R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. arXiv preprint arXiv:1603.08511, 2016.
7