Reference Constrained Cartoon Colorization Mit Shah University of Texas at Austin Austin, TX 78712 Nayan Singhal University of Texas at Austin Austin, TX 78712 Tushar Nagarajan University of Texas at Austin Austin, TX 78712 [email protected] [email protected] [email protected] Abstract the cartoon domain. The cartoon domain comes with its own unique challenges. Recent work on colorization using deep networks has achieved impressive results on natural images. The colorization task in these works is underconstrained - a grayscale image of a t-shirt could be any of several colors, which results in ambiguous colonization. In this work, we look at the cartoon domain, where the color ambiguity is much higher than natural images. We explore this colorization task with an additional constraint of being consistent with a reference image. We train on a set of cartoons and investigate if our model generalizes to unseen cartoons and natural images. • The color ambiguity is much higher in general - cartoons exaggerate color distributions. • The pixel gradients are much lower across pixels most regions are uniformly colored except at the edges where there is an abrupt change. We aim to address these issues with a combination of finetuning for our cartoon domain and including extra supervision from a reference image. Our task therefore, is to colorize a cartoon image to be consistent with a reference image. For the reference image, we choose natural images from live-action adaptions of cartoons rather than other cartoons. The motivation behind this decision is two-fold. (1) Using cartoon references reduces this problem to a version of the colorization by super-pixel matching task in [6]. We would like to incorporate learning into our system to allow for better generalization. (2) We would like to test the generalizability of our system for natural images as well. Given a cartoon of cat, and a natural cat images with different color, textures; does the system impart those colors/textures onto the cartoon image? We choose the cartoon domain for one other reason accumulating a large amount of supervised data with images colored consistent to a reference is difficult and is discussed in detail in Section 3. Frames from cartoon clips provide a large amount of training data. For reference images, live-action adaptations of popular cartoon provide a decent source of this information. We build a dataset from 15 cartoons and their live-action adaptations for this purpose. Building upon the model in [17], we train several variations of this deep net to predict the A,B channels of an image in LAB color-space given the L channel and a reference image. We perform an array of experiments highlighting the applicability of each model for our domain. 1. Introduction Previous works in colorization have highlighted the difficulty in colorizing grayscale images [17, 9]. For natural images, humans easily form semantic correlations between common scene elements and their colors (water being blue, ground being brown). These semantic associations form the basic building blocks for a colorization algorithm to correctly impart color to a grayscale image. One natural issue that arises due to the way the problem is defined is the problem of color ambiguity. Colorization is an underconstrained problem, which means that there can be many plausible colorizations for the same object. For example, fruits like grapes, apples and bananas can take multiple colors, but grayscale images of each of them could be identical. A fully automatic system can generate any of the plausible colorizations, making learning parameters for this kind of model a difficult task. To address this issue, interactive models have been proposed where a user selects colors for some input regions (Scribble based [10, 7, 16, 11, 14]) or a user gives a reference image (Transfer based [6]) from which color should be imparted. This eliminates the ambiguity in colorization but introduces a certain level of human intervention which is absent in fully automatic models. In this work, we address the problem of colorization in 1 2. Related Work main where the reference is a natural image, and the target image is a cartoon image. Colorization methods can mainly be classified into two categories: user interaction driven methods or fully automatic. The former models can be further classified into Scribble based methods and Transfer based methods. Levin et al. [10] proposed Scribble based methods which requires manually specifying desired output colors for certain spatial locations in the image. They are then propagated to adjacent pixels by a least-squares optimization method. Interactive refinement is allowed via specifying additional colors. Further improvements to scribble based methods were proposed in [7, 16, 14, 11]. Automatic methods pose the problem as either regression onto continuous color space [2, 4] or classification of quantized color values [1, 17]. Zhang et al. [17] cast colorization as a classification task and use class re-balancing at training time to increase the diversity of colors in the result. Larsson et al. [9] make use of low-level and semantic representations to train a model to predict per-pixel color histograms. 3. Dataset Generating reference images that are different from the original images is not a straight forward task - category labels serve as a very coarse proxy for usable references (for example, for a cartoon image of a dog, use the ImageNet category ’dog’ as reference images). During training however, the consistently colorized output is also required, which is unavailable by selecting references in this manner. Therefore, to create our dataset, we use a series of cartoons and their live-action movie counterparts to harvest a set of consistent references for our cartoon frames. Our dataset primarily consists of cartoons involving characters in distinct costumes because the live action counterparts will share similar color schemes, making them usable references. For each cartoon clip, frames are extracted at every 4s. From these frames, several key-frames (with main characters) are manually selected and serve as a seed for extracting other similar frames. Similar frames are selected by ranking using cosine similarity over color-histogram features. From each cartoon, several sets of frames corresponding to different characters is extracted. Frames from the live-action clips are extracted in a similar manner. In an attempt to train on sketches as our primary domain, instead of grayscale images, we also generate corresponding sketches for each cartoon frame. These are based on cartoon outlines, and do not directly represent human-drawn sketches, but serve as a proxy for them. For each cartoon frame, the edges are enhanced using the open-source image manipulation tool - GIMP, and then thresholding is performed to obtain a binary image representing the sketch. The resulting dataset contains 1133 cartoon images (involving 15 unique characters) and 1115 reference images that span the same characters. The dataset generation process along with examples from the dataset can be seen in Figure 1. 2.1. Colorization using Reference Images Transfer-based methods use semantically related reference images, from which color is imparted to the target image. Mapping between reference and target images are generated automatically, using correspondences between local descriptors [1, 12, 15], or in combination with manual inputs [3, 8]. Gupta et al. [6] proposed example-based method to colorize a gray image, in which user needs to input a reference image that is semantically similar to target image. Features are then extracted at the superpixel resolution and used in the colorization process. Our work is similar to these works in that we aim to transfer color from semantically relevant regions from a reference onto a target image. We avoid using local descriptor features as our reference image is from a different domain (natural images) compared to our cartoon image. Gatys et al. [5] introduced a Deep Neural Network that uses neural representations to separate and recombine content and style of arbitrary images, and then imposes the style of one image into the content of another - allowing the creation of artistic images. We aim to do something similar for color - but with the additional constraint of imposing the color at the semantically relevant regions. Using the methods of [5], there is no supervision for where the stylistic elements of one image gets transferred in the content image, which is why we base our models on traditional colorization literature. Our work borrows ideas from [6] with regard to using reference images, and [17] for the end-to-end neural model to adapt the colorization with a reference problem to the do- 3.1. Data Augmentation For each cartoon image, while there may be several candidate reference images, there is only one dominant color scheme (Spider-man reference and cartoon images all have the character in a red costume). This bias allows models trained on this data to ignore the reference image entirely and predict the ’default’ color for the input image. Therefore, we randomly vary the hue of both the reference and the cartoon images during training by the same amount, forcing the network to learn a dependence on the reference image. We also employ standard data augmentation techniques like random cropping and mirroring to learn a more generalizable model. 2 Figure 1. Dataset generation pipeline. (1) Frames are extracted from the cartoon clip and the corresponding live-action clip. (2) Key-frames are manually selected for dataset expansion. (3) Other frames are ranked using the key-frames to construct similar sets of frames. (4) Color augmentation is applied to randomly selected pairs of frames (from cartoon and live-action) and included in the training set. 4. Model Architecture sification of objects, and one branch for segmentation. Activations from the classification branch are used by the segmentation branch to fine-tune the segmentation output. In a similar manner, we create a clone of VGG network used in [17] and use that as our reference branch, while the original CNN serves as our input branch. Connections from each of the blocks after conv3 are made from the reference branch to the input branch in order to incorporate color information in the final predictions. These connections serve as cross-domain links between the two branches. The connections perform a convolution operation on the activations of one branch (to account for the domain change), and add it to the respective activations from the other branch. Connections are made in both directions (from input → reference and reference → input) to allow both branches to utilize information from the reference and input images during training. The model architecture can be seen in Figure 2. We refer to this model as full additive. The input to the reference branch is a three channel (LAB) reference image, while the input branch receives only a single channel as before. The weights between these two branches are not shared in any way as each branch is meant to capture different information. While the input branch is meant to learn information to preserve edge boundaries, learn common object-color correlations etc., the reference branch is meant to retrieve color information and has more of a region matching function. The two branches also operate on different kinds of inputs, and while they are not truly different modalities, distinct operations need to be performed on each - parameter sharing does not seem intuitive. A simpler variant of this model where information from only the reference branch is allowed to propagate to the in- We build our models upon the modified VGG network used in [17]. We explore several model architectures and motivate their use in the different experiments performed. The basic architecture consists of 8 blocks - each having structure (conv, ReLU, conv, ReLU, BatchNorm), with no pooling layers. The general idea explored in our models is how to incorporate color information from a reference image into the predictions on the grayscale image. The models differ in how they incorporate this information. 4.1. Zhang et al. VGG As a baseline, we fine-tune the model from [17] to cater to our cartoon domain. In this model, no reference image is used to establish consistency with. Due to the construction of our dataset, this model suffers from the ambiguity of color assignment - the same grayscale image can have multiple different colorizations without a reference. We use the parameters learned from this model as initializations for other model variants. 4.2. Stacked Input A naive model to incorporate the reference image into the prediction would be to stack the 1 channel input image to the 3 channel reference image (LAB) and train the network. This serves as our second baseline. 4.3. Additive Connections We incorporate the idea of skip connections similar to those used in [13]. In their work, two CNN branches are used to perform object segmentation: one branch for clas3 Figure 2. Model Architecture: The input and reference branch have the same architecture as [17]. Information from each branch can propagate to the other via bridging connection at each conv block. The information is combined using element-wise addition after a transformation to account for the input to reference or reference to input domain change. Color information from the reference image is used to influence the final color prediction for the input image, while boundary information from the input branch is used to influence what regions of color information are important. 5.2. Reference Consistent Colorization put branch via additive connections is also created. This simpler version is quicker to train, and performs quite differently than the more complex version. We refer to this model as simple additive. We show the results of two models, full additive and simple additive (Section 4.3) on a variety of images from our test set. These images are from a cartoon from which no frames were used during training. This highlights the model’s ability to generalize to different cartoons (still in the cartoon domain) to some extent. On an overall level, none of the models satisfactorily transferred the colors from the required regions of the reference image to the relevant regions in the target image. We present some observations about our models and speculate on their poor performance. From Figure 4 we see that there clearly has been a dependence learned from the reference branch. However, instead of transferring color to the semantically correct regions of the image, the color is transferred to seemingly random parts of the image (primarily the background). 4.4. Concatenating Connections Similar to the model in Section 4.3, instead of adding the activations from the reference branch directly to the activations of the input branch, the activations are concatenated. This is a minor variation from the previous model, and performs slightly worse on the colorization task. This model is explored here in the interest of completion. 5. Experiments 5.1. Fine-Tuning for Cartoon Domain We fine-tune the model from [17] on our cartoon domain. From Figure 3, we can see that fine-tuning on the cartoons allows the model to learn domain specific characteristics like strict edges and high color contrast. The images from the fine-tuned model (base1) look visually more appealing. This is expected as it is just a small domain shift. Ref Basic Simple Add Concat Stack Full Add 1 7752.53 7228.94 6202.76 6303.77 7069.09 2 7753.73 7676.18 6767.40 7136.79 6916.49 Table 1. Average Validation cross entropy loss of each model tested for the two reference images used in Figure 4. The difference between the simple additive model and the full additive model is also highlighted in the results. The simple additive model seems less liberal to colorize elements in the image, with several patches retaining their original ’default’ color. This may be because of the limited influence that the reference branch has on the input (only simple element-wise additions, compared to the bidirectional connections in the full additive model). Figure 3. Colorization of the model from [17] before and after fine-tuning on the cartoon domain. 4 Figure 4. Results on colorization of a series of test set images using two different reference images. While both models fail to place the color of the reference image in the correct location, notice how the simple additive model is more conservative in distributing reference image colors, while the full additive model is more liberal in doing so. Figure 5. Results on colorization of a target cartoon image using 4 different references images. Left: The full additive model colors the cat with some success according to the reference image. Right: The simple additive model correctly colors the pink cat, but both models have difficulty with other reference images. The reference images used in Figure 4 were full scenes from live-action movies containing multiple objects and complex backgrounds. As a simpler test, we use simple reference images with a single object to try and colorize cartoons. Using these references produce much better results. In Figure 5, we can see that the colors of the reference image find the correct object in the cartoon image to impose upon, while ignoring the background. The segmentation is also respected to some degree. While it is clear that if different parts of the cats in the images were colored differently, the color distribution would not transfer, the fact that the color gets transferred to the foreground instead of the background is a small success. Looking at best performing target images, we can see that none of them contain Superman clearly and large enough as can be seen in worst performing target images. The reason for this could be that due to poor colorization, the images containing Superman clearly in the center (and at a large scale) suffer from the highest loss. On the other hand, if we see the best performing reference images, most of them have the clear colors in the costume and are easily visible, while in worst performing images Superman is not large enough and/or colors in his costume are not clearly visible. 5.3. Qualitative Results We evaluate which target images and reference images perform well on the task. To measure that, we run our models for all the cartoon images keeping the reference image constant and calculate the average loss for each reference image. That way we can measure how well a particular reference image is in colorizing the target images (and similarly, how the structure of the target image helps itself in colorizing well). We sort both the reference and target images in increasing order of loss. The images at top are good (reference or target) images for the given task. Top 4 and bottom 4 target and reference images are shown in Figure 6. Figure 6. References and Cartoon Ranking 5.4. Quantitative evaluation To measure our models performance we choose to measure cross-entropy loss on the held out test set. This test set consists of frames and references from a single, unseen 5 cartoon, and the loss is measured as an average across all (image, reference) pairs in the dataset. Simple Additive Model Simple Concatenated Model Full Additive Model 7637.30 6565.17 6878.95 Table 2. Average Cross entropy loss for all tested models Figure 8. Output of full additive model trained specifically on sketches. Interestingly, the best model according to the test loss was the concatenated model, but most of the colorizations produced by this method are washed out and qualitatively unappealing. This raises an issue about the method of evaluation (and even training!). A similar trend in performance is observed even for L2 loss on the AB channels. 6. Conclusion In this work we explored the task of colorization of cartoon images with a reference image. We fine-tune a network to our cartoon domain and use that as a starting point for several experiments. The results of our experiments show that we indeed are able to bring some color information from the reference image into the colorization of our input image, but only to a certain degree. In hindsight, this task is not at all straight-forward. Implicit in the ability to successfully colorize a cartoon image with a natural image reference, is the ability to identify and draw a semantic mapping between regions from each of those images. For example, first identifying that the tail of a cat in a cartoon corresponds to the tail of a cat in a reference image. Previous work by [6] focused expressly on generating those mappings on images that were of the same domain - which itself proved to be difficult. The second challenge of transferring colors, should have been the sole focus of this work. We used element-wise addition as a means for information transfer between the two branches in our network. This may have been an oversimplification considering the ability that we wished the network to possess. A strategy that we did not explore would be to use the kernels from the reference branch to perform a cross-convolution onto the input branch stream, with the idea that the filters contain color information that would need to be imposed onto the input image. 5.5. Supplementary Experiments 5.5.1 Neural Style Transfer The model used by [5] was used to impose the style of the reference images onto the cartoon images. While the stylistic elements get transferred quite well, there is no focus of the model on specifically the colors and so while the output has its own merit, it is not suitable for our task. Figure 7. Top row: reference images. Bottom row: imposed style images. [5] applied to the reference cartoon image of the cat used in Figure 5 to highlight the difference between the proposed approach and the style-transfer approach. 5.5.2 Sketch Colorization References We attempt to extend our models to the sketch domain (constructed from the cartoon frames themselves). Sketches (the way we generate them) are a curious domain because large sections of the (now binarized) image are completely white or completely black. With the absence of gradient information, it is very hard to learn color mappings. The changes in the image are only edge gradients, and they occur in scales much larger than the original grayscale image. The convolution kernel size that was used in our models was 3x3, which may have been too small to capture this information, resulting in a model that could not be trained. Figure 8 contains the output of our model on a set of sketches, to demonstrate how poorly the model performed. [1] G. Charpiat, M. Hofmann, and B. Schölkopf. Automatic image colorization via multimodal predictions. In European conference on computer vision, pages 126–139. Springer, 2008. [2] Z. Cheng, Q. Yang, and B. Sheng. Deep colorization. In Proceedings of the IEEE International Conference on Computer Vision, pages 415–423, 2015. [3] A. Y.-S. Chia, S. Zhuo, R. K. Gupta, Y.-W. Tai, S.-Y. Cho, P. Tan, and S. Lin. Semantic colorization with internet images. In ACM Transactions on Graphics (TOG), volume 30, page 156. ACM, 2011. [4] A. Deshpande, J. Rock, and D. Forsyth. Learning large-scale automatic image colorization. In Proceedings of the IEEE 6 [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] International Conference on Computer Vision, pages 567– 575, 2015. L. A. Gatys, A. S. Ecker, and M. Bethge. A neural algorithm of artistic style. arXiv preprint arXiv:1508.06576, 2015. R. K. Gupta, A. Y.-S. Chia, D. Rajan, E. S. Ng, and H. Zhiyong. Image colorization using similar images. In Proceedings of the 20th ACM international conference on Multimedia, pages 369–378. ACM, 2012. Y.-C. Huang, Y.-S. Tung, J.-C. Chen, S.-W. Wang, and J.L. Wu. An adaptive edge detection based colorization algorithm and its applications. In Proceedings of the 13th annual ACM international conference on Multimedia, pages 351– 354. ACM, 2005. R. Irony, D. Cohen-Or, and D. Lischinski. Colorization by example. In Eurographics Symp. on Rendering, volume 2. Citeseer, 2005. G. Larsson, M. Maire, and G. Shakhnarovich. Learning representations for automatic colorization. arXiv preprint arXiv:1603.06668, 2016. A. Levin, D. Lischinski, and Y. Weiss. Colorization using optimization. In ACM Transactions on Graphics (TOG), volume 23, pages 689–694. ACM, 2004. Q. Luan, F. Wen, D. Cohen-Or, L. Liang, Y.-Q. Xu, and H.Y. Shum. Natural image colorization. In Proceedings of the 18th Eurographics conference on Rendering Techniques, pages 309–320. Eurographics Association, 2007. Y. Morimoto, Y. Taguchi, and T. Naemura. Automatic colorization of grayscale images using multiple images on the web. In SIGGRAPH’09: Posters, page 32. ACM, 2009. P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. arXiv preprint arXiv:1603.08695, 2016. Y. Qu, T.-T. Wong, and P.-A. Heng. Manga colorization. In ACM Transactions on Graphics (TOG), volume 25, pages 1214–1220. ACM, 2006. T. Welsh, M. Ashikhmin, and K. Mueller. Transferring color to greyscale images. In ACM Transactions on Graphics (TOG), volume 21, pages 277–280. ACM, 2002. L. Yatziv and G. Sapiro. Fast image and video colorization using chrominance blending. IEEE Transactions on Image Processing, 15(5):1120–1129, 2006. R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. arXiv preprint arXiv:1603.08511, 2016. 7
© Copyright 2026 Paperzz