PolyRNN: Polygon-based Instance Segmentation with Recurrent Neural Networks Kamyar Ghasemipour Student #1000873548 Department of Computer Science University of Toronto [email protected] Lluis Castrejon Student #1002201057 Department of Computer Science University of Toronto [email protected] Abstract Instance segmentation is the task of annotating each region in an image as belonging to a specific instance of an object. In this work we present a novel method for addressing this challenge that instead of dense pixel predictions, produces sequences of vertices defining polygons enclosing individual object instances. Our model uses a convolutional neural network to extract features from an input image that are subsequently fed into a recurrent neural network which predicts polygon vertices one at a time. We experiment with artificial datasets and show initial promising results that suggest our proposed architecture is a suitable approach to the instance segmentation problem. 1 Introduction Semantic segmentation is the task of classifying regions in an image as belonging to specific object categories. This is a challenging problem in image understanding, with many potential applications. However, the outputs of a semantic segmentation pipeline are not sufficient for certain advanced tasks in which information about the individual instances of different object categories is required. For example, one might want to differentiate between individual cells to determine which ones are healthy in chemical treatments, or a grasping robot might need to differentiate between two neighbouring objects of the same class. Hence, in this project we investigate the task of instance segmentation. Instance segmentation is the problem of not only producing category annotations for pixels in an image, but also marking them as belonging to a specific instance of a particular object class. In this project we propose a novel method for addressing this challenge: Instead of producing pixel-level annotations, our model produces vertices of a polygon enclosing object instances. The degree of accuracy of polygon segmentations can be adjusted through the number of vertices the model predicts. Polygons with an appropriate number of vertices produce accurate segmentation results, as evidenced by the fact that a number of segmentation datasets are generated through the use of polygonal annotations. This brings about a second very interesting motivation for our work: The pipeline we present in this project can be used for generating new segmentation datasets more efficiently. One could envision a dataset generation framework in which annotations are initialized using our approach, requiring human annotators to only refine the polygons. This could significantly speed up the generation process while reducing poor annotations since less effort is demanded from the annotators. In the following section we present an overview of instance segmentation techniques and section 3 describes the details of our model. After discussing the challenges facing our approach, we compare the different datasets available to train our models. We report our experiments and results in section 1 6 and conclude with a summary of our findings while proposing directions to extend this work in the future. 2 Related Work Instance segmentation is a relatively new task in the field of computer vision, and proposed methods usually have been inspired by semantic segmentation and object dectection. A recent example is the work of Hariharan et al. [9], insipred by the R-CNN framework [8, 7, 15]. Their method first generates a large number of region proposals using the technique described in [1]. Then, features are extracted from both the bounding box of the proposals as well as the region foregrounds using separate CNNs. Finally, these features are used to classify the region proposals using linear SVMs and non maxima suppression. Chen et al. [3] use a similar framework while making use of object exemplars to reason about occlusions. Recently a significant amount of progress has been made in semantic segmentation due to convolutional neural networks. In [13], Long et al. presented Fully Convolutional Networks (FCNs) for addressing this task. By using only convolution, pooling, and upsampling layers, they were able to produce superior results to the previous state-of-the-art with a margin of 20%. Subsequent work from different research groups significantly improved this method adding CRF-like models on top of FCNs to achieve segmentations with finer detail. Examples of such works include [2] and [23]. Naturally, FCNs have also be employed for instance segmentation as well. In [11], Liang et al. present a framework to avoid the proposal generation step of [8, 9]. The FCN network from [13] is used to produce category-level segmentations for given images. The authors also fine-tune a modified version of this network that produces a vector of instance counts for each object category while for each pixel producing the coordinates of the top-left, center, and bottom-right locations of the bounding box of the instance it belongs to. In a post-processing step, a clustering algorithm is used to assign pixels to instances of different categories. A notable paper proposing an end-to-end method for instance segmentation is [16]. Romera-Paredes et al. employ a recurrent neural network to sequentially segment instances from images. Specifically, they use convolutional LSTMs built on top of the output of the FCN network of [13]. At each time-step, the output of the convolutional LSTM passes through two separate branches: one which produces a segmentation mask, and one that produces a score indicating how confident the model is that the mask belongs to a new object instance. The unrolling of the RNN is stopped once the confidence falls below a certain threshold. There are two notes worth mentioning about this approach. First, the model does not take into account any information regarding the classes of objects. Second, the authors note that the results obtained from their model are somewhat coarse and as a result, they employ a CRF as a post-processing step in some experiments. Some important work has also been done in the context of autonomous driving, where instance segmentation is required to differentiate between the diferents vehicles, pedestrians and other objects of interest. In particular, [22] and [21] propose methods for instance segmentation from monocular image in this context. We now proceed to describe our method in greater detail. 3 PolyRNN Our goal is to generate a sequence of vertices vi for an image xi that defines a polygon enclosing the region of the image where an object of a specific type is located. For each example, we have access to a sequence of vertices yi that defines the ground truth of the enclosing polygon. Each image xi encloses at least an object to segment and corresponds to the object bounding box plus a 10% expansion to capture more context. Qj=N −1 We define the probability of predicting a polygon as p(v) = j=0 p(vi |v0 , ..., vi−1 , x). A way to model this probability distribution is to use an RNN decoder that predicts one polygon vertex at a time and that is conditioned on an image x. While one could feed the RNN with the full image patch, this would be very computationally expensive as our images are of size 224x224. Therefore, 2 Figure 1: Diagram of our baseline model architecture. a more reasonable option is to first extract image features using a CNN, and feed these features to the RNN. One can then train the system end to end so that the error in predicting the polygon with the RNN is backpropagated through the CNN. To evaluate a prediction we compute the Intersection Over the Union (IOU) of the regions defined by the predicted polygon and the ground truth polygon. We choose this metric because the kind of polygon we predict is not important as long as it encloses the same region. Therefore, a predicted polygon should not be penalized for having a different number of vertices to the ground truth or for predicting the vertices from a different starting point. Figure 1(a) shows a diagram of our model. 4 Challenges This section outlines a series of challenges that our framework needs to overcome, some of which are addressed in this report and some which are left to future work. 4.1 Loss function Directly optimizing the IOU metric used for evaluation is hard. From a predicted polygon, we need to compute the enclosed area to be able to obtain the intersection and the union. Finding an analytical gradient for this operation is not trivial. Furthermore, it can only be computed once we have the full vertex sequence for a polygon. Only using such a loss would pose difficulties in training, as propagating the gradients from the last timesteps back in time through the RNN and the CNN might not be enough to train the model because of the vanishing gradient problem. Instead, a first simpler approach is to compute the squared Euclidean distance between the predicted vertex and the ground truth, and directly optimizing this metric. This has the advantage that there is a gradient for each timestep. However, it penalizes predictions that do not have the same number of vertices as the ground truth and different polygon orderings. In the next section we propose a modification to make this loss agnostic to the starting point in our prediction. A more advanced loss could be defined that does not penalize for predicting a different number of vertices. The idea is to define a spatial loss that penalizes i) predicting out of the edge defined between the last vertex prediction and the next vertex in the sequence and ii) predicting a point that is on the edge but far from the next vertex in the sequence. Predicting a point out of the edge should have a stronger penalty than predicting a point in the edge but far away from the next vertex in the sequence. This loss would allow the model to predict a polygon with more vertices than in the ground truth as long as, for each step, we are advancing to the next vertex in the ground truth 3 Figure 2: Progression of training (row major ordering). Left: Without order agnosia. Right: With order agnosia. sequence. However, this loss is hard to implement as it is conditional on the previous predicted vertices. 4.1.1 Agnostic Loss One of the problems with using a loss function based on the distance of predicted vertices to ground truth vertices is that, even if our model predicts the ground truth, if the ordering of the predictions are a rotated version of the original we will be penalizing our model. To amend this situation, for a given loss function f , we compute the order agnostic version fag as follows. n−1 Let P = {pi }n−1 i=0 denote the sequence of vertices predicted by our model and let Q = {qi }j=0 denote the sequence of ground truth vertices. Furthermore, let ∀k ∈ {0, ..., n − 1}, Pk = (n−1−k) mod n {pi }i=−k mod n . Then, fag = mink f (Pk , Q). A problem with this loss function (as in the case of euclidean distance measure) is that the number of vertices of the prediction must match that of the ground truth. If this condition is not satisfied, we need to pad or slice the output predictions before computing the loss. So far however, we have only tested the agnostic loss for scenarios in which we unroll the LSTM for as many steps as the number of ground truth vertices. Figure 2 demonstrates the differences in the progress of the model when trained on Shapes8 (described in the next section) with and without an order agnostic loss. As can be seen, without this loss, initially the model solely focuses on getting the ordering correct. Furthermore, the Shapes8 data has a nice property that the ith vertex is around the same area in all images. For natural images, it will be almost impossible for the model to learn anything if this modified loss is not used. 4.2 Fine-grained Segmentation An important challenge plaguing many frameworks for semantic segmentation is being able to produce segmentations that take into account the fine structure of the objects present in images. This is because most successful methods incorporate convolutional neural networks in order to extract high-level semantic information from images. However, there is a trade-off: Deeper layers of CNNs provide more semantic information, but due to pooling layers as well as different strides of convolution kernels, information about exact locations and fine structures may be lost. In recent segmentation literature, two main techniques are used to deal with this scenario. The first is to refine segmentations produced by networks such the fully convolutional networks of [13] using CRFs or other models that behave in a similar fashion. We could potentially use a similar approach, but it would need to be implemented as a post-processing step in which polygons are converted to pixel segmentations and the pixels are subsequently refined. This however deviates from our desire to have good polygonal boundaries to begin with. The second approach is to fuse together activations from different levels of CNNs. In this way, the models are able to make predictions while considering local as well as global structure of the images. Motivated by this we also implemented a model that employs a variation of the glimpse network of Mnih et al. [14] as a key component. 4 Figure 3: Inputs to the LSTM when glimpses are employed. Given the vertex prediction pt of the model at step t, the glimpse sensor extracts a glimpse from the model by taking k square crops centered at point pt of the input image. The first square has sidelengths s and each subsequent crop has a side-length twice that of the previous one. All crops are resized to have the same dimensions, concatenated, and passed through a CNN that ends with a fully connected layer which we denote by crop fc. pt itself is also passed through a fully connected layers which is subsequently concatenated with crop fc. The concatenated activations are then passed through another linear layer to produce the output of this glimpse network. This output is combined with representations of the original image and the object class and fed to the LSTM as before. Figure 3 presents a diagram of inputs the the LSTM when using glimpses. In our initial experiments we were not able to train this version of our model to produce interesting outputs. We therefore refrain from discussing experiments with the glimpse network in section 6. Furthermore, we notice that we made one poor decision in the architecture design: We resized and concatenated the different crops as different channels of input to a CNN. This is incorrect because the crops have differing receptive fields. Instead, in future work, we will be using separate CNNs for each crop and combine the outputs of the CNNs using fully connected layers or gating instead. 4.3 Output representation So far we assumed the predictions of our model would be the absolute cartesian coordinates of the polygon vertices. However, we could also represent a polygon with polar coordinates, or relative coordinates. Furthermore, we could quantize the input image using a grid and predict the cell of each vertex. This would convert the problem into a classification task, as opposed to regression, and would require further changes to the loss function. While different coordinate systems and the use of absolute or relative coordinates did not make a difference in our initial experiments, the use of a quantized output grid is explored in detail in section 6. 5 5.1 Datasets Instance segmentation datasets To train our models we need datasets that provide instance segmentations, preferably in the form of polygons. For datasets that provide segmentation masks instead of polygon annotations, we could still use them by generating an approximated polygon from the mask. We could employ a convex hull algorithm, for example. However, these approximated examples might impact negatively the performance of our model, and therefore we restrict ourselves to the following two datasets with polygon annotations. 5 MS COCO [12] includes instance segmentations for 80 different object classes in indoor and outdoor scenes. It contains more than 300,000 images and provides polygon annotations. MS COCO will be the main dataset in our experiments. SUN Database [20] contains thousands of fully segmented images. As opposed to MS COCO, SUN contains object annotations but also part annotations. Annotations are in the form of polygons. While SUN is adequate to our needs, it requires extra steps needed to preprocess the data and it has a smaller number of images as compared to MS COCO. Other datasets include KITTI [6] and CityScapes [4], which are two of the main datasets used in autonomous driving, PASCAL VOC [5] and the NYU Depth dataset [17, 18], 5.2 Artificial dataset (a) Shapes8 (b) ShapesN (c) ShapesTexture Figure 4: Examples from the artificial datasets: We created artificial datasets with polygons of a fixed number of sides (Shapes8), polygons with a variable number of sides (ShapesN) and polygons filled with textures (ShapesTexture) . Since the task we are approaching is hard, we designed a synthetic dataset to help evaluate our proposed models in a simpler setting and gain some insights about their performance. In this artificial dataset, we generate random polygons in an image of size 224x224 pixels with a black background. To generate a polygon, first we decide its number of vertices N . We randomly −1 2π assign an angle ai for each vertex {vi }N i=0 as N ∗ i + α, where α is a random angle jitter sam2π 2π pled from a uniform distribution [− N , N ]. We then assign a radius ri to each vertex vi sampling from a uniform distribution [0, bi ], where bi is the distance to the image border and depends on the selected angle. Given a radius and an angle, we compute the coordinates for a given vertex vi using trigonometry: xi = ri cos(ai ), yi = ri sin(ai ). We generate four versions of the dataset of increasing difficulty: • Shapes8: we generate polygons of a fixed number of eight vertices. Hence, the model does not have to predict when to stop predicting vertices. The polygons are filled with a random color so that segmenting the polygon from the background is easy. • ShapesN: instead of generating polygons with a fixed number of vertices, now we generate polygons with a random number of sides in the interval [4, 12]. The polygons are filled with a random color. • ShapesTexture: we generate polygons with a random number of sides in [4, 12] and filled with a random texture out of 100 possibilities. The model has to predict when to stop and segmenting the polygon from the background is harder. • ShapesTexBground: we generated polygons as in the ShapesTexture dataset, and employ a different texture to that of the polygon to fill the background. Figure 4 shows examples from the artificial dataset. All the datasets contained 20,000 training images and 2,000 validation images. 6 5.3 Semi-Artificial dataset The disadvantage of using artificial datasets is that they might represent too simple tasks as compared to performing instance segmentation on a big-scale dataset of natural images. For that reason, we also created a Semi-Artificial dataset we call SemiSynth. This dataset is characterized by using polygons with textures, as in the ShapesTexture dataset, but instead of generating a random polygon shape with four to twelve vertices, we employed polygons from COCO. Results obtained in this dataset are still premature and will be reported in the future. 6 6.1 Experiments Regression experiments In all our regression experiments we use the PolyRNN model with a CNN made of a first convolutional layer with 32 kernels of 7x7 pixels, strides of 1x1 pixels, padding of 3x3 and a ReLU activation, a max pooling layer of kernel 2x2 and 2x2 strides, a second convolutional layer with 256 kernels of 5x5 pixels, strides of 1x1 pixels, padding of 2x2 pixels and a ReLU activation, and another max pooling layer as the first one. The RNN uses two-layers of 256 GRU units. Finally, we have a fully connected layer that maps the output of the second layer to three numbers: the x coordinate of the predicted vertex, the y coordinate of the predicted vertex and a boolean value that indicates whether to continue predicting vertices or to stop (end of sequence token). We train the model using the Adam optimizer with parameters β1 = 0.01, β2 = 0.005 and lr = 0.0001. We found that different settings of the learning rate resulted non-significant differences in the final performance of the model. We use mini-batches of 32 examples and train our models from scratch for 100,000 batches. We provide the first ground-truth point to our models so that they know where to start from. 6.2 Shapes8 (a) Training and validation loss. (b) Sampled predictions. IoU (Shapes8) Mean 0.5770 Std 0.1270 (c) Validation IoU Figure 5: Shapes8 results. Left: evolution of the training and validation loss. Right: examples of predictions on the validation set. Bottom: intersection over the union for the validation set. Figure 5 shows the results we obtained using PolyRNN with the Shapes8 dataset. We can observe how the training loss quickly decreases. The validation loss follows the training loss, indicating that the model is not overfitting. In the sampled predictions we can see that the model predicts point near 7 the vertices of the ground truth polygon. We argue that we are limited by the loss function and that better results could be obtained with the loss function options discussed in section 4.1 6.3 ShapesN (a) Training and validation loss. (b) Sampled predictions. IoU (ShapesN) Mean 0.5261 Std 0.1148 (c) Validation IoU Figure 6: ShapesN results. Left: evolution of the training and validation loss. Right: examples of predictions on the validation set. Bottom: intersection over the union for the validation set. The results obtained for the ShapesN dataset are shown in Figure 6. We can observe how the training loss reaches a similar level to that achieved for the Shapes8 dataset. However, in this case we can observe how the validation loss does not decrease in the same way as the training loss, and the model starts to overfit early in training. Upon close inspection, we noticed that the model is computing polygons with a similar performance as in the previous experiment, but in this case the model is not able to produce the right end-of-sequence tokens and overfits to the training set. We argue that the model is memorizing the EOS prediction for the training set examples, but that it is not able to generalize to the validation set because it cannot accurately count the number of vertices in a polygon from the image features it receives. This is probably due to the reduced resolution in the input features. Given this, we decide to use the ground truth length sequence in our predictions. We can observe how the samples show close results to the Shapes8 dataset. The obtained IoU confirms this, with a mean predicted IoU of 52.61%. 6.4 ShapesTexture For the ShapesTexture dataset we observe very similar results to those of the ShapesN dataset, with a slightly smaller IoU of 52.14%. We conclude that the model is able to distinguish the boundaries of the polygon regardless of whether it is filled with a solid color or a texture. 6.5 Classification experiments In this section we explore the use of a quantized output. We divide our input image into a grid of 28x28 cells, and transform vertex coordinates into the number of the cell that is closest. This way, we convert our problem into a classification task. Instead of an Squared Euclidean distance loss at each timestep, we now use a Categorical CrossEntropy loss at the output of a softmax. Furthermore, we include an option in the softmax to indicate the End Of the Sequence (EOS token). 8 (a) Training and validation loss. (b) Sampled predictions. IoU (ShapesTexture) Mean 0.5214 Std 0.1204 (c) Validation IoU Figure 7: ShapesTexture results. Left: evolution of the training and validation loss. Right: examples of predictions on the validation set. Bottom: intersection over the union for the validation set. AlexNet VGG-16 Shapes8 99.3352% Not tested ShapesN 96.3143% Not tested ShapesTexture 91.7722% 95.7335% ShapesTexBground 80.7962% 91.0576% Table 1: Classification results: we can observe how framing the problem as a classification task allows us to obtain very good performances in the artificial dataset. Note that AlexNet has reduced performance in ShapesTexture and ShapesTexBground as compared to VGG. We modify the CNN from the regression PolyRNN model and instead employ the convolutional layers of AlexNet [10] or VGG-16 [19]. Table 1 shows the IOU’s obtained in the artificial dataset, while 8 shows some predictions of the VGG model for the ShapesTexBground dataset. Figure 8: VGG-16 Predictions for ShapesTexBground: we can observe how the network is able to distinguish the vertices of the polygons with great accuracy and rarely becomes confused. 9 We can observe how using a quantized output allows the model to learn very accurate polygon segmentations, achieving IOUs over 90% in the artificial dataset. We can also observe how using a more complex CNN increases performance sbustantially, as the VGG model outperforms AlexNet at the ShapesTexBground by more than 10%. 7 Future Directions There still remain many interesting directions that we will be exploring in future work. 7.1 Detection At the moment we are working with images that contain one instance of an object of interest and our goal is to draw a polygon enclosing this object. However, to create a full instance segmentation pipeline, we need a method for differentiating between the instances to begin with. To this effect, we can employ an object detection framework to extract detections from images of interest and run our algorithm on the detections. Another interesting alternative could be to employ the fully convolutional networks of [13] to get category-level segmentations. Next we could modify our pipeline so that given an input image and a segmentation mask for a specific category, our model produces a long sequence of outputs with potentially many EOS tokens. In this scenario, each EOS token would denote the end of the polygon for one instance of the object category. An interesting property of this approach would be that the entire pipeline would be trainable end-to-end. 7.2 Occlusion & Holes Two other aspects of segmentation that we have not yet attempted to deal with are occlusion and the existance of holes. It is not unusual to find instances of objects where the pixel-level segmentations do not form a single connected component. For example, when a person is standing in front of their motorcycle, the motorcycle would be divided into two segments which should not be classified as separate instances. A simple modification to our model, such as an End-of-Component token could potentially be sufficient for dealing with occlusions. Another difficult situation would be one in which the object contains holes. An example of this is the empty triangular area in the frame of a bicycle. Dealing with holes could also potentially be addressed by introducing new tokens that the model could output at different timesteps. 7.3 Fine-Grained Segmentation and attention mechanisms As mentioned before in section 4.2, an important challenge when working with convolutional neural networks as feature extractors is getting fine details in segmentations. We will be experimenting with our modifications of our glimpse network to be able to better capture information at small scales, as well as other attention mechanisms. 8 Conclusions In this work we addressed the problem of instance segmentation by proposing a model that regresses enclosing polygons for the different object instances in an image. Our model has a number of advantages over traditional methods using dense predictions, and is especially useful as an annotation tool to collect segmentation datasets. We create an artificial dataset to test our model and experiment with its different components, in which our method shows promising results. We finally discuss possible improvements and future directions that we will be exploring to further improve the proposed method. 10 References [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] Pablo Arbeláez et al. “Multiscale combinatorial grouping”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 328–335. Liang-Chieh Chen et al. “Semantic image segmentation with deep convolutional nets and fully connected crfs”. In: arXiv preprint arXiv:1412.7062 (2014). Yi-Ting Chen, Xiaokai Liu, and Ming-Hsuan Yang. “Multi-Instance Object Segmentation With Occlusion Handling”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 3470–3478. Marius Cordts et al. “The cityscapes dataset”. In: Proc. IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) Workshops. Vol. 3. 2015. Mark Everingham et al. “The pascal visual object classes (voc) challenge”. In: International journal of computer vision 88.2 (2010), pp. 303–338. Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for autonomous driving? the kitti vision benchmark suite”. In: Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on. IEEE. 2012, pp. 3354–3361. Ross Girshick. “Fast r-cnn”. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1440–1448. Ross Girshick et al. “Rich feature hierarchies for accurate object detection and semantic segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2014, pp. 580–587. Bharath Hariharan et al. “Simultaneous detection and segmentation”. In: Computer vision– ECCV 2014. Springer, 2014, pp. 297–312. Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep convolutional neural networks”. In: Advances in neural information processing systems. 2012, pp. 1097–1105. Xiaodan Liang et al. “Proposal-free network for instance-level object segmentation”. In: arXiv preprint arXiv:1509.02636 (2015). Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: Computer Vision– ECCV 2014. Springer, 2014, pp. 740–755. Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 3431–3440. Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. “Recurrent models of visual attention”. In: Advances in Neural Information Processing Systems. 2014, pp. 2204–2212. Shaoqing Ren et al. “Faster R-CNN: Towards real-time object detection with region proposal networks”. In: Advances in Neural Information Processing Systems. 2015, pp. 91–99. Bernardino Romera-Paredes and Philip HS Torr. “Recurrent Instance Segmentation”. In: arXiv preprint arXiv:1511.08250 (2015). Nathan Silberman and Rob Fergus. “Indoor scene segmentation using a structured light sensor”. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE. 2011, pp. 601–608. Nathan Silberman et al. “Indoor segmentation and support inference from RGBD images”. In: Computer Vision–ECCV 2012. Springer, 2012, pp. 746–760. Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale image recognition”. In: arXiv preprint arXiv:1409.1556 (2014). Jianxiong Xiao et al. “Sun database: Large-scale scene recognition from abbey to zoo”. In: Computer vision and pattern recognition (CVPR), 2010 IEEE conference on. IEEE. 2010, pp. 3485–3492. Ziyu Zhang, Sanja Fidler, and Raquel Urtasun. “Instance-Level Segmentation with Deep Densely Connected MRFs”. In: arXiv preprint arXiv:1512.06735 (2015). Ziyu Zhang et al. “Monocular object instance segmentation and depth ordering with cnns”. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 2614– 2622. 11 [23] Shuai Zheng et al. “Conditional random fields as recurrent neural networks”. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1529–1537. 12
© Copyright 2026 Paperzz