PolyRNN: Polygon-based Instance Segmentation with Recurrent

PolyRNN: Polygon-based Instance Segmentation with
Recurrent Neural Networks
Kamyar Ghasemipour
Student #1000873548
Department of Computer Science
University of Toronto
[email protected]
Lluis Castrejon
Student #1002201057
Department of Computer Science
University of Toronto
[email protected]
Abstract
Instance segmentation is the task of annotating each region in an image as belonging to a specific instance of an object. In this work we present a novel method
for addressing this challenge that instead of dense pixel predictions, produces sequences of vertices defining polygons enclosing individual object instances. Our
model uses a convolutional neural network to extract features from an input image
that are subsequently fed into a recurrent neural network which predicts polygon
vertices one at a time. We experiment with artificial datasets and show initial
promising results that suggest our proposed architecture is a suitable approach to
the instance segmentation problem.
1
Introduction
Semantic segmentation is the task of classifying regions in an image as belonging to specific object
categories. This is a challenging problem in image understanding, with many potential applications.
However, the outputs of a semantic segmentation pipeline are not sufficient for certain advanced
tasks in which information about the individual instances of different object categories is required.
For example, one might want to differentiate between individual cells to determine which ones
are healthy in chemical treatments, or a grasping robot might need to differentiate between two
neighbouring objects of the same class.
Hence, in this project we investigate the task of instance segmentation. Instance segmentation is
the problem of not only producing category annotations for pixels in an image, but also marking
them as belonging to a specific instance of a particular object class. In this project we propose a
novel method for addressing this challenge: Instead of producing pixel-level annotations, our model
produces vertices of a polygon enclosing object instances.
The degree of accuracy of polygon segmentations can be adjusted through the number of vertices
the model predicts. Polygons with an appropriate number of vertices produce accurate segmentation
results, as evidenced by the fact that a number of segmentation datasets are generated through the
use of polygonal annotations. This brings about a second very interesting motivation for our work:
The pipeline we present in this project can be used for generating new segmentation datasets more
efficiently. One could envision a dataset generation framework in which annotations are initialized
using our approach, requiring human annotators to only refine the polygons. This could significantly
speed up the generation process while reducing poor annotations since less effort is demanded from
the annotators.
In the following section we present an overview of instance segmentation techniques and section 3
describes the details of our model. After discussing the challenges facing our approach, we compare
the different datasets available to train our models. We report our experiments and results in section
1
6 and conclude with a summary of our findings while proposing directions to extend this work in
the future.
2
Related Work
Instance segmentation is a relatively new task in the field of computer vision, and proposed methods
usually have been inspired by semantic segmentation and object dectection.
A recent example is the work of Hariharan et al. [9], insipred by the R-CNN framework [8, 7, 15].
Their method first generates a large number of region proposals using the technique described in
[1]. Then, features are extracted from both the bounding box of the proposals as well as the region
foregrounds using separate CNNs. Finally, these features are used to classify the region proposals
using linear SVMs and non maxima suppression. Chen et al. [3] use a similar framework while
making use of object exemplars to reason about occlusions.
Recently a significant amount of progress has been made in semantic segmentation due to convolutional neural networks. In [13], Long et al. presented Fully Convolutional Networks (FCNs) for
addressing this task. By using only convolution, pooling, and upsampling layers, they were able
to produce superior results to the previous state-of-the-art with a margin of 20%. Subsequent work
from different research groups significantly improved this method adding CRF-like models on top
of FCNs to achieve segmentations with finer detail. Examples of such works include [2] and [23].
Naturally, FCNs have also be employed for instance segmentation as well. In [11], Liang et al.
present a framework to avoid the proposal generation step of [8, 9]. The FCN network from [13]
is used to produce category-level segmentations for given images. The authors also fine-tune a
modified version of this network that produces a vector of instance counts for each object category
while for each pixel producing the coordinates of the top-left, center, and bottom-right locations of
the bounding box of the instance it belongs to. In a post-processing step, a clustering algorithm is
used to assign pixels to instances of different categories.
A notable paper proposing an end-to-end method for instance segmentation is [16]. Romera-Paredes
et al. employ a recurrent neural network to sequentially segment instances from images. Specifically, they use convolutional LSTMs built on top of the output of the FCN network of [13]. At each
time-step, the output of the convolutional LSTM passes through two separate branches: one which
produces a segmentation mask, and one that produces a score indicating how confident the model is
that the mask belongs to a new object instance. The unrolling of the RNN is stopped once the confidence falls below a certain threshold. There are two notes worth mentioning about this approach.
First, the model does not take into account any information regarding the classes of objects. Second,
the authors note that the results obtained from their model are somewhat coarse and as a result, they
employ a CRF as a post-processing step in some experiments.
Some important work has also been done in the context of autonomous driving, where instance
segmentation is required to differentiate between the diferents vehicles, pedestrians and other objects
of interest. In particular, [22] and [21] propose methods for instance segmentation from monocular
image in this context.
We now proceed to describe our method in greater detail.
3
PolyRNN
Our goal is to generate a sequence of vertices vi for an image xi that defines a polygon enclosing
the region of the image where an object of a specific type is located. For each example, we have
access to a sequence of vertices yi that defines the ground truth of the enclosing polygon. Each
image xi encloses at least an object to segment and corresponds to the object bounding box plus a
10% expansion to capture more context.
Qj=N −1
We define the probability of predicting a polygon as p(v) = j=0
p(vi |v0 , ..., vi−1 , x). A way
to model this probability distribution is to use an RNN decoder that predicts one polygon vertex at
a time and that is conditioned on an image x. While one could feed the RNN with the full image
patch, this would be very computationally expensive as our images are of size 224x224. Therefore,
2
Figure 1: Diagram of our baseline model architecture.
a more reasonable option is to first extract image features using a CNN, and feed these features to
the RNN. One can then train the system end to end so that the error in predicting the polygon with
the RNN is backpropagated through the CNN.
To evaluate a prediction we compute the Intersection Over the Union (IOU) of the regions defined
by the predicted polygon and the ground truth polygon. We choose this metric because the kind of
polygon we predict is not important as long as it encloses the same region. Therefore, a predicted
polygon should not be penalized for having a different number of vertices to the ground truth or for
predicting the vertices from a different starting point.
Figure 1(a) shows a diagram of our model.
4
Challenges
This section outlines a series of challenges that our framework needs to overcome, some of which
are addressed in this report and some which are left to future work.
4.1
Loss function
Directly optimizing the IOU metric used for evaluation is hard. From a predicted polygon, we
need to compute the enclosed area to be able to obtain the intersection and the union. Finding
an analytical gradient for this operation is not trivial. Furthermore, it can only be computed once
we have the full vertex sequence for a polygon. Only using such a loss would pose difficulties in
training, as propagating the gradients from the last timesteps back in time through the RNN and the
CNN might not be enough to train the model because of the vanishing gradient problem.
Instead, a first simpler approach is to compute the squared Euclidean distance between the predicted
vertex and the ground truth, and directly optimizing this metric. This has the advantage that there
is a gradient for each timestep. However, it penalizes predictions that do not have the same number
of vertices as the ground truth and different polygon orderings. In the next section we propose a
modification to make this loss agnostic to the starting point in our prediction.
A more advanced loss could be defined that does not penalize for predicting a different number
of vertices. The idea is to define a spatial loss that penalizes i) predicting out of the edge defined
between the last vertex prediction and the next vertex in the sequence and ii) predicting a point
that is on the edge but far from the next vertex in the sequence. Predicting a point out of the edge
should have a stronger penalty than predicting a point in the edge but far away from the next vertex
in the sequence. This loss would allow the model to predict a polygon with more vertices than in
the ground truth as long as, for each step, we are advancing to the next vertex in the ground truth
3
Figure 2: Progression of training (row major ordering). Left: Without order agnosia. Right: With
order agnosia.
sequence. However, this loss is hard to implement as it is conditional on the previous predicted
vertices.
4.1.1
Agnostic Loss
One of the problems with using a loss function based on the distance of predicted vertices to ground
truth vertices is that, even if our model predicts the ground truth, if the ordering of the predictions
are a rotated version of the original we will be penalizing our model. To amend this situation, for a
given loss function f , we compute the order agnostic version fag as follows.
n−1
Let P = {pi }n−1
i=0 denote the sequence of vertices predicted by our model and let Q = {qi }j=0
denote the sequence of ground truth vertices. Furthermore, let ∀k ∈ {0, ..., n − 1}, Pk =
(n−1−k) mod n
{pi }i=−k mod n . Then, fag = mink f (Pk , Q).
A problem with this loss function (as in the case of euclidean distance measure) is that the number
of vertices of the prediction must match that of the ground truth. If this condition is not satisfied, we
need to pad or slice the output predictions before computing the loss. So far however, we have only
tested the agnostic loss for scenarios in which we unroll the LSTM for as many steps as the number
of ground truth vertices.
Figure 2 demonstrates the differences in the progress of the model when trained on Shapes8 (described in the next section) with and without an order agnostic loss. As can be seen, without this
loss, initially the model solely focuses on getting the ordering correct. Furthermore, the Shapes8
data has a nice property that the ith vertex is around the same area in all images. For natural images,
it will be almost impossible for the model to learn anything if this modified loss is not used.
4.2
Fine-grained Segmentation
An important challenge plaguing many frameworks for semantic segmentation is being able to produce segmentations that take into account the fine structure of the objects present in images. This
is because most successful methods incorporate convolutional neural networks in order to extract
high-level semantic information from images. However, there is a trade-off: Deeper layers of CNNs
provide more semantic information, but due to pooling layers as well as different strides of convolution kernels, information about exact locations and fine structures may be lost.
In recent segmentation literature, two main techniques are used to deal with this scenario. The first
is to refine segmentations produced by networks such the fully convolutional networks of [13] using
CRFs or other models that behave in a similar fashion. We could potentially use a similar approach,
but it would need to be implemented as a post-processing step in which polygons are converted to
pixel segmentations and the pixels are subsequently refined. This however deviates from our desire
to have good polygonal boundaries to begin with.
The second approach is to fuse together activations from different levels of CNNs. In this way,
the models are able to make predictions while considering local as well as global structure of the
images. Motivated by this we also implemented a model that employs a variation of the glimpse
network of Mnih et al. [14] as a key component.
4
Figure 3: Inputs to the LSTM when glimpses are employed.
Given the vertex prediction pt of the model at step t, the glimpse sensor extracts a glimpse from the
model by taking k square crops centered at point pt of the input image. The first square has sidelengths s and each subsequent crop has a side-length twice that of the previous one. All crops are
resized to have the same dimensions, concatenated, and passed through a CNN that ends with a fully
connected layer which we denote by crop fc. pt itself is also passed through a fully connected layers
which is subsequently concatenated with crop fc. The concatenated activations are then passed
through another linear layer to produce the output of this glimpse network. This output is combined
with representations of the original image and the object class and fed to the LSTM as before. Figure
3 presents a diagram of inputs the the LSTM when using glimpses.
In our initial experiments we were not able to train this version of our model to produce interesting
outputs. We therefore refrain from discussing experiments with the glimpse network in section 6.
Furthermore, we notice that we made one poor decision in the architecture design: We resized and
concatenated the different crops as different channels of input to a CNN. This is incorrect because
the crops have differing receptive fields. Instead, in future work, we will be using separate CNNs
for each crop and combine the outputs of the CNNs using fully connected layers or gating instead.
4.3
Output representation
So far we assumed the predictions of our model would be the absolute cartesian coordinates of the
polygon vertices. However, we could also represent a polygon with polar coordinates, or relative
coordinates. Furthermore, we could quantize the input image using a grid and predict the cell of
each vertex. This would convert the problem into a classification task, as opposed to regression, and
would require further changes to the loss function.
While different coordinate systems and the use of absolute or relative coordinates did not make a
difference in our initial experiments, the use of a quantized output grid is explored in detail in section
6.
5
5.1
Datasets
Instance segmentation datasets
To train our models we need datasets that provide instance segmentations, preferably in the form of
polygons. For datasets that provide segmentation masks instead of polygon annotations, we could
still use them by generating an approximated polygon from the mask. We could employ a convex
hull algorithm, for example. However, these approximated examples might impact negatively the
performance of our model, and therefore we restrict ourselves to the following two datasets with
polygon annotations.
5
MS COCO [12] includes instance segmentations for 80 different object classes in indoor and outdoor scenes. It contains more than 300,000 images and provides polygon annotations. MS COCO
will be the main dataset in our experiments.
SUN Database [20] contains thousands of fully segmented images. As opposed to MS COCO,
SUN contains object annotations but also part annotations. Annotations are in the form of polygons.
While SUN is adequate to our needs, it requires extra steps needed to preprocess the data and it has
a smaller number of images as compared to MS COCO.
Other datasets include KITTI [6] and CityScapes [4], which are two of the main datasets used in
autonomous driving, PASCAL VOC [5] and the NYU Depth dataset [17, 18],
5.2
Artificial dataset
(a) Shapes8
(b) ShapesN
(c) ShapesTexture
Figure 4: Examples from the artificial datasets: We created artificial datasets with polygons of a
fixed number of sides (Shapes8), polygons with a variable number of sides (ShapesN) and polygons
filled with textures (ShapesTexture)
.
Since the task we are approaching is hard, we designed a synthetic dataset to help evaluate our
proposed models in a simpler setting and gain some insights about their performance.
In this artificial dataset, we generate random polygons in an image of size 224x224 pixels with a
black background. To generate a polygon, first we decide its number of vertices N . We randomly
−1
2π
assign an angle ai for each vertex {vi }N
i=0 as N ∗ i + α, where α is a random angle jitter sam2π 2π
pled from a uniform distribution [− N , N ]. We then assign a radius ri to each vertex vi sampling
from a uniform distribution [0, bi ], where bi is the distance to the image border and depends on the
selected angle. Given a radius and an angle, we compute the coordinates for a given vertex vi using
trigonometry: xi = ri cos(ai ), yi = ri sin(ai ).
We generate four versions of the dataset of increasing difficulty:
• Shapes8: we generate polygons of a fixed number of eight vertices. Hence, the model does
not have to predict when to stop predicting vertices. The polygons are filled with a random
color so that segmenting the polygon from the background is easy.
• ShapesN: instead of generating polygons with a fixed number of vertices, now we generate
polygons with a random number of sides in the interval [4, 12]. The polygons are filled
with a random color.
• ShapesTexture: we generate polygons with a random number of sides in [4, 12] and filled
with a random texture out of 100 possibilities. The model has to predict when to stop and
segmenting the polygon from the background is harder.
• ShapesTexBground: we generated polygons as in the ShapesTexture dataset, and employ
a different texture to that of the polygon to fill the background.
Figure 4 shows examples from the artificial dataset. All the datasets contained 20,000 training
images and 2,000 validation images.
6
5.3
Semi-Artificial dataset
The disadvantage of using artificial datasets is that they might represent too simple tasks as compared
to performing instance segmentation on a big-scale dataset of natural images. For that reason, we
also created a Semi-Artificial dataset we call SemiSynth. This dataset is characterized by using
polygons with textures, as in the ShapesTexture dataset, but instead of generating a random polygon
shape with four to twelve vertices, we employed polygons from COCO. Results obtained in this
dataset are still premature and will be reported in the future.
6
6.1
Experiments
Regression experiments
In all our regression experiments we use the PolyRNN model with a CNN made of a first convolutional layer with 32 kernels of 7x7 pixels, strides of 1x1 pixels, padding of 3x3 and a ReLU
activation, a max pooling layer of kernel 2x2 and 2x2 strides, a second convolutional layer with
256 kernels of 5x5 pixels, strides of 1x1 pixels, padding of 2x2 pixels and a ReLU activation, and
another max pooling layer as the first one.
The RNN uses two-layers of 256 GRU units. Finally, we have a fully connected layer that maps
the output of the second layer to three numbers: the x coordinate of the predicted vertex, the y
coordinate of the predicted vertex and a boolean value that indicates whether to continue predicting
vertices or to stop (end of sequence token).
We train the model using the Adam optimizer with parameters β1 = 0.01, β2 = 0.005 and lr =
0.0001. We found that different settings of the learning rate resulted non-significant differences in
the final performance of the model. We use mini-batches of 32 examples and train our models from
scratch for 100,000 batches. We provide the first ground-truth point to our models so that they know
where to start from.
6.2
Shapes8
(a) Training and validation loss.
(b) Sampled predictions.
IoU (Shapes8)
Mean 0.5770
Std
0.1270
(c) Validation IoU
Figure 5: Shapes8 results. Left: evolution of the training and validation loss. Right: examples of
predictions on the validation set. Bottom: intersection over the union for the validation set.
Figure 5 shows the results we obtained using PolyRNN with the Shapes8 dataset. We can observe
how the training loss quickly decreases. The validation loss follows the training loss, indicating that
the model is not overfitting. In the sampled predictions we can see that the model predicts point near
7
the vertices of the ground truth polygon. We argue that we are limited by the loss function and that
better results could be obtained with the loss function options discussed in section 4.1
6.3
ShapesN
(a) Training and validation loss.
(b) Sampled predictions.
IoU (ShapesN)
Mean 0.5261
Std
0.1148
(c) Validation IoU
Figure 6: ShapesN results. Left: evolution of the training and validation loss. Right: examples of
predictions on the validation set. Bottom: intersection over the union for the validation set.
The results obtained for the ShapesN dataset are shown in Figure 6. We can observe how the training
loss reaches a similar level to that achieved for the Shapes8 dataset. However, in this case we can
observe how the validation loss does not decrease in the same way as the training loss, and the model
starts to overfit early in training. Upon close inspection, we noticed that the model is computing
polygons with a similar performance as in the previous experiment, but in this case the model is
not able to produce the right end-of-sequence tokens and overfits to the training set. We argue that
the model is memorizing the EOS prediction for the training set examples, but that it is not able
to generalize to the validation set because it cannot accurately count the number of vertices in a
polygon from the image features it receives. This is probably due to the reduced resolution in the
input features.
Given this, we decide to use the ground truth length sequence in our predictions. We can observe
how the samples show close results to the Shapes8 dataset. The obtained IoU confirms this, with a
mean predicted IoU of 52.61%.
6.4
ShapesTexture
For the ShapesTexture dataset we observe very similar results to those of the ShapesN dataset, with
a slightly smaller IoU of 52.14%. We conclude that the model is able to distinguish the boundaries
of the polygon regardless of whether it is filled with a solid color or a texture.
6.5
Classification experiments
In this section we explore the use of a quantized output. We divide our input image into a grid of
28x28 cells, and transform vertex coordinates into the number of the cell that is closest. This way,
we convert our problem into a classification task. Instead of an Squared Euclidean distance loss at
each timestep, we now use a Categorical CrossEntropy loss at the output of a softmax. Furthermore,
we include an option in the softmax to indicate the End Of the Sequence (EOS token).
8
(a) Training and validation loss.
(b) Sampled predictions.
IoU (ShapesTexture)
Mean
0.5214
Std
0.1204
(c) Validation IoU
Figure 7: ShapesTexture results. Left: evolution of the training and validation loss. Right: examples of predictions on the validation set. Bottom: intersection over the union for the validation
set.
AlexNet
VGG-16
Shapes8
99.3352%
Not tested
ShapesN
96.3143%
Not tested
ShapesTexture
91.7722%
95.7335%
ShapesTexBground
80.7962%
91.0576%
Table 1: Classification results: we can observe how framing the problem as a classification task
allows us to obtain very good performances in the artificial dataset. Note that AlexNet has reduced
performance in ShapesTexture and ShapesTexBground as compared to VGG.
We modify the CNN from the regression PolyRNN model and instead employ the convolutional
layers of AlexNet [10] or VGG-16 [19]. Table 1 shows the IOU’s obtained in the artificial dataset,
while 8 shows some predictions of the VGG model for the ShapesTexBground dataset.
Figure 8: VGG-16 Predictions for ShapesTexBground: we can observe how the network is able
to distinguish the vertices of the polygons with great accuracy and rarely becomes confused.
9
We can observe how using a quantized output allows the model to learn very accurate polygon
segmentations, achieving IOUs over 90% in the artificial dataset. We can also observe how using a
more complex CNN increases performance sbustantially, as the VGG model outperforms AlexNet
at the ShapesTexBground by more than 10%.
7
Future Directions
There still remain many interesting directions that we will be exploring in future work.
7.1
Detection
At the moment we are working with images that contain one instance of an object of interest and
our goal is to draw a polygon enclosing this object. However, to create a full instance segmentation
pipeline, we need a method for differentiating between the instances to begin with. To this effect,
we can employ an object detection framework to extract detections from images of interest and run
our algorithm on the detections.
Another interesting alternative could be to employ the fully convolutional networks of [13] to get
category-level segmentations. Next we could modify our pipeline so that given an input image and
a segmentation mask for a specific category, our model produces a long sequence of outputs with
potentially many EOS tokens. In this scenario, each EOS token would denote the end of the polygon
for one instance of the object category. An interesting property of this approach would be that the
entire pipeline would be trainable end-to-end.
7.2
Occlusion & Holes
Two other aspects of segmentation that we have not yet attempted to deal with are occlusion and the
existance of holes.
It is not unusual to find instances of objects where the pixel-level segmentations do not form a single connected component. For example, when a person is standing in front of their motorcycle,
the motorcycle would be divided into two segments which should not be classified as separate instances. A simple modification to our model, such as an End-of-Component token could potentially
be sufficient for dealing with occlusions.
Another difficult situation would be one in which the object contains holes. An example of this is
the empty triangular area in the frame of a bicycle. Dealing with holes could also potentially be
addressed by introducing new tokens that the model could output at different timesteps.
7.3
Fine-Grained Segmentation and attention mechanisms
As mentioned before in section 4.2, an important challenge when working with convolutional neural
networks as feature extractors is getting fine details in segmentations. We will be experimenting with
our modifications of our glimpse network to be able to better capture information at small scales, as
well as other attention mechanisms.
8
Conclusions
In this work we addressed the problem of instance segmentation by proposing a model that regresses
enclosing polygons for the different object instances in an image. Our model has a number of advantages over traditional methods using dense predictions, and is especially useful as an annotation
tool to collect segmentation datasets. We create an artificial dataset to test our model and experiment
with its different components, in which our method shows promising results. We finally discuss possible improvements and future directions that we will be exploring to further improve the proposed
method.
10
References
[1]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[11]
[12]
[13]
[14]
[15]
[16]
[17]
[18]
[19]
[20]
[21]
[22]
Pablo Arbeláez et al. “Multiscale combinatorial grouping”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2014, pp. 328–335.
Liang-Chieh Chen et al. “Semantic image segmentation with deep convolutional nets and
fully connected crfs”. In: arXiv preprint arXiv:1412.7062 (2014).
Yi-Ting Chen, Xiaokai Liu, and Ming-Hsuan Yang. “Multi-Instance Object Segmentation
With Occlusion Handling”. In: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition. 2015, pp. 3470–3478.
Marius Cordts et al. “The cityscapes dataset”. In: Proc. IEEE Conf. on Computer Vision and
Pattern Recognition (CVPR) Workshops. Vol. 3. 2015.
Mark Everingham et al. “The pascal visual object classes (voc) challenge”. In: International
journal of computer vision 88.2 (2010), pp. 303–338.
Andreas Geiger, Philip Lenz, and Raquel Urtasun. “Are we ready for autonomous driving?
the kitti vision benchmark suite”. In: Computer Vision and Pattern Recognition (CVPR), 2012
IEEE Conference on. IEEE. 2012, pp. 3354–3361.
Ross Girshick. “Fast r-cnn”. In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 1440–1448.
Ross Girshick et al. “Rich feature hierarchies for accurate object detection and semantic segmentation”. In: Proceedings of the IEEE conference on computer vision and pattern recognition. 2014, pp. 580–587.
Bharath Hariharan et al. “Simultaneous detection and segmentation”. In: Computer vision–
ECCV 2014. Springer, 2014, pp. 297–312.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. “Imagenet classification with deep
convolutional neural networks”. In: Advances in neural information processing systems. 2012,
pp. 1097–1105.
Xiaodan Liang et al. “Proposal-free network for instance-level object segmentation”. In: arXiv
preprint arXiv:1509.02636 (2015).
Tsung-Yi Lin et al. “Microsoft coco: Common objects in context”. In: Computer Vision–
ECCV 2014. Springer, 2014, pp. 740–755.
Jonathan Long, Evan Shelhamer, and Trevor Darrell. “Fully convolutional networks for semantic segmentation”. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015, pp. 3431–3440.
Volodymyr Mnih, Nicolas Heess, Alex Graves, et al. “Recurrent models of visual attention”.
In: Advances in Neural Information Processing Systems. 2014, pp. 2204–2212.
Shaoqing Ren et al. “Faster R-CNN: Towards real-time object detection with region proposal
networks”. In: Advances in Neural Information Processing Systems. 2015, pp. 91–99.
Bernardino Romera-Paredes and Philip HS Torr. “Recurrent Instance Segmentation”. In:
arXiv preprint arXiv:1511.08250 (2015).
Nathan Silberman and Rob Fergus. “Indoor scene segmentation using a structured light sensor”. In: Computer Vision Workshops (ICCV Workshops), 2011 IEEE International Conference on. IEEE. 2011, pp. 601–608.
Nathan Silberman et al. “Indoor segmentation and support inference from RGBD images”.
In: Computer Vision–ECCV 2012. Springer, 2012, pp. 746–760.
Karen Simonyan and Andrew Zisserman. “Very deep convolutional networks for large-scale
image recognition”. In: arXiv preprint arXiv:1409.1556 (2014).
Jianxiong Xiao et al. “Sun database: Large-scale scene recognition from abbey to zoo”. In:
Computer vision and pattern recognition (CVPR), 2010 IEEE conference on. IEEE. 2010,
pp. 3485–3492.
Ziyu Zhang, Sanja Fidler, and Raquel Urtasun. “Instance-Level Segmentation with Deep
Densely Connected MRFs”. In: arXiv preprint arXiv:1512.06735 (2015).
Ziyu Zhang et al. “Monocular object instance segmentation and depth ordering with cnns”.
In: Proceedings of the IEEE International Conference on Computer Vision. 2015, pp. 2614–
2622.
11
[23]
Shuai Zheng et al. “Conditional random fields as recurrent neural networks”. In: Proceedings
of the IEEE International Conference on Computer Vision. 2015, pp. 1529–1537.
12

Download Report

PolyRNN: Polygon-based Instance Segmentation with Recurrent

Paperzz.com

Your Paperzz