We all know about the image classification problem. Given an image can you find out the class the image belongs to? We can solve any new image classification problem with ConvNets and Transfer Learning using pre-trained nets.
As a side note: if you want to know more about convnets and Transfer Learning I would like to recommend this awesome course on Deep Learning in Computer Vision in the Advanced machine learning specialization. This course talks about various CNN architetures and covers a wide variety of problems in the image domain including detection and segmentation.
But there are a lot many interesting problems in the Image domain. The one which we are going to focus on today is the Segmentation, Localization and Detection problem. So what are these problems?
So these problems are divided into 4 major buckets. In the next few lines I would try to explain each of these problems concisely before we take a deeper dive:
- Semantic Segmentation: Given an image, can we classify each pixel as belonging to a particular class?
- Classification+Localization: We were able to classify an image as a cat. Great. Can we also get the location of the said cat in that image by drawing a bounding box around the cat? Here we assume that there is a fixed number(commonly 1) in the image.
- Object Detection: A More general case of the Classification+Localization problem. In a real-world setting, we don't know how many objects are in the image beforehand. So can we detect all the objects in the image and draw bounding boxes around them?
- Instance Segmentation: Can we create masks for each individual object in the image? It is different from semantic segmentation. How? If you look in the 4th image on the top, we won't be able to distinguish between the two dogs using semantic segmentation procedure as it would sort of merge both the dogs together.
In this post, we will focus mainly on Object Detection.
So lets first try to understand how we can solve the problem when we have a single object in the image. The Classification+Localization case. Pretty neatly said in the CS231n notes:
Input Data: Lets first talk about what sort of data such sort of model expects. Normally in an image classification setting we used to have data in the form (X,y) where X is the image and y used to be the class labels. In the Classification+Localization setting we will have data normally in the form (X,y), where X is still the image and y is a array containing (class_label, x,y,w,h) where,
x = bounding box top left corner x-coordinate
y = bounding box top left corner y-coordinate
w = width of bounding box in pixel
h = height of bounding box in pixel
Model: So in this setting we create a multi-output model which takes an image as the input and has (n_labels + 4) output nodes. n_labels nodes for each of the output class and 4 nodes that give the predictions for (x,y,w,h).
Loss: In such a setting setting up the loss is pretty important. Normally the loss is a weighted sum of the Softmax Loss(from the Classification Problem) and the regression L2 loss(from the bounding box coordinates).
Since these two losses would be on a different scale, the alpha hyper-parameter needs to be tuned.
There is one thing I would like to note here. We are trying to do object localization task but we still have our convnets in place here. We are just adding one more output layer to also predict the coordinates of the bounding box and tweaking our loss function. And here in lies the essence of the whole Deep Learning framework - Stack layers on top of each other, reuse components to create better models, and create architectures to solve your own problem. And that is what we are going to see a lot going forward.
So how does this idea of localization using regression get mapped to Object Detection? It doesn't. We don't have a fixed number of objects. So we can't have 4 outputs denoting, the bounding box coordinates.
One naive idea could be to apply a CNN to many different crops of the image, CNN classifies each crop as object class or background class. This is intractable. There could be a lot of such crops that you can create.
If just there was a method(Normally called Region Proposal Network)which could find some cropped regions for us automatically, we could just run our convnet on those regions and be done with object detection. And that is what selective search (Uijlings et al, "Selective Search for Object Recognition", IJCV 2013) provided for RCNN.
So what are Region Proposals:
- Find "blobby" image regions that are likely to contain objects
- Relatively fast to run; e.g. Selective Search gives 2000 region proposals in a few seconds on CPU
How the region proposals are being made?
Selective Search for Object Recognition:
So this paper starts with a set of some initial regions using  (P. F. Felzenszwalb and D. P. Huttenlocher. Efficient GraphBased Image Segmentation. IJCV, 59:167–181, 2004. 1, 3, 4, 5, 7)
In this paper they take an approach:
As you can see if we create bounding boxes around these masks we will be losing a lot of regions. We want to have the whole baseball player in a single bounding box/frame. We need to somehow group these initial regions. For that the authors of Selective Search for Object Recognition apply the Hierarchical Grouping algorithm to these initial regions. In this algorithm they merge most similar regions together based on different notions of similarity based on colour, texture, size and fill.
The above selective search is the region proposal they used in RCNN paper. But what is RCNN and how does it use region proposals?
Along with this, the authors have also used a class specific bounding box regressor, that takes: Input : (Px,Py,Ph,Pw) - the location of the proposed region. Target: (Gx,Gy,Gh,Gw) - Ground truth labels for the region. The goal is to learn a transformation that maps the proposed region(P) to the Ground truth box(G)
What is the input to an RCNN? So we have got an image, Region Proposals from the RPN strategy and the ground truths of the labels (labels, ground truth boxes) Next we treat all region proposals with ≥ 0.5 IoU(Intersection over union) overlap with a ground-truth box as positive training example for that box's class and the rest as negative. We train class specific SVM's
So every region proposal becomes a training example. and the convnet gives a feature vector for that region proposal. We can then train our n-SVMs using the class specific data.
Test Time RCNN
At test time we predict detection boxes using class specific SVMs. We will be getting a lot of overlapping detection boxes at the time of testing. Non-maximum suppression is an integral part of the object detection pipeline. First, it sorts all detection boxes on the basis of their scores. The detection box M with the maximum score is selected and all other detection boxes with a significant overlap (using a pre-defined threshold) with M are suppressed. This process is recursively applied on the remaining boxes
Problems with RCNN:
Training is slow. Inference (detection) is slow. 47s / image with VGG16 - Since the Convnet needs to be run many times.
Need for speed. Hence comes in picture by the same authors:
This idea depends a little upon the architecture of the model that get used too. Do we take the 4096 bottleneck layer from VGG16? So the architecture that the authors have proposed is:
This obviously is a little confusing and "hairy", let us break this down. But for that, we need to see the VGG16 architecture.
The last pooling layer is 7x7x512. This is the layer the network authors intend to replace by the ROI pooling layers. This pooling layer has got as input the location of the region proposal(xmin_roi,ymin_roi,h_roi,w_roi) and the previous feature map(14x14x512).
Now the location of ROI coordinates are in the units of the input image i.e. 224x224 pixels. But the layer on which we have to apply the ROI pooling operation is 14x14x512. As we are using VGG we will transform image (224 x 224 x 3) into (14 x 14 x 512) - height and width is divided by 16. we can map ROIs coordinates onto the feature map just by dividing them by 16.
How the ROI pooling is done?
In the above image our region proposal is (0,3,5,7) and we divide that area into 4 regions since we want to have a ROI pooling layer of 2x2.
How do you do ROI-Pooling on Areas smaller than the target size? if region proposal size is 5x5 and ROI pooling layer of size 7x7. If this happens, we resize to 35x35 just by copying 7 times each cell and then max-pooling back to 7x7.
After replacing the pooling layer, the authors also replaced the 1000 layer imagenet classification layer by a fully connected layer and softmax over K + 1 categories(+1 for Background) and category-specific bounding-box regressors.
What is the input to an Fast- RCNN?
Pretty much similar: So we have got an image, Region Proposals from the RPN strategy and the ground truths of the labels (labels, ground truth boxes)
Next we treat all region proposals with ≥ 0.5 IoU(Intersection over union) overlap with a ground-truth box as positive training example for that box's class and the rest as negative. This time we have a dense layer on top, and we use multi task loss.
So every ROI becomes a training example. The main difference is that there is concept of multi-task loss:
A Fast R-CNN network has two sibling output layers. The first outputs a discrete probability distribution (per RoI), p = (p0, . . . , pK), over K + 1 categories. As usual, p is computed by a softmax over the K+1 outputs of a fully connected layer. The second sibling layer outputs bounding-box regression offsets, t= (tx , ty , tw, th), for each of the K object classes. Each training RoI is labeled with a ground-truth class u and a ground-truth bounding-box regression target v. We use a multi-task loss L on each labeled RoI to jointly train for classification and bounding-box regression
Where Lcls is the softmax classification loss and Lloc is the regression loss. u=0 is for BG class and hence we add to loss only when we have a boundary box for any of the other class. Further:
The next question that got asked was : Can the network itself do region proposals?
How does the Region Proposal Network work?
One of the main idea in the paper is the idea of Anchors. Anchors are fixed bounding boxes that are placed throughout the image with different sizes and ratios that are going to be used for reference when first predicting object locations.
So first of all we define anchor centers on the image.
The anchor centers are separated by 16 px in case of VGG16 network as the final convolution layer of (14x14x512) subsamples the image by a factor of 16(224/14). This is how anchors look like:
- So we start with some predefined regions we think our objects could be with Anchors.
- Our RPN Classifies which regions have the object and the offset of the object bounding box. 1 if IOU for anchor with bounding box>0.5 0 otherwise.
- Non-Maximum suppression to reduce region proposals
- Fast RCNN detection network on top of proposals
The whole network is then jointly trained with 4 losses:
- RPN classify object / not object
- RPN regress box coordinates offset
- Final classification score (object classes)
- Final box coordinates offset
Disclaimer: This is my own understanding of these papers with inputs from many blogs and slides on the internet. Let me know if you find something wrong with my understanding. I will be sure to correct myself and post.
- Transfer Learning
- CS231 Object detection Lecture Slides
- Efficient Graph-Based Image Segmentation
- Rich feature hierarchies for accurate object detection and semantic segmentation(RCNN Paper)
- Selective Search for Object Recognition
- ROI Pooling Explanation
- Faster RCNN Blog
- Faster RCNN Blog
- Faster RCNN Blog
- Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks