A brief overview of R-CNN, Fast R-CNN and Faster R-CNN

Region Based CNN (R-CNN)

6 min readMay 6, 2021

R-CNN architecture is used to detect the classes of objects in the images and the bounding boxes of these objects. RCNN architecture has been developed since classification cannot be made for more than one object with CNN in visuals containing more than one object.

The general working principle of R-CNN takes place in two steps. First, the features where the object can be found in the visual are determined with selective search, then after the regions are determined, each region is given as an input to a CNN model and the prediction process is performed for classes and bounding boxes.

Selective Search:
It is used to determine the regions on the image that should be captured. Small areas are determined first. Then, similar regions are combined to create larger new regions. This process continues repeatedly, larger regions are created at each step during the process, however, the objects in the visual are actually clustered.

At R-CNN, we use selective search to identify candidates for specific regions. Each region candidate is given as an input to different CNN networks. As a result of the operations performed on the region island, approximately 2000 different regions are obtained, 2000 CNN network is used for the 2000 regions obtained. Classes of objects are determined with SVM by using features from these networks and bounding boxes of objects are determined by regression.

Intersection Over Union (IoU) score refers to the accuracy of the predicted bounding boxes. The obtained bounding box and the real bounding box are compared, with this comparison process, the IoU score is obtained.

→ The intersection of the predicted area for the bounding box and the real bounding box area / The combination of the predicted area for the bounding box and the real area of the bounding box

Non-Max Supression:
Not all regions obtained from the images are used. Non-Max Supression technique is used to obtain the correct areas. With this technique, the bounding boxes obtained with an Intersection Over Union (IoU) score greater than 0.5 are kept and other bounding boxes are suppressed. If more than 0.5 bounding boxes are obtained for an object, the bounding box with the highest IoU score is used.

Fast R-CNN

The cost of R-CNN models is quite high because nearly 2000 different candidate regions are extracted for each image, different CNN networks are used for each region. These process steps cause both a great cost and a long training time.

For this reason, different CNN models created for each region in the R-CNN architecture were removed and Fast R-CNN architecture was developed using a single CNN for the regions. Unlike R-CNN, the use of CNN, SVM and Regressor has been developed. The architecture created with the combination of CNN, SVM and Regressor performed very well with the developed models.

The whole image is processed with CNN and feature maps are obtained. Required features for region recommendations are collected (region proposal feature map). Then, max pooling is applied to the obtained feature maps, and the dimensions of the feature maps are reduced. The layer where the max pooling process is performed is called the RoI (Region of Interest) pooling layer. Feature maps of reduced dimensions are transformed into a one-dimensional vector and given as input to the CNN model. With Softmax, the class information of the object in the region is determined, while the bounding box regressor of the object is determined.
It works about 10 times faster than R-CNN.

Faster R-CNN

Because selective search applied in R-CNN and Fast R-CNN is costly in terms of computations , Region Proporsal Network (RPN) is used in Faster R-CNN. How efficient the use of RPN is has been demonstrated by certain studies.

Any size image is taken as input. The image is then given as input to a CNN model. If the models you use are models such as VGG16, AlexNet, reconstruct the model without using a fully connected layer in the model. Because we have to give feature map as input for RPN, feature maps are also produced by convolutional layers.

An important structure we need to know when talking about RPN: anchor boxes. Anchors are boxes with different scales and aspect ratios. While the small network to be created is sliding on the feature map, an object search is made in the feature map in accordance with the anchor.

The first convolutional layer in RPN applies a filter of 3x3 by default, the number of output channels is 512. This conv layer takes the feature map as input. The output of the convolutional layer is given as input to two different convolutional layers. Both have a filter size of 1x1. The first layer is cls, ie the classification layer. The cls layer informs us whether there is an object in the location of the sliding window or not. A binary classification process is applied here. In the case of an object 1, in case of no object 0, in the modeling stage, the sigmoid activation function is used with the convolutional layer in general. The number of output channels of this layer is expressed as 2 * 9, by default the value of k is 9 (therefore the filter size in the first layer is 3x3), the number of output channels of the Cls layer should be 2 * k. The Reg layer works parallel to the Cls layer. Reg layer is responsible for drawing bounding boxes of objects detected in Cls. Since it contains 4 coordinate information, the number of output channels of this layer is expressed as 4 * 9 (4 * k). Linear activation function is preferred with conv layer for Reg layer in model coding.

In summary, as a result of the filtering processes, the output of RPN is anchor boxes marked as having an object. Bounding boxes with a value of 0 in the cls layer are put into the background.

The RoI layer that comes after RPN receives the outputs of RPN’s cls and reg layers as input, along with the feature map that RPN takes as input. The RoI layer is responsible for making the size of each feature map the same before the fully connected layer, which is the last layer of Faster R-CNN. Since the region recommendations from RPN are of different sizes (because anchor boxes have different scales and aspect ratios), it is necessary to produce feature maps with fixed size. The RoI layer uses a 7x7 max pooling layer for size. Number of output layer is 512.
The values obtained after the RoI are subjected to the classification process by giving them as input to the fully connected layer after flattening. In the classification process, the bounding box of the object whose class is predicted is created with the help of regressor.

I hope I was able to convey the basics of R-CNN, Fast R-CNN and Faster R-CNN correctly. Thank you for reading!

Resources:

1- https://arxiv.org/abs/1506.01497

2- https://www.youtube.com/watch?v=iHf2xHQ2VYo

3- https://towardsdatascience.com/r-cnn-fast-r-cnn-faster-r-cnn-yolo-object-detection-algorithms-36d53571365e

4- https://www.hackerearth.com/blog/developers/object-detection-for-self-driving-cars/

A brief overview of R-CNN, Fast R-CNN and Faster R-CNN

Region Based CNN (R-CNN)

Fast R-CNN

Faster R-CNN

Written by Sema Zeynep Bulut