This article is devoted to explaining the architecture of the RetinaNet neural network architecture. The review was carried out by me in the course of the thesis, and since it was necessary to write it exclusively to English sources and collect the information found together, I decided that the material received would help someone reduce the time it takes to find the necessary information and simplify understanding of the construction of neural networks for Object Detection Tasks.


The architecture of the convolutional neural network (SNA) RetinaNet consists of 4 main parts, each of which has its own purpose:

a) Backbone - the main (basic) network, which is used to extract features from the input image. This part of the network is variable and it may include classification neural networks, such as ResNet, VGG, EfficientNet, and others;

b) Feature Pyramid Net (FPN) - a convolutional neural network built in the form of a pyramid, which serves to combine the advantages of feature cards of the lower and upper levels of the network, the former have high resolution, but low semantic, generalizing ability; second - vice versa;

c) Classification Subnet - a subnet that extracts information about classes of objects from FPN, solving the classification problem;

d) Regression Subnet - a subnet that extracts from FPN information about the coordinates of objects in the image, solving the regression problem.

In fig. Figure 1 depicts a RetinaNet architecture with a ResNet neural network as a backbone.

ITKarma picture
Figure 1 - RetinaNet architecture with ResNet backbone network

Let us examine in detail each of the parts of RetinaNet presented in Fig. 1.

Backbone part of RetinaNet

Considering that the part of the RetinaNet architecture, which receives the image and highlights important signs, is variable and the information extracted from this part will be processed in the next steps, it is important to choose the right backbone network for the best results.

Recent studies on the optimization of the SNA have allowed the development of classification models that are ahead of all previously developed architectures with the best accuracy indicators on the ImageNet dataset with a 10-fold improvement in efficiency. These networks are called EfficientNet-B (0-7). The indicators of the new network family are presented in Fig. 2.

ITKarma picture
Figure 2 - Graph of the greatest accuracy indicator on the number of network weights for various architectures

Pyramid of signs

Feature Pyramid Network consists of three main parts: bottom-up pathway, top-down pathway and lateral connections.
The upward path is a kind of hierarchical “pyramid” - a sequence of convolutional layers with decreasing dimension, in our case, a backbone network. The upper layers of the convolutional network have a greater semantic meaning, but lower resolution, and the lower ones, on the contrary (Fig. 3). The Bottom-up pathway is vulnerable to feature extraction - loss of important information about an object, for example, due to the noise of a small but significant object in the background, since by the end of the network information is highly compressed and generalized.

ITKarma picture
Figure 3 - Features of feature maps at different levels of the neural network

The downward path is also a “pyramid”. The feature cards of the top layer of this pyramid are the size of the feature cards of the top layer of the bottom-up pyramid and are doubled by the nearest neighbor method (Fig. 4) in the downward direction.

ITKarma picture
Figure 4 - Increasing the resolution of the image using the nearest neighbor method

Thus, in the top-down network, each feature map of the overlying layer grows to the size of the underlying card.In addition, there are side connections in the FPN, which means that the feature maps of the corresponding bottom-up and top-down layers of the pyramids are added element-by-element, and the bottom-up cards undergo a 1 * 1 convolution. This process is schematically shown in Fig. 5.

ITKarma picture
Figure 5 - Design of the feature pyramid

Side connections solve the problem of attenuation of important signals during the passage through the layers, combining semantically important information obtained at the end of the first pyramid and more detailed information obtained in it earlier.

Further, each of the received layers in the top-down pyramid is processed by two subnets.

Classification and Regression Subnets

The third part of the RetinaNet architecture is two subnets: classification and regression (Fig. 6). Each of these subnets forms at the output an answer about the class of the object and its location on the image. Consider the principle of each of them.

ITKarma picture
Figure 6 - RetinaNet Subnets

The difference in the principles of operation of the considered blocks (subnets) does not differ until the last layer. Each of them consists of 4 layers of convolution networks. 256 feature cards are formed in the layer. On the fifth layer, the number of feature maps changes: the regression subnet has 4 * A feature maps, the classification subnet - K * A feature maps, where A is the number of anchor frames (a detailed description of anchor frames in the next subsection), K is the number of feature classes.

In the last, sixth, layer, each feature map is converted to a set of vectors. The output regression model has for each anchor frame a vector of 4 values ​​indicating the offset of the target frame (English ground-truth box) relative to the anchor. The classification model has an output for each anchor frame of a one-hot vector of length K, in which the index with a value of 1 corresponds to the class number that the neural network assigned to the object.

Anchor frames

In the last section, the term anchor frames was used. Anchor frame

Suppose a network has a 3 * 3 feature map at the output. In RetinaNet, each cell has 9 anchor frames, each of which has a different size and aspect ratio (Fig. 7). During training, each anchor framework is matched with an anchor framework. If their IoU indicator has a value from 0.5, then the anchor frame is assigned to the target, if the value is less than 0.4, then it is considered the background, in other cases, the anchor frame will be ignored for training. The classification network is trained in relation to the completed assignment (object class or background), the regression network is trained in relation to the coordinates of the anchor frame (it is important to note that the error is calculated relative to the anchor, but not the target frame).

ITKarma picture
Figure 7 - Anchor frames for one cell of a feature map with a size of 3 * 3

Loss Functions

RetinaNet losses are composite, they are composed of two values: the error of regression or localization (hereinafter referred to as Lloc), and the classification error (below indicated as Lcls). The general loss function can be written as:

$ L=\ lambda Lloc + Lcls \ mathrm {\ \ \ \ \ \} $

Where λ is a hyperparameter that controls the balance between two losses.

Let us consider in more detail the calculation of each of the losses.
As described earlier, each anchor frame is assigned an anchor. We denote these pairs as (Ai, Gi) i=1,.N, where A represents the anchor, G the target frame, and N the number of matched pairs.

For each anchor, the regression network predicts 4 numbers, which can be denoted as Pi=(Pix, Piy, Piw, Pih).The first two pairs indicate the predicted difference between the coordinates of the centers of the anchor Ai and the target frame Gi, and the last two pairs indicate the predicted difference between their width and height. Accordingly, for each target frame, Ti is calculated as the difference between the anchor and target frame:

$ Lloc=\ sum j \ in \ left \ {x, y, w, h \ right \} smoothL \ mathrm {1} \ left (Pij-Tij \ right) \ mathrm {\} $

Where smoothL1 (x) is defined by the formula below:

ITKarma picture

RetinaNet classification task losses are calculated using the Focal loss function.

$ Lcls=-i=1Kαiyilog (pi) (1-pi) \ gamma $

where K is the number of classes, yi is the target value of the class, p is the probability of predicting the i-th class, γ is the focus parameter, and α is the bias coefficient. This function is an advanced cross-entropy function. The difference lies in adding the parameter γ∈ (0, + ∞), which solves the problem of class imbalance. During training, most of the objects processed by the classifier is the background, which is a separate class. Therefore, a problem may arise when the neural network learns to determine the background better than other objects. Adding a new parameter solved this problem by reducing the error value for easily classified objects. The graphs of the focal and cross entropy functions are shown in Fig. 8.

ITKarma picture
Figure 8 - Graphs of focal and cross entropy functions

Thanks for reading this article!

List of sources:

  1. Tan M., Le Q. V. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. 2019. URL:
  2. Zeng N. RetinaNet Explained and Demystified [Electronic resource]. 2018 URL:
  3. Review: RetinaNet - Focal Loss (Object Detection) [Electronic resource]. 2019 URL: detection-38fba6afabe4
  4. Tsung-Yi Lin Focal Loss for Dense Object Detection. 2017. URL:
  5. The intuition behind RetinaNet [Electronic resource]. 2018 URL: eb636755607d