PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection

A paper summary

A paper summary of the paper
PVANET: Deep but Lightweight Neural Networks for Real-time Object Detection
by Xinyu Zhou, Cong Yao, He Wen, Yuzhi Wang, Shuchang Zhou, Weiran He, and Jiajun Liang


This paper presents our lightweight feature extraction network architecture for object detection, named PVANET, which achieves real-time object detection performance without losing accuracy.

  1. Computational Cost: 7.9GMAC for feature extraction with 1065x640input
  2. Runtime Performance: 750ms/image (1.3FPS) on Intel i7 and 42ms/image (21.7FPS) on NVIDIA Titan X GPU
  3. Accuracy: 83.8% mAP on VOC-2007; 82.5% mAP on VOC-2012

The key design principle is “less channels with more layers”.

Aditionally, the network adopts some other building blocks:

  1. Concatenated Rectified Linear Unit (C.ReLU) is applied to the early stage of our CNNs to reduce the number of computations by half without losing accuracy.
  2. Inception is applied to the remaining of our feature generation sub-network
  3. Adopted the idea of multi-scale representationt that combines several intermediate outputs so that multiple levels of details and non-linearities can be considered simultaneously.


Fig 2. Model Architecture

Concatenated Rectified Linear Unit

Fig 3. Concatenated Rectified Linear Unit (C.ReLU)

C.ReLU is motivated from the observation that in the early stage, output nodes tend to be paired such that one node’s activation is the opposite side of another’s. C.ReLU reduces the number of output channels by half, and doubles it by simply concatenating the same outputs with negation, which leads to 2x speed up of the early stage.


Fig4. The Inception Module

Inception can be one of the most cost-effective building block for capturing large objects and for capturing small objects. They replace the 5×5 convolution in a common Inception block with 2 3x3s.


Multiscale representation and its combination are proven to be effective in many Deep Learning tasks. Combining fine grained details with highly abstracted information in feature extraction layer helps the following region proposal network and classification network to detect object of different scales.

They combine the
1) Last layer
2) Two intermediate layers whose scales are 2x and 4x of the last layer, respectively.

Deep Network Training

They have adopted the residual structures for better training. They add residual connections onto inception layers as well to stabilize the later part of the deep network.

Add Batch Normalization layers before all ReLU activation layers.

The Learning rate policy they use is based on plateau detection, where they detect a plateau based on the moving average of loss, and if its below a certain threshold they decrease the learning rate by a certain factor.

Faster R-CNN with PVANET

Three intermediate outputs from conv3_4, conv4_4 and conv5_4 are combined into the 512 channel multi scale output features which are fed into the Faster RCNN modules


  1. PVANET was pretrained with ILSVRC2012 training images for 1000-class image classification.
  2. All images were resized to 256×256 and 192×192 patches were randomly cropped and used as the network input.
  3. The learning rate was initially set to 0.1 and then decreased by a factor of 1/sqrt(10) ~ 0.3165 whenever a plateau is detected.
  4. Pre-training terminated if learning rate drops below 1e-4 (which usually requires about 2M iterations)
  5. Then PVANET was trained with the union set of MS-COCO trainval, VOC2007 trainval, VOC2012 trainval. Fine tuning with VOC2007 trainval and VOC2012 trainval was also required afterwards, since the class definations of MS-COCO and VOC are slightly different.
  6. Training images were resized randomly such that the shorter edge of an image to be between 416 and 864.
  7. For PASCAL VOC evaluations, each input image was resized such that its shorter edge to be 640.
  8. All parameters related to Faster R-CNN were set as in the original work except for the number of proposal boxes before non-maximum suppression (NMS) (=12000) and NMS threshold (=0.4)
  9. All evaluations done on Intel i7 with a single core and NVIDIA Titan X GPU.

Fig 5. Performance with VOC2007

Fig 6.Performance with VOC2012

PVANET+ achieved the 2nd place on the PASCAL VOC 2012 Challenge. The first being the Faster-RCNN + ResNet101 which is much heavier than PVANET.

read original article at https://towardsdatascience.com/pvanet-deep-but-lightweight-neural-networks-for-real-time-object-detection-aa9de432512?source=rss——artificial_intelligence-5