Solving Small Data Object Detection by Metric Learning using Attention-RPN

Shu-Yu Huang
6 min readApr 20, 2022

--

Attention! There’s a shark!

Prerequisite: deep learning CV, object detection, small data problem, few shot learning, meta learning, metric learning

Lacking data is always a pain in the back for deep learning scientists. As we know, metric-based meta-learning can be applied to alleviate the performance drop when using less data in training.
Metric learning is to transform the classification problem from “choosing the most likely answer” to “which answer is the closest in length/similarity?”
It works for classification problems, but what about object detection?

Fig. 1 Metric learning trained the cosine similarity between different classes, θ_1 represent a closer relation than θ_2

Assume an NGO is trying to capture the amount of endangered marine creatures in a confidential area of the Atlantic Ocean. They only have a handful of pictures for each animal (support pictures), and the job is to predict the location and type of each creature when they appear on the screen next time.

Fig. 2 Detect objects with only few pictures in training set

It is easy to combine an object detection model with a small classifier network for adapting to this small data task. The model can be fine-tuned well on the small classifier with a few cases of each class with the effort of metric comparison in the last layer. That is how metric learning works. However, even with this clumsy two-model structure, the object detector may still underperform in locating objects.

Fig. 3 Naive way to solve small data object detection
example code

A group of smart guys figured out a way to fully utilize such a few examples of each class in a two-stage object detector, which comprises of region proposal network (RPN) and classification network.

Classically, RPN proposes bounding boxes according to the extracted feature map of the image. Then the ROI pooling function compresses the contents in the box for classification in the next stage.

Fig. 4 Original region proposal framework
example code

Attention-Region Proposal

The Attention-RPN method enhances the focus of each class by the attention of query feature maps to the “prototype” of the corresponding class. The feature maps of a class are extracted from the support instances first and pooled for summarizing the information in the map. The pooled vectors then serve as the “prototypes” of that class. For more, the prototype can be more robust by forming it with more instances. Using the prototype as a kernel to convolve with the query feature map, a filtered version of the query feature map can do better in region proposal for having a more focused view of certain classes of objects.

Fig.5 Mechanism of Attention-RPN in proposing regions
example code

Multi-Relation Head

The relation scores between query and each support are the output logit of the model in metric learning methods. The author proposed the Multi-Relation Head using a different way of metric calculation in the second stage of the two-stage object detection framework.

Objects in the bounding boxes are cropped from the feature maps for further compression. ROI-pooling helps to compress the information in the objects. The compression applies to both support and query objects.
(You can find ROI pooling algorithms in many Medium paragraphs, it doesn’t compress the image to only a point but a very tiny feature map.)

The Multi-Relation Head scores the relation between the query and each support object. There are three kinds of relation scores:

  1. Global relation: Calculate the relation between support and query feature maps by a Relation Network. The pair of query and support are concatenated and put into the Relation Network for obtaining their relation value.
Fig. 6 Global relation value measured by a trainable network

2. Local relation: Calculate the similarity of support and query map and get a similarity vector of each support-query channel pair, then put the similarity vector into a dense network for calculating the relation value.

Fig.7 Local relation value measured by paired similarity vector and a trainable network

3. Patch relation: Calculate the relation between support and query by a CNN on their concatenated feature map.

Fig.8 Patch relation value measured by a CNN

Each method above has a trainable network in the model, not only a definite metric function.

Two-way Contrastive Training

Training a meta-learning model comprises training for random source tasks. In a typical contrastive training method like Siamese Networks for metric learning in classification, pairs of images are used to form the metric space. Representation of data sits separately in the metric space. The target is that representations from the same class stay close while that from different classes stay away. The projection from images to the metric space is trained by contracting consistent data pairs and distancing inconsistent data pairs.

The case is the same in training for the Attention-RPN. The Attention-RPN method defines four kind of data pairs for training:

  1. Positive support object with foreground proposal: the distance should be minimized.
  2. Positive support object with background proposal: the distance should be pulled away.
  3. Negative support object with foreground proposal: the distance should be pulled away.
  4. Negative support object with background proposal from different class: neglected.

The ratio of pulling away the distance dominates the possibilities of the object pairs. In response to that, the author picked the cases of pairs in a ratio of 1:2:1:0. The system then consistently input pairs of support and query to the second stage of the model for training a proper metric space projection.

Performance

With attention region proposal and multi-head relation, the Attention RPN can obtain a 40% AP50 in COCO dataset.

Please refer to the MS-COCO 10-shot Few-Shot Object Detection dataset for a more up-to-date ranking.

Next time, I will try to train a customized Attention-RPN for the marine creature dataset using MMFewShot framework with PyTorch backend.

The framework is complicated itself so an introduction of that framework will be given in a separated paragraph.

References

  1. Fan, Q., Zhuo, W., Tang, C. K., & Tai, Y. W. (2020). Few-shot object detection with attention-RPN and multi-relation detector. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 4013–4022).
  2. Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P. H., & Hospedales, T. M. (2018). Learning to compare: Relation network for few-shot learning. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1199–1208).
  3. Bromley, J., Guyon, I., LeCun, Y., Säckinger, E., & Shah, R. (1993). Signature verification using a” siamese” time delay neural network. Advances in neural information processing systems, 6.

--

--

Shu-Yu Huang

AI engineer in Taiwan AI Academy| Former process engineer in TSMC| Former Research Assistance in NYMU| Studying few-shot-learning and GAN