Hits
3 min read (465 words)

A separate well-studied area of computer vision has been object detection. It rose a special interest in the community due to its value for video analysis and image understanding. Its goal is to locate the bounding box around the object, label with corresponding class and confidence score of existence [1]. Prominent applications in early days included face detection [2, 3]. and pedestrian tracking [4, 5]. Traditionally, object detection pipeline can be represented as three-stage process: region selection, feature extraction and classification [6]. At region selection computational methods (sliding windows) were used to propose several potential positions. Later, histograms of oriented gradient (HOG) features were extracted as representation [7]. Besides, usually machine learning methods were used for classification (SVM [8], Boosting [9]) of extracted features from previous step.

Convolutional neural networks, as most representative model of deep learning [10] has caught an attention of research community. Benefiting its advantages, such as hierarchical feature representation, capability, opportunity to perform joint tasks, generic object detection models were proposed to locate and classify objects in a single, end-to-end CNN. Compared to previous approaches, it adds robustness by disregarding handcrafted features and replaces shallow architectures. This approach has been first successful in 2014 [11], outperforming mAP (mean average precision) metrics of the previously reported state-of-the-art methods.
The typical cost function for training a detection-designed convolutional neural network could be described as minimization of joint classification and bounding-box regression loss functions:

\[L\left(p_i, t_i\right)=\frac{1}{N_{c l s}} \sum_i L_{c l s}\left(p_i, p_i^*\right)+\lambda \frac{1}{N_{r e g}} \sum_i p_i^* L_{r e g}\left(t_i, t_i^*\right)\]

Given the probability of the prediction \(p_i\) for ith anchor being an object and \(t_i\) parametrized coordinates of bounding box. Other parameters \(\lambda\) for regulate importance, N_cls inputs minibatch size, and N_reg number of locations. \(p_i^*\) and \(t_i^*\) ground-truth label and coordinates.

Further evolution of deep neural networks in object detection led to the development of two main frameworks. In region proposal-based framework, network training involves two-step process slightly resembling to the human brain operation. By generating possible regions, it localizes and classifies region based on the extracted deep CNN-based features. Pioneering works in this branch (Faster R-CNN [12], FPN [13], Mask-RCNN [14]) were addressed towards overcoming the cost, scale invariance, multitask learning issues. Another branch, regression/classification-based framework focuses on reduction of the architecture for real-time applications. This was achievable via mapping neural network directly from images into coordinates of bounding box and probabilities of class. Foundational works are YOLO [15] and SSD [16], which produces feature maps at different resolutions. The interchangeable terminology for the described frameworks is also reported as one-stage and two-stage networks.

There are several future directions in which object detection problem nowadays is advancing, including improving accuracy (multimodal information fusion [17, 18], scale adaptation [19, 20]), releasing the burden of laborious data annotation (unsupervised 21 and weakly supervised learning 22), extension of methods (three-dimensional representation [23] and video information [24]) and efficient solutions [25, 26].

How to cite

Askaruly, S. (2021). Object detection. Tuttelikz blog: tuttelikz.github.io/blog/2021/12/detection

References

[1] Szeliski, R., 2010. Computer vision: algorithms and applications Springer Science & Business Media.
[2] Moghaddam, B., Jebara, T. and Pentland, A., 2000. Bayesian face recognition. Pattern recognition, 33(11), pp.1771-1782.
[3] Yang, M.H., Kriegman, D.J. and Ahuja, N., 2002. Detecting faces in images: A survey. IEEE Transactions on pattern analysis and machine intelligence, 24(1), pp.34-58.
[4] Gavrila, D.M., 1999. The visual analysis of human movement: A survey. Computer vision and image understanding, 73(1), pp.82-98.
[5] Gavrila, D.M. and Philomin, V., 1999, September. Real-time object detection for” smart” vehicles. In Proceedings of the Seventh IEEE International Conference on Computer Vision (Vol. 1, pp. 87-93). IEEE.
[6] Zhao, Z.Q., Zheng, P., Xu, S.T. and Wu, X., 2019. Object detection with deep learning: A review. IEEE transactions on neural networks and learning systems, 30(11), pp.3212-3232.
[7] Dalal, N. and Triggs, B., 2005, June. Histograms of oriented gradients for human detection. In 2005 IEEE computer society conference on computer vision and pattern recognition (CVPR’05) (Vol. 1, pp. 886-893). Ieee.
[8] Cortes, C. and Vapnik, V., 1995. Support vector machine. Machine learning, 20(3), pp.273-297.
[9] Schneiderman, H. and Kanade, T., 2004. Object detection using the statistics of parts. International Journal of Computer Vision, 56(3), pp.151-177.
[10] LeCun, Y., Bengio, Y. and Hinton, G., 2015. Deep learning. nature, 521(7553), pp.436-444.
[11] Girshick, R., Donahue, J., Darrell, T. and Malik, J., 2014. Rich feature hierarchies for accurate object detection and semantic segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 580-587).
[12] Ren, S., He, K., Girshick, R. and Sun, J., 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. Advances in neural information processing systems, 28.
[13] Lin, T.Y., Dollár, P., Girshick, R., He, K., Hariharan, B. and Belongie, S., 2017. Feature pyramid networks for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 2117-2125).
[14] He, K., Gkioxari, G., Dollár, P. and Girshick, R., 2017. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision (pp. 2961-2969). [15] Redmon, J., Divvala, S., Girshick, R. and Farhadi, A., 2016. You only look once: Unified, real-time object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 779-788).
[16] Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.Y. and Berg, A.C., 2016, October. Ssd: Single shot multibox detector. In European conference on computer vision (pp. 21-37). Springer, Cham.
[17] Tang, S., Andriluka, M. and Schiele, B., 2014. Detection and tracking of occluded people. International Journal of Computer Vision, 110(1), pp.58-69.
[18] Gao, Y., Wang, M., Zha, Z.J., Shen, J., Li, X. and Wu, X., 2012. Visual-textual joint relevance learning for tag-based social image search. IEEE Transactions on Image Processing, 22(1), pp.363-376.
[19] Li, H., Lin, Z., Shen, X., Brandt, J. and Hua, G., 2015. A convolutional neural network cascade for face detection. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 5325-5334).
[20] Xie, S., Girshick, R., Dollár, P., Tu, Z. and He, K., 2017. Aggregated residual transformations for deep neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 1492-1500).
[21] Croitoru, I., Bogolin, S.V. and Leordeanu, M., 2017. Unsupervised learning from video to detect foreground objects in single images. In Proceedings of the IEEE International Conference on Computer Vision (pp. 4335-4343).
[22] Papadopoulos, D.P., Uijlings, J.R., Keller, F. and Ferrari, V., 2017. Training object class detectors with click supervision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6374-6383).
[23] Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S. and Urtasun, R., 2015. 3d object proposals for accurate object class detection. Advances in neural information processing systems, 28.
[24] Byeon, W., Breuel, T.M., Raue, F. and Liwicki, M., 2015. Scene labeling with lstm recurrent neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3547-3555).
[25] Duan, K., Bai, S., Xie, L., Qi, H., Huang, Q. and Tian, Q., 2019. Centernet: Keypoint triplets for object detection. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 6569-6578).
[26] Law, H. and Deng, J., 2018. Cornernet: Detecting objects as paired keypoints. In Proceedings of the European conference on computer vision (ECCV) (pp. 734-750).