Active underwater object recognition from multibeam sonar imagery Ivor Rendulić Laboratory for Underwater Systems and Technologies Faculty of Electrical Engineering and Computing, University of Zagreb Email: [email protected] Abstract—Automatic object recognition is very hard to achieve underwater. Water turbidity and low lighting cause optical cameras to often have very limited range and result in poor quality images. Multibeam sonars, sometimes referred to as ”acoustic cameras”, are not influenced by the optical visibility problems. However, the image they produce can be quite noisy and lack in detail. In order to cope with the low detail images of objects, multiple views from different perspectives can be very helpful and provide additional information needed to successfully recognize an object. The area of active object recognition deals in how to manipulate the sensor and from which perspective to approach the object in order to reduce the uncertainty of the recognition estimate. Having the sonar mounted on an Autonomous Underwater Vehicle (AUV) makes this scenario a perfect candidate for active object recognition task. Another issue with building a recognition system for a multibeam sonar is unavailability of the training data. Using synthetic 3D models and a sonar simulator to create images from different views will be considered as alternative to recording large amounts of real data. Index Terms—sonar, multibeam, active object recognition, underwater I. I NTRODUCTION With the advancements in algorithms and processing power object recognition from images or video has become very accurate. Old benchmarks (such as MNIST [1], a database for handwritten digit recognition) have become too easy as even fairly simple models today can result in almost perfect precision [2]. The focus has shifted on a far more challenging general object recognition with a large number of object classes. The most popular modern benchmark is performance on ImageNet test set [3], which has 1000 different classes. These types of recognition tasks rely on having a large amount of training data available from different viewpoints of the object, and have to classify an unknown object solely based on a single image of it. The notion of active object recognition [4], which will be explained in more detail in later sections, implies that the recognition system also has some control over the sensor input. In the observed case, an underwater vehicle with a multibeam sonar can re-position itself to allow a better angle at the target and improve the probability of correct classification. The rest of the paper is organized as follows. In Section II general overview of related work dealing with next-best view and active object recognition will be made. Next-best view planning is one of the key steps in active object recognition, where the best possible move for the system is being calculated. An overview of the envisioned scenario and motivation for the use of an active object recognition system will be made in Section III. After that, the system will be broken down into main components in Section IV. In Section V object representation and recognition from sonar imagery will be discussed, as well as the need for defining similarity between images. Using synthetic data for training the classifier will be presented in Section VI. Finally, in Section VII algorithms for matching the observed images with known training data and for path planning with the goal of maximizing object recognition rate will be explored. Next-best view approach will be used to lead to a path that is the most discriminating among different candidates for classification. Hidden Markov Models and Conditional Random Fields, along with the well-known algorithms for their optimal solution, will be presented for the matching problem. Finally, in section VIII a conclusion on all the presented methods will be given. Plans for future work and initial prototype of the system will be made. II. G ENERAL OVERVIEW OF RELATED WORK In this chapter a general overview of published papers related to parts of the subject of interest will be made. Later in the paper some of them will be referenced again in more detail, in specific sections related to that topic. A. Active 3D model acquisition The area of active planning with visual sensors is often used in two major categories. First one is for 3D object synthesis, where a virtual 3D model of an object is built based on the scans of that object. The sensor (e.g. a camera) is usually mounted on a robot arm or a similar device which allows accurate positioning to the desired position. This problem is often related with the for-mentioned note of next-best view planning as it is desired that the whole process is performed as quickly as possible. Calculation of best scanning positions that will capture the whole object is then required. Examples of the next-best view planning for object scanning can be be found in [5] where the author develops a method that uses a range camera for object scanning to build CAD models. The work is further improved in [6]. The algorithm is based on splitting the viewing volume into seen and unseen regions, and a novel representation for it called positional space is introduced. The effectiveness of compared to other algorithms at the time showed very good performance. A more recent work in next-best view for building 3D models can be found in [7], where the use of stereo camera images and videos is used for improving 3D models. The authors focus on selecting viewpoints that minimize the uncertainty of the incrementally developed model. B. Active object recognition with synthetic model training data The other category of active sensor positioning problems with visual sensors is the one of interest for the problem described in this paper - active 3D object recognition - and the research performed in that area will be presented in a bit more detail. The difference compared to the active 3D model building is in the criteria governing the process - instead of looking for quick way of mapping the entire object, the goal is to recognize the object as quickly as possible. In the following subsections a quick overview of the relevant literature will be given. 1) Representation: Scale-invariant Feature Transform (SIFT) [8] and similar related features have been extremely popular for representation of objects in images. In the recent years, with advancements in deep artificial neural networks, automatically learned representations of data [9] have shown much better results in object recognition compared to traditional, hand-crafted features such as SIFT. Automatically learned features will also be considered for representation of sonar data. Another set of features, which are specifically used to describe shapes, are shape contexts [10]. Their use in [11] on silhouettes, which are similar looking to sonar images, showed good potential for the task. 2) Recognition: As for the image recognition part, for years the dominant algorithms were Support Vector Machines [12] and boosted Haar cascades [13]. They too were replaced in the recent years by various deep neural network architectures, which have managed to achieve much lower error rates. 3) Using synthetic model data: In 3D object recognition, and especially cases where it is also important to detect relative orientation of the object, using artificial data can be very helpful. Although such data will never be completely true to the real one, it can be collected quickly and in large quantities using 3D models of objects. Additional advantage is in perfectly accurate info on orientation and distance from the camera, which can also be obtained in simulators or 3D modeling tools. There are many papers dealing with the use of artificial 3D model data to train object recognition systems. While both [14] and [15] provide quite old and outdated overview of the field, some of the concepts described are still valid. In [16] the authors have tested the use of synthetic data obtained from 3D models, and have concluded that much more training data is needed compared to the use of real images. A more complex system is built in [17], where 3D models are used to train a system capable of detection, recognition and pose estimation of objects in cluttered scenes. Using Kinect camera their system works in real time. In [18] the authors focus on pose estimation and attempt to get a precise estimate of IKEA furniture pose based on 3D models. Finally, a complete active object recognition system with synthetic 3D object models was developed in [11]. This paper will be referenced in many subsequent sections as it covers a great deal. All these model-based approaches share a view clustering step. Since it is possible to sample the 3D model from any desired view, it is important to limit the number of views. This is usually done by clustering visually similar views. 4) Planning next step for active object recognition: In [19] the authors are dealing with a closely related problem of improving object recognition when multiple model views are available. Although this technically does not fall under active object recognition, as it focuses solely on combining multiple views and does not influence their acquisition, it provides an alternative perspective and ideas that can be used in an active object recognition system. Similarly, but with focus on active object recognition, in [20] authors discuss fusion of multiple views specifically for active object recognition. Some of the ground-breaking research in planning for active object recognition was done in [21], where the Bayesian approach to the problem was introduced with a goal of minimizing uncertainty in every step of the process. Similar approaches were used in [22]. III. E NVISIONED SCENARIO In this section the scenario for proposed active recognition system, sketched in figure 1, will be presented. A multibeam sonar is mounted on an autonomous underwater vehicle (AUV). The vehicle is BUDDY, an AUV developed at the Laboratory for Underwater Systems and Technologies (LABUST) for the FP7 project ”CADDY - Cognitive Autonomous Diving Buddy” [23]. The sonar used is a Soundmetrics ARIS 3000. While scanning the seabed, if the AUV encounters a possible object of interest (e.g. a sea mine), it will take additional looks at it to give a better estimate whether that indeed is a mine or not. At first, the operator will manually mark the object of interest while the AUV is scanning, but later this should also be automatized. IV. OVERVIEW OF THE PROPOSED SYSTEM In this section an overview of both the training system and the real-time active recognition system will be given. The training phase, shown in Figure 2, consists of several stages. First, from different views of 3D models synthetic images are generated with a sonar simulator. Then, a transformation of the image is performed to obtain chosen object representation. Fig. 1. Visualization of the envisioned scenario Views are then clustered based on similarity in order to reduce the dimensionality of the view space and make the recognition stage easier. Finally, a classifier is trained to enable recognition of objects and views of the object. Fig. 3. Block diagram of the active recognition process V. S ONAR IMAGE Fig. 2. Block diagram of the training phase with 3D models In the active recognition phase, shown in Figure 3, the stages are the following. First, sonar image is transformed into chosen representation, same as in the training phase. Then, the sequence of recorded sonar images are matched to training data from 3D models, simultaneously improving position and classification estimates with every new image. Finally, based on calculated estimates, the next action is planned. This can be next view to which the system should position itself, or outputting the recognized object if the confidence is high and enough data was already collected. Image can be obtained from the sonar in two different modes. First one is in sonar geometry, where each vertical line matches one of the beams. This type of image has rectangular shape, but the objects appear distorted in it as the beams are not actually parallel. A natural image is obtained by mapping to Cartesian space and has a shape of a circular sector. This produces some issues during processing at the edges, as all the algorithms work on rectangular images, but offers natural look at objects. Images of a straight wall can be seen in Figure 4 (native sonar polar geometry) and 5 (mapped to Cartesian geometry). The straight line appears distorted in the polar geometry. A. Feature representation Due to the appearance and quality of sonar images, some standard feature representations normally used on camera images might not be as appropriate. The sonar image is gray scale, and is extremely noisy. This is partly due to processing inaccuracies and partly to actual small particles in the water which can be highly visible on the sonar image, but would appear barely noticeable in an optical camera image. 1) Classical image features: First set of features that were tested were the ones described in [24]. The features were tested Figure 6 displays the original sonar image (left), image after applying Gaussian blur to filter out the noise (middle) and image after an adaptive threshold algorithm as a contour detection step (right). Fig. 6. Contour detection in sonar images Fig. 4. image Native sonar Fig. 5. Cartesian sonar image in terms of optical flow calculation, to see how consistently they work on two sonar images taken one right after another, with very little movement in between. On low-frequency sonar mode, with objects at distance of around 5 meters, it worked poorly, as the noise in the sonar image influenced features a lot. SIFT features [8] have also been tested, producing similar results and not being consistent. They were also tested on images of human hand recorded at 1 meter distance with sonar operating in high-frequency mode. This made an improvement, as the image becomes much better. However, it was still far off the results obtainable with normal video cameras. Plan for the next step is to test previously mentioned shape descriptors [10]. Based on the idea behind it, it could work well with smoothed sonar image, after calculating the contour of the object of interest. 2) Contours: Contours are often used with multibeam sonar images. In particular, if there are not many objects surrounding the object of interest, it can seem highly prominent in the image. This approach was tested in LABUST in object tracking algorithms, where the sonar image was fused in an Extended Kalman Filter with Ultra-Short Baseline (USBL) acoustic tracking measurements [25]. In [26] the authors use contour-based object detection, combined with background removal, to perform object detection with a multibeam sonar. Contour-based algorithms can also be used to build masks for approximate isolation of the object of interest. B. Recognition algorithms The recognition or classification step can work either directly on the raw image or part of the image, or on computed features as mentioned in previous subsection. Some of the popular classification algorithms include Support Vector Machines (SVM) [12], boosted Haar cascades [13], Logistic Regression, Random Forest and Artificial Neural Networks. Support Vector Machines, in their simplest version, calculate an optimal separating hyperplane for linearly separable data. The hyperplane is optimal in the sense of maximizing the margin between the two classes, and achieves that by being right in the middle between the two classes. This is the simplest case of SVM, which has also been extended to nonlinear cases (by using non-linear kernels instead of dot product. The boosted Haar cascades algorithm was probably the most widely used algorithm in object detection and recognition in images and video during the last decade. It was based on two important concepts - integral image representation, which can be calculated very efficiently, and on the boosting algorithm of many simpler classifiers that work on subsets of features. Both the SVM (with a set of features calculated from the contours) and the boosted Haar cascades algorithm have been used in [27] for detection of human hand and recognizing the gesture. They performed well, but on short distances and in high-frequency mode for the sonar. Artificial neural networks with many hidden layers and different structures (the so-called ”Deep Learning”) have been the most successful and widely used approach not only in object recognition in images, but in all kinds of machine learning problems in the last several years. A type of network that could be very appropriate for the task is described in [28]. In the paper the authors present a way of efficiently using the convolutional neural network to not only recognize object, but also perform the detection step by learning to predict the object boundaries. That way the detection could also be automatized in the same neural network that performs the recognition. Neural networks are used with sonar data in [29]. However, the authors use simple feed-forward network as a classifier, and feed it with features obtained with image processing techniques. To the best of author’s knowledge, testing deep learning techniques such as deep convolutional neural networks on sonar data has not yet been reported in literature. C. Similarity between images A useful concept often used in active image recognition is to have the measure of similarity between two images. With this information the system can plan for the next view to be at a location that discriminates best among the current candidates. An example for why this is important can be seen on objects displayed in Figure 7. If the vehicle approaches the object from the left, it is impossible to distinguish the two. The best way for the active object recognition system to act is to go to the side where the two objects differ the most. D. Clustering views Question arises on how densely to sample the 3D view space. For now a constant radius sphere will be considered for sampling points. Some objects have finer detail than others, and do not require as fine sampling of the view space. For example, a ball appears the same from any view the sonar might have at it, but an underwater ship wreckage will greatly differ. Because of that, and by using the similarity of resulting images of objects, multiple views can be iteratively clustered until a desired difference between clusters is achieved. This should, ideally, lead to a ball having only a single cluster representation, while a more complex object can be represented as accurately as desired. Unfortunately, in the case of multibeam sonar imagery, the situation is far worse. Sonars are expensive, both for purchase and to use, and it is almost impossible to find images useful for training a desired classifier. Because of the challenges in acquiring real data, and the relative simplicity of the sonar images, an alternative way might be to use 3D models of objects and a sonar simulator to scan them. Tool that will be used to build a sonar simulator is UWSim [30], a popular simulator for underwater applications. It offers the virtual range sensor primitive, which can be used to build an array that resembles a multibeam sonar. Alternative (or complementary) way might be to use real images for generation of a much bigger set of synthetic images for training. This can be done in a simple way, by introducing noise or transformations to an image. Another approach is developed in [31], where the authors use real image to find the rendering parameters for creating synthetic images. VII. PATH PLANNING FOR ACTIVE RECOGNITION A. Matching images to objects and views Given an image obtained from the sonar (or a sequence of images) and the information of the vehicle’s path, the system needs to find the likely candidates and match the sonar and positioning data with object models. This is the crucial step in the system. In [11] the authors use Conditional Random Fields (CRF) to perform the matching. More specifically, given a sequence of images F = {f1 , ..., fT } they seek the matching sequence of object views from training data V = {v1 , ..., vT } (which is primarily parametrized by the object itself and the view point in the view space, along with a few parameters they define). They introduce the CRF to calculate the conditional probability of V given F : T P (V |F ) = Fig. 7. Two objects appearing the same if approached from the left side VI. U SING 3D MODELS AS TRAINING DATA In order to achieve high precision in classification a large amount of training data is necessary, and this need grows rapidly with increase in number of classes. With massive amounts of labeled image data becoming available, object recognition from camera images can be done with virtually as much data as the training system can handle. 1 Y P (vi |F )P (vi , vi−1 |F ). Z(f ) i (1) In the equation above, Z(f ) is the partition function and T is the number of images in the sequence. That approach seems very intuitive for the problem, and leads to an efficient CRF solution with a forward-backward algorithm. The next step is to extend it with the position probability P (S), S = {s1 , ..., sT }, which can supposedly be obtained from the navigational filter of the vehicle. The probability distribution over model views is now dependent on both the recorded sonar images F and the position of the vehicle, so the new target probability is P (V |F, S). An alternative way in modeling the problem can be by using the Hidden Markov Model (HMM), which can also quite intuitively explain the process. The observations are the obtained sonar images, while the underlying process has object views as states. State transition probabilities are influenced by position estimates. B. Planning the next action The final step in the system is deciding on which action to take next. In the trivial case, when the certainty in some object is high enough, no further views at the object is needed. In the non-trivial case, there is a subset of object candidates M = {m1 , ..., mC } which are all still above some likelihood threshold and cannot be discarded. The goal is then to move in the direction where the candidates are the least similar. In a simple case, there are two candidate objects m1 and m2 , and a similarity map between each pairs of views is available. Given the current estimates of object orientations, there is a view vi which will give the smallest similarity measure D(m1 , m2 |vi ). Any number of criteria could be made to calculate the next step, and in the simplest one the goal is to go in a shortest path towards vi . VIII. C ONCLUSION In this paper an overview of the literature about active object recognition and using synthetic 3D models to train recognition systems was given. The use case and motivation for such a system, with an Autonomous Underwater Vehicle and a multibeam sonar, was presented. With many specifics of sonar imagery, analysis of representation and recognition algorithms was given. Testing and analysis of the popular approach often used with optical camera images showed that some of them do not work as well on sonar images, so alternative approaches have to be taken. Novel approaches, such as deep convolutional neural network and similar deep learning structures, have yet to be tested on sonar imagery and are not yet reported in literature. Active object recognition with synthetic 3D model training data has been researched and provides a good base for the envisioned system. Using an AUV will give additional positional uncertainty which will have to be included in the model, in the critical step where the sensor data is matched with known training data. The use of probabilistic approaches, such as Conditional Random Fields or Hidden Markov Models, was considered and is promising for the task, giving reasonable description of the model and efficient solutions for optimal solving. In the future work, a sonar simulator has to be built to create a database of synthetic training images from 3D models. More recognition algorithms will be tested, together with the clustering step, to see how well the recognition process scales with increasing number of objects and views. Those two steps will form a foundation for building the entire active recognition system. R EFERENCES [1] Y. LeCun, C. Cortes, and C. J. Burges, “The mnist database of handwritten digits,” 1998. [2] L. Bottou, C. Cortes, J. S. Denker, H. Drucker, I. Guyon, L. D. Jackel, Y. LeCun, U. A. Muller, E. Sackinger, P. Simard et al., “Comparison of classifier methods: a case study in handwritten digit recognition,” in International conference on pattern recognition. IEEE Computer Society Press, 1994, pp. 77–77. [3] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 248–255. [4] D. Wilkes and J. K. Tsotsos, “Active object recognition,” in Computer Vision and Pattern Recognition, 1992. Proceedings CVPR’92., 1992 IEEE Computer Society Conference on. IEEE, 1992, pp. 136–141. [5] R. Pito and R. K. Bajcsy, “Solution to the next best view problem for automated cad model acquisiton of free-form objects using range cameras,” in Photonics East’95. International Society for Optics and Photonics, 1995, pp. 78–89. [6] R. Pito, “A solution to the next best view problem for automated surface acquisition,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 21, no. 10, pp. 1016–1030, 1999. [7] E. Dunn and J.-M. Frahm, “Next best view planning for active model improvement.” in BMVC, 2009, pp. 1–11. [8] D. G. Lowe, “Object recognition from local scale-invariant features,” in Computer vision, 1999. The proceedings of the seventh IEEE international conference on, vol. 2. Ieee, 1999, pp. 1150–1157. [9] Y. Bengio, A. Courville, and P. Vincent, “Representation learning: A review and new perspectives,” IEEE transactions on pattern analysis and machine intelligence, vol. 35, no. 8, pp. 1798–1828, 2013. [10] S. Belongie, J. Malik, and J. Puzicha, “Shape matching and object recognition using shape contexts,” IEEE transactions on pattern analysis and machine intelligence, vol. 24, no. 4, pp. 509–522, 2002. [11] A. Toshev, A. Makadia, and K. Daniilidis, “Shape-based object recognition in videos using 3d synthetic object models,” in Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on. IEEE, 2009, pp. 288–295. [12] C. Cortes and V. Vapnik, “Support-vector networks,” Machine learning, vol. 20, no. 3, pp. 273–297, 1995. [13] P. Viola and M. Jones, “Rapid object detection using a boosted cascade of simple features,” in Computer Vision and Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society Conference on, vol. 1. IEEE, 2001, pp. I–511. [14] A. R. Pope, “Model-based object recognition,” A Survey of Recent Tecniques, Technical Report, 1994. [15] V. Blanz, B. Schölkopf, H. Bülthoff, C. Burges, V. Vapnik, and T. Vetter, “Comparison of view-based object recognition algorithms using realistic 3d models,” in International Conference on Artificial Neural Networks. Springer, 1996, pp. 251–256. [16] B. Heisele, G. Kim, and A. Meyer, “Object recognition with 3d models.” in BMVC. Citeseer, 2009, pp. 1–11. [17] S. Hinterstoisser, V. Lepetit, S. Ilic, S. Holzer, G. Bradski, K. Konolige, and N. Navab, “Model based training, detection and pose estimation of texture-less 3d objects in heavily cluttered scenes,” in Asian conference on computer vision. Springer, 2012, pp. 548–562. [18] J. J. Lim, H. Pirsiavash, and A. Torralba, “Parsing ikea objects: Fine pose estimation,” in 2013 IEEE International Conference on Computer Vision. IEEE, 2013, pp. 2992–2999. [19] V. Ferrari, T. Tuytelaars, and L. Van Gool, “Integrating multiple model views for object recognition,” in Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on, vol. 2. IEEE, 2004, pp. II–105. [20] F. Deinzer, J. Denzler, and H. Niemann, “On fusion of multiple views for active object recognition,” in Joint Pattern Recognition Symposium. Springer, 2001, pp. 239–245. [21] J. Denzler and C. M. Brown, “Information theoretic sensor data selection for active object recognition and state estimation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 24, no. 2, pp. 145–157, 2002. [22] N. Govender, J. Warrell, P. Torr, and F. Nicolls, “Probabilistic object and viewpoint models for active object recognition,” in AFRICON, 2013. IEEE, 2013, pp. 1–7. [23] CADDY FP7 project. [Online]. Available: http://caddy-fp7.eu/ [24] J. Shi and C. Tomasi, “Good features to track,” in Computer Vision and Pattern Recognition, 1994. Proceedings CVPR’94., 1994 IEEE Computer Society Conference on. IEEE, 1994, pp. 593–600. [25] F. Mandić, I. Rendulić, N. Mišković, and D. Nad, “Underwater object tracking using sonar and usbl measurements,” Journal of Sensors, vol. 2016, 2016. [26] E. Galceran, V. Djapic, M. Carreras, and D. P. Williams, “A real-time underwater object detection algorithm for multi-beam forward looking sonar,” IFAC Proceedings Volumes, vol. 45, no. 5, pp. 306–311, 2012. [27] F. Gustin, I. Rendulic, and N. Miskovic, “Hand gesture recognition from multibeam sonar imagery,” in Conference on Control Applications in Marine Systems. IFAC, 2016, in press. [28] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, “Overfeat: Integrated recognition, localization and detection using convolutional networks,” arXiv preprint arXiv:1312.6229, 2013. [29] J. Han, P. Yang, and L. Zhang, “Object recognition system of sonar image based on multiple invariant moments and bp neural network,” International Journal of Signal Processing, Image Processing and Pattern Recognition, vol. 7, no. 5, pp. 287–298, 2014. [30] M. Prats, J. Pérez, J. J. Fernández, and P. J. Sanz, “An open source tool for simulation and supervision of underwater intervention missions,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 2577–2582. [31] A. Rozantsev, V. Lepetit, and P. Fua, “On rendering synthetic images for training an object detector,” Computer Vision and Image Understanding, vol. 137, pp. 24–37, 2015.
© Copyright 2026 Paperzz