MAGI: Multistream Aerial Segmentation of Ground Images with Small-Scale Drones

Avola, Danilo; Pannone, Daniele

doi:10.3390/drones5040111

Open AccessArticle

MAGI: Multistream Aerial Segmentation of Ground Images with Small-Scale Drones

by

Danilo Avola

^* and

Daniele Pannone

^*

Department of Computer Science, Sapienza University of Rome, 00198 Roma, Italy

^*

Authors to whom correspondence should be addressed.

Drones 2021, 5(4), 111; https://doi.org/10.3390/drones5040111

Submission received: 20 August 2021 / Revised: 1 October 2021 / Accepted: 3 October 2021 / Published: 4 October 2021

(This article belongs to the Special Issue Advances in Deep Learning for Drones and Its Applications)

Download

Browse Figures

Versions Notes

Abstract

:

In recent years, small-scale drones have been used in heterogeneous tasks, such as border control, precision agriculture, and search and rescue. This is mainly due to their small size that allows for easy deployment, their low cost, and their increasing computing capability. The latter aspect allows for researchers and industries to develop complex machine- and deep-learning algorithms for several challenging tasks, such as object classification, object detection, and segmentation. Focusing on segmentation, this paper proposes a novel deep-learning model for semantic segmentation. The model follows a fully convolutional multistream approach to perform segmentation on different image scales. Several streams perform convolutions by exploiting kernels of different sizes, making segmentation tasks robust to flight altitude changes. Extensive experiments were performed on the UAV Mosaicking and Change Detection (UMCD) dataset, highlighting the effectiveness of the proposed method.

Keywords:

small-scale drones; semantic segmentation; multistream; fully convolutional network

1. Introduction

In recent years, computer vision has been widely used in several heterogeneous tasks, including deception detection [1,2,3], background modeling [4,5,6], and person reidentification [7,8,9]. In addition, thanks to technological advancement, it is possible to execute computer-vision algorithms on different devices, including robots [10,11] and drones [12,13]. The usage of the latter has dramatically increased from both the civilian and the research world because of several factors, including lower production costs, small sizes that allow for easy transportation and deployment, and the possibility to execute complex software during flights. The latter aspect is possible by equipping the small-scale drone (referred to as “drone”) with single board computers (SBCs), usually separated from the main drone controller, in charge of executing software. In addition, by exploiting SBCs with specific hardware for parallel computing (e.g., CUDA), it is possible to achieve real-time or near-real-time performance. This is crucial for critical tasks such as fire detection [14,15,16], search and rescue [17,18,19], and environmental monitoring [20,21,22]. To perform such tasks, different algorithms for simultaneous localization and mapping (SLAM) [23,24], mosaicking [25,26], object classification and detection [27,28], and object tracking [29,30] were proposed in the literature. These algorithms usually provide, in the worst case, only the class of the object detected within the scene or, in the best case, a bounding box surrounding those objects as well. In some situations, e.g., fire detection and precision agriculture, there is the need to isolate a specific target with the highest possible precision. To achieve this goal, segmentation algorithms [31] are used. It is possible to obtain the shape of a specific element (e.g., a person, an object, a vehicle) at pixel precision, allowing for perfect isolation of such an element from the rest of the scene. Nevertheless, there are certain conditions that make both object classification or detection and segmentation algorithms difficult, such as nadir view and altitude changes during the flight.

To overcome these problems, in this paper, a novel multistream aerial segmentation of ground images (MAGI) deep-learning model is proposed. The model uses a fully convolutional approach on three different streams to perform semantic segmentation. The streams use different sizes for the convolutional kernels, thus allowing for better handling images acquired with a nadir view and images acquired at different altitudes. The model was tested on the UAV Mosaicking and Change Detection (UMCD) dataset [32], which contains aerial images acquired at several low flight altitudes. Obtained results in our experiments are very promising and can be used as baseline for future works on drone segmentation algorithms exploiting the UMCD dataset.

The rest of the paper is structured as follows. In Section 2, the current state of the art regarding segmentation is provided. In Section 3, the MAGI model is described. Section 4 presents the experiments and obtained results. Section 5 provides a discussion on the obtained results. Lastly, Section 6 concludes the paper.

2. Related Works

In the current literature, there is a plethora of works regarding segmentation. In this section, only the most recent works that use a base approach similar to ours are covered.

A segmentation task can be divided into two approaches: semantic and instance segmentation. In the former approach, the model identifies several classes of the elements present within the scene, e.g., chair, table, and person. In the latter approach, once classes are identified, the model can distinguish the different instances of each class, e.g., chair 1 from chair 2 and person 1 from person 2. One of the first works exploiting deep learning for segmentation was presented in [33]. The authors proposed a model comprising only convolutional layers, replacing the dense layer with

1 \times 1

convolutions. This model, the fully convolutional network (FCN), was used as the base for the majority of the segmentation models in the state of the art. For example, in [34], the authors showed that the U-Net [35] architecture outperformed other deep-learning methods in power-line detection and segmentation in aerial images. In [36], U-Net was used as the main structure of the architecture; to capture objects of different scales, a group of cascaded dilated convolutions was inserted in the bottom of U-Net itself. The authors in [37] used an FCN with infrared green, normalized digital surface model, and normalized difference vegetation index as the model input, achieving overall segmentation accuracy of

87 %

, and showing that this type of architecture also works well with different data from those of standard images. The authors in [38] proposed an improved FCN to detect transmission lines. The first improvement regards the use of overlaying dilated convolutional layers and spatial convolutional layers, while the second improvement regards the use of two output branches used to generate multidimensional feature maps. Another improvement to a standard FCN was proposed in [39]. The model had encoder–decoder form, and while there was only one encoder for the entire architecture, each class to be segmented had its own decoder, allowing for improving the segmentation of each class. In [40], an adaptive multiscale module (AMSM) with the adaptive fuse model (AFM) were proposed for fusing extracted features from remote-sensing images. The AMSM is composed of 3 branches, each performing convolutions with different dilation rates, while the AFM fuses the information of shallow and deep feature maps generated by the several branches. Lastly, in [41] an FCN with skip connection was designed for segmentation in panchromatic aerial images, obtaining an overall accuracy of >90% in the best case.

With respect to these works, the proposed MAGI model uses a multistream approach with different kernel sizes on each stream. This pyramidal approach allows for performing the better segmentation of acquired objects with a nadir view or at different flight altitudes.

3. Materials and Methods

This section describes the proposed MAGI model. In detail, the model takes inspiration from our previous work [30], which exploited the multistream approach to perform object tracking from drones.

3.1. MAGI Model

The detailed architecture of the model is presented in Figure 1. As shown, MAGI is composed of 3 parallel streams, each one comprising an encoder and a decoder. The encoders generate feature maps of different kernel sizes on each stream, so that each stream segments objects within an image with different precision. The larger kernels provide coarser segmentation, while smaller kernels provide more fine-grained segmentation. This approach resembles the pyramidal technique used in other feature extractors [42], allowing for extracting features at different scales. As proof of this concept, we analyze in detail how the convolutional operation works. Let us consider input image I. The pixels belonging to I are identified as

I (x, y)

, where x and y correspond to the pixel coordinates. A filtered pixel

f (x, y)

is obtained by applying a kernel to I as follows:

f (x, y) = ω * I (x, y) = \sum_{δ_{x} = - i}^{i} \sum_{δ_{y} = - j}^{j} ω (δ_{x}, δ_{y}) I (x + δ_{x}, y + δ_{y}),

(1)

where

ω

is the kernel weight,

δ_{x}

and

δ_{y}

are coordinates within the kernel and the neighborhood of the starting pixel, and

i = j = (3, 5, 7)

are the kernel sizes of each stream. Figure 2 shows an example of application of a kernel with different sizes. As expected, different kernel sizes allow for extracting features having different granularity with respect to the size of the kernel.

Once all the convolutions are performed by the encoders, each stream performs transposed convolutional operations through decoders to obtain several segmentation maps. To perform more precise localization, and to allow for the transposed convolutions to learn to assemble a more precise output, skip connections are used. In detail, each upscaled feature map generated by a transposed convolutional layer is summed with the output of the corresponding convolutional layer. Formally,

\bar{T} C o n v_{k} = T C o n v_{k - 1} + C o n v_{k - 1},

(2)

where

\bar{T} C o n v_{k}

is the input of the k-th transposed convolutional layer, and

T C o n v_{k - 1}

and

C o n v_{k - 1}

are the output of the (k−1)-th transposed convolutional layer and the output of the corresponding convolutional layer, respectively.

After obtaining the segmentation masks from each stream, these are merged together at the pixel level to create a unique mask containing all extracted information with each stream:

M = m e r g e (M_{3}, M_{5}, M_{7}),

(3)

where M is the resulting mask, and

M_{i}

with

i = (3, 5, 7)

is the generated mask with the stream having a kernel size equal to

i \times i

. In the last layer, instead of having a dense layer, convolutions with a size of

1 \times 1

are used to map the output of the last decoder layer to the corresponding number of classes.

3.2. Loss Function

Concerning the model loss function, two approaches were possible. The first was to compute one loss function for each stream, and the second was to compute a single loss function for the entire model.

By following the first approach, we could have had the situation in which one or more streams would have not converged by the end of the training process, leading to poor model performance. In addition, with three different losses, it would not have been possible to know the general performance of the model, but only the performance related to the single streams. To ensure the rapid convergence of the model and have a general overview of performance, we used the second approach. In detail, the loss function was computed after merging the masks of the three streams into the final mask.

The chosen loss function for our model is cross-entropy loss [43], which consists of the softmax and negative log likelihood loss functions. Softmax is needed to normalize predictions given by the model in the range of

[0, 1]

, while the log likelihood loss function is used to compute the loss value. Let us consider two vectors, p and t, which are the vector containing the normalized predictions performed by our model and the ground truth labels, respectively, i.e., the correct classes for the analyzed data. Cross-entropy loss function can be defined as

L = - t \cdot l o g (p) .

(4)

where L is the loss value, and · is the inner product. The smaller the loss function is, the closest the prediction to the ground-truth label is, and hence the better the model is.

3.3. Training Strategy

The model is trained by following the fine-tuning approach because the UMCD was designed for change detection tasks, and despite there being heterogeneous objects within the several videos, they were sparsely distributed along the flying path. By using only such objects for model training, there is a high probability of model overfitting during the process.

Hence, we first trained the model on another semantic segmentation dataset, namely, the Cityscape Dataset [44]. In detail, we used fine annotations, i.e., high-quality dense pixel annotations, contained in the dataset to perform the first training. Then, to correctly segment objects of the UMCD, we replaced the classification layers of the MAGI trained with the Cityscape dataset with new layers having the number of classes of UMCD dataset. In this way, we had to only retrain the classification layers instead of the whole model, avoiding the overfitting problem.

4. Experimental Results

This section presents obtained results with the proposed MAGI model for the semantic segmentation task. The used dataset is first described. Then, implementation details and obtained results are discussed.

4.1. Dataset

For testing the goodness of the proposed MAGI model, we used the recently released UMCD (http://www.umcd-dataset.net/ (accessed on 4 October 2021)) dataset. The dataset was chosen for the following reasons. First, to the best of our knowledge, this is the only drone dataset that contains videos acquired at very low altitudes. For this work, this is crucial, since with such images, it is possible to better evaluate the performed segmentation. Second, all persons and objects present within the videos were acquired with a nadir view, allowing for better testing the pyramidal approach.

Describing the UMCD in detail, it consists of 50 aerial video sequences acquired at different flight altitudes, ranging from 6 to 15 m. The videos were collected from different environments, namely, dirt, countryside, and urban, at different daytime periods, i.e., morning and afternoon. Drone flight telemetry is provided with the videos. Two drones were used to perform the acquisitions, namely, a custom homemade hexacopter and a DJI Phantom 3 Advanced. These drones allowed for the acquisition of videos having a spatial resolution ranging from

720 \times 540

to

1920 \times 1080

. Lastly, several elements are present within the videos, such as persons, suitcases, cars, and tires, which were used as the classes to detect in our context.

4.2. Evaluation Metric

Regarding the evaluation metric, in our experiments, the intersection over union (IoU) in pixels was used. We preferred the IoU over pixel accuracy since the latter, despite its ease to compute, suffers from the class imbalance problem. The UMCD dataset is a very imbalanced dataset because several elements are very sparse in the drone flight path. Figure 3 shows an example of the failure of the pixel accuracy metrics with respect to a UMCD frame. The IoU was more reliable, and it is defined as the overlapping area between the predicted segmentation and the ground truth divided by the area of union between the predicted segmentation and the ground truth. Formally,

I o U = \frac{A r e a o f O v e r l a p}{A r e a o f U n i o n} .

(5)

Since there were several classes to segment within the image, the mean IoU was given by computing the IoU of each class and then averaging them.

4.3. Implementation Details

The model was implemented in Python with the Pytorch framework, and was trained with a learning rate of 0.001 and the AdamW optimizer. The used machine consisted of an AMD Ryzen 1700 CPU, 16 GB DDR4 RAM, a 500 GB solid state drive, and an NVidia RTX 2080 GPU.

4.4. Segmentation Performance Evaluation

To the best of our knowledge, no works have performed aerial semantic segmentation on the UMCD. Hence, with the obtained results, we provide the baseline for future works in the aerial semantic segmentation on such datasets.

The proposed method generally performed very well. Concerning the tests on the Cityscape dataset, we obtained an IoU of

84.5 %

, which is in line with the current literature. Regarding tests on the UMCD, the proposed model obtained an IoU of

90.3 %

. Thanks to the 3 streams working in parallel, higher segmentation accuracy was obtained. Let us consider the example shown in Figure 4, and let us consider Figure 4a as the input image to MAGI. For the sake of a clearer explanation, we refer to

3 \times 3

,

5 \times 5

and

7 \times 7

streams as Streams 1–3, respectively. The output of Stream 1, shown in Figure 4b, provided the most refined mask. Although only this stream could be used to perform satisfactory segmentation (see results reported in Section 4.5), in certain cases, the obtained segmentation mask was too refined. This means that the outer shape of the object may not be included in the mask. At this point, the less-refined segmentation masks generated by Streams 2 and 3, as shown in Figure 4c,d, were used to include the missing part of the object in the final segmentation mask. The result of this operation is shown in Figure 4e.

Table 1 shows segmentation examples with correctly segmented objects most of the time. However, there might were false positives, as shown in the last row of Table 1. In this specific example, the problem was related to the shadow of the person, which was segmented in the same way as the person itself because the Cityscape dataset contains persons acquired with a horizontal view. The projected shadow of the person contained in the UMCD strongly resembled the persons in the Cityscape dataset, leading to wrong segmentation.

4.5. Ablation Study

This section presents an ablation study that highlights the significance and the effectiveness of the three streams composing the proposed model.

As baseline model, we consider the one composed of only the

3 \times 3

(stream 1) stream. Such a model is a simple fully convolutional network with skip connections to improve the output of the transposed convolutions in the decoder part. In this form, the model scored

85 %

on UMCD and

62.3 %

on Cityscape. While performance was still good in UMCD, the same cannot be said for Cityscape. By using the streams with the larger kernels, i.e.,

5 \times 5

(Stream 2) and

7 \times 7

(Stream 3), performance drastically decreased. As explained in Section 3, larger kernels perform coarser segmentation, and only skip connections are not sufficient for refining segmentation. When using only the

5 \times 5

kernel, the model achieved an IoU of

79.2 %

on UMCD and

55 %

on Cityscape; by using the

7 \times 7

kernel, we obtained

77.4 %

on UMCD and

51.7 %

on Cityscape.

Then, we tested the MAGI with different pairs of streams, namely, Streams 1 and 2, Streams 1 and 3, and Streams 2 and 3. By using the (1,2) pair, we achieved an IoU score of

88.6 %

on UMCD and a score of

64.5 %

on Cityscape, while using the (1,3) pair, the model scored

86.5 %

on UMCD and a

63.1 %

on Cityscape. Lastly, with the (2,3) pair, we achieved an IoU score of

80.1 %

on UMCD and

57.2 %

on Cityscape. As expected, the best results were obtained with the pair having the smallest kernel sizes, i.e., pair (1,2). This allowed for extracting both the inner part and the border of the several classes. The worst results were obtained with the pair with the two largest kernel sizes, i.e., pair (2,3). Despite this pair obtaining the worst result, there was an improvement when using

5 \times 5

and

7 \times 7

kernels. Table 2 summarizes the obtained results in the ablation study.

5. Discussions

5.1. Considerations on Execution Time and Energy Consumption

In our study, the model was executed on a ground station receiving a video stream sent from the drone, and its specifications are reported in Section 4.3. The advantages of this approach are twofold. The first advantage concerns the speed of execution. With the used hardware, we could achieve real-time performance, making the model suitable for time-critical operations such as fire detection, and search and rescue. The second advantage regards drone energy consumption. Modern drones use optimized protocols for sending video data with very low latency and with a very high bit rate (e.g., 100 Mb/s). With this available bandwidth, it is possible to completely move computation to the ground station. In order to execute deep-learning algorithms on drones, additional hardware such as embedded computers may be required, which can be powered in two ways: by using a dedicated battery or by using a single battery for both drone and embedded computer. In the first case, the added battery for powering the embedded computer increases the payload weight. This means that, for example, if the drone must fly at a specific altitude, the rotors must spin at a higher speed due to having no payload, thus affecting the duration of the drone battery. In the second case, a battery with greater capacity, and often larger, is needed due to the power demand from both drone and embedded computer. In this case, the payload weight increases, thus affecting power consumption and thereby flight duration.

5.2. Considerations on Model Robustness

The robustness of the model was tested on the UMCD dataset, which is, to the best of our knowledge, the only public dataset providing acquisitions performed at very low altitudes (i.e., below 50 m). In addition, the dataset provides videos in which the flight altitude is not constant, allowing for testing segmentation at different scales. In fact, we obtained a IoU (in pixels) of

90.3 %

on the UMCD, highlighting that the pyramidal approach of the MAGI is capable to properly handle the different scales derived from the changes of flight altitude.

Despite the very promising obtained IoU, the model can be improved to correctly deal with some specific cases. As an example, we consider the third row of Table 1, in which shadows were segmented as objects belonging to the scene. The first training of the model was performed on the Cityscape dataset, which contains segmentation masks of persons acquired with a frontal view. Since the shadows of the subjects in the UMCD dataset resemble the shape of the subjects in the Cityscape dataset, these were segmented as persons.

In general, depending on how shadows are projected onto the ground, the model wrongly segments them. To overcome this limitation of the MAGI, further investigation on how to properly handle shadows will be conducted.

6. Conclusions

Due to improvements in hardware manufacturing and the low cost, drones are used in several and heterogeneous tasks such as search and rescue, border control, and environmental monitoring. This has led to the design and implementation of novel algorithms regarding object classification, object detection, and segmentation for best performing such tasks. In addition, in the context of aerial images, parameters such as the angle of view and different flight altitudes must be considered. In this paper, a novel deep-learning multistream aerial segmentation for ground images (MAGI) model was proposed. The model, based on a fully convolutional approach, exploits three parallel branches, each with a different convolutional kernel size, allowing for improving the segmentation of images at different scales. Experiments performed on the UMCD show the effectiveness of our method, and since no other works used such a dataset for segmentation tasks from drones, they can be used as the baseline for future works.

Author Contributions

Conceptualization, D.A. and D.P.; Methodology, D.P.; Software, D.P.; Supervision, D.A.; Validation, D.A.; Writing—original draft, D.A. and D.P.; Writing—review and editing, D.A. and D.P. The authors contributed equally to this work. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Not applicable.

Acknowledgments

This work was partially supported by MIUR under the Departments of Excellence 2018–2022 grant of the Department of Computer Science of Sapienza University.

Conflicts of Interest

The authors declare no conflict of interest.

References

Bhamare, A.R.; Katharguppe, S.; Silviya Nancy, J. Deep Neural Networks for Lie Detection with Attention on Bio-signals. In Proceedings of the 2020 7th International Conference on Soft Computing Machine Intelligence (ISCMI), Stockholm, Sweden, 14–15 November 2020; pp. 143–147. [Google Scholar]
Avola, D.; Cinque, L.; Foresti, G.L.; Pannone, D. Automatic Deception Detection in RGB Videos Using Facial Action Units. In Proceedings of the 13th International Conference on Distributed Smart Cameras, Trento, Italy, 9–11 September 2019; pp. 1–6. [Google Scholar]
Gogate, M.; Adeel, A.; Hussain, A. Deep learning driven multimodal fusion for automated deception detection. In Proceedings of the 2017 IEEE Symposium Series on Computational Intelligence (SSCI), Honolulu, HI, USA, 27 November–1 December 2017; pp. 1–6. [Google Scholar]
Avola, D.; Bernardi, M.; Cinque, L.; Foresti, G.L.; Massaroni, C. Adaptive bootstrapping management by keypoint clustering for background initialization. Pattern Recognit. Lett. 2017, 100, 110–116. [Google Scholar] [CrossRef]
He, W.; Kim, Y.K.W.; Ko, H.L.; Wu, J.; Li, W.; Tu, B. Local Compact Binary Count Based Nonparametric Background Modeling for Foreground Detection in Dynamic Scenes. IEEE Access 2019, 7, 92329–92340. [Google Scholar] [CrossRef]
Avola, D.; Cinque, L.; Foresti, G.L.; Massaroni, C.; Pannone, D. A keypoint-based method for background modeling and foreground detection using a PTZ camera. Pattern Recognit. Lett. 2017, 96, 96–105. [Google Scholar] [CrossRef]
Ye, M.; Shen, J.; Lin, G.; Xiang, T.; Shao, L.; Hoi, S.H. Deep Learning for Person Re-identification: A Survey and Outlook. IEEE Trans. Pattern Anal. Mach. Intell. 2021, 1–20. [Google Scholar]
Avola, D.; Cinque, L.; Fagioli, A.; Foresti, G.L.; Pannone, D.; Piciarelli, C. Bodyprint—A Meta-Feature Based LSTM Hashing Model for Person Re-Identification. Sensors 2020, 20, 5365. [Google Scholar] [CrossRef]
Chen, Y.; Zhu, X.; Gong, S. Person Re-identification by Deep Learning Multi-scale Representations. In Proceedings of the 2017 IEEE International Conference on Computer Vision Workshops (ICCVW), Venice, Italy, 22–29 October 2017; pp. 2590–2600. [Google Scholar]
Robin, C.; Lacroix, S. Multi-robot target detection and tracking: Taxonomy and survey. Auton. Robot. 2016, 40, 729–760. [Google Scholar] [CrossRef] [Green Version]
Avola, D.; Cinque, L.; Foresti, G.L.; Pannone, D. Visual Cryptography for Detecting Hidden Targets by Small-Scale Robots. In Pattern Recognition Applications and Methods; Springer: Cham, Switzerland, 2019; pp. 186–201. [Google Scholar]
Akbari, Y.; Almaadeed, N.; Al-maadeed, S.; Elharrouss, O. Applications, databases and open computer vision research from drone videos and images: A survey. Artif. Intell. Rev. 2021, 54, 3887–3938. [Google Scholar] [CrossRef]
Avola, D.; Cinque, L.; Foresti, G.L.; Pannone, D. Homography vs similarity transformation in aerial mosaicking: Which is the best at different altitudes? Multimed. Tools Appl. 2020, 79, 18387–18404. [Google Scholar] [CrossRef]
Yuan, C.; Liu, Z.; Zhang, Y. Fire detection using infrared images for UAV-based forest fire surveillance. In Proceedings of the 2017 International Conference on Unmanned Aircraft Systems (ICUAS), Miami, FL, USA, 13–16 June 2017; pp. 567–572. [Google Scholar]
Akhloufi, M.A.; Couturier, A.; Castro, N.A. Unmanned Aerial Vehicles for Wildland Fires: Sensing, Perception, Cooperation and Assistance. Drones 2021, 5, 15. [Google Scholar] [CrossRef]
Sudhakar, S.; Vijayakumar, V.; Sathiya Kumar, C.; Priya, V.; Ravi, L.; Subramaniyaswamy, V. Unmanned Aerial Vehicle (UAV) based Forest Fire Detection and monitoring for reducing false alarms in forest-fires. Comput. Commun. 2020, 149, 1–16. [Google Scholar] [CrossRef]
Półka, M.; Ptak, S.; Kuziora, Ł. The Use of UAV’s for Search and Rescue Operations. Procedia Eng. 2017, 192, 748–752. [Google Scholar] [CrossRef]
Weldon, W.T.; Hupy, J. Investigating Methods for Integrating Unmanned Aerial Systems in Search and Rescue Operations. Drones 2020, 4, 38. [Google Scholar] [CrossRef]
de Alcantara Andrade, F.A.; Reinier Hovenburg, A.; Netto de Lima, L.; Dahlin Rodin, C.; Johansen, T.A.; Storvold, R.; Moraes Correia, C.A.; Barreto Haddad, D. Autonomous Unmanned Aerial Vehicles in Search and Rescue Missions Using Real-Time Cooperative Model Predictive Control. Sensors 2019, 19, 4067. [Google Scholar] [CrossRef] [Green Version]
Avola, D.; Foresti, G.L.; Martinel, N.; Micheloni, C.; Pannone, D.; Piciarelli, C. Aerial video surveillance system for small-scale UAV environment monitoring. In Proceedings of the 2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), Lecce, Italy, 29 August–1 September 2017; pp. 1–6. [Google Scholar]
Gonzalez, L.F.; Montes, G.A.; Puig, E.; Johnson, S.; Mengersen, K.; Gaston, K.J. Unmanned Aerial Vehicles (UAVs) and Artificial Intelligence Revolutionizing Wildlife Monitoring and Conservation. Sensors 2016, 16, 97. [Google Scholar] [CrossRef] [Green Version]
Avola, D.; Cinque, L.; Fagioli, A.; Foresti, G.L.; Pannone, D.; Piciarelli, C. Automatic estimation of optimal UAV flight parameters for real-time wide areas monitoring. Multimed. Tools Appl. 2021, 80, 25009–25031. [Google Scholar] [CrossRef]
Schmuck, P.; Chli, M. Multi-UAV collaborative monocular SLAM. In Proceedings of the 2017 IEEE International Conference on Robotics and Automation (ICRA), Singapore, 29 May–3 June 2017; pp. 3863–3870. [Google Scholar]
Avola, D.; Cinque, L.; Fagioli, A.; Foresti, G.L.; Massaroni, C.; Pannone, D. Feature-Based SLAM Algorithm for Small Scale UAV with Nadir View. In Image Analysis and Processing—ICIAP 2019; Springer: Cham, Switzerland, 2019; pp. 457–467. [Google Scholar]
Zhao, J.; Zhang, X.; Gao, C.; Qiu, X.; Tian, Y.; Zhu, Y.; Cao, W. Rapid Mosaicking of Unmanned Aerial Vehicle (UAV) Images for Crop Growth Monitoring Using the SIFT Algorithm. Remote Sens. 2019, 11, 1226. [Google Scholar] [CrossRef] [Green Version]
Avola, D.; Foresti, G.L.; Martinel, N.; Micheloni, C.; Pannone, D.; Piciarelli, C. Real-Time Incremental and Geo-Referenced Mosaicking by Small-Scale UAVs. In Image Analysis and Processing—ICIAP 2017; Springer: Cham, Switzerland, 2017; pp. 694–705. [Google Scholar]
Mittal, P.; Singh, R.; Sharma, A. Deep learning-based object detection in low-altitude UAV datasets: A survey. Image Vis. Comput. 2020, 104, 104046. [Google Scholar] [CrossRef]
Walambe, R.; Marathe, A.; Kotecha, K. Multiscale Object Detection from Drone Imagery Using Ensemble Transfer Learning. Drones 2021, 5, 66. [Google Scholar] [CrossRef]
Yeom, S. Moving People Tracking and False Track Removing with Infrared Thermal Imaging by a Multirotor. Drones 2021, 5, 65. [Google Scholar] [CrossRef]
Avola, D.; Cinque, L.; Diko, A.; Fagioli, A.; Foresti, G.L.; Mecca, A.; Pannone, D.; Piciarelli, C. MS-Faster R-CNN: Multi-Stream Backbone for Improved Faster R-CNN Object Detection and Aerial Tracking from UAV Images. Remote Sens. 2021, 13, 1670. [Google Scholar] [CrossRef]
Cao, F.; Bao, Q. A Survey On Image Semantic Segmentation Methods With Convolutional Neural Network. In Proceedings of the 2020 International Conference on Communications, Information System and Computer Engineering (CISCE), Kuala Lumpur, Malaysia, 3–5 July 2020; pp. 458–462. [Google Scholar]
Avola, D.; Cinque, L.; Foresti, G.L.; Martinel, N.; Pannone, D.; Piciarelli, C. A UAV Video Dataset for Mosaicking and Change Detection From Low-Altitude Flights. IEEE Trans. Syst. Man Cybern. Syst. 2020, 50, 2139–2149. [Google Scholar] [CrossRef] [Green Version]
Long, J.; Shelhamer, E.; Darrell, T. Fully convolutional networks for semantic segmentation. In Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Boston, MA, USA, 7–12 June 2015; pp. 3431–3440. [Google Scholar]
Hota, M.; Rao, B.S.; Kumar, U. Power Lines Detection and Segmentation In Multi-Spectral Uav Images Using Convolutional Neural Network. In Proceedings of the 2020 IEEE India Geoscience and Remote Sensing Symposium (InGARSS), Ahmedabad, India, 1–4 December 2020; pp. 154–157. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. In Medical Image Computing and Computer-Assisted Intervention—MICCAI 2015; Springer: Cham, Switzerland, 2015; pp. 234–241. [Google Scholar]
Li, X.; Jiang, Y.; Peng, H.; Yin, S. An aerial image segmentation approach based on enhanced multi-scale convolutional neural network. In Proceedings of the 2019 IEEE International Conference on Industrial Cyber Physical Systems (ICPS), Taipei, Taiwan, 6–9 May 2019; pp. 47–52. [Google Scholar]
Farhangfar, S.; Rezaeian, M. Semantic Segmentation of Aerial Images using FCN-based Network. In Proceedings of the 2019 27th Iranian Conference on Electrical Engineering (ICEE), Yazd, Iran, 30 April–2 May 2019; pp. 1864–1868. [Google Scholar]
Li, B.; Chen, C.; Dong, S.; Qiao, J. Transmission line detection in aerial images: An instance segmentation approach based on multitask neural networks. Signal Process. Image Commun. 2021, 96, 116278. [Google Scholar] [CrossRef]
Tian, T.; Chu, Z.; Hu, Q.; Ma, L. Class-Wise Fully Convolutional Network for Semantic Segmentation of Remote Sensing Images. Remote Sens. 2021, 13, 3211. [Google Scholar] [CrossRef]
Liu, Y.; Zhu, Q.; Cao, F.; Chen, J.; Lu, G. High-Resolution Remote Sensing Image Segmentation Framework Based on Attention Mechanism and Adaptive Weighting. ISPRS Int. J. Geo-Inf. 2021, 10, 241. [Google Scholar] [CrossRef]
Mboga, N.; Grippa, T.; Georganos, S.; Vanhuysse, S.; Smets, B.; Dewitte, O.; Wolff, E.; Lennert, M. Fully convolutional networks for land cover classification from historical panchromatic aerial photographs. ISPRS J. Photogramm. Remote Sens. 2020, 167, 385–395. [Google Scholar] [CrossRef]
Bay, H.; Ess, A.; Tuytelaars, T.; Van Gool, L. Speeded-Up Robust Features (SURF). Comput. Vis. Image Underst. 2008, 110, 346–359. [Google Scholar] [CrossRef]
Goodfellow, I.; Bengio, Y.; Courville, A. Deep Learning; MIT Press: Cambridge, MA, USA, 2016. [Google Scholar]
Cordts, M.; Omran, M.; Ramos, S.; Rehfeld, T.; Enzweiler, M.; Benenson, R.; Franke, U.; Roth, S.; Schiele, B. The Cityscapes Dataset for Semantic Urban Scene Understanding. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Las Vegas, LA, USA, 26 June–1 July 2016; pp. 1–11. [Google Scholar]

Figure 1. Detailed architecture of MAGI model, designed as a 3-stream fully convolutional model. Each stream follows the encoder–decoder approach, and kernel sizes are different for each stream, i.e.,

k = 3, 5, 7

. Hence, the output of each stream is a segmentation mask with different precision levels, namely,

M_{3}

,

M_{5}

, and

M_{7}

. To obtain the final segmentation mask, all

M_{i}

are merged together.

Figure 1. Detailed architecture of MAGI model, designed as a 3-stream fully convolutional model. Each stream follows the encoder–decoder approach, and kernel sizes are different for each stream, i.e.,

k = 3, 5, 7

. Hence, the output of each stream is a segmentation mask with different precision levels, namely,

M_{3}

,

M_{5}

, and

M_{7}

. To obtain the final segmentation mask, all

M_{i}

are merged together.

Figure 2. Example of same convolutional filter applied with different kernel sizes. (a) Input image; filters with kernel size (b)

3 \times 3

, (c)

5 \times 5

, and (d)

7 \times 7

.

Figure 2. Example of same convolutional filter applied with different kernel sizes. (a) Input image; filters with kernel size (b)

3 \times 3

, (c)

5 \times 5

, and (d)

7 \times 7

.

Figure 3. Example of pixel accuracy metric fail. (a) Input frame; (b) correct segmentation mask; (c) wrong segmentation mask. Despite the pixel accuracy score being

93 %

, most elements to be segmented were missing. The high accuracy score was obtained through background pixels, which were the majority of the pixels present within the image.

Figure 3. Example of pixel accuracy metric fail. (a) Input frame; (b) correct segmentation mask; (c) wrong segmentation mask. Despite the pixel accuracy score being

93 %

, most elements to be segmented were missing. The high accuracy score was obtained through background pixels, which were the majority of the pixels present within the image.

Figure 4. Example of segmentation performed on UMCD. (a) Input image; (b) segmentation performed with Stream 1 (kernel size

3 \times 3

); (c) segmentation performed with Stream 2 (kernel size

5 \times 5

), (d) segmentation performed with Stream 3 (kernel size

7 \times 7

); (e) segmentation mask obtained by merging obtained masks with the several streams.

Figure 4. Example of segmentation performed on UMCD. (a) Input image; (b) segmentation performed with Stream 1 (kernel size

3 \times 3

); (c) segmentation performed with Stream 2 (kernel size

5 \times 5

), (d) segmentation performed with Stream 3 (kernel size

7 \times 7

); (e) segmentation mask obtained by merging obtained masks with the several streams.

Table 1. Obtained results on UMCD. First column, input images for MAGI. Second column, segmentation masks obtained with the proposed method. Third column, ground-truth masks.

Input	MAGI	Ground Truth

Table 2. Summary of IoUs obtained during performed tests in the ablation study.

Streams	IoU on UMCD	IoU on Cityscapes
Stream 1	$85 %$	$62.3 %$
Stream 2	$79.2 %$	$55 %$
Stream 3	$77.4 %$	$51.7 %$
Streams (1,2)	$88.6 %$	$64.5 %$
Streams (1,3)	$86.5 %$	$63.1 %$
Streams (2,3)	$80.1 %$	$57.2 %$
Full Model	90.3%	84.5%

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2021 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Avola, D.; Pannone, D. MAGI: Multistream Aerial Segmentation of Ground Images with Small-Scale Drones. Drones 2021, 5, 111. https://doi.org/10.3390/drones5040111

AMA Style

Avola D, Pannone D. MAGI: Multistream Aerial Segmentation of Ground Images with Small-Scale Drones. Drones. 2021; 5(4):111. https://doi.org/10.3390/drones5040111

Chicago/Turabian Style

Avola, Danilo, and Daniele Pannone. 2021. "MAGI: Multistream Aerial Segmentation of Ground Images with Small-Scale Drones" Drones 5, no. 4: 111. https://doi.org/10.3390/drones5040111

Article Menu

MAGI: Multistream Aerial Segmentation of Ground Images with Small-Scale Drones

Abstract

1. Introduction

2. Related Works

3. Materials and Methods

3.1. MAGI Model

3.2. Loss Function

3.3. Training Strategy

4. Experimental Results

4.1. Dataset

4.2. Evaluation Metric

4.3. Implementation Details

4.4. Segmentation Performance Evaluation

4.5. Ablation Study

5. Discussions

5.1. Considerations on Execution Time and Energy Consumption

5.2. Considerations on Model Robustness

6. Conclusions

Author Contributions

Funding

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI