A Deep Learning-Enhanced Multi-Modal Sensing Platform for Robust Human Object Detection and Tracking in Challenging Environments

Cheng, Peng; Xiong, Zinan; Bao, Yajie; Zhuang, Ping; Zhang, Yunqi; Blasch, Erik; Chen, Genshe

doi:10.3390/electronics12163423

Open AccessArticle

A Deep Learning-Enhanced Multi-Modal Sensing Platform for Robust Human Object Detection and Tracking in Challenging Environments

by

Peng Cheng

¹

,

Zinan Xiong

¹

,

Yajie Bao

¹

,

Ping Zhuang

¹,

Yunqi Zhang

¹,

Erik Blasch

²

and

Genshe Chen

^1,*

¹

Intelligent Fusion Technology, Inc., Germantown, MD 20874, USA

²

MOVEJ Analytics, Dayton, OH 45324, USA

^*

Author to whom correspondence should be addressed.

Electronics 2023, 12(16), 3423; https://doi.org/10.3390/electronics12163423

Submission received: 7 July 2023 / Revised: 6 August 2023 / Accepted: 9 August 2023 / Published: 12 August 2023

(This article belongs to the Special Issue Multimodal Data Fusion and Computational Optimization for Intelligent Perception)

Download

Browse Figures

Versions Notes

Abstract

:

In modern security situations, tracking multiple human objects in real-time within challenging urban environments is a critical capability for enhancing situational awareness, minimizing response time, and increasing overall operational effectiveness. Tracking multiple entities enables informed decision-making, risk mitigation, and the safeguarding of civil-military operations to ensure safety and mission success. This paper presents a multi-modal electro-optical/infrared (EO/IR) and radio frequency (RF) fused sensing (MEIRFS) platform for real-time human object detection, recognition, classification, and tracking in challenging environments. By utilizing different sensors in a complementary manner, the robustness of the sensing system is enhanced, enabling reliable detection and recognition results across various situations. Specifically designed radar tags and thermal tags can be used to discriminate between friendly and non-friendly objects. The system incorporates deep learning-based image fusion and human object recognition and tracking (HORT) algorithms to ensure accurate situation assessment. After integrating into an all-terrain robot, multiple ground tests were conducted to verify the consistency of the HORT in various environments. The MEIRFS sensor system has been designed to meet the Size, Weight, Power, and Cost (SWaP-C) requirements for installation on autonomous ground and aerial vehicles.

Keywords:

human object recognition and tracking; multi-modal sensing; EO/IR; radar; mobile platform; deep learning; image fusion; autonomous vehicles

1. Introduction

Autonomous vehicles, including unmanned aerial vehicles (UAVs) [1,2,3] and unmanned ground vehicles (UGVs) [4], have found extensive applications in agriculture [5], data acquisition [6], and search and rescue due to their mobility and operational simplicity. One significant capability desired in these search and surveillance scenarios is the ability of autonomous vehicles to recognize human subjects’ actions and respond accordingly. Electro-optical (EO) cameras have become essential tools on UAV and UGV platforms to enhance situational awareness, perform object detection, and enable efficient tracking capabilities. Cameras provide valuable visual information that aids in various applications, including search and rescue operations, surveillance missions, and security monitoring.

However, recognizing human objects from videos captured by a mobile platform presents several challenges. The articulated structure and range of possible poses of the human body make human object recognition and tracking (HORT) a complex task. Humans exhibit diverse movements and postures, making it difficult for an autonomous system to accurately recognize and track them in video footage. Additionally, the quality of the captured videos further complicates the recognition and classification process. Videos may suffer from perspective distortion, occlusion (when parts of the human body are obstructed), motion blur, or poor visibility in adverse weather conditions like fog or rain.

Addressing these video exploitation challenges requires advanced computer vision and image processing techniques. Researchers and engineers employ various approaches, including deep learning (DL)-based methods [7], to develop algorithms capable of robust HORT. These HORT algorithms leverage DL models trained on extensive datasets to recognize and track human subjects despite the sensor, environment, and object challenges. They take into account pose variations, occlusion handling, motion estimation, and visibility enhancement techniques to improve accuracy and reliability. To quickly respond to variations, deployed systems seek sensor fusion on edge platforms [8] or with communication support from fog services [9].

Although DL-based human object detection methods have shown great potential for improving detection accuracy with high-quality images, they may encounter difficulties in challenging environments where image quality is significantly degraded. These environments include scenarios such as completely dark tunnels or situations with limited visibility due to heavy fog. In such challenging conditions, the performance of DL algorithms can be hindered. The lack of sufficient illumination in dark environments can lead to poor image quality, making it challenging for the algorithms to extract relevant features and accurately detect human objects. Similarly, heavy fog or other atmospheric conditions can cause image distortion, reduced contrast, and blurred edges, further impeding the performance of the detection algorithms.

To address these environmental limitations of video exploitation, researchers are exploring alternative sensing technologies and approaches that can complement or enhance DL-based methods [10]. For example, thermal imaging sensors, such as infrared (IR) cameras, can operate effectively in low-light or no-light environments by detecting the heat signatures emitted by objects, including humans [11]. IR sensing allows for improved object detection even in complete darkness or challenging weather conditions. Radar [12,13,14] and LiDAR technologies [15,16,17] also play a significant role in various tasks related to object detection. Particularly, the emergence of radar sensors holds great promise for real-time human object detection applications in moving vehicles [18,19], multimodal systems [20,21], and passive sensing [22,23]. In comparison to visible cameras, radar and LiDAR sensors may have limitations in providing detailed texture information about the viewing scene. Consequently, the lack of texture information makes it challenging to utilize radar and LiDAR effectively for tasks such as human object detection and recognition.

In addition to the challenges posed by HORT, it faces an even greater challenge when it comes to identifying (e.g., friendly/non-friendly) and continuously tracking human subjects in videos captured from platforms such as UAVs. While human detection focuses on locating individuals in a single frame, tracking requires maintaining their classification and trajectory across multiple frames. Tracking algorithms must address the inherent difficulties of accurate and reliable human detection, as errors or inaccuracies in the initial detection phase can propagate and adversely affect the tracking process. Tracking algorithms also need to handle situations where detection may fail or produce false positives, leading to incorrect associations or track drift. To improve object detection and tracking performance at the edges of the sensing range, Meng et al. [24] propose HYDRO-3D, an approach that combines object detection features from V2X-ViT with historical tracking information.

This paper presents the development of a multi-modal EO/IR and RF fused sensing (MEIRFS) platform for real-time human object detection, recognition, and tracking in challenging environments. By utilizing different sensors in a complementary manner, the robustness of the sensing system is enhanced, enabling reliable detection results across various situations. The system incorporates DL-based image fusion and HORT algorithms to ensure accurate detection and tracking of human objects. The sensor system consists of two subsystems: (1) the EO/IR subsystem, which detects and locates human objects in line-of-sight (LOS) scenarios and continuously tracks the selected objects; and (2) the RF subsystem, featuring a linear frequency modulated continuous wave (LFMCW) ranging radar and smart antenna, designed to detect and locate both LOS and non-line-of-sight (NLOS) friendly objects. These two sensor subsystems have been successfully integrated, establishing communication between the sensor system and the host computer and thereby enabling real-time HORT of friendly human objects.

The major contributions of this paper are: (1) identifying the appropriate sensors that can provide required information in different situations; (2) building the hardware prototype of the proposed sensor system with both hardware integration and software implementation; and (3) verifying the effectiveness of the sensor system with both indoor and outdoor experiments. By employing state-of-the-art sensors and well-tested DL-enhanced algorithms, a robust and reliable sensor system for real-time human target detection, identification, and tracking was successfully demonstrated.

2. System Architecture

Figure 1 illustrates the complete structure of the MEIRFS sensor system designed for human object detection, recognition, and tracking. The edge platform (UAV or UGV) is equipped with all the required sensors for detecting and continuously tracking human objects. These sensors include the ranging radar, EO/IR camera, laser range finder, differential barometer, and a pan/tilt platform. Additionally, the known friendly object (designated as blue in Figure 1) is equipped with an IR emitter and an RF transponder, enabling easy recognition by the MEIRFS system amidst all the detected human objects. Ideally, it would be desirable to have a solution that can accurately detect and identify friendly human objects without the need for additional tags or markers. However, in practical scenarios, it is challenging, if not impossible, to find a single method that can work effectively in all situations. For instance, visible light imaging can provide valuable color and feature patterns that can be used to differentiate between unknown friend and foe objects in well-illuminated environments. However, this approach may not be effective in low-light or dark environments. In such cases, additional measures are necessary to promptly and correctly identify friendly human objects.

To address this challenge, the use of tags or markers becomes essential. By incorporating such tags, the detection and recognition of friendly forces can be enhanced, facilitating effective communication and decision-making on challenging scenarios. The importance of employing tags or markers for friendly human object detection can enhance security operations enabling efficient coordination among people and ultimately enhancing the overall effectiveness and safety of challenging missions.

2.1. Radio Frequency (RF) Subsystem

The RF subsystem is comprised of a LFMCW ranging radar, with the transceiver located on the platform and the transponder situated on the friendly object. Additionally, a smart antenna is positioned on the platform side. The LFMCW transceiver, illustrated in Figure 2a, consists of a LFMCW transmitter, a LFMCW receiver with frequency/range scanning capability, and a signal processor. The RF system incorporates a smart antenna capable of estimating the angle between the platform and the friendly object. The smart antenna achieves a measurement accuracy of 0.8° and effectively suppresses multipath signals reflected from the ground, walls, and ceilings. Figure 2b displays the radar transponder situated on the friendly object side. The entire radar subsystem underwent testing in an indoor environment, as depicted in Figure 2c, which showcases the measured distance between the platform and the friendly object. The results demonstrate the consistent detection and accurate distance measurement capabilities of the MEIRFS self-developed radar subsystem.

To enhance the signal-to-noise ratio (SNR) and range detection, several techniques were implemented:

The RF signals were sampled multiple times, typically eight samples, and Fast Fourier Transform (FFT) calculations were performed on each sample. The results were then averaged, improving the SNR, and extending the detection range;
Due to varying hardware gain responses across the baseband spectrum, it was necessary to determine the local signal noise floor as a reference. By comparing the real signal with the local noise floor instead of the entire baseband noise floor, accurate detection can be achieved;
Local averaging windows were utilized to establish the appropriate reference level, contributing to improved detection accuracy.

The current radar range cutoff stands at just over 27 m. If required, parameters can be adjusted to enable a longer detection range. The distance measurement update rate is set at 7 times per second. At this refresh rate, the average current draw is 700 mA at 6 V. The refresh rate can be increased if certain radar functions are not turned off between each update to conserve power.

The capabilities of the MEIRFS RF subsystem were tested and verified in both outdoor open environments and wooded areas. Furthermore, it was confirmed that the RF subsystem consistently detects the distance of human objects equipped with radar transponders, even through multiple drywalls.

2.2. EO/IR Subsystem

The EO/IR subsystem comprises an EO camera, an IR camera, a laser rangefinder situated on the platform side, a controllable IR emitter on the friendly object side, and a pan/tilt platform. Within the subsystem, the EO camera utilizes a 3D stereo camera for visible image acquisition and depth sensing, while the long-wavelength IR camera is employed for thermal detection. Two options for IR cameras are available, allowing for interchangeability to accommodate different detection ranges. Both options have undergone comprehensive testing and successful implementation.

Aligned with the viewing direction of the IR camera, the laser rangefinder is capable of measuring distances up to 100 m. The IR subsystem consistently distinguishes between LOS friendly and non-friendly objects by analyzing the IR signal emitted from the IR emitter equipped by the friendly object.

The hardware arrangement of the IR subsystem is depicted in Figure 3a. Both the IR camera and the laser rangefinder are aligned to point in the same direction and are mounted on the pan/tilt platform, allowing for rotation in various directions. The laser rangefinder is utilized to measure the distance of the object located at the center of the IR image’s field of view. As shown in Figure 3b, the process begins with the capture of the first image

t_{1}

from the IR camera, which detects the human object. The object’s position within the IR image’s field of view is then calculated. Subsequently, the lateral angle position

α

and the vertical angle position

ϕ

of the object relative to the IR camera’s pointing direction can be determined. These calculated angle positions are then sent to the pan/tilt platform, which adjusts the IR subsystem’s orientation to center the object within the IR camera’s field of view. Thus, at time instant

t_{2}

, the distance of the object can be measured using the laser rangefinder. Figure 3c presents the flowchart illustrating the working principle of the EO/IR subsystem, highlighting its functionality in detecting, tracking, and measuring the distance of the object of interest.

2.2.1. Electro-Optical (EO) Camera

The 3D stereo camera from Stereolabs is used as the EO camera for both visible image acquisition and depth sensing. The camera offers advanced depth sensing capabilities and is widely used for applications such as robotics, virtual reality, autonomous navigation, and 3D mapping. Some key features of the Stereolabs 3D camera include a high-resolution (1920 × 1080 pixels) visible image, depth sensing, real-time 3D mapping, and a comprehensive software development kit (SDK).

In our specific application, we utilize the image captured by the left camera of the 3D stereo camera as the EO image. The left image serves as the basis for human object detection and tracking using visible light. By leveraging the visible light spectrum, one can benefit from the detailed texture information and visual cues present in the EO image, enabling accurate detection and tracking of human subjects.

2.2.2. Infrared (IR) Camera

The IR subsystem incorporates two different IR cameras for varying human object detection ranges: the 9640P IR camera from ICI and the Boson 320 IR camera from Teledyne. The selection and testing of these cameras were performed to adapt to different detection requirements.

The short-range Boson 320 IR camera boasts a compact size of 21 × 21 × 11 mm and weighs only 7.5 g. It is equipped with a 6.3 mm lens and offers a horizontal field of view (FOV) of 34°. This camera is capable of detecting human objects up to a range of 25 m. It features exceptional thermal sensitivity, equal to or less than (≤) 20 mK, and an upgraded automatic gain control (AGC) filter that enhances scene contrast and sharpness in all environments. With a fast frame rate of up to 60 Hz, it enables real-time human object detection. The image resolution of this camera is 320 × 256 pixels, and the image stream is transferred in real-time from the camera to the host PC via a universal serial bus (USB) cable.

On the other hand, the long-range ICI 9640p is a high-quality thermal-grade IR camera with an image resolution of 640 × 512 pixels. It utilizes a 50 mm athermalized lens, providing a FOV of 12.4° × 9.3°, and has a total weight of 230 g. This ICI IR camera achieves a detection range exceeding 100 m. The maximum frame rate supported by this camera is 30 Hz.

By incorporating both the Boson 320 and the ICI 9640p cameras into the IR subsystem, the MEIRFS system can adjust to different detection ranges, ensuring flexibility and adaptability in various scenarios.

2.2.3. Laser Rangefinder

To overcome the limitation of the IR camera in measuring the distance of detected objects, we integrated a laser rangefinder, the SF30/C, from Lightware into our system. The laser rangefinder is specifically designed to provide accurate distance measurements. It is aligned with the viewing direction of the IR camera, and both devices are mounted on a rotary stage. The collocated configuration ensures that the laser rangefinder is always directed towards the center of the IR camera’s field of view (FOV).

When a human object of interest is detected in the FOV, the rotary stage automatically adjusts the orientation of the IR subsystem to the center of the object, affording the precise position of the object relative to the platform of the sensor system. By combining the information from the IR camera, which provides the location of the object, and the laser rangefinder, which provides the distance measurement, MEIRFS can accurately determine the spatial coordinates of the human object in real-time.

2.3. Sensor System Integration

The proposed MEIRFS system is designed to be versatile and applicable to both UAVs and UGVs for various tasks. In this paper, we demonstrate the successful integration and mounting of the MEIRFS system onto an all-terrain robot platform to conduct ground tests.

By deploying the MEIRFS system on a UGV, the performance and capabilities are evaluated in real-world scenarios encountered by ground-based robotic platforms. The all-terrain robot platform provides a suitable environment for testing the MEIRFS system’s functionalities, such as human object detection, recognition, and tracking. These tests help validate the effectiveness and robustness of the MIERFS system in different sensor, environment, and object operational conditions.

The MEIRFS integration onto the all-terrain robot platform enables us to assess the MEIRFS system’s performance in practical ground-based applications, paving the way for potential deployment on both UAVs and UGVs for diverse tasks such as surveillance, search and rescue, and security operations.

2.3.1. Hardware System Assembly

As shown in Figure 4, the MEIRFS system includes the following components:

(1): The radar sensor developed by Intelligent Fusion Technology, Inc., Germantown, MD, USA;
(2): The 3D EO camera from Stereolabs, San Francisco, CA, USA;
(3): The FLIR Boson 320 IR camera from Teledyne, Wilsonville, OR, USA;
(4): The SF30/C laser rangefinder from Lightware, Boulder, Colorado, USA;
(5): The HWT905 inertial measurement unit (IMU) sensor from Wit-Motion, Shenzhen, Guangdong, China;
(6): X-RSW series motorized rotary stage from Zaber, Vancouver, BC, Canada;
(7): The IG42 all-terrain robot from SuperDroid Robots, Fuquay-Varina, NC, USA.

To ensure an organized and compact design, all the cables of the MEIRFS system are carefully managed and extended to the interior of the robot. Inside the robot, two 12 V batteries are utilized to generate a 24 V DC power supply, which is required for operating both the rotary stage and the robot’s wheels.

In terms of connectivity, a single USB cable is all that is necessary to establish communication between the MEIRFS system and the host computer. The USB cable connects to a USB hub integrated into the robot, facilitating seamless communication between the host computer and all the sensors as well as the rotary stage. By consolidating the cables and employing a simplified connection scheme, the MEIRFS system ensures efficient and streamlined communication, minimizing clutter, and simplifying the setup process. The organized arrangement enhances the overall functionality and practicality of the system during operation.

2.3.2. Software Package

To facilitate user control and provide a comprehensive display of the detection results, a graphical user interface (GUI) software package was developed. The MEIRFS GUI software serves as a centralized platform for communication and control between the host computer and all the hardware devices in the sensor system.

The GUI software, illustrated in Figure 5, enables seamless communication and data exchange with the various components of the sensor system. The GUI acts as a user-friendly interface for controlling and configuring the system, as well as displaying key data and detection results in a clear and organized manner. Through the GUI software, users can conveniently interact with the sensor system, adjusting settings, initiating detection processes, and monitoring real-time data. The software provides an intuitive and efficient means of accessing and managing the functionalities of the MIERFS system. Specifically, the GUI software has been developed with the following capabilities:

(1): Display the image acquired from EO/IR cameras;
(2): Configure the machine learning model for human object detection;
(3): Receive and display the measurement results from the IMU sensor;
(4): Receive and display the measurement results from the laser rangefinder;
(5): Send the control command to the rotary stage;
(6): Receive and display the measurement results from the radar;and
(7): Perform real-time object tracking to follow the interested object.

The measurement results from the various sensors in the MIERFS system are transmitted to the host computer at different data update rates. To ensure accurate tracking of the object, these measurements are synchronized within the GUI software to calculate the object’s position. In the MIERFS system, the IR camera plays a crucial role in human object detection, recognition, and tracking. Therefore, the measurements from other sensors are synchronized with the update rate of the IR camera. During our testing, the real-time human object detection process achieved a continuous frame rate of approximately 35 frames per second (fps) when the laptop computer (equipped with an Intel Core i9-11900H CPU and Nvidia RTX-3060 laptop GPU) was connected to a power source. When the laptop computer operated solely on battery, the frame rate reduced to about 28 fps. Each time a new frame of the IR image is received in the image acquisition thread, the software updates the measured data from all the sensors. The synchronization ensures that the measurement results from different sensors are aligned with the latest IR image frame, providing accurate and up-to-date information for human object detection and tracking.

3. Enhance the MIERFS System with Deep Learning Methods

3.1. Deep Learning-Based Algorithm for Human Object Detection

After evaluating various DL-based object detection algorithms suitable for real-time applications [25,26], we selected the open-source YOLOv4 (You Only Look Once) detector [7] as the tool for EO/IR image analysis in human object detection. The YOLOv4 detector is recognized as one of the most advanced DL algorithms for real-time object detection. It employs a single neural network to process the entire image, dividing it into regions and predicting bounding boxes and probabilities for each region. These bounding boxes are weighted based on the predicted probabilities.

The YOLOv4 model offers several advantages over classifier-based systems. It considers the entire image during testing, leveraging global context to enhance its predictions. Unlike systems such as the region-based convolutional neural network (R-CNN), which require thousands of network evaluations for a single image, YOLOv4 makes predictions in a single evaluation, making it remarkably fast. In fact, it is over 1000 times faster than R-CNN and 100 times faster than Fast R-CNN [7].

To ensure the YOLOv4 detector’s effectiveness in different scenarios, we gathered more than 1000 IR images, encompassing various cases, as depicted in Figure 6. Additionally, we considered scenarios where only a portion of the human body was within the IR camera’s field of view, such as the lower body, upper body, right body, and left body. Once the raw IR image data was annotated, both the annotated IR images and their corresponding annotation files were used as input for training the YOLOv4 model. The pre-trained YOLOv4 model, initially trained with the Microsoft Common Objects in Context (COCO) dataset, served as the starting point for training with the annotated IR images.

Once the training of the YOLOv4 model was finalized, we proceeded to evaluate its performance using IR images that were not included in the training process. Figure 7 showcases the effectiveness of the trained YOLOv4 model in accurately detecting human objects across various scenarios, including:

(1): Human object detection in indoor environments;
(2): Human object detection in outdoor environments;
(3): Detection of multiple human objects within the same IR image;
(4): Human object detection at different distances; and
(5): Human object detection regardless of different human body gestures.

The trained YOLOv4 model exhibited satisfactory performance in all these scenarios, demonstrating its ability to robustly detect human objects in diverse environments and under various conditions.

3.2. Sensor Fusion and Multi-Target Tracking

Although the IR image alone is effective for human object detection, it may not provide optimal performance in multiple human object tracking tasks due to its limited color and texture information compared to visible light images. To address this limitation and achieve accurate human object tracking in complex scenarios, images from both the IR camera and the EO camera were utilized. To enhance the features in these images, a DL-based image fusion algorithm was developed. Image fusion combines the information from the IR and EO images to create fused images that offer improved detection and tracking capabilities and enhance the tracking results in challenging situations.

This Section presents the algorithms that are compatible with the MEIRFS hardware design for sensor fusion and multi-target tracking. In particular, the U2 Fusion [27], a unified unsupervised image fusion network, is adapted to fuse visible and infrared images and provide high-quality inputs even in adversarial environments for the downstream multi-target tracking (MTT) task.

3.2.1. Sensor Fusion

Infrared cameras capture thermal radiation emitted by objects, while visible cameras capture the reflected or emitted light in the visible spectrum. Therefore, infrared cameras are useful for applications involving temperature detection, night vision, and identifying heat signatures [28,29]. Visible cameras, on the other hand, are commonly used for photography, computer vision, and surveillance in well-lit conditions. Both types of cameras serve distinct purposes and have their own specific applications based on the type of light they capture. Fusing these two modalities allows us to see the thermal characteristics of objects alongside their visual appearance, providing enhanced scene perception and improved object detection.

Image fusion has been an active field [30,31], and many algorithms have been developed. DL-based image fusion techniques are of particular interest to MEIRFS due to their superior performance and reduced efforts for feature engineering and fusion rules. Zhang et al. [32] provide a comprehensive review of the DL methods in different image fusion scenarios. In particular, DL for infrared and visible image fusion can be categorized into autoencoder (AE), convolutional neural network (CNN), and generative adversarial network (GAN)-based methods according to the deep neural network architecture. Since AE is mostly used for feature extraction and image reconstruction while GAN is often unstable and difficult to train, we consider CNN-based methods to facilitate the multi-object tracking task. To overcome the problem of lacking a universal groundtruth and no-reference metric, CNN-based fusion constrains the similarity between the fusion image and source images by designing loss functions. Specifically, we adapt U2Fusion [27] for the MEIRFS system, which provides a unified framework for multi-modal, multi-exposure, and multi-focal fusion. However, U2Fusion [27] did not consider image registration, which is the first step towards image fusion. Due to the differences in camera parameters such as the focal length and field of view, the images may not share the same coordinate system, and thus image registration is necessary to align and fuse the images. We calibrate the IR and visible cameras and compute the transformation matrix offline to reduce the online effort for image registration. The image registration in our work is performed by cropping the RGB image to align the FOV with the IR image based on the camera calibration for our hardware design and achieves efficacy performance. It is noted that integrating image registration into the U2Fusion model and training the integrated model in an end-to-end manner can simplify the image registration process and improve image fusion performance [33], which will be investigated in future work. After image registration, the training pipeline of U2Fusion with aligned images is shown in Figure 8.

To preserve the critical information of a pair of source images denoted as

I_{1}

and

I_{2}

, U2Fusion [27] minimizes the loss function defined as follows:

L (θ, D) = L_{s i m} (θ, D) + {λ L}_{e w c} (θ, D),

(1)

where

θ

denotes the parameters in DenseNet for generating the result fusion image

I_{f}

, and

D

is the training dataset;

L_{s i m} (θ, D)

is the similarity loss between the result and source images;

L_{e w c} (θ, D)

is the elastic weight consolidation [34] that prevents catastrophic forgetting in continual learning; and

λ

is the trade-off parameter that controls the relative importance of the two parts. Additionally,

L_{s i m} (θ, D) = E [ω_{1} (1 - S_{I_{f}, I_{1}}) + ω_{2} (1 - S_{I_{f}, I_{2}})] + α E [ω_{1} {M S E}_{I_{f}, I_{1}} + ω_{2} {M S E}_{I_{f}, I_{2}})],

(2)

where

α

controls the trade-off;

S_{I_{f}, I_{i}} (i = 1,2)

denotes the structural similarity index measure (SSIM) for constraining the structural similarity between the source images

I_{i}

and

I_{f}

;

{M S E}_{I_{f}, I_{i}} (i = 1,2)

denotes the mean square error (MSE) for constraining the difference of the intensity distribution; and

ω_{1}

and

ω_{2}

are adaptive weights estimated based on the information measurement of the feature maps of the source images. In particular, the information measurement

g_{I}

is defined as:

g_{I} = \frac{1}{5} \sum_{j = 1}^{5} \frac{1}{H_{j} W_{j} D_{j}} \sum_{k = 1}^{D_{j}} {‖\nabla ϕ_{C_{j}^{k}} (I)‖}_{F}^{2},

(3)

where

ϕ_{C_{j}^{k}} (I)

is the feature map extracted by the convolutional layer of VGG16 before the

j

-th max-pooling layer, and

H_{j}

,

W_{j}

, and

D_{j}

denote the height, width, and channel of the feature map, respectively. Moreover, the elastic weight consolidation

L_{e w c}

is defined as

L_{e w c} (θ, D) = \frac{1}{2} \sum_{i} μ_{i} {(θ - θ^{*})}^{2},

(4)

which penalizes the weighted squared distance between the parameter values of the current task

θ

and those of the previous task

θ^{*}

to prevent forgetting what has been learned from old tasks.

To train a customized model for our system, we can fine-tune the learned model in U2Fusion using transfer learning approaches [35] with data collected by the cameras to enhance learning efficiency. Furthermore, since solely IR or visible images can be sufficient for the object tracking task under certain environmental conditions, we designed a selector switch to skip the image fusion if it is unnecessary to detect the object. The mode selector is controlled manually, i.e., the operator selects the proper mode based on the assessment of the image quality of the infrared and visible images and the necessity of image fusion. In future work, we will incorporate mode selection into the U2Fusion model to automatically select the mode. Figure 9 shows the complete pipeline of image fusion processing for object tracking.

3.2.2. DL-Based Algorithm for Human Object Tracking

In certain scenarios, the human object may become lost due to inherent limitations in object detection algorithms as well as various challenging circumstances such as occlusions and fluctuations in lighting conditions. To effectively address these situations, the utilization of a human object tracking algorithm becomes necessary [36].

To optimize the tracking results, our system employs the “ByteTrack” object tracking model as the primary algorithm [37]. For effective performance, ByteTrack utilizes YOLOX as the underlying backbone for object detection [38]. Unlike traditional methods that discard detection results below a predetermined threshold, ByteTrack takes a different approach. It associates nearly all the detected boxes by initially separating them into two categories: high-score boxes, containing detections above the threshold, and low-score boxes, encompassing detections below the threshold. The high-score boxes are first linked to existing tracklets. Subsequently, ByteTrack computes the similarity between the low-score boxes and the established tracklets, facilitating the recovery of objects that may be occluded or blurred. Consequently, the remaining tracklets, which mostly correspond to background noise, are removed. The ByteTrack methodology effectively restores precise object representations while eliminating spurious background detections.

In the MEIRFS system, the fusion of IR and visible image pairs is followed by the application of the YOLOX algorithm to the fused image. This algorithm performs human object detection and generates confidence scores for the detected objects. In the presence of occlusion, priority is given to high-confidence detections, which are initially matched with the tracklets generated by the Kalman filter. Subsequently, an intersection over union (IoU) similarity calculation is utilized to evaluate the remaining tracklets and low-confidence detections. This process facilitates the matching of low-confidence detections with tracklets, enabling the system to effectively handle occlusion scenarios.

4. Experiments and Results

With the integrated sensors in the MEIRFS system, multiple ground tests have been performed in different environments to validate the performance of each individual component in the sensor system as well as the whole system’s performance for human object detection, geolocation, and LOS-friendly human object recognition.

4.1. Indoor Experiments

In Figure 10a, we tested the MEIRFS sensor system’s capability of detecting and continuously tracking a single human object. When the human object appears in the IR camera’s field of view, it will be immediately identified (marked with the red bounding box) and tracked by the sensor’s system.

Compared with the traditional EO camera, one advantage of the IR camera is that it can detect human objects when there is no illumination. The long-wavelength infrared (LWIR) camera detects the direct thermal energy emitted from the human body. Figure 10b shows that the MEIRFS system can function correctly even in a dark environment.

Figure 10c demonstrates the measurement accuracy of the radar subsystem. When the friendly human object is detected by the MEIRFS system, the distance to the platform is measured by both the radar subsystem and the laser rangefinder. The measurement results verified that the radar subsystem can provide accurate distance information for the friendly object, with an error of less than 0.3 m when compared with the laser rangefinder.

In the last test, as shown in Figure 10d, there are two human objects. The one holding the IR emitter (a heat source) is the friendly object. The other is the non-friendly object. The system was configured to track only non-friendly objects. When both objects came into the IR camera’s FOV, the sensor system immediately identified them and marked the friendly object with a green bounding box and the non-friendly object with a red box. Moreover, the sensor system immediately started to continuously track and follow the non-friendly object.

4.2. Outdoor Experiments

Extensive experiments were conducted to thoroughly validate the effectiveness of the MEIRFS system for multiple human object tracking in outdoor environments. These experiments were designed to assess the system’s performance and capabilities across various scenarios and conditions encountered in real-world outdoor settings.

The tracking model employed has undergone pre-training on two datasets, namely CrowdHuman [39] and MOT20 [40]. The CrowdHuman dataset is characterized by its extensive size, rich annotations, and substantial diversity. The CrowdHuman dataset encompasses a total of 470,000 human instances from both the training and validation subsets. Notably, each image within the dataset contains an average of 22.6 people, thereby exhibiting a wide range of occlusions. On the other hand, the MOT20 dataset comprises eight sequences extracted from three densely populated scenes, where the number of individuals per frame can reach up to 246 individuals. The pre-trained model’s exposure to such varied and challenging conditions enables it to effectively handle a wide array of real-world scenarios, leading to enhanced object tracking capabilities and more reliable results. The original model used in our research was trained on a separate system consisting of eight NVIDIA Tesla V100 GPUs with a batch size of 48. There is a 80epoch training schedule for the MOT17 dataset, combining the MOT17, CrowdHuman, Cityperson, and ETHZ datasets. The image size is set to 1440 × 800, with the shortest side ranging from 576 to 1024 during multi-scale training. Data augmentation includes Mosaics and Mixups. The optimizer is SGD with a weight decay of 5 × 10⁻⁴ and momentum of 0.9. The initial learning rate is 10⁻³ with a 1epoch warm-up and cosine annealing schedule. For the inference stage, we performed the evaluations on an NVIDIA 2080 Ti GPU. With this configuration, we achieved 27.98 frames per second (FPS), which demonstrates the real-time capabilities of our hardware system.

Figure 11 presents the evaluation of MEIRFS’ tracking ability, revealing noteworthy insights from the top and bottom rows of the displayed results. In these scenarios, which involve the movement of multiple individuals amidst occlusion, the MEIRFS multimodal U2Fusion tracking algorithm exhibits exceptional performance. Each individual is identified by a unique ID number and tracked using a distinct color, showcasing the algorithm’s ability to accurately track different people without experiencing any instances of object loss. As shown in Figure 11, the continuous tracking results are represented by six key image frames, which are labeled with key frame number in time sequence at the lower left corner of each image frame. The outcome underscores the robustness and reliability of the REIRFS tracking algorithm, particularly in challenging conditions where occlusion and the simultaneous presence of multiple objects present significant tracking difficulties.

Figure 12 illustrates the performance of the MEIRFS tracking algorithm on images captured by an IR camera, images captured by a visible camera, and the fused images obtained by sensor fusion. Analysis of the top and middle rows reveals that both scenarios encounter challenges in tracking person #1, and person #2 is incorrectly assigned as person #1, while person #1 is mistakenly considered a new individual, person #3. However, in the bottom row, following the fusion of IR and visible images, our tracking algorithm successfully tracks both person #1 and person #2, even in the presence of occlusions. The performance highlights the effectiveness of the induced sensor fusion, which combines information from both IR and visible images. As a result, the fusion process enriches the image features available for utilization by the tracking algorithm, leading to improved tracking performance in challenging scenarios.

4.3. Discussion

To demonstrate the effectiveness of our system in tracking human subjects, we conducted an evaluation using the videos that we collected from outdoor experiments.

The results of this experiment, as presented in Table 1, showcased a mean average precision (mAP) score of 0.98, calculated at an intersection over union (IOU) threshold of 0.50. With a high mAP of 0.98, the detection algorithm demonstrates its proficiency and precision in identifying objects accurately and reliably. This achievement provides strong evidence that the algorithm is well-suited and perfectly capable of handling the unique characteristics and complexities presented by our data. Consequently, this success in accuracy lays a solid foundation for the subsequent tracking evaluation, affirming the algorithm’s competence in reliably detecting and localizing human subjects for the tracking phase.

To assess the tracking algorithm’s performance, we employed multiple object tracking accuracy (MOTA) as our evaluation metric. The MOTA metric considers three crucial aspects: the number of misses (

m

), the number of false positives (

f p_{t}

), and the number of mismatches (

m m e_{t}

), with the total number of objects (

g_{t}

) included in the denominator. This comprehensive evaluation provides valuable insights into the system’s ability to accurately track human subjects over time.

MOTA = 1 - \frac{\sum_{t} (m_{t} + f p_{t} + m m e_{t})}{\sum_{t} g_{t}}

(5)

The evaluation results of the tracking algorithm are presented in Table 2. Notably, the achieved MOTA score is an impressive 0.984, indicating a remarkably high level of accuracy and performance. This outstanding MOTA score serves as compelling evidence that the tracking algorithm is exceptionally effective. With such encouraging results, we can confidently assert that the tracking algorithm is well-suited for this specific application and has the potential to significantly enhance the overall capabilities of our system. Its outstanding performance in human tracking brings us closer to achieving our system’s objectives with a high degree of precision and reliability.

5. Conclusions

This paper proposes and develops a multimodal EO/IR and RF-based sensor (MEIRFS) system for real-time human object detection, recognition, and tracking on autonomous vehicles. The integration of hardware and software components of the MEIRFS system was successfully accomplished and demonstrated in indoor and outdoor scenes with collected and common datasets. Prior to integration, thorough device functionality testing established communication between each device and the host computer. To enhance human object recognition and tracking (HORT), multimodal deep learning techniques were designed. Specifically, the “U2Fusion” sensor fusion algorithm and the “ByteTrack” object tracking model were utilized. These approaches significantly improved the performance of human object tracking, particularly in complex scenarios. Multiple ground tests were conducted to verify the consistent detection and recognition of human objects in various environments. The compact size and light weight of the MEIRFS system make it suitable for deployment on UGVs and UAVs, enabling real-time HORT tasks.

Future work includes deploying and testing the MEIRFS system on UAV platforms. Additionally, we aim to leverage the experience gained from ground tests to retrain the deep learning models using new images acquired from the EO/IR camera and a radar on the UAV. We anticipate that the MEIRFS system will be capable of performing the same tasks of human object detection, recognition, and tracking that have been validated during the ground tests.

Author Contributions

Conceptualization, P.C., Z.X. and Y.B.; methodology, P.C., Z.X. and Y.B.; software, P.C. and Z.X.; validation, P.Z., Y.Z. and Z.X.; investigation, E.B.; data curation, Z.X. and Y.B.; writing—original draft preparation, P.C.; writing—review and editing, Z.X. and E.B.; supervision, G.C.; project administration, G.C.; funding acquisition, G.C. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Data Availability Statement

Data sharing is not applicable to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

Perera, F.; Al-Naji, A.; Law, Y.; Chahl, J. Human Detection and Motion Analysis from a Quadrotor UAV. IOP Conf. Ser. Mater. Sci. Eng. 2018, 405, 012003. [Google Scholar] [CrossRef]
Rudol, P.; Doherty, P. Human body detection and geolocalization for uav search and rescue missions using color and thermal imagery. In Proceedings of the Aerospace Conference, Big Sky, MT, USA, 1–8 March 2008; pp. 1–8. [Google Scholar]
Andriluka, M.; Schnitzspan, P.; Meyer, J.; Kohlbrecher, S.; Petersen, K.; von Stryk, O.; Roth, S.; Schiele, B. Vision based victim detection from unmanned aerial vehicles. In Proceedings of the Intelligent Robots and Systems (IROS), IEEE/RSJ International Conference onIntelligent Robots and Systems, Taipei, Taiwan, 18–22 October 2010; pp. 1740–1747. [Google Scholar]
Gay, C.; Horowitz, B.; Elshaw, J.J.; Bobko, P.; Kim, I. Operator suspicion and human-machine team performance under mission scenarios of unmanned ground vehicle operation. IEEE Access 2019, 7, 36371–36379. [Google Scholar] [CrossRef]
Xia, X.; Meng, Z.; Han, X.; Li, H.; Tsukiji, T.; Xu, R.; Zheng, Z.; Ma, J. An automated driving systems data acquisition and analytics platform. Transp. Res. Part CEmerg. Technol. 2023, 151, 104120. [Google Scholar] [CrossRef]
Liu, W.; Quijano, K.; Crawford, M.M. YOLOv5-Tassel: Detecting tassels in RGB UAV imagery with improved YOLOv5 based on transfer learning. IEEE J. Sel. Top. Appl. Earth Obs. Remote Sens. 2022, 15, 8085–8094. [Google Scholar] [CrossRef]
Bochkovskiy, A.; Wang, C.Y.; Liao, H.Y.M. Yolov4: Optimal speed and accuracy of object detection. arXiv 2004, arXiv:2004.10934. [Google Scholar]
Chen, N.; Chen, Y.; Blasch, E.; Ling, H.; You, Y.; Ye, X. Enabling Smart Urban Surveillance at The Edge. In Proceedings of the IEEE International Conference on Smart Cloud (SmartCloud), New York, NY, USA, 3–5 November 2017; pp. 109–111. [Google Scholar]
Munir, A.; Kwon, J.; Lee, J.H.; Kong, J.; Blasch, E.; Aved, A.J.; Muhammad, K. FogSurv: A Fog-Assisted Architecture for Urban Surveillance Using Artificial Intelligence and Data Fusion. IEEE Access 2021, 9, 111938–111959. [Google Scholar] [CrossRef]
Blasch, E.; Pham, T.; Chong, C.-Y.; Koch, W.; Leung, H.; Braines, D.; Abdelzaher, T. Machine Learning/Artificial Intelligence for Sensor Data Fusion–Opportunities and Challenges. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 80–93. [Google Scholar] [CrossRef]
He, Y.; Deng, B.; Wang, H.; Cheng, L.; Zhou, K.; Cai, S.; Ciampa, F. Infrared machine vision and infrared thermography with deep learning: A review. Infrared Phys. Technol. 2021, 116, 103754. [Google Scholar] [CrossRef]
Huang, W.; Zhang, Z.; Li, W.; Tian, J. Moving object tracking based on millimeter-wave radar and vision sensor. J. Appl. Sci. Eng. 2018, 21, 609–614. [Google Scholar]
Van Eeden, W.D.; de Villiers, J.P.; Berndt, R.J.; Nel, W.A.J. Micro-Doppler radar classification of humans and animals in an operational environment. Expert Syst. Appl. 2018, 102, 1–11. [Google Scholar] [CrossRef] [Green Version]
Majumder, U.; Blasch, E.; Garren, D. Deep Learning for Radar and Communications Automatic Target Recognition; Artech House: London, UK, 2020. [Google Scholar]
Premebida, C.; Ludwig, O.; Nunes, U. LIDAR and vision-based pedestrian detection system. J. Field Robot. 2009, 26, 696–711. [Google Scholar] [CrossRef]
Duan, Y.; Irvine, J.M.; Chen, H.-M.; Chen, G.; Blasch, E.; Nagy, J. Feasibility of an Interpretability Metric for LIDAR Data. In Proceedings of the SPIE 10645, Geospatial Informatics, Motion Imagery, and Network Analytics VIII, Orlando, FL, USA, 22 May 2018; p. 1064506. [Google Scholar]
Salehi, B.; Reus-Muns, G.; Roy, D.; Wang, Z.; Jian, T.; Dy, J.; Ioannidis, S.; Chowdhury, K. Deep Learning on Multimodal Sensor Data at the Wireless Edge for Vehicular Network. IEEE Trans. Veh. Technol. 2022, 71, 7639–7655. [Google Scholar] [CrossRef]
Sun, S.; Zhang, Y.D. 4D automotive radar sensing for autonomous vehicles: A sparsity-oriented approach. IEEE J. Sel. Top. Signal Process. 2021, 15, 879–891. [Google Scholar] [CrossRef]
Roy, D.; Li, Y.; Jian, T.; Tian, P.; Chowdhury, K.; Ioannidis, S. Multi-Modality Sensing and Data Fusion for Multi-Vehicle Detection. IEEE Trans. Multimed. 2023, 25, 2280–2295. [Google Scholar] [CrossRef]
Vakil, A.; Liu, J.; Zulch, P.; Blasch, E.; Ewing, R.; Li, J. A Survey of Multimodal Sensor Fusion for Passive RF and EO information Integration. IEEE Aerosp. Electron. Syst. Mag. 2021, 36, 44–61. [Google Scholar] [CrossRef]
Vakil, A.; Blasch, E.; Ewing, R.; Li, J. Finding Explanations in AI Fusion of Electro-Optical/Passive Radio-Frequency Data. Sensors 2023, 23, 1489. [Google Scholar] [CrossRef]
Liu, J.; Ewing, R.; Blasch, E.; Li, J. Synthesis of Passive Human Radio Frequency Signatures via Generative Adversarial Network. In Proceedings of the IEEE Aerospace Conference, Big Sky, MT, USA, 6–13 March 2021. [Google Scholar]
Liu, J.; Mu, H.; Vakil, A.; Ewing, R.; Shen, X.; Blasch, E.; Li, J. Human Occupancy Detection via Passive Cognitive Radio. Sensors 2020, 20, 4248. [Google Scholar] [CrossRef]
Meng, Z.; Xia, X.; Xu, R.; Liu, W.; Ma, J. HYDRO-3D: Hybrid Object Detection and Tracking for Cooperative Perception Using 3D LiDAR. IEEE Trans. Intell. Veh. 2023, 1–13. [Google Scholar] [CrossRef]
Jiao, L.; Zhang, F.; Liu, F.; Yang, S.; Li, L.; Feng, Z.; Qu, R. A survey of deep learning-based object detection. IEEE Access 2019, 7, 128837–128868. [Google Scholar] [CrossRef]
Zaidi, S.S.A.; Ansari, M.S.; Aslam, A.; Kanwal, N.; Asghar, M.; Lee, B. A survey of modern deep learning based object detection models. Digit. Signal Process. 2022, 126, 103514. [Google Scholar] [CrossRef]
Xu, H.; Ma, J.; Jiang, J.; Guo, X.; Ling, H. U2Fusion: A unified unsupervised image fusion network. IEEE Trans. Pattern Anal. Mach. Intell. 2020, 44, 502–518. [Google Scholar] [CrossRef]
Liu, S.; Gao, M.; John, V.; Liu, Z.; Blasch, E. Deep Learning Thermal Image Translation for Night Vision Perception. ACM Trans. Intell. Syst. Technol. 2020, 12, 1–18. [Google Scholar] [CrossRef]
Liu, S.; Liu, H.; John, V.; Liu, Z.; Blasch, E. Enhanced Situation Awareness through CNN-based Deep MultiModal Image Fusion. Opt. Eng. 2020, 59, 053103. [Google Scholar] [CrossRef]
Zheng, Y.; Blasch, E.; Liu, Z. Multispectral Image Fusion and Colorization; SPIE Press: Bellingham, WA, USA, 2018. [Google Scholar]
Kaur, H.; Koundal, D.; Kadyan, V. Image fusion techniques: A survey. Arch. Comput. Methods Eng. 2021, 28, 4425–4447. [Google Scholar] [CrossRef]
Zhang, H.; Xu, H.; Tian, X.; Jiang, J.; Ma, J. Image fusion meets deep learning: A survey and perspective. Inf. Fusion 2021, 76, 323–336. [Google Scholar] [CrossRef]
Wu, Z.; Wang, J.; Zhou, Z.; An, Z.; Jiang, Q.; Demonceaux, C.; Sun, G.; Timofte, R. Object Segmentation by Mining Cross-Modal Semantics. arXiv 2023, arXiv:2305.10469. [Google Scholar]
Kirkpatrick, J.; Pascanu, R.; Rabinowitz, N.; Veness, J.; Desjardins, G.; Rusu, A.A.; Milan, K.; Quan, J.; Ramalho, T.; Grabska-Barwinska, A.; et al. Overcoming catastrophic forgetting in neural networks. Proc. Natl. Acad. Sci. USA 2017, 114, 3521–3526. [Google Scholar] [CrossRef] [PubMed]
Bao, Y.; Li, Y.; Huang, S.L.; Zhang, L.; Zheng, L.; Zamir, A.; Guibas, L. An information-theoretic approach to transferability in task transfer learning. In Proceedings of the 2019 IEEE International Conference on Image Processing (ICIP), Taipei, Taiwan, 22–25 September 2019; pp. 2309–2313. [Google Scholar]
Xiong, Z.; Wang, C.; Li, Y.; Luo, Y.; Cao, Y. Swin-pose: Swin transformer based human pose estimation. In Proceedings of the 2022 IEEE 5th International Conference on Multimedia Information Processing and Retrieval (MIPR), Virtual, 2–4 August 2022; pp. 228–233. [Google Scholar]
Zhang, Y.; Sun, P.; Jiang, Y.; Yu, D.; Weng, F.; Yuan, Z.; Luo, P.; Liu, W.; Wang, X. Bytetrack: Multi-object tracking by associating every detection box. In Proceedings of the Computer Vision–ECCV 2022: 17th European Conference, Tel Aviv, Israel, 23–27 October 2022; Proceedings, Part XXII. pp. 1–21. [Google Scholar]
Ge, Z.; Liu, S.; Wang, F.; Li, Z.; Sun, J. Yolox: Exceeding yolo series in 2021. arXiv 2021, arXiv:2107.08430. [Google Scholar]
Shao, S.; Zhao, Z.; Li, B.; Xiao, T.; Yu, G.; Zhang, X.; Sun, J. Crowdhuman: A benchmark for detecting human in a crowd. arXiv 2018, arXiv:1805.00123. [Google Scholar]
Dendorfer, P.; Rezatofighi, H.; Milan, A.; Shi, J.; Cremers, D.; Reid, I.; Roth, S.; Schindler, K.; Leal-Taixé, L. Mot20: A benchmark for multi object tracking in crowded scenes. arXiv 2020, arXiv:2003.09003. [Google Scholar]

Figure 1. Structure of a multimodal EO/IR and RF-based sensor system.

Figure 2. LFMCW radar with a smart antenna. (a) Radar transceiver on the platform side, (b) radar transponder on the friendly object side, and (c) distance measurement in the experiment.

Figure 3. IR subsystem for human object detection and tracking. (a) IR subsystem hardware setup; (b) centering on an interested object for distance measurement; and (c) flowchart of the working principle for the IR subsystem.

Figure 4. Assembled multimodal EO/IR and RF fused sensing system.

Figure 5. Assembled multimodal EO/IR and RF fused sensing system.

Figure 6. Examples of training image data obtained in different environments.

Figure 7. Human object detection results with a trained YOLOv4 model.

Figure 8. The training pipeline of U2Fusion [27] with aligned infrared and visible images.

Figure 9. The pipeline of image processing for object tracking.

Figure 10. Experiments to demonstrate the capability of the MEIRFS sensor system. (a) single human object tracking; (b) single human object tracking in dark environment; (c) friendly human object tracking with radar sensor; (d) multiple human targets identification with the discerning capability between friendly and non-friendly.

Figure 11. Experiments to demonstrate the capability of the MEIRFS sensor systemfor multiple human objects tracking. Each identified human object is labeled with a unique ID number.

Figure 12. Results of the MEIRFS sensor system after applying sensor fusion. The top row shows the results on pictures captured by an IR camera, the middle row shows the results on pictures captured by a visible camera, and the bottom row shows the results on fused pictures.

Table 1. Average precision (AP) score and average recall (AR) from experiments.

Average Precision (AP) @[IoU = 0.50:0.95	area = all	maxDets = 100] = 0.794
Average Precision (AP) @[IoU = 0.50	area = all	maxDets = 100] = 0.980
Average Precision (AP) @[IoU = 0.75	area = all	maxDets = 100] = 0.905
Average Precision (AP) @[IoU = 0.50:0.95	area = small	maxDets = 100] = −1.000
Average Precision (AP) @[IoU = 0.50:0.95	area = medium	maxDets = 100] = −1.000
Average Precision (AP) @[IoU = 0.50:0.95	area = large	maxDets = 100] = 0.794
Average Recall (AR) @[IoU = 0.50:0.95	area = all	maxDets = 1] = 0.422
Average Recall (AR) @[IoU = 0.50:0.95	area = all	maxDets = 10] = 0.812
Average Recall (AR) @[IoU = 0.50:0.95	area = all	maxDets = 100] = 0.812
Average Recall (AR) @[IoU = 0.50:0.95	area = small	maxDets = 100] = −1.000
Average Recall (AR) @[IoU = 0.50:0.95	area = medium	maxDets = 100] = −1.000
Average Recall (AR) @[IoU = 0.50:0.95	area = large	maxDets = 100] = 0.812

Table 2. Evaluation results of the tracking algorithm.

	Recall	Precision	MOTA
Evaluation Result	0.992	0.992	0.984

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Cheng, P.; Xiong, Z.; Bao, Y.; Zhuang, P.; Zhang, Y.; Blasch, E.; Chen, G. A Deep Learning-Enhanced Multi-Modal Sensing Platform for Robust Human Object Detection and Tracking in Challenging Environments. Electronics 2023, 12, 3423. https://doi.org/10.3390/electronics12163423

AMA Style

Cheng P, Xiong Z, Bao Y, Zhuang P, Zhang Y, Blasch E, Chen G. A Deep Learning-Enhanced Multi-Modal Sensing Platform for Robust Human Object Detection and Tracking in Challenging Environments. Electronics. 2023; 12(16):3423. https://doi.org/10.3390/electronics12163423

Chicago/Turabian Style

Cheng, Peng, Zinan Xiong, Yajie Bao, Ping Zhuang, Yunqi Zhang, Erik Blasch, and Genshe Chen. 2023. "A Deep Learning-Enhanced Multi-Modal Sensing Platform for Robust Human Object Detection and Tracking in Challenging Environments" Electronics 12, no. 16: 3423. https://doi.org/10.3390/electronics12163423

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

A Deep Learning-Enhanced Multi-Modal Sensing Platform for Robust Human Object Detection and Tracking in Challenging Environments

Abstract

1. Introduction

2. System Architecture

2.1. Radio Frequency (RF) Subsystem

2.2. EO/IR Subsystem

2.2.1. Electro-Optical (EO) Camera

2.2.2. Infrared (IR) Camera

2.2.3. Laser Rangefinder

2.3. Sensor System Integration

2.3.1. Hardware System Assembly

2.3.2. Software Package

3. Enhance the MIERFS System with Deep Learning Methods

3.1. Deep Learning-Based Algorithm for Human Object Detection

3.2. Sensor Fusion and Multi-Target Tracking

3.2.1. Sensor Fusion

3.2.2. DL-Based Algorithm for Human Object Tracking

4. Experiments and Results

4.1. Indoor Experiments

4.2. Outdoor Experiments

4.3. Discussion

5. Conclusions

Author Contributions

Funding

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI