OD-XAI: Explainable AI-Based Semantic Object Detection for Autonomous Vehicles

Mankodiya, Harsh; Jadav, Dhairya; Gupta, Rajesh; Tanwar, Sudeep; Hong, Wei-Chiang; Sharma, Ravi

doi:10.3390/app12115310

Open AccessArticle

OD-XAI: Explainable AI-Based Semantic Object Detection for Autonomous Vehicles

by

Harsh Mankodiya

¹,

Dhairya Jadav

¹,

Rajesh Gupta

¹

,

Sudeep Tanwar

^1,*

,

Wei-Chiang Hong

^2,*

and

Ravi Sharma

³

¹

Department of Computer Science and Engineering, Institute of Technology, Nirma University, Ahmedabad 382481, Gujarat, India

²

Department of Information Management, Asia Eastern University of Science and Technology, New Taipei 22064, Taiwan

³

Centre for Inter-Disciplinary Research and Innovation, University of Petroleum and Energy Studies, Dehradun 248007, Uttarakhand, India

^*

Authors to whom correspondence should be addressed.

Appl. Sci. 2022, 12(11), 5310; https://doi.org/10.3390/app12115310

Submission received: 3 March 2022 / Revised: 7 May 2022 / Accepted: 20 May 2022 / Published: 24 May 2022

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In recent years, artificial intelligence (AI) has become one of the most prominent fields in autonomous vehicles (AVs). With the help of AI, the stress levels of drivers have been reduced, as most of the work is executed by the AV itself. With the increasing complexity of models, explainable artificial intelligence (XAI) techniques work as handy tools that allow naive people and developers to understand the intricate workings of deep learning models. These techniques can be paralleled to AI to increase their interpretability. One essential task of AVs is to be able to follow the road. This paper attempts to justify how AVs can detect and segment the road on which they are moving using deep learning (DL) models. We trained and compared three semantic segmentation architectures for the task of pixel-wise road detection. Max IoU scores of 0.9459 and 0.9621 were obtained on the train and test set. Such DL algorithms are called “black box models” as they are hard to interpret due to their highly complex structures. Integrating XAI enables us to interpret and comprehend the predictions of these abstract models. We applied various XAI methods and generated explanations for the proposed segmentation model for road detection in AVs.

Keywords:

explainable AI; DL; autonomous vehicles; semantic segmentation; KITTI dataset; object detection; black box; ResNet; SegNet

1. Introduction

In the early modern era, driving was an arduous task, as cars were essentially bigger and heavier motor-driven cycles. With advancements in technology, driving has become highly efficient and pleasant. However, with time, the frequency of accidents increased as the number of car buyers increased. Figure 1 shows a world map, highlighting the road traffic death rate, per country, per 100,000 people. Countries in Africa have the highest number of deaths, of which, Zimbabwe has a death rate of 61.90%. Countries such as India, Brazil, Mongolia, etc., highlighted in green, have death rates between 29.04% and 18.30%. The death rates in the United States, Mexico, and China, highlighted in purple, range from 18.01% to 8.17%. Countries highlighted in grey have the lowest death rates [1]. It can be inferred from these statistics that countries on the African continent have very substantial death rates due to the lack of technological advancements in those countries. The majority of these accidents occur due to the lack of attention by drivers and their highly unpredictable mindsets [2,3]. Driving under the influence, reckless driving, drowsy driving, etc., can increase accidents.

AVs, connected automobiles, AI, personalization, and data analytics play significant roles in giving great driving experiences. AVs are cars that can pretty much drive themselves with little or no human interaction [4]. The latest technology in AVs allows drivers to relax while the car conducts all the tough work. This is considerably better than driving because AVs are machines with specified functions. They make decisions based on the information supplied to them. The data are given to the AVs by sensors installed around the vehicles [5]. These sensors provide critical statistics, such as the distance of the AV to other vehicles in the local region, the depth field, etc. Cameras installed on AVs are sensors that function as eyes and provide vision to the vehicles. Processing the sensor data allows the AV to drive autonomously without human interaction [6].

Figure 2 displays the growing demand for AVs. The figure displays the growth rate of AVs in terms of sales (in thousands) from 2018 to 2025. A prediction analysis was conducted (and is displayed) to foresee the future AV sales trends. Using these prediction results, the anticipated sale of AVs from 2026 through the year 2030 was obtained. These future insights into AV sales provide foresight into the technology’s projected demand. It signifies that the technology caters to many people, making it increasingly necessary to efficiently control, monitor, and upgrade. It is estimated that by 2035, AVs will control 25% of the worldwide commercial vehicle sector.

Although the field algorithms from the field of DL generate more accurate results, it is difficult for the average user to comprehend. It is vital to make DL more explainable for wider and better usage of DL as a whole. XAI is a technology that allows naive consumers to grasp the complexity of AI. One use case of XAI in AVs involves explaining the semantic segmentation predictions by evaluating the input frames that the AV observed [7]. There is not much work on semantic segmentation with XAI integrated into it, notably in AVs. Ref. [8] suggests a toolbox to build a segmented image for a given input image and a specific model. Although very accurate results are expected from such methods, not much model interpretability is gained. Motivated by this, we used XAI techniques, such as GradCAM and saliency maps, to explain the segmentation maps generated from DL models in AVs. GradCAM and saliency maps are backpropagation-based XAI techniques that generate heat maps and localize important regions in the image. We also analyzed intermediate model layers and provide explanations for them.

Figure 2. Prediction line statistics for autonomous car sales from 2018–2030 [9].

1.1. Related Works

This section explores various prior approaches to XAI-based autonomous driving. In Ref. [10], the researchers performed semantic segmentation on road images using fully connected networks (FCNs). Ref. [11] proposed an imitation learning-based agent equipped with attention maps. Attention maps in the proposed system provide insight into the important sections of the image. Ref. [12] proposed an approach for intelligent driving systems using human drivers to deliver scene forecasts to the vehicle. Ref. [13] introduced AUTO-DISCERN, which uses common sense reasoning and answer set reasoning to provide automated explainable decision-making for AVs. Ref. [7] proposed a system that detects a malicious vehicle in vehicular ad hoc networks (VANETs). Ref. [14] used DL models to create intelligent semantic segmentation for AVs. Table 1 shows the objectives and findings observed from the research of the above-mentioned studies.

1.2. Research Novelty

This section briefly describes and justifies the novelty of work conducted and the novelty of our paper; we conducted comparisons with a few related works from the literature. Kaymak et al. [10] did not offer information about the explainability of the model’s judgements. However, our approach provides detailed explanations about the outputs of semantic segmentation models by using various XAI techniques. Cultrera et al. [11] provided no intermediate model explanations to explain how the model works within. Our paper, on the other hand, provides detailed descriptions about the decision-making of the model by visualizing and explaining the output of XAI methods and intermediate layers. Wang et al. [12] did not provide explanations on how the decision-making in the AI system is conducted internally. In our paper, we used XAI techniques to provide insight into the system’s decision-making and its intricate working. Kothawade et al. [13] did not provide a visual depiction on the model’s decision-making; only output text was generated by them. Our paper, on the other hand, provides visual explanations in the form of heat maps of real world images, which makes it easier for the general audience to infer. Mankodiya et al. [7] worked with numeric data in a VANET only and did not provide insight into the explainability of the model’s decision-making. Our paper works with visual data from AVs, which are independent and do not belong to VANETs. In-depth explanations of the model’s output are provided. Sellat et al. [14] used CNNs, autoencoder, and other deep neural networks, which are ’black box’ in nature. The paper lacks information about the model’s intermediate explanations. Our paper, on the other hand, provides explanations for the outputs of semantic segmentation models using different XAI techniques.

1.3. Research Contributions

The significant contributions of the paper are as follows:

We propose an XAI integrated AV system that improves the explainability of semantic segmentation models, which are considered black box models and are difficult to analyze and comprehend. To conduct semantic segmentation on road images in AVs, the proposed system was custom-trained on the KITTI road dataset.
To comprehend the model’s internal workings and which areas it concentrates on when input images are processed through the trained model, XAI methods, such as Grad-Cam and saliency maps, were used. Logical connections are derived and are mapped with the application, thereby producing easy-to-understand explanations for real world scenarios.
We present a novel approach to explain the intermediate model workings. It analyzes the intermediate layers of the DL model by visualizing their outputs and deriving accurate interpretations. The analysis is extensive; all the conclusions are expanded upon by offering logical links and reasons.

1.4. Contribution Mapping

This section maps the previously-stated research contributions to the different sections of the paper for easy access and understanding. Table 2 shows the mapping of each stated contributions to the sections on the manuscript.

1.5. Paper Organization

The rest of the paper is organized as follows. Section 2 explores the paper’s preliminary topics, such as XAI and integration of XAI with semantic segmentation in AVs. The system model and problem formulation are presented in Section 3. Section 4 illustrates the proposed method. Section 5 discusses XAI and its numerous tools, as well as its integration with trained segmentation modules. Section 6 discusses the trained model’s results. Section 7 concludes the paper.

2. Preliminaries

This section provides brief insights into XAI and its integration with segmentation with AVs.

2.1. Explainable Artificial Intelligence (XAI)

The popularity of AI has soared in the past several decades. The number of articles (globally) on AI has increased more than three-fold since 2000 [15]. AVs have considerably improved in the last few years. With ML and DL approaches, AVs have become more precise and have expanded tremendously. This increasing popularity of AI has encouraged developers to design larger and more sophisticated models, which, in hindsight, produces better results, but it is increasingly becoming harder to analyze and interpret. The human mind will sometimes not accept concepts that it cannot grasp. For example, with AI, if people cannot understand the “why?” and “why not?” of the AI decision-making, trust management will become a major concern. There are six types of explanations in XAI that assist in increasing the understanding of the model.

Global explanation: This type of explanation attempts to elaborate how models arrive at their predictions via visual charts, graphs, or images. Holistic explanations are generally obtained in techniques that perform global explanations.
Contrastive explanation: Such type of explanation tends to pinpoint the “why” and the “why not” of the model prediction and behaviour. Contrastive explanations are generally used in determining the minimal changes in input that can affect the final predictions.
Local explanation: This approach is totally opposite from the global explanation. While global explanations are used to obtain holistic representation, local explanations tend to focus more on specific features and attributes of the model.
What-if explanation: This type of XAI technique is used to understand model attributes and predictions.
Counterfactual explanation: This type of explanation answers “How to” arrive at the desired outcome.
Example-based explanation: Such a technique uses model prediction on a sample dataset to explain the general output or the underlying data distribution of the whole dataset. This explanation technique is based on the hypothesis that “similar inputs” will generate “similar outputs”.

To overcome the aforementioned issues, the notion of XAI was established. The sole purpose of XAI is to make it easy for everybody, including engineers and general users, to grasp the model and its predictions. XAI attempts to decode the exact model working and decision-making. As the parameters and layers expand inside a model, it becomes more complex. These types of models are black box models [16,17]. A black box model can be viewed as a black box, one in which an input is processed to obtain an output where the internal process itself is abstract and unknown. To understand the backends of such models, we need to open the black box and delve inside its working. This is exactly what XAI seeks to do.

Some AI methods that are considered black box models include the artificial neural network (ANN), tree ensemble (TE), support vector machine (SVM), and deep neural network (DNN). XAI is increasing in popularity because it solves the following three questions [18]: What drives the model predictions? Why did the model take a specific decision? How can we trust the model? The first question verifies the fairness of the model, which implies that the model is driven by features that are relevant in decision-making [18]. The second question assures accountability and dependability of the model. The model should be able to justify and validate why some of the characteristics are more influential in the model’s decision-making [18]. The third and last question assures the transparency of the model. The model should be able to defend its choice for any given data and it should be simple to grasp for users.

2.2. Integration of XAI with Semantic Segmentation in AV

Semantic segmentation allows AVs to categorize items on the road [14], whereas XAI can help justify the segmentation predictions of the AV. Because of safety issues, the usage of XAI in AVs is critical. When a car is driven autonomously, the passengers should have faith in the vehicle. Traveling will become perilous if the user does not trust the decisions made by the AV [7]. This research employed semantic segmentation to classify the image pixels into two classes: road and surrounds. The role of XAI, in this case, is to improve the interpretability of the segmentation module by adequately analyzing the segmentation output. The advantages are beneficial to the common user and provide detailed insights into the intricate workings of black box modules to the developers. Few scenarios mentioned in Table 3, which may correlate to actual events, explicitly describe the advantages to stakeholders due to XAI integration.

There are essentially two stakeholders when it comes to AVs. They include naive users and highly-skilled software engineers as well as data scientists who participate in model building, deployment, and testing procedures. XAI integration can provide high-level insights into complicated models and performances that are intelligible to laymen, allowing AI algorithms to be explained.

3. System Model and Problem Formulation

Figure 3 displays the system model of semantic segmentation of an input frame along with its integration of XAI for the model explanation. Firstly, AV records the frame with the help of a camera positioned on it. The input frame is then scaled, preprocessed, and made suitable for predictions. During data collection, frames are stored in a database that can subsequently be used to train the model. After the model is trained, its performance is evaluated using multiple measures. For evaluating the model and preprocessed frames are sent through the model to acquire the predicted outputs. These predictions can then be explained with the help of XAI to comprehend the model’s inner workings and assess its credibility.

Problem Formulation

For AVs, it is crucial to correctly recognize and detect the objects around them to drive securely without causing havoc. It is a massive task to perform, especially for computer vision-led AVs, since all that an AV interprets is visual information from the camera, as light detection and ranging (LiDAR) [22] and other sensors are absent. Detection and correct road classification are also tough since extremely complex algorithms have to be built; a sufficient dataset is also necessary to achieve the objective. Pixel wise object detection and categorization can be performed utilizing various DL-based semantic segmentation architectures.

These semantic segmentation models are complex to develop and train. They require datasets covering all conceivable scenarios to train such models in-order to attain acceptable prediction performance. A plethora of successful DL architectures were built to handle pixel-wise object classification [23,24,25,26]. However concerns still arise. Are these models reliable? How do they work internally? How does the outcome correlate to the real world? What is the model’s interpretation? In this research, we strived to overcome the lack of explainability in semantic segmentation models in AVs by employing a few XAI algorithms on three different existing image segmentation architectures. We trained these models on a road dataset and merged them with XAI approaches to boost the model transparency. A detailed discussion on the proposed approach for tackling the problem is given in the subsequent sections.

4. The Proposed Approach

In this research, we applied semantic segmentation on the KITTI road dataset; we explain the model predictions using a few XAI approaches. The dataset was preprocessed to make it fit for training the semantic segmentation models. Three models, namely, ResNet-18, ResNet-50, and SegNet, were trained on these preprocessed images to obtain the segmented outputs. The best performing model was chosen from the lot to further use it for generating model explanations by passing them through XAI algorithms. XAI techniques, namely GradCAM and saliency maps, were used to generate visual heat maps, showing important regions as localized by the model. We established a novel approach to analyze the outputs of various intermediate layers and provide explanations for them. The upcoming subsections walk through the ’nitty gritty’ of the dataset description, preprocessing step, and the architecture of DL models used for training and XAI integration

4.1. Dataset Description

For AVs, the road or surface on which the AV is driving is the most crucial thing that needs to be detected to drive safely. For our purpose, we used the KITTI [27] road dataset for training and testing the DL models. The KITTI Vision Benchmark Suite [28] was introduced in 2012 but upgraded with semantic segmentation ground truth [29] in 2018. The dataset consists of 289 training instances and 289 annotated photos [27]. Each road image has a resolution of 375 × 1242 × three and each annotated image has a resolution of 375 × 1242 × 2 (each channel in these images represents annotation/ground truth segmentation of a single class); they were stored as PNG files.

4.2. Preprocessing

The original KITTI dataset image size was 375 × 1242 × 3 for the road images and 375 × 1242 × 2 for the annotated images. However for our purpose, we changed the dimensions to 192 × 640 × 3 for the road images and 192 × 640 × 2 for the annotated images. This was done to save computational time and resources. An image with relatively lower resolution could significantly reduce the computational time, thereby speeding up the whole process. Along with the resizing operation, the image brightness, saturation, and hue values were also altered, as shown in Algorithm 1. The original image input stream acquired from AV is specified by the variable

r o a d_i m a g e s

in Algorithm 1. These input images were conventional BGR images captured by the camera installed on the AV. The BGR color space images were converted to HSV images for easier manipulation of brightness, hue, and saturation values. The KITTI dataset with 289 images was further split into train and test sets with 20% test size ratios. Road image shape: (289, 192, 640, 3), annotated images shape: (289, 192, 640, 2), train set shape: (231, 192, 640, 3), and test set shape: (58, 192, 640, 2).

Algorithm 1 Preprocessing the images to make them trainable.
Input: road_images
Output: processed image
1: rand ← random()
2: image ← road_image
3: image ← resize(image,(192,640))
4: hsv ← cvtColor(image,(BGR2HSV))
	▹ Changing image brightness values
5: hsv[:,:,2] += (rand-0.2) * 255.
6: hsv[:,:,2] ← clip(hsv[:,:,2],min = 0, max = 255)
	▹ Changing image saturation values
7: hsv[:,:,1] += (rand-0.2) * 255.
8: hsv[:,:,1] ← clip(hsv[:,:,1],min = 0, max = 255)
	▹ Changing image hue values
9: hsv[:,:,0] += (rand-0.2) * 255.
10: hsv[:,:,0] ← clip(hsv[:,:,0],min = 0, max = 255)
11: processed_image ← cvtColor(hsv,(HSV2BGR))
return processed_image

4.3. Image Segmentation Architectures

This section discusses the architectures of DL models trained for image segmentation, specifically on the KIITI road dataset.

4.3.1. Residual Networks (ResNets)

ResNets are a form of DNNs, which were introduced in 2015 [30]. Traditionally, it was believed that adding more layers to the NN would increase the accuracy; however, adding numerous layers without residual blocks would either diminish the accuracy or make it stagnant. In residual blocks, each layer’s output is sent as input to the following layer and to a layer that is two or more hops distant [31]. The issue of accuracy becoming stagnant is known as the vanishing gradient. The vanishing gradient problem occurs because the weights become sparser with every gradient descent step during backpropagation. Due to this issue, the model either learns at a protracted rate or does not learn at all. Shallow networks produce considerably better outcomes than DNNs when few layers are added to them. Thus, skip connections are formed to produce outcomes, such as a shallow network, which connects early layers with the deeper layers in a neural network. These are seen in ResNets [32]. ResNets are named residual networks; in standard NNs, the network layers strive to learn the outputs of the immediately preceding layers, whereas in ResNets, the layers that do not immediately precede a specific layer also influence activation.

\begin{matrix} R (x) = P (x) - x \end{matrix}

(1)

Rearranging the above equation, we have:

\begin{matrix} P (x) = R (x) + x \end{matrix}

(2)

x represents the activations of the immediately preceding layer.

R (x)

symbolizes the residual layers and

P (x)

specifies the current or present layer. The equations primarily show that the influences on the current layer are functions of residual

R (x)

and x. The architectures for ResNet-18 and ResNet-50 were adopted from the [33] repository. The network architecture employs an UNet network structure with ResNet-18 or ResNet-50 encoders. The encoder and decoder model structures have several functions in computer vision; object detection, pose estimation, and semantic segmentation, to mention a few. The functionality of encoders and decoders is relatively straightforward. Encoders gradually decrease the size of feature maps to capture higher semantic information, whereas decoders are exactly the opposite to encoders; they gradually increase the size of feature maps to recover the spatial information [34]. Figure 4 shows the architecture of UNet with a ResNet-18 encoder.

4.3.2. SegNet

SegNet is another DNN that is employed in this paper. The architecture for the model in this study is inspired by the [34] repository. SegNet uses the capabilities of encoders and decoders. There are 13 encoder layers that are similar to the first 13 VGG16 convolution layers [35,36]. The encoder is not entirely attached, which helps to lower the overall amount of parameters. For each encoder layer, related decoders use upsampling to expand the spatial size. The decoder provides sparse feature maps that are of high resolution. The last layer in the decoder features a softmax activation function that allows categorization of each pixel [35]. Figure 5 demonstrates the SegNet architecture, which is utilized in this paper.

Table 4 shows the total number of layers and the total number of model parameters in all three models.

4.4. Training

Deep learning ResNet-18, ResNet-50, and SegNet are proposed above; they further fit on the training set. The proposed architectures have discrete complexities and structures. Hence, we trained them in such a way that we acquired appropriate train and test accuracies while guaranteeing that the models did not under-fit or over-fit. The models were constructed and trained using python-TensorFlow 2.6.0. The final training parameters of all the designs are as per Table 5. Moreover, the specifications of the system in which the models were trained may be viewed in Table 6.

5. XAI Integration

In this section, we explore some existing XAI algorithms; we integrate them with our trained segmentation models to explain the predictions and ensure that they are in the right intended directions (and not just flukes). Various XAI techniques can be implemented on the model, but choosing the most appropriate ones is important for better explanations. We also observe the outputs of various individual hidden layers of the model and attempt to generate explanations by visualizing them. Applying XAI over the DL architectures for vision in AVs can help to encapsulate the ’nitty gritty’ of the model.

5.1. Taxonomy of XAI

Some prior published survey papers classified XAI techniques based on scope, usage, and methodology [37,38]. This section provides a brief overview about XAI terminologies and the technique categories.

Scope: Scope refers to the extent of the explanation of the XAI technique. According to scope, the XAI technique can be categorized into locally or globally explainable models [38]. Locally explainable techniques generally focus on explaining individual features and attributions to a single datum. In contrast, global explanation methods focus more on obtaining the generalized view of the model.
Methodology: Methodology of a technique refers to the algorithmic strategy employed to obtain the explanations. It can be further categorized into two types: backpropagation-based and perturbation-based [37].
Usage: Use of a specific XAI algorithm can be classified into two two types: model-specific and model-agnostic [38].

5.2. XAI Techniques

In this section, we present XAI approaches fitting under the above-discussed categories. We first examine those techniques briefly and then execute and analyze their outcomes.

5.2.1. Gradient Class Activation Mapping (Grad-CAM)

Grad-CAM is a backpropagation-based explainability approach that is mainly utilized for image data predictions [39]. Grad-CAM generated outputs involve localized class-specific areas in an image created within a single pass [37]. We may utilize these images and compare them with the model predictions. The heat map created as the output of the Grad-CAM unit involves the gradients that are most stressed by the model. The Grad-CAM result on the semantic segmentation model appears remarkably similar to the segmentation predictions. The reason for their similarities is the pixel-wise class localization of images. Though the Grad-CAM approach has a global scope, it may also be utilized to explain activations of every sublayer in a DNN [39]. A mathematical description of Grad-CAM is presented below [40]. Grad-CAM computes on the final convolution layer because it catches high-level information and retains critical image areas.

In order to compute the gradient of

t_{c}

concerning feature map

F^{k}

of a convolution layer, the value of the gradient being calculated depends on the input image because the feature map

F^{k}

depends on the image shape.

\begin{matrix} \frac{\partial t_{c}}{\partial F^{k}} \end{matrix}

(3)

Here,

t_{c}

is the score of class c just before the softmax layer and F indicates the last convolution layer. k indicates the feature map number, which has height h and width u. Therefore the gradient will also have the same shape as (k, h, u).

The gradients are global average-pooled over the width and the height. This will lead to the neuron importance for a feature map k of class c.

\begin{matrix} β_{k}^{c} = \frac{1}{S} \sum_{i} \sum_{j} \frac{\partial t^{c}}{\partial F_{i, j}^{k}} \\ S = u \times h . \end{matrix}

(4)

The gradient is pooled over height h and width u, which provides us with the final shapes of the alpha values as (k, 1, 1) or (k).

Finally, the

β

values are taken as weights for each corresponding feature map and a weighted sum of these feature maps will give the final value of Grad-CAM. The ReLU operation is applied over the sum to keep only those values that are positive.

\begin{matrix} G_{G r a d C A M}^{c} = R e L U (\sum_{k} β_{k}^{c} A^{k}) \end{matrix}

(5)

The final Grad-CAM heat map shape is

u \times h

, the shape of the last convolution layer. The finished image is significantly smaller than the source image. Hence, the heat map is subsequently upsampled to obtain the size of the original image. The pipeline for obtaining the Grad-CAM heat map of the final convolution layer is depicted by the Algorithm 2. The resulting heat map illustrates where the model is examined while making a judgement.

Figure 6 illustrates the Grad-CAM backpropagation result generated on SegNet, ResNet-18, and ResNet-50 architecture given above. The area covered by all of the Grad-CAM heat maps is virtually identical and close to the original ground truth. The difference that we can see, though, is in the distribution of gradients. Figure 6a is the original ground-truth images from the test set. There is a notable contrast in the brightness of the pixels in Figure 6b,c. Pixel brightness is uniform in ResNet-18, suggesting equal relevance to all road segment areas in the image. In contrast, non-uniform brightness in pixels for SegNet implies differential relevance to image pixels.

On the other hand, the Grad-CAM results of ResNet-50 as shown in Figure 6d, involve feature pixels with uniform brightness. Still, as we can notice, there are tiny misclassifications in the surrounding regions. This approach can be applied to the model before its AV deployment to check out its performance. Similar approaches may also be used to the models currently deployed on AVs to track their performances when fresh training instances are generated.

We partially trained a new ResNet-18 model for five epochs with randomly-initialized weights on the same KITTI dataset and obtained its Grad-CAM output. Figure 7a shows the Grad-CAM results on this partially trained model and Figure 7b shows its predicted semantic map. We can detect the erroneous predictions of the partially trained model by looking at the Grad-CAM output heat map. Comparing the Grad-CAM outputs of different segmentation models can visually help evaluate the model and strengthen the model explainability. As far as AVs are concerned, segmentation maps give spatial awareness by identifying objects. Any irregularity in predictions can be easily recognized utilizing the Grad-CAM backpropagation approach.

Algorithm 2 Algorithm for the generation of Grad-CAM predictions [39].
Input: image and class of interest.
Output: Grad-CAM prediction.
1: Forward propagate the image across the CNN.
2: Gradients are set to zero for all classes except the desired class, which is set to 1.
3: $β_{k}^{c} = \frac{1}{S} \sum_{i} \sum_{j} \frac{\partial t^{c}}{\partial F_{i, j}^{k}}$	▹ Calculating weights for each corresponding feature map.
4: $S = u \times h$ .
5: These weights are then backpropagated through the CNN.
6: The Grad-CAM prediction is produced by a weighted mixture of feature maps followed by ReLU:
7: $G_{G r a d C A M}^{c} = R E L U (\sum_{k} β_{k}^{c} A^{k})$
8: The Grad-CAM prediction creates a heat map, which depicts where the model has to look to make the particular decision.

5.2.2. Saliency Maps

A saliency map is a model output visualization tool that displays how the model localizes the sections on the image. The term saliency refers to the distinctive aspects of the image. These are the features that the model gives the most significance. The output generated by the saliency map indicates the area where the model localizes the most [41]. The brightness of each pixel in the output is exactly related to its saliency, i.e., the regions with brighter gradients are more stressed. Figure 8 demonstrates the comparison of three different models on the same image. The Figure 8a is the ground truth obtained directly from the KITTI dataset. Figure 8b is generated on the ResNet-18 model. In this image, the region encompassed by bright pixels covers quite a lot of the road on the ground truth. This illustrates that the model is localized on the road. The model concentrates on the white road markings rather than the remainder of the road. The saliency map in Figure 8c is generated on the SegNet model. SegNet can capture roads; however, it is not as precise as ResNet-18. Figure 8d is generated on ResNet-50. Even though the model focuses on the road, the region it encompasses is substantially poorer than the other two models. It depicts a highly generic section of the road. This can happen due to one of two causes: the model was not trained entirely or certain difficulties occurred while training the model, which led to the situation. Algorithm 3 presents an overview of how saliency maps are constructed.

Algorithm 3 Algorithm for generating the saliency map [42].

1:: Colours (C), intensity (I) and orientations (O) are extracted from the input images.
2:: Color space of C → red–green–blue–yellow color space.
3:: I = image to grayscale.
4:: O is converted using Gabor filters.
5:: To create the feature maps, all images are used to create Gaussian pyramids.
6:: The feature map is created with respect to each C, I, and O.
7:: The saliency map is created by taking the mean of all the feature maps.

5.2.3. Observational Explanations

This method focuses on visualizing the hidden layers and generating predictions to understand the trained models. This section will investigate various layers of the trained ResNet-18 model, display outputs, observe kernel weights of various hidden convolutional layers, and infer appropriate reasons. It is difficult to examine every layer and generalize the predictions. Therefore, we divided the work into two sub-process, i.e., observing a handful of (1) initial layers (2) deep hidden layers. We also attempted to explain how output varies as we move further into the network and how it influences the final predictions.

Initial Layer: ResNet-18 is a deep neural network comprising a total of 111 layers. Initially, hidden layers in this design are those closer to the primary input layer. All these layers lie in the encoder component of the segmentation model. In the first few layers, hidden convolutional units prefer to collect local patterns from images [43]. We attempt to assess the outputs of a few early hidden layers and develop suitable interpretations for them.
The output of some channels of the Max Pooling layer, which is the seventh hidden layer of the ResNet-18 network, can be seen in Figure 9. Figure 9a is the visualization of channel 0,
taking test image Figure 10 as the input. It can be noticed that the model stresses items other than the road itself. All vehicles in both lanes have greater gradient values, i.e., they are red. Although we are not expected to identify items other than the road, the attention to cars or other vehicles is critical for better filtering of trivial classes and, consequently, leads to better segmentation outputs for the road. While channel 0 stresses all of the objects, gradients in channel 5, as shown in Figure 9b, are stronger for the nearby vehicles in the view. As it is more necessary to identify close vehicles and prioritize them first, a filter of channel 5 comes into action for the same.
Figure 9c,d, both the output stresses on vertical objects, such as electric poles, trees, and other straight elevations in the image. It is observed that Figure 9d does a better job; the gradients in this image are more distinguishable from the surrounding. Pole-like items in the surroundings play vital parts in the segmentation map predictions. As the vehicle approaches the pole from a distance, its relative size grows. Hence, such channels can give exact output road segmentations for AVs by precisely recognizing pole-like items. Though the channels offer a higher say for some items in the AV vision, it cannot be totally generalized that the gradients of a single channel solely have specific filtering roles. Moreover, other channels may have filter weights that can generate comparable outputs. In Figure 11 we applied filter weights of channel 63, layer 7, on a different test picture, to view and compare the results with the one seen above. Figure 11a is the image in which the prediction model is applied and Figure 11b is the intermediate filter output of layer 7 and channel 63 of the ResNet-18 model.
We also visualized layer 34 to check the extent of the feature selection in every forward pass.
The outputs in Figure 12 suggest the generalized prediction gradients of the road and surroundings. All other objects in the original image were filtered away; only the road and surroundings were kept. Figure 12a,b are the outputs generated on channels 39 and 114 of layer 34 of the ResNet-18 model. Although the segmentation is not accurate, most of the relevant traits were already retrieved.
Deep hidden layers: Deep hidden layers are the ones closer to the output layers of the model. Here, we studied the deep layers of ResNet-18 residing in the decoder area of the semantic segmentation architecture. We investigated the output of a few of these layers of the model.
Comparing the outputs of layer 34 in Figure 12 to that of layer 94 in Figure 13, we can identify total altercations in the projections. Layer 94 is a deep network layer; near to the output layer, it performs a better job at creating road semantics compared to layer 34. The gradients in Figure 13 are not as uniform as the output layer. It should be noted that the gradients are still not smooth.
In Figure 14b, specific stress is placed on the wheels of the vehicle. Moreover, the figure’s null gradients on the road portion also depict that the surroundings in the image are appropriately segregated from the road part. It can be interpreted that the gradients in the output of Figure 14a,c are brighter near the boundary of the road region.

In Figure 15, the ResNet-18 model outputs for deeper layers are obtained. These images are almost similar to the predictions obtained at the output layer. Figure 15a shows the predicted output map, which stresses on the background of the image and Figure 15b shows the predicted output map, which stresses on the road part of the image.

5.2.4. Comparison of XAI Techniques

Table 7 contrasts the aforementioned XAI techniques. These strategies provide an understanding of the decision-making modules in an AV to its stakeholders. The expansion of total AVs worldwide will require people to comprehend the AV working mechanisms (to trust its judgments). The XAI techniques discussed above provide simple (yet detailed) depictions of AV judgments. As a result, these XAI techniques are appropriate for autonomous driving applications.

6. Results and Discussions

Results of the proposed DL models are discussed in this section. Evaluation metrics provide data on train and test set performances and play key parts in primary model testing and diagnoses. These parameters are critical to examine before deploying the DL models into AVs. Poor performance on measures suggests a poorly trained model that cannot be deployed in the actual world. A thorough description of evaluation methods is provided in the forthcoming section.

6.1. Evaluation Metric

Evaluation metrics provide us with a clear picture about the model performance. They are calculated on train and test sets to measure the goodness of a model. Evaluation of a trained semantic segmentation model is typically conducted using two performance criteria—intersection over union (IOU) value and accuracy score.

6.1.1. Intersection over Union

IOU, also known as the Jaccard index, is the most commonly used metric for measuring the similarity between two arbitrary shapes [44]. IOU can also be used as a loss function so it can be backpropagated [45]; however, in our case, we utilize it as a metric to measure the merit of the anticipated segmentation map.

For semantic segmentation, pixel-wise mapping was done between a ground truth segmentation map (G) and a model-predicted segmentation map (P). Note: the semantic maps ‘G’ and ‘P’ are image matrices of the same shape.

\begin{matrix} IoU = \frac{| G \cap P |}{| G \cup P |} \end{matrix}

(6)

\begin{matrix} Intersection = | G \cap P | Union = | G \cup P | \end{matrix}

(7)

The intersection in Equation (7) can be taken as a number of actually predicted positive pixel values or true positives (TP); the union in Equation (7) refers to a collection of all the predicted as well as ground truth pixels. Thus, the IOU value can be expressed as follows:

\begin{matrix} IoU = \frac{T P}{T P + F P + F N} \end{matrix}

(8)

IOU can be computed class-wise as well, since for each class label we have separate ground truth and anticipated segmentation maps. IOU values are between 0 and 1. An IOU value closer to 1 mathematically denotes an equal or nearly equal intersection and union values, suggesting that the ground truth and forecasts are comparable and closely resemble each other. In this paper, we computed IOU for the predicted segmentation maps according to Algorithm 4.

Algorithm 4 Functions used to calculate IOU in this paper.

Input: ground_truth_images, model_predicted_images
Output: overall IOU of the model predictions.

1:: class_num ← 2
2:: G ← ground_truth_images
3:: P ← model_predicted_images
4:: if G.shape == P.shape then
5:: IOU ← keras.MeanIoU(num_classes = class_num)
6:: IOU.update_state(G,P)
7:: IOU_overall ← IOU.result().numpy()
8:: end if
9:: return IOU_overall

Another metric that can be used to evaluate the performance of a semantic segmentation model is the accuracy score.

6.1.2. Accuracy Score

Similar to IOU, the accuracy score is determined by a pixel-wise comparison of ground truth and anticipated segmentation maps. A single accuracy score is computed for the output predictions instead of separate class-wise scores. Due to the singleness of the accuracy score, there could be bias of a certain class in its final result. Moreover, we cannot determine the model’s reliability in providing segmentation maps for all class labels individually. The accuracy score is mathematically represented as below.

\begin{matrix} Accuracy Score = \frac{\sum_{i = 1}^{d} T P_{i}}{\sum_{i = 1}^{d} T P_{i} + F P_{i} + F N_{i}} \end{matrix}

(9)

d = Number of classes.

IOU may be determined for overall prediction maps as well as for each class label. Hence, it is more dependable and a superior evaluation metric than the accuracy score for assessing the model in this scenario. The accuracy score brings in the skewness of all the projected classes, whereas the IOU for a given class is determined using the true positives of that class only. Some classes could have a higher say when it comes to determining the total accuracy, making it less reliable. Abetter accuracy score does not necessarily imply correct forecasts of all classes; however, a good individual class-wise IOU score does reflect accurate predictions.

Now that the evaluation metrics have been discussed, the results of the trained semantic segmentation models can be seen as follows.

Table 8 shows the mean IOU scores calculated on the train and test set. ResNet-18 performed better than the other mentioned architectures. We will also see the class-wise IOU scores of the above models.

Table 9 shows the class-wise IOU values for all of the proposed models. It can be observed that the individual class IOU scores within the same model are almost equal and have very little differences. This implies that there is no partiality in the prediction of labels in any of the above-trained models.

Table 10 displays the comparisons of the model accuracy. A side-by-side comparison of IOU values in Table 8 and accuracy scores in Table 10 indicate considerable differences for each model. The IOU values were observed to be smaller than the accuracy scores, which implies that accuracy does bring in the bias of all classes. Hence, the IOU measure is a superior assessment metric as it offers the exact total values.

Figure 16 shows the results of the segmentation detection of each model. These outputs can be compared with the ground truth images to visualize the merits of each model. ResNet-18 seems to work best here, producing sharper boundaries on the road’s edges compared to scattered road boundaries in ResNet-50 and SegNet. It is important to note that the above segmentation maps were produced considering only the road masks (road class) obtained from predictions, i.e., overlapping original images with the class-road predictions. The inference time analyses of all segmentation models are shown in Figure 17. ResNet-18, depicted in the graph by the blue line, has an average inference time of roughly 0.3 s. In comparison, ResNet-50 has a mean inference time of 0.6 s. SegNet, on the other hand, requires a longer time to build segmentation maps for a single image, roughly 4.0 s. This result also demonstrates the ResNet-18 model’s superiority in terms of the prediction time when compared to the other models. One of the most significant parameters for AVs is the forecast time as a slight delay in decision-making by the AV system could have catastrophic consequences.

In this section, we presented the output and results of the trained models on the KITTI road dataset. We evaluated the model using the mean IOU metric and observed its value for all individual classes.

7. Conclusions

This paper trained and assessed three current DL semantic segmentation models for road detection in AVs. It is necessary for AI in AVs to effectively recognize roads and other surrounding objects to securely drive a car. It is vital to implement these architectures correctly, but it is also necessary to generate explanations for the decisions and the complicated inner workings of the model. Hence, to handle a similar challenge, we attempted to integrate XAI with the trained models in this paper. We applied Grad-CAM and saliency maps to the models and described the relevance of their outcomes. We observed that the areas of an image that had more significance had brighter pixel values in the heat map. We also studied a few model sublayers to understand which layer focused on what part of the image. Although the explanations offered in this research reveal specific insights into the complex model and predictions, they do not entirely solve the black box issues of the ML and DL algorithms. In the future, natural language processing will be integrated in the XAI system to provide explanations to the users in textual form.

Author Contributions

Conceptualization: W.-C.H. and S.T.; writing—original draft preparation: H.M., D.J. and R.G. methodology: R.G. and S.T.; writing—review and editing: R.S. and S.T. All authors have read and agreed to the published version of the manuscript.

Funding

This research is supported by the Ministry of Science and Technology, Taiwan (MOST 110-2410-H-161-001).

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Not applicable.

Conflicts of Interest

The authors declare no conflict of interest.

References

Road Traffic Accidents. Available online: https://www.worldlifeexpectancy.com/cause-of-death/road-traffic-accidents/by-country/ (accessed on 14 December 2021).
Talbot, R.; Fagerlind, H.; Morris, A. Exploring inattention and distraction in the SafetyNet Accident Causation Database. Accid. Anal. Prev. 2013, 60, 445–455. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Sundfør, H.B.; Sagberg, F.; Høye, A. Inattention and distraction in fatal road crashes – Results from in-depth crash investigations in Norway. Accid. Anal. Prev. 2019, 125, 152–157. [Google Scholar] [CrossRef] [PubMed]
Rajasekhar, M.V.; Jaswal, A.K. Autonomous vehicles: The future of automobiles. In Proceedings of the 2015 IEEE International Transportation Electrification Conference (ITEC), Chennai, India, 27–29 August 2015; pp. 1–6. [Google Scholar] [CrossRef]
Gupta, R.; Kumari, A.; Tanwar, S. Fusion of Blockchain and AI for Secure Drone Networking underlying 5G Communications. Trans. Emerg. Telecommun. Technol. 2020, 32, e4176. [Google Scholar] [CrossRef]
Patel, M.M.; Tanwar, S.; Gupta, R.; Kumar, N. A Deep Learning-based Cryptocurrency Price Prediction Scheme for Financial Institutions. J. Inf. Secur. Appl. 2020, 55, 102583. [Google Scholar] [CrossRef]
Mankodiya, H.; Obaidat, M.S.; Gupta, R.; Tanwar, S. XAI-AV: Explainable Artificial Intelligence for Trust Management in Autonomous Vehicles. In Proceedings of the 2021 International Conference on Communications, Computing, Cybersecurity, and Informatics (CCCI), Beijing, China, 15–17 October 2021; pp. 1–5. [Google Scholar] [CrossRef]
Schorr, C.; Goodarzi, P.; Chen, F.; Dahmen, T. Neuroscope: An Explainable AI Toolbox for Semantic Segmentation and Image Classification of Convolutional Neural Nets. Appl. Sci. 2021, 11, 2199. [Google Scholar] [CrossRef]
The Self-Driving Car Market Presents a Strong Opportunity for Samsung. Available online: https://www.businessinsider.com/self-driving-car-market-presents-a-strong-opportunity-for-samsung-2017-5?IR=T (accessed on 3 May 2017).
Kaymak, C.; Ucar, A. Semantic Image Segmentation for Autonomous Driving Using Fully Convolutional Networks. In Proceedings of the 2019 International Artificial Intelligence and Data Processing Symposium (IDAP), Malatya, Turkey, 21–22 September 2019; pp. 1–8. [Google Scholar] [CrossRef]
Cultrera, L.; Seidenari, L.; Becattini, F.; Pala, P.; Del Bimbo, A. Explaining Autonomous Driving by Learning End-to-End Visual Attention. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, Seattle, WA, USA, 14–19 June 2020. [Google Scholar] [CrossRef]
Wang, C.; Weisswange, T.H.; Kruger, M.; Wiebel-Herboth, C.B. Human-Vehicle Cooperation on Prediction-Level: Enhancing Automated Driving with Human Foresight. In Proceedings of the 2021 IEEE Intelligent Vehicles Symposium Workshops (IV Workshops), Nagoya, Japan, 11–15 July 2021. [Google Scholar] [CrossRef]
Kothawade, S.; Khandelwal, V.; Basu, K.; Wang, H.; Gupta, G. AUTO-DISCERN: Autonomous Driving Using Common Sense Reasoning. arXiv 2021, arXiv:2110.13606. [Google Scholar] [CrossRef]
Sellat, Q.; Bisoy, S.; Priyadarshini, R.; Vidyarthi, A.; Kautish, S.; Barik, D.R. Intelligent Semantic Segmentation for Self-Driving Vehicles Using Deep Learning. Comput. Intell. Neurosci. 2022, 2022, 6390260. [Google Scholar] [CrossRef]
10 Charts That Will Change Your Perspective On Artificial Intelligence’s Growth. Available online: https://www.forbes.com/sites/louiscolumbus/2018/01/12/10-charts-that-will-change-your-perspective-on-artificial-intelligences-growth/?sh=570e3c8a4758 (accessed on 12 January 2018).
Mahbooba, B.; Timilsina, M.; Sahal, R.; Serrano, M. Explainable Artificial Intelligence (XAI) to Enhance Trust Management in Intrusion Detection Systems Using Decision Tree Model. Complexity 2021, 2021, 6634811. [Google Scholar] [CrossRef]
Castelvecchi, D. Can we open the black box of AI. Nature 2016, 538, 20. [Google Scholar] [CrossRef] [Green Version]
The Importance of Human Interpretable Machine Learning. Available online: https://towardsdatascience.com/human-interpretable-machine-learning-part-1-the-need-and-importance-of-model-interpretation-2ed758f5f476 (accessed on 25 May 2018).
Mitsuhara, M.; Fukui, H.; Sakashita, Y.; Ogata, T.; Hirakawa, T.; Yamashita, T.; Fujiyoshi, H. Embedding Human Knowledge into Deep Neural Network via Attention Map. arXiv 2019, arXiv:1905.03540. [Google Scholar] [CrossRef]
Wang, H.; Abraham, Z. Concept drift detection for streaming data. In Proceedings of the 2015 international joint conference on neural networks (IJCNN), Killarney, Ireland, 12–17 July 2015; pp. 1–9. [Google Scholar]
Coffin, D.; Oliver, S.; VerWey, J. Building Vehicle Autonomy: Sensors, Semiconductors, Software and US Competitiveness. Office of Industries Working Paper ID-063. November 2019. Available online: https://ssrn.com/abstract=3492778 (accessed on 2 March 2022).
Li, Y.; Ibanez-Guzman, J. Lidar for Autonomous Driving: The Principles, Challenges, and Trends for Automotive Lidar and Perception Systems. IEEE Signal Process. Mag. 2020, 37, 50–61. [Google Scholar] [CrossRef]
Long, J.; Shelhamer, E.; Darrell, T. Fully Convolutional Networks for Semantic Segmentation. arXiv 2015, arXiv:1411.4038. [Google Scholar]
Mohan, R. Deep Deconvolutional Networks for Scene Parsing. arXiv 2014, arXiv:1411.4101. [Google Scholar]
Girshick, R.; Donahue, J.; Darrell, T.; Malik, J. Rich feature hierarchies for accurate object detection and semantic segmentation. arXiv 2014, arXiv:1311.2524. [Google Scholar]
Ronneberger, O.; Fischer, P.; Brox, T. U-Net: Convolutional Networks for Biomedical Image Segmentation. arXiv 2015, arXiv:1505.04597. [Google Scholar]
Fritsch, J.; Kuehnl, T.; Geiger, A. A New Performance Measure and Evaluation Benchmark for Road Detection Algorithms. In Proceedings of the International Conference on Intelligent Transportation Systems (ITSC), Hague, The Netherlands, 6–9 October 2013. [Google Scholar]
Geiger, A.; Lenz, P.; Urtasun, R. Are we ready for autonomous driving? The KITTI vision benchmark suite. In Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI, USA, 16–21 June 2012; pp. 3354–3361. [Google Scholar] [CrossRef]
Alhaija, H.A.; Mustikovela, S.K.; Mescheder, L.M.; Geiger, A.; Rother, C. Augmented Reality Meets Computer Vision: Efficient Data Generation for Urban Driving Scenes. Int. J. Comput. Vis. 2018, 126, 961–972. [Google Scholar] [CrossRef] [Green Version]
Deep Residual Learning for Image Recognition. arXiv 2015, arXiv:1512.03385.
Residual Blocks—Building Blocks of ResNet. Available online: https://towardsdatascience.com/residual-blocks-building-blocks-of-resnet-fd90ca15d6ec (accessed on 27 November 2018).
Veit, A.; Wilber, M.; Belongie, S. Residual Networks Behave Like Ensembles of Relatively Shallow Networks. arXiv 2016, arXiv:1605.06431. [Google Scholar]
KITTI RoadSeg. Available online: https://github.com/robertklee/KITTI-RoadSeg (accessed on 22 March 2021).
Badrinarayanan, V.; Kendall, A.; Cipolla, R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 2017, 39, 2481–2495. [Google Scholar] [CrossRef]
Review: SegNet (Semantic Segmentation). Available online: https://towardsdatascience.com/review-segnet-semantic-segmentation-e66f2e30fb96 (accessed on 10 February 2019).
Simonyan, K.; Zisserman, A. Very Deep Convolutional Networks for Large-Scale Image Recognition. arXiv 2015, arXiv:1409.1556. [Google Scholar]
Das, A.; Rad, P. Opportunities and Challenges in Explainable Artificial Intelligence (XAI): A Survey. arXiv 2020, arXiv:2006.11371. [Google Scholar]
Adadi, A.; Berrada, M. Peeking Inside the Black-Box: A Survey on Explainable Artificial Intelligence (XAI). IEEE Access 2018, 6, 52138–52160. [Google Scholar] [CrossRef]
Selvaraju, R.R.; Cogswell, M.; Das, A.; Vedantam, R.; Parikh, D.; Batra, D. Grad-CAM: Visual Explanations from Deep Networks via Gradient-Based Localization. Int. J. Comput. Vis. 2019, 128, 336–359. [Google Scholar] [CrossRef] [Green Version]
Grad-CAM: Visual Explanations from Deep Networks. Available online: https://glassboxmedicine.com/2020/05/29/grad-cam-visual-explanations-from-deep-networks/ (accessed on 29 May 2020).
Simonyan, K.; Vedaldi, A.; Zisserman, A. Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps. arXiv 2014, arXiv:1312.6034. [Google Scholar]
What Are Saliency Maps in Deep Learning? Available online: https://analyticsindiamag.com/what-are-saliency-maps-in-deep-learning/ (accessed on 11 July 2018).
Wang, R. Edge Detection Using Convolutional Neural Network. In Advances in Neural Networks–ISNN 2016; Cheng, L., Liu, Q., Ronzhin, A., Eds.; Springer International Publishing: Cham, Switzerland, 2016; pp. 12–20. [Google Scholar]
Rezatofighi, S.H.; Tsoi, N.; Gwak, J.; Sadeghian, A.; Reid, I.D.; Savarese, S. Generalized Intersection over Union: A Metric and A Loss for Bounding Box Regression. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, Long Beach, CA, USA, 15–20 June 2019. [Google Scholar]
Yu, J.; Jiang, Y.; Wang, Z.; Cao, Z.; Huang, T.S. UnitBox: An Advanced Object Detection Network. In Proceedings of the 24th ACM International Conference on Multimedia, Amsterdam, The Netherlands, 15–19 October 2016. [Google Scholar]

Figure 1. Worldwide road traffic death rate per 100,000 people [1].

Figure 3. OD-XAI: system model.

Figure 4. ResNet architecture for semantic segmentation.

Figure 5. SegNet architecture for semantic segmentation.

Figure 6. (a) Original road image from the KITTI dataset with an overlapped semantic map. (b) Grad-CAM results of ResNet-18 model. (c) Grad-CAM results of the SegNet model. (d) Grad-CAM results of the ResNet-50 model.

Figure 7. (a) Grad-CAM results on partially trained ResNet18. (b) Semantic predictions of partially trained ResNet-18.

Figure 8. (a) Original road image from the KITTI dataset. (b) Saliency Map results of the ResNet-18 model. (c) Saliency Map results of the SegNet model. (d) Saliency Map results of the ResNet-50 model.

Figure 9. Model—ResNet18, layer 7, total channels—64, layer type—Max Pooling. (a) Output of channel 0. (b) Output of channel 5. (c) Output of channel 33. (d) Output of channel 63.

Figure 10. Test image on which layer wise explanations are presented.

Figure 11. Model—ResNet-18, layer 7, channel 63 weights applied on the test image. (a) Test image. (b) Channel 63 output on the test image (a). Vertical poles and trunks of the trees obtained greater heat-map values denoting higher gradient scores.

Figure 12. Model—ResNet-18, layer 34, total channels—64, type—convolutional layer. (a) Channel 39. (b) Channel 114.

Figure 13. Model—ResNet-18, layer 94, total channels—256, type—transpose convolutional layer. (a) Output of channel 16. (b) Output of channel 17.

Figure 14. Model—ResNet-18, layer 102, total channels—64, type—convolutional layer. (a) Output of channel 4. (b) Output of channel 5. (c) Output of channel 7. Smooth semantic outputs are generated at this layer.

Figure 15. Model—ResNet-18, layer 107, total channels—32, type—convolutional layer. (a) Output of channel 1. (b) Output of channel 2. Uniform and accurate gradients. The predictions at these layers are almost similar to those of the output layers.

Figure 16. Comparison of various semantic segmentation outputs of trained deep learning architectures with the original and ground truth images. The above outputs are obtained for four different original test images belonging to the KITTI dataset. (a–d) Test images, (e–h) ground truth, (i–l) ResNet-18, (m–p) ResNet-50, and (q–t) SegNet.

Figure 17. Inference time analysis of the trained models.

Table 1. Comparative analysis of various approaches for XAI-based autonomous driving.

Author	Year	Objective	Research Findings
Kaymak et al. [10]	2019	A fully connected network (FCN) is used to perform semantic segmentations on road images.	Does not offer information on the explainability of the model’s judgments.
Cultrera et al. [11]	2020	Imitation learning-based agent equipped with attention maps is proposed. The attention model is used to determine which section of the picture is most significant.	There are no intermediate model explanations that explain how the model works within.
Wang et al. [12]	2021	Proposed a method for a human driver to use intentional gazing to deliver scene forecasts to an intelligent driving system.	If the driver is not paying attention while driving, the system may not work correctly because it relies significantly on the driver’s gaze when making decisions. The application (being sensitive) requires use of XAI. However no explainability principles are employed.
Kothawade et al. [13]	2021	Introduced AUTO-DISCERN, an automated explainable decision-making system for AVs that combines common sense reasoning with answer set programming.	There is no visual depiction of the model’s decision-making.
Mankodiya et al. [7]	2021	Proposes a system to detect malicious vehicles in vehicular ad hoc networks (VANETs) and incorporates XAI to provide interpretability of ML models.	This technology only works in VANETs, requiring the presence of two or more AVs. The proposed system is based solely on numerical data and does not consider road images. It does not provide insight into the explainability of the model’s decision.
Sellat et al. [14]	2022	CNNs, autoencoders, pyramid networks, and bottleneck residual networks are used to create intelligent semantic segmentations for self-driving vehicles.	The proposed approach does not provide information on the models interpretability.
The proposed approach	2022	Integration of XAI with semantic segmentation for AI systems in AVs.	The proposed approach produces visual explanations for the segmentation maps produced by the semantic segmentation models using various XAI techniques.

Table 2. Contribution mapping with respective sections in the paper.

Research Contribution No.	Mapped Sections	Section Name
1	3, 4, 6	System Model and Problem Formulation, Proposed Approach, Results and Discussion
2	5.2.1, 5.2.2	GradCAM, Saliency Maps
3	5.2.3	Observational Explanations

Table 3. A few scenarios with and without XAI-integrated modules in AVs are compared.

	Naive User	Developer	Naive User	Developer
	Without XAI Integration	Without XAI Integration	With XAI Integration	With XAI Integration
When the AV is stationary [19]	Naive users are unable to envision and comprehend the nuances of what an AV senses while stationary.	Developers cannot visualize what the model sees when the AV is motionless. Few decisions, such as turning the engine on and setting the AV in motion, moving uphill from a static state, taking a u-turn from a static state, etc., are significantly dependent on the input footage in AVs. This analysis without XAI integration is not possible.	Naive users can easily take control of the AV in case of any visual anomaly found in the XAI output.	When an AV is stationary, developers may readily inspect the attention maps created by XAI modules based on the input stream. Attention maps for different scenarios can be seen, which permit better delivery of results.
In the event of a traffic accident	Reasoning for explicit decision-making based on what the AV perceives cannot be obtained by the user.	Developers cannot pre-mitigate possible threats due to the lack of the black box of the DL algorithm.	The user can be presented with real-time output generated by XAI modules and can readily acquire control of the vehicle if necessary.	Using XAI, developers may predict the cause of the AI model’s decision-making in AVs and rectify any discovered flaws.
Software maintenance	Comparative findings for a fresh software release cannot be acquired, remaining as a black box to the naive user.	The deterioration of software and models caused due to data drift [20] is a tough problem to identify without XAI integrated.	New software patches can be distinctly analyzed and reviewed by naive users making the process more transparent for the users.	Because XAI modules are independent of training data, they enable effective model monitoring over time to avoid and address software degradation. This permits efficient and speedier updates, sustaining the flow of CI–CD pipelines.
When AV is in motion [21]	Although real-time semantic map predictions are easily available, the motion of an AV is heavily influenced by several physical characteristics that may go unnoticed during conventional output scrutiny and analysis.	Although real-time semantic map predictions are easily available, the motion of an AV is heavily influenced by several physical characteristics that may go unnoticed during a conventional output scrutiny and analysis.	Real-time semantic map predictions and analyses can be provided with XAI integration. Semantic heat maps can be generated at various instances corresponding to various physical phenomena. Physical phenomena in this context refer to situations, such as the sudden emergence of speed breakers, signboards, the existence of potholes on the road, and so on.	Real-time semantic map predictions and analyses can be provided with XAI integration. Semantic heat maps can be generated at various instances corresponding to various physical phenomena. Physical phenomena in this context refer to situations, such as the sudden emergence of speed breakers, signboards, the existence of potholes on the road, and so on.

Table 4. Layers and model parameters of all three models.

Model	Layers	Model Parameters
ResNet-18	111	17,271,149
ResNet-50	215	39,030,637
SegNet	65	73,613,762

Table 5. Training configuration.

Training Configuration
Model	Batch Size	Optimizer	Loss Function	Training Metrics	Epochs
ResNet-18	3	Adam (learning rate = 0.001)	Binary Cross Entropy	Accuracy	20
ResNet-50	3	Adam (learning rate = 0.001)	Binary Cross Entropy	Accuracy	10
SegNet	3	Adam (learning rate = 0.001)	Binary Cross Entropy	Accuracy	25

Table 6. Dataset and system configurations.

Dataset Configuration
Dataset Size	289
Road Image Shape	192 × 640 × 3
Annotated Image Shape	192 × 640 × 2
System Configuration (Google Collab)
Programming Language	Python 3.7.12
System RAM	12.69 GB
System GPU	Tesla K80

Table 7. Comparison of XAI techniques.

	Grad-CAM	Saliency Map	Observational Explanations
Methodology	Grad-CAM employs gradients associated with a certain output class that flow into the penultimate convolutional layer to provide coarse localizations, emphasizing crucial portions of the picture.	Saliency maps compute the gradients of the loss function with respect to the input pixels for a specific output class. This will return a map of the size of the input layer containing negative and positive values.	This technique leverages model API for constructing the output of a given hidden layer for a particular model.
Advantages	Faster to calculate than model-independent approaches, such as LIME and SHAP.	Faster to calculate than model-independent approaches, such as LIME and SHAP.	Any hidden convolution layer’s output can be readily constructed.
Disadvantages	Grad-CAM heat map for multiple entity occurrences belonging to a particular class in an image can become distorted, i.e., it is not able to localize on the target entity.	When ReLU is used as an activation function, if the value falls below zero, the activation becomes stuck at zero, resulting in activation saturation.	Only the feature maps conceived by a few layers at the beginning and end are genuinely useful.

Table 8. Comparison of IOU results of the trained models.

	ResNet-18	ResNet-50	SegNet
Train set Mean IOU	0.9459	0.8665	0.9038
Test set Mean IOU	0.9621	0.9190	0.9343

Table 9. Class-wise IOU score for the proposed model architectures.

	ResNet-18		ResNet-50		SegNet
	Class-Road	Class-Surroundings	Class-Road	Class-Surroundings	Class-Road	Class-Surroundings
Train IOU	0.9459	0.9459	0.8665	0.8665	0.9038	0.9038
Test IOU	0.9622	0.9621	0.9195	0.9184	0.9344	0.9341

Table 10. Train and test set accuracies of the proposed models.

	ResNet-18	ResNet-50	SegNet
Train Accuracy	0.9761	0.9281	0.9655
Test Accuracy	0.9786	0.9318	0.9630

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Mankodiya, H.; Jadav, D.; Gupta, R.; Tanwar, S.; Hong, W.-C.; Sharma, R. OD-XAI: Explainable AI-Based Semantic Object Detection for Autonomous Vehicles. Appl. Sci. 2022, 12, 5310. https://doi.org/10.3390/app12115310

AMA Style

Mankodiya H, Jadav D, Gupta R, Tanwar S, Hong W-C, Sharma R. OD-XAI: Explainable AI-Based Semantic Object Detection for Autonomous Vehicles. Applied Sciences. 2022; 12(11):5310. https://doi.org/10.3390/app12115310

Chicago/Turabian Style

Mankodiya, Harsh, Dhairya Jadav, Rajesh Gupta, Sudeep Tanwar, Wei-Chiang Hong, and Ravi Sharma. 2022. "OD-XAI: Explainable AI-Based Semantic Object Detection for Autonomous Vehicles" Applied Sciences 12, no. 11: 5310. https://doi.org/10.3390/app12115310

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

OD-XAI: Explainable AI-Based Semantic Object Detection for Autonomous Vehicles

Abstract

1. Introduction

1.1. Related Works

1.2. Research Novelty

1.3. Research Contributions

1.4. Contribution Mapping

1.5. Paper Organization

2. Preliminaries

2.1. Explainable Artificial Intelligence (XAI)

2.2. Integration of XAI with Semantic Segmentation in AV

3. System Model and Problem Formulation

Problem Formulation

4. The Proposed Approach

4.1. Dataset Description

4.2. Preprocessing

4.3. Image Segmentation Architectures

4.3.1. Residual Networks (ResNets)

4.3.2. SegNet

4.4. Training

5. XAI Integration

5.1. Taxonomy of XAI

5.2. XAI Techniques

5.2.1. Gradient Class Activation Mapping (Grad-CAM)

5.2.2. Saliency Maps

5.2.3. Observational Explanations

5.2.4. Comparison of XAI Techniques

6. Results and Discussions

6.1. Evaluation Metric

6.1.1. Intersection over Union

6.1.2. Accuracy Score

7. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Informed Consent Statement

Data Availability Statement

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI