1. Introduction
The rapid urbanization and industrialization over the past seven decades have led to significant air pollution in large cities. Consequently, the air quality in urban centers has severely declined, posing risks to both human health and the environment [
1,
2]. Unfortunately, there is a lack of spatiotemporal air quality data for populated areas, hindering data-driven interventions to address environmental deterioration [
1]. Regular air quality monitoring is essential to devise suitable strategies to prevent its negative effects on human health and the ecosystem of the area of interest [
3,
4]. Moreover, these monitoring methods can help track the immediate effects associated with the shift toward sustainable energy transportation systems. The detection and monitoring of trace gases using remote sensing data from satellites offer numerous advantages, such as global coverage for extended periods, enabling researchers to examine the concentration of trace gases over a wider area and map their distribution [
5]. Additionally, precise measurements of trace gases at multiple locations help identify sources and sinks, allowing for reasonable budgets to be generated. However, ongoing urbanization and industrialization have complicated the monitoring and control of air quality, particularly in rapidly developing nations, such as China and India. Despite suffering from poor air quality, these nations continue to produce synthetic gases to meet industrial growth without fully understanding the adverse environmental effects [
1,
6]. The high levels of air pollution in certain regions of Asia, such as South Asia and East Asia, have been associated with higher incidences of respiratory, mental, and other health issues [
7,
8,
9,
10]. It is estimated that Asia alone accounts for nearly 6.7 million premature deaths annually to poor air quality [
11].
Besides India and China, Pakistan is also suffering from high air pollution levels owing to significant population and economic growth. The largest and fastest-growing sources of air pollution in Pakistan over the past decade have been the automotive and industrial sectors. During the period 2001–2013, the number of vehicles in Pakistan increased by 130% [
12,
13]. The city of Lahore alone accounts for 23–26% of extra carbon monoxide (CO) emissions due to an inadequate and inefficient mass-transit system [
14].
In Lahore and Islamabad, emissions from vehicles significantly contribute to the deteriorating air quality, highlighting the urgent need for interventions. The air pollution crisis in Lahore is worsened by the involvement of 40% of the city’s 7 million registered vehicles, which emit higher than permissible levels of hazardous air pollutants and contribute to smog-related issues. The situation is exacerbated by traffic congestion and the operation of heavy transport vehicles without road-worthiness certification [
15]. It underscores the critical need for a transition to green energy in the transportation sector. Leveraging green transportation systems could substantially reduce air pollution and improve public health. Green transportation, which includes electric vehicles, hybrid cars, biofuels, and effective public transit systems, could substantially reduce air pollution and improve public health. It also helps combat climate change by reducing emissions, conserving energy, and promoting efficient resource use [
16,
17,
18]. Pakistan was among the top 10 nations most hit by extreme weather events from 1991 to 2010 [
19]. Since 2010, Pakistan has experienced numerous instances of intense heatwaves, torrential rains, and widespread floods. Hence, it is important to explore the chemical composition of the atmosphere over Pakistan by monitoring chemically active trace gases for understanding their impact on surface air temperature, heat waves, and climate change.
Atmospheric pollution is mainly caused by higher concentrations of various trace gas species including CO and oxides of nitrogen (NO
) and sulfur (SO
). The primary emissions from anthropogenic sources are the trace gases such as CO, nitrogen dioxide (NO
), and sulfur dioxide (SO
). CO is a hazardous air pollutant that negatively impacts air quality and poses risks to all forms of life. While present in trace amounts, it can severely impair oxygen supply in the body, leading to severe health problems which include drowsiness and irritation in the eyes [
20]. The main sources of CO include vehicular emissions, fossil fuel combustion, industry, home heating, and vegetation burning, as well as natural sources like forest fires and volcanoes [
2]. NO
, generated from burning fossil fuels in transportation, industry, and power generation, is another hazardous gas contributing to air pollution. Exposure to NO
can cause respiratory symptoms, reduced lung function, and increased cardiovascular risks, and has led to millions of premature deaths globally [
21,
22]. Similarly, SO
, generated by natural and human activities such as volcanic eruptions, fossil fuel burning, and industrial operations, directly affects air quality and poses risks to human health, ecosystems, and the environment. Pakistan’s heavy reliance on coal and industrial activities has resulted in high SO
emissions, exceeding WHO standards [
23]. During the COVID-19 pandemic, many cities experienced lockdown measures, resulting in reduced travel, cutting pollution, fuel consumption, and emissions. Post-pandemic, promoting sustainable options like cycling, electric vehicles, and public transport is crucial for climate mitigation. Cities adopt low emissions zones, shared mobility, and innovative transport for efficient, eco-friendly systems [
24,
25]. This situation closely resembled a green energy transportation scenario, providing valuable insights into the potential improvements in air quality. Analyzing data from different time periods, including before, during, and after the pandemic, is essential to assess the potential impact of adopting green transportation systems [
26]. The Pakistan Environmental Protection Agency (Pak-EPA) is attempting to analyze the concentration of NO
in a few Pakistani cities, along with other air quality examinations, but frequent updates are needed to investigate its influence on climate change. By leveraging data from the Sentinel-5P satellite, which measures air pollutants such as NO
, CO, and SO
, researchers can obtain frequent, accurate, and comprehensive information on the levels and distribution of these pollutants in urban areas [
27]. This research enables the evaluation of the effectiveness of green transportation systems in reducing air pollution and its subsequent positive impact on public health. By accurately predicting the impact of transitioning to green transportation systems, policymakers can make informed decisions to prioritize sustainable transportation solutions and create healthier and more livable cities for everyone. The main contributions of this study are summarized as follows:
Exploratory data analysis (EDA) is conducted on Sentinel-5p data to analyze the effects of green energy transportation on AQI trends in Lahore and Islamabad.
Machine learning (ML) and deep learning (DL) models are created to forecast future air pollution levels to provide actionable insights and trends for policymakers to mitigate the harmful effects of air pollution.
Comparative analysis of the traditional LSTM and bidirectional LSTM model is performed to predict concentrations of CO, NO, and SO. The bidirectional LSTM model provided an improvement of 10% over the traditional LSTM.
2. Related Work
To effectively address the issue of air pollution and assess the usage of green energy transportation in evaluating air quality, a comprehensive review of the relevant literature was conducted. Monitoring air quality is vital for a sustainable environment, achieved through various methods such as active/passive gas sampling, automatic point monitoring, photochemical/optical sensors, remote optical sensing, and imagery data. These approaches provide a holistic understanding of pollution, enabling precise assessments and targeted interventions. Combined with deep learning, these techniques offer a detailed air quality view, helping policymakers in developing effective pollution control strategies.
In traditional approaches, the active and passive sampling methods involve collecting samples of gases and vapors using pumps, sorbent tubes, or diffusion techniques [
28,
29]. The other approach that was utilized by the US Environmental Protection Agency was automatic point monitoring to detect and calculate the concentration of selected gases [
30]. It provides continuous measurements and real-time data availability, which helps to identify pollution hotspots and develop mitigation strategies.
Traditional air quality monitoring methods have limitations. Active sampling is accurate but expensive, slow, and limited. Passive sampling is less sensitive, delayed, and prone to interference. Automatic point monitoring is costly, fixed, and has technical problems. Despite their usefulness, these methods should be combined with others for a complete understanding of air quality.
Apart from traditional methods, sensor-based systems, like photochemical and optical sensor systems, use light-sensitive sensors to detect pollutants in the air, offering mobility and simultaneous measurement of multiple pollutants [
30]. This is especially useful for urban areas with diverse pollution sources. Another sensor-based approach is remote optical monitoring, which employs electromagnetic spectrum measurements to determine pollutant concentrations in real-time [
31]. Space-based sensors also utilize image-based monitoring with aerosol optical thickness for assessing air pollutants, using various methods based on the application and available resources [
32].
Air quality monitoring using Internet of Things (IoT) sensors allows real-time monitoring of air quality parameters [
33]. The Atmospheric Air Surveil System (AASS) is a transportable prototype that uses IoT sensors to monitor parameters like CO and CO
in outdoor environments. The AASS system utilizes microcontrollers, gas sensors, and GPS to measure gas concentrations and transmit the processed data to a Data Acquisition unit via MQTT and cloud services. The data are then stored in a remote server, which can be accessed remotely. This cost-effective AASS system offers real-time air quality data for analysis and decision-making.
The aforementioned techniques provide precise air quality measurements at a specific site, but they are restricted by spatial and temporal constraints. To address this, remote sensing techniques have emerged for broader regional and global air quality monitoring. These methods encompass satellite-based sensing, airborne measurements, and mobile ground-based monitoring [
34]. Optical, radar, and LiDAR satellites offer high spatial and temporal resolutions, and advanced satellite-based technologies have the potential to provide highly accurate and comprehensive data than traditional ground-based monitoring methods [
35,
36,
37].
Recent improvements in satellite and aerial remote sensing technology have made it possible to collect precise data on air pollution across vast areas [
38,
39,
40]. This aids in precise air quality mapping and trend tracking. Deep learning and machine learning analyze these data for real-time monitoring and prediction; this is crucial for public health in urban areas [
38]. These techniques excel due to their capacity to efficiently manage diverse data [
39].
In recent years, there has been a growing interest in using machine learning and deep learning techniques for air quality prediction and estimation. Lin et al. used a random forest regression model to forecast PM2.5 and nitrate levels based on road site data [
41]. The model showed strong predictive accuracy, gauged by the R-squared value. However, precision depends on data quality and site conditions, potentially limiting applicability to diverse locations.
Shafi et al. [
42] utilized K-means clustering to detect abrupt changes in air quality. The method successfully grouped data into clusters based on similarity, detecting notable changes linked to weather and human activities. This highlights the K-means clustering promise in crafting early warning systems to predict air quality shifts. These techniques provide prompt action to counter the adverse effects of pollution on health and the environment.
Choi et al. [
43] employed affordable sensors and machine learning to monitor Seoul’s air quality for urban planning. Their model effectively predicted pollutants, like PM2.5 and NO
, using sensor data. The study underscores the value of budget-friendly sensor-based monitoring and machine learning for the swift identification of pollution areas, providing proactive solutions in air quality management and urban planning.
Li et al. [
44] used a machine learning model to assess the impact of clean air actions in improving air quality in Beijing on the basis of data from 2008 to 2017. The findings revealed substantial decreases in pollutants including PM2.5, SO
, and NO
due to these actions. The study underscores the actions’ efficacy while underscoring the necessity for ongoing endeavors to sustain and enhance air quality. Moreover, it showcases machine learning’s utility in gauging the impact of environmental policies on air pollution.
Huang et al. [
45] developed an accurate PM2.5 concentration prediction model using remote sensing data and machine learning algorithms. The random forest algorithm performed the best with an R-squared value of 0.80, RMSE of 6.62, and MAE of 4.58. In another study, Banerjee et al. [
46] investigated the potential relationship between air pollution, economic growth, and COVID-19 mortality rates in India using machine learning techniques. The study concluded that air pollution levels and economic growth were significant predictors of COVID-19 mortality rates in India. Specifically, a 10
g/m
increase in PM2.5 concentrations was associated with a 9.4% rise in COVID-19 deaths, while a 1% increase in gross domestic product (GDP) was linked to a 5.5% decrease in COVID-19 deaths.
Cosemans et al. [
47] compared the performance of three machine learning algorithms in predicting air pollutant concentrations at different locations across Europe. Random forest and support vector regression outperformed both linear regression and regularization. Researchers have also proposed a deep learning-based model based on air quality and meteorological data to accurately identify the major sources of air pollution [
45,
48], which can help policymakers take targeted actions to reduce emissions. Zhang et al. and Zhou et al. [
49,
50] have developed deep learning-based approaches that utilize satellite remote sensing data to identify the sources of particulate matter pollution with high accuracy.
Besides monitoring air quality, researchers have also attempted to estimate the concentration of pollutants and predict air quality based on measured data. Kow et al. [
51] proposed a new approach for air quality estimation using image data and deep learning neural networks, achieving high accuracy in predicting AQI values in real time. Similarly, Sharma et al. [
52] reported a novel technique for forecasting PM10 concentrations in the most polluted hotspots in Australia using satellite data and deep learning methods, achieving high accuracy with a mean absolute error of less than 10. Another study by Kurnaz et al. [
53] predicted the concentrations of two air pollutants, SO
and PM10, in the city of Sakarya in Turkey, with high accuracy. Similarly, Mao et al. [
54] have reported a deep learning method for predicting air quality. In another study, the researchers proposed an effective convolutional neural network (CNN) for visual understanding of transboundary air pollution based on Himawari-8 satellite images [
55]. The CNN-based model was shown to accurately identify and classify different types of pollutants.
This [
56] study presents a novel deep predictive model for accurately predicting spatiotemporal PM2.5 in Los Angeles County using meteorological data, wildfire data, remote-sensing satellite imagery, and ground-based sensor data. The model employs a graph convolutional network (GCN) and a convolutional long short-term memory (ConvLSTM) to learn and predict spatiotemporal correlations in air pollution data. The model achieves state-of-the-art accuracy in predicting hourly PM2.5 at seven sensor locations in Los Angeles County. The root mean square error (RMSE) and normalized root mean square error (NRMSE) decrease over time with later frames, but this is expected as the nature of PM2.5 results in concentrations 24 h in the future being more correlated with 24 h in the past as compared to concentrations 48 h in the future.
Das et al. [
57] compared the performance of MLP, RNN, and LSTM models in predicting air pollutants such as PM10 and SO
. The evaluation metrics used were MSE, RMSE, MAE, and R2. The LSTM model outperformed the MLP and RNN models in terms of accuracy. The study also compared the performance of the proposed model with existing studies in the literature and found that the LSTM model predicted PM10 and SO
pollutants with high accuracy. The study provides valuable insights into the use of deep learning models for air pollutant prediction.
In [
58], multiple techniques for forecasting air pollution levels using statistical and deep learning methods were used. The data were used from government-built air pollution monitoring stations in Kolkata and evaluated the performance of different models based on two performance indicators, RMSE and MAE. It is observed that Holt–Winter-based forecasting models outperform for PM2.5, PM10, and SO
time series, while deep learning-based models, such as ConvLSTM and Bi-LSTM dominate for NO
time series data.
Shin et al. [
59] present a study on the use of an FCN-based deep learning regression model for real-time indoor air quality monitoring. The dataset is preprocessed to reduce skewness and convert the raw 1D dataset into 2D image input/output datasets, after which the model is trained with various hyperparameters. The results show a decrease in the average prediction error for the MAE and RMSE compared with a deep neural network model.
LSTM and BiLSTM networks excel in air quality forecasting by capturing sequential dependencies, handling missing data, and modeling complex temporal relationships. They retain crucial information from past observations, considering weather and pollution factors, and enhance prediction by incorporating future insights. Optimizing these models requires experimentation, considering data quality, features, and architecture [
60].
Machine learning and deep learning offer advantages over traditional methods. They handle large, irregular data, learn intricate patterns, and leverage remote sensing for precise pollution source detection. These models inform policies, aid urban planning, and offer cost-effective data-driven solutions for air quality management [
61].
Table 1 and
Table 2 summarize the performances of various statistical machine learning and deep learning models used for predicting air quality.
3. Methodology
The study was conducted in two major cities of Pakistan—Lahore and Islamabad. The dataset for the study was based on atmospheric monitoring data collected by the Sentinel-5P satellite from 2018–2021. The dataset was preprocessed, including the conversion of L2 to L3 products, filtering for the study areas, interpolation, and outlier removal. The data were converted from mole/m
to the AQI standard unit. An exploratory data analysis (EDA) was performed to analyze the AQI trends before, after, and during COVID-19 in both cities. Two forecasting models were trained to predict future trends to support data-driven policy interventions for improving AQI.
Figure 1 illustrates the methodology followed in this study.
3.1. Study Area
Air pollution is a serious problem for major population centers of Pakistan as it has been ranked third among the countries with the highest levels of air pollution [
62]. Lahore and Islamabad, shown in
Figure 2, are two major cities and neither is immune from the curse of environmental pollution. Both cities are renowned for their cultural and historical significance but they also suffer from air quality issues. Lahore is the second largest city and the provincial capital of Punjab with a population of over 11 million people growing at an annual rate of 3% since 1998, resulting in substantial urbanization and a growing reliance on transportation [
63]. This trend has led to significant problems with road congestion and increased emissions in the area. According to the annual global survey conducted by IQAir, a Swiss manufacturer of air purifiers, the city of Lahore experienced a significant rise in its air pollution levels in 2022. The city has jumped more than 10 places to become the world’s most polluted city. IQAir measures air quality by assessing the concentration of harmful PM2.5 particles, which can damage the lungs. Lahore’s air quality deteriorated from 86.5 micrograms of PM2.5 particles per cubic meter in 2021 to an alarming level of 97.4 micrograms per cubic meter in 2022.
The primary sources of pollution in Lahore comprise transportation, industries, agriculture (through crop residue burning), open waste burning, and inefficient fuel consumption in the commercial and domestic sectors. Air pollution in Lahore is predominantly caused by the transportation sector, accounting for a staggering 83% of total pollution. This sector alone is responsible for 127 Gg of emissions. The majority of these emissions, amounting to 104.76 Gg, are produced by two-stroke vehicles like motorcycles, scooters, and auto-rickshaws. Motorcars, jeeps, and wagons contribute a further 16.34 Gg to the total emissions. The primary pollutant emitted in Lahore is carbon monoxide, resulting from the incomplete combustion of fuels in mobile engines and other processes, as illustrated in
Figure 3.
Non-methane volatile organic compounds (NMVOCs) and nitrogen oxides (NOx) are secondary major pollutants, largely emitted from the transport and industrial sectors. Particulate matter, including total suspended particulates, PM2.5, and PM10, are emitted in lower concentrations. Apart from transportation, emissions from the industrial (9%), domestic (0.11%), and commercial (0.14%) sectors also contribute to the overall pollution levels in Lahore. These sectors primarily use inefficient fuels, such as coal and diesel oil, leading to emissions of pollutants. Additionally, the common practice of burning crop residues (3.9%) and waste (3.6%) in the outskirts of Lahore also contributes significantly to the city’s pollution. The resulting pollution levels in Lahore far exceed the recommended limits, leading to a surge in respiratory ailments among the population. It has been estimated that if air quality guidelines were adhered to, residents could potentially increase their life expectancy by an average of 6.8 years [
63].
Islamabad, the capital of Pakistan, is home to over 1.7 million people, with an average growth rate of 3.7%. This has resulted in rapid urbanization, causing an increase in transportation [
64]. While its air quality is generally better than in Lahore, it still faces pollution challenges. In 2022, it was reported as unhealthy, with the average level of hazardous air pollutant PM2.5 measured at 49.33 micrograms per cubic meter, exceeding the permissible limit of 35 micrograms per cubic meter [
65]. Vehicular emissions are identified as the primary cause of particle pollution in Islamabad, leading to levels as high as 41.63 micrograms per cubic meter [
66]. Astonishingly, these emissions contribute to a substantial 43% of the country’s overall air pollution. The usage of non-compliant diesel fuel, containing hazardous sulfur dioxide, exacerbates the problem. It is crucial to address this issue promptly by implementing stricter regulations on vehicle emissions, promoting cleaner fuels, and ensuring compliance with environmental standards. Taking these measures will help improve air quality and safeguard public health in Islamabad [
67]. Emissions in both cities are primarily attributable to transportation activities, which led to the selection of these urban areas for an analysis of the trends in the AQI during the COVID-19 pandemic. Concurrently, a prediction model was also developed. The COVID-19 period, marked by frequent lockdowns, saw a significant reduction in intracity transportation. Our study was devised to evaluate the impact of this reduction on air quality. The change in AQI trends captured by the prediction model provides policymakers with a measure of the effectiveness of their transition toward green energy policies.
3.2. Data Acquisition and Preprocessing
There are various datasets available that provide information on air quality related to CO, NO, and SO. One of the most commonly used datasets is the one provided by the World AQI project, which collects and aggregates air quality data from different sources worldwide. The World AQI project provides hourly data on various air pollutants and aggregates these data into an overall air quality index that can be used to compare air quality between different cities or regions. While the World AQI project provides a valuable source of information on air quality, certain potential limitations must be considered when using these data. The AQI data are compiled from various sources and are susceptible to gaps, in terms of geographic coverage, data quality, lack of detail, and time lag. The AQI provides a broad overview of air quality but typically lacks the level of detail needed for more localized or in-depth analysis. Additionally, there could be a time difference between the measurement of pollutants and their inclusion in the AQI data. Therefore, it is important to be aware of these limitations and consider using additional sources of data and information to supplement and verify the findings. Other datasets that provide information on air quality related to these pollutants include those provided by national or regional air quality monitoring networks. For example, in the United States, the Environmental Protection Agency (EPA) provides data on air quality through its Air Quality System (AQS), covering thousands of monitoring sites across the country. The measured data provided information about various pollutants, including CO, NO, and SO. In addition to these datasets, there are also satellite-based datasets that contain information on atmospheric pollutants such as NO and SO on a global scale. For example, the Sentinel-5P mission, which is part of the European Space Agency’s Copernicus program, provides high-resolution data on common atmospheric pollutants, including CO, NO, SO, and O. This dataset can be used to monitor air quality on a global scale. Overall, these datasets play an important role in helping scientists, policymakers, and the public to understand and address air quality issues arising from hazardous pollutants like CO, NO, and SO.
To collect the data for our study, the Python API was used to query the Sentinel-5P database, which contains atmospheric data collected by satellite. Incorporating Lahore and Islamabad as pivotal points in the utilization of Sentinel-5P’s data can produce critical knowledge. These cities in Pakistan confront significant air pollution levels, partly due to industrial operations, traffic exhaust, and agricultural burning in surrounding locations. Therefore, they provide ideal scenarios for exploiting Sentinel-5P’s data collection capabilities in the environmental and atmospheric monitoring sector. The robustness and reliability of Sentinel-5P’s data, collected through the Tropospheric Monitoring Instrument (TROPOMI), are key attributes. TROPOMI’s ability to monitor gases, such as CO, NO, and SO, has proven consistently accurate, making the data highly reliable. Furthermore, the global scientific community, environmental agencies, and government bodies place considerable reliance on the data produced by Sentinel-5P. This extensive data collection can significantly support local and national policymakers in making well-informed decisions about environmental policies and mitigation strategies, underscoring the instrument’s significant reliance and relevance. The API was programmed to retrieve data for the specific region of interest, which was defined by a GeoJSON file of Pakistan. GeoJSON is a file format used to represent geographical data and is commonly employed in mapping applications. By providing the GeoJSON file, the Python API was able to extract the data for the cities of Lahore and Islamabad. The satellite collected data daily basis from 2017 to 2021 for the three pollutants of interest, NO, CO, and SO. However, monitoring data for April 2018 onward is made public, and the same is used in this study. The data were downloaded in ‘netcdf’ format, which is a standard format used for storing, manipulating, and analysis of scientific data. The size of the dataset for SO, NO, and CO was 1651.15 GB, 771.26 GB, and 274.23 GB, respectively. The following data preprocessing steps were taken:
Conversion from Level-2(L2) to Level-3(L3) Products: L2 products are the minimally processed or unprocessed data that a satellite sensor has collected. These deliverables often include measurements of particular variables with great spatial and temporal resolution, such as atmospheric composition. However, the accuracy and interpretation of the measurements may be impacted by noise, artifacts, or anomalies in the L2 data. The L2 data are transformed into L3 products using the HARP Python package to address these limitations. Aggregating the L2 data over more expansive spatial and temporal scales is implemented to reduce data noise and improve measurement accuracy. The conversion of the L2 to L3 product aggregation procedure enables a more thorough understanding of the atmospheric composition in the studied area. L3 products offer a broader viewpoint and capture the qualities of the variables of interest that have been averaged or aggregated. These tools help carry out national or international assessments, capture trends, and explore long-term patterns.
Filtering and conversion to CSV: The L3 data were filtered separately for Lahore and Islamabad for each pollutant to isolate data relevant to the study area and eliminate extraneous data points. The data were then converted from netcdf files to CSV files to simplify data manipulation and facilitate further analysis.
Checking for null values: The CSV files were examined for null values to identify missing data points. It was observed that these missing values were clustered in specific areas, suggesting that interpolation could be employed to estimate the missing values.
Interpolation: Linear interpolation was performed to estimate the missing values based on neighboring data points. This technique is commonly used to impute missing data in scientific datasets and results in the creation of a complete dataset.
Outlier removal: To ensure data quality, any outliers were removed by utilizing GeoJSON files containing the geographical boundaries of Lahore and Islamabad. This step filtered out data points located outside the city boundaries, improving the accuracy and reliability of the dataset.
Duplicate values: To address the issue of duplicate values, the pandas library provides two key functions: duplicated() and drop_duplicates(). The duplicated() function was employed to identify duplicate values in a DataFrame, while the drop_duplicates() function was used to eliminate those duplicates.
Conversion to AQI standard unit: The initial gas concentrations, measured in units of moles per square meter (mol/m2), were converted to air quality index (AQI) standard units. The mass concentrations of each gas in micrograms per cubic meter (μg/m3) were calculated using the molecular weight and molar volume of air. The AQI values for each gas concentration were determined by comparing these to the relevant AQI standards.
After performing the pre-processing steps on Sentinel-5p data, the clean data statistics are shown in
Table 3.
It is important to note that the specific AQI standards may vary depending on the location and regulations governing air quality monitoring.
3.3. Training of Machine/Deep Learning Models
Two machine learning models, random forest, and decision tree, as well as two deep learning models, LSTM and bidirectional LSTM, are utilized to predict air quality. The LSTM model, being a recurrent neural network (RNN) variant, is particularly well-suited for modeling sequential data, making it an optimal choice for analyzing time-series data. On the other hand, random forest and decision tree models are well-known for their ability to handle structured data and make accurate predictions in various domains. Each model was trained using six input features (“latitude”, “longitude”, “year”, “month”, “day”, and “hour”) and one output label (“respective gas”). Initially, the input features consisted of longitude, latitude, and timestamp. However, to enhance the feature set and improve the accuracy of the model, feature engineering was applied to the timestamp. This resulted in the generation of four new features, namely year, month, day, and hour. These additional features are crucial for both increasing the accuracy of the model and performing Exploratory Data Analysis. The dataset is split into 80% for training and 20% for testing with random shuffling. This approach ensures that the model learns from a variety of data points during training, improving its ability to generalize and make accurate predictions on unseen data. Random shuffling also helps in assessing the model’s performance by providing an unbiased representation of the dataset for evaluation.
3.3.1. Decision Tree
The decision tree model is an effective machine-learning approach that enables the division of the feature space into different and independent regions. It can effectively represent non-linear correlations between these variables by using the predictor variables. Decision tree models do have the propensity to overfit the data, which means they could become overly specialized to the training dataset and have trouble generalizing successfully to new data. To mitigate overfitting, k-folds cross-validation is used. Additionally, the criterion parameter is set to mean squared error (MSE), which guides the decision tree’s construction by minimizing the squared differences between predicted and actual values. Also, the splitter parameter is set to “best”, which determines the best possible split point at each node based on the chosen criterion. This encourages the model to make more informed and accurate decisions during the tree-building process.
3.3.2. Random Forest
The random forest model is an ensemble model that combines multiple decision trees to reduce overfitting, as shown in
Figure 4. It can improve the performance of the decision tree model by reducing its variance. We also performed k-fold cross-validation for each model to obtain a more reliable estimate of the model’s performance. We used five-fold, shuffled the data, and used MSE as a criterion. The mean of the MSE scores across all folds was utilized to obtain the cross-validation MSE.
3.3.3. Long Short-Term Memory (LSTM) Model
The LSTM regression model consists of one LSTM layer with 50 units and a dense output layer with one unit (
Figure 5). The LSTM layer uses the ReLU activation function and has an input shape of (1, number of features). The output layer has no activation function and one unit. The model is compiled using the Adam optimizer and the mean squared error loss function. The mean absolute error is used as a metric to evaluate the model’s performance during training. The LSTM layer has several parameters that can be adjusted to optimize the model’s performance. The dropout and recurrent dropout parameters are used to prevent overfitting by randomly dropping out some of the LSTM layer’s output values during training. The return sequences and return state parameters can be used to return the LSTM layer’s output sequences and final state, respectively. The LSTM model is trained for 100 epochs with a batch size of 32. During training, the model’s performance is evaluated on a validation set, and the MSE and mean absolute error (MAE) are calculated for the testing set after training.
3.3.4. Bidirectional LSTM
The bidirectional LSTM (
Figure 6 and
Figure 7) regression model consists of one bidirectional LSTM layer with 50 units and a dense output layer with one unit. Like the LSTM model, the bidirectional LSTM layer uses the ReLU activation function and has an input shape of (1, number of features). The output layer has no activation function and one unit. The model is compiled using the Adam optimizer and the MSE loss function, with the mean absolute error used as a metric to evaluate the model’s performance during training. The bidirectional LSTM layer processes the input sequence in both forward and backward directions, allowing the model to take into account both past and future information. This improves the model’s performance compared to the LSTM model, especially for time-series data with long-term dependencies. The bidirectional LSTM layer has the same parameters as the LSTM layer, including the dropout and recurrent dropout, return sequences, and return state. The model is trained for 100 epochs with a batch size of 32, and the MSE and MAE are calculated for the testing set after training.
To summarize, this section has discussed the study area, dataset, and analytical techniques used to predict air quality in Lahore and Islamabad based on remote sensing data from the Sentinel-5P satellite. The prediction models employed include machine learning and deep learning techniques such as random forest, decision tree, LSTM, and bidirectional LSTM. The models are trained on preprocessed data and evaluated to predict air quality parameters for the cities of interest. When it comes to air quality forecasting using LSTM, it is essential to address the time-consuming calculations and stability concerns to achieve efficient and accurate predictions. To tackle these challenges and optimize the forecasting process, several strategies can be implemented. Firstly, reducing the number of LSTM layers or units can significantly improve computational efficiency without sacrificing forecasting performance. By optimizing the architecture and finding the right balance between complexity and accuracy, training and inference times can be reduced. In addition to reducing complexity, regularization techniques play a vital role in stabilizing LSTM models. Applying dropout or recurrent dropout to the LSTM layers helps prevent overfitting and enhances the generalization capability of the model. This ensures that the LSTM network learns meaningful patterns from the air quality data and produces reliable forecasts. To further improve stability, incorporating batch normalization into the LSTM layers is beneficial. By normalizing the activations within each layer, batch normalization helps stabilize the training process, leading to faster convergence and better overall model stability. Addressing gradient explosion or vanishing is essential for LSTM models in air quality forecasting. Implementing gradient clipping techniques prevents the gradients from becoming too large or too small during backpropagation. This regularization technique ensures stable updates to the LSTM parameters, enabling more accurate and reliable predictions. Considering the nature of air quality forecasting, which often involves long sequences, truncated backpropagation through time (BPTT) can be employed. By breaking down the input sequences into smaller subsequences, the memory requirements and computation times are reduced. Although some long-term dependencies may be sacrificed, the trade-off allows for stable and efficient training of LSTM models. Optimizing hardware and software resources is also crucial for efficient air quality forecasting. Leveraging hardware accelerators, such as GPUs or TPUs, can significantly speed up the calculations involved in LSTM training and inference. Additionally, using optimized software frameworks like TensorFlow or PyTorch allows for efficient utilization of parallel processing capabilities and optimized implementations, further enhancing performance. By implementing these strategies specifically in the context of air quality forecasting, researchers and practitioners can effectively address the challenges of time-consuming calculations and stability concerns associated with LSTM models. This leads to more efficient training and inference, improved stability, and reliable forecasts, ultimately aiding in better decision-making and management of air quality.
6. Conclusions and Future Work
This research endeavor focused on investigating the potential of green energy transportation systems to significantly enhance air quality in the urban areas of Islamabad and Lahore. To accomplish this, a thorough exploratory data analysis was conducted to assess the feasibility of implementing such systems. Additionally, predictive models were trained and validated to accurately forecast the trends in AQI. Remote sensing data from Sentinel-5P were utilized and machine learning and deep learning models were deployed, such as decision trees, random forests, LSTM, and bidirectional LSTM, to predict pollutant levels. The models exhibited high efficacy, with the trained LSTM model achieving an MSE of 0.50, 0.44, and 0.47 for NO, SO, and CO, respectively, in Islamabad. The MSE results improved with the trained Bi-LSTM model to 0.41, 0.38, and 0.34 for the same pollutants. In Lahore, the LSTM model produced an MSE of 0.55, 0.66, and 0.34, while the Bi-LSTM model achieved 0.44, 0.61, and 0.26. The findings present substantial evidence that transitioning to green transportation could significantly lessen urban air pollution. Consequently, this underlines the urgent need for a policy shift toward sustainable transportation. The developed predictive models can help policymakers understand the potential impacts of green energy transition efforts on air quality. Nonetheless, it is essential to combine the trained models with other metrics, such as renewable energy usage and specific pollutant reductions, given the multi-factorial nature of AQI and the varying reliability of predictive models. In the future, the integration of data from various sources will be explored, such as moderate resolution imaging spectroradiometer (MODIS) or cloud–aerosol lidar and infrared pathfinder satellite observation (CALIPSO) satellites, along with existing on-ground monitoring devices. This could generate a more diverse dataset, potentially leading to improved air quality forecasting and a broader understanding of air quality trends. The inclusion of other air pollutants—like ground ozone and particulate matter into predictive models—will further widen the scope of air quality analysis. This comprehensive approach is vital for improving data quality and achieving a holistic understanding of atmospheric conditions. To facilitate this, machine learning models will need to be fine-tuned with a diverse array of parameters that influence atmospheric processes. These models could incorporate features representing influential factors, like El Niño or the Schwabe cycle. Furthermore, the deployment of upscaling or downscaling techniques will play a crucial role in mitigating disparities in spatial resolution among different datasets. Striking a balance between preserving fine-grained details and adjusting resolution will be key in enabling localized predictions. Additionally, developing reporting and monitoring solutions for relevant government bodies and environmental agencies based on the trained models influence decisions around green energy resource management. A geographical expansion of the analysis to other major cities of Pakistan may provide a more holistic view of the country’s air quality dynamics and regional variations. This comprehensive approach will better illustrate the immediate and long-term benefits of transitioning to green energy transportation systems.