Next Article in Journal
Impacts of Complex Terrain Features on Local Wind Field and PM2.5 Concentration
Next Article in Special Issue
A Quantum Machine Learning Approach to Spatiotemporal Emission Modelling
Previous Article in Journal
Physics-Informed Neural Network for Flow Prediction Based on Flow Visualization in Bridge Engineering
Previous Article in Special Issue
Short-Term Air Pollution Forecasting Using Embeddings in Neural Networks
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Big-Data-Driven Machine Learning for Enhancing Spatiotemporal Air Pollution Pattern Analysis

Department of Geoinformatics and Applied Computer Science, Faculty of Geology, Geophysics and Environmental Protection, AGH University of Science and Technology, 30-059 Krakow, Poland
*
Author to whom correspondence should be addressed.
Atmosphere 2023, 14(4), 760; https://doi.org/10.3390/atmos14040760
Submission received: 25 March 2023 / Revised: 9 April 2023 / Accepted: 19 April 2023 / Published: 21 April 2023
(This article belongs to the Special Issue Machine Learning in Air Pollution)

Abstract

:
Air pollution is an important problem for public health. The spatiotemporal analysis is a crucial step for understanding the complex characteristics of air pollution. Using many sensors and high-resolution time-step observations makes this task a big data challenge. In this study, unsupervised machine learning algorithms were applied to analyze spatiotemporal patterns of air pollution. The analysis was conducted using PM10 big data collected from almost 100 sensors located in Krakow, over a period of one year, with data being recorded at 1-h intervals. The analysis results using K-means and SKATER clustering revealed distinct differences between average and maximum values of pollutant concentrations. The study found that the K-means algorithm with Dynamic Time Warping (DTW) was more accurate in identifying yearly patterns and clustering in rapidly and spatially varying data, compared to the SKATER algorithm. Moreover, the clustering analysis of data after kriging greatly facilitated the interpretation of the results. These findings highlight the potential of machine learning techniques and big data analysis for identifying hot-spots, cold-spots, and patterns of air pollution and informing policy decisions related to urban planning, traffic management, and public health interventions.

1. Introduction

Air pollution is a major public health concern with far-reaching consequences. It has been established that overexposure to particulate matter (PM) may result in neurodegenerative diseases such as Alzheimer’s and Parkinson’s [1]. Polluted air can cause respiratory problems, increase the risk of heart disease, and lead to lung cancer [2] and chronic respiratory issues [3]. Air pollution can also worsen existing health conditions, such as diabetes and mental health disorders. What is more, the impact of air pollution is not limited to human health only, but also affects the environment and wildlife. Air pollution is also a factor that harms crops, forests, and water, and affects wildlife by altering their natural habitat [4]. The decline in biodiversity as a result of human activities is a matter of significant concern for the health of our planet. This decline can have a detrimental effect on ecosystem functions One of the major contributors to this decline is air pollution, which is not only harmful to the environment but also poses a threat to human health [5]. In particular, air pollution is a significant contributor to climate change, which is one of the most pressing environmental issues of our time. Air quality and climate are intertwined [6], with many air pollutants being the source of greenhouse gases that affect the climate. The pollutants can alter solar and terrestrial radiation and contribute to the changing climate. Air pollution represents a hazard both on a global and local scale, affecting both urban and rural communities. Local PM concentrations exceeding acceptable limits can significantly impact daily operations and lead to numerous health-related and administrative challenges as well. It may have a negative impact even on tourism [7].
The issue of smog is particularly severe in developed European countries where the energy mix is based on the burning of coal and lignite. Countries such as Poland, where cities have been on the top of the list of most polluted places in the world for many years, fall under this category [8]. This specific type of smog, which is prevalent during the cold season, is generated most intensely in the temperature range of −5 °C to 0 °C [9] and has even been given its own name, Polish Smog (https://polishsmog.com/ accessed on 1 March 2023). This term encompasses a specific type of smog associated with low emissions—including the use of coal for heating. However, this type of smog is most commonly associated with low temperatures, sunny days, and relatively high pressure [10,11]. Despite many efforts, this problem continues to dominate public debates. Krakow, located in the southern part of the country, near the border with the Czech Republic and Slovakia, is a pioneer in the fight against smog. Although there is a total ban on the burning of solid fuels in the city (applying to both stoves and fireplaces, as well as coal grills), the city still periodically struggles with PM10 and PM2.5 levels that exceed permissible levels by hundreds of percent. Studies have proven that the main cause of this situation is the city’s topography (located in a depression), generally weak winds that are both a ventilation factor for the city, but also a factor favoring the influx of pollutants on certain days, and the absence of a similar ban in surrounding towns [8,9,12]. Krakow is a city with many universities and companies specializing in high-tech industries and is also one of Poland’s most frequently visited cities by foreign tourists. The city was one of the first twelve objects inscribed on the World Heritage List by UNESCO. The problem of Polish Smog significantly reduces the potential of the city as a cultural-scientific center. In February 2023, an official warning was issued for several days in the city to refrain from outdoor activities, including walks. During this time, the weather was beautiful, sunny, and high pressure, which favored the retention of pollutants in the city [13]. The issue of smog should be systematically solved, and the analysis of further proceedings should be based on reliable studies using traditional data analysis techniques, as well as artificial intelligence (AI) and machine learning (ML), in densely timed and spatial resolution.
The types of air particulate matter measurements and permissible concentrations in Poland comply with the European Union Ambient Air Quality and Cleaner Air for Europe Directive No. 2008/50/EC. They are described at the national level in the PN-EN12341 and PN-EN 16450 standards [14]. The annual average concentration of PM2.5 particulate matter should not exceed 25 µg/m 3 , while the 24-h average concentration of PM10 particulate matter should not exceed 50 µg/m 3 . Reference measurements are performed manually using the gravimetric method based on the analysis of the vertical component of the Earth’s gravity field intensity vector. Automatic measurement is also allowed, using various devices, for which measurements have been found to be in compliance with the gravimetric method described in the standards. The results of the measurements are publicly available online on the Chief Inspectorate for Environmental Protection website [15]. A disadvantage of these measurements is their high cost, which is associated with the small number of such devices, making it impossible to carry out reliable spatial analyses on a local city scale. According to the World Meteorological Organization (WMO), low-cost sensors (LCS) can be used for air quality analysis [16]. These are receivers characterized by lower accuracy and higher uncertainty than reference measurements, most commonly based on processes such as light scattering, but with low cost and high availability. Properly calibrated devices and well-prepared measurements (according to WMO standards) allow for their use in social communication and scientific analysis. This work uses LCS sensors from the Airly company, whose effectiveness and high compatibility with reference measurements have been repeatedly confirmed in scientific studies.
The use of sensor data is cost-effective and provides real-time data; however, it involves the need to process large amounts of data that exceed the capabilities of traditional data analysis methods, which are insufficient to handle the complexity and volume of the data. This is characteristic of the term big data, which refers to large, complex, and diverse data sets. Big data plays an increasingly important role in science and business due to the growing computing, storage, and data transfer capabilities [17]. Access to such data, with appropriate processing techniques, allows for valuable insights into the subject matter that was not possible in previous, traditional approaches. Historical data from LCS sensors, which are used for this analysis, allow finding trends and spatial patterns of smog formation, but these data are collected by the sensors continuously, so when combined with machine learning, access to big data makes it possible, for example, to quickly identify hot-spots or air pollution real-time trends.
Machine learning is a data science field that develops algorithms capable of learning from data and identifying patterns that may not be apparent to human analysts [18]. It has gained popularity due to the explosive growth of big data and the need for faster and more effective data processing. The effectiveness of machine learning algorithms depends on the quality and quantity of data used for training, making big data and machine learning closely related. In this study, unsupervised learning is used to perform cluster analysis and identify patterns and relationships within the Krakow air pollution data.
The current research presents a novel approach for utilizing big data type analysis from LCS to determine local PM concentration trends. The aim of the study was to gain insights into the spatial patterns of the concentrations and to identify yearly trends. A complete workflow was developed to carry out the analysis and several algorithms were tested, including the commonly used K-means algorithm with the dynamic time wrapping (DT) metric, which takes into account the temporal variability, as well as the relatively new SKATER algorithm. The results of the algorithms were compared to the results obtained from the algorithms available in ArcGIS, a widely used software for spatial data analysis. This comparison helped to evaluate the performance of the algorithms and to determine which algorithm was best suited for the task at hand. Furthermore, based on domain knowledge, an analysis was conducted to determine the most appropriate metric for determining the optimal number of clusters in this domain-specific problem. The heuristic elbow method and the Bayesian Information Criterion metrics were considered. In conclusion, the results of this research provide valuable insights into the use of big data analysis for determining local concentration trends. The complete workflow and the algorithms tested can be used as a foundation for future research in this field. The findings of the comparison of the algorithms and the selection of the optimal metric will assist researchers in making informed decisions when conducting similar studies in the future.

2. Materials and Methods

2.1. Localization

The studies aimed to investigate air quality in the city of Krakow, Poland, located in the central part of the European continent. Poland is also the eastern border of the European Union. Krakow has a temperate climate, characterized by four astronomical seasons and six thermal seasons. The cool period, which lasts from late autumn to spring, extends from November to April. During this period, the main source of air pollution is the burning of coal, as confirmed by radiometric studies of the carbon fraction. The radiometric studies showed that the carbon fraction in the air significantly increases during the cool period, primarily due to the burning of coal for heating purposes.
The warm months, on the other hand, are characterized by a different pattern of air pollution. The main component of the carbon fraction of PM10 dust during this period is transportation. The natural background in the city is constant throughout the year and is estimated to be around 20–30%. The city has introduced a law prohibiting the use of solid fuels for heating apartments in an attempt to reduce air pollution. However, despite the efforts and legal changes, the smog problem remains significant [12]. In order to investigate the air quality in Krakow, data from 100 LCS sensors were collected in the city from March 2021 to February 2022. The choice of sensors was made using an algorithm written in R, with the goal of ensuring a nearly regular grid. The sensor data was analyzed to determine the levels of various air pollutants, such as PM1, PM2.5, PM10, and meteorological factors—pressure, temperature, and humidity.
Figure 1 shows the location map of the studies against the European continent, including the main rivers and city division boundaries. The map provides a visual representation of the study area and highlights the significance of the findings in the larger context of the European continent. Krakow is located in an unfavorable depression, bounded by elevations to the north and south. The depression in which the city center is located is associated with the Wisla River valley. The location of Krakow in a depression surrounded by elevations can create unfavorable atmospheric conditions that trap pollutants and contribute to poor air quality. In the city, weak west winds are predominant, with the Wisła River valley serving as the main ventilation corridor, stretching from east to west. This meteorological pattern, combined with the city’s location in a depression, can exacerbate the effects of local emissions, leading to higher concentrations of pollutants in the air. Additionally, the topography of the region can also play a role in trapping pollutants, as air masses can become stagnant in the valley.

2.2. Big Data Machine-Learning Workflow

Machine learning is a rapidly growing field of artificial intelligence that focuses on the development of algorithms capable of learning from data. These algorithms use statistical techniques to find relationships within the data and identify patterns that may not be apparent to human analysts [18]. In recent years, machine learning has gained popularity due to the explosive growth of big data and the need for faster and more effective data processing. Big data and machine learning are closely related because the effectiveness of machine learning algorithms depends on the quality and quantity of data used for training [19]. The more data available, the more accurate and reliable the predictions made by the machine learning model. In addition, big data provides the infrastructure required to store, process, and analyze the massive amounts of data required for machine learning [17]. Machine learning is divided into two main areas: supervised and unsupervised learning. Supervised learning involves the use of labeled data to train the algorithm, while unsupervised learning involves the discovery of patterns and relationships within the data without any pre-existing labels or classifications. In this analysis, unsupervised learning was used, which involved the identification of patterns and relationships within the data to perform cluster analysis, which is a technique used to group objects or data points based on their similarities or dissimilarities. Unsupervised machine learning algorithms were used for cluster analysis to identify patterns and group similar areas together, to study air pollution in and around Krakow.
The proper preparation of a large-scale data pipeline is a complex process that requires careful attention to detail. In this project, we collected data using Airly’s API. Our data pre-processing pipeline, as illustrated in Figure 2, involved several key steps. After performing an initial data quality check to identify any non-numeric or out-of-range observations (using exploratory data analysis techniques such as box-plots, swarm-plots, etc.), the dataset was securely stored in a cloud-based database (the Azure Data-Science Machine-Learning platform) for further processing. We then narrowed our focus to data from sensors with at least 90% valid observations during the timeframe of interest, which allowed us to focus our analysis on the most reliable data. To address missing data in records where less than 10% of values were missing, a K-Nearest Neighbour Imputation method was used. This approach allowed us to accurately fill in any gaps in this dataset while minimizing the risk of introducing bias. Once the dataset had been cleaned and imputed, data normalization was done using StandardScaler from sklearn Python library [20]. Finally, we performed clustering on data after kriging and using the original sensor positions (per sensor) to see the difference and to investigate if smoothing and filtering through kriging has an impact on time and space-variant clustering. Nearly 100 sensors were analyzed in and around Krakow. Each sensor measured pollution parameters on an hourly basis. The data for the entire calendar year was analyzed, resulting in nearly one million samples for the sensor data and almost 100 million samples for the grid data.

2.3. Clustering Methods

This study compared two clustering algorithms, namely K-means, and SKATER. These algorithms differ in the way they create clusters from the dataset. K-means divides the data into a predetermined number of clusters by minimizing the sum of squared distances between each data point and the centroid of the corresponding cluster. On the other hand, SKATER uses a minimum spanning tree to construct a dendrogram in a hierarchical fashion. The objective is to find the tree with the minimum total edge weight that spans all data points. One advantage of SKATER over K-means clustering is its robustness to outliers and noise, as it can be used for both divisive and agglomerative clustering. In this study, SKATER utilized the location of individual data points, while K-Means relied only on the PM10 values.

2.3.1. Unsupervised Machine Learning Technique Using K-Means

There are many different clustering algorithms that are used to analyze multidimensional data sets [18]. Due to the characteristics of the data from the LCS sensors, which are time series containing information about the location, the K-means algorithm was selected for the analysis. It is one of the most popular unsupervised machine learning algorithms widely used in cluster analysis for data with a priori number of clusters [19]. It is a non-deterministic, iterative algorithm whose goal is to divide the data set into ‘k’ predefined clusters in such a way that the points inside the cluster are as similar as possible while keeping individual clusters different from each other. After defining the number of clusters, the first step is to ’randomly’ select the centroids and assign each point to the nearest centroid. In the next step, the location of the centroids is corrected and their new position is calculated based on the points that belong to the cluster so that the new position of the centroid is the average of these points. Then, the steps of assigning points to the clusters to which they belong and updating the centroid positions are repeated—usually until the centroid positions between individual iterations remain unchanged. Therefore the assignment of all points to individual clusters does not change, or when a different stop criterion is reached [21].
The process of assigning points to the clusters is given as Equation (1) (for Euclidean distance)
c i = arg min j = 1 k | | x i μ j | | 2 ,
where c i is the point-cluster assignment for the point x i , k is the predefined number of clusters, μ j is the location of j’s centroid, and | | x i μ j | | 2 is the Euclidean distance between the point and the centroid.
The process of correcting the position of the centroids can be expressed by Equation (2)
μ j = 1 n j i = 1 n x i [ c i = j ] ,
where μ j is the new centroid position for the j cluster, n j is the number of points in the j cluster, and i = 1 n x i [ c i = j ] is the sum of all points that belong to the j cluster. Thus, it can be seen that this is the average value for all points belonging to this cluster.
The above equations are a generalized form of the K-means algorithm for the Euclidean metric. For this analysis, it was decided to use the Dynamic Time Warping (DTW) metric, which is a metric used to compare time series in which the processes occurring differ in speed and where it is important to find a pattern, not a direct comparison of processes occurring at the moment. DTW warps the time series in such a way that the sum of the Euclidean distances between the corresponding points is as small as possible. In the case of air pollution time series analysis, similarities in the formation of pollution between individual locations will be recognized and clusters will be created on this basis.
The DTW distance between two time series X and Y can be expressed by Equation (3)
d D T W ( X , Y ) = m i n α , β i = 1 n | | x i y α ( i ) | | + | | y i x β ( i ) | | ,
where α and β are warping functions that specify how the time indexes of the series should be transformed, and n is the length of the series.
Using the formula above, K-means algorithm can be modified by replacing the Euclidean distance with the DTW distance. The process of assigning points to the clusters using DTW is given as Equation (4)
c i = arg min j = 1 k d D T W ( x i , μ j )
The following parameters for K-means clustering were used: metric: DTW, number of initialization runs: 4, the number of jobs to be executed in parallel for computing the cross-distance matrix: 5, number of iterations: 50.

2.3.2. Spatially Constrained Clustering Using SKATER

Another algorithm used is Spatial ’K’luster Analysis by Tree Edge Removal (SKATER). What makes it different from K-means is that it is a spatially constrained method and requires that the objects (in this case, the sensors) in one cluster are not only similar to each other but also geographically linked.
SKATER identifies clusters in a dataset by constructing a minimum spanning tree (MST), which is a tree that connects every point in a dataset with the shortest possible overall edge length and iteratively removes edges based on their weight. The algorithm partitions the graph into K clusters by calculating the between-cluster distance (BCD) for each component resulting from the edge removal and assigning components to clusters based on a user-defined threshold. This process continues until K clusters have been identified or there are no more edges in the MST [22]. The following parameters for SKATER clustering were used: distance metric: Euclidean, number of random weight simulations: 100, neighborhood: binary, graph: undirected, number of iterations: 50.

2.3.3. Spatial Clustering in ArcGIS

Clustering based on spatial autocorrelation was performed using ArcGIS Pro [23]. Autocorrelation-based clusterings such as Moran and Getis-Ord are based on analysis of the mutual similarity of attribute values assigned to a particular location in space. Clustering was performed both for data from individual sensors and for a grid in which we have interpolated values of PM10 concentrations over the entire analyzed area. Local Moran’s I is a statistical method of local spatial autocorrelation proposed by Anselin [24], which can be used to detect local clusters and outliers. The equation for Local Moran’s I is as follows Equation (5):
I i = x i X ¯ S i 2 j = 1 , j i n w i j ( x j X ¯ )
where the symbols indicate: x i —the attribute value of x at location i, X ¯ —the mean of the attribute at each of n points, x j —the attribute value at all other locations ( j i ), S i 2 the variance of the variable x, and w i j matrix of weights. The weight matrix determines the distance between objects, which was computed through the utilization of the inverse distance approach. In ArcGIS Pro the function Cluster and Outlier Analysis (Anselin Local Moran’s I) allows for the determination of high-high clusters (high values in a high-value neighborhood) and low-low clusters (low values in a low-value neighborhood).
The second tool used to find clusters was the Getis-Ord local statistic G i * [25]. This statistic enables the identification of nearby clusters with high or low values, and it assesses the statistical significance of this association. Using this statistic, it is possible to detect hot-spots (clusters of high attribute levels) and cold-spots (clusters of low attribute levels) with different degrees of significance. The implementation of this function is available in ArcGIS Pro software as Getis-Ord G i * (High/Low Clustering). The equation for G i * is given as Equation (6):
G i * = j = 1 n w i j x j X ¯ j = 1 n w i j S n j = 1 n w i j 2 ( j = 1 n w i j ) 2 n 1
The G i * index displays high values for objects with high PM concentration levels and low values for objects with low levels. When the values are near the expected value, the spatial distribution of the attribute is random.

2.3.4. Evaluation Metrics for Optimal Clusters Number

Determination of the optimal number of clusters is a key step in the clustering process, for example, for algorithms for which it is required to determine the number of clusters a priori. There are many techniques to help determine this number. In this case, metrics such as the elbow method, silhouette coefficient, Davies–Bouldin index, Calinski–Harabasz index, and Bayesian Information Criterion were used.
The elbow method involves plotting the sum of squared errors (SSE) against the number of clusters and identifying the point of inflection on the curve that determines the optimal number of clusters. The formula for SSE is given:
SSE = i = 1 k x C i | | x μ i | | 2 ,
where k is the number of clusters, C i is the ith cluster, μ i is the centroid (mean) of the ith cluster, and | | x μ i | | 2 is the squared distance between point x and the centroid μ i .
Silhouette coefficient (SC) allows you to visualize the similarity and dissimilarity between points in clusters. The SC range is from −1 to 1. Value −1 indicates the worst number of clusters, while 1 is for the most suitable number [26]. SC is calculated according to this formula [20]:
S C = b a m a x ( a , b ) ,
where a is the mean distance between the point and other points inside the cluster and b is the distance between a point and the nearest cluster (to which this point does not belong).
The Calinski–Harabasz index (CH) is based on the ratio between the total dispersion within a given cluster and between clusters, where dispersion is a summed squared distance [27]. The higher the index values, the denser and further apart the clusters are. The CH can be calculated using Equation (9):
C H = t ( B k ) t ( W I k ) × n W k k 1 ,
where t ( B k ) it the trace of the covariance matrix (between groups), t ( W I k ) it the trace of the covariance matrix (within the cluster), n W is the total number of the data points and k is the number of clusters.
The Davies–Bouldin index (DB) is based on the similarity of clusters that are assumed to have data densities that are decreasing with distance from a vector characteristic of the specific cluster [28]. A better clustering result with more distinct and well-separated clusters is indicated by a lower value of the Davies-Bouldin index. The DB index can be calculated using Equation (10):
D B = 1 k i = 1 k max i j s i + s j d i j , i = 1 , , k ,
where s i is the average distance between the i-cluster centroid and points in this cluster, s j is the average distance between the j-cluster centroid and points in this cluster and d i j is the distance between i and j cluster centroids.
One of the information criterion that can be used to determine the optimal number of clusters is the Bayesian Information Criterion (BIC). The likelihood function of the clustering solution, the total number of parameters in the model, and the total number of data points in the dataset are all taken into account by the BIC calculation. For this metric, the lowest BIC value represents the optimal number of clusters. The equation for calculating the BIC is as follows [29]:
B I C = 2 ln ( L ) + k ln ( n )
where L is the maximum value of the likelihood function for the clustering solution, k is the number of parameters in the model, and n is the number of data points.

3. Results

3.1. Clusters Number Evaluation

Before clustering, the previously described metrics were calculated to determine the optimal number of clusters. The results obtained are shown in Figure 3, Figure 4, Figure 5 and Figure 6 for the SSE, DB, CH, and SC metrics, while Figure 7 shows the results obtained for the Bayesian Information Criterion. The presented graphs show the metric values for 10 or even more clusters, however, due to the number of sensors on the basis of which the clustering will be carried out, attention has been focused on smaller numbers, closer to 2–6 clusters. The clearest and most consistent results were obtained for the average values on the grid depicted in Figure 4. Each of the metrics was consistently indicating the number 4 as the optimal number of clusters. Metrics such as CH and SC reached local maximums at this point, DB local minimum, while for the SSE a ’knee point’ is visible for this number. Figure 3, Figure 5 and Figure 6 are not so consistent with each other. In the case of Figure 3, DB indicates number 3, SSE number 6, and Silhouette 3 and 6, while for the CH, the first local maximum is only seen at number 8. Figure 5 also favors number 6 for DB, Silhouette, and CH, and number 4 for SEE. Figure 6 instead shows the numbers 6 for DB and Silhouette, 2 and 6 for CH, and 4 for SSE. Due to the fairly consistent but ambiguous results, it was decided to calculate the Bayesian Information Criterion per sensor data. The results are presented in Figure 7. In both cases, both for the average and maximum PM10 values, the minimum value, indicating the most optimal number of clusters, was obtained for the number 4, therefore it was decided to stay with this number.

3.2. K-Means with DTW and Skater Clustering

A cluster analysis was performed using spatiotemporal big data from LCS sensors—monthly maximum and average PM indicators—to identify annual patterns of air pollution in the urban area. Two algorithms were used, K-means with DTW and SKATER. Additionally, the influence of early kriging of input data was examined for the K-means DTW algorithm. Due to the spatial dependencies considered by the SKATER algorithm, the clustering test on kriging data was omitted. Figure 8 presents a comparison of annual patterns, including average PM10 indicators for 4 clusters: Figure 8a—K-means DTW clustering per sensor, Figure 8c—K-means DTW clustering after kriging, Figure 8e—SKATER clustering, as well as maximum indicators. Figure 8b—K-means DTW clustering per sensor, Figure 8d—K-means DTW clustering after kriging, Figure 8f—SKATER clustering.
There is a clear difference in annual patterns for average and maximum indicators when using the K-means DTW algorithm. In the case of the SKATER algorithm, the differentiation is not as pronounced due to the nature of its operation based on the spatial dependencies of the sensors. Both for average and maximum values, it is consistent. Interestingly, in the case of average values, it shows clear differentiation in the western part of the city into two distinct clusters separated by the Vistula River valley and major transportation routes, while for maximum values, the entire western area was classified into one cluster. The north-western and eastern part of the area was classified into separate cluster. This part of the study area is dominated by agricultural fields, with no forests, and frequent episodes of grass burning occur. It is interesting that the division into two parts of the western region is consistent for each type of clustering, as well as the clear distinction of the northeastern part as a separate cluster number 1. The city of Krakow and the southeastern part are basically in the same cluster in every method studied. The dominant factor here is the topography of the terrain, where a lowering of the terrain associated with the river valley is observed in the southeastern direction, and there is a vast area of the Niepolomice forest. The isolated clusters for community no. 1 (Figure 8a,c) areas where major transportation routes intersect are interesting. It is evident that the introduction of kriging results in smoother clusters, significantly facilitating interpretation.
When analyzing the maximum annual PM10 concentration distributions using the K-means algorithm with DTW (Figure 8b,d), an interesting phenomenon, called the “Krakow bagel”, becomes visible. This clustering pattern is observed in the city of Krakow, where an isolated cluster number 4 is formed, exhibiting a slight asymmetry and a preference for the N-S direction. Moreover, clusters 2 and 3 are found to be concentrically distributed around cluster 4, which creates the contours of the “bagel”. This clustering pattern may suggest that the concentration of PM10 in Krakow and its surroundings follows a discernible pattern and is not randomly distributed. However, when interpreting the kriging maps, it becomes evident that the distribution of PM10 concentrations is much easier to interpret. Despite this, there is still a clear consistency in the dominant positions of individual clusters (for K-means per sensor and after kriging), indicating the correctness and consistency of the results for different approaches. This consistency further suggests that the clustering patterns observed are not artifacts of the algorithms used but reflect real underlying patterns in the data. Unfortunately, the skater algorithm did not perform well in this case, indicating that it may not be suitable for analyzing air pollution data in Krakow. Specifically, the algorithm was unable to show the differences between the maximum and mean values of PM10 concentrations. Therefore, it is important to exercise caution when selecting an appropriate algorithm for analyzing air pollution data, as different algorithms may yield different results. The “Krakow bagel” phenomenon, observed in the clustering of PM10 concentrations in Krakow, is an interesting and notable pattern. By using different approaches, such as the K-means algorithm and kriging maps, we can gain insight into the distribution of PM10 concentrations in the city. However, it is essential to choose the right algorithm, as some algorithms may not be suitable for analyzing air pollution data in certain areas.

3.3. Spatial Clustering

To find clusters of PM10 with a distinction between high and low-value clusters, spatial autocorrelation was performed. The analysis was conducted separately for per sensor data and gridded data. The study involved the analysis of PM10 concentrations, with a focus on both maximum and average values, for each month. The analysis was conducted separately for each month, as it is not feasible to account for the time factor. A significance level of 95% was adopted for statistical inference. The spatial relationship was conceptualized using the inverse distance method. For per sensor data, an analysis was conducted to identify clusters and outliers (Local Moran’s statistic) in order to detect anomalous sensor readings. For gridded data, the local Getis-Ord statistic was calculated to identify hot-spots and cold-spots in a specific area.
Figure 9 and Figure 10 show local Moran’s I cluster maps for average PM10 concentration on each month from March 2021 to February 2022. The Local Moran’s I analyses detected regions with positive autocorrelation (i.e., high-high and low-low clusters) and regions with negative autocorrelation (i.e., high-low and low-high outliers). The average values of monthly PM10 concentrations enable the identification of overall patterns and trends in concentrations. For most days of the year, the average concentration of particulate matter is higher outside of the Krakow area than within the city.
Figure 11 and Figure 12 show local Moran’s I cluster maps for maximum PM10 concentration on each month from March 2021 to February 2022. The analysis of maximum values of PM10 enables the detection of places where the highest pollution occurred. By analyzing the maximum concentrations of PM10 for each month, it can be observed that both the anomalies of high values and the clusters of high values are mainly located outside the Krakow area. Throughout the observation period, only clusters of low values were observed within the city of Krakow.
Visual representations of monthly average PM10 concentration hot-spots and cold-spots for gridded data (Figure 13 and Figure 14) were generated using the Getis-Ord G i * statistic, with a significance level. By analyzing the average concentration values, it can be observed that clusters of high values occur outside of the Krakow area during colder months. The observations reveal that during the summer months (June–August), hot-spots occur, which are likely caused by traffic.
The analysis of cold-spots and hot-spots was also demonstrated for the maximum values of PM10 concentrations (Figure 15 and Figure 16). It is observed that for most days of the year, the maximum concentrations of particulate matter are located outside of the Krakow area. Clusters of maximum values appear, similar to the case of averaged data, during the summer months.

4. Discussion

Air pollution is a major public health concern, particularly in densely populated urban areas [30]. PM is a significant component of air pollution, and its concentration levels can vary greatly based on a variety of factors, including weather conditions, traffic patterns, and industrial activity. Identifying patterns in the distribution of PM concentrations is critical to developing effective mitigation strategies and protecting public health. Unfortunately, the relation between PM and other factors is not trivial and rather complex [31]. Traditional methods of analyzing PM concentrations are often limited by their reliance on subjective interpretation and incomplete data. In recent years, machine learning and big data analysis techniques have emerged as promising approaches for objective and comprehensive data analysis. In the present study, machine learning techniques and big data analysis were employed to identify patterns in the distribution of PM concentrations in Krakow and its neighboring areas. The study utilized solely the available data, demonstrating the potential of these techniques to provide valuable insights into complex environmental phenomena without the need for costly and time-consuming field studies. To identify yearly spatiotemporal patterns from the monthly average and maximum concentrations of PM10, the K-means algorithm with DTW was employed. This approach was found to be effective in overcoming the complex temporal and spatial dependencies of PM10 concentrations, which would not have been possible using conventional data analysis techniques.
In environmental studies, spatial analysis plays a vital role in understanding complex phenomena [32] as air pollution, weather patterns, and terrain. Clustering is a widely used spatial analysis technique that groups similar spatial objects or features based on their geographic proximity or attribute similarity. The effectiveness of clustering algorithms in identifying patterns is heavily dependent on the quality and accuracy of the spatial data used. In the present study, the effectiveness of two different approaches to clustering was explored. The first approach involved clustering based on the actual locations of LCS sensors, while the second approach involved clustering on data that had been subjected to kriging. The use of kriging in clustering is known to eliminate certain artifacts and outliers, which can result in more accurate cluster identification. The results of the study suggest that clustering based on grids, i.e., data subjected to kriging, does not fundamentally alter the pattern of spatial objects but introduces a smoothing factor that facilitates interpretation. This finding is significant as it indicates that kriging-based clustering can improve the accuracy and reliability of clustering results without introducing artifacts or compromising the spatial pattern of the data. Clustering based on actual sensor locations is not as effective in identifying spatial patterns due to the presence of artifacts and outliers. This suggests that kriging-based clustering can be a valuable tool for identifying spatial patterns in environmental studies, particularly in cases where the spatial data is noisy or incomplete.
The analysis revealed a distinct difference in patterns between clusters representing average and maximum values. The “Krakow bagel” pattern (for maximum concentrations), which was consistent with previous research findings, was clearly evident. The topography surrounding Krakow can be identified as the main factor responsible for the distribution of maximum concentrations. The city of Krakow was found to form a separate cluster for maximum concentrations, distinct from its neighboring clusters, radiating outwards. In the case of average concentrations, the city of Krakow and its southeastern vicinity were located in one cluster, while the northwest part was in another, corresponding to the terrain’s shape. In addition, it was proved that the SKATER algorithm was not optimal for analyzing rapidly and spatially varying data due to its specific operating mode. Instead, the K-means algorithm with DTW was found to be more effective in analyzing such data. The findings of this study highlight the importance of using appropriate machine-learning techniques and algorithms for analyzing complex and high-resolution spatiotemporal data. SKATER spatial clustering algorithm did not work well because the data turned out to be very specific. The spatial factor turned out to be too insignificant, and due to the nature of the algorithm, it dominated the result. However, this does not mean that SKATER is generally unsuitable for such studies. It simply means that the specific “nested” pattern of the studied feature is being overly smoothed by the spatial component.
As mentioned before, in recent years, machine learning and artificial intelligence algorithms have become increasingly popular for analyzing spatiotemporal data [33]. However, their effectiveness in detecting patterns and trends in such data is not always consistent across different methods. Therefore, it is important to evaluate and compare the performance of various algorithms to determine which is best suited for a particular dataset. In this study, spatial statistical calculations were performed using autocorrelation clustering techniques, specifically Moran and Getis-Ord, to compare the results of unsupervised machine learning algorithms. A similar approach to evaluating machine learning results was applied in research on air pollution in China [34]. These techniques are widely used in spatial data analysis to identify spatial patterns and correlations. However, because they do not take into account the time factor, they were performed separately for each month. The findings from this part of the study suggest that performing hot and cold-spot analyses using the Getis-Ord method allows for the observation of additional factors that were not considered in the K-means pattern. For example, the impact of traffic in summer months, which may be intense but shorter in duration, can be observed using the Getis-Ord method. Smoothing the data using grids allowed us to observe patterns that would not have been visible with sensor data alone. This analysis demonstrated the utility of using grids in spatial analysis. Moreover, the visualization of clusters greatly facilitates the interpretation of results, as demonstrated in the previous analysis of interpolated data. It is important to note that the K-means method requires a predetermined number of clusters. However, selecting the optimal number of clusters is not always straightforward. This study used several metrics to select the optimal number of clusters, together with Bayesian information criteria. The results indicate that four clusters provide a reasonable representation of the data and are consistent with current knowledge, allowing for direct interpretation of dependencies from the data. This can be useful for better-diagnosing situations and planning urban environments [35]. Finally, supplementing the analyses with Moran and Getis-Ord statistics provides additional information on the spatial patterns and correlations that may be missed by other methods. This underscores the importance of a comprehensive and holistic approach to spatiotemporal data analysis, which can provide insights into complex phenomena to the problem of air pollution and provides compelling evidence for the use of data-driven approaches to the spatiotemporal analysis of phenomena. Our conclusions regarding the significance of using clustering techniques in air pollution analysis are consistent with other studies, including those conducted in Korea [36] and in Chile [37].

Limitations of the Study

The method used in the analysis of gridded data can cause some oversimplification, which should be taken into consideration. With regards to gridded data, the method involves representing interpolated data values in regular intervals, which can lead to uncertain results in regions where the distance between measurement points is large. Additionally, using monthly means as a measure of similarity in the analysis can facilitate the identification of general trends, but it can also lead to a loss of information about short-term, yet impactful phenomena that could affect the results of clustering.
Another limitation of the study is the difficulty in accurately determining the quantitative amount of time saved when using a big-data approach compared to traditional analysis. Due to several factors, including the type of equipment, computational power, and the experience of the individuals performing spatial analysis, it is not possible to determine precisely how much this process shortens the analysis without additional research. From the perspective of clustering, it can be said that the time is shortened to hours (depending on the amount of data) when an appropriate workflow is developed, and the number of clusters is determined.

5. Conclusions

This study highlights the application of unsupervised machine learning algorithms for analyzing spatiotemporal patterns of air pollution. The main findings of the study are presented below:
  • The K-means and SKATER clustering algorithms revealed distinct differences between average and maximum values of pollutant concentrations.
  • The SKATER algorithm was found to be suboptimal for analyzing rapidly and spatially varying data, highlighting the importance of selecting appropriate clustering algorithms for specific data types. However, it does not mean that SKATER is generally unsuitable for such studies, but its effectiveness may depend on the specific nature of the data being analyzed.
  • The application of the K-means algorithm with DTW produced more accurate results in identifying yearly patterns and it seems to be a superior method for identifying clusters in this particular case of the spatiotemporal fast-changing data.
  • ML techniques together with Moran and Getis-Ord hot-spots and cold-spots analysis provided a holistic problem overview. Furthermore, the clustering analysis of data after kriging greatly facilitated the interpretation of the results, suggesting that this approach can—in some cases—be preferable to clustering on real sensor positions.
The study’s implications for urban planning and decision-making are as follows:
  • The use of machine learning and big data analysis can provide valuable insights into the spatial and temporal distribution of air pollution.
  • The identification of hot-spots and cold-spots can inform policy decisions regarding urban planning, traffic management, and public health interventions.
  • A holistic data approach is needed to fully understand the complex spatiotemporal nature of air pollution in urban environments.
This study highlights the potential of unsupervised ML algorithms for analyzing spatiotemporal patterns of air pollution. It also underscores the importance of selecting appropriate clustering algorithms for specific data types and the need for a holistic approach to fully understand the complex nature of air pollution in urban environments. The findings have important implications for public health in the context of smart cities.

Author Contributions

Conceptualization, M.Z. and H.D.; methodology, M.Z., H.D., E.W., T.D.; validation, H.D., M.Z., E.W., T.D.; formal analysis, H.D., M.Z., E.W., T.D.; investigation, H.D., M.Z., E.W., T.D.; data curation, T.D.; writing—original draft preparation, M.Z., H.D., E.W.; writing—review and editing, H.D., M.Z., E.W., T.D.; visualization, M.Z., H.D., E.W., T.D.; supervision, M.Z., T.D.; project administration, T.D., M.Z.; All authors contribute equally. All authors have read and agreed to the published version of the manuscript.

Funding

This research was supported as a part of the statutory project by AGH University of Science and Technology, Faculty of Geology, Geophysics and Environmental Protection.

Institutional Review Board Statement

Not applicable.

Informed Consent Statement

Not applicable.

Data Availability Statement

Availability of data and materials. Publicly available datasets from Airly sensors were analyzed in this study and can be found here: (https://map.airly.org/, accessed on 20 February 2023). API documentation from Airly is available here: (https://developer.airly.org/en/docs, accessed on 19 January 2023). Publicly available datasets from E-OBS gridded datasets were analyzed in this study. This data can be found here: (https://www.ecad.eu/download/ensembles/download.php, accessed on 20 February 2023).

Conflicts of Interest

The authors declare no conflict of interest.

Abbreviations

The following abbreviations are used in this manuscript:
LCSLow-Cost Sensors
PMParticulate matter
WMOWorld Meteorological Organization
EUEuropean Union
MLMachine Learning
AIArtificial Intelligence

References

  1. Thurston, G.; Kipen, H.; Annesi-Maesano, I.; Balmes, J.; Brook, R.; Cromar, K.; De Matteis, S.; Forastiere, F.; Forsberg, B.; Frampton, M.; et al. A joint ERA/ATS policy statement: What constitutes an adverse health effect of air pollution? An analytical framework. Eur. Respir. J. 2017, 49, 1600419. [Google Scholar] [CrossRef] [PubMed]
  2. Raaschou-Nielsen, O.; Andersen, Z.; Beelen, R.; Samoli, E.; Stafoggia, M.; Weinmayr, G.; Hoffmann, B.; Fischer, P.; Nieuwenhuijsen, M.; Brunekreef, B.; et al. Air pollution and lung cancer incidence in 17 European cohorts: Prospective analyses from the European Study of Cohorts for Air Pollution Effects (ESCAPE). Lancet Oncol. 2013, 14, 813–822. [Google Scholar] [CrossRef] [PubMed]
  3. Kuzma, L.; Roszkowska, S.; Swieczkowski, M.; Dabrowski, E.; Kurasz, A.; Wanha, W.; Bachorzewska-Gajewska, H.; Dobrzycki, H. Exposure to air pollution and its effect on ischemic strokes (EP-PARTICLES study). Sci. Rep. 2022, 12, 17150. [Google Scholar] [CrossRef]
  4. Manisalidis, I.; Stavropoulou, E.; Stavropoulos, A.; Bezirtzoglou, E. Environmental and Health Impacts of Air Pollution: A Review. Front. Public Health 2020, 8, 14. [Google Scholar] [CrossRef]
  5. Pedersen, M.; Giorgis-Allemand, L.; Bernard, C.; Aguilera, I.; Andersen, A.; Ballester, F.; Beelen, R.; Chatzi, L.; Cirach, M.; Danileviciute, A.; et al. Ambient air pollution and low birthweight: A European cohort study (ESCAPE). Lancet Respir. Med. 2013, 1, 695–704. [Google Scholar] [CrossRef]
  6. Bokwa, A. Environmental Impacts of Long-Term Air Pollution Changes in Kraków, Poland. Polish J. Environ. Stud. 2008, 17, 673–686. [Google Scholar]
  7. Change, I.P.C. Climate Change 2013: The Physical Science Basis, Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel on Climate Change; Cambridge University Press: New York, NY, USA, 2013. [Google Scholar]
  8. Danek, T.; Weglinska, E.; Zareba, M. The influence of meteorological factors and terrain on air pollution concentration and migration: A geostatistical case study from Krakow, Poland. Sci. Rep. 2022, 12, 11050. [Google Scholar] [CrossRef] [PubMed]
  9. Danek, T.; Zareba, M. The Use of Public Data from Low-Cost Sensors for the Geospatial Analysis of Air Pollution from Solid Fuel Heating during the COVID-19 Pandemic Spring Period in Krakow, Poland. Sensors 2021, 21, 5208. [Google Scholar] [CrossRef] [PubMed]
  10. Kuzma, L.; Kurasz, A.; Dabrowski, E.J.; Dobrzycki, S.; Bachorzewska-Gajewska, H. Short-Term Effects of “Polish Smog” on Cardiovascular Mortality in the Green Lungs of Poland: A Case-Crossover Study with 4,500,000 Person-Years (PL-PARTICLES Study). Atmosphere 2021, 12, 1270. [Google Scholar] [CrossRef]
  11. Czerwinska, J.; Wielgosinski, G.; Szymanska, O. Is the Polish Smog a New Type of Smog? Ecol. Chem. Eng. S 2019, 26, 465–474. [Google Scholar] [CrossRef]
  12. Zareba, M.; Danek, T. Analysis of Air Pollution Migration during COVID-19 Lockdown in Krakow, Poland. Aerosol Air Qual. Res. 2022, 22, 210275. [Google Scholar] [CrossRef]
  13. Krakowa, U.M. I Stopień zagrożEnia Zanieczyszczeniem Powietrza. Available online: https://www.krakow.pl/aktualnosci/218420,29,komunikat,i_stopien_zagrozenia_zanieczyszczeniem_powietrza.html (accessed on 20 March 2023).
  14. Parliament, E. Directive 2008/50/EC of the European Parliament and of the Council of 21 May 2008 on Ambient Air Quality and Cleaner Air for Europe. 2008. Available online: http://eur-lex.europa.eu/legal-content/en/ALL/?uri=CELEX:32008L0050 (accessed on 29 September 2021).
  15. For Environmental Protection, C.I. PMs Measuring in the Air. 2021. Available online: http://www.gios.gov.pl/pl/aktualnosci/391-pomiary-pylu-zawieszonego-w-powietrzu (accessed on 29 September 2021).
  16. Peltier, R.E.; Castell, N.; Clements, A.L.; Dye, T.; Hüglin, C.; Kroll, J.H.; Lung, S.C.C.; Ning, Z.; Parsons, M.; Penza, M.; et al. An Update on Low-Cost Sensors for the Measurement of Atmospheric Composition; World Meteorological Organization: Geneva, Switzerland, 2020; p. 1215. [Google Scholar]
  17. Abdalla, H.B. A brief survey on big data: Technologies, terminologies and data-intensive applications. J. Big Data 2022, 9, 1–36. [Google Scholar] [CrossRef]
  18. Hamerly, G. Learning Structure and Concepts in Data Through data Clustering. Ph.D. Thesis, University of California, San Diego, CA, USA, 2003. [Google Scholar]
  19. Zareba, M.; Danek, T.; Stefaniuk, M. Unsupervised Machine Learning Techniques for Improving Reservoir Interpretation Using Walkaway VSP and Sonic Log Data. Energies 2023, 16, 493. [Google Scholar] [CrossRef]
  20. Pedregosa, F.; Varoquaux, G.; Gramfort, A.; Michel, V.; Thirion, B.; Grisel, O.; Blondel, M.; Prettenhofer, P.; Weiss, R.; Dubourg, V.; et al. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 2011, 12, 2825–2830. [Google Scholar]
  21. Bishop, C.M. Pattern Recognition and Machine Learning. In Information Science and Statistics; Jordan, M., Kleinberg, J., Scholkopf, B., Eds.; Springer Science+Business Media: Berlin/Heidelberg, Germany, 2006. [Google Scholar]
  22. Assunção, R.M.; Neves, M.C.; Câmara, G.; da Costa Freitas, C. Efficient regionalization techniques for socio-economic geographical units using minimum spanning trees. Int. J. Geogr. Inf. Sci. 2006, 20, 797–811. [Google Scholar] [CrossRef]
  23. ESRI Learning Center, Redlands. ArcGIS Pro [Computer Software]: Release 2.8, 2021; ESRI: Redlands, CA, USA, 2021. [Google Scholar]
  24. Anselin, L. Local Indicators of Spatial Association—LISA. Geogr. Anal. 1995, 27, 93–115. [Google Scholar] [CrossRef]
  25. Getis, A.; Ord, J. The Analysis of Spatial Association by Use of Distance Statistics. Geogr. Anal. 1992, 24, 189–206. [Google Scholar] [CrossRef]
  26. Banthia, A.; Jayasumana, A.; Malaiya, Y. Data size reduction for clustering-based binning of ICs using principal component analysis (PCA). In Proceedings of the 2005 IEEE International Workshop on Current and Defect Based Testing, Palm Springs, CA, USA, 1 May 2005; pp. 24–30. [Google Scholar] [CrossRef]
  27. Caliński, T.; Harabasz, J. A dendrite method for cluster analysis. Commun. Stat. 1974, 3, 1–27. [Google Scholar] [CrossRef]
  28. Davies, D.L.; Bouldin, D.W. A Cluster Separation Measure. IEEE PAMI 1979, PAMI-1, 224–227. [Google Scholar] [CrossRef]
  29. Celeux, G.; Fruhwirth-Schnatter, S.; Robert, C. Model Selection for Mixture Models-Perspectives and Strategies. In Handbook of Mixture Analysis; CRC Press: Boca Raton, FL, USA, 2018. [Google Scholar]
  30. Fischer, P.H.; Marra, M.; Ameling, C.B.; Hoek, G.; Beelen, R.; de Hoogh, K.; Breugelmans, O.; Kruize, H.; Janssen, N.A.; Houthuijs, D. Air pollution and mortality in seven million adults: The Dutch Environmental Longitudinal Study (DUELS). Environ. Health Perspect. 2015, 123, 697–704. [Google Scholar] [CrossRef] [PubMed]
  31. Lu, D.; Mao, W.; Xiao, W.; Zhang, L. Non-Linear Response of PM2.5 Pollution to Land Use Change in China. Remote. Sens. 2021, 13, 1612. [Google Scholar] [CrossRef]
  32. Jankowski, P. Integrating geographical information systems and multiple criteria decision-making methods. Int. J. Geogr. Inf. Syst. 1995, 9, 251–273. [Google Scholar] [CrossRef]
  33. Iskandaryan, D.; Ramos, F.; Trilles, S. Air Quality Prediction in Smart Cities Using Machine Learning Technologies Based on Sensor Data: A Review. Appl. Sci. 2020, 10, 2401. [Google Scholar] [CrossRef]
  34. Yin, L.; Wang, L.; Huang, W.; Liu, S.; Yang, B.; Zheng, W. Spatiotemporal Analysis of Haze in Beijing Based on the Multi-Convolution Model. Atmosphere 2021, 12, 1408. [Google Scholar] [CrossRef]
  35. Marquez, L.O.; Smith, N.C. A framework for linking urban form and air quality. Environ. Model. Softw. 1999, 14, 541–548. [Google Scholar] [CrossRef]
  36. Urban form and air pollution: Clustering patterns of urban form factors related to particulate matter in Seoul, Korea. Sustain. Cities Soc. 2022, 81, 103859. [CrossRef]
  37. Jorquera, H.; Villalobos, A.M. Combining Cluster Analysis of Air Pollution and Meteorological Data with Receptor Model Results for Ambient PM2.5 and PM10. Int. J. Environ. Res. Public Health 2020, 17, 8455. [Google Scholar] [CrossRef]
Figure 1. Krakow’s digital terrain model (back lines—city districts), sensor’s locations (purple triangles), main rivers (blue lines). Data sources: Digital terrain—European Union, Copernicus Land Monitoring Service 2022, European Environment Agency (EEA); European Union Map by ©OpenStreetMap —(https://www.openstreetmap.org/ accessed on 2 March 2023).
Figure 1. Krakow’s digital terrain model (back lines—city districts), sensor’s locations (purple triangles), main rivers (blue lines). Data sources: Digital terrain—European Union, Copernicus Land Monitoring Service 2022, European Environment Agency (EEA); European Union Map by ©OpenStreetMap —(https://www.openstreetmap.org/ accessed on 2 March 2023).
Atmosphere 14 00760 g001
Figure 2. Data preparation workflow.
Figure 2. Data preparation workflow.
Atmosphere 14 00760 g002
Figure 3. Clustering evaluation metrics for average values per sensor.
Figure 3. Clustering evaluation metrics for average values per sensor.
Atmosphere 14 00760 g003
Figure 4. Clustering evaluation metrics for average values for gridded data.
Figure 4. Clustering evaluation metrics for average values for gridded data.
Atmosphere 14 00760 g004
Figure 5. Clustering evaluation metrics for maximum values per sensor.
Figure 5. Clustering evaluation metrics for maximum values per sensor.
Atmosphere 14 00760 g005
Figure 6. Clustering evaluation metrics for maximum values for gridded data.
Figure 6. Clustering evaluation metrics for maximum values for gridded data.
Atmosphere 14 00760 g006
Figure 7. Bayesian Information Criterion for maximum and average values per sensor.
Figure 7. Bayesian Information Criterion for maximum and average values per sensor.
Atmosphere 14 00760 g007
Figure 8. Clustering results for K-means with DTW and SKATER algorithms for maximum and average concentration in 12-month period.
Figure 8. Clustering results for K-means with DTW and SKATER algorithms for maximum and average concentration in 12-month period.
Atmosphere 14 00760 g008
Figure 9. The spatial associations for monthly average PM10 concentrations showing high-high and low-low clusters and high-low and low-high outliers using Local Moran’s I. The maps illustrate clustering in March (A),April (B), May (C), June (D), July (E), and August 2021 (F).
Figure 9. The spatial associations for monthly average PM10 concentrations showing high-high and low-low clusters and high-low and low-high outliers using Local Moran’s I. The maps illustrate clustering in March (A),April (B), May (C), June (D), July (E), and August 2021 (F).
Atmosphere 14 00760 g009
Figure 10. The spatial associations for monthly average PM10 concentrations for high-high and low-low clusters and high-low and low-high outliers using Local Moran’s I. The maps illustrate clustering in September (A), October (B), November (C), December 2021 (D), January (E), and February 2022 (F).
Figure 10. The spatial associations for monthly average PM10 concentrations for high-high and low-low clusters and high-low and low-high outliers using Local Moran’s I. The maps illustrate clustering in September (A), October (B), November (C), December 2021 (D), January (E), and February 2022 (F).
Atmosphere 14 00760 g010
Figure 11. The spatial associations for monthly maximum PM10 concentrations showing high-high and low-low clusters and high-low and low-high outliers using Local Moran’s I. The maps illustrate clustering in March (A), April (B), May (C), June (D), July (E), and August 2021 (F).
Figure 11. The spatial associations for monthly maximum PM10 concentrations showing high-high and low-low clusters and high-low and low-high outliers using Local Moran’s I. The maps illustrate clustering in March (A), April (B), May (C), June (D), July (E), and August 2021 (F).
Atmosphere 14 00760 g011
Figure 12. The spatial associations for monthly maximum PM10 concentrations showing high-high and low-low clusters and high-low and low-high outliers using Local Moran’s I. The maps illustrate clustering in September (A), October (B), November (C), December 2021 (D), January (E), and February 2022 (F).
Figure 12. The spatial associations for monthly maximum PM10 concentrations showing high-high and low-low clusters and high-low and low-high outliers using Local Moran’s I. The maps illustrate clustering in September (A), October (B), November (C), December 2021 (D), January (E), and February 2022 (F).
Atmosphere 14 00760 g012
Figure 13. Hot-spots and cold-spots maps for monthly average PM10 concentration using Getis-Ord G i * . The maps illustrate clustering in March (A), April (B), May (C), June (D), July (E), and August 2021 (F).
Figure 13. Hot-spots and cold-spots maps for monthly average PM10 concentration using Getis-Ord G i * . The maps illustrate clustering in March (A), April (B), May (C), June (D), July (E), and August 2021 (F).
Atmosphere 14 00760 g013
Figure 14. Hot-spots and cold-spots maps for monthly average PM10 concentration using Getis-Ord G i * . The maps illustrate clustering in September (A), October (B), November (C), December 2021 (D), January (E), and February 2022 (F).
Figure 14. Hot-spots and cold-spots maps for monthly average PM10 concentration using Getis-Ord G i * . The maps illustrate clustering in September (A), October (B), November (C), December 2021 (D), January (E), and February 2022 (F).
Atmosphere 14 00760 g014
Figure 15. Hot-spots and cold-spots maps for monthly maximum PM10 concentration using Getis-Ord G i * . The maps illustrate clustering in March (A), April (B), May (C), June (D), July (E), and August 2021 (F).
Figure 15. Hot-spots and cold-spots maps for monthly maximum PM10 concentration using Getis-Ord G i * . The maps illustrate clustering in March (A), April (B), May (C), June (D), July (E), and August 2021 (F).
Atmosphere 14 00760 g015
Figure 16. Hot-spots and cold-spots maps for monthly maximum PM10 concentration using Getis-Ord G i * . The maps illustrate clustering in September (A), October (B), November (C), December 2021 (D), January (E), and February 2022 (F).
Figure 16. Hot-spots and cold-spots maps for monthly maximum PM10 concentration using Getis-Ord G i * . The maps illustrate clustering in September (A), October (B), November (C), December 2021 (D), January (E), and February 2022 (F).
Atmosphere 14 00760 g016
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Zareba, M.; Dlugosz, H.; Danek, T.; Weglinska, E. Big-Data-Driven Machine Learning for Enhancing Spatiotemporal Air Pollution Pattern Analysis. Atmosphere 2023, 14, 760. https://doi.org/10.3390/atmos14040760

AMA Style

Zareba M, Dlugosz H, Danek T, Weglinska E. Big-Data-Driven Machine Learning for Enhancing Spatiotemporal Air Pollution Pattern Analysis. Atmosphere. 2023; 14(4):760. https://doi.org/10.3390/atmos14040760

Chicago/Turabian Style

Zareba, Mateusz, Hubert Dlugosz, Tomasz Danek, and Elzbieta Weglinska. 2023. "Big-Data-Driven Machine Learning for Enhancing Spatiotemporal Air Pollution Pattern Analysis" Atmosphere 14, no. 4: 760. https://doi.org/10.3390/atmos14040760

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop