Next Article in Journal
Unit Commitment Accommodating Large Scale Green Power
Next Article in Special Issue
Prediction of Air Pollution Concentration Based on mRMR and Echo State Network
Previous Article in Journal
Mechanical and Thermoelectric Properties of Bulk AlSb Synthesized by Controlled Melting, Pulverizing and Subsequent Vacuum Hot Pressing
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Majority Voting Based Multi-Task Clustering of Air Quality Monitoring Network in Turkey

1
Graduate School of Natural and Applied Sciences, Dokuz Eylul University, 35390 Izmir, Turkey
2
Department of Computer Engineering, Dokuz Eylul University, 35390 Izmir, Turkey
3
Department of Environmental Engineering, Dokuz Eylul University, 35390 Izmir, Turkey
*
Author to whom correspondence should be addressed.
Appl. Sci. 2019, 9(8), 1610; https://doi.org/10.3390/app9081610
Submission received: 1 March 2019 / Revised: 7 April 2019 / Accepted: 14 April 2019 / Published: 18 April 2019
(This article belongs to the Special Issue Air Quality Prediction Based on Machine Learning Algorithms)

Abstract

:
Air pollution, which is the result of the urbanization brought by modern life, has a dramatic impact on the global scale as well as local and regional scales. Since air pollution has important effects on human health and other living things, the issue of air quality is of great importance all over the world. Accordingly, many studies based on classification, clustering and association rule mining applications for air pollution have been proposed in the field of data mining and machine learning to extract hidden knowledge from environmental parameters. One approach is to model a region in a way that cities having similar characteristics are determined and placed into the same clusters. Instead of using traditional clustering algorithms, a novel algorithm, named Majority Voting based Multi-Task Clustering (MV-MTC), is proposed and utilized to consider multiple air pollutants jointly. Experimental studies showed that the proposed method is superior to five well-known clustering algorithms: K-Means, Expectation Maximization, Canopy, Farthest First and Hierarchical clustering methods.

1. Introduction

Air pollution is now recognized as an important problem all over the world. It can be referred as a mixture of multiple pollutants that vary in size and composition. Air pollutants (also referred to as “criteria pollutants”) are commonly grouped as particulate matters such as PM10 and PM2.5, and ground-level pollutants such as ozone (O3), carbon monoxide (CO), sulfur dioxide (SO2), nitrogen oxides (NO) and nitrogen dioxide (NO2).
It is known that air pollution has negative impacts on human health, range of vision, materials, and plant and animal health. Air pollutants trigger or worsen chronic diseases such as asthma, pneumonia, heart attack, bronchitis and other respiratory problems. Since particulate matters are very small and light, they tend to stay in the air longer than the heavier particles. This increases the likelihood of humans and animals inhaling these particles through respiration. Due to their small size, these particles can easily pass through the nose and throat and penetrate the lungs, and some may even enter the circulatory system. Smoke, a gas mixture of solid and liquid particles resulting from non-burned carbon materials such as solid fuels and fuel oil, is a variety of air pollution and has a reducing effect on range of visibility. Air pollution also has a destructive and disturbing effect on artistic and architectural structures. On plants, they can be lethal and prevent their growth. Thus, high concentrations of air pollutants can harm human health, adversely influence environment, and also cause property damage [1,2,3,4].
Due to the seriousness of the issue, air pollution control policies require systematic monitoring and evaluation of air quality. The causes of air pollution should be investigated and necessary precautions should be taken in accordance with the findings. Therefore, it is very important to develop an appropriate tool to understand the air quality in an area. For this purpose, effective methods are continuously developed with new studies.
In this context, this study aimed at examining the air quality monitoring stations in Turkey according to their similarities in terms of five air pollutants, PM10, SO2, NO2, NO and O3, and making appropriate inferences based on the analysis of the levels of air pollutants measured at these stations in the time interval from 1 November 2017 to 1 November 2018. In this way, city areas with similar air pollution behavior can be identified so that decision-making authority can canalize emission sources to be located to the regions in need. To perform the experiments, a novel algorithm, named majority-based multi-task clustering (MV-MTC), is proposed, instead of applying traditional clustering algorithms, to benefit from the common decision coming from different pollutant sources. The novelty of this study is the implementation of multi-task clustering (MTC) in the field of environmental science and the examination of air pollution in Turkey with this method for the first time.
The proposed algorithm (MV-MTC) was compared with popular clustering algorithms, namely K-Means, Expectation Maximization, Canopy, Farthest First and Hierarchical clustering methods, in terms of sum of squared error (SSE). The experimental results obtained in this study indicate that the proposed approach produces better clusters than standard clustering algorithms by considering relationships among multiple air pollutants jointly.
The remainder of this study is organized as follows. In Section 2, a detailed literature survey investigating the studies using data mining methods to deal with the air quality control of Turkey is given in addition to the recent studies on the proposed method of multi-task clustering. In Section 3, background information on the applied methodology used in the experiments is explained. The proposed MTC technique and dataset description are mentioned in Section 4 and Section 5, respectively. The experimental studies are presented and the obtained results are discussed in Section 6. Lastly, concluding remarks, a brief summary and future directions are given.

2. Related Work

The monitoring stations located in nearby area are characterized by the same specific air pollution characteristics. Many studies have been done in the literature using this information. Data mining and machine learning are intensely applied to environmental subjects to identify interesting structure in large amount of environmental data, where the structure finds patterns, rules, predictive models and relationships among the data. Ignaccolo, Ghigo and Giovenali [5] classified the air quality monitoring network in Piemonte (Northern Italy) using functional cluster analysis based on Partitioning around Medoids algorithm and considering three air pollutants, namely NO2, PM10, and O3, to classify sites in homogeneous clusters and identify the representative ones. Barrero, Orza, Cabello and Cantón [6] analyzed and made experiments on the variations of PM10 concentrations at 43 stations in the air quality monitoring network of the Basque Country to group them according to their common characteristics. They implemented the autocorrelation function and K-means clustering. Similarly, Lu, He and Dong [7] used principal component analysis and cluster analysis for the management of air quality monitoring network of Hong Kong and for the reduction of associated expenses.
In Turkey, the importance of environmental issues has also gained much attention and studies related to air quality increasingly continue in this direction. Several of environmental data mining studies mentioned thus far on “air quality in Turkey” are compared in Table 1 by displaying the year of the publication, the target pollutants of the study, dataset content used in the experiments, the aim of the study, which data mining task was applied and which algorithms/methods were implemented as well as performance metrics to evaluate the results of the applied methodology. The bold notation in the Algorithms/Methods column shows the algorithm which performs the best among the others. According to the findings, most of the experiments are done using the measurements of PM10 concentrations [8,9,10,11,12,13] and prediction of air pollutant amount is the main goal. In addition to pollution data, some of the studies also integrate meteorological data such as temperature, wind speed, wind direction, pressure and humidity into the problem domain [8,11,12,13].
Multi-task learning (MTL) is a learning technique in which useful information contained in a number of relevant tasks is leveraged to help improve the overall performance of all tasks. All of these tasks or at least a subset of them are assumed to be related to each other. In many of the applications, it is found that learning these tasks jointly leads to performance improvement compared with learning them individually. In most fields comprising computer vision, bioinformatics, health informatics, speech and natural language processing, web applications and ubiquitous computing, MTL is used to enhance the overall performance of the applications involved. Learning paradigms including supervised learning (e.g., classification or regression problems) [14,15,16], unsupervised learning [17,18,19,20,21,22,23], semi-supervised learning [24,25,26,27], active learning [28,29,30,31], reinforcement learning [32,33,34,35], multi-view learning [21,36,37,38], and graphical models [39,40,41] are generally combined with MTL [42,43].
Multi-task classification and multi-task clustering are two well-known types of multi-task learning recently presented in the literature. Wang, Yan, Lu, Zhang and Li [44] use multi-task classification in the prediction of air pollution particles by implementing a deep multi-task learning framework. On the other hand, multi-task clustering has not been studied until now for the air quality management, neither in the environmental science.
There is an issue to be addressed: “what to share” while learning multiple tasks. The form of sharing type determines which knowledge sharing among all the tasks could occur. Usually, there are three forms of sharing: feature, instance and parameter. Feature-based MTL aims to learn common features among different tasks. Instance-based MTL identifies useful data objects in a task for other tasks and then shares knowledge via the identified instances. Parameter-based MTL uses model parameters in a task to help learn model parameters in other tasks [42]. The proposed method in this study (MV-MTC) is among the popularly applied unsupervised learning schemes of instance-based MTL applications.
MTC has been applied in many different areas including bioinformatics, text mining, web mining, image mining, daily activity recognition and so on [18,19,20,21,22,23,45]. The resulting clustering template of MTC has generally outperformed any single clustering algorithm’s outputs. Table 2 presents a brief list of studies in which different MTC algorithms are proposed and applied in various subject areas. It is experimentally proven that MTC algorithms provide remarkable performance when compared to single task learners.
The proposed MV-MTC algorithm has many advantages over existing multi-task clustering methods. First, some methods have a complicated theoretical foundation, which leads to implementation difficulties. For instance, graph-based methods and matrix factorization for nonnegative data are commonly applied (e.g., [21]) by implementing a semi-nonnegative matrix tri-factorization method to co-cluster the data in each view of each task. Likewise, the algorithm introduced in [49] has several sophisticated steps: feature extraction, clustering-based regularization, convex relaxation, and optimization. Spectral clustering that uses the eigenvectors of the Laplacian of a graph for clustering is another way to implement multi-task clustering [20]. In addition to graph-based methods (e.g., [19]), multi-task clustering can be performed by reweighting the distance between data points in different tasks by learning a shared subspace. In this way, clustering operation for each individual task is generated by selecting the nearest neighbors for each sample from the other tasks in the learned shared subspace.
Second, some proposed MTC methods (e.g., scVDMC [18] and Arboretum-Hi-C [20]) were designed as field-specific methods and have a valid use only for bioinformatics data to analyze the genome architecture or to simultaneously capture the differentially expressed genes. These methods are not suitable for the analysis of geographical data (or for the identification of air pollution levels of a region).
Third, our algorithm is particularly advantageous since it does not need any a priori information about the data. However, Yan et al. [22] proposed a novel algorithm, named Convex Multi-task Clustering (CMTC), which requires some a-priori knowledge about the data relationship.
Fourth, some multi-task clustering algorithms (i.e., [23]) require additional parameters and the results change significantly with different parameter values. It makes it difficult to use the algorithm, since the user should determine the optimal parameter for each problem. Our algorithm does not require any additional parameter tuning.
Fifth, the execution time of some multi-task clustering algorithms (e.g., [45]) increases exponentially when the input data increase. However, our algorithm (MV-MTC) requires computation time that grows linearly with the number of instances, clusters and tasks.
Sixth, our proposed method can effectively avoid the imbalance of cluster distribution by merging multiple models according to majority voting. In addition, the MV-MTC framework can effectively reduce clustering errors by selecting the best clustering algorithm for the problem under consideration.
Our goal is to propose an easily implemented, generally applicable, fast, prior knowledge- and parameter-independent multi-task clustering method. Unlike existing methods, the algorithm in this paper is a new kind of multi-task clustering method that is much easier to understand and implement by taking the jointly obtained common decision from different tasks using cluster labels. It was developed as a new method that can appeal to every area rather than being specific to one area (e.g., [18,20]).
Different types of MTC algorithms have been proposed. For instance, multi-task multi-view clustering [21] is presented to handle the learning problem of multiple related tasks with one or more common views. Each view is associated with one task or multiple related tasks, the inter-task knowledge is transferred to one another, and multi-task and multi-view relationships are exploited to improve clustering performance. In [21], it is applied for webpage and image mining operations under clustering framework.

3. Materials and Methods

In this section, applied methodologies and datasets for experiments in addition to used platforms are presented. The overall goal of the used techniques was to create clusters with a consistent set of similar behavioral points by ensuring the maximum similarities in intra-cluster objects while keeping inter-cluster differences high. The clustering algorithms or techniques used in this study were: K-Means, Expectation Maximization, Canopy, Farthest First and Hierarchical clustering in addition to the proposed technique Multi-task clustering.

3.1. K-Means Clustering

Consider a dataset D = {o1, o2, …, on} where each oi represents an object as a p-dimensional explanatory variable and n is the number of objects (instances) in the dataset. Assume that the problem domain is to be divided into k clusters combination of which is represented as a vector CKM = {C1, C2, …, Ck} and the centroids of k clusters are denoted by µ = {m1, m2, …, mk}.
The first step is to assign k points as cluster centers at random. The distance between each data point oi and each cluster centroids mj, where i = {1, …, n} and j = {1, …, k}, are calculated using one of the distance metrics such as Euclidean, Manhattan, Chebyshev, Minkowski distance, etc. as argmin j dist ( o i , m j ) to find the nearest cluster for the respective instance to be assigned. New cluster centroids are calculated by mj = ( 1 / n j ) o i m j o i ,   where nj denotes the number of objects in cluster j, Cj. This process iteratively continues until no data point changes cluster membership. According to the method used for the initialization of the process, different techniques instead of random initialization can be used such as K-Means++, Farthest First or Canopy.

3.2. Expectation Maximization Clustering

It extends the K-Means paradigm in a different way. While the K-Means algorithm assigns each data point to a cluster, each object in the Expectation-Maximization (EM) model is assigned to each cluster according to a weight representing the probability of membership. In other words, there is no definite limit between clusters and new centers are calculated in terms of weighted measures [50]. EM clusters data points using a finite mixture density model, i.e., normal distribution, of k probability distributions, where each distribution represents a cluster.
As in the K-Means clustering, the process starts with selecting cluster centroids randomly. The procedure continues with two steps to refine the parameters (i.e., clusters) iteratively based on statistical modeling: Expectation (E) step and Maximization (M) step [51]. In Step E, a function to determine the probability of cluster membership of an instance is generated using the present estimate for the attributes using Equation (1) where p ( o i | C j ) follows the normal distribution and i = {1, …, n}, j = {1, …, k}.
P ( o i   C j ) = p ( C j | o i ) =   ( p ( C j ) p ( o i | C j ) ) / p ( o i ) ,
Step M is applied as in Equation (2) for re-estimating the model parameters by discovering the attributes which maximizes the expected log-likelihood found in Step E. The iterative process continues until obtaining the optimal value.
m j =   ( 1 / n ) i = 1 n o i P ( o i   C j ) t P ( o i   C t ) ,

3.3. Canopy Clustering

The general application of Canopy is on the preprocessing step of other clustering algorithms such as K-Means or Hierarchical clustering to speed up the process in the case of large datasets [52]. The procedure uses two distance metrics T1 > T2 to be used for later processing and a list of data points to cluster. Initial canopy center is determined randomly from one of the data points and then distances of all other instances to this canopy center are approximated. The instances whose distance value fall within the threshold of T1 is placed into a canopy while the data points whose distance value fall within the threshold of T2 are removed from the list. These removed ones are excluded from being selected as a new canopy center or creating new canopies. The process iteratively continues until the list is empty.

3.4. Farthest First Clustering

It is one of the variants of K-Means clustering where each cluster centroid is selected in turn at the point furthest from the existing cluster centers. This point must lie within the data area. This significantly boosts the speed of clustering in general due to the need of less reassignment and modification [53].

3.5. Hierarchical Clustering

Hierarchical clustering is used to group data objects into a tree of clusters either by bottom-up (agglomerative) or top-down (divisive) fashion [49]. In agglomerative version, each instance of the dataset is put into its own cluster initially and all of these atomic clusters are merged continuously until a single cluster is formed to hold all data points inside or if there is a termination condition. Divisive version is just the opposite of agglomerative clustering because it begins the process with a single cluster where all data points are placed and the later steps are the subdivision of the cluster into smaller distinct ones until a termination criterion is satisfied such as a predetermined number of clusters is obtained.
According to the distance calculation method between different clusters, there are many link types used in Hierarchical clustering such as Single (the minimum link that is the closest distance between any items of two different clusters), Complete (the maximum link that is the largest distance between any items of two different clusters), Average (the average distance between the elements of two clusters), Mean (the mean distance of merged cluster) and Centroid (the distance from one centroid to another).

4. Multi-task Clustering

A task is generally referred to the construction of a model using a specific dataset for a single target or for a sub-goal. In this sense, “multiple tasks” could mean the modeling of multiple output targets simultaneously by using task-related datasets and by considering task relations. Depending on the definition of “multiple tasks”, we can define multi-task clustering as follows: multi-task clustering (MTC) is a process of generating global clusters that are shared by the multiple related tasks. MTC is desired to merge information among tasks to improve the clustering performance of individual tasks. The most important aspect in MTC is to discover the shared information among tasks. In this paper, a novel algorithm, named Majority Voting based Multi-task Clustering (MV-MTC), is proposed to provide this aspect.
Consider the unlabeled dataset D = {o1, o2, …, on} where each oi represents an object as a p-dimensional explanatory variable and n is the number of objects (instances) in the dataset. Assume that the problem domain consists of r different tasks T = {t1, t2, …, tr}, each of which is represented as ti.
In the first step of the algorithm, the instance set allotted to each task should be properly clustered using one of the traditional clustering algorithms. For r different tasks, let us denote the resulting clustering assignments as C = { C t 1 , C t 2 , …, C t r } where C t i = {c1, c2, …, ck} for the predetermined number of clusters as k and each ci consists of different ois from the dataset D. To take the joint decision from all C t i s, a common factor should be determined because the same cluster names do not have to represent the same clustering structure among the task groups. We need to determine common cluster labels meaning the same information through all tasks.
In this context, after clustering instances of each task by one of the single clustering algorithms, all clusters are labeled from the common label set L = {L1, L2, …, Lk} as in Table 3 in terms of the mean weights of intra-cluster objects and k cluster labels for k clusters are produced according to the cluster weights. To illustrate if we have three clusters, the heaviest one, the medium one and the lightest one can be labeled as “L3”, “L2“ and “L1”, respectively. The same procedure is applied for r tasks. As shown in the following example, all instances in the dataset are labeled with a suitable cluster label Li for each task ti. As the final stage, as in the majority voting approach, the most common cluster label among all tasks for a given instance oi is selected as the final cluster assignment. Therefore, the novel MTC algorithm is called Majority Voting based Multi-task Clustering (MV-MTC).
This study proposes two novel concepts: single-task clusters and multi-task clusters. In the first phase, the proposed algorithm discovers local clusters (single-task clusters) from each task data separately, and, in the second phase, these local clusters are combined to produce the global result (multi-task clusters).
Definition 1.
(Single-Task Clusters) Single-task clusters are groups of instances discovered from the data partition Dt of a particular task t, i.e., D = t = 1 r D t , and denoted by C t i = {c1, c2, …, ck}, where k is the number of clusters.
Definition 2.
(Multi-Task Clusters) Given r tasks T = { t i } i = 1 r where all the tasks are related but not identical, multi-task clusters, which are denoted by C = { C t 1 , C t 2 , …, C t r }, are groups of instances that mostly appear in the same level of the clusters of the tasks.
Based on these definitions, it is possible to say that there are two elementary factors for multi-task clustering. The first factor is the definition of task. Many real world problems consist of a number of related subtasks. For instance, PM10, SO2, NO2, NO and O3 air pollutants can be considered as the tasks of air quality monitoring problem. The second factor is the definition of ensemble method to combine multiple tasks. In our study, we used majority voting mechanism, which selects the cluster that is the one with the most votes.
To figure out the rationale behind the algorithm, the example scenario in Table 4 and Table 5 explain the process step by step. In the first stage, the dataset D, which is full of instances with only one feature, is given. There are three tasks (t1, t2 and t3) and the aim is to group the dataset into three clusters by taking the joint decision from each task. The attribute value of instances can change according to different tasks. The next step is applied for clustering instances by one of the clustering algorithms simultaneously for each task. Instances are properly assigned to one of three clusters (C1, C2 or C3). On the other hand, we need to determine a common decision point on the cluster groups of different tasks to get the final cluster assignments. Therefore, three labels (L1, L2 and L3) are used to generalize the clusters and mean the same groupings under different tasks according to average intra-cluster weights. In the final part, after instances are labeled with the new label set for every task ( C t 1 , C t 2 and C t 3 ), majority voting scheme is applied to obtain final cluster labels for MV-MTC algorithm.
Figure 1 displays the general framework of multi-task clustering algorithm where each ti shows single task of the task space and D is the unlabeled data. The main purpose is to ensure that the instances in the clusters that are created before the MV-MTC result remain in the same set in the final step. The number of instances remaining in the same cluster is maximized according to Equation (3) where C i j ( o r ) means that the instance o r is the member of cluster cj of task ti and MV MTC j indicates the resulting cluster cj of MV-MTC algorithm. The pseudo code of the proposed algorithm is given in Algorithm 1.
max   { i = 1 r j = 1 k r = 1 n 1   :   C i j ( o r ) MV MTC j ,   o r C i j } ,
As shown in Algorithm 1, the methodology is made up of four steps. In the first step, single-task clusters are generated by taking each individual task into consideration. Step 2 is performed to calculate intra-cluster weights under different tasks. In Step 3, cluster labels are assigned to clusters according to their weight values, assigning L1 to the cluster which has the lowest mean value, and then increasing the label values until giving Lk to the highest one. The last step is the place where joint decision from different tasks is taken by applying a majority voting mechanism. As a result, all data points are placed into the most suitable clusters and final cluster labels are assigned from joint decision.
Algorithm 1: Majority Voting based Multi-task Clustering (MV-MTC)
Inputs: Dataset D = {o1, o2, …, on}
Task space T = {t1, t2, …, tr}
CA: a clustering algorithm
Cluster label set L = {L1, L2, …, Lk}
n: the number of instances
k: the number of clusters
r: the number of tasks
Process:
//   Step 1: Clustering according to task ti
1.   for i = 1 to r
2.      C t i = CA(D, ti)
3.     C.add( C t i )
Output:
C = { C t 1 , C t 2 , …, C t r }   // cluster assignments in terms of different tasks
C t i = {c1, c2, …, ck}      // k different clusters under the task ti
//   Step 2: Determine average intra-cluster weights
4.   for each C t i in C
5.     for i = 1 to k
6.       for each o in ck
7.         sum = sum + o      // value of the instance
8.     mi = sum/|ck|
Output:
µ i = {m1, m2, …, mk} // k different average intra-clusters weights under the task ti
//   Step 3: Label each cluster ci in C t i for all tasks according to µ i values
9.   for each ci in C t i
10.    for i = 1 to k
11.       L c i = L i n d e x ( m i )   // an appropriate cluster label from L in terms of the index of mi
//   Step 4: Obtaining joint decision
12.    for i = 1 to n
13.    for j = 1 to r
14.      L( o i ) = a r g max L i L L t j ( o i ) 1
// final cluster assignment of o i where each L t j is the cluster label of oi in task tj and j ϵ {1, 2, …, r}

5. Dataset Description and Used Platforms

There are seven geographical regions, namely as Eastern Anatolia, Central Anatolia, Southeastern Anatolia, Blacksea, Mediterranean, Aegean, and Marmara, in Turkey and numerous air quality monitoring stations (AQMS) at each region. This study was conducted on 49 AQMSs from 32 provinces which are from different regions. The features of each station are listed in Table 6 and Table 7 by showing the name of AQMS, in which city it is located, the corresponding county of the city, longitude and latitude information, network type (urban/rural/industrial), and which air pollutants are regularly measured in there.
The National Air Quality Monitoring Network of Turkey includes 330 Air Quality Monitoring Stations. The air quality of all provinces in the country is monitored. To facilitate public access to information on air quality, the monitoring results are published online at the website of http://laboratuvar.cevre.gov.tr [54]. In all of the air pollution measurement stations, SO2 and PM10 parameters are measured; in addition, NO, NO2, NOx, CO and O3 are measured automatically in many of them. In this study, all of the AQMSs were investigated and 49 out of 330 stations were selected because the aforementioned air pollutants (PM10, SO2, NO2, NO and O3) are regularly measured in these stations together.
Since the data become roughly periodic after one-year period, only one year of (November 2017 to November 2018) data were used in the experiments. The pollutant concentrations are mean values of daily (24 h) measurements. The application was developed using Weka open source data mining library [55] on Visual Studio.

6. Experimental Results

In this study, the proposed MTC method, MV-MTC, was compared with traditional clustering algorithms K-Means (KM), Expectation Maximization (EM), Hierarchical Clustering (HIER), Canopy and Farthest First (FFIRST). Each task was clustered by the selected algorithm and then their decision from consensus was obtained in MV-MTC framework. Performance evaluation was done via sum of squared error (SSE) calculation. Before constructing the model, data were normalized and missing data imputation was performed using the mean values.
The number of clusters, k, was selected as 10% of the number of instances in the dataset, therefore it was 5. Distance metric was chosen as Euclidean distance. To take the joint decision from each single clustering algorithm, each cluster was labeled according to the weights calculated as the average value of the instances of intra-cluster. According to this scheme, five cluster labels were determined as “L1”, “L2”, “L3“, “L4” and “L5”. Table 8 displays the average normalized weight of each cluster in terms of different air pollutants and their corresponding cluster labels. As a result of the joint decision of different tasks, where evaluation of PM10, SO2, NO2, NO and O3 pollutants were assumed as a new task, final cluster assignments were done.
To evaluate the performance of the applied methodology, values of sum of squared error were calculated. SSE is the sum of the squared differences between each observation and its group’s mean. In Equation (4), o i represents an instance of dataset D, C j represents the jth cluster, m j is the centroid value of the specified cluster j where o i is assigned and k is the number of cluster. Total SSE of a method is the sum of all separate SSE calculations coming from distinct clusters.
SSE   =   j = 1 k o i C j ( o i m j ) 2 ,
Final assignments are obtained both by MV-MTC and single clustering algorithms. Clustering algorithms is applied on each single task, i.e., the model is formed just by taking one pollutant into consideration. The SSE results of different pollutants under different algorithms are shown as CpollutantName where “pollutantName” is one of the pollutants (PM10, SO2, NO2, NO or O3) in Table 8. CALL is the average SSE value coming from all pollutants. KM is applied with two different versions in terms of initialization method used. KM with the random initialization is denoted as KM and KM initialized with K-Means++ is displayed as KM++. Hierarchical clustering is implemented with different link types among clusters. HIERSing, HIERComp, HIERAvg, HIERMean and HIERCentro represent the hierarchical clustering types with single link, complete link, average link, mean link and centroid link, respectively. The bold notations in Table 9 show the best results in the respective rows.
We can conclude that the proposed MV-MTC method outperforms all single clustering algorithms that similar AQMSs are assigned to the same cluster group more accurately when multi-task clustering is applied. Besides, the most promising output of MV-MTC is obtained by KM++. In the case of single clustering algorithms, EM performs the best among the other applied techniques.
Final cluster assignments after performing MV-MTC with KM++ are shown in Figure 2. It points out the geographical locations of AQMSs in the map of Turkey with different colored markers where each color represents a cluster.
In MV-MTC approach, a clustering algorithm is performed for t tasks and a merging operation is done in the final step (“majority voting”). In this study, the best results were obtained using the KMeans++ algorithm. The time complexity of K-Means++ is O (n × k + n × k × I), where n is the number of instances, k is the number of clusters, and I is the number of iterations needed for convergence [56]. After single-task clustering step, the merging process takes the runtime cost of O (t × n), where t is the number of tasks. Considering this, the total time complexity of MV-MTC algorithm is O (n × k + n × k × I + t × n). This time complexity indicates that the proposed MV-MTC algorithm requires computation time that grows linearly with the number of instances, clusters and tasks. Thus, the execution time of the algorithm will still be reasonable even if we process a large volume of data.
Table 10 shows the execution time (in seconds) to perform MV-MTC algorithm in terms of different clustering methods. Single task clustering results are also shown as CPolluntantName, and CALL represents the sum of the running time of all single task clustering results. Experiments were performed on a desktop computer with Intel Core i7-6700 3.40 GHz processor and 8 GB memory. In each experiment, the algorithms were executed 10 times and then the average values were reported. The empirical results show that the running time of the proposed K-means++ algorithm under MV-MTC framework is better than EM and hierarchical clustering algorithms. Besides, The MV-MTC algorithm has comparable speed with the traditional clustering algorithms when we compare CALL and MV-MTC results on the datasets.
MV-MTC algorithm was compared with one of the recently proposed MTC methods, MTCMRL [45], in terms of time complexity. In [45], multi-task clustering is combined with model relation learning (MTCMRL) method to automatically learn the model parameter relatedness between each pair of tasks by providing a solution to a non-convex optimization problem. Even though the proposed algorithm has a better clustering performance compared to other multi-task clustering methods, it still does not offer the expected performance in terms of time complexity, which is O (n2 * m), where m is the number of features and n is the number of instances per task, thus it increases exponentially when n is increased to larger volumes. On the other hand, MV-MTC is still reasonable to be preferred because of its linearly changing time complexity.
With this study, it was aimed to identify similar regions in terms of air quality. It enables flexible decision-making at the cluster level. Thus, decision makers on the control of air quality can take actions similarly for the members of the same cluster. Since many air quality monitoring station data are summarized in several clusters, it provides richer but compacted information for control and modeling. It finds structure in air quality data and is therefore exploratory in nature. Representing the whole environmental data by few clusters may offer the great advantage of simplification in analyzing the data. Identification of the monitoring station groups can be used to understand why these stations in a same cluster are similar. Clustering monitoring stations minimizes the overload of information. Grouping similar information and summarizing common characteristics help the environmental scientists understand the current situation more clearly. In addition, it is also possible to classify a new station by assigning it to the cluster with the closest center.
The potential contributions of this study to the prediction of air quality can be listed as follows:
  • The multi-task clustering can also be used to label all the observed elements before air quality prediction, by calculating the distance between each centroid and each element in the data, and then selecting the cluster label (or level) with minimum distance.
  • Multi-task clustering can also be used as a preprocessing step to improve the speed and performance of the classification algorithm that is used to predict air quality index.
  • In the application to predict air quality index, temporal data clustering results can give information about air quality variations, such that a set of forecasting systems, which are dedicated to reflect temporal changes, can be formed.
  • The identification of the air pollution levels of the different regions by clustering can be useful to design air quality monitoring network structure. Such networks must consider the monitoring location, sampling frequencies and the pollutants concern. For instance, clustering results lead to design an optimal network, i.e., a network providing maximum data with minimum measurement devices. The spatial relationship analysis is used to compare the information given by the potential sites that may form the network.
  • On forecasting the level of air pollution, it is possible to find the closest cluster of a new instance to be predicted, and then use the values in this cluster for prediction.
  • Multi-task clustering can also be useful for detecting the extreme air pollution events and can help predict future exceedances. In this sense, an air pollutant value of a region may be considered as an outlier if it exceeds the minimum or maximum value of the cluster it belongs to.

7. Conclusions

The main purpose of this study was to present a new multi-task clustering algorithm to determine which provinces of Turkey have the same air pollution characteristics so that similar precautions for the reduction of pollution can be taken by the decision-making authority for the cities in the same group. The main air pollutants for the experiments were selected as PM10, SO2, NO2, NO and O3 and their mean daily concentrations were taken into consideration. All of the data were taken from 49 air quality monitoring stations from different regions of Turkey. Two phases were performed under MV-MTC scheme: single-task clustering and multi-task clustering. In single-task clustering, each air pollutant was handled individually and air quality monitoring stations were assigned to respective clusters (local clustering). In multi-task clustering phase, clusters were labeled according to the intra-cluster weights so that taking common decision from different tasks becomes easier by applying majority voting on these cluster labels per each instance. Final cluster labels were obtained in this phase by combining the results of single-task clusters (global clustering). According to the results of the sum of squared error, the proposed multi-task clustering method MV-MTC performed well compared to classical single clustering algorithms K-Means, Expectation Maximization, Canopy, Farthest First and Hierarchical clustering. MV-MTC with K-Means, which was initialized with K-Means++, provides promising results in the detection of similar AQMSs.
With this study, the following benefits can be obtained:
  • Similar regions can be detected easily so that similar air quality management strategies can be applied for them by the decision-making authority.
  • Collecting similar information together and summarizing common features help environmental scientists figure out the present situation more clearly.
  • Data analysis becomes easier due to dealing with only few cluster instances instead of whole environmental data.
  • Data summarization is performed resulting in compact and useful information, thus one does not need to handle huge amounts of redundant data.
  • It can be used as a pre-processing step before performing the essential environmental study.
  • Inherent hidden patterns of air quality data can be discovered.
  • In the case of a new station to be classified, the process can be achieved by placing the station into the cluster that has the nearest cluster center.
In the future, other unsupervised learning methods such as association rule mining or outlier detection or time series analysis can be applied on Turkey’s air pollution data. Instead of using only pollutant levels, meteorological factors such as temperature, humidity, wind speed and direction, pressure, etc. are going to be added into the problem domain because they can significantly influence the air quality level of a region. Seasonal changes can also be observed instead of using yearly data. The severity of air quality may be clustered based on the impact on the health issue or the potential damage to the environment. Furthermore, a new study could be conducted to investigate the main causes of pollution by utilizing data such as fuel, exhaust and industrial waste.
PM2.5 is one of the most dangerous particulate matters. However, in Turkey, there is a missing data problem considering the measurements of PM2.5 particulate matter. The same case is also valid for CO pollution, thus it is not dealt with in this study. If the study is extended to be applied on other countries, new air pollutants can also be handled.

Author Contributions

G.T., D.B. and A.P. were the main investigators; G.T. and D.B. contributed to writing paper and critically reviewed the paper; D.B. contributed to the design of the paper; G.T. performed the review of the literature; G.T. implemented the methodology; D.B. supervised the work and provided experimental insights; and A.P. critically reviewed the paper and contributed to its final edition.

Funding

This research received no external funding.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Rich, D.Q. Accountability Studies of Air Pollution and Health Effects: Lessons Learned and Recommendations for Future Natural Experiment Opportunities. Environ. Int. 2017, 100, 62–78. [Google Scholar] [CrossRef] [PubMed]
  2. Xing, Y.F.; Xu, Y.H.; Shi, M.H.; Lian, Y.X. The Impact of PM2.5 on the Human Respiratory System. J. Thorac. Dis. 2016, 8, E69–E74. [Google Scholar] [CrossRef]
  3. Mannucci, P.M.; Franchini, M. Health Effects of Ambient Air Pollution in Developing Countries. Int. J. Environ. Res. Public Health 2017, 14, 1048. [Google Scholar] [CrossRef] [PubMed]
  4. Hava Kirliliğinin Çevre ve İnsan Sağlığına Etkileri. Available online: http://cevreonline.com/hava-kirliliginin-cevre-ve-insan-sagligina-etkileri/ (accessed on 20 January 2019).
  5. Ignaccolo, R.; Ghigo, S.; Giovenali, E. Analysis of Air Quality Monitoring Networks by Functional Clustering. Environmetrics 2008, 19, 672–686. [Google Scholar] [CrossRef]
  6. Barrero, M.A.; Orza, J.A.G.; Cabello, M.; Cantón, L. Categorisation of Air Quality Monitoring Stations by Evaluation of PM10 Variability. Sci. Total Environ. 2015, 524, 225–236. [Google Scholar] [CrossRef] [PubMed]
  7. Lu, W.Z.; He, H.D.; Dong, L.Y. Performance Assessment of Qir Quality Monitoring Networks Using Principal Component Analysis and Cluster Analysis. Build. Environ. 2011, 46, 577–583. [Google Scholar] [CrossRef]
  8. Kaya, K.; Öğüdücü, Ş.G. A binary classification model for PM10 levels. In Proceedings of the 3rd International Conference on Computer Science and Engineering (UBMK 2018), Sarajevo, Bosnia-Herzegovina, 20–23 September 2018; pp. 361–366. [Google Scholar]
  9. Onal, A.E.; Bayramlar, O.F.; Ezirmik, E.; Gulle, B.T.; Canatar, F.; Calik, D.; Nacar, D.D.; Aydin, L.E.; Baran, A.; Harbawi, Z.K. Evaluation of Air Quality in the City of Istanbul during the Years 2013 and 2015. J. Environ. Sci. Eng. 2017, 6, 465–470. [Google Scholar] [CrossRef]
  10. Güler, N.; İşçi, Ö.G. The Regional Prediction Model of PM10 Concentrations for Turkey. Atmos. Res. 2016, 180, 64–77. [Google Scholar] [CrossRef]
  11. Taşpınar, F.; Bozkurt, Z. Application of artificial neural networks and regression models in the prediction of daily maximum PM10 concentration in Düzce, Turkey. Fresenius Environ. Bull. 2014, 23, 2450–2459. [Google Scholar]
  12. Şahin, Ü.A.; Ucan, O.N.; Bayat, C.; Tolluoglu, O. A New Approach to Prediction of SO2 and PM10 Concentrations in Istanbul, Turkey: Cellular Neural Network (CNN). Environ. Forensics 2011, 12, 253–269. [Google Scholar] [CrossRef]
  13. Kurt, A.; Oktay, A.B. Forecasting Air Pollutant Indicator Levels with Geographic Models 3 Days in Advance Using Neural Networks. Expert Syst. Appl. 2010, 37, 7986–7992. [Google Scholar] [CrossRef]
  14. Xue, Y.; Liao, X.; Carin, L.; Krishnapuram, B. Multi-task Learning for Classification with Dirichlet Process Priors. J. Mach. Learn. Res. 2007, 8, 35–63. [Google Scholar]
  15. Liu, P.; Qiu, X.; Huang, X. Recurrent neural network for text classification with multi-task learning. In Proceedings of the Twenty-Fifth International Joint Conference on Artificial Intelligence (IJCAI 2016), New York, NY, USA, 9–15 July 2016; pp. 2873–2879. [Google Scholar]
  16. Liu, X.; Gao, J.; He, X.; Deng, L.; Duh, K.; Wang, Y.Y. Representation Learning Using Multi-task Deep Neural Networks for Semantic Classification and Information Retrieval. Available online: https://www.microsoft.com/en-us/research/publication (accessed on 2 January 2019).
  17. Zhang, X.L. Convex Discriminative Multitask Clustering. IEEE Trans. Pattern Anal. Mach. Intell. 2015, 37, 28–40. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  18. Zhang, H.; Lee, C.A.A.; Li, Z.; Garbe, J.R.; Eide, C.R.; Petegrosso, R.; Kuang, R.; Tolar, J. A multitask clustering approach for single-cell RNA-seq analysis in Recessive Dystrophic Epidermolysis Bullosa. PLoS Comput. Biol. 2018, 14, e1006053. [Google Scholar] [CrossRef] [PubMed]
  19. Zhang, X.; Zhang, X.; Liu, H.; Liu, X. Multi-task Clustering through Instances Transfer. Neurocomputing 2017, 251, 145–155. [Google Scholar] [CrossRef]
  20. Siahpirani, A.F.; Ay, F.; Roy, S. A Multi-task Graph-Clustering Approach for Chromosome Conformation Capture Data Sets Identifies Conserved Modules of Chromosomal Interactions. Genome Biol. 2016, 17, 114. [Google Scholar] [CrossRef] [PubMed]
  21. Zhang, X.; Zhang, X.; Liu, H.; Liu, X. Multi-Task Multi-View Clustering. IEEE Trans. Knowl. Data Eng. 2016, 28, 3324–3338. [Google Scholar] [CrossRef]
  22. Yan, Y.; Ricci, E.; Liu, G.; Sebe, N. Egocentric Daily Activity Recognition via Multitask Clustering. IEEE Trans. Image Process. 2015, 24, 2984–2995. [Google Scholar] [CrossRef]
  23. Zhang, X.; Zhang, X.; Liu, H. Smart Multitask Bregman Clustering and Multitask Kernel Clustering. ACM Trans. Knowl. Discov. Data 2015, 10, 8. [Google Scholar] [CrossRef]
  24. Liu, Q.; Liao, X.; Carin, L. Semi-supervised multitask learning. In Proceedings of the 20th International Conference on Neural Information Processing Systems (NIPS 2007), Vancouver, BC, Canada, 3–6 December 2007; pp. 937–944. [Google Scholar]
  25. Qi, Y.; Tastan, O.; Carbonell, J.G.; Klein-Seetharaman, J.; Weston, J. Semi-supervised Multi-task Learning for Predicting Interactions between HIV-1 and Human Proteins. Bioinformatics 2010, 26, i645–i652. [Google Scholar] [CrossRef]
  26. Lu, X.; Li, X.; Mou, L. Semi-supervised Multitask Learning for Scene Recognition. IEEE Trans. Cybern. 2015, 45, 1967–1976. [Google Scholar] [CrossRef] [PubMed]
  27. Zhang, Y.; Yeung, D.Y. Semi-supervised multi-task regression. In Machine Learning and Knowledge Discovery in Databases, Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, Bled, Slovenia, 6–10 September 2009; Springer: Berlin/Heidelberg, Germany, 2009; Volume 5782, pp. 617–631. [Google Scholar]
  28. Reichart, R.; Tomanek, K.; Hahn, U.; Rappoport, A. Multi-task active learning for linguistic annotations. In Proceedings of the Association for Computational Linguistics: Human Language Technology Conference (ACL: HLT 2008), Columbus, OH, USA, 15–20 June 2008; pp. 861–869. [Google Scholar]
  29. Zhang, Y. Multi-task active learning with output constraints. In Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI 2010), Atlanta, Georgia, USA, 11–15 July 2010; pp. 667–672. [Google Scholar]
  30. Acharya, A.; Mooney, R.J.; Ghosh, J. Active Multitask Learning Using Supervised and Shared Latent Topics. In Pattern Recognition and Big Data; World Scientific: Singapore, 2017; pp. 75–112. [Google Scholar]
  31. Fang, M.; Tao, D. Active multi-task learning via bandits. In Proceedings of the SIAM International Conference on Data Mining (SDM 2015), Vancouver, BC, Canada, 30 April–2 May 2015; pp. 505–513. [Google Scholar]
  32. Wilson, A.; Fern, A.; Ray, S.; Tadepalli, P. Multi-task reinforcement learning: A hierarchical Bayesian approach. In Proceedings of the 24th International Conference on Machine Learning (ICML 2007), Corvalis, OR, USA, 20–24 June 2007; pp. 1015–1022. [Google Scholar]
  33. Parisotto, E.; Ba, J.L.; Salakhutdinov, R. Actor-Mimic: Deep Multitask and Transfer Reinforcement Learning. Available online: https://arxiv.org/abs/1511.06342 (accessed on 12 January 2019).
  34. Li, H.; Liao, X.; Carin, L. Multi-task Reinforcement Learning in Partially Observable Stochastic. J. Mach. Learn. Res. 2009, 10, 1131–1186. [Google Scholar]
  35. Lazaric, A.; Ghavamzadeh, M. Bayesian multi-task reinforcement learning. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21–24 June 2010; pp. 599–606. [Google Scholar]
  36. Zhao, J.; Xie, X.; Xu, X.; Sun, S. Multi-view Learning Overview: Recent Progress and New Challenges. J. Adv. Inf. Fusion 2017, 38, 43–54. [Google Scholar] [CrossRef]
  37. He, J.; Lawrence, R. A graphbased framework for multi-task multi-view learning. In Proceedings of the 28th International Conference on Machine Learning (ICML 2011), Bellevue, WA, USA, 28 June–2 July 2011; pp. 25–32. [Google Scholar]
  38. Gao, Z.; Li, S.H.; Zhang, G.T.; Zhu, Y.J.; Wang, C.; Zhang, H. Evaluation of Regularized Multi-task Leaning Algorithms for Single/Multi-view Human Action Recognition. Multimed. Tools Appl. 2017, 76, 20125–20148. [Google Scholar] [CrossRef]
  39. Honorio, J.; Samaras, D. Multi-task learning of gaussian graphical models. In Proceedings of the 27th International Conference on Machine Learning (ICML 2010), Haifa, Israel, 21–24 June 2010; pp. 447–454. [Google Scholar]
  40. Oyen, D.; Lane, T. Leveraging domain knowledge in multitask Bayesian network structure learning. In Proceedings of the 26 AAAI Conference on Artificial Intelligence (AAAI 2012), Toronto, ON, Canada, 22–26 July 2012; pp. 1091–1097. [Google Scholar]
  41. Yan, Y.; Ricci, E.; Subramanian, R.; Lanz, O.; Sebe, N. No matter where you are: Flexible graph-guided multi-task learning for multi-view head pose classification under target motion. In Proceedings of the IEEE International Conference on Computer Vision (ICCV 2013), Sydney, Australia, 3–6 December 2013; pp. 1177–1184. [Google Scholar]
  42. Zhang, Y.; Yang, Q. A Survey on Multi-Task Learning. Available online: https://arxiv.org/abs/1707.08114 (accessed on 10 January 2019).
  43. Zhang, Y.; Yang, Q. An Overview of Multi-task Learning. Natl. Sci. Rev. 2017, 5, 30–43. [Google Scholar] [CrossRef]
  44. Wang, B.; Yan, Z.; Lu, J.; Zhang, G.; Li, T. Deep Multi-task Learning for Air Quality Prediction. In Lecture Notes in Computer Science; Cheng, L., Leung, A., Ozawa, S., Eds.; Springer: Cham, Switzerland, 2018; Volume 11305, ISBN 978-3-030-04221-9. [Google Scholar]
  45. Zhang, X.; Zhang, X.; Liu, H.; Luo, J. Multi-task clustering with model relation learning. In Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), Stockholm, Sweden, 13–19 July 2018; pp. 3132–3140. [Google Scholar]
  46. Dincer, N.G.; Akkuş, Ö. A New Fuzzy Time Series Model based on Robust Clustering for Forecasting of Air Pollution. Ecol. Inform. 2018, 43, 157–164. [Google Scholar] [CrossRef]
  47. Atacak, I.; Arici, N.; Guner, D. Modelling and Evaluating Air Quality with Fuzzy Logic Algorithm-Ankara-Cebeci Sample. Int. J. Intell. Syst. Appl. Eng. 2017, 5, 263–268. [Google Scholar] [CrossRef]
  48. Cagcag, O.; Yolcu, U.; Egrioglu, E.; Aladag, C.H. A Novel Seasonal Fuzzy Time Series Method to the Forecasting of Air Pollution Data in Ankara. Am. J. Intell. Syst. 2013, 3, 13–19. [Google Scholar] [CrossRef]
  49. Liu, A.; Lu, Y.; Nie, W.; Su, Y.; Yang, Z. HEp-2 Cells Classification via Clustered Multi-task Learning. Neurocomputing 2016, 195, 195–201. [Google Scholar] [CrossRef]
  50. Han, J.; Pei, J.; Kamber, M. Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann Publishers: Waltham, MA, USA, 2011; ISBN 9780123814807. [Google Scholar]
  51. Dempster, A.P.; Laird, N.M.; Rubin, D.B. Maximum Likelihood from Incomplete Data via the EM Algorithm. J. R. Stat. Soc. Ser. B 1977, 39, 1–22. [Google Scholar] [CrossRef]
  52. McCallum, A.; Nigam, K.; Ungar, L.H. Efficient clustering of high-dimensional data sets with application to reference matching. In Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (ACM:KDD 2000), Boston, MA, USA, 20–23 August 2000; pp. 169–178. [Google Scholar]
  53. Sharma, N.; Bajpai, A.; Litoriya, M.R. Comparison the Various Clustering Algorithms of Weka Tools. Int. J. Emerg. Technol. Adv. Eng. 2012, 4, 78–80. [Google Scholar]
  54. Hava Kalitesi İzleme İstasyonları Web Sitesi. Available online: http://laboratuvar.cevre.gov.tr (accessed on 20 November 2018).
  55. Frank, E.; Hall, M.A.; Witten, I.H. The WEKA Workbench. Online Appendix for “Data Mining: Practical Machine Learning Tools and Techniques”, 4th ed.; Morgan Kaufmann: Cambridge, MA, USA, 2016; ISBN 9780128043578. [Google Scholar]
  56. Wu, H.; Li, H.; Jiang, M.; Chen, C.; Lv, Q.; Wu, C. Identify High-quality Protein Structural Models by Enhanced-means. BioMed Res. Int. 2017, 2017, 7294519. [Google Scholar] [CrossRef] [PubMed]
Figure 1. The general MV-MTC clustering scheme.
Figure 1. The general MV-MTC clustering scheme.
Applsci 09 01610 g001
Figure 2. Results from the implementation of multi-task clustering obtained by KM++.
Figure 2. Results from the implementation of multi-task clustering obtained by KM++.
Applsci 09 01610 g002
Table 1. Summary of the data mining studies with “air pollution in Turkey” as the main subject.
Table 1. Summary of the data mining studies with “air pollution in Turkey” as the main subject.
Ref.YearThe Target PollutantsThe Dataset ContentAimTaskAlgorithms/MethodsPerformance Metrics
[8]2018PM10PD: hourly densities of CO, NO, NO2, NOx, O3 and SO2;
M: T, WD, WS, P, RH, min. T, max. T, max. WD and max. WS
Estimation of the density of PM10 at Istanbul using datasets with imbalanced class distributionPrediction
Classification
LRC, RFC, ETC, and GBCAccuracy, AUROC
[46]2018SO2The time series of weekly SO2 concentrationsForecast air pollution in 65 monitoring stations Clustering
Prediction
FTS based on FKM, FCMF and GKFRMSE and PB
[9]2017PM10, SO2Four months SO2 and PM10 concentrations in IstanbulEvaluation of the results of regular measurements of PM10 and SO2 concentrations in the city of IstanbulPredictionANOVASSE and MSE, Kolmogorov–Smirnov Test, t-test
[47]2017AQI (monthly basis)SO2, NO2, CO, O3 and PM10 (hourly and daily basis)Determination of the AQI in AnkaraClassificationFLA-
[10]2016PM10Weekly PM10 concentrations in numerous stations in TurkeyPredicting a model to estimate PM10 concentrations for 130 monitoring stationsClustering PredictionFCARM, ARMAPE, Dickey–Fuller test, p-value
[11]2014PM10Hourly observations consisted of:
M: max. T, avg. T, std. T, max. WS, avg. WS, std. WS, max. WD, avg. WD, std. WD, degree, max. RH, avg. RH, std. RH;
PD: max. PM10, avg. PM10, std. PM10
Forecast maximum Daily PM10 concentrations one day ahead in DuzcePredictionANN (MLP), SWR, MLRIA, FMB, RMSE, R2
[48]2013SO2The amount of SO2 in AnkaraSeasonal fuzzy time series forecasting in AnkaraClustering
Prediction
Fuzzy C-means combined with ANN and SARIMARMSE, MAPE
[12]2011PM10, SO2M: T, WS and WD, RH, P, S, C, RPrediction of the daily and hourly mean concentrations of PM10 and SO2 pollutants in the regions of IstanbulPredictionCNN, PERr, d, the Mean Bias
Error, MAE and RMSE, t test (p value)
[13]2010PM10, SO2, COM: P, Day T, Night T, H, WS, WD; PD: SO2, CO, PM10;
GC, Day of Week, Date
Forecasting SO2, CO and PM10 levels 3 days in advance for the Besiktas district of IstanbulPredictionGFM_NNBand Error
LRC, Logistic Regression Classifier; T, temperature; FKM, Fuzzy K-Medoid; MLP, Multi-Layer Perceptron; RFC, Random Forest Classifier; RH, Relative Humidity (H); P, Pressure; R2, coefficient of determination; ETC, Extra Trees Classifier; AQI, Air Quality Index; FTS, Fuzzy Time Series; FMB, Fractional Mean Bias; GBC, Gradient Boosting Classifier; RMSE, Root Mean Squared Error; IA, Index-of-Agreement; MLR, Multiple Linear Regression; PD, Pollution Data; FCMF, FTS Models based on Fuzzy C-means; FCARM, Fuzzy C-Auto Regressive Model; SWR, Stepwise Regression; M, Meteorological Data; GKF, Gustafson–Kessel; AR, Autoregressive model; GKF, Gustafson–Kessel; WS, Wind Speed; PB, Percent Bias; MAPE, Mean Absolute Percentage Error; GC, General Condition; WD, Wind Direction; ANOVA, Analysis of Variance; FLA, Fuzzy Logic Algorithm; ANN, Artificial Neural Network; r, Correlation Coefficient; MAE, Mean Absolute Error; S, Sunshine; R, Rainfall; C, Cloudiness; d, Index of Agreement; PER, Statistical Persistence Method; CNN, Cellular Neural Network; AUROC, The Area under the Receiver Operating Characteristic; SARIMA, Seasonal Autoregressive Integrated Moving Average; GFM_NN, Geographic Forecasting Models using Neural Networks; Percentage Error; GC, General Condition; WD, Wind Direction; ANOVA, Analysis of Variance; FLA, Fuzzy Logic Algorithm; ANN, Artificial Neural Network; r, Correlation Coefficient; MAE, Mean Absolute Error; S, Sunshine; R, Rainfall; C, Cloudiness; d, Index of Agreement; PER, Statistical Persistence Method; CNN, Cellular Neural Network; AUROC, The Area under the Receiver Operating Characteristic; SARIMA, Seasonal Autoregressive Integrated Moving Average; GFM_NN, Geographic Forecasting Models using Neural Networks.
Table 2. Summary of the data mining studies taking “Multi-Task Clustering” as the main subject.
Table 2. Summary of the data mining studies taking “Multi-Task Clustering” as the main subject.
Ref.YearSubject AreaAimAlgorithms/MethodsPerformance Metrics
[18]2018BioinformaticsGenerate a model that utilizes multiple single-cell populations from biological replicates or different samples to address the cross-population clustering problem of scRNA-seq datascVDMC, KM, Pooled KM, SNN-Cliq, CellTree, Seurat, SC3ARI, Cluster Error
(measured on the best one-to-one matching between the detected clusters and the true clusters)
[19]2017Text Mining and Image MiningPropose a general multi-task clustering algorithm by transferring knowledge of instances through reweighting the distance between samples in different tasks by learning a shared subspace and selecting the nearest neighbors for each sample from the other tasksMTCTKI, KM, KKM, Ncut–GK, Ncut–SNN, LSSMTC, DMTFC, SMT–NMF, MTCTKI0Clustering Accuracy, NMI
[20]2016BioinformaticsIdentify common and context-specific aspects of genome architectureArboretum-Hi-C, KM, HC, SCDBI, SI, D, Number of enriched clusters, log P value of ANOVA
[21]2016Web Page Mining and Image MiningDevelop a co-clustering based multi-task multi-view clustering framework which integrates within-view-task clustering, multi-view relationship learning and multi-task relationship learningKM, KKM, NSC, BiCo, SNMTF, CoRe, CoTr, LSSMTC, DMTFC, BMTMVC, SMTMVCClustering Accuracy, NMI
[49]2016BioinformaticsAutomated HEp-2Cells ClassificationCMTL and the other 28 methods presented in the HEp-2Cells Classification contest held at the 2012 International Conference on Pattern Recognition Accuracy
[22]2015Activity RecognitionDaily living analysis from visual data gathered from wearable camerasEMD-MTC with linear and rbf kernel denoted as CEMD-MTC and KEMD-MTC respectively; KM, KKM, CNMF and SemiNMF, SemiEMD-MTC, KSemiEMD-MTC and the LSMTC methodClustering Accuracy, NMI
[23]2013Document ClusteringIdentifying and avoiding negative effects of the boosting process of MBC and also dealing with nonlinear separable data in the clustering of documentsSMBC and S-MKC: KM and KKM; MBCClustering Accuracy, NMI
NMF, Nonnegative Matrix Factorization; Ncut–GK, Normalized Cut with Gaussian Kernel Similarity; BiCo, Bipartite Graph Co-clustering; DMTFC, Convex Discriminative Multi-task Feature Clustering; MTCTKI0, MTCTKI without using Shared Subspace; NSC, Normalized Spectral Clustering; SNMTF, Semi-nonnegative Matrix Tri-factorization; CoRe, Co-regularized Multi-view Spectral Clustering; CoTr, Co-trained Multi-view Spectral Clustering; SC3, Single-cell Consensus Clustering; SNN-Cliq, Shared Nearest Neighbor Cliq; LSSMTC, Learning the Shared Subspace for Multi-task Clustering; DBI, Davies–Bouldin Index; SI, Silhouette Index; D, Delta Contact Count; SMT–NMF, Symmetric Multi-task Non-negative Matrix Factorization; ARI, Adjusted Rand Index; HC, Hierarchical Clustering; SC, Spectral Clustering; EMD-MTC, Earth Mover’s Distance Regularized Multi-task Clustering; NMI, Normalized Mutual Information; KM, K-means; S-MKC, Smart Multi-task Kernel Clustering; KKM, Kernel K-means; CMTL, Clustered Multi-task Learning; MBC, Multitask Bregman Clustering; SMBC, Smart Multi-task Bregman Clustering; BMTMVC, Bipartite Graph based Multi-task Multi-view Clustering; Ncut–SNN, Normalized Cut with Shared Nearest Neighbor Similarity; MTCTKI, Multi-task Clustering by Transferring Knowledge of Instances; scVDMC, Variance-Driven Multitask Clustering of Singlecell RNA-seq Data; SMTMVC, Semi-nonnegative Matrix Tri-factorization based Multi-task Multi-view Clustering; Arboretum-Hi-C, Multi-task Spectral Clustering Algorithm for Comparative Analysis of Hi-C Data.
Table 3. Assignments of cluster labels under different tasks.
Table 3. Assignments of cluster labels under different tasks.
InstanceCluster Label Assignments for t1Cluster Label Assignments for t2Cluster Label Assignments for tr−1Cluster Label Assignments for tr
o1L1L2Lk−1Lk
o2L2LkLk−1L3
on-1LkL2L1L4
onL1Lk-2Lk−1Lk
Table 4. An example showing MV-MTC algorithm step by step.
(1) Dataset D and its instance values in terms of three tasks.
(1) Dataset D and its instance values in terms of three tasks.
Ins. IDTask t1Task t2Task t3
132343
2101530
3214054
493289
5181472
6122728
762426
844122
9173379
10192858
11101547
1231873
(2) Instance assignments to clusters under different tasks.
(2) Instance assignments to clusters under different tasks.
Cluster IDInstance Assignments
Task t1Task t2Task t3
C1{2, 4, 6, 11}{3, 4, 8, 9}{1, 3, 10, 11}
C2{3, 5, 9, 10}{2, 5, 11, 12}{4, 5, 9, 12}
C3{1, 7, 8, 12}{1, 6, 7, 10}{2, 6, 7, 8}
(3) Determination of the common cluster labels according to average intra-cluster weights.
(3) Determination of the common cluster labels according to average intra-cluster weights.
Cluster IDCommon Label Assignments
Task t1Avg. WeightTask t2Avg. WeightTask t3Avg. Weight
C1L210.25L336.50L250.50
C2L318.75L115.50L378.25
C3L14.00L225.50L126.50
Table 5. Assignments of final cluster labels to the instances in terms of different tasks and the output of MV-MTC from majority voting rule.
Table 5. Assignments of final cluster labels to the instances in terms of different tasks and the output of MV-MTC from majority voting rule.
Ins. ID C t 1 C t 2 C t 3 MV-MTCIns. ID C t 1 C t 2 C t 3 MV-MTC
1L1L2L2L27L1L2L1L1
2L2L1L1L18L1L3L1L1
3L3L3L2L39L3L3L3L3
4L2L3L3L310L3L2L2L2
5L3L1L3L311L2L1L2L2
6L2L2L1L212L1L1L3L1
Table 6. The selected air quality monitoring stations (AQMS) and their features.
Table 6. The selected air quality monitoring stations (AQMS) and their features.
IDAQMS NAMECITYCOUNTYLONGITUDELATITUDETYPETHE MEASURED POLLUTANTS
1Adana-CatalanAdanaYüreğir35.261937.1864RuralPM10, SO2, NO2, NO, O3, NOX
2Adana-DogankentAdanaYüreğir35.349136.8545RuralPM10, SO2, NO2, NO, O3, NOX
3Adana-MeteorolojiAdanaKaraisali35.344037.0041UrbanPM10, SO2, NO2, NO, O3, NOX
4Adana-ValilikAdanaSeyhan35.312436.9991UrbanPM10, SO2, NO2, NO, O3, NOX, CO
5AgriAgriMerkez43.039639.7213UrbanPM10, SO2, NO2, NO, O3, NOX
6Agri-DogubeyazitAgriDogubeyazit44.083539.5476UrbanPM10, SO2, NO2, NO, O3, NOX, CO
7Agri-PatnosAgriPatnos42.853039.2365UrbanPM10, SO2, NO2, NO, O3, NOX, CO
8Ankara-KeciorenAnkaraKecioren32.862839.9672UrbanPM10, SO2, NO2, NO, O3, NOX, CO, PM2.5
9Ankara-SihhiyeAnkaraCankaya32.859439.9272IndustrialPM10, SO2, NO2, NO, O3, NOX, CO, PM2.5
10ArdahanArdahanMerkez42.705541.0000UrbanPM10, SO2, NO2, NO, O3, NOX
11ArtvinArtvinMerkez41.818241.1752UrbanPM10, SO2, NO2, NO, O3, NOX
12BartinBartinMerkez32.356441.6248UrbanPM10, SO2, NO2, NO, O3, NOX, CO, PM2.5
13BayburtBayburtMerkez40.225540.2558UrbanPM10, SO2, NO2, NO, O3, NOX
14Canakkale-Biga IcdasCanakkaleBiga27.107240.4173IndustrialPM10, SO2, NO2, NO, O3, NOX, CO
15Canakkale-Can-MTHMCanakkaleCan27.049840.0293UrbanPM10, SO2, NO2, NO, O3, NOX
16Edirne-Kesan-MTHMEdirneKesan26.635240.8511UrbanPM10, SO2, NO2, NO, O3, NOX, PM2.5
17ErzincanErzincanMerkez39.495039.7430UrbanPM10, SO2, NO2, NO, O3, NOX
18ErzurumErzurumYakutiye41.272839.8982UrbanPM10, SO2, NO2, NO, O3, NOX
19Erzurum-PalandokenErzurumPalandoken41.275239.8676UrbanPM10, SO2, NO2, NO, O3, NOX, CO
20Erzurum-PasinlerErzurumPasinler41.572140.0335RuralPM10, SO2, NO2, NO, O3, NOX
21Giresun-GemilercekegiGiresunMerkez38.398540.9144UrbanPM10, SO2, NO2, NO, O3, NOX, CO, PM2.5, PM10 Flow Rate, PM2.5 Flow Rate
22GumushaneGumushaneMerkez39.480840.4608UrbanPM10, SO2, NO2, NO, O3, NOX
23Hatay-IskenderunHatayIskenderun36.223936.7141IndustrialPM10, SO2, NO2, NO, O3, NOX, CO
24IgdirIgdirMerkez44.053639.9261UrbanPM10, SO2, NO2, NO, O3, NOX
25Igdir-AralikIgdirAralik44.620939.7868RuralPM10, SO2, NO2, NO, O3, NOX, PM2.5
26Istanbul-Basaksehir-MTHMIstanbulBasaksehir28.789841.0954IndustrialPM10, SO2, NO2, NO, O3, NOX, CO
27Istanbul-Esenyurt-MTHMIstanbulEsenyurt28.668841.0192UrbanPM10, SO2, NO2, NO, O3, NOX
Table 7. The selected air quality monitoring stations (AQMS) and their features.
Table 7. The selected air quality monitoring stations (AQMS) and their features.
IDAQMS NAMECITYCOUNTYLONGITUDELATITUDETYPETHE MEASURED POLLUTANTS
28Karabuk-Kardemir 1KarabukMerkez32.627441.1920IndustrialPM10, SO2, NO2, NO, O3, NOX, CO
29Kars-Istasyon Mah.KarsMerkez43.104440.6050UrbanPM10, SO2, NO2, NO, O3, NOX, CO
30Kirklareli-Limankoy-MTHMKirklareliLimankoy28.055941.8852RuralPM10, SO2, NO2, NO, O3, NOX
31KirsehirKirsehirMerkez34.168639.1381UrbanPM10, SO2, NO2, NO, O3, NOX, CO, PM2.5
32Kocaeli-Gebze-MTHMKocaeliGebze29.436540.8108UrbanPM10, SO2, NO2, NO, O3, NOX
33Kocaeli-Korfez-MTHMKocaeliKorfez29.788840.7461IndustrialPM10, SO2, NO2, NO, O3, NOX, PM2.5, PM2.5 Flow Rate
34Kocaeli-Yenikoy-MTHMKocaeliBasiskele29.884440.7042UrbanPM10, SO2, NO2, NO, O3, NOX
35Manisa-SomaManisaSoma27.612939.1814UrbanPM10, SO2, NO2, NO, O3, NOX, CO
36Ordu-UnyeOrduUnye37.280241.1214UrbanPM10, SO2, NO2, NO, O3, NOX,PM10 Flow Rate
37RizeRizeMerkez40.532841.0217UrbanPM10, SO2, NO2, NO, O3, NOX
38Rize-ArdesenRizeArdesen41.047541.1273RuralPM10, SO2, NO2, NO, O3, NOX, PM2.5
39Samsun-AtakumSamsunAtakum36.296541.3253UrbanPM10, SO2, NO2, NO, O3, NOX, PM2.5, PM10 Flow Rate, PM2.5 Flow Rate
40Seyyar-1(06 THL 77)-Malatya ArapgirMalatyaArapgir38.487839.0457UrbanPM10, SO2, NO2, NO, O3
41Seyyar-2 (06 THL 79)-Sincan OSBAnkaraMamak33.036439.9008UrbanPM10, SO2, NO2, NO, O3
42Seyyar-4 (06 DV 9975)-Isparta KizildagIspartaSarkikaraagac31.354938.0442UrbanPM10, SO2, NO2, NO, O3
43Tekirdag-Corlu-MTHMTekirdagCorlu27.815441.1806IndustrialPM10, SO2, NO2, NO, O3
44Trabzon-AkcaabatTrabzonAkcaabat39.592341.0143UrbanPM10, SO2, NO2, NO, O3, NOX, CO
45Trabzon-UzungolTrabzonUzungol40.298040.6173RuralPM10, SO2, NO2, NO, O3
46Trabzon-ValilikTrabzonMerkez39.712341.0059UrbanPM10, SO2, NO2, NO, O3, NOX
47Yalova-Armutlu-MTHMYalovaArmutlu28.784540.5292RuralPM10, SO2, NO2, NO, O3, NOX, PM2.5
48Zonguldak-Eren Enerji LiseZonguldakCatalagzi31.880141.4964IndustrialPM10, SO2, NO2, NO, O3, NOX, CO, PM2.5
49Zonguldak-Eren Enerji TepekoyZonguldakCatalagzi31.937441.5269IndustrialPM10, SO2, NO2, NO, O3, NOX, CO, PM2.5
Table 8. The mean weight values of each cluster group and their intra-cluster labels obtained by KM++.
Table 8. The mean weight values of each cluster group and their intra-cluster labels obtained by KM++.
Cluster 1Cluster 2Cluster 3Cluster 4Cluster 5
WeightLabelWeightLabelWeightLabelWeightLabelWeightLabel
PM100.2002L20.3142L30.1309L10.3982L40.5977L5
SO20.0713L10.2328L30.1497L20.5799L50.4996L4
NO20.0902L10.2177L20.3322L30.4396L40.6593L5
NO0.0726L10.1382L20.5186L50.1781L30.1783L4
O30.4208L30.3237L20.7482L50.4873L40.2082L1
Table 9. The results of the performance evaluation in terms of SSE values.
Table 9. The results of the performance evaluation in terms of SSE values.
KMKM++EMCanopyFFirstHIERSingHIERCompHIERAvgHIERMeanHIERCentro
CPM10396.40406.53392.42476.39415.03532.51411.91516.16456.02526.03
CSO2260.31194.96226.03415.04214.27216.13204.56214.27214.27214.27
CNO2367.95315.79319.25406.52356.39437.90317.32437.90373.54437.90
CNO335.17351.29294.79426.31305.73319.82303.40303.40319.82319.82
CO3393.83394.42371.31478.08418.25645.29402.39410.27514.38446.52
CALL350.73332.60320.76440.47341.93430.33327.92376.40375.61388.91
MV-MTC113.80108.12116.16122.84134.41161.96124.92161.96144.16161.96
Table 10. Comparisons of different clustering methods in terms of execution time (in seconds).
Table 10. Comparisons of different clustering methods in terms of execution time (in seconds).
KMKM++EMCanopyFFirstHIERSingHIERCompHIERAvgHIERMeanHIERCentro
CPM100.040.050.320.030.030.110.110.110.110.16
CSO20.040.030.360.040.030.110.110.110.110.17
CNO20.030.040.290.030.030.110.110.110.110.16
CNO0.040.040.260.030.030.110.110.110.110.17
CO30.050.040.300.030.030.110.110.110.120.16
CALL0.200.201.530.160.150.550.550.550.560.82
MV-MTC0.310.601.550.330.290.730.730.730.761.05

Share and Cite

MDPI and ACS Style

Tuysuzoglu, G.; Birant, D.; Pala, A. Majority Voting Based Multi-Task Clustering of Air Quality Monitoring Network in Turkey. Appl. Sci. 2019, 9, 1610. https://doi.org/10.3390/app9081610

AMA Style

Tuysuzoglu G, Birant D, Pala A. Majority Voting Based Multi-Task Clustering of Air Quality Monitoring Network in Turkey. Applied Sciences. 2019; 9(8):1610. https://doi.org/10.3390/app9081610

Chicago/Turabian Style

Tuysuzoglu, Goksu, Derya Birant, and Aysegul Pala. 2019. "Majority Voting Based Multi-Task Clustering of Air Quality Monitoring Network in Turkey" Applied Sciences 9, no. 8: 1610. https://doi.org/10.3390/app9081610

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop