1. Introduction
In machine learning and data mining, the construction of a classifier can be considered one of the most significant and challenging tasks [
1]. Traditional classification algorithms belong to the class of supervised algorithms which use only labelled data to train the classifier. However, in many realworld classification problems, labelled instances are often difficult, expensive, or time consuming to obtain, since they require the efforts of empirical research. In contrast unlabeled data are fairly easy to obtain and require less effort of experienced human annotators.
Semisupervised learning (SSL) algorithms constitute the appropriate and effective machine learning methodology for extracting knowledge from both labeled and unlabeled data so as to build efficient classifiers [
2]. More analytically, they efficiently combine the explicit classification information of labeled data with the information hidden in the unlabeled data. The general assumption of this class of algorithms is that data points in a high density region are likely to belong to the same class and the decision boundary lies in low density regions [
3]. Hence, these methods have the advantage of reducing the effort of supervision to a minimum, while still preserving competitive recognition performance. Nowadays, these algorithms have great interest both in theory and in practice and have become a topic of significant research as an alternative to traditional methods of machine learning, since they require less human effort and frequently present higher accuracy [
4,
5,
6,
7,
8,
9,
10]. The main issue of semisupervised learning is how to efficiently exploit the hidden information in the unlabeled data. In the literature, several approaches have been proposed with different philosophy related to the link between the distribution of labeled and unlabeled data [
2,
11,
12,
13,
14].
Selftraining constitutes perhaps the most popular and frequently used SSL algorithm due to its simplicity and classification accuracy [
4,
5,
9]. This algorithm wraps around a base learner and uses its own predictions to assign labels to unlabeled data. More specifically, in the selftraining process, a classifier is trained with a small number of labeled examples and iteratively enlarges its training set using newly labeled data with its own most confident predictions. However, this methodology can lead to erroneous predictions when noisy examples are classified as the most confident ones and in following incorporated into the labeled training set. Li and Zhou [
15] tried to address this difficulty and presented the SETRED method which incorporates data editing in the selftraining framework in order to actively learn from the selflabeled examples. Along this line, Tanha et al. [
16] studied the classification behaviour of selftraining and based on their numerical experiments, stated that the most important aspect of the selftraining procedure is to correctly estimate the confidence of the predictions so as to be successful.
Therefore, the success of the selftraining algorithm is depended on the newly labeled data [
2] but most significantly, on the selection of the base learner. Nevertheless, the selection for base learner is still in progress since the decision of which particular learning algorithm to choose for a specific problem, is still a complicated and challenging problem. Given a pattern recognition problem, the traditional approach is to evaluate a set of different learners against a representative validation set and select the best one. It is generally recognized that the key to pattern recognition problems does not wholly lie in any particular solution since no single model exists for all problems [
17].
In this work, we propose a new semisupervised learning algorithm which is based on a selftraining philosophy. The proposed algorithm initially uses several independent base learners and during the training process dynamically selects the most promising base learner relative to a strategy based on the number of the most confident predictions of unlabeled data. Our numerical experiments on several benchmark datasets confirm the efficacy of the proposed methodology. Additionally, we performed several statistical tests in order to illustrate the efficiency of our proposed algorithm.
The remainder of this paper is organized as follows:
Section 2 defines the semisupervised classification problem and the selftraining approach.
Section 3 presents a detailed description of the proposed algorithm and
Section 4 presents the numerical experiments and discusses the obtained results. Finally,
Section 5 discusses the conclusions and some further research topics for future work.
2. A Review of SemiSupervised Classification Via SelfLabeled Approach
This section provides a definition for the semisupervised classification problem and a short description of the most popular and frequently used semisupervised selflabeled algorithms.
2.1. SemiSupervised Classification
In the sequel, we present the definitions and the necessary notations for the semisupervised classification problem. Let ${x}_{p}=({x}_{p1},{x}_{p2},\dots ,{x}_{pD},y)$ be an example, where ${x}_{p}$ belongs to a class y and a Ddimensional space in which ${x}_{pi}$ is the ith attribute of the pth sample. Suppose L is a labeled set of ${N}_{l}$ instances ${x}_{p}$ with y known and U is an unlabeled set of ${N}_{u}$ instance ${x}_{q}$ with y unknown, where ${N}_{l}\ll {N}_{u}$. Notice that the set $L\cup U$ consists the training set. Moreover, there is a test set T composed of ${N}_{t}$ unseen instances ${x}_{t}$ which has not been used in the training stage. The aim of semisupervised classification is to obtain an accurate and robust learn hypothesis using the training set $L\cup U$ and in following evaluate its performance using the test set T.
In the literature, a variety of selflabeled methods has been proposed, each following a different methodology on exploiting the information hidden in the unlabeled data. Next, we present a brief description of the most popular and frequently used semisupervised selflabeled methods.
2.2. SemiSupervised SelfLabeled Methods
Selftraining is a wrapperbased semisupervised approach which constitutes an iterative procedure of selflabeling unlabeled data and is generally considered to be a simple and effective SSL algorithm. According to Ng and Cardie [
18] “
selftraining is a singleview weakly supervised algorithm” which is based on its own predictions on unlabeled data to teach itself. In the selftraining framework, an arbitrary classifier is initially trained with a small amount of labeled data which constitutes its training set, aiming to classify unlabeled points. Subsequently, it iteratively enlarges its labeled training set with its own most confident predictions and retrained. More specifically, at each iteration, the classifier’s training set is augmented gradually with classified unlabeled instances that have achieved a probability value over a defined threshold
c; these instances are considered as sufficiently reliable to be added to the training set. Notice that the way in which the confidence predictions are measured depends on the type of used base learner (see [
19]).
Clearly, this model does not make any specific assumptions for the input data, but rather accepts that its own predictions tend to be correct. Therefore, since the success of the selftraining algorithm is heavily depended on the newlylabeled data based on its own predictions, its weakness is that erroneous initial predictions will probably lead the classifier to generate incorrectly labeled data [
2].
Li and Zhou [
15] tried to address this difficulty and as a result, they presented the
SETRED method which incorporates data editing in the selftraining framework in order to actively learn from the selflabeled examples. Their principal improvement in relation to the classical selftraining scheme, is the establishment of a restriction related to the acceptance or the rejection of the unlabeled examples which are evaluated as trustworthy by the algorithm. More analytically, a neighboring graph in
Ddimensional feature space is being built and all the candidate unlabeled examples for being appended to the initial training set are being filtered through a hypothesis test. Thus, any examples having successfully passed that test are finally added to the training set before the end of each iteration.
Cotraining is a semisupervised algorithm which can be regarded as a different variant of the selftraining technique [
12]. It is based on the strong assumption that the feature space can be divided into two conditionally independent views, with each view being sufficient to train an efficient classifier. In this framework, two learning algorithms are separately trained for each view using the initial labeled dataset and the most confident predictions of each algorithm on unlabeled data are used to augment the training set of the other through an iterative learning process. Following the same concept, Nigam and Ghani [
14] performed an experimental analysis where they concluded that the Cotraining outperforms other SSL algorithms when there is a natural existence of two distinct and independent views. Nevertheless, the assumption about the existence of sufficient and redundant views is a luxury hardly met in most realcase scenarios.
Zhou and Goldman [
20] have also adopted the idea of ensemble learning and majority voting in the semisupervised framework. Along this line, Li and Zhou [
21] proposed another algorithm, in which several Random Trees are trained on bootstrap data from the dataset, named
CoForest. The main idea of this algorithm is the assignment of a few unlabeled examples to each Random Tree during the training process. Eventually, the final decision is composed by a simple majority voting. Notice that the use of Random Tree classifier for random samples of the collected labeled data is the main reason why the behavior of CoForest is efficient and robust although the number of the available labeled examples is reduced.
A rather representative approach which is based on the ensemble philosophy is the
Tritraining algorithm. This algorithm constitutes an improved singleview extension of the Cotraining algorithm exploiting unlabeled data without relying on the existence of two views of instances [
22]. Tritraining algorithm can be considered as a bagging ensemble of three classifiers which are trained on data subsets generated through bootstrap sampling from the original labeled training set [
23]. Subsequently, in each Tritraining round, if two classifiers agree on the labeling of an unlabeled instance while the third one disagrees, then these two classifiers will label this instance for the third classifier. It is worth noticing that the “
majority teach minority strategy” serves as an implicit confidence measurement which avoids the use of complicated timeconsuming approaches for explicitly measuring the predictive confidence, and hence the training process is efficient [
4].
Kostopoulos et al. [
24] and Livieris et al. [
25,
26], motivated by the previous works, studied the fusion of ensemble as well as semisupervised learning. More specifically, they presented selflabeled methods by adopting majority voting in the semisupervised framework.
3. AutoAdjustable SelfTraining SemiSupervised Algorithm
In this section, we present the proposed SSL algorithm which is based on the selftraining framework. We recall that two main difficulties in selftraining is the decision of which base learner to choose for a specific problem and how to find a set of high confidence predictions of unlabeled data. Therefore, in order to address these difficulties, we consider starting with an initial pool of classifiers and during the training process, to dynamically select the most promising classifier, relative to the most confident predictions. A highlevel description of the proposed semisupervised algorithm, entitled AutoAdjustable SelfTraining (AAST), is presented in Algorithm 1 which consists of two phases: in the 1st phase, the most promising classifier is selected from a pool of classifiers based on the number of confident predictions of unlabeled data, whereas in the 2nd phase, the most promising classifier is trained within the selftraining framework.
Suppose that
$C=({C}_{1},{C}_{2},\dots ,{C}_{N})$ constitutes a set of
N classifiers which can be used as base learners in the selftraining framework. Initially, all base learners
${C}_{i}\in C$ are trained using the same small amount of labeled data
L and then applied on the same unlabeled data
U. Subsequently, the labeled set
${L}_{i}$ of each classifier
${C}_{i}$ is iteratively augmented gradually using its own most confident predictions. More specifically, each classified unlabeled instance that has achieved a probability value over a defined threshold
c, is considered sufficiently reliable in order to be added to the classifier’s labeled set
${L}_{i}$ for the following training phases. It is worth mentioning that the way the confidence predictions are measured, depends on the type of the used base learner (see [
19,
27,
28] and the references there in). Finally, each classifier is retrained using its own new enlarged training set.
Algorithm 1: AutoAdjustable SelfTraining (AAST). 
Input: L— Set of labeled training instances. U— Set of unlabeled training instances. c— Confidence level. k— Iterations per cycle’s. $C=({C}_{1},{C}_{2},\dots ,{C}_{N})$— Set of N base learners. Output: ${C}_{P}$— Trained classifier. /* Phase I: Classifier Selection */ 1:
repeat  2:
for $i=1$ to N do  3:
Set ${L}_{i}=L$ and ${U}_{i}=U$.  4:
end for  5:
for $j=1$ to k do  6:
for each (classifier ${C}_{i}\in C$) do  7:
Apply ${C}_{i}$ on ${L}_{i}$.  8:
Select instances with a predicted probability more than threshold c per iteration (${x}_{MCP}^{\left(i\right)}$).  9:
Remove ${x}_{MCP}^{\left(i\right)}$ from ${U}_{i}$ and add to ${L}_{i}$.  10:
end for  11:
end for  12:
Select classifier ${C}_{m}\in C$ with the fewest labeled instances.  13:
Remove the classifier ${C}_{m}$ from the set C.  14:
Select classifier ${C}_{M}\in C$ with the most labeled instances.  15:
Set $L={L}_{M}$ and $U={U}_{M}$.  16:
Set $N=N1$.  17:
until one classifier remains in set C.  18:
Set ${C}_{P}$ the only classifier in set C. /* Phase II: Training of ${C}_{P}$ classifier */ 19:
repeat  20:
Apply ${C}_{P}$ on L.  21:
Select instances with a predicted probability more than threshold c per iteration (${x}_{MCP}^{\left(P\right)}$).  22:
Remove ${x}_{MCP}^{\left(P\right)}$ from U and add to L.  23:
until some stopping criterion is met or U is empty.

The proposed algorithm in order to select a base learner from set C is grounded on the following simple idea: the most promising base learner is probably the base learner with the most confident predictions. In other words, the base learner that is able to confidently label as many unlabeled instances as possible in order to explore them is the most promising classifier.
Every k iterations (which we call a cycle), AAST evaluates the base learner in set C and selects the classifier ${C}_{m}$ with the minimum number of most confident predictions as well as the classifier ${C}_{M}$ with the maximum number of most confident predictions. Subsequently, the classifier ${C}_{m}$ is removed from the set C and the classifier ${C}_{M}$ will provide its labeled set ${L}_{M}$ and its unlabeled set ${U}_{M}$ for all the rest classifiers for the next cycle. More to the point, in every cycle (i.e., every k iterations) the algorithm removes the least promising classifier from the set C, in order to reduce the computational cost and restarts the selftraining process using the labeled and unlabeled sets of the most promising classifier ${C}_{M}$, relative to the number of most confident predictions of each classifier.
Notice that, it is immediately implied from the above discussion that after ${N}_{C}1$ cycles (i.e., $k\xb7({N}_{C}1)$ iterations), where ${N}_{C}$ is the initial number of used base learners, only one classifier, denoted as ${C}_{P}$, remains in set C. This classifier constitutes the most promising classifier, relative to the proposed selection strategy. Subsequently, the only remaining classifier ${C}_{P}$ continues its training within the semisupervised framework.
An obvious advantage of the proposed technique is that it exploits the diversity of the errors of the learned models by using different learning algorithms and the classifier with the most confident predictions is dynamically selected as the most promising one. Nevertheless, the efficacy and computational cost of the proposed algorithm depends on the value of parameter k. As the value of parameter k increases, the base learners exploit the hidden information in the unlabeled data for more iterations before being evaluated; however, the computational cost and time significantly increases.
4. Experimental Results
The experiments were based on 40 datasets from UCI Machine Learning Repository [
29] and KEEL repository [
30].
Table 1 presents a brief description of the datasets’ structure i.e., the number of instances (#Instances), number of attributes (#Features) and number of output classes (#Classes). The considered datasets contain between 101 and
$\mathrm{19,020}$ instances, while the number of attributes ranges from 2 to 60 and the number of classes varies between 2 and 11.
Our experimental results were obtained by conducting a three phase procedure: In the first phase, the performance of the proposed algorithm AAST using various values of parameter
k in order to study its sensitivity is evaluated; in the second phase, the performance of AAST with that of the most popular and commonly used selflabeled algorithms is compared, while in the third stage, a statistical comparison between all compared semisupervised selflabeled algorithms is performed. The detailed numerical results can be found in the web site:
www.math.upatras.gr/~livieris/Results/AAST.zip.
The implementation code was written in Java, using the WEKA Machine Learning Toolkit [
28] and the classification accuracy was evaluated using the stratified 10fold crossvalidation i.e., the data was separated into folds so that each fold had the same distribution of classes as the entire dataset. For each generated fold, a given algorithm is trained with the examples contained in the rest of the other folds (training partition) and then tested with the current fold. Moreover, the training partition was divided into labeled and unlabeled subsets.
Similar to [
13,
31] in the division process, we do not maintain the class proportion in the labeled and unlabeled sets since the main aim of semisupervised classification is to exploit unlabeled data for better classification results. Hence, we use a random selection of examples that will be marked as labeled instances and the class label of the remaining instances will be removed. Furthermore, we ensure that every class has at least one representative instance. To study the influence of the amount of labeled data, three different ratios
R were used:
$10\%$,
$20\%$ and
$30\%$. In summary, this experimental study involves a total of 120 datasets (40 datasets × 3 labeled ratios).
Furthermore, the proposed algorithm uses three wellknown supervised classifiers as base learners namely C4.5, JRip and
kNN. These base learners constitute some of the most effective and widely used data mining algorithms for classification [
24,
32]. A brief description of these classifiers is given below:
C4.5 [
33] constitutes one of the most effective and efficient classification algorithms for building decision trees. This algorithm induces classification rules in the form of decision trees for a given training set. More analytically, it categorizes instances to a predefined set of classes according to their attribute values from the root of a tree down to a leaf. The accuracy of a leaf corresponds to the percentage of correctly classified instances of the training set.
JRip [
34] is generally considered to be a very effective and fast rulebased algorithm, especially on large samples with noisy data. The algorithm examines each class in increasing size and an initial set of rules for a class is generated using incremental reduced errors. Then, it proceeds by treating all the examples of a particular judgement in the training data as a class and determines a set of rules that covers all the members of that class. Subsequently, it proceeds to the next class and iteratively applies the same procedure until all classes have been covered. What is more, JRip produces error rates competitive with C4.5 with less computational effort.
kNN [
35] constitutes a representative instancestructured learning algorithm based on dissimilarities among a set of instances. It belongs to the lazy learning family of methods [
35] which do not build a model during the learning process. According to
kNN algorithm, characteristics extracted from classification process by viewing the entire distance among new individuals, should be classified and then the nearest
k category is used. As a result of this process, test data belongs to the nearest
k neighbor category which has more members in certain class. The main advantages of the
kNN classification algorithm is its easiness and simplicity of implementation and the fact that it provides good generalization results during classification assigned to multiple categories.
The configuration parameters of the proposed algorithm AAST and base learners used in the experiments are presented in
Table 2. Moreover, similar to Blum and Mitchell [
12], we established a limit to the number of iterations (MaxIter = 40), in algorithm AAST.
All classification algorithms were evaluated using the performance profiles based on accuracy proposed by Dolan and Morè [
36]. This metric provides a wealth of information such as solver efficiency, robustness and probability of success in compact form. More specifically, authors presented a new tool for analyzing the efficiency of algorithms by introducing the notion of a performance profile as a means to evaluate and compare the performance of the set of solvers
S on a test set
P.
Assuming that there exist
${n}_{s}$ solvers and
${n}_{p}$ problems for each solver
s and problem
p, they defined
${\alpha}_{p,s}$ as the percentage of misclassified instances by solver
s for problem
p. Requiring a baseline for comparisons, they compared the performance on problem
p by solver
s with the best performance by any solver on this problem; that is, using the performance ratio.
The performance of solver
s on any given problem might be of interest, but we would like to obtain an overall assessment of the performance of the solver. Next they defined.
Function
${\rho}_{s}$ was the (cumulative) distribution function for the performance ratio. The performance profile
${\rho}_{s}:\mathbb{R}\to [0,1]$ for a solver was a nondecreasing, piecewise constant function, continuous from the right at each breakpoint [
36]. In other words, the performance profile plots the fraction
P of problems for which any given method is within a factor
$\alpha $ of the best solver. According to the above rules and discussion, we conclude that one solver whose performance profile plot is on top right will win over the rest of the solvers.
Ultimately, the use of performance profiles eliminates the influence of a small number of problems on the benchmarking process and the sensitivity of results associated with the ranking of solvers [
36,
37,
38]. It is worth mentioning that the vertical side of a performance profile gives the percentage of the problems that were successfully solved by each method (robustness).
4.1. Sensitivity of AAST to the Value of Parameter k
In the sequel, we focus our interest on the experimental analysis for the best value of parameter
k; hence, we have tested values of
k ranging from 3 to 8 in steps of 1.
Figure 1 presents the performance profiles for various values of parameter
k, relative to the used ratio of labeled data. Clearly, AAST exhibits better classification performance as the value of parameter
k increases, revealing its sensitivity. More specifically, using
$10\%$ as labeled ratio, AAST with
$k=3,4,5,6,7$ and 8 classifies
$22.5\%$,
$25\%$,
$25\%$,
$35\%$,
$47.5\%$ and
$80\%$ of the test problems with the highest accuracy, respectively. Furthermore, AAST with
$k=3,4,5,6,7$ and 8 classifies
$20\%$,
$27.5\%$,
$30\%$,
$40\%$,
$52.5\%$ and
$80\%$ of the test problems with the highest accuracy, respectively for
$20\%$ labeled ratio as well as
$17.5\%$,
$17.5\%$,
$22.5\%$,
$35\%$,
$57.5\%$ and
$80\%$, respectively for
$30\%$ labeled ratio.
4.2. Performance Evaluation of AAST
Subsequently, we evaluate the performance of the proposed algorithm AAST against Selftraining using C4.5, JRip and kNN as base learners. In the rest of this section, the value of parameter k in Algorithm AAST is set to 8 which exhibited the highest classification accuracy.
Figure 2 presents the performance profiles for Selftraining and AAST. Obviously, AAST illustrates the highest probability of being the optimal classifier since it corresponds to the top curve, regarding all used labeled ratio. More analytically, AAST reports the best performance, classifying
$72.5\%$,
$87.5\%$ and
$60\%$ of the test problems with the highest accuracy using
$10\%$,
$20\%$ and
$30\%$ as labeled ratio, respectively, followed by Selftraining (
kNN) reporting
$22.5\%$,
$10\%$ and
$25\%$, in the same situations.
Finally, in order to demonstrate the classification performance of the proposed algorithm, we compare it with other stateoftheart selflabeled algorithms such as Cotraining [
12] and Tritraining [
22] using C4.5, JRip and
kNN as base learners, CoForest [
21] and SETRED [
15]. Notice that all algorithms were used with the parameters presented in [
30].
Figure 3 presents the performance profiles of some stateoftheart selflabeled algorithms and AAST, regarding the used labeled ratio. Despite the ratio of instances, AAST algorithm managed to achieve the best overall performance, outperforming all selflabeled algorithms. More specifically, AAST classifies
$45\%$,
$52.5\%$ and
$35\%$ of the test problems with the highest accuracy, using
$10\%$,
$20\%$ and
$30\%$ as labeled ratio, respectively. Conclusively, it is worth mentioning that the reported performance profiles illustrate that AAST exhibits better performance on average, outperforming classical SSL methods, but this is not in general the case for a single dataset.
4.3. Statistical and PostHoc Analysis
The statistical comparison of multiple algorithms over multiple datasets is fundamental in machine learning and usually it is carried out by means of a nonparametric statistical test. Therefore, we use Friedman AlignedRanks (FAR) test [
39] in order to conduct a complete performance comparison between all algorithms for all the different labeled ratios. Its application will allow us to highlight the existence of significant differences between our proposed algorithm and the classical SSL algorithms and in following to evaluate the rejection of the hypothesis that all the classifiers perform equally well for a given level [
25,
40].
Let
${r}_{i}^{j}$ be the rank of the
jth of
k learning algorithms on the
ith of
M problems. Under the nullhypothesis
${H}_{0}$ which states that all the algorithms are equivalent, the Friedman aligned ranks test statistic is defined by:
where
${\widehat{R}}_{i}$ is equal to the rank total of the
ith dataset and
${\widehat{R}}_{j}$ is the rank total of the
jth algorithm. The test statistic
${F}_{AR}$ is compared with the
${\chi}^{2}$ distribution with
$(k1)$ degrees of freedom. Please note that since the test is nonparametric, it does not require the commensurability of the measures across different datasets. In addition, this test does not assume the normality of the sample means, and thus, it is robust to outliers.
In statistical hypothesis testing, the
pvalue is the probability of obtaining a result at least as extreme as the one that was actually observed, while assuming that the null hypothesis is true. In other words, the
pvalue provides information about whether a statistical hypothesis test is significant or not, thus indicating “how significant” the result is while it does this without committing to a particular level of significance. When a
pvalue is considered in a multiple comparison, it reflects the probability error of a certain comparison; however, it does not take into account the remaining comparisons belonging to the family. One way to address this problem is to report adjusted
pvalues which take into account that multiple tests are conducted and can be compared directly with any significance level [
40].
To this end, the Finner PostHoc test [
39] with a significance level
$\alpha =0.05$ was applied so as to detect the specific differences between the algorithms. In addition, the Finner test is easy to comprehend, as it usually offers better results than other PostHoc tests, especially when the number of compared algorithms is low [
40]. The Finner procedure adjusts the value of
$\alpha $ in a stepdown manner. Let
${p}_{1},{p}_{2},\dots ,{p}_{k1}$ be the ordered
pvalues with
${p}_{1}\le {p}_{2}\le \cdots \le {p}_{k1}$ and
${H}_{1},{H}_{2},\dots ,{H}_{k1}$ be the corresponding hypothesis. The Finner procedure rejects
${H}_{1}$–
${H}_{i1}$ if
i is the smallest integer such that
${p}_{i}>1{(1\alpha )}^{(k1)/i}$, while the adjusted Finner
pvalue is defined by:
where
${p}_{j}$ is the
pvalue obtained for the
jth hypothesis and
$1\le j\le i$. It is worth mentioning that the test rejects the hypothesis of equality when the adjusted Finner
pvalue
${p}_{F}$ is less than
$\alpha $.
Table 3,
Table 4 and
Table 5 present the information of the statistical analysis performed by nonparametric multiple comparison procedures over
$10\%$,
$20\%$ and
$30\%$ of labeled data, respectively. The best (lowest) ranking obtained in each FAR test determines the control algorithm for the PostHoc test. Moreover, the adjusted
pvalue with Finner’s test (Finner APV) is presented based on the control algorithm, at
$\alpha =0.05$ level of significance. Clearly, the proposed algorithm exhibits the best overall performance, outperforming the rest selflabeled algorithms, since it reports the highest probabilitybased ranking and presents statistically better results, relative to all labeled ratio.
5. Conclusions and Future Research
In this work, we presented a new SSL algorithm which is based on a selftraining philosophy. More specifically, our proposed algorithm automatically selects the best base learner, relative to the number of the most confident predictions of unlabeled data.
The efficiency of the proposed semisupervised algorithm was evaluated on several benchmark datasets in terms of classification accuracy utilizing the most frequently used base learners: C4.5, kNN and JRip and different ratios of labeled data. Our numerical results as well as the presented statistical analysis demonstrate that the AAST algorithm outperforms its component SSL algorithms, confirming the effectiveness and robustness of the proposed method. Therefore, the presented methodology seems to lead to more efficient, stable and robust predictive models.
In our future work, we intend to pursue extensive empirical experiments in order to compare the proposed selflabeled method AAST with various methods, belonging to other SSL classes such as generative mixture models [
14,
41], transductive SVMs [
42,
43,
44], graphbased methods [
45,
46,
47,
48,
49], extreme learning methods [
50,
51,
52], expectation maximization with generative mixture models [
14,
53]. Furthermore, since our experimental results are quite encouraging, our next step is the use of other supervised classifiers as base learners, such as neural networks [
54] and support vector machines [
55] or ensemblebased learners [
26] aiming to enhance our proposed framework with more sophisticated and theoretically motivated selection criteria for the most promising classifier in order to study the behavior of AAST at each cycle. Finally, an interesting aspect is the evaluation of the proposed algorithm in specific scientific fields applying real world datasets, such as educational, health care, etc. and explore its performance on imbalanced datasets [
56,
57] using more sophisticated performance metrics such as Sensitivity, Specificity,
Fmeasure, AUC, ROC curve [
58,
59].