1. Introduction
In information retrieval (IR), ranking retrieved documents according to their relevance to a user query is an important task. To adjust the relevance of the retrieved documents, a ranking system needs to be used after receiving the user’s query, as shown in
Figure 1. An optimization model is used to order the collection of available documents using such a ranking system [
1,
2]. A number of unsupervised term vector models (TVMs), including the vector space model (VSM), TF-IDF and Okapi BM25, were used in early IR research [
2,
3]. Based on these models, the documents that were retrieved were rated in terms of their relevance to the user’s search terms following one scoring method (TWS) in IR systems. It was found that these methods were insufficient for the development of effective IR systems. There are several reasons for this, including the fact that scoring approaches such as Okapi BM25 and various language models are limited in their ability to return appropriate search results based on relevance judgments [
3,
4]. Consequently, multiple scoring methods should be used to rank retrieved documents based on the user’s query. Furthermore, other aspects, such as the importance of business documents on the web, should also be considered. Among other desirable features, the host server is taken into account when ranking documents. A statistical machine learning approach traditionally focuses on solving one single-objective optimization problem [
4,
5], that is, during a training set, it is necessary to minimize the average loss. Several additional quantities, including the complexity of the model, are either implicitly addressed by the choice of the model class or are incorporated into the main objective by incorporating weighted regularization terms.
Recently, the machine learning community has focused on additional quantities of interest, such as the fairness, robustness, efficiency or interpretability of learned models. Optimizing these can conflict with the goal of reducing training loss, and task-specific trade-offs need to be considered. The problem with hard-coding such trade-offs is that they may have undesirable consequences. The process of selecting them is also cumbersome when multiple objectives are at stake. There has been an increase in interest in multiobjective learning in recent years as a way to avoid the need for a priori trade-offs. By performing multiobjective optimization at the same time as training the actual model, the optimization either finds promising trade-off parameters simultaneously or computes multiple solutions that reflect different trade-offs, ideally along the Pareto frontier. Despite being algorithmically rich, the theory of multiobjective optimization and learning has been little studied. Specifically, learning theory results such as generalization bounds are almost completely absent. To overcome the mentioned limitations, a new approach is proposed in this study that involves combining multiobjective evaluation metrics in (1 + 1)—an evolutionary strategy—with three different methods and examining their effectiveness with a single-objective evolutionary strategy. The contributions can be summarized as follows:
A hybrid multiobjective algorithm is proposed for a more accurate exploration of the IR problem search space. This objective is achieved by devising the multiobjective evolutionary strategy with three different methods.
The performance of the multiobjective evolutionary strategy is enhanced by automatically choosing and optimizing search results by using three novel multiobjective functions to determine which set of solutions are nondominant with regard to one another and are superior to the rest in the search space.
A comprehensive experiment was conducted to validate the effectiveness of the proposed strategy and to compare its performance against that of state-of-the-art single-objective evolutionary algorithms.
2. Related Work
In this section, we introduce/discuss some related studies that applied multiobjective methods to LTR problems.
A learning-to-rank process aims to produce a ranking model that is capable of predicting accurately the relevance of a set of queries and items, improving user satisfaction and engagement. To obtain a ranking function, a structured process involving several steps is required. Firstly, a dataset is gathered which includes queries, items and relevance labels, ensuring a variety of scenarios for robustness. Following this, relevant features are extracted from both queries and items, capturing the critical aspects that affect the relative relevance of both. Once the training data have been obtained, they data are used to develop a ranking function. Finally, a ranking list of documents associated with a new query is created by using the ranking function [
7,
8].
The study in [
8] presented a multiobjective LTR approach for commercial search engines using LambdaMART, a state-of-the-art ranking algorithm. They modified the λ functions to solve two associated problems with the current LambdaMART λ-gradient. The goal was to stop the ranking model from trying to separate documents that were already ordered and separated in addition to making ranking mistakes that persisted long into training. Their proposed approach achieved significant improvements in terms of accuracy over the baseline state-of-the-art ranker LambdaMART. The experiments were performed on a large real-world dataset in which each query–URL pair had 860 features. However, the dataset itself and the authors’ code package are not available to researchers for research reproducibility.
The incorporation of relevant and well-engineered features into the dataset will enhance the model’s ability to generalize and provide informed ranking results. Several evolutionary multiobjective feature-selection ranking algorithms have been proposed in recent years [
7,
8]. According to Li et al. [
9], a new decomposition-based multiobjective immune algorithm called MOIA/DFSRank was proposed for the selection of features in L2R. To ensure greater convergence and diversity of the initial populations, representative features are selected for generation based on their importance and redundancy score. The proposed algorithm utilizes two effective operators: clonal selection and mutation, where the clonal selection operator generates clones to facilitate the search direction during evolution, while the mutation operator retains excellent features with a high probability of evolution. Kundu, P.P. et al. [
10] employed the NSGA-II algorithm framework to introduce a method for feature selection utilizing an SNN-based distance metric. This method aims to concurrently maximize both the count of selected features and the classification accuracy. Zhang et al. [
11] utilized an enhanced MOPSO algorithm to effectively diminish the Hamming loss value, even when utilizing a reduced number of features. In a related context, Das, A. [
12] presented a multiobjective evolutionary algorithm centered on relevance and redundancy considerations. This approach demonstrated superior classification outcomes while utilizing a reduced set of selected features. Mahapatr et al. [
13] addressed the multiobjective optimization (MOO) problem associated with multilabel LTR (training a model using a different relevance criterion). Essentially, this framework is capable of consuming any first-order gradient-based MOO algorithm to train a ranking model. Cheng et al. [
14], on the other hand, addressed the learning-to-rank problem by devising an algorithm grounded in the NSGA-II framework, yielding commendable results. Nevertheless, there remains a need for further enhancement of classification accuracy within this framework.
For commercial search engine preferences, the query–item relevance can be judged based on different criteria. For instance, in a search for products, the search engine may rank products based on their quality or on the user’s price preferences. The research study in [
4] applied several multiobjective optimization methods with preference directions, such as the traditional Pareto optimal search, to LTR problems. Their approach was applied to three LTR datasets and worked effectively for all three datasets. The datasets included the Microsoft Learning-to-Rank web search dataset (MSLR-WEB30K) [
15], which is represented by a 136-dimensional feature vector, and E-commerce datasets. They presented the maximum weighted loss as a novel model evaluation metric. The gradient-boosted regression tree (GBRT or MART) [
16] algorithm was used in the study. They found that the single-objective MART outperformed the multiobjective MART. Thus, they proposed a smooth remedy procedure to improve the performance of multiobjective MART compared to using the traditional Pareto optimal method in this algorithm.
Multiobjective optimization methods have been developed and used for multitask learning, especially for combinatorial optimization; however, their applications to LTR problems are still a novel research topic.
A different line of research studies presented a multiobjective learning framework where the authors used relevance labels and adjusted remedies for the ranking function to satisfy multiple objectives to produce results satisfying specific criteria, such as scale calibration [
17] and fairness [
18,
19]. They used the Rank Neural Network (RankNET), LambdaMART and Listwise Neural Network (ListNET) approaches [
16]. Remedy procedures were used to overcome the gaps in performance between the single-objective approaches and multiobjective ones [
15]. On the other hand, evolutionary strategy LTR (ES-Rank) outperformed MART, RankNET, LambdaMART and ListNET in previous research [
16]. Furthermore, the single-objective ES-Rank outperformed the 14 well-known evolutionary and machine learning approaches.
Hence, the principal objective of this research is to introduce an innovative search-space-exploration-procedures-based multiobjective algorithm using the Pareto optimal approach. These procedures are used here as a remedy between single- and multiobjective performance for the same algorithm to prove that the multiobjective version of the LTR algorithm can outperform the single-objective version in some exploration circumstances. Empirical findings attest to the heightened performance of the introduced algorithm in tackling the challenges posed by the learning-to-rank problem.
3. Proposed Approach
In the field of optimization, metaheuristic algorithms are computational techniques for solving complex optimization problems. Traditionally, optimization methods may struggle when faced with this type of problem because of its size or nonlinearity or the presence of multiple objectives that conflict. A metaheuristic is an approach to optimization that is different from an exact optimization algorithm. While exact algorithms promise the best solution given enough time and resources, metaheuristics offer approximate solutions that are often of excellent quality. A single-objective heuristic addresses optimization problems based on a single objective. The objective of these problems is to maximize or minimize a single criterion or goal. By utilizing heuristics, an optimal solution to the given objective function can be found. Multiobjective heuristics, on the other hand, are designed to solve optimization problems with multiple conflicting objectives. This paper uses the (1 + 1) evolutionary strategy algorithm for learning to rank (ES-Rank) in two variations, which are single-objective and multiobjective evaluation metrics (as shown in
Figure 2).
The single-objective ES-Rank has been used in the previous study in [
16] in comparison with 14 evolutionary and machine learning methods, and it outperformed them. It is often necessary to optimize multiple criteria simultaneously in such problems, and these objectives often conflict. In general, it is not possible to find a solution that optimizes all objectives simultaneously due to inherent trade-offs. Several studies have shown that multiobjective optimization is usually less accurate than the approach of optimizing each fitness function individually. However, our method can be a strong rival to the single-objective ES-Rank.
This study aims to find the most effective method for multiobjective learning in order to optimize performance in multiobjective learning. The present study introduces three methods, two of which are novel methods in the field of multiobjective optimization.
Our proposed optimization algorithm, the multiobjective (1 + 1) evolutionary strategy, is a novel approach for tackling complex multi-ES-Rank problems. The problem involves multiple objectives to be optimized, and no single solution may be considered to be the best across all objectives. In this algorithm, the decision variables are assigned random values for a population of “individuals”, each representing a potential solution. In order to rank these individuals, this algorithm employs the Pareto principle. In a multiobjective optimization problem, the Pareto optimal, also known as the Pareto frontier, is a set of solutions that are not considered to be dominated by any of the objectives. As a result, no solution in the set is superior to any of its competitors in all of its objectives, and at least one objective has improved without compromising any of the others. The framework of the proposed multiobjective (1 + 1) evolutionary strategy is as follows.
3.1. Step 1: Initialization
Set initial values for the maximum algorithm iterations and population size, and then generate an initial population of candidate solutions (individuals), denoted as P(0), and assign random values to the decision variables for each individual.
3.2. Step 2: Termination
If the termination condition has not been reached for a maximum number of iterations, then continue; otherwise, print out the Pareto optimal set from P.
3.3. Step 3: Mutation
Using the objective function values for each individual, calculate the fitness value for that individual. In order to calculate the fitness value, a ranking-based approach can be used, such as a nondominated sorting rank. We utilized 3 different methods from a single fitness objective function, ES-Rank. These 3 multiobjective ES-Rank methods use the Pareto frontier approach for the cumulative objective function
MFitness. This cumulative fitness function can be calculated by Equation (1).
where
is the Pareto frontier coefficient
i, which corresponds to the fitness evaluation metric
, and
is an integer number between 1 and 5. The fitness evaluation metrics used in this study are the mean average precision (MAP), normalized discounted cumulative gain (NDCG@10), reciprocal rank (RR@10), expected reciprocal rank (ERR@10) and precision (P@10) at top 10 documents retrieved [
20]. The 3 multiobjective ES-Rank methods use 3 different representations for
:
The first multiobjective ES-Rank approach uses = 1 , while .
The second multiobjective ES-Rank approach uses a traditional real random number generator for assigning a real number value for the coefficient for every fitness function in every evolving iteration with constraints. This constraint is that in every evolving iteration.
The third multiobjective ES-Rank approach uses a ziggurat Gaussian random number generator to assign a real number value for the
coefficient for every fitness function in every evolving iteration with constraints. This constraint is that
in every evolving iteration. The ziggurat Gaussian random number generator [
21] generates a normalized Gaussian random number between 0 and 1 rather than −50 and 50 as in the traditional Gaussian random number generator.
3.4. Step 4: Population Evolution
To guarantee that the constraints in the second and third multiobjective ES-Rank on Pareto frontier coefficients are met, we assume that the five Pareto frontier coefficients generated using random number generators are
Ci = {
C1,
C2,
C3,
C4,
C5} in each evolving iteration. There is no guarantee for the summation value for these coefficients to be 1 without a normalization factor. The normalization factor
can be calculated by Equation (2).
Then, the Pareto coefficients are calculated by , where i .
During each iteration, methods 2 and 3 use multiobjective randomization functions based on traditional and ziggurat Gaussian distribution Pareto coefficients. In this manner, more exploration can be achieved for a multiobjective search-space solution, while exploitation can be limited to a single Pareto coefficient sum. A is assigned to each individual, with a lower rank indicating a higher level of fitness. Ranks and fitness values are then used to select parents for reproduction. The probability of becoming a parent increases for individuals with a lower rank and a higher fitness value.
3.5. Step 5: Population Update
Evolutionary strategy consists of two solutions, the current solution (parent) and a candidate solution (offspring) that results from perturbing the parent. If offspring are not at least as efficient as their parents, they will be discarded from consideration for the following generation. As a vector of weights, the chromosome represents the evolving ranking model.
Algorithm 1 outlines the multi-ES-Rank algorithm. The training and validating set of query–document pairs provides a means of assessing evolutionary solutions in each iteration, and the output of this algorithm is a ranking model for the dataset used in the evolving phase. Using PCh as a parent chromosome, each gene is represented as a real number, representing the significance of the corresponding feature for ranking the training and validating data instances, where the data instances are queries and documents. Each gene in steps 1 through 4 is initialized to a value of 0.5 in the parent chromosome vector. The Boolean parameter Good is used to indicate whether to repeat the previous mutation steps from the previous generation or not. It is set to FALSE in step 5 when the previous mutation steps are to be repeated.
A copy of PCh is assigned to OffCh in step 6. The evolving process is repeated until the maximum generation MaxGenerations is reached; the number of iterations is 1300 in this paper. The evolving procedure begins in step 7 and ends in step 24. The procedure for managing mutations is demonstrated in steps 8–16 by choosing the number of genes to mutate (RM). Four probability distributions are used to determine the mutation step (steps 11 to 15): Gaussian, Cauchy, Levy and uniform. The successful evolution process (which produced good offspring) for evolving iteration G is repeated in evolving iteration G, as illustrated in step 9. Otherwise, the mutation procedure’s settings are reset, as demonstrated in steps 11 to 15. Using the fitness metrics, steps 17 to 23 determine which PCh or OffCh to use. Finally, in step 25, the relationship between dynamic feature weights and query–document pairs is represented by the mathematical transposition of the feature weights vector (i.e., multi-ES-Rank procedure).
Algorithm 1: MultiES-Rank: Multiobjective Evolutionary Strategy Ranking Approach
|
| Input:A training setα (q, d) and a validation set ɳ (q, d) of query-document pairs of feature vectors. |
| Output:A linear ranking function F (q, d) that assigns a weight to every query-document pair indicating its relevancy degree. |
1 | Initialization: |
2 | For (Gen_i Є PCh) do |
3 | | Gen_i = 0.0; |
4 | end |
5 | Good = FALSE; |
6 | OffCh = PCh; |
7 | For (G = 1 to MaxGenerations) do |
8 | | If (Good==TRUE) Then |
9 | | | Use the same mutation process of generation (G-1) on OffCh to mutate next OffCh, that is, mutate the same RM genes using the same Mutation Step; |
10
| | Else |
11 | | | Choose number of genes to mutate RM at random from 1 to M |
12 | | For (j = 1 to RM) |
13 | | | Choose random Gen_i in OffCh for mutation; |
14 | | | Mutate Gene_i using Mutation Step according to Probability Distributions used |
15 | | end |
16 | end |
17 | If (((Fitness(PCh,α(q,d)) < Fitness(OffCh α(q,d))) && (Fitness(PCh, ɳ(q,d)) ≤ Fitness(OffCh, ɳ(q,d)))) Then |
18 | | PCh = OffCh; |
19 | | Good=TRUE; |
20 | Else |
21 | | OffCh = PCh; |
22 | | Good = FALSE; |
23 | end |
24 | Return: The linear ranking function F (q, d) = PCh, that is PCh at the end of the MaxGenerations contains the evolved vector W of M feature weights. |
4. Experimental Results
This section includes a thorough experimental investigation that compares the three proposed learning-to-rank multiobjective strategies to a single-objective existing approach in terms of five accuracy fitness metrics. MAP, RR, ERR, NDCG and P (mean average precision, reciprocal rank, expected reciprocal rank—total precision, normalized discount cumulative gain and average precision) at top 10 documents retrieved are the five metrics used to evaluate accuracy, as stated in subsection IV-A. To evaluate the performance of an LTR approach, the LTR technique is first applied to the training set. Afterwards, the ranking model’s performance is evaluated using the test set to determine how well the LTR algorithm makes predictions.
4.1. Benchmark Datasets and Evaluation Fitness Metrics
Three benchmarking datasets are considered in this paper, as follows:
The MSLR-WEB30K dataset [
22]: This dataset provides a comprehensive and realistic set of query–document pairs with relevance labels. Additionally, there is a set of features associated with each query–document pair that capture various aspects of the query and the document. Among these features are textual features, numerical features and other metadata that can be used to determine the degree of relevance of a document with respect to a specific query.
LETOR 4.0 [
23,
24]: It is part of the LETOR (Learning to Rank for Information Retrieval) dataset collection. A significant number of query–document pairs are included in the dataset, each associated with a relevance label. Additionally, a variety of features are included in the dataset that capture the characteristics of both queries and documents. These include textual attributes, numerical attributes and other metadata. These features are designed to aid ranking algorithms in determining the relevance of documents to a query.
As can be seen in
Table 1, these datasets have a number of different characteristics. Compared to LETOR 4 datasets (MQ2007 and MQ2008), the Microsoft Bing Search dataset (MSLR-WEB30K) has a much higher number of query–document pairs and features. There are several low-level characteristics associated with each query–document pair, such as term frequency and inverse document frequency. In order to determine low-level features for all document parts (title, anchor, body and whole), a set of low-level features was determined. In addition, there are high-level features that indicate how well the searches and documents correspond. Additionally, hybrid features have been employed in previous SIGIR conference papers including LMIR.ABS, LMIR.JM, LMIR.DIR and LMIR.DIR as well as the Language Model with Absolute Discounted Smoothing [
22,
23,
24,
25] and Language Model with Jelinek–Mercer smoothing [LMIR.JM]. There are 30,000 queries in the MSLR-WEB30K dataset. MQ2008 contains fewer than 1000 queries, whereas MQ2007 contains 1692 queries. There are a variety of query–document combinations for each query, which are based on a set of relevant and irrelevant documents. A relevance label indicates the level of relevance of a query when it is accompanied by a document (relationship query–document). As a general rule, relevance labels are classified as 0 (for totally irrelevant), 1 (for moderately relevant) and 2 (for very relevant). There is one exception to this rule, the MSLR-WEB30K dataset, where values range from 0 (irrelevant) to 4 (perfectly relevant).
In this research, MAP, NDCG@10, P@10, RR@10 and ERR@10 were used as five distinct fitness functions on the training sets [
1]. They were also used as assessment measures for the test-set ranking algorithms. These fitness functions were demonstrated in detail in [
20].
4.2. Result Analysis and Discussion
This section gives an overview of the progress achieved using multiobjective LTR. From the results obtained, we can say that using the Cauchy probability distribution as a random number generator for mutation step sizes in multiobjective ES-Rank outperformed Gaussian, Levy and uniform distributions. It also outperformed single-objective ES-Rank, but the multiobjective method that uses the Cauchy distribution as the dominant method in performance relies on the particular dataset used.
Figure 3,
Figure 4 and
Figure 5 illustrate the superiority of the proposed methodologies for LTR for the three datasets used.
From
Figure 3 and
Figure 4, for both the MSLR-WEB10K and MQ2008 datasets, the single-objective ES-Rank performance is higher than the multiobjective ES-Rank. This degradation in performance is accepted in order to gain the multiobjective ranking. From
Figure 5, it is found that for the dataset MQ2007, the multiobjective ES-Rank with method 1 using uniform and multiobjective ES-Rank with method 2 using Cauchy as a random number generator for mutation step sizes both achieve high performance, with 6 and 7 winning rates, respectively. This is better than the overall performance of the single-objective ES-Rank. These results ensure the effectiveness of our proposed methods for both single-objective and multiobjective optimization. Moreover, the dataset affects the performance of ES-Rank for all the methods used.
To evaluate the methods of generated random numbers distribution,
Figure 6 illustrates the NCDG@10 for the test set MSLR dataset. From
Figure 6, we can conclude that Levy is the best one for single-objective ES-Rank, while for multiobjective,
Figure 6 shows grouping results based on the method of optimization, where Levy is the best for method 1, method 2 and method 3. Thus, Levy probability distribution as a random number generator for mutation step sizes is recommended for single-objective and multiobjective ES-Rank using all three methods. Moreover, method 2 with Levy achieves the highest NDCG@10 for the MSLR dataset.
For analyzing and evaluating different random number generators,
Figure 6,
Figure 7 and
Figure 8 illustrate the NDCG@10 for testing data for MSLR, MQ2007 and MQ2008. For the MQ2008 dataset,
Figure 7 illustrates the NCDG@10 for the test set. From
Figure 7, it is found that the Gaussian probability distribution as a random number generator for mutation step sizes is recommended for single-objective and multiobjective ES-Rank using all three methods. Moreover, method 3 with Gaussian achieves the highest NDCG@10 for the MQ2008 dataset.
For the MQ2007 dataset,
Figure 8 illustrates the NCDG@10 for the test set. From
Figure 8, it is found that Levy probability distribution as a random number generator for mutation step sizes is recommended for multiobjective ES-Rank using all three methods; however, Gaussian is recommended for single-objective ES-Rank. Moreover, method 3 with Levy achieves the highest NDCG@10 for the MQ2008 dataset. Thus, random number generators clearly affect the performance of ES-Rank based on the dataset used.
Multi-ES-Rank is an evolutionary strategy that uses a cumulative fitness function to determine the quality of each evolving ranking model in each iteration. Additionally, as the Pareto frontier contains no dominant solution, there is no other solution that performs better on all objectives at the same time. The developed strategy explores the search space and produces diverse solutions reflecting different trade-offs between multiple objectives through the cumulative fitness function. As a result, developed algorithms provide decision-makers with a variety of options from which to select so that they can make informed decisions based on their individual preferences.
In summary, this paper introduces a multiobjective evolutionary strategy (multi-ES-Rank) approach for learning-to-rank problems. In addition, we propose three novel Pareto optimal methods in continuous optimization research. Furthermore, we provide the Java archive package of the proposed approach for research reproducibility. From the experimental results, multi-ES-Rank can outperform single-objective ES-Rank in some circumstances of mutation step sizes and Pareto optimal methods for LTR data, as given in
Appendix A. The best performance can be gained with the method using Cauchy as a random number generator for mutation step sizes in terms of winning rate. This causes the multi-ES-Rank to outperform the single-objective ES-Rank in certain conditions. Moreover, the different random number generators are evaluated and analyzed versus the three datasets in terms of NDCG@10 for testing data. It was found that the Levy generator is the best for both the MSLR and MQ2007 datasets while the Gaussian generator is the best for the MQ2008 dataset. Thus, random number generators clearly affect the performance of ES-Rank based on the dataset used. Furthermore, method 3 achieved the highest NDCG@10 for MQ2008 and MQ2007, while for the MSLR dataset, the highest NDCG@10 was achieved by method 2.
An important limitation of this study is the sensitivity of the evolutionary fitness function to configuration parameters. The results of this study highlight the importance of careful parameter tuning, but they also demonstrate that it is difficult to identify a universally optimal configuration because it is often dependent upon specific datasets and problem domains. Since there may not be one configuration suitable for different LTR tasks and datasets, developing automated hyperparameter optimization techniques may mitigate this limitation in the future. This study is also limited by the lack of dedicated multiobjective optimization packages for comparison. Most research focuses on learning-to-rank models with single objectives, such as mean squared error or pairwise ranking. However, in real-world applications, it is often necessary to optimize conflicting objectives simultaneously. The proposed techniques can be further evaluated in more complex optimization scenarios where the performance of the proposed techniques can be evaluated on a broader scale in future research.