Research

19 pages, 515 KiB

Open AccessArticle

How Optimal Transport Can Tackle Gender Biases in Multi-Class Neural Network Classifiers for Job Recommendations

by Fanny Jourdan, Titon Tshiongo Kaninku, Nicholas Asher, Jean-Michel Loubes and Laurent Risser

Algorithms 2023, 16(3), 174; https://doi.org/10.3390/a16030174 - 22 Mar 2023

Cited by 3 | Viewed by 1885

Automatic recommendation systems based on deep neural networks have become extremely popular during the last decade. Some of these systems can, however, be used in applications that are ranked as High Risk by the European Commission in the AI act—for instance, online job [...] Read more.

Automatic recommendation systems based on deep neural networks have become extremely popular during the last decade. Some of these systems can, however, be used in applications that are ranked as High Risk by the European Commission in the AI act—for instance, online job candidate recommendations. When used in the European Union, commercial AI systems in such applications will be required to have proper statistical properties with regard to the potential discrimination they could engender. This motivated our contribution. We present a novel optimal transport strategy to mitigate undesirable algorithmic biases in multi-class neural network classification. Our strategy is model agnostic and can be used on any multi-class classification neural network model. To anticipate the certification of recommendation systems using textual data, we used it on the Bios dataset, for which the learning task consists of predicting the occupation of female and male individuals, based on their LinkedIn biography. The results showed that our approach can reduce undesired algorithmic biases in this context to lower levels than a standard strategy. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

21 pages, 6631 KiB

Open AccessArticle

Interpretation for Variational Autoencoder Used to Generate Financial Synthetic Tabular Data

by Jinhong Wu, Konstantinos Plataniotis, Lucy Liu, Ehsan Amjadian and Yuri Lawryshyn

Algorithms 2023, 16(2), 121; https://doi.org/10.3390/a16020121 - 16 Feb 2023

Cited by 2 | Viewed by 3173

Abstract

Synthetic data, artificially generated by computer programs, has become more widely used in the financial domain to mitigate privacy concerns. Variational Autoencoder (VAE) is one of the most popular deep-learning models for generating synthetic data. However, VAE is often considered a “black box” [...] Read more.

Synthetic data, artificially generated by computer programs, has become more widely used in the financial domain to mitigate privacy concerns. Variational Autoencoder (VAE) is one of the most popular deep-learning models for generating synthetic data. However, VAE is often considered a “black box” due to its opaqueness. Although some studies have been conducted to provide explanatory insights into VAE, research focusing on explaining how the input data could influence VAE to create synthetic data, especially for tabular data, is still lacking. However, in the financial industry, most data are stored in a tabular format. This paper proposes a sensitivity-based method to assess the impact of inputted tabular data on how VAE synthesizes data. This sensitivity-based method can provide both global and local interpretations efficiently and intuitively. To test this method, a simulated dataset and three Kaggle banking tabular datasets were employed. The results confirmed the applicability of this proposed method. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

15 pages, 14871 KiB

Open AccessArticle

Deep Learning Classification of Colorectal Lesions Based on Whole Slide Images

by Sergey A. Soldatov, Danil M. Pashkov, Sergey A. Guda, Nikolay S. Karnaukhov, Alexander A. Guda and Alexander V. Soldatov

Algorithms 2022, 15(11), 398; https://doi.org/10.3390/a15110398 - 27 Oct 2022

Cited by 2 | Viewed by 2137

Abstract

Microscopic tissue analysis is the key diagnostic method needed for disease identification and choosing the best treatment regimen. According to the Global Cancer Observatory, approximately two million people are diagnosed with colorectal cancer each year, and an accurate diagnosis requires a significant amount [...] Read more.

Microscopic tissue analysis is the key diagnostic method needed for disease identification and choosing the best treatment regimen. According to the Global Cancer Observatory, approximately two million people are diagnosed with colorectal cancer each year, and an accurate diagnosis requires a significant amount of time and a highly qualified pathologist to decrease the high mortality rate. Recent development of artificial intelligence technologies and scanning microscopy introduced digital pathology into the field of cancer diagnosis by means of the whole-slide image (WSI). In this work, we applied deep learning methods to diagnose six types of colon mucosal lesions using convolutional neural networks (CNNs). As a result, an algorithm for the automatic segmentation of WSIs of colon biopsies was developed, implementing pre-trained, deep convolutional neural networks of the ResNet and EfficientNet architectures. We compared the classical method and one-cycle policy for CNN training and applied both multi-class and multi-label approaches to solve the classification problem. The multi-label approach was superior because some WSI patches may belong to several classes at once or to none of them. Using the standard one-vs-rest approach, we trained multiple binary classifiers. They achieved the receiver operator curve AUC in the range of 0.80–0.96. Other metrics were also calculated, such as accuracy, precision, sensitivity, specificity, negative predictive value, and F1-score. Obtained CNNs can support human pathologists in the diagnostic process and can be extended to other cancers after adding a sufficient amount of labeled data. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

16 pages, 427 KiB

Open AccessArticle

Dealing with Gender Bias Issues in Data-Algorithmic Processes: A Social-Statistical Perspective

by Juliana Castaneda, Assumpta Jover, Laura Calvet, Sergi Yanes, Angel A. Juan and Milagros Sainz

Algorithms 2022, 15(9), 303; https://doi.org/10.3390/a15090303 - 27 Aug 2022

Cited by 5 | Viewed by 6530

Abstract

Are algorithms sexist? This is a question that has been frequently appearing in the mass media, and the debate has typically been far from a scientific analysis. This paper aims at answering the question using a hybrid social and technical perspective. First a [...] Read more.

Are algorithms sexist? This is a question that has been frequently appearing in the mass media, and the debate has typically been far from a scientific analysis. This paper aims at answering the question using a hybrid social and technical perspective. First a technical-oriented definition of the algorithm concept is provided, together with a more social-oriented interpretation. Secondly, several related works have been reviewed in order to clarify the state of the art in this matter, as well as to highlight the different perspectives under which the topic has been analyzed. Thirdly, we describe an illustrative numerical example possible discrimination in the banking sector due to data bias, and propose a simple but effective methodology to address it. Finally, a series of recommendations are provided with the goal of minimizing gender bias while designing and using data-algorithmic processes to support decision making in different environments. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

12 pages, 575 KiB

Open AccessArticle

A Federated Generalized Linear Model for Privacy-Preserving Analysis

by Matteo Cellamare, Anna J. van Gestel, Hasan Alradhi, Frank Martin and Arturo Moncada-Torres

Algorithms 2022, 15(7), 243; https://doi.org/10.3390/a15070243 - 13 Jul 2022

Cited by 9 | Viewed by 2512

Abstract

In the last few years, federated learning (FL) has emerged as a novel alternative for analyzing data spread across different parties without needing to centralize them. In order to increase the adoption of FL, there is a need to develop more algorithms that [...] Read more.

In the last few years, federated learning (FL) has emerged as a novel alternative for analyzing data spread across different parties without needing to centralize them. In order to increase the adoption of FL, there is a need to develop more algorithms that can be deployed under this novel privacy-preserving paradigm. In this paper, we present our federated generalized linear model (GLM) for horizontally partitioned data. It allows generating models of different families (linear, Poisson, logistic) without disclosing privacy-sensitive individual records. We describe its algorithm (which can be implemented in the user’s platform of choice) and compare the obtained federated models against their centralized counterpart, which were mathematically equivalent. We also validated their execution time with increasing numbers of records and involved parties. We show that our federated GLM is accurate enough to be used for the privacy-preserving analysis of horizontally partitioned data in real-life scenarios. Further development of this type of algorithm has the potential to make FL a much more common practice among researchers. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

22 pages, 878 KiB

Open AccessEditor’s ChoiceArticle

Efficient Machine Learning Models for Early Stage Detection of Autism Spectrum Disorder

by Mousumi Bala, Mohammad Hanif Ali, Md. Shahriare Satu, Khondokar Fida Hasan and Mohammad Ali Moni

Algorithms 2022, 15(5), 166; https://doi.org/10.3390/a15050166 - 16 May 2022

Cited by 24 | Viewed by 4686

Abstract

Autism spectrum disorder (ASD) is a neurodevelopmental disorder that severely impairs an individual’s cognitive, linguistic, object recognition, communication, and social abilities. This situation is not treatable, although early detection of ASD can assist to diagnose and take proper steps for mitigating its effect. [...] Read more.

Autism spectrum disorder (ASD) is a neurodevelopmental disorder that severely impairs an individual’s cognitive, linguistic, object recognition, communication, and social abilities. This situation is not treatable, although early detection of ASD can assist to diagnose and take proper steps for mitigating its effect. Using various artificial intelligence (AI) techniques, ASD can be detected an at earlier stage than with traditional methods. The aim of this study was to propose a machine learning model that investigates ASD data of different age levels and to identify ASD more accurately. In this work, we gathered ASD datasets of toddlers, children, adolescents, and adults and used several feature selection techniques. Then, different classifiers were applied into these datasets, and we assessed their performance with evaluation metrics including predictive accuracy, kappa statistics, the f1-measure, and AUROC. In addition, we analyzed the performance of individual classifiers using a non-parametric statistical significant test. For the toddler, child, adolescent, and adult datasets, we found that Support Vector Machine (SVM) performed better than other classifiers where we gained 97.82% accuracy for the RIPPER-based toddler subset; 99.61% accuracy for the Correlation-based feature selection (CFS) and Boruta CFS intersect (BIC) method-based child subset; 95.87% accuracy for the Boruta-based adolescent subset; and 96.82% accuracy for the CFS-based adult subset. Then, we applied the Shapley Additive Explanations (SHAP) method into different feature subsets, which gained the highest accuracy and ranked their features based on the analysis. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

14 pages, 1182 KiB

Open AccessArticle

Using Explainable Machine Learning to Explore the Impact of Synoptic Reporting on Prostate Cancer

by Femke M. Janssen, Katja K. H. Aben, Berdine L. Heesterman, Quirinus J. M. Voorham, Paul A. Seegers and Arturo Moncada-Torres

Algorithms 2022, 15(2), 49; https://doi.org/10.3390/a15020049 - 29 Jan 2022

Cited by 5 | Viewed by 3839

Abstract

Machine learning (ML) models have proven to be an attractive alternative to traditional statistical methods in oncology. However, they are often regarded as black boxes, hindering their adoption for answering real-life clinical questions. In this paper, we show a practical application of [...] Read more.

Machine learning (ML) models have proven to be an attractive alternative to traditional statistical methods in oncology. However, they are often regarded as black boxes, hindering their adoption for answering real-life clinical questions. In this paper, we show a practical application of explainable machine learning (XML). Specifically, we explored the effect that synoptic reporting (SR; i.e., reports where data elements are presented as discrete data items) in Pathology has on the survival of a population of 14,878 Dutch prostate cancer patients. We compared the performance of a Cox Proportional Hazards model (CPH) against that of an eXtreme Gradient Boosting model (XGB) in predicting patient ranked survival. We found that the XGB model (c-index = 0.67) performed significantly better than the CPH (c-index = 0.58). Moreover, we used Shapley Additive Explanations (SHAP) values to generate a quantitative mathematical representation of how features—including usage of SR—contributed to the models’ output. The XGB model in combination with SHAP visualizations revealed interesting interaction effects between SR and the rest of the most important features. These results hint that SR has a moderate positive impact on predicted patient survival. Moreover, adding an explainability layer to predictive ML models can open their black box, making them more accessible and easier to understand by the user. This can make XML-based techniques appealing alternatives to the classical methods used in oncological research and in health care in general. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

34 pages, 3117 KiB

Open AccessArticle

An Interaction-Based Convolutional Neural Network (ICNN) Toward a Better Understanding of COVID-19 X-ray Images

by Shaw-Hwa Lo and Yiqiao Yin

Algorithms 2021, 14(11), 337; https://doi.org/10.3390/a14110337 - 19 Nov 2021

Cited by 4 | Viewed by 2859

Abstract

The field of explainable artificial intelligence (XAI) aims to build explainable and interpretable machine learning (or deep learning) methods without sacrificing prediction performance. Convolutional neural networks (CNNs) have been successful in making predictions, especially in image classification. These popular and well-documented successes use [...] Read more.

The field of explainable artificial intelligence (XAI) aims to build explainable and interpretable machine learning (or deep learning) methods without sacrificing prediction performance. Convolutional neural networks (CNNs) have been successful in making predictions, especially in image classification. These popular and well-documented successes use extremely deep CNNs such as VGG16, DenseNet121, and Xception. However, these well-known deep learning models use tens of millions of parameters based on a large number of pretrained filters that have been repurposed from previous data sets. Among these identified filters, a large portion contain no information yet remain as input features. Thus far, there is no effective method to omit these noisy features from a data set, and their existence negatively impacts prediction performance. In this paper, a novel interaction-based convolutional neural network (ICNN) is introduced that does not make assumptions about the relevance of local information. Instead, a model-free influence score (I-score) is proposed to directly extract the influential information from images to form important variable modules. This innovative technique replaces all pretrained filters found by trial-and-error with explainable, influential, and predictive variable sets (modules) determined by the I-score. In other words, future researchers need not rely on pretrained filters; the suggested algorithm identifies only the variables or pixels with high I-score values that are extremely predictive and important. The proposed method and algorithm were tested on real-world data set and a state-of-the-art prediction performance of 99.8% was achieved without sacrificing the explanatory power of the model. This proposed design can efficiently screen patients infected by COVID-19 before human diagnosis and can be a benchmark for addressing future XAI problems in large-scale data sets. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

20 pages, 1968 KiB

Open AccessArticle

A Context-Aware Neural Embedding for Function-Level Vulnerability Detection

by Hongwei Wei, Guanjun Lin, Lin Li and Heming Jia

Algorithms 2021, 14(11), 335; https://doi.org/10.3390/a14110335 - 17 Nov 2021

Cited by 8 | Viewed by 3826

Abstract

Exploitable vulnerabilities in software systems are major security concerns. To date, machine learning (ML) based solutions have been proposed to automate and accelerate the detection of vulnerabilities. Most ML techniques aim to isolate a unit of source code, be it a line or [...] Read more.

Exploitable vulnerabilities in software systems are major security concerns. To date, machine learning (ML) based solutions have been proposed to automate and accelerate the detection of vulnerabilities. Most ML techniques aim to isolate a unit of source code, be it a line or a function, as being vulnerable. We argue that a code segment is vulnerable if it exists in certain semantic contexts, such as the control flow and data flow; therefore, it is important for the detection to be context aware. In this paper, we evaluate the performance of mainstream word embedding techniques in the scenario of software vulnerability detection. Based on the evaluation, we propose a supervised framework leveraging pre-trained context-aware embeddings from language models (ELMo) to capture deep contextual representations, further summarized by a bidirectional long short-term memory (Bi-LSTM) layer for learning long-range code dependency. The framework takes directly a source code function as an input and produces corresponding function embeddings, which can be treated as feature sets for conventional ML classifiers. Experimental results showed that the proposed framework yielded the best performance in its downstream detection tasks. Using the feature representations generated by our framework, random forest and support vector machine outperformed four baseline systems on our data sets, demonstrating that the framework incorporated with ELMo can effectively capture the vulnerable data flow patterns and facilitate the vulnerability detection task. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

36 pages, 2499 KiB

Open AccessArticle

Local Data Debiasing for Fairness Based on Generative Adversarial Training

by Ulrich Aïvodji, François Bidet, Sébastien Gambs, Rosin Claude Ngueveu and Alain Tapp

Algorithms 2021, 14(3), 87; https://doi.org/10.3390/a14030087 - 14 Mar 2021

Cited by 4 | Viewed by 6126

Abstract

The widespread use of automated decision processes in many areas of our society raises serious ethical issues with respect to the fairness of the process and the possible resulting discrimination. To solve this issue, we propose a novel adversarial training approach called GANSan [...] Read more.

The widespread use of automated decision processes in many areas of our society raises serious ethical issues with respect to the fairness of the process and the possible resulting discrimination. To solve this issue, we propose a novel adversarial training approach called GANSan for learning a sanitizer whose objective is to prevent the possibility of any discrimination (i.e., direct and indirect) based on a sensitive attribute by removing the attribute itself as well as the existing correlations with the remaining attributes. Our method GANSan is partially inspired by the powerful framework of generative adversarial networks (in particular Cycle-GANs), which offers a flexible way to learn a distribution empirically or to translate between two different distributions. In contrast to prior work, one of the strengths of our approach is that the sanitization is performed in the same space as the original data by only modifying the other attributes as little as possible, thus preserving the interpretability of the sanitized data. Consequently, once the sanitizer is trained, it can be applied to new data locally by an individual on their profile before releasing it. Finally, experiments on real datasets demonstrate the effectiveness of the approach as well as the achievable trade-off between fairness and utility. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

18 pages, 2509 KiB

Open AccessArticle

Detection of Representative Variables in Complex Systems with Interpretable Rules Using Core-Clusters

by Camille Champion, Anne-Claire Brunet, Rémy Burcelin, Jean-Michel Loubes and Laurent Risser

Algorithms 2021, 14(2), 66; https://doi.org/10.3390/a14020066 - 22 Feb 2021

Cited by 1 | Viewed by 2167

Abstract

In this paper, we present a new framework dedicated to the robust detection of representative variables in high dimensional spaces with a potentially limited number of observations. Representative variables are selected by using an original regularization strategy: they are the center of specific [...] Read more.

In this paper, we present a new framework dedicated to the robust detection of representative variables in high dimensional spaces with a potentially limited number of observations. Representative variables are selected by using an original regularization strategy: they are the center of specific variable clusters, denoted CORE-clusters, which respect fully interpretable constraints. Each CORE-cluster indeed contains more than a predefined amount of variables and each pair of its variables has a coherent behavior in the observed data. The key advantage of our regularization strategy is therefore that it only requires to tune two intuitive parameters: the minimal dimension of the CORE-clusters and the minimum level of similarity which gathers their variables. Interpreting the role played by a selected representative variable is additionally obvious as it has a similar observed behaviour as a controlled number of other variables. After introducing and justifying this variable selection formalism, we propose two algorithmic strategies to detect the CORE-clusters, one of them scaling particularly well to high-dimensional data. Results obtained on synthetic as well as real data are finally presented. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

16 pages, 929 KiB

Open AccessArticle

Groundwater Prediction Using Machine-Learning Tools

by Eslam A. Hussein, Christopher Thron, Mehrdad Ghaziasgar, Antoine Bagula and Mattia Vaccari

Algorithms 2020, 13(11), 300; https://doi.org/10.3390/a13110300 - 17 Nov 2020

Cited by 52 | Viewed by 8785

Abstract

Predicting groundwater availability is important to water sustainability and drought mitigation. Machine-learning tools have the potential to improve groundwater prediction, thus enabling resource planners to: (1) anticipate water quality in unsampled areas or depth zones; (2) design targeted monitoring programs; (3) inform groundwater [...] Read more.

Predicting groundwater availability is important to water sustainability and drought mitigation. Machine-learning tools have the potential to improve groundwater prediction, thus enabling resource planners to: (1) anticipate water quality in unsampled areas or depth zones; (2) design targeted monitoring programs; (3) inform groundwater protection strategies; and (4) evaluate the sustainability of groundwater sources of drinking water. This paper proposes a machine-learning approach to groundwater prediction with the following characteristics: (i) the use of a regression-based approach to predict full groundwater images based on sequences of monthly groundwater maps; (ii) strategic automatic feature selection (both local and global features) using extreme gradient boosting; and (iii) the use of a multiplicity of machine-learning techniques (extreme gradient boosting, multivariate linear regression, random forests, multilayer perceptron and support vector regression). Of these techniques, support vector regression consistently performed best in terms of minimizing root mean square error and mean absolute error. Furthermore, including a global feature obtained from a Gaussian Mixture Model produced models with lower error than the best which could be obtained with local geographical features. Full article

(This article belongs to the Special Issue Interpretability, Accountability and Robustness in Machine Learning)

► Show Figures

Figure 1

Journal Menu

Journal Browser

Interpretability, Accountability and Robustness in Machine Learning

Share This Special Issue

Special Issue Editor

Special Issue Information

Keywords

Published Papers (12 papers)

Research

Further Information

Guidelines

MDPI Initiatives

Follow MDPI