# Statistical Analysis of Chemical Element Compositions in Food Science: Problems and Possibilities

^{1}

^{2}

^{*}

## Abstract

**:**

## 1. Introduction

## 2. Results

## 3. Discussion

## 4. Materials and Methods

#### 4.1. Mineral Element Data of Honey Samples

#### 4.2. Stable Isotope Ratio and Trace Element Concentration Data of Saffron Samples

## 5. Data Analysis

#### 5.1. Non-Compositional Standardization of Variables

#### 5.2. Non-Compositional Standardization of Observations

#### 5.3. Non-Compositional Transformation

#### 5.4. Compositional Analysis

#### 5.5. Standardization and Transformation by Means of Log-Ratios

#### 5.6. Replacement of Missing Values and Non-Detects

**const:**- Any rounded zero value is replaced by a
**constant**value of $0.1$. Note that it is not a good strategy to impute rounded zeros. However, this method should serve as a benchmark, among other things. **dl23:**- This comparatively equally simple method also replaces all zeros with a constant value smaller than the
**two-thirds of the detection limit**. Martín-Fernández et al. [58] found that the detection limit minimizes the distortion in the covariance structure. **unif:**- A zero is replaced in a variable ${\mathbf{x}}_{j}$ by drawing a random
**uniform**number between the interval $[0.1\xb7min({\mathbf{x}}_{j}^{(+)});0.9\xb7min({\mathbf{x}}_{j}^{(+)})]$, with ${\mathbf{x}}_{j}^{(+)}$, the smallest positive value of variable j. It prevents a zero being imputed to close to 0 and ensures imputation below an unknown detection limit.

**bdls_pls:**- (
**b**elow-**d**etection-**l**imit using (censored)**p**artial**l**east**s**quares regression) A zero is replaced by an iterative EM-algorithm based on a censored partial least squares estimation on sequential log-ratio coordinate representations. For details, see [40].

#### 5.7. Principal Component Analysis

#### 5.8. Classification

- zeros replaced with const, dl23, unif, and bdls_pls (see Section 5.6).
- no transformation, standardization, log-transformation, log-transformation and standardization, rescaling by closure, or pivot coordinate or centered coordinate representation.

- 20% validation/80% training data,
- 3 layers, 300 neurons in the first layer, followed by 128 and 64 neurons in the next layer,
- 10% dropout in the first 2 layers,
- mean squared error as a loss function and mean absolute error as an evaluation metric, and
- 500 epochs with break whenever 50 epochs do not improve the result

## 6. Conclusions

## Author Contributions

## Funding

## Data Availability Statement

## Conflicts of Interest

## Sample Availability

**Figure 1.**Biplots obtained from honey samples (pure and adulterated). First, two principal components represented by biplots of the PCA applied on (

**A**) standardized data, (

**B**) standardized and log-transformated data, (

**C**) closed and standardized data, and (

**D**) centred log-ratio coordinates. Abbreviations of various type of honey: AC: Acacia, CA: Chaste, JU: Jujube, LD: Linden, SS: T. cochinchinensis, RP: Rape; and various types of sugar syrups: Sy; and adulterated honey categories: AAC (adulterated Acacia), ACA (adulterated Chaste), AJU (adulterated Jujube), ALD (adulterated Linden), ARP (adulterated Rape), ASS (adulterated T. cochinchinensis).

**Figure 2.**Explained variance (in %, cumulative) for different numbers of components and different pre-processing of the compositional honey samples. Abbreviations: clr: centered log-ratio coordinates, ilr: isometric log-ratio transformed data (i.e., pivot coordinates).

**Figure 3.**Biplots obtained from saffron samples originating from Iran and Spain. First two principal components represented by biplots of the PCA that was applied on (

**A**) standardized data, (

**B**) standardized and log-transformated data, (

**C**) closed and standardized data, and (

**D**) centred log-ratio coordinates.

**Figure 4.**Explained variance (in %, cumulative) for different numbers of components and different pre-processing of the compositional saffron samples. Abbreviations as for Figure 2.

**Figure 5.**Misclassification rates of various classification methods based on different pre-processing and replacement strategies applied to the honey samples. Abbreviations (for details, see Section 3): lda: linear discriminant analyis, KNN: k-nearest neighbor, ANN: artificial neural network; bdls: below detection limit using (censored) partial least squares regression, const: constant, dl23: two-thirds of the detection limit, unif: uniform; closed + stand: closed and standardized data, raw: raw, i.e., non-transformed, log: log transformed, scale: scaled, ilr: isometric log-ratio transformed (i.e., pivot coordinates), clr: centered log-ratio coordinates.

