Prediction of Blueberry (Vaccinium corymbosum L.) Yield Based on Artificial Intelligence Methods

Niedbała, Gniewko; Kurek, Jarosław; Świderski, Bartosz; Wojciechowski, Tomasz; Antoniuk, Izabella; Bobran, Krzysztof

doi:10.3390/agriculture12122089

Open AccessFeature PaperArticle

Prediction of Blueberry (Vaccinium corymbosum L.) Yield Based on Artificial Intelligence Methods

by

Gniewko Niedbała

¹

,

Jarosław Kurek

^2,*

,

Bartosz Świderski

²

,

Tomasz Wojciechowski

¹

,

Izabella Antoniuk

²

and

Krzysztof Bobran

³

¹

Department of Biosystems Engineering, Faculty of Environmental and Mechanical Engineering, Poznań University of Life Sciences, Wojska Polskiego 50, 60-627 Poznań, Poland

²

Department of Artificial Intelligence, Institute of Information Technology, Warsaw University of Life Sciences, Nowoursynowska 159, 02-776 Warsaw, Poland

³

Seth Software sp. z o.o., Strefowa 1, 36-060 Głogów Małopolski, Poland

^*

Author to whom correspondence should be addressed.

Agriculture 2022, 12(12), 2089; https://doi.org/10.3390/agriculture12122089

Submission received: 27 October 2022 / Revised: 27 November 2022 / Accepted: 2 December 2022 / Published: 6 December 2022

(This article belongs to the Special Issue Digital Innovations in Agriculture)

Download

Browse Figures

Review Reports Versions Notes

Abstract

:

In this paper, we present a high-accuracy model for blueberry yield prediction, trained using structurally innovative data sets. Blueberries are blooming plants, valued for their antioxidant and anti-inflammatory properties. Yield on the plantations depends on several factors, both internal and external. Predicting the accurate amount of harvest is an important aspect in work planning and storage space selection. Machine learning algorithms are commonly used in such prediction tasks, since they are capable of finding correlations between various factors at play. Overall data were collected from years 2016–2021, and included agronomic, climatic and soil data as well satellite-imaging vegetation data. Additionally, growing periods according to BBCH scale and aggregates were taken into account. After extensive data preprocessing and obtaining cumulative features, a total of 11 models were trained and evaluated. Chosen classifiers were selected from state-of-the-art methods in similar applications. To evaluate the results, Mean Absolute Percentage Error was chosen. It is superior to alternatives, since it takes into account absolute values, negating the risk that opposite variables will cancel out, while the final result outlines percentage difference between the actual value and prediction. Regarding the research presented, the best performing solution proved to be Extreme Gradient Boosting algorithm, with MAPE value equal to 12.48%. This result meets the requirements of practical applications, with sufficient accuracy to improve the overall yield management process. Due to the nature of machine learning methodology, the presented solution can be further improved with annually collected data.

Keywords:

machine learning; artificial intelligence; yield prediction; blueberry

1. Introduction

The blueberry (

V a c c i n i u m

c o r y m b o s u m

L.) is a blooming plant species in the genus Vaccinium within the heather family most commonly found in Eurasia and North America [1]. Blueberries are valued for their antioxidant and anti-inflammatory properties and have neurocognitive benefits [2]. According to FAOSTAT data [3], in 2020, blueberry crop fields in the European Union covered an area of 27,630 hectares, and Poland ranked first with a cultivated area of 9700 hectares and an average yield of 57,010 hg/ha (8th place in the EU).

Increased yield, fruit quality and economic stability in blueberry production requires paying attention to the plantation by identifying key features that affect plant growth and condition throughout the growing season. The yield of blueberry plants depends on several factors, both internal (genetic) and external (growing practices, stimulants, climate) [4]. These factors are most often correlated, but their interaction has not yet been studied in depth.

To predict yields, models are used for estimation during the growing season immediately before harvest [5,6,7,8]. The knowledge of the predicted yield for a given year can provide support for decision-making in work planning and the selection of storage space. It can also improve farm profitability and balance the number of inputs used, such as fertilizers, crop protection products or water. Balanced consumption of these products leads to both reduced energy input on the farm and reduced input of human labour. Finally, a plantation can increase profitability due to lower production costs [9,10]. Yield prediction is also used as a tool when theoretical yields need to be estimated in agricultural damage assessments ([11]). It should also be added that in addition to the yield quantity prediction, the yield lost has potential application in practice as shown in research conducted by Khan’s team [12], as well as its yield quality, for example, in the form of fruit freshness prediction [13].

In the literature on fruit yield forecasting of orchard crops, some of the most important criteria for dividing yield determination methods include data nature, data type and data source. Thus, the first classification separates them into direct and indirect methods [10]. This classification describes direct methods as yield estimation methods and indirect methods as yield prediction methods, but this is not common nomenclature.

Direct methods are based on data from direct measurements of yield-forming generative organs. These may include measurements of the number of flowers, buds or fruits, their geometric dimensions and/or weight. They are carried out manually or automatically using various types of stationary ground platforms [14,15,16], mobile ground platforms [17,18,19] or aerial platforms [20,21,22,23].

In the case of indirect methods, yield prediction is performed by developing a predictive model with features indirectly related to yield as inputs. These traits can be divided into several categories. The most important include traits attributed to plants, climate, soil conditions and agrotechnical processes [10]. In the case of data attributed to climate, the most common include meteorological data on historical and current air and soil parameters, such as the amount of natural precipitation and solar activity, i.e., insolation or solar radiation [24]. The data used in yield forecasting, and describing the soil environment, refer both to small time-varying soil parameters, such as the texture of the surface or subsurface layer, and also to medium and short-term variables as soil pH, organic matter (OM) content, salinity (EC), or the content of individual plant nutrients, both macro and micro elements [25,26]. Plant data that are input into prediction models using indirect data most often detail information about the growth status of plants or their organs in successive vegetation phases expressed in the form of vegetation indices, the degree of plant compactness (canopy/biomass), the time and rate of reaching characteristic developmental phases, e.g., flowering, and fruit setting. For the most part, these data come from remote sensing (RS), satellites or UAVs [27,28,29,30].

Regardless of their type and nature, the data sources for prediction models can be information from measurements made locally at the site of plant growth and information interpolated from network measurements conducted over larger areas. Local data are increasingly being collected not only from individual measurement stations but also from a grid of sensors mounted on plantations using IoT devices [31]. From the perspective of access to databases for predictive models, it can be noted that the data are private, public and commercial [32]. The abundance of types, natures and sources of data used in predictive models in orchard crops makes it increasingly necessary to employ methods for managing large data sets, Big Data [33,34], in the phase of storing, processing, sharing and analyzing them.

Scientific papers on the yield prediction of blueberries have been published for two decades. However, regading searches performed on the most common literature databases (WoS, SCOPUS) for the phrase “blueberry” and “yield prediction” or “blueberry” and “yield forecast” or “blueberry” and “yield estimation”, only 12 and 10 publications from 2002 to 2022, respectively, can be obtained (Figure 1). Thus, it can be concluded that blueberry yield prediction is not the most frequently addressed topic by science, and in addition, most of the publications are from the United States, Canada, and China, which shows a large knowledge gap for the European region.

The available literature on blueberry yield prediction points to several basic topics pursued in this domain. First of all, it should be said that they concern both indirect and direct methods of yield forecasting, although indirect methods are less numerous. In the case of indirect methods, the main data sources for the models are satellite imaging data, ground–based imaging, fruit and leaf spectrometry, and meteorological data [4,11,35,36,37,38,39,40,41,42,43,44,45,46,47]. Interestingly, publications using agronomic data from the production process are not found.

Machine learning is the most widely used method for yield modeling due to the highest accuracy of predictions [10,48]. Obsie et al. [11] conducted a study evaluating the importance of bee species composition and weather factors in regulating wild blueberry agroecosystems. This study clarified how bee species composition and weather affect blueberry yield, and predicted the optimal bee species composition and weather conditions that achieve the best yield. For this purpose, computer simulation and machine learning algorithms were used as predictive tools. These included multiple linear regression (MLR), boosted decision trees (BDT), random forest (RF) and extreme gradient boosting (XGBoost). Seireg et al. [47] employed a set of machine learning techniques to predict wild blueberry yields. For this purpose, they used the method of stack regression (SR) and cascade regression (CR) with a novel combination of machine learning algorithms. The authors used traits that indicated the best regulation for wild blueberry agroecosystems. A total of four feature engineering selection techniques were used, namely inferential variance factor (VIF), sequential forward feature selection (SFFS), sequential backward feature selection (SBEFS) and extreme gradient enhancement based on feature importance (XFI). In this study, Bayesian optimization was applied to popular MLAs to obtain the best hyperparameters for accurate wild blueberry yield prediction. A two-layer structure was used in SR: level-0 containing the light gradient boost machine (LGBM), gradient regression (GBR) and extreme gradient boost (XGBoost), and level-1 providing the output prediction using Ridge. The CR topology is the same MLA used in SR, but in a serial form that takes a new prediction as input into each MLA and removes the previous prediction at each stage. Finally, CR and SR were evaluated with results according to root mean square error (RMSE) and coefficient of determination (

R^{2}

). MacEachern et al. [49] conducted research using deep learning convolutional neural networks on fruit maturity stage detection and yield prediction of wild blueberry. In this study, six artificial neural network models based on YOLOv3, YOLOv3-SPP, YOLOv3-Tiny, YOLOv4, YOLOv4-Small and YOLOv4-Tiny architectures were given for analysis. Both 3-class (green berries, red berries, blueberries) and 2-class (immature berries, mature berries) models were developed. The results proved that YOLOv4 performed the best in terms of

R^{2}

accuracy and received the highest F1 score. On the other hand, YOLOv4-Tiny performed the best from the perspective of computational load. Only slight differences were detected in the accuracy of yield prediction models using nonlinear regression, with YOLOv4-Small performing best regarding mean absolute error (RMSE).

The method presented in this paper uses new data sets that are innovative in their structure. To the authors’ best knowledge, types of collected data, overall diversity of parameters as well as aggregation of BBCH phases have not been used before in such combination. Taking all of this into account, the data preparation process and use of cumulative features can provide additional insights, resulting in increased model accuracy.

The aim of our research was to develop 11 blueberry yield prediction models using state-of-the-art machine learning algorithms. In addition to selecting the best model for yield prediction, our research had three scientific objectives. First, we analyzed missing and outlier data, and performed normalization of empirical data also used to generate aggregates of predictive data. Second, we performed feature selection, discarding those features that might introduce noise and lead to a deterioration in yield prediction. Third, we determined MAPE prediction errors for each model (algorithm) and identified the best model for yield prediction. Our goal was to determine the most robust model for yield prediction by comparing traditional and state-of-the-art machine learning algorithms while using the minimum number of features to conduct analysis.

2. Materials and Methods

2.1. Dataset Description

The data used in this work came from a marketable blueberry plantation located in southeastern Poland, and the cultivars grown were Chandler, Liberty, Nelson. The data covered six growing seasons in the years spanning from 2016 to 2021 containing several data types, i.e., agronomic data, climatic data, satellite vegetation data and soil data. Data were obtained from both public databases as open data, as well as from private farmer databases and ERP vendor databases. Regardless of the type and source, data were obtained for the subsequent cultivation plots and subplots.

Final structures in the database were prepared and supplemented using obtained dataset. During research, data from cultivation of highbush blueberry (Vaccinium corymbocatfish) were used. The full structure of Highbush blueberry dataset is show in Table 1.

2.1.1. Agronomical Data

Agronomic data was obtained from the Plantator System by Seth Software as well as from Plantator System operating during production in regards to: crop register, harvest registration, registration of hourly work results. The acquired data holds different formats, so in some cases it was necessary to obtain pre–processed dataset, e.g., from soil test results (pdf) or locations of crops (jpg). Agronomic data also comes from private grower databases.

2.1.2. BBCH-Scale

The BBCH-scale is used to identify the phenological development stages of plants. BBCH-scales have been developed for a range of crop species where similar growth stages of each plant are given the same code.

Phenological development stages of plants are used in a number of scientific disciplines (crop physiology, phytopathology, entomology and plant breeding) and in the agriculture industry (risk assessment of pesticides, timing of pesticide application, fertilization, agricultural insurance). The BBCH-scale uses a decimal code system, which is divided into principal and secondary growth stages, and is based on the cereal code system (Zadoks scale) developed by Jan Zadoks.

The abbreviation BBCH derives from the names of the originally participating stakeholders: “Biologische Bundesanstalt, Bundessortenamt und CHemische Industrie”. Allegedly, the abbreviation is said to unofficially represent the four companies that initially sponsored its development; Bayer, BASF, Ciba-Geigy, and Hoechst. The phenological development stages obtained from the producer were in following ranges:

(1): BBCH 0–60 (from dormancy to the beginning of flowering)
(2): BBCH 61–70 (from the beginning of flowering to the end of flowering
(3): BBCH > 70 (from the beginning of fruit growth to harvest)

Based on the imported data, the BBCH phase limits in the ranges: 0–60, 61–70, >70 were assigned. They will be later used to calculate aggregated data.

2.1.3. Numerical Features

A total of 89 potential explanatory variables were derived for the target. Target variable was defined as the total yield of harvested crop (harvest) for the blueberries. All numerical data (both explanatory and dependent variables) were aggregated to full year (2016, 2017, 2018, 2019, 2020 or 2021). The target variable was measured in kilograms [kg].

Since after initial tests, the climatic data delivered by the producer were not sufficient, additional information was obtained using the public data available from Institute of Meteorology and Water Management-National Research Institute (IMGW) weather stations. This additional data was stored using following labels:

(1): Air temperature (avg.) [°C]
(2): Air temperature (max.) [°C]
(3): Air temperature (min.) [°C]
(4): Rainfall [mm]
(5): Air relative humidity (avg.) [%]
(6): Air relative humidity (max.) [%]
(7): Air relative humidity (min.) [%]
(8): Dew point temperature (avg.) [°C]
(9): Dew point temperature (max.) [°C]
(10): Dew point temperature (min.) [°C]

In relation to total harvest obtained, two numerical features regarding treatment of the blueberry were stored (treatment features):

(1): Irrigation [kg]
(2): Fertigation [l]

Depending on the season, data regarding nutrient content of fertigation media and soil content were obtained from different certified soil testing laboratories, resulting in variability of analyzed components and different units of measurement. A total of 14 numerical features regarding soil parameters of blueberry has been stored/extracted. The final list for fertigation media is given as follows:

(1): pH
(2): S-SO $_{4}$ —sulfur [mg/L]
(3): P—phosphorus [mg/L]
(4): K—potassium [mg/L]
(5): C—calcium [mg/L]
(6): Mg—magnesium [mg/L]
(7): Fe—iron [mg/L]
(8): Zn—zinc [mg/L]
(9): Mn—manganese [mg/L]
(10): Cu—copper [mg/L]
(11): B—boron [mg/L]
(12): Cl—chlorine [mg/L]
(13): Na—sodium [mg/L]
(14): N-NO $_{3}$ —nitrogen [mg/L]

2.1.4. BBCH Soil and Climate Features

Using the BBCH scale, additional potential predictive features in the form of aggregates were determined. After calculating those sets, total of 30 numerical features regarding soil and climate parameters based on BBCH scale of blueberry has been extracted.

(1): Insolation (BBCH 0–60) [W/m $^{2}$ ]
(2): Insolation (BBCH 61–70) [W/m $^{2}$ ]
(3): Insolation (BBCH > 70) [W/m $^{2}$ ]
(4): Rainfall (BBCH 0–60) [mm]
(5): Rainfall (BBCH 61–70) [mm]
(6): Rainfall (BBCH > 70) [mm]
(7): Irrigation (BBCH 0–60) [mm]
(8): Irrigation (BBCH 61–70) [mm]
(9): Irrigation (BBCH > 70) [mm]
(10): Daily air temperature (avg.) (BBCH 0–60) [°C]
(11): Daily air temperature (avg.) (BBCH 61–70) [°C]
(12): Daily air temperature (avg.) (BBCH > 70) [°C]
(13): Daily soil temperature (avg.) (BBCH 0–60) [°C]
(14): Daily soil temperature (avg.) (BBCH 61–70) [°C]
(15): Daily soil temperature (avg.) (BBCH > 70) [°C]
(16): Soil pH (avg.) (BBCH 0–60)
(17): Soil pH (avg.) (BBCH 61–70)
(18): Soil pH (avg.) (BBCH > 70)
(19): Soil humidity (avg.) (BBCH 0–60) [%]
(20): Soil humidity (avg.) (BBCH 61–70) [%]
(21): Soil humidity (avg.) (BBCH > 70) [%]
(22): Soil P—phosphorus (avg.) (BBCH 0–60) [mg/L]
(23): Soil P—phosphorus (avg.) (BBCH 61–70) [mg/L]
(24): Soil P—phosphorus (avg.) (BBCH > 70) [mg/L]
(25): Soil Mg—magnesium (avg.) (BBCH 0–60) [mg/L]
(26): Soil Mg—magnesium (avg.) (BBCH 61–70) [mg/L]
(27): Soil Mg—magnesium (avg.) (BBCH > 70) [mg/L]
(28): Soil K—potassium (avg.) (BBCH 0–60) [mg/L]
(29): Soil K—potassium (avg.) (BBCH 61–70) [mg/L]
(30): Soil K—potassium (avg.) (BBCH > 70) [mg/L]

2.1.5. Vegetation Features

Crop vegetation status data were derived from satellite remote observations. The primary image database used in the imaging was the European Copernicus Sentinel 2 mission’s image database, and the Google Earth Engine (GEE) platform was used as the image processing and vegetation index (VI) calculation tool. Those VI were:

(1): EVI—Enhanced Vegetation Index
(2): NDVI—Normalized Difference Vegetation Index
(3): RDVI—Renormalized Difference Vegetation Index
(4): SAVI—Soil-Adjusted Vegetation Index

The above-mentioned VIs are among the most widely used indices in the literature in yield predictions for orchard crops, and were calculated according to Index DataBase [50].

Finally, 16 vegetation features for 4 VIs were calculated, with final division containing Min, Mean, Max, Standard deviation groups. The full list contains the following parameters:

(1): EVI 40 days before harvest (max.)
(2): EVI 40 days before harvest (avg.)
(3): EVI 40 days before harvest (min.)
(4): EVI 40 days before harvest (stddev.)
(5): NDVI 40 days before harvest (max.)
(6): NDVI 40 days before harvest (avg.)
(7): NDVI 40 days before harvest (min.)
(8): NDVI 40 days before harvest (stddev.)
(9): RDVI 40 days before harvest (max.)
(10): RDVI 40 days before harvest (avg.)
(11): RDVI 40 days before harvest (min.)
(12): RDVI 40 days before harvest (stddev.)
(13): SAVI 40 days before harvest (max.)
(14): SAVI 40 days before harvest (avg.)
(15): SAVI 40 days before harvest (min.)
(16): SAVI 40 days before harvest (stddev.)

2.1.6. Selyaninov Hydrothermal Coefficient

The study of climate variability is a topic of interest for many scientists across many diverse fields including hydrologists, meteorologists, farmers, and foresters. All would like to determine as precisely as possible what climatic conditions will prevail in a given area in the future [51,52,53,54,55,56,57,58,59]. Greater computing power allows us to analyse increasingly complex models while still showing that there are more factors affecting the environment. Therefore, the problem remains unresolved.

In accordance with various scenarios of climate change for Central Europe, the temperature increase will be accompanied by a very small increase in annual rainfall which will be redistributed over the year; an increase in the winter will be accompanied by a decline in the summer [60,61,62,63]. In this situation with poor retention possibilities and a simultaneous increase in evaporation, it should be expected that the amount of water that is useful for plants will be reduced during the growing season, with a possible depletion of post-winter stocks. As the authors of the works [64,65] indicate, one must not neglect the effect of the growing variance of precipitation and temperatures, which means that extreme situations that are unfavourable for plant production will occur more frequently.

One element to observe is the assessment of the amount of water present in a given area, especially in extreme values, i.e., floods and droughts. Different types of indicators are used to assess the severity of water shortages. One of these is the Selyaninov hydrothermal coefficient (HTC) [66]. The pattern assesses drought according to the following formula:

H T C = \frac{10 \sum_{i}^{n} P_{i}}{\sum_{i}^{n} t_{i}}

(1)

where:

n—length of the period considered in days,

P_{i}

—rainfall on the i-th day [mm],

t_{i}

—average daily temperature on the i-th day [°C].

Based on the above properties, three aggregated parameters were generated to be used as additional prediction features:

(1): HTC (BBCH 0–60)
(2): HTC (BBCH 61–70)
(3): HTC (BBCH > 70)

2.1.7. GDD Features

In the absence of extreme conditions such as nonseasonal drought or disease, plants grow in a cumulative step-wise manner that is strongly influenced by the ambient temperature. Growing degree days (GDD) takes aspects of local weather into account and allow gardeners to predict (or, in greenhouses, even to control) the plants’ pace toward maturity.

Unless stressed by other environmental factors such as soil moisture, the development rate from emergence to maturity for many plants depends upon the daily air temperature. Because many developmental events of plants and insects depend on the accumulation of specific quantities of heat, it is possible to predict when these events should occur during a growing season regardless of differences in temperatures from year to year. Growing degrees days (GDD) is defined as the number of temperature degrees above a certain threshold base temperature, which varies among crop species [67]. The base temperature is that temperature below which plant growth is zero. GDD are calculated each day as maximum temperature plus the minimum temperature divided by 2, minus the base temperature. GDD are accumulated by adding each day’s GDD contribution as the season progresses.

GDD can be used to: assess the suitability of a region for production of a particular crop; estimate the growth-stages of crops, weeds or even life stages of insects; predict maturity and cutting dates of forage crops; predict best timing of fertilizer or pesticide application; estimate the heat stress on crops; plan spacing of planting dates to produce separate harvest dates.

G D D = \sum_{i = 1}^{n} T_{a v g}

(2)

where:

G D D

—Growing Degree Days [°C]

n—length of the period considered in days,

T_{a v g}

—average daily air temperature ≥ 0 [°C]

Similarly as with HTC, three aggregated prediction features were generated based on this parameter:

(1): GDD (BBCH 0–60)
(2): GDD (BBCH 61–70)
(3): GDD (BBCH > 70)

2.1.8. Aggregates Based on Mineral Fertilization and Fertigation

Depending on the amount of fertilization, structure of used fertilizers and fertigation media, the results can vary. A total of nine parameters based on these features were generated:

(1): Fertilization (BBCH 0–60) [kg]
(2): Fertilization (BBCH 61–70) [kg]
(3): Fertilization (BBCH > 70) [kg]
(4): Fertigation (BBCH 0–60) [l]
(5): Fertigation (BBCH 61–70) [l]
(6): Fertigation (BBCH > 70) [l]
(7): K—potassium-Fertilization (annually) [kg]
(8): N—nitrogen-Fertilization (annually) [kg]
(9): P—phosphorus-Fertilization (annually) [kg]

2.1.9. Harmful Features

Apart from features positively influencing final harvest, two additional ones were derived. Those features concerned randomly occurring harmful conditions that had a significant impact on the obtained crop:

(1): hailstorm percentage of damage [%]
(2): hailstorm cut fruit [%]

2.1.10. Features Summary

A total of 89 potential predictive features were obtained during initial data processing. A summary of all raw data groups is presented in Table 2 and Table 3.

2.2. Data Preprocessing Methods

Since obtained data were not represented in an easy to use and coherent form, it could not be incorporated in prepared algorithms in original form. In order to prepare the dataset for further use, series of operations were performed in order to adjust it to the requirements of used methodologies. Those operations included data normalization and establishing methods for dealing with missing values.

2.2.1. Data Normalization

Since the original dataset was heterogeneous in nature, before the learning process, the data needed to be normalized.

For the prognostic features in regards to the areadelcared variable, following features needed to be rescaled to 1ha values:

(1): Crop/Harvest
(2): Irrigation
(3): Fertigation

Furthermore, since most AI algorithms work best when specific values are placed in <0, 1> range, all prognostic features were normalized to fit into this range, using following equation:

z_{i} = \frac{(x_{i} - m i n (x))}{m a x (x) - m i n (x)}

(3)

where:

z_{i}

: ith normalized value in the feature vector

x_{i}

: ith value in the feature vector

m i n

(x): minimum value in the feature vector

m a x

(x): maximum value in the feature vector

Additionally, the explained variable (crop/harvest) was rescaled using harmful conditions, lowering the values of the crop/harvest variable by % of the occurrence of harmful conditions (if any) such as:

(1): hailstorm percentage of damage [%]
(2): hailstorm cut fruit [%]

2.2.2. Finding and Replacing Missing Values

One of the mandatory analysis procedures in large datasets is finding and handling missing data. It is necessary to search for missing values of each variable. When they occur, such data record is either discarded or the missing value is replaced by a different one. The main problem here is determining the best way to assign missing values in the second case. There are some well-known statistical methods widely used to fill in non-matching data:

(1): Imputation Using (Mean/Median) Values
(2): Imputation Using (Most Frequent) or (Zero/Constant) Values
(3): Stochastic Regression Imputation
(4): Extrapolation and Interpolation
(5): Imputation Using k-NN
(6): Imputation Using XGBoost
(7): Others

Currently, the most effective method of replacing missing data is the prediction for all missing values by XGBoost (Extreme Gradient Boosting) method. The full list of features with missing values in current dataset is given as follows:

(1): S-SO $_{4}$ —sulfur
(2): Cl—chlorine
(3): Fertigation (BBCH 0–60)
(4): Fertilization (BBCH 0–60)
(5): Hailstorm percentage of damage
(6): Hailstorm cut fruit

A full summary of missing data percentages for each feature is presented in Table 4. In the presented approach, the XGBoost method was used for four of the first features [68]. In case of the remaining two features, due to their different nature, it was determined that the best approach would be replacing the missing values with zero.

Extreme Gradient Boosting-XGBoost

XGBoost (Extreme Gradient Boosting) is a model that was first proposed by Tianqi Chen and Carlos Guestrin in 2011 and has been continuously optimized and improved in follow-up studies performed by different scientists (Chen and Guestrin, 2016). The model is a learning framework based on Boosting Tree models.

The traditional Boosting Tree model uses only the first derivative information. When training the nth tree, it is difficult to implement distributed training because the residual of the former n-1 trees is used. XGBoost performs a second-order Taylor expansion on the loss function and it can automatically use the multithreading of the CPU for parallel computing. Additionally, XGBoost uses a variety of methods to avoid overfitting. This approach is much more precise than simple statistical methods such as using mean, median or modal value to replace missing values. The XGBoost algorithm used in this paper is outlined in Algorithm 1 (Chen and Guestrin, 2016).

Algorithm 1: Algorithm of missing value imputation using XGBoost

procedure FillMissingByXGBoost(

A l l F e a t u r e s, f e a t u r e 1

)

f i e l d s \leftarrow A l l F e a t u r e s - f e a t u r e 1

X \leftarrow f i e l d s

Y \leftarrow f e a t u r e 1

i n d i c a t o r \leftarrow i s n o t n a n (Y)

X_{t r a i n} \leftarrow X [i n d i c a t o r]

y_{t r a i n} \leftarrow y [i n d i c a t o r]

x g b r \leftarrow x g b . X G B R e g r e s s o r ()

x g b r . f i t (X_{t r a i n}, y_{t r a i n})

y_{e s t} \leftarrow x g b r . p r e d i c t (X)

y_{e s t} [i n d i c a t o r] \leftarrow y [i n d i c a t o r]

▹ do not touch real value of features

A l l F e a t u r e s [f e a t u r e 1] \leftarrow y_{e s t}

return

A l l F e a t u r e s

end procedure

2.3. Feature Generation Using PCA (Principal Component Analysis) Method

As for Pearson and Chi-square, for the set of all features and for features distinguished by the stepwise fit method, it was checked whether by generating artificial variables (reduction of multidimensionality by the PCA method) the effectiveness of the model would be increased. In that case, PCA was generated for 3, 4, 5, 6, 7, 8, 9 and 10 principal components. In result, the models will have 3 to 10 variables (artificial variables). Each variable is an artificial trait that is a linear combination of the already existing traits.

2.4. Outlier Detection

Outlier observation is an observation of relatively distant values from other elements of the sample [51]. In other words, having an atypical value of the independent (explaining) variable or atypical values of both variables—dependent (explained) and explaining (explaining in multiple regression analysis). This means that the relationship between Xi and Yi for an observation may be different than for the rest of the observations in the dataset.

Outliers may reflect the actual distribution or be the result of chance, but may also indicate a measurement error or a mistake in entering information into the database, etc. A large number of outliers may also be a signal informing that the wrong model was chosen.

Outliers resulting from data errors make the analysis difficult and, in extreme cases, impossible. Methods and coefficients based on the assumption of normal distribution and linear dependencies, such as Pearson’s correlation, linear regression, classical correspondence analysis, etc., are particularly sensitive to them. A single outlier can completely change the value and sign of the correlation, even from 0.99 to −0.99.

It is therefore necessary to either remove outliers or use robust statistical methods, e.g., rank methods. In our paper, we have decided to eliminate records that are stated as outliers. A comparison of two different methods applied to this problem is shown in Figure 2.

2.4.1. Local Outlier Factor (LOF)

Local Outlier Factor (LOF) is an algorithm that computes a score reflecting the degree of abnormality of the observations. In order to find samples that have significantly lower density than their neighbours, local density deviations are measured for a given point of data.

K-nearest neighbours are used to obtain local density, where LOF score of single point is equal to the ratio of average local density between set of k-neighbouring points and the initial point. In the case of a normal data point, this density should be similar, while any abnormal data will have smaller local density.

One difficulty here is the appropriate choice of k parameter—the number of neighbours to consider. This parameter would usually be either higher than the minimum number of objects a cluster has to contain (in that case objects can be local outliers to that cluster), or smaller than the maximum number of close objects identified as potential outliers. This information usually is not readily available, and might require some testing. In general 20 considered points would work in most cases, while sets with higher outlier numbers (i.e., more than 10%) might require higher values.

Main advantage of this method is that it takes into consideration both local and global dataset properties. It can perform well in datasets, where abnormal samples differ in underlying densities. Instead of simply checking how isolated the sample is, this method will also consider how this sample relates to surrounding samples.

2.4.2. Unsupervised Outlier Detection Based on OneClassSVM

The One-Class SVM was introduced by Schölkopf et al. for that purpose and implemented in the Support Vector Machines module in the svm.OneClassSVM object. It requires the choice of a kernel and a scalar parameter to define a frontier. The RBF kernel is usually chosen although there exists no exact formula or algorithm to set its bandwidth parameter. This is the default in the scikit-learn implementation. The nu parameter, also known as the margin of the One-Class SVM, corresponds to the probability of finding a new, but regular, observation outside the frontier.

All the above methods were used in order to generate best possible feature set. During the research process, the influence of each method was checked, while the final solution will use the best achieved combination.

2.5. Feature Selection

After initial dataset preparation, the next crucial step from the machine learning algorithms point of view was features selection. Finding and using the best possible features that carry the most information can significantly improve overall algorithm performance.

2.5.1. Stepwise Regression

Stepwise regression is a method of adding and removing features from a multi-line model based on their statistical significance. This method begins with the initial model and then takes the steps to modify the model by adding or removing features. At each step, the p-value F-statistic is computed to test models with and without the potential feature. If the feature is not currently in the model, the null hypothesis is that the feature would have a factor of zero if it were added to the model. If there is sufficient evidence to reject the null hypothesis, the feature is added to the model. Conversely, if the feature is currently in the model, the null hypothesis is that the feature has a factor of zero. If there is insufficient evidence to reject the null hypothesis, the feature is removed from the model. The method works as follows:

(1): Fit the initial model.
(2): If any features not in the model have a p-value less than the input tolerance (e.g., 0.05), add the one with the smallest p-value and repeat this step. For example, suppose the initial model is the default model and the input tolerance = 0.05. The algorithm first fits all models consisting of the constant plus the first feature and looks for the next feature that has the smallest p-value, for example feature 4. If feature 4’s p-value is less than 0.05 then feature 4 is added to the model. Then the algorithm searches among all models consisting of the constant + feature 4 and looks at the next features. If the trait not in the model has a p-value less than 0.05, the trait with the smallest p-value is added to the model and the process is repeated. When there are no further features to add to the model, the algorithm moves to step 3.
(3): If any features in the model have a p-value greater than the output tolerance -premove (e.g., 0.06), remove those with the largest p-value and go to step 2; otherwise the algorithm will finish computations and return the resulting feature list.

In each step of the stepwisefit algorithm it uses the least squares method to estimate the model coefficients. After a feature has been added to the model at an earlier stage, the algorithm may later remove it if it is no longer useful in conjunction with later added elements. The method ends when no single step improves the model. However, there is no guarantee that the final model will be optimal (best fit to the data). A different starting model or a different sequence of steps may lead to a better fit. In this sense, step models are locally optimal, but not always best globally.

Total of 90 different values for input and output tolerance were tested (denoted as penter and premove accordingly). In the final set only unique features will be taken into account.

2.5.2. Pearson’s Feature Selection Method

In case of large feature sets, there is high possibility that some of them will be correlated. In order to avoid using such elements, the Pearson correlation method was used both for the complete set of all features and for those distinguished by the stepwisefit method. In presented approach, if the Pearson correlation was greater than 0.95, the variable was removed from the set of prognostic features.

2.5.3. Chi-square Feature Selection Method

As for the Pearson correlation approach, the Chi-square statistic was used for the p-value. For the initial set of features, we the variable will be added to the prognostic features if the Chi-squared value for it is greater than 0.05, otherwise the variable is blocked and will not be included in the final set.

2.6. Prediction Methods Applied

In machine learning, there are different prediction methods available. Depending on the specific case, type and amount of used data, as well as various other factors, each method can result in different level of adjustment to the given problem. Therefore, during the research presented in this paper, various approaches were tested in order to extract the best methodology. A summary of the applied classifiers is presented in Table 5.

2.6.1. Linear Regression

In the case of statistical approaches, the linear regression is an approach used for modelling relationship between the result and one or more explanatory variables. In this case ordinary, least-squares linear regression was used. In general, this model fits a linear model with coefficients w = (w1, …, wp). The goal here is to minimize the residual sum of squares between predicted target, and an actual target as can be observed in the original dataset [70].

The method used returns the coefficient of determination of the final prediction, defined for

R^{2}

as

(1 - \frac{u}{v})

, where u is the residual sum of squares, with total sum of squares denoted as v. In the worst case, when the input features are disregarded during prediction, the

R^{2}

score would equal to 0.0. In the best case scenario, this score can equal 1.0, or −1.0 because the model can be arbitrarily worse.

In general linear regression obtains good results for problems with goals such as forecasting, prediction, error reduction or explaining variations in goal variables. Because of that, and due to general simplicity it was chosen as one of methods tested in presented approach.

2.6.2. Ridge

Ridge regression can be used to estimate coefficients for multiple-regression models, assuming that various independent variables available in dataset are highly correlated. Since mean square estimators tend to be smaller than the least squares estimators, this provided more precise estimate of ridge parameters. Ridge regression addresses some of the problems of Ordinary Least Squares by imposing a penalty on the size of the coefficients with l2 regularization. If the l2-norm is given as a regularization, and linear least-squares method is used for loss function, this approach can solve the regression model [71].

2.6.3. Lasso

The Lasso (least absolute shrinkage and selection operator) is a linear model that estimates sparse coefficients with l1 regularization. In general this method amounts to linear regression analysis, performing both variable selection and regularization to enhance obtained results in terms of both prediction accuracy and interpretability [72].

2.6.4. ElasticNet

Elastic-Net is a linear regression model trained with both l1 and l2-norm regularization of the coefficients. In general this model combines linearly the penalties form lasso and ridge methods. Both of those are treated as a special cases in this approach.

Such approach is used to increase overall accuracy. For example, in case of datasets with high-dimensional (p) data with few examples (n), the Lasso method will select maximum of n variables before saturation. Furthermore, if groups of highly correlated variables exist, it tends to select only one variable from such group. In order to overcome this limitations, the elastic net method uses the quadratic part in the penalty (which used alone amounts to ridge regression) [73].

2.6.5. Random Forest Regressor

In general, the random forests or random decision forest is a type of ensemble learning method, that constructs multiple decision trees during training time. It is commonly applied in classification and regression problems. In case of classification, the class selected by most of the trees would be used as an output. In the regression approaches the average prediction value for used trees would be returned. This approach corrects the tendency of decision trees for overfitting to the used training set, and in general it will outperform the decision trees. It is also dependant on data characteristic, which can influence the performance of this methodology.

A random forest regressor is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap = True (default), otherwise the whole dataset is used to build each tree [74].

2.6.6. MLP Regressor

MLP or Multi-layer Perceptron regressor is a model which optimizes the squared error with either LBFGS or stochastic gradient descent method. It is a fully connected feed-forward artificial neural network. In general it will have at least three layers of nodes: input, at least one hidden layer and output. All nodes are neurons using a nonlinear activation function, except from input nodes providing the initial data. Back-propagation is used for training. Main advantage here is that this model can distinguish data that cannot be separated linearly.

Two versions of this model were used in current approach, using 10 and 100 hidden layers, respectively, [75].

2.6.7. SGD Regressor

SGD (Stochastic Gradient Descent) is an iterative method that can be used for stochastic approximation of gradient descent optimization. In this case the actual gradient calculated from the dataset is replaced with its estimate obtained from randomly selected data subset. It is mainly used to reduce computation, especially in high-dimensional optimization problems.

In case of method used in this approach, the linear model was fitted with regularized empirical loss with SGD. For each sample, one at a time the gradient of the loss is estimated, updating the model with a decreasing learning rate. Penalty added to the loss function is treated as the regularizer, shrinking model parameters towards zero vector. Squared Euclidean norm L2 or the absolute norm L1 or a combination of both (Elastic Net) is used. If because of the regularizer the parameter update crosses 0.0, it will be truncated to this value in order to allow the possibility of learning sparse models, as well as achieving online feature selection [76].

2.6.8. SVR and NuSVR

SVR (Epsilon-Support Vector Regression) is an application of Support Vector Machine to the regression problem. In general, this approach gives the user high flexibility in defining how much error can be accepted in given model. The SVM will then find appropriate line or hyperplane fitting the given set of data.

Objective function of SVR is to minimize the coefficients–L2 norm of the coefficient vector. Second parameter used to define the SVR is the epsilon, or maximum acceptable error. Absolute error will need to be kept less than or equal to this specified margin, and the epsilon parameter can be tuned in order to obtain desired model accuracy.

Sometimes due to the data characteristics, not all points will fall inside specified margin. In such cases additional parameters are needed to account for the possibility of errors that are larger than epsilon. This can be done with slack variables. Those variables are used to denote deviation from the margin, for any value that falls outside of specified epsilon. The C hyperparameter can be additionally tuned, with lower values indicating that the tolerance for values outside epsilon also decreases. The ideal situation in that case would result in simplified version, with no variables falling into that category and C parameter equal to 0.0. One important aspect in this approach is finding the optimal value of C hyperparameter (as low as possible, but at the same time ensuring that all points will be addressed) [77].

Additionally, a modification of this approach was used. NuSVR (Nu Support Vector Regression) is an algorithm used for solving regression problems, applying nu parameter by replacing the epsilon parameter of SVR [78].

3. Results and Discussion

The dataset was divided into three parts. Sets containing data from years: 2016, 2017, 2018, were used as a training data. Sets containing information from years 2019 and 2020 were used as a validation set. Finally, data from the last year (2021) were used as a test set for all methods. General outline of the numerical experiments performed in this paper are presented in Algorithm 2.

Table 6 presents the best results for each of the tested classifiers. It turns out that each of the best results used stepwise regression. the table shows the penter and premove values that produced the best values in the set. As can be seen, in some cases the reduction of the Pearson and Chi-squared correlation was applied. In many cases, in the best one, the reduction of multidimensionality with the PCA method was used. For the best result, nine artificial variables were found, obtained from the PCA algorithm. The overall best result was achieved for XGB algorithm and with MAPE equal to 12.48%.

Based on the research presented in this paper, models for blueberry yield prediction have been produced, and the best of these models will be implemented in a dedicated information system. The described research is funded within the framework of a R&D project [79], whose final product will be marketable products in the form of a blueberry yield prediction service.

In this paper, we describe the results of the developed models for data that a farm producer or entrepreneur can obtain simply from their own databases or from public data. The perspective for the development of prediction models is the use of further datasets extended by new types of data (e.g.: terrestrial phenological imaging, soil data from mobile proximal sensors), but also the same type of data but with better quality (e.g.: high- or very-high-resolution satellite imaging data). In addition, the good results of blueberry yield forecasting models at the species level obtained so far are a good prospect for future research on the development of yield prediction models for selected blueberry varieties.

Figure 3 shows the effect of each PCA component for the XGBoost model (artificial variable) in explaining the variance of the original variables. Thanks to this, it is possible to determine how many components should be selected in order to achieve a satisfactory percentage of information from the original variables (40 variables after the application of the feature selection and before the application of the PCA transformation). For example, the first component F0 already contains 68.69% of information from the original variables, the second component F1 over 10.69%, the third component F2 6.91%, the fourth F3 3.21%, the fifth F4 2.46%, etc. The overall feature importance, in a descending order is presented at Figure 4.

On the other hand, better approach to selecting the number of PCA components to best estimate how much information process will extract from the original variables is the so-called cumulative percentage of explainable variance. On the graph presented at Figure 5 it can be seen that the first two components explain the original variables to the degree of 79.39%. First three components achieve 86.30%, and, respectively, 89.52% and 91.98 for the first 4 and 5 components. On the other hand, the first 9 PCA components explain as much as 97.49% of the information from the set of 40 original features.

Figure 6 shows the error results for the 11 tested algorithms. The outcomes are presented both in the validation part and the test dataset. The first part of the experiment is used in the training process. After each trained epoch, the result obtained in validation stage are tested to avoid overfitting problems. For this reason, the validation part of the experiment is involved only during the training process and cannot be used as a criterion for selecting the model. On the other hand, the testing part of the experiment is independent of training stage. Based on results obtained during this process, the most effective model should be selected. For this reason, the models were sorted in the increasing order after the MAPE error of the test part is calculated.

The results of the research presented in this article indicate that modelling the yield of blueberries during the growing season is reasonable and brings promising application possibilities.

Predictive models are usually created on the basis of results collected during long-term field and orchard experiments. The amount of empirical data included in the modelling is a very important element because too little can lead to excessive forecast errors. A large amount of data collected from many years is highly desirable regardless of the chosen modelling method. It is widely accepted that when using machine learning methods, the amount of data should be as large as possible [34]. This increases the real chances of obtaining a low forecast error. In addition, the number of features that are considered in modelling with machine learning methods should also be large. Most often, a few or a dozen features up to 30 features are used in crop yield forecasting issues [5,48,80], which do not always fully capture the nature of plant growth and vegetation. However, datasets that are too large (in terms of the amount of data and the number of traits) can lead to inappropriate model operation due to existing cross-correlations between the analyzed features. In such cases, the result can be increased prediction error and long computation time.

Algorithm 2: Algorithm of numerical experiments

In this work, we analyzed three varieties of blueberry (Chandler, Liberty, Nelson), for which data were collected from six growing seasons spanning from 2016 to 2021. Blueberries were grown on an area of 393.34 hectares on 243 subplots. A total of 89 explanatory characteristics were analyzed in the form of meteorological data, irrigation, fertigation and plant fertilization information, soil data, vegetation indices from satellite imagery and time intervals. Some of the data were aggregated into indices in the form of the Sielaninov hydrothermal coefficient and the sum of effective temperatures (GDD). In addition, some of the collected data was aggregated to growing seasons according to the BBCH scale.

Obsie et al. [11] investigated the effects of the spatial distribution of blueberry plants, the species composition of bees in the field, and weather conditions on yield. In their study, they considered three groups of parameters: (1) the average size of blueberry clones within the field; (2) the foraging density of each group of bee taxa; and (3) weather information from 121 to 181 days from the beginning of the calendar year (temperature, precipitation and wind speed). The dataset consisted of 13 features, and 777 records were analyzed, for which 77,700 simulations were performed. MLR, Boosted decision tree (BDT), Random Forest and XGBoost algorithms were used to predict blueberry yields. The XGBoost algorithm achieved the best results for yield prediction, with an

R^{2}

of 0.938 and an RRMSE of 5.444%.

In our study, the best algorithm was XGBoost, for which we obtained a MAPE error of 12.48%. Relating our yield forecast result to the results of Obsie et al. [11], it should be considered that the XGBoost algorithm was found to be the best.

Seireg et al. [47] predicted blueberry yield using mixed machine learning techniques LGBM, GBR, XGBoost and Ridge. The data to create the blueberry yield models came from 30 years and included meteorological information (temperature and precipitation). In total, five datasets (M1-M5) were used for calculations covering a different number of features (7–10) depending on the feature selection method adopted. The best yield prediction results were obtained for seven traits using the stacking technique with a combination of LGBM, GBR, XGBoost and Ridge algorithms. The model achieved an accuracy of 0.984

R^{2}

and 179.898 RMSE.

MacEachern et al. [49] collected two years of imagery data from 54 points located in 4 blueberry production fields. After their images were taken, all of the blueberry fruit was harvested with a hand rake. This made it possible to determine the yield for each study point. A total of 17,280 images were accepted for analysis, from which 6766 images were randomly selected for labelling. Overall, six deep neural networks (DNNs) YOLOv3 (3) and YOLOv4 (3), in different configurations were used for analysis. The best model was YOLOv4-Small, which achieved 89.67

R^{2}

and an absolute average error of 24.1%.

The parameter that determines the quality of the forecasts made is the mean absolute percentage error (MAPE). It is most often interpreted as the average percentage deviation between the forecast value and the actual realization. Peng et al. [81] provide threshold values for the correct assessment of the MAPE index. If the error is less than 10%, the degree of goodness of the model is ideal, while a range of 10 to 20% indicates a good fit, and 20 to 30% indicates a level of acceptance. MAPE above 30% indicates low model accuracy and disqualifies the model from practical use. In individual studies, MAPE reached low values.

In our study, we obtained MAPE error of 12.48% for XGBoost. Most of the other algorithms did not exceed 20% MAPE, with the exception of ElasticNet, NuSVR, MLP(100) and MLP(10). This indicates a good model fit. The variables chosen for this model using the stepwise fit method were as follows:

(1): Fertigation
(2): Hailstorm percentage of damage
(3): EVI 40 days before harvest (avg.)
(4): RDVI 40 days before harvest (max.)
(5): Dew point temperature (max.)
(6): NDVI 40 days before harvest (avg.)
(7): SAVI 40 days before harvest (min.)
(8): SAVI 40 days before harvest (stddev.)
(9): Irrigation (BBCH > 70)
(10): Dew point temperature (avg.)
(11): P—phosphorus
(12): Mn—manganese
(13): NDVI 40 days before harvest (avg.)
(14): Fe—iron
(15): RDVI 40 days before harvest (avg.)
(16): Fertigation (BBCH 0–60)
(17): SAVI 40 days before harvest (avg.)
(18): NDVI 40 days before harvest (max.)
(19): pH
(20): B—boron
(21): C—calcium
(22): NDVI 40 days before harvest (min.)
(23): EVI 40 days before harvest (min.)
(24): N-NO3—nitroge
(25): Fertilization (BBCH 61–70)
(26): EVI 40 days before harvest (max.)
(27): Na—sodium
(28): K—potassium
(29): Soil P—phosphorus (avg.) (BBCH 61–70)
(30): SAVI 40 days before harvest (max.)
(31): HTC (BBCH 61–70)
(32): Fertigation
(33): RDVI 40 days before harvest (min.)
(34): HTC (BBCH > 70)
(35): Soil P—phosphorus (avg.) (BBCH > 70)
(36): Irrigation (BBCH 0–60)
(37): K-potassium-Fertilization (annually)
(38): Cu—copper
(39): S-SO $_{4}$ —sulfur
(40): Rainfall (BBCH 0–60)

On the basis of real features above, PCA has been applied which generated nine artificial features. The nine features importance is depicted in Figure 4.

4. Conclusions

With the growing relevance of blueberry production comes the desirability for better monitoring of crop yields, as well as conscious production management and proper pre-harvest decision-making. The results of the individual analyses presented here indicate that machine learning methods are a very useful tool in the yield prediction of the blueberry varieties Chandler, Liberty and Nelson. Yield prediction models are used to estimate yields during the growing season, immediately before harvest.

The presented models and their quality are strictly dependent on the availability of data, climatic and vegetation parameters of growing areas, cultivated varieties, thus constituting the factors limiting their interoperability. The implementation of the developed models for crops located in regions with significantly different climatic and vegetation conditions than for southeastern Poland should be preceded by additional research work and the implementation of eventual corrections to the developed models.

Overall, predictions made prior to fruit harvest are a valuable source of knowledge before harvesting, selling and storing crops. The presented models were able to achieve sufficient accuracy and error rate, while the presented approach shows great promise for further applications and development.

Author Contributions

K.B., G.N. and T.W. conceived the study design, managed data collection, built the database, and performed the first data analysis. J.K. and B.Ś. carried out all deep data analyses and built models until final results were attained. I.A., G.N., T.W., J.K. and B.Ś. wrote the manuscript with substantial input from K.B. All authors have read and agreed to the published version of the manuscript.

Funding

The project is co-financed by the European Union from the European Regional Development Fund under the Smart Growth Operational Programme. The project is being conducted under the competition of the National Centre for Research and Development, within the 1.1.1 programme for R&D projects of enterprises “Fast track–Agrotech” number: POIR.01.01.01-00-2298/20.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Not applicable.

Acknowledgments

We gratefully acknowledge Łukasz Cypcar and Szymon Margański for facilitating sample collection and organizing data in storage; all the members from Seth Software teams for their assistance and helpful discussions. We thank Joanna Kogut for administrative services without which we could not provide such fruitful results.

Conflicts of Interest

The authors declare that they have no competing interest.

References

Qu, H.; Xiang, R.; Obsie, E.Y.; Wei, D.; Drummond, F. Parameterization and Calibration of Wild Blueberry Machine Learning Models to Predict Fruit-Set in the Northeast China Bog Blueberry Agroecosystem. Agronomy 2021, 11, 1736. [Google Scholar] [CrossRef]
Golovinskaia, O.; Wang, C.K. Review of Functional and Pharmacological Activities of Berries. Molecules 2021, 26, 3904. [Google Scholar] [CrossRef] [PubMed]
FAOSTAT My Name Is John Doe. Available online: https://www.fao.org/faostat/en/#data/QCL (accessed on 14 September 2022).
Salvo, S.; Muñoz, C.; Ávila, J.; Bustos, J.; Ramírez-Valdivia, M.; Silva, C.; Vivallo, G. An estimate of potential blueberry yield using regression models that relate the number of fruits to the number of flower buds and to climatic variables. Sci. Hortic. 2012, 133, 56–63. [Google Scholar] [CrossRef]
Piekutowska, M.; Niedbała, G.; Piskier, T.; Lenartowicz, T.; Pilarski, K.; Wojciechowski, T.; Pilarska, A.A.; Czechowska-Kosacka, A. The Application of Multiple Linear Regression and Artificial Neural Network Models for Yield Prediction of Very Early Potato Cultivars before Harvest. Agronomy 2021, 11, 885. [Google Scholar] [CrossRef]
Gorzelany, J.; Belcar, J.; Kuźniar, P.; Niedbała, G.; Pentoś, K. Modelling of Mechanical Properties of Fresh and Stored Fruit of Large Cranberry Using Multiple Linear Regression and Machine Learning. Agriculture 2022, 12, 200. [Google Scholar] [CrossRef]
Niazian, M.; Sadat-Noori, S.A.; Abdipour, M. Modeling the seed yield of Ajowan (Trachyspermumammi L.) using artificial neural network and multiple linear regression models. Ind. Crops Prod. 2018, 117, 224–234. [Google Scholar] [CrossRef]
Sabzi-Nojadeh, M.; Niedbała, G.; Younessi-Hamzekhanlu, M.; Aharizad, S.; Esmaeilpour, M.; Abdipour, M.; Kujawa, S.; Niazian, M. Modeling the Essential Oil and Trans-Anethole Yield of Fennel (Foeniculumvulgare Mill. var. vulgare) by Application Artificial Neural Network and Multiple Linear Regression Methods. Agriculture 2021, 11, 1191. [Google Scholar] [CrossRef]
Hara, P.; Piekutowska, M.; Niedbała, G. Selection of Independent Variables for Crop Yield Prediction Using Artificial Neural Network Models with Remote Sensing Data. Land 2021, 10, 609. [Google Scholar] [CrossRef]
He, L.; Fang, W.; Zhao, G.; Wu, Z.; Fu, L.; Li, R.; Majeed, Y.; Dhupia, J. Fruit yield prediction and estimation in orchards: A state-of-the-art comprehensive review for both direct and indirect methods. Comput. Electron. Agric. 2022, 195, 106812. [Google Scholar] [CrossRef]
Obsie, E.Y.; Qu, H.; Drummond, F. Wild blueberry yield prediction using a combination of computer simulation and machine learning algorithms. Comput. Electron. Agric. 2020, 178, 105778. [Google Scholar] [CrossRef]
Khan, H.; Esau, T.J.; Farooque, A.A.; Abbas, F. Wild blueberry harvesting losses predicted with selective machine learning algorithms. Agriculture 2022, 12, 1657. [Google Scholar] [CrossRef]
Huang, W.; Wang, X.; Zhang, J.; Xia, J.; Zhang, X. Improvement of blueberry freshness prediction based on machine learning and multi-source sensing in the cold chain logistics. Food Control 2022, 145, 109496. [Google Scholar] [CrossRef]
Vasconez, J.P.; Delpiano, J.; Vougioukas, S.; Cheein, F.A. Comparison of convolutional neural networks in fruit detection and counting: A comprehensive evaluation. Comput. Electron. Agric. 2020, 173, 105348. [Google Scholar] [CrossRef]
Häni, N.; Roy, P.; Isler, V. A comparative study of fruit detection and counting methods for yield mapping in apple orchards. J. Field Robot. 2020, 37, 263–282. [Google Scholar] [CrossRef] [Green Version]
Coviello, L.; Cristoforetti, M.; Jurman, G.; Furlanello, C. GBCNet: In-field grape berries counting for yield estimation by dilated CNNs. Appl. Sci. 2020, 10, 4870. [Google Scholar] [CrossRef]
Koirala, A.; Walsh, K.; Wang, Z.; McCarthy, C. Deep learning for real-time fruit detection and orchard fruit load estimation: Benchmarking of ‘MangoYOLO’. Precis. Agric. 2019, 20, 1107–1135. [Google Scholar] [CrossRef]
Mekhalfi, M.L.; Nicolò, C.; Ianniello, I.; Calamita, F.; Goller, R.; Barazzuol, M.; Melgani, F. Vision system for automatic on-tree kiwifruit counting and yield estimation. Sensors 2020, 20, 4214. [Google Scholar] [CrossRef]
Gutiérrez, S.; Wendel, A.; Underwood, J. Ground based hyperspectral imaging for extensive mango yield estimation. Comput. Electron. Agric. 2019, 157, 126–135. [Google Scholar] [CrossRef]
Kalantar, A.; Edan, Y.; Gur, A.; Klapp, I. A deep learning system for single and overall weight estimation of melons using unmanned aerial vehicle images. Comput. Electron. Agric. 2020, 178, 105748. [Google Scholar] [CrossRef]
Torres-Sánchez, J.; Mesas-Carrascosa, F.J.; Santesteban, L.G.; Jiménez-Brenes, F.M.; Oneka, O.; Villa-Llop, A.; Loidi, M.; López-Granados, F. Grape cluster detection using UAV photogrammetric point clouds as a low-cost tool for yield forecasting in vineyards. Sensors 2021, 21, 3083. [Google Scholar] [CrossRef]
Di Gennaro, S.F.; Toscano, P.; Cinat, P.; Berton, A.; Matese, A. A low-cost and unsupervised image recognition methodology for yield estimation in a vineyard. Front. Plant Sci. 2019, 10, 559. [Google Scholar] [CrossRef] [PubMed] [Green Version]
Apolo-Apolo, O.E.; Pérez-Ruiz, M.; Martínez-Guanter, J.; Valente, J. A cloud-based environment for generating yield estimation maps from apple orchards using UAV imagery and a deep learning technique. Front. Plant Sci. 2020, 11, 1086. [Google Scholar] [CrossRef] [PubMed]
Khoshnevisan, B.; Rafiee, S.; Mousazadeh, H. Application of multi-layer adaptive neuro-fuzzy inference system for estimation of greenhouse strawberry yield. Measurement 2014, 47, 903–910. [Google Scholar] [CrossRef]
Papageorgiou, E.; Aggelopoulou, K.; Gemtos, T.; Nanos, G. Yield prediction in apples using Fuzzy Cognitive Map learning approach. Comput. Electron. Agric. 2013, 91, 19–29. [Google Scholar] [CrossRef]
Wojciechowski, T.; Mazur, A.; Przybylak, A.; Piechowiak, J. Effect of Unitary Soil Tillage Energy on Soil Aggregate Structure and Erosion Vulnerability. J. Ecol. Eng. 2020, 21, 180–185. [Google Scholar] [CrossRef]
Bai, X.; Li, Z.; Li, W.; Zhao, Y.; Li, M.; Chen, H.; Wei, S.; Jiang, Y.; Yang, G.; Zhu, X. Comparison of machine-learning and casa models for predicting apple fruit yields from time-series planet imageries. Remote Sens. 2021, 13, 3073. [Google Scholar] [CrossRef]
Van Beek, J.; Tits, L.; Somers, B.; Deckers, T.; Verjans, W.; Bylemans, D.; Janssens, P.; Coppin, P. Temporal dependency of yield and quality estimation through spectral vegetation indices in pear orchards. Remote Sens. 2015, 7, 9886–9903. [Google Scholar] [CrossRef] [Green Version]
Li, G.; Suo, R.; Zhao, G.; Gao, C.; Fu, L.; Shi, F.; Dhupia, J.; Li, R.; Cui, Y. Real-time detection of kiwifruit flower and bud simultaneously in orchard using YOLOv4 for robotic pollination. Comput. Electron. Agric. 2022, 193, 106641. [Google Scholar] [CrossRef]
Matese, A.; Di Gennaro, S.F. Beyond the traditional NDVI index as a key factor to mainstream the use of UAV in precision viticulture. Sci. Rep. 2021, 11, 1–13. [Google Scholar] [CrossRef]
Sinwar, D.; Dhaka, V.S.; Sharma, M.K.; Rani, G. AI-based yield prediction and smart irrigation. In Internet of Things and Analytics for Agriculture, Volume 2; Springer: Berlin/Heidelberg, Germany, 2020; pp. 155–180. [Google Scholar]
Engen, M.; Sandø, E.; Sjølander, B.L.O.; Arenberg, S.; Gupta, R.; Goodwin, M. Farm-scale crop yield prediction from multi-temporal data using deep hybrid neural networks. Agronomy 2021, 11, 2576. [Google Scholar] [CrossRef]
Fukuda, M.; Okuno, T.; Yuki, S. Central Object Segmentation by Deep Learning to Continuously Monitor Fruit Growth through RGB Images. Sensors 2021, 21, 6999. [Google Scholar] [CrossRef] [PubMed]
Cravero, A.; Pardo, S.; Sepúlveda, S.; Muñoz, L. Challenges to Use Machine Learning in Agricultural Big Data: A Systematic Literature Review. Agronomy 2022, 12, 748. [Google Scholar] [CrossRef]
Angulo-Meza, L.; González-Araya, M.; Iriarte, A.; Rebolledo-Leiva, R.; de Mello, J.C.S. A multiobjective DEA model to assess the eco-efficiency of agricultural practices within the CF+ DEA method. Comput. Electron. Agric. 2019, 161, 151–161. [Google Scholar] [CrossRef]
Yarborough, D. Development of a crop estimation technique for wild blueberries. In Proceedings of the VII International Symposium on Vaccinium Culture 574, Chillan, Chile, 4–9 December 2000; pp. 409–413. [Google Scholar]
Zaman, Q.; Schumann, A.; Percival, D.; Gordon, R. Estimation of wild blueberry fruit yield using digital color photography. Trans. ASABE 2008, 51, 1539–1544. [Google Scholar] [CrossRef]
Swain, K.C.; Zaman, Q.U.; Schumann, A.W.; Percival, D.C.; Bochtis, D.D. Computer vision system for wild blueberry fruit yield mapping. Biosyst. Eng. 2010, 106, 389–394. [Google Scholar] [CrossRef]
Panda, S.S.; Hoogenboom, G.; Paz, J.O. Remote sensing and geospatial technological applications for site-specific management of fruit and nut crops: A review. Remote Sens. 2010, 2, 1973–1997. [Google Scholar] [CrossRef] [Green Version]
Yang, C.; Lee, W.S.; Williamson, J.G. Classification of blueberry fruit and leaves based on spectral signatures. Biosyst. Eng. 2012, 113, 351–362. [Google Scholar] [CrossRef]
Tan, K.; Lee, W.S.; Gan, H.; Wang, S. Recognising blueberry fruit of different maturity using histogram oriented gradients and colour features in outdoor scenes. Biosyst. Eng. 2018, 176, 59–72. [Google Scholar] [CrossRef]
Jafari, F.; Nassar, L.; Karray, F. Time series similarity analysis framework in fresh produce yield forecast domain. In Proceedings of the 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC), Melbourne, Australia, 17–21 October 2021; pp. 2368–2374. [Google Scholar]
Nagaraju, Y.; Hegde, S.U.; Stalin, S. Fine-tuned mobilenet classifier for classification of strawberry and cherry fruit types. In Proceedings of the 2021 International Conference on Computer Communication and Informatics (ICCCI), Coimbatore, India, 23–25 June 2021; pp. 1–8. [Google Scholar]
Ni, X.; Li, C.; Jiang, H.; Takeda, F. Three-dimensional photogrammetry with deep learning instance segmentation to extract berry fruit harvestability traits. ISPRS J. Photogramm. Remote Sens. 2021, 171, 297–309. [Google Scholar] [CrossRef]
Wojciechowski, T.; Niedbala, G.; Czechlowski, M.; Nawrocka, J.R.; Piechnik, L.; Niemann, J. Rapeseed seeds quality classification with usage of VIS-NIR fiber optic probe and artificial neural networks. In Proceedings of the 2016 International Conference on Optoelectronics and Image Processing, ICOIP 2016, Warsaw, Poland, 10–12 June 2016. [Google Scholar] [CrossRef]
Kujawa, S.; Dach, J.; Kozłowski, R.J.; Przybył, K.; Niedbała, G.; Mueller, W.; Tomczak, R.J.; Zaborowicz, M.; Koszela, K. Maturity classification for sewage sludge composted with rapeseed straw using neural image analysis. In Proceedings of the SPIE—The International Society for Optical Engineering, ICOIP 2016, Warsaw, Poland, 10–12 June 2016; Volume 10033, p. 100332H. [Google Scholar] [CrossRef]
Seireg, H.R.; Omar, Y.M.; Abd El-Samie, F.E.; El-Fishawy, A.S.; Elmahalawy, A. Ensemble machine learning techniques using computer simulation data for wild blueberry yield prediction. IEEE Access 2022, 2020, 3181970. [Google Scholar] [CrossRef]
Niedbała, G. Application of artificial neural networks for multi-criteria yield prediction of winter rapeseed. Sustainability 2019, 11, 533. [Google Scholar] [CrossRef] [Green Version]
MacEachern, C.B.; Esau, T.J.; Schumann, A.W.; Hennessy, P.J.; Zaman, Q.U. Detection of fruit maturity stage and yield estimation in wild blueberry using deep learning convolutional neural networks. Smart Agric. Technol. 2023, 3, 100099. [Google Scholar] [CrossRef]
Index DataBase. Available online: https://www.indexdatabase.de/ (accessed on 20 November 2022).
Anderberg, M.R. Cluster Analysis for Applications; Academic Press: New York, NY, USA, 1983. [Google Scholar]
R Development Core Team. R: A Language and Environment for Statistical Computing; R Foundation for Statistical Computing: Vienna, Austria, 2012; Available online: http://www.R-project.org (accessed on 20 November 2022)ISBN 3-900051-07-0.
Arabie, P.; Carroll, J.D. MAPCLUS: A mathematical programming approach to fitting the ADCLUS models. Psychometrika 1980, 445, 211–235. [Google Scholar] [CrossRef]
Tufte, E.R. Envisioning Information; Graphics Press: Cheshire, CT, USA, 1990. [Google Scholar]
Tufte, E.R. The Visual Display of Quantitative Information; Graphics Press: Cheshire, CT, USA, 1983. [Google Scholar]
Cleveland, W.S. The Elements of Graphing Data; revised ed.; Hobart Press: Thousand Oaks, CA, USA, 1994. [Google Scholar]
Cleveland, W.S. Vizualizing Data; Hobart Press: Thousand Oaks, CA, USA, 1993. [Google Scholar]
Ball, G.H.; Hall, D.J. A Novel Method of Data Analysis and Pattern Classification; Technical Report; Stanford Research Institute: Stanford, CA, USA, 1965. [Google Scholar]
Banfield, J.D.; Raftery, A.E. Model-Based Gaussian and Non–Gaussian Clustering. Biometrics 1993, 49, 803–821. [Google Scholar] [CrossRef]
Beale, E.M.L. Euclidean cluster analysis. Bull. Int. Stat. Inst. 1969, 43, 92–94. [Google Scholar]
Bensmail, H.; Meulman, J.J. Model-based clustering with noise: Bayesian inference and estimation. J. Classif. 2003, 20, 049–076. [Google Scholar] [CrossRef]
Bezdek, J.C. Numerical taxonomy with fuzzy sets. J. Meth. Biol. 1974, 1, 57–71. [Google Scholar] [CrossRef]
Cox, D.R. Regression models and life tables (with Discussion). J. R. Stat. Soc. B 1972, 34, 187–220. [Google Scholar]
Heard, N.A.; Holmes, C.C.; Stephens, D.A. A Quantitative Study of Gene Regulation Involved in the Immune Response of Anopheline Mosquitoes: An Application of Bayesian Hierarchical Clustering of Curves. J. Am. Stat. Assoc. 2006, 101, 18–29. [Google Scholar] [CrossRef]
Fan, J.; Peng, H. Nonconcave penalized likelihood with a diverging number of parameters. Ann. Stat. 2004, 32, 928–961. [Google Scholar] [CrossRef]
Selyaninov, G. Methods of agricultural climatology. Agric. Meteorol. 1930, 22, 4–20. [Google Scholar]
Prentice, I.C.; Cramer, W.; Harrison, S.P.; Leemans, R.; Monserud, R.A.; Solomon, A.M. Special Paper: A Global Biome Model Based on Plant Physiology and Dominance, Soil Properties and Climate. J. Biogeogr. 1992, 19, 117–134. [Google Scholar] [CrossRef]
XGBoost Package. Available online: https://xgboost.readthedocs.io/en/stable/python/python_intro.html (accessed on 20 November 2022).
Comparing Anomaly Detection Algorithms for Outlier Detection on Toy Datasets. Available online: https://scikit-learn.org/stable/auto_examples/miscellaneous/plot_anomaly_comparison.html (accessed on 20 November 2022).
Least squares Linear Regression. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html (accessed on 20 November 2022).
Ridge Model. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html (accessed on 20 November 2022).
Lasso Model. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html (accessed on 20 November 2022).
ElasticNet. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html (accessed on 20 November 2022).
Random Forest Regressor. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestRegressor.html (accessed on 20 November 2022).
Multi-Layer Perceptron Regressor. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPRegressor.html (accessed on 20 November 2022).
SGD Regressor. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html (accessed on 20 November 2022).
Epsilon-Support Vector Regression. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.svm.SVR.html (accessed on 20 November 2022).
Nu Support Vector Regression. Available online: https://scikit-learn.org/stable/modules/generated/sklearn.svm.NuSVR.html (accessed on 20 November 2022).
“Pragmatic” Project Webpage. Available online: https://seth.software/zt_portfolio/pragramatic/ (accessed on 20 November 2022).
Niedbała, G.; Piekutowska, M.; Weres, J.; Korzeniewicz, R.; Witaszek, K.; Adamski, M.; Pilarski, K.; Czechowska-Kosacka, A.; Krysztofiak-Kaniewska, A. Application of artificial neural networks for yield modeling of winter rapeseed based on combined quantitative and qualitative data. Agronomy 2019, 9, 781. [Google Scholar] [CrossRef] [Green Version]
Peng, J.; Kim, M.; Kim, Y.; Jo, M.; Kim, B.; Sung, K.; Lv, S. Constructing Italian ryegrass yield prediction model based on climatic data by locations in South Korea. Grassl. Sci. 2017, 63, 184–195. [Google Scholar] [CrossRef]

Figure 1. Number of responses as a function of time and region for WOS database search phrases: “blueberry” and “yield prediction” or “blueberry” and “yield forecast” or “blueberry” and “yield estimation” (September 2022).

Figure 2. Comparison between two methods of outlier detection. The images used were first presented in [69]. (a) SVM one class-case #1; (b) LOF-case #1; (c) SVM one class-case #2; (d) LOF-case #2; (e) SVM one class-case #3; (f) LOF-case #3; (g) SVM one class-case #4; (h) LOF-case #4.

Figure 3. Feature explained variance for XGBoost.

Figure 4. Feature importance of artificial features generated based on PCA (Principal component Analysis).

Figure 5. Cumulative Feature explained variance for XGBoost.

Figure 6. Result of blueberry crop prediction for test and validation.

Table 1. Structure of Highbush blueberry data.

Subplot Code	Variety	No of Subplots	Total Area [ha]
102	Nelson	18	26.16
103	Nelson	12	26.4
104	Nelson	12	16.2
105	Nelson	18	26.22
106	Nelson	6	12.48
107	Chandler	18	32.34
108	Chandler	18	29.82
109	Chandler	12	15.9
110	Chandler	12	17.22
111	Chandler	6	13.98
112	Chandler	12	23.46
113	Chandler	12	22.2
114	Liberty	12	17.1
115	Liberty	18	26.82
116	Liberty	12	22.2
117	Liberty	12	23.22
118	Liberty	11	10.07
120	Chandler	5	8.65
121	Chandler	5	8.8
122	Nelson	6	5.82
123	Nelson	6	8.28
	Total	243	393.34

Table 2. Summary of raw data.

Features Group	No. of Raw Data
Treatment features	135,113
Weather features	831,562
Soil features	6929
BBCH soil features	7380
Vegetation features	3936
Selyaninov hydrothermal coefficient	738
GDD features	738
Aggregates based on fertilization and fertigation	9045
Harmful features	110
Total	995,551

Table 3. Summary of all prognostic features.

Features Group	No. of Features
Treatment features	2
Weather features	10
Soil features	14
BBCH soil features	30
Vegetation features	16
Sjeljaninow features	3
GDD features	3
Aggregates based on fertilization and fertigation	9
Harmful features	2
Total	89

Table 4. List of features with missing values.

No.	Feature with Missing Values	% of Missing Values
1	S-SO $_{4}$ —sulfur	17%
2	Cl—chlorine	33%
3	Irrigation (BBCH 0–60)	46%
4	Fertilization (BBCH 0–60)	47%
5	Hailstorm percentage of damage	67%
6	Hailstorm cut fruit	87%

Table 5. Summary of applied classifiers.

No.	Classifier
1	Linear regression
2	Ridge
3	Lasso
4	ElasticNet
5	XGB (learning_rate = 0.1, n_estimators = 1000, max_depth = 6)
6	Random Forest (max_depth = 3, n_estimators = 300)
7	MLP (hidden_layer_sizes = 10)
8	MLP (hidden_layer_sizes = 100)
9	SGD
10	NuSVR (nu = 0.2, C = 0.2,kernel = ’rbf’, gamma = 0.001)
11	SVR (C = 30,000.0, epsilon = 0.2)

Table 6. Top results of blueberry crop prediction.

Classifier	No. of Cols	Step Wise	P-Enter	P-Remove	Pearson	Chi2	PCA	PCA Comp.	MAPE Val. [%]	MAPE Test [%]
XGB	40	Yes	0.33	0.38	No	No	Yes	8	10.33	12.48
Random
Forest	39	Yes	0.24	0.29	No	No	Yes	9	10.20	14.30
Linear
Regression	48	Yes	0.44	0.49	No	No	Yes	9	5.70	15.90
SVR	37	Yes	0.20	0.25	No	No	Yes	6	15.09	15.96
Lasso	48	Yes	0.44	0.49	No	No	Yes	8	1.49	17.68
SGD	48	Yes	0.44	0.49	No	No	Yes	9	1.22	17.94
Ridge	48	Yes	0.44	0.49	No	No	Yes	9	0.49	18.64
ElasticNet	37	Yes	0.20	0.25	No	No	Yes	5	13.73	21.84
NuSVR	37	Yes	0.20	0.25	No	No	Yes	4	31.81	34.91
MLP(100)	89	No	0	0	No	No	No	0	97.48	98.25
MLP(10)	46	Yes	0.43	0.48	No	No	No	0	99.70	99.76

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Niedbała, G.; Kurek, J.; Świderski, B.; Wojciechowski, T.; Antoniuk, I.; Bobran, K. Prediction of Blueberry (Vaccinium corymbosum L.) Yield Based on Artificial Intelligence Methods. Agriculture 2022, 12, 2089. https://doi.org/10.3390/agriculture12122089

AMA Style

Niedbała G, Kurek J, Świderski B, Wojciechowski T, Antoniuk I, Bobran K. Prediction of Blueberry (Vaccinium corymbosum L.) Yield Based on Artificial Intelligence Methods. Agriculture. 2022; 12(12):2089. https://doi.org/10.3390/agriculture12122089

Chicago/Turabian Style

Niedbała, Gniewko, Jarosław Kurek, Bartosz Świderski, Tomasz Wojciechowski, Izabella Antoniuk, and Krzysztof Bobran. 2022. "Prediction of Blueberry (Vaccinium corymbosum L.) Yield Based on Artificial Intelligence Methods" Agriculture 12, no. 12: 2089. https://doi.org/10.3390/agriculture12122089

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Menu

Prediction of Blueberry (Vaccinium corymbosum L.) Yield Based on Artificial Intelligence Methods

Abstract

1. Introduction

2. Materials and Methods

2.1. Dataset Description

2.1.1. Agronomical Data

2.1.2. BBCH-Scale

2.1.3. Numerical Features

2.1.4. BBCH Soil and Climate Features

2.1.5. Vegetation Features

2.1.6. Selyaninov Hydrothermal Coefficient

2.1.7. GDD Features

2.1.8. Aggregates Based on Mineral Fertilization and Fertigation

2.1.9. Harmful Features

2.1.10. Features Summary

2.2. Data Preprocessing Methods

2.2.1. Data Normalization

2.2.2. Finding and Replacing Missing Values

Extreme Gradient Boosting-XGBoost

2.3. Feature Generation Using PCA (Principal Component Analysis) Method

2.4. Outlier Detection

2.4.1. Local Outlier Factor (LOF)

2.4.2. Unsupervised Outlier Detection Based on OneClassSVM

2.5. Feature Selection

2.5.1. Stepwise Regression

2.5.2. Pearson’s Feature Selection Method

2.5.3. Chi-square Feature Selection Method

2.6. Prediction Methods Applied

2.6.1. Linear Regression

2.6.2. Ridge

2.6.3. Lasso

2.6.4. ElasticNet

2.6.5. Random Forest Regressor

2.6.6. MLP Regressor

2.6.7. SGD Regressor

2.6.8. SVR and NuSVR

3. Results and Discussion

4. Conclusions

Author Contributions

Funding

Institutional Review Board Statement

Data Availability Statement

Acknowledgments

Conflicts of Interest

References

Share and Cite

Article Metrics

Article Access Statistics

Further Information

Guidelines

MDPI Initiatives

Follow MDPI