Next Article in Journal
Improved Adaptive Augmentation Control for a Flexible Launch Vehicle with Elastic Vibration
Next Article in Special Issue
The Eco-Evo Mandala: Simplifying Bacterioplankton Complexity into Ecohealth Signatures
Previous Article in Journal
Partitioning Entropy with Action Mechanics: Predicting Chemical Reaction Rates and Gaseous Equilibria of Reactions of Hydrogen from Molecular Properties
Previous Article in Special Issue
Entropy and the Brain: An Overview
 
 
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:
Background:
Article

Bridging Offline Functional Model Carrying Aging-Specific Growth Rate Information and Recombinant Protein Expression: Entropic Extension of Akaike Information Criterion

Department of Automation, Kaunas University of Technology, LT-51367 Kaunas, Lithuania
*
Author to whom correspondence should be addressed.
Entropy 2021, 23(8), 1057; https://doi.org/10.3390/e23081057
Submission received: 9 July 2021 / Revised: 13 August 2021 / Accepted: 14 August 2021 / Published: 16 August 2021
(This article belongs to the Collection Do Entropic Approaches Improve Understanding of Biology?)

Abstract

:
This study presents a mathematical model of recombinant protein expression, including its development, selection, and fitting results based on seventy fed-batch cultivation experiments from two independent biopharmaceutical sites. To resolve the overfitting feature of the Akaike information criterion, we proposed an entropic extension, which behaves asymptotically like the classical criteria. Estimation of recombinant protein concentration was performed with pseudo-global optimization processes while processing offline recombinant protein concentration samples. We show that functional models including the average age of the cells and the specific growth at induction or the start of product biosynthesis are the best descriptors for datasets. We also proposed introducing a tuning coefficient that would force the modified Akaike information criterion to avoid overfitting when the designer requires fewer model parameters. We expect that a lower number of coefficients would allow the efficient maximization of target microbial products in the upstream section of contract development and manufacturing organization services in the future. Experimental model fitting was accomplished simultaneously for 46 experiments at the first site and 24 fed-batch experiments at the second site. Both locations contained 196 and 131 protein samples, thus giving a total of 327 target product concentration samples derived from the bioreactor medium.

1. Introduction

Controlling and observing industrial biotechnology processes is a challenging task for bioengineers. The main problems are collecting accurate information regarding the state of the process and its quality. The industry demands the process be as productive as possible, which also contributes to the task’s difficulty. Overcoming these challenges requires high-quality and reliable process data. With concrete and quality data, easier process controllability and higher result repeatability are attainable. Unfortunately, the industry still lacks accurate and real-time measurements, especially for the main focus of almost all industrial cell cultivation processes—synthesized target product concentration. Sampled, time-delayed measurements with additional instruments and time-consuming analyses remain the most common way to determine the product concentration throughout cultivations. In large-scale processes, this problem becomes more acute, with additional hardware costs and the increased possibility of errors. Therefore, the realization and implementation of software sensors that can measure and predict indirect quantities using information collected throughout the process has become more prominent [1,2,3,4,5].
Target product concentration estimation in specific cultivations uses soft sensors that consist of various mathematical models [6]. These range from traditional mechanistic and empirical models to hybrid models, which have become increasingly prevalent for solving the estimation task. The conventional model’s classical shape requires elaboration and the tuning of its parameters to achieve satisfactory results [7]. Nevertheless, traditional mathematical models remain the fundamental basis of the software sensor, and in some instances, they are the most appropriate way to estimate process variables [8].
The use of traditional models for product estimation is seen in cultivations of P. chrysogenum for penicillin concentration [9], recombinant E. coli for protein concentration [10,11,12], and yeast fermentations for ethanol concentration [13]. Among the mechanistic unstructured models, the most popular approach is the extended Kalman filter [14,15]. However, the accuracy of the EKF and its results are closely related to the accuracy of the mathematical model, and may also suffer from convergence problems [16]. Nonetheless, EKF has considerable robustness to changes of initial process conditions, and has proven successful when applied in S. cerevisiae cultivations [6,17].
Applying traditional mathematical models to nonlinear and multidimensional systems may result in numerous errors due to the low flexibility of simple-structure differential equations. Therefore, researchers frequently choose an empirical model as an alternative approach that does not require detailed description of the process, but rather quantitative and qualitative data of the bioprocess. Among these data-driven models, the most successful and commonly applied are ANN (artificial neural networks), PLS (partial least squares), and PCA (principal component analysis)-based soft sensors. The latter, combined with spectroscopy, has been proven to provide satisfactory results in product estimation [18,19]. Meanwhile, ANNs have become crucial to hybrid models for product and state estimation [10,20]. The use of ANN is prominent not only as an alternative to describing complex parts of the processes, but also as a combination with additional off-gas analysis or spectroscopy data [21,22]. However, using such supplementary equipment for data gathering increases the process cost while also requiring added algorithms to compensate for the possible drifts in the gas sensors or data filtering from spectroscopy. Additionally, the estimation becomes time-delayed when taking samples periodically. Generally speaking, ANN-based software sensors, compared with traditional mathematical models, achieve more satisfactory results and require less development time [10,23].
A quick overview of the different techniques employed for specific product estimation can be seen in Table 1.
Our study aims to employ and expand the Luedeking–Piret model [25], and present an extension of the protein product estimation model based on gathered offline data. This paper improves the previous functional model by adding cell age and extensive model fitting analysis. The purpose of the proposed mathematical model is not to descriptively define the bioprocess, but instead to identify the correct state variables and their interrelationships that maximize synthesized product content.
Section 2: Materials and Methods describes the test object, processes, and operating conditions. Section 3: Proposed Extension of Akaike Information Criterion presents the modified Akaike criterion for model fitting with the addition of a tuning coefficient. Section 4: Combined Model Representing Multiple Hypothesis overviews previous similar maximal production rate expressions and proposes an improved model for target protein fitting. Section 5: System Identification and Parameter Estimation presents the model’s parameter identification methods and the use of cells ages. Section 6: Model Selection Based on Experimental Model Calibration compares the different models presented. Section 7: Discussion and Conclusions presents final remarks about the results and model fitting.

2. Materials and Methods

2.1. Cell Strains

The experimental object of this work was recombinant E. coli cells tested at two independent biopharmaceutical sites. The experimental data originate from cultivations of two different cell strains. The first cell strain was E. coli (BL21(DE3) pLysS (Site 1), and the second was E. coli BL21 (DE3) pET21-IFN-alfa-5 (Site 2). The synthesized product appeared in soluble and insoluble forms at both sites. The E. coli BL21 (DE3) target product was insoluble protein and inclusion bodies. The product’s expression was dependent on the T7 promoter, with one millimole of isopropyl-D-1-thiogalactopyranoside (IPTG).

2.2. Medium

For Site 1, the cultivation medium throughout the experiments consisted of Na2SO4, 2.0 g/L; (NH4)2SO4, 2.46 g/L; NH4Cl, 0.5 g/L; K2HPO4, 14.6 g/L; NaH2PO4 × H2O, 3.6 g/L; (NH4)2-H–citrate, 1.0 g/L; MgSO4 × 7H2O, 1.2 g/L; trace element solution, 2 mL/L [26].
For Site 2, the cultivations were based on a minimal mineral medium, consisting of 46.55 g KH2PO4, 14 g (NH4)2HPO4, 5.6 g C6H8O7.H2O, 3 mL of concentrated antifoam, 35 g H14MgO11S, and 105 g D (+) glucose monohydrate.

2.3. Cultivation Conditions

Table 2 presents the different cell cultivation conditions for both of the cell strains at both sites.

2.4. Target Protein Analysis

The analytical method of determining the amount of target protein was SDS-PAGE (sodium dodecyl sulfate–polyacrylamide gel) electrophoresis. The final measurement of the target protein consists of a sequence of the following actions. Firstly, 200 g of wet biomass was dissolved in 1 mL of solution and mixed for 30 min. Then, to measure the total protein concentration, SDS-PAGE electrophoresis was performed on 200 μL of the suspension sample. The remainder of the suspension was mixed with SDS (sodium dodecyl sulfate) buffer to dissolve all proteins and centrifuged for 15 min at 4 °C with 20,000 G force. Determining the soluble protein concentration required another SDS-PAGE electrophoresis with a sample of 200 µL. The leftover supernatant was discarded and replaced with 1 mL of water, then mixed and centrifuged. Finally, decanting the supernatant and mixing it for approximately 12 h with the addition of 1 mL of solubilization buffer (8 M urea; 50 mM, pH 8.0 Tris base) allowed for measurements of insoluble protein (inclusion bodies) concentration via SDS-PAGE electrophoresis.

3. Proposed Extension of Akaike Information Criterion

The classical form of the Akaike information criterion allows for selecting an informative set of parameters with an inevitable trade-off concerning the model’s fitting uncertainty [27]. Let n be the number of observation samples, k the number of model parameters, and MSE the mean squared error of the residuals. Then, the Akaike measure is
A I C ( k , n ) = n ln ( M S E ) + 2 · k .
An alternative is the Bayesian information criterion, or BIC, which contains variance σ 2 of errors instead
B I C ( k , n ) = n ln ( σ 2 ) + 2 · k .
One of the drawbacks of both BIC and AIC is that these criteria are designed to not have a tuning coefficient for minimizing the number of parameters to be used without changing the shape of the likelihood distributions. Another consideration is a tuning coefficient that would involve some theoretic asymptotic maximum number of parameters. In reality, the log-likelihood part of the criterion might not necessarily be related to the average characteristics, but they may also be cumulative characteristics based on the sum of squared residuals, R S S . This amount divided by the degree of freedom n recovers MSE and presents the average discrepancy between the readings y ( t i ) observed at time t i and the value estimated by the model f ( t i , k ) . Such cumulative discrepancy depends on the number of observations n i , and has the form of
R S S ( k , n i ) = i = 1 n i ( y ( t i ) f ( t i , k ) ) 2 = i = 1 n i ( y i f i ( k ) ) 2 .
Therefore, we suggest two entropic criteria for prospective model selection, which have a tuning coefficient k max , a likelihood R S S R S S ( k , n i ) , and a maximum likelihood R S S max R S S max ( n i ) = lim k 0 R S S ( k , n i ) , yielding
S A S A ( k max , k , n i ) = ( k max k ) · R S S ln R S S + k · ( R S S max R S S ) ln ( R S S max R S S ) .
The other information measure, S, in the entropic representation, which can serve equally well, is
S B S B ( k max , k , n i ) = ( k max k ) · R S S · ln R S S + k · R S S max · ln R S S max R S S .
Then, one can determine k A I C and k B I C , with which
R S S R S S ( k , n i ) = i = 1 n i ( y ( t i ) f ( t i , k ) ) 2 = i = 1 n i ( y i f i ( k ) ) 2 .
This links to Equations (1) and (2). In other words,
A I C ( k , n i ) ~ lim k max k A I C ln ( S ( k , n i ) ) ,
and
B I C ( k , n i ) ~ lim k max k B I C ln ( S ( k , n i ) ) .
The motivation for tuning k max to a certain k o p t i m a l is the need to avoid overfitting with experimental data when a user applies raw AIC or BIC criteria with a likelihood in any probabilistic form. Furthermore, the practical expectation is that the criterion be as generic as possible, and the likelihood’s shape should not require modification. Consequently, an investigator must pick such a set of parameters that mean minimal effort is required to perform a trial when seeking rational bioprocess optimization. For example, only one or two cultivation protocol changes should be made to potentially and noticeably increase the overall total product, i.e., by more than 10 percent or so. It is expected that a biopharmaceutical manufacturer performs as few changes as possible. Simultaneously, the manufacturer must follow for maximal repeatability and standardization according to EU CE labeling, EU medical device (MDR), and US Food and Drug Administration (FDA) regulations at good manufacturing practice (GMP) or GMP-compliant (cGMP) facilities. This is particularly true when service providers provision a CDMO (contract development and manufacturing organization) technology transfer. Therefore, the upstream developers have one or two protocol adaptations or parameters at their disposal for a single experimental iteration consisting of unique experimental development trials or minor online checks.
In this study, we propose generic forms of Equations (4) and (5) that can be used to select such a minimal set of parameters that both reach (the principle of parsimony [28]) and match (the principle of convex optimization [29]) the extremum state of the measure.

4. Combined Model Representing Hypothesis with Multiple Elements

The previous study [11] introduced an additional protein P ( t ) production yield γ parameter to extend the Luedeking–Piret model for fed-batch cultivations [25,30,31]. The model relied on the oxygen uptake rate (OUR) for biomass X estimation
O U R ( t ) = α · X ( t ) + β · X ( t ) + γ ( t ) · P ( t ) ,
The addition of production yield γ , which represents the oxygen consumption yield for the protein synthesis rate, supplements the previous cell’s oxygen consumption parameters for biomass growth α and maintenance β . The expanded model achieved a pseudo-global estimation of synthesized protein and biomass concentration [29,32,33]. Such a procedure corresponds to pseudo-global offline model calibration. It was assumed that protein yield was a function of biomass concentration in a gray box model [34].
As shown in a previous work, protein productivity depends on IPTG (isopropyl-D-1-thiogalactopyranoside) and biomass concentrations at time of induction [29,35]. The latter had a significant impact on the model, such that the product formation parameter γ became a function of biomass concentration at time of induction. Then, the final estimator form became
O U R ( t ) = α · X ( t ) + k γ · ( X ( t ) X i n d ) · d P ( t ) d t
The expression of the product model is based on the assumption of the linear dependency of product synthesis on the specific growth rate (SGR) of biomass [36]
d P X d t = q p x ( μ ,   P X ) = P max ( μ ,   X ) k t · P X ,
where q p x is the specific protein accumulation rate (U/g/h), µ the specific biomass growth rate (1/h), and P X P ( t ) / X ( t ) the specific protein activity (U/g), where the protein concentration is normalized by biomass concentration. Even though the previous study assumed that the maximum target protein formation rate was linked to the specific substrate consumption rate, the underlying idea is still the same in this study. Finally, the time constant k t was assumed to have a self-inhibiting effect [37].
Over the years, multiple researchers have studied how different process variables and parameters affect the model of P max . Table 3 presents significant historic parametric developments.
D. Levisauskas and others expressed the maximal production rate ( P max ) via the concept of active biomass [38,39]. This latter is assumed to be the part of the biomass that is responsible for specific product production. The average cell age identifies the active biomass A g e ¯ i A g e ¯ ( t i ) at any time t i throughout the bioprocess. The expression of average cell age, including the initial biomass boundary condition, is
A g e ¯ i = X 0 · t i + 0 t i ( t i t j ) · X ( t j ) d t j X i ,
where X 0 is initial biomass at time of inoculation to a bioreactor. If the latter is assumed to be negligible,   A g e ¯ i takes the following form
A g e ¯ i = 0 t i ( t i t j ) · X ( t j ) d t j X i   j = 0 i ( t i t j ) Δ X ( t j ) Δ t j · Δ t j X i = = j = 0 i ( t i t j ) Δ X ( t j ) X i .
Equation (13) is the recovery of a particular case, shown in Equation (12), taken from D. Levisauskas and others’ research [38,39]. Assuming that t j ~ j Δ t , the maximal production rate P max at time t i is
P max , 1999 ( t i ) = 1 X ( t i ) j = 1 i Δ X j · m ( t i j Δ t ) ,
where Δ X j is the growth of biomass throughout the j-th time interval, and m (0 < m < 1) is the relative activity ratio that introduces the linearly increasing and decreasing transient effect of the age. The parameter m is described by a trapezoid time function, which consists of four model parameters presumably related to each culture.
The most recent functional protein model [11] relies on the assumption that the maximal specific product concentration value is asymptotically dependent on SGR. However, the authors identified an apparent effect of IPTG injection on product synthesis through data analysis. Therefore, the functional model was expanded with the addition of biomass at induction time X i n d
P max , 2019 ( μ ,   X ) = μ ( t ) · ( k m 0 + k m 1 · ( X ( t ) X i n d ) )  
where k m 0 and k m 1 are tuning parameters.
Other researchers [12] tried one more variation of the maximal product formation model
P max , 2003 ( μ ) = μ ( t ) · k m k μ + μ ( t ) + μ 2 ( t ) k i μ .  
Such an approach was based on a rational assumption of what inhibits the maximal product formation rate. As far as we know, no efforts were made to test the different hypotheses of various methods with the same datasets originating from different sources. We propose a method of model selection using the principles of parsimony and convex optimization in this study. This is based on Equations (7) and (8).
With the combined approach of both product synthesis models, we include an expanded protein function model, where P max P max ( t ) is the hypothesis of a mixture of linearly dependent competing models
P max , 2013 = l = 1 n l = 24 P m a x , l ,
where 24 model coefficients represent the parametric set of k t , k 0 k 22 , as defined in
P max , 2021 = k 0 · μ i n d + μ ( k 1 ( X ( t ) X i n d ) + k 3 ) + k 2 · μ · A g e ¯ i n d + k 4 · μ ( k 13 + μ i n d ) + X i n d ( k 6 + k 7 · A g e ¯ ) + k 8 · A g e ¯ + A g e ¯ i n d ( k 10 + k 11 · μ i n d ) + k 12 · μ i n d 2 + k 16 · μ i n d · A g e ¯ k 20 + A g e ¯ + k 17 · μ k 19 + μ + k 18 · μ · A g e ¯ k 21 + A g e ¯ + k 22 · μ · A g e ¯ i n d k 5 + A g e ¯ i n d + k 9 · μ k 14 + μ + μ 2 k 15 .
Here, k t , k 0 k 22 are the optimization parameters of the model to be established. All of them contain zero values at the start of the convex search. The subset of linear terms represents the linear term of Equation (18), and some of them are the basis of Monod’s formulation theories [40,41]. The matches are depicted in Table 4.
The novelty of this study is the proposed average cell age at induction time A g e i n d . As the researchers [38,39] did not study the recombinant bioprocess in their work, so far, the effect of IPTG injection has not been assessed. Based on the experimental data, we deduced that the average cell age and specific growth rate during the induction time are the most significant parameters to consider when creating a protein formation model.

5. System Identification and Parameter Estimation

5.1. Average Cell Age at the Induction

Historically, mathematical bioprocess models have considered only external state variables that affect product biosynthesis. For this reason, traditional models show frequent inconsistency when validating theoretical knowledge with empirical data. To improve the accuracy and applicability of the model, we considered variations in the physiological state of the microorganisms, including, but not limited to, their physical age, similarly to the developments made in the 1970s [42]. Consequently, we express the average cell age at induction time ( t i n d ) as
A g e ¯ i n d A g e ¯ ( t i n d ) X 0 · t i n d + 0 t i n d ( t i n d t j ) · X ( t j ) d t j X i n d .
The use of cell age relies on two main assumptions. The first is that the total biomass does not produce the specific product, only its physiologically active part. The second is that the activity of the biomass depends on its age. Therefore, through our modeling, we can predict that the cells produce the specific product throughout a particular period, during which there is an average cell age that would lead to maximal production. This also relates to induction, at which point the cells have already reached a certain age.

5.2. Model of Product Model Fitting

Following the presented changes, the previously described relative protein synthesis Equation (11) has a more general presentation
d P X d t q p x ( μ ,   P X ) = P max ( μ ,   X , t ) k t · P X
Furthermore, its integral form at time t becomes
P X ( t ) = t 0 t P max ( t * ) d t * k t · t 0 t P X ( t * ) d t * ,
where the integrals are the left-hand Riemann sum [11,43]. Finally, the protein model for pseudo-global offline fitting takes the form
P i = ( j = 1 i P max , j · Δ t j , j 1 k t · j = 1 i 1 P X , j · Δ t j , j 1 ) · X i 1 + Δ t i , i 1 · k t .
In Equation (22), the discrete protein values define the variable P X , i P X ( t i ) , where the sample observed at time t is indexed by i, and i [ 1 , n i ] .

5.3. Pseudo-Global Offline Identification of Model Parameters

Before selection, each model requires pseudo-global parameter identification. The identification process of protein model fitting coefficients consists of the convex optimization method and the maximization of entropy [28,44,45]. Based on Bayesian analysis, the posterior distribution for the i-th offline sample is expressed as
P posterior ( P i ) ~ N ( P i , σ P 2 ) ,
where σ P 2 is the constant variance for every sampled prediction i. Similarly, the prior distribution has the following form
P likelihood ( P i ) ~ N ( P i y , σ P , i 2 ) ,
where P i y is the i-th observed value of product concentration with an individual variance σ P , i 2 . Having both distributions leads to a simplified form of relative entropy, which serves as a likelihood function for the posterior,
L i S i ( P posterior , P likelihood ) = ( P i P i y ) 2 2 · σ P , i 2 + c .
In a previous study, we neglected coefficient c in favor of a separate tuning coefficient K exp   ( 0 K exp 2 ) [11,29]. The coefficient is implemented to adjust for trade-offs between the least squares and mean absolute percentage error approaches. Such a combination takes advantage of both criteria. With the addition of K exp , the expression of relative entropy becomes
L i = ( P i P i y ) 2 · ( 1 K exp ) 2 · P i y , 2 ( P i P i y ) 2 · K exp 2 .
The process of model fitting uses the former equation to identify the product model’s parameters. The use of convex optimization with parsimony assumptions allows the entropy measure to indicate local extremums and derive a sufficient computational processing time [28]. For simplicity, and given that the protein content did reach high concentrations, the K exp was set to 2 in this study. Therefore, the residual sum of squares denotes the squared sum, which thus represents the likelihood in the ensuing text.

6. Model Selection Based on Experimental Model Calibration

We analyzed two datasets in this study, derived from different samples from two independent sites. The first repository consisted of 46 independent experiments and, in total, n i , I = 196 readings. The other dataset, from the second site, contained 24 unique biosyntheses and, in total, n i , I I = 131 protein observations. To use a single R S S with n i = n i , I + n i , I I in the same model selection routine, we picked a normalized form by reusing two sums of squared residuals ( R S S I and R S S I I ) for each site
R S S = n i , I I · R S S I + n i , I · R S S I I n i .
This allowed for distributing the average variances of the estimates evenly over both sites’ repositories. After the maximization of Equation (26), a convex search of the data from previous studies gave the results shown in Table 5. To check for errors at the beginning of product synthesis, we added to the evaluation the criteria of mean absolute error (MAE).
M A E = i = 1 n | P i P i y | n .
At first glance, according to the AIC in Table 5, the investigation from 2019 [11] improved on the studies from 1999 [38,39] and 2003 [12]. Then, the study of 2003 [12] improved upon the AIC of 1999 [11]. However, according to the MAE criterion, which is more relevant to product formation, the oldest assumption in the literature [38,39] is more powerful than the newer findings derived over 20 years later. Moreover, if the AIC were to be followed literally, the overfitting of the overall model would have been favored, as the last row of Table 5 demonstrates. Such an elaboration led us to further study the product formation model, and search for better ways of selecting a model with fewer parameters and which avoids overfitting by design.
First of all, there is a possible value for the maximum number of coefficients ( k max ) that asymptotically makes the entropic criteria work the same way as the original AIC and BIC measures. The maximization of correlation between AIC and S A (Equation (4)), and then S B (Equation (5)), generates corresponding k max values k A I C , A and k A I C , B , which are shown in Table 6.
Similarly, maximizing the linear relationship between BIC and S A , and then S B , provides the data for Table 7. We asymptotically tuned both AIC and BIC on the sum of correlations of 33 models, which together comprised a specific subset of Equation (18). We tried more reproductions with different assumptions in this study. However, those 33 representations comprising Equation (18) are the best set, according to our investigation experience. The maximal parametric complexity we tried was k 6 in this study.
Table 6 and Table 7 both show that each entropic measure of S is a more generic quantity that can help restrict the number of expected state variables, thus helping with upstream CDMO development in the biopharmaceutical industry. Typically, two to four coefficients are preferred in optimal control routines, because the degree of freedom in Hamiltonians intensifies computational requirements. The main reason for this is that, frequently, Hamiltonians are solved numerically or using hybrid approaches, of which arithmetic processing still represents an extensive part. As such, we present experimental findings for a maximal number of model parameters of k m a x = k A I C = k B I C = 450 , unless specifically stated otherwise.
Before proceeding with model selection, we must check the significance of the tuned model parameters individually. We select k t and two other coefficients with state variables and a significant history [11,12,38,39], which we found to be the best descriptors.
The specific growth rate at time of induction is the most significant parameter from a singleton analysis perspective, as Table 8 shows. This table offers two insights:
(a)
There is significant doubt that k t belongs to the descriptor set;
(b)
Even if the specific growth rate surpasses the average cell age, the significance of either is still relatively similar. Therefore, there is a high chance that both of them combine in a single nonlinear relationship that is proportional to the maximum product formation rate.
Such thinking led us to construct maximum product expression, as in Equation (18). We will use the maximum number of models assessed during our criterion asymptotic analysis, and set k max = 33 . The five best model equations that derive from Equation (18) are
P max = k 0 · ( μ i n d μ i n d 2 ) + k 16 · μ i n d · A g e ¯ k 20 + A g e ¯ + k 16 k 16 · μ   and   k t = 0 ,
P max = k 0 · μ i n d + k 16 · μ i n d · A g e ¯ k 20 + A g e ¯ + k 16 k 16 · μ   and   k t = 0 ,
P max = k 0 · μ i n d + k 16 · μ i n d · A g e ¯ k 20 + A g e ¯ + k 16   and   k t = 0 ,
P max = k 16 · μ i n d · A g e ¯ k 20 + A g e ¯ + k 16   and   k t = 0 .
Table 9 depicts the parameter values of the models in Equations (29)–(32).
The second additive term, as used in Equations (29)–(32), and the first additive term, as used in Equation (32), is the Monod term, whose coefficients k 16 and k 20 carry a specific physiological meaning: the maximum specific target protein formation rate is the multiplication k 16 · μ i n d ; the denominator additive coefficient defines the average age at which the production formation rate (represented by term k 16 · μ i n d ) is halved. The perfect average age for inoculation is somewhere between 1.066 h and 1.3 h, at which point product formation has the highest theoretical rate of acceleration. It remains to be determined whether it is a coincidence that the minimum induction time was 1.14 h for the first site and 1.237 h for the second site.
As the mean absolute error is the smallest for the model with more variables in Equation (29), other maximal counts of model parameters remain to be verified. The asymptotic analysis using k max = 6 , which is the maximum number of tested parameters per experiment in this study, suggests the following five alternatives:
P max = k 0 · μ i n d   and   k t = 0 ,
P m a x = k 8 · A g e ¯   and   k t = 0 ,
P m a x = k 16 · μ i n d · A g e ¯ k 20 + A g e ¯ + k 16 k 16 · μ   and   k t = 0 ,
P m a x = k 0 · μ i n d   and   k t = 0.447 ,
P m a x = k 8 · A g e ¯   and   k t = 2.059 .
Table 10 shows another alternative set of coefficients, which verify that the average age has a more substantial effect at the start of product formation. Thus far, Equation (29) gives the best estimate of the total product.
There is still one model to consider, which can improve MAE to 0.424
P max , 2021 = μ ( k 1 ( X ( t ) X i n d ) + k 3 )   and k t = 0.112 , k 1 = 0.00243 , k 3 = 0.074 .
However, this model’s RSS is poor, at 14.826. Further increasing the number of parameters starts to reduce the MAE due to overfitting.

7. Discussion and Conclusions

The results of the model selection and the application of enhanced AIC show two things:
(a)
As regards rational, practical benefits, the proposed entropic measures can help with tuning the maximum count of the model parameters, thus helping devise standardized CDMO procedures for attaining higher product yields from biopharmaceutical efforts;
(b)
Secondly, both average age and biomass growth values at time of induction, or in other words, at the very start of product synthesis, are crucial. Therefore, the combined model employing Monod structures is the best recommendation for maximizing the total product yield.
Similar to the Akaike information criterion, the Bayesian information criterion can also be viewed as a particular asymptotic enhancement of the entropic expansion of AIC. Such an approach avoids altering the likelihood or re-organization the experiments. Instead, it brings the benefit of adjustability in the maximum number of expected coefficients. Moreover, two entropic values are available for scientists to exploit: relative entropy and Shannon entropy. The experimental model fitting was performed simultaneously on 46 experiments at the first site and 24 fed-batch experiments at the second site. Both locations contained 196 and 131 protein samples, thus giving a total of 327 target product tests using the bioreactor medium.
Regarding the physiological characteristics of any aerobic microbial system, we witnessed that average cell age and the inhibition coefficient are both more relevant, and describe the model better, at the very beginning of product biosynthesis. At the same time, the specific growth rate improves upon the latter overall, when considering the total (recombinant target protein) expression at the end of the experiments.

Author Contributions

Conceptualization, R.U.; Methodology, R.U.; Software, R.U.; Validation, R.U., B.K. and R.S.; Formal analysis, R.U.; Investigation, R.U. and R.S.; Resources, R.U.; Data curation, R.S.; Writing—original draft preparation, R.U. and B.K.; Writing—review and editing, R.U. and R.S.; Visualization, B.K.; Supervision, R.S.; Project administration, R.U.; Funding acquisition, R.U. All authors have read and agreed to the published version of the manuscript.

Funding

This project received funding from the European Regional Development Fund (project no. 01.2.2-LMT-K-718-03-0039) under a grant agreement with the Research Council of Lithuania (LMTLT).

Institutional Review Board Statement

Not applicable.

Data Availability Statement

Data sharing does not apply to this article.

Conflicts of Interest

The authors declare no conflict of interest.

References

  1. Goodwin, G. Predicting the Performance of Soft Sensors as a Route to Low Cost Automation. Annu. Rev. Control 2000, 24, 55–66. [Google Scholar] [CrossRef]
  2. Randek, J.; Mandenius, C.-F. On-Line Soft Sensing in Upstream Bioprocessing. Crit. Rev. Biotechnol. 2018, 38, 106–121. [Google Scholar] [CrossRef] [PubMed]
  3. Sagmeister, P.; Wechselberger, P.; Jazini, M.; Meitz, A.; Langemann, T.; Herwig, C. Soft Sensor Assisted Dynamic Bioprocess Control: Efficient Tools for Bioprocess Development. Chem. Eng. Sci. 2013, 96, 190–198. [Google Scholar] [CrossRef]
  4. Luttmann, R.; Bracewell, D.G.; Cornelissen, G.; Gernaey, K.V.; Glassey, J.; Hass, V.C.; Kaiser, C.; Preusse, C.; Striedner, G.; Mandenius, C.-F. Soft Sensors in Bioprocessing: A Status Report and Recommendations. Biotechnol. J. 2012, 7, 1040–1048. [Google Scholar] [CrossRef]
  5. Simutis, R.; Galvanauskas, V.; Levisauskas, D.; Repsyte, J.; Vaitkus, V. Comparative Study of Intelligent Soft-Sensors for Bioprocess State Estimation. J. Life Sci. Technol. 2013, 1, 163–167. [Google Scholar] [CrossRef]
  6. Zhang, H. Software Sensors and Their Applications in Bioprocess. In Computational Intelligence Techniques for Bioprocess Modelling, Supervision and Control; de Nicoletti, M.C., Jain, L.C., Eds.; Springer: Berlin/Heidelberg, Germany, 2009; Volume 218, pp. 25–56. [Google Scholar] [CrossRef]
  7. de Azevedo, S.F.; Dahm, B.; Oliveira, F.R. Hybrid modelling of biochemical processes: A comparison with the conventional approach. Comput. Chem. Eng. 1997, 21, S751–S756. [Google Scholar] [CrossRef]
  8. Wiechert, W.; Noack, S. Mechanistic pathway modeling for industrial biotechnology: Challenging but worthwhile. Curr. Opin. Biotechnol. 2011, 22, 604–610. [Google Scholar] [CrossRef] [PubMed]
  9. Kager, J.; Herwig, C.; Stelzer, I.V. State estimation for a penicillin fed-batch process combining particle filtering methods with online and time delayed offline measurements. Chem. Eng. Sci. 2018, 177, 234–244. [Google Scholar] [CrossRef]
  10. Gnoth, S.; Simutis, R.; Lübbert, A. Selective expression of the soluble product fraction in Escherichia coli cultures employed in recombinant protein production processes. Appl. Microbiol. Biotechnol. 2010, 87, 2047–2058. [Google Scholar] [CrossRef] [PubMed]
  11. Urniezius, R.; Survyla, A. Identification of Functional Bioprocess Model for Recombinant E. Coli Cultivation Process. Entropy 2019, 21, 1221. [Google Scholar] [CrossRef] [Green Version]
  12. Levisauskas, D.; Galvanauskas, V.; Henrich, S.; Wilhelm, K.; Volk, N.; Lübbert, A. Model-based optimization of viral capsid protein production in fed-batch culture of recombinant Escherichia coli. Bioprocess Biosyst. Eng. 2003, 25, 255–262. [Google Scholar] [CrossRef] [PubMed]
  13. San, K.-Y.; Stephanopoulos, G. Studies on on-line bioreactor identification. IV. Utilization of pH measurements for product estimation. Biotechnol. Bioeng. 1984, 26, 1209–1218. [Google Scholar] [CrossRef]
  14. Julier, S.J.; Uhlmann, J.K. Unscented Filtering and Nonlinear Estimation. Proc. IEEE 2004, 92, 401–422. [Google Scholar] [CrossRef] [Green Version]
  15. Giffin, A.; Urniezius, R. The Kalman Filter Revisited Using Maximum Relative Entropy. Entropy 2014, 16, 1047–1069. [Google Scholar] [CrossRef]
  16. de Assis, A.J.; Filho, R.M. Soft sensors development for on-line bioreactor state estimation. Comput. Chem. Eng. 2000, 24, 1099–1103. [Google Scholar] [CrossRef]
  17. Krämer, D.; King, R. On-line monitoring of substrates and biomass using near-infrared spectroscopy and model-based state estimation for enzyme production by S. cerevisiae. IFAC-PapersOnLine 2016, 49, 609–614. [Google Scholar] [CrossRef]
  18. Koch, C.; Posch, A.E.; Goicoechea, H.C.; Herwig, C.; Lendl, B. Multi-analyte quantification in bioprocesses by Fourier-transform-infrared spectroscopy by partial least squares regression and multivariate curve resolution. Anal. Chim. Acta 2014, 807, 103–110. [Google Scholar] [CrossRef] [PubMed]
  19. Sellick, C.A.; Hansen, R.; Jarvis, R.M.; Maqsood, A.R.; Stephens, G.M.; Dickson, A.J. Royston Goodacre Rapid monitoring of recombinant antibody production by mammalian cell cultures using fourier transform infrared spectroscopy and chemometrics. Biotechnol. Bioeng. 2010, 106, 432–442. [Google Scholar] [CrossRef] [PubMed]
  20. Montague, G.A.; Glassey, J.; Ignova, M.; Paul, G.C.; Kent, C.A.; Thomas, C.R.; Ward, A.C. Hybrid Modelling for On-Line Penicillin Fermentation Optimisation. IFAC Proc. 2002, 35, 395–400. [Google Scholar] [CrossRef] [Green Version]
  21. Bachinger, T.; Riese, U.; Eriksson, R.K.; Mandenius, C.F. Electronic nose for estimation of product concentration in mammalian cell cultivation. Bioprocess Eng. 2000, 23, 637–642. [Google Scholar] [CrossRef]
  22. Golabgir, A.; Herwig, C. Combining Mechanistic Modeling and Raman Spectroscopy for Real-Time Monitoring of Fed-Batch Penicillin Production. Chem. Ing. Tech. 2016, 88, 764–776. [Google Scholar] [CrossRef]
  23. Thibault, J.; van Breusegem, V.; Chéruy, A. On-line prediction of fermentation variables using neural networks: Prediction of Fermentation Variables. Biotechnol. Bioeng. 1990, 36, 1041–1048. [Google Scholar] [CrossRef]
  24. Simutis, R.; Lübbert, A. Hybrid Approach to State Estimation for Bioprocess Control. Bioengineering 2017, 4, 21. [Google Scholar] [CrossRef] [Green Version]
  25. Luedeking, R.; Piret, E.L. A kinetic study of the lactic acid fermentation. Batch process at controlled pH. Biotechnol. Bioeng. 1959, 1, 393–412. [Google Scholar] [CrossRef]
  26. Schaepe, S.; Kuprijanov, A.; Simutis, R.; Lübbert, A. Avoiding overfeeding in high cell density fed-batch cultures of E. coli during the production of heterologous proteins. J. Biotechnol. 2014, 192, 146–153. [Google Scholar] [CrossRef]
  27. Murari, A.; Peluso, E.; Cianfrani, F.; Gaudio, P.; Lungaroni, M. On the Use of Entropy to Improve Model Selection Criteria. Entropy 2019, 21, 394. [Google Scholar] [CrossRef] [Green Version]
  28. Urniezius, R.; Galvanauskas, V.; Survyla, A.; Simutis, R.; Levisauskas, D. From Physics to Bioengineering: Microbial Cultivation Process Design and Feeding Rate Control Based on Relative Entropy Using Nuisance Time. Entropy 2018, 20, 779. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  29. Urniezius, R.; Survyla, A.; Paulauskas, D.; Bumelis, V.A.; Galvanauskas, V. Generic estimator of biomass concentration for Escherichia coli and Saccharomyces cerevisiae fed-batch cultures based on cumulative oxygen consumption rate. Microb. Cell Fact. 2019, 18, 190. [Google Scholar] [CrossRef] [Green Version]
  30. Garcia-Ochoa, F.; Gomez, E.; Santos, V.E.; Merchuk, J.C. Oxygen uptake rate in microbial processes: An overview. Biochem. Eng. J. 2010, 49, 289–307. [Google Scholar] [CrossRef]
  31. Sivashanmugam, A.; Murray, V.; Cui, C.; Zhang, Y.; Wang, J.; Li, Q. Practical protocols for production of very high yields of recombinant proteins using Escherichia coli. Protein Sci. 2009, 18, 936–948. [Google Scholar] [CrossRef] [PubMed] [Green Version]
  32. Çalik, P.; Yilgör, P.; Demir, A.S. Influence of controlled-pH and uncontrolled-pH operations on recombinant benzaldehyde lyase production by Escherichia coli. Enzym. Microb. Technol. 2006, 38, 617–627. [Google Scholar] [CrossRef]
  33. Kocabaş, P.; Çalık, P.; Özdamar, T.H. Fermentation characteristics of l-tryptophan production by thermoacidophilic Bacillus acidocaldarius in a defined medium. Enzym. Microb. Technol. 2006, 39, 1077–1088. [Google Scholar] [CrossRef]
  34. Bohlin, T. Practical Grey-Box Process Identification; Springer: London, UK, 2006. [Google Scholar] [CrossRef] [Green Version]
  35. Babaeipour, V.; Shojaosadati, S.A.; Maghsoudi, N. Maximizing Production of Human Interferon-γ in HCDC of Recombinant E. coli. Iran. J. Pharm. Res. 2013, 12, 563–572. [Google Scholar]
  36. Galvanauskas, V.; Volk, N.; Simutis, R.; Lübbert, A. Design of Recombinant Protein Production Processes. Chem. Eng. Commun. 2004, 191, 732–748. [Google Scholar] [CrossRef]
  37. Miao, F.; Kompala, D.S. Overexpression of cloned genes using recombinant Escherichia coli regulated by a T7 promoter: I. Batch cultures and kinetic modeling. Biotechnol. Bioeng. 1992, 40, 787–796. [Google Scholar] [CrossRef] [PubMed]
  38. Levisauskas, D.; Plaskute, V. Modeling and Optimization of Secondary Metabolites Production in Fed-Batch Biotechnological Processes Based on Physiologically Active Biomass Concept; Information Technology and Control: Kaunas, Lithuania, 1999; pp. 33–36. ISSN 1392-124X. [Google Scholar]
  39. Plaskute, V.; Levisauskas, D. Application of hybrid models for prediction and optimization of enzyme fermentation process. Comparative study. Syst. Sci. 2001, 27, 115–123. [Google Scholar]
  40. Zhao, F.; Heidrich, E.S.; Curtis, T.P.; Dolfing, J. The Effect of Anode Potential on Current Production from Complex Substrates in Bioelectrochemical Systems: A Case Study with Glucose. Appl. Microbiol. Biotechnol. 2020, 104, 5133–5143. [Google Scholar] [CrossRef] [Green Version]
  41. Monod, J. The Growth of Bacterial Cultures. Annu. Rev. Microbiol. 1949, 3, 371–394. [Google Scholar] [CrossRef] [Green Version]
  42. Bell, G.I.; Anderson, E.C. Cell Growth and Division. Biophys. J. 1967, 7, 329–351. [Google Scholar] [CrossRef] [Green Version]
  43. Swokowski, E.W. Calculus with Analytic Geometry, 2nd ed.; Prindle, Weber & Schmidt: Boston, MA, USA, 1979; ISBN 978-0-87150-268-1. [Google Scholar]
  44. Urniezius, R. Convex programming for semi-globally optimal resource allocation. In AIP Conference Proceedings; AIP Publishing: Beirut, Lebanon, 2016; p. 040002. [Google Scholar]
  45. Giffin, A.; Urniezius, R. Simultaneous State and Parameter Estimation Using Maximum Relative Entropy with Nonhomogenous Differential Equation Constraints. Entropy 2014, 16, 4974–4991. [Google Scholar] [CrossRef] [Green Version]
Table 1. Examples of different modeling techniques for product estimation.
Table 1. Examples of different modeling techniques for product estimation.
Model TypeModel StructureCommentProductReference
SolubleInsoluble
Conventional (based on balance equations)Balance of production rateAssessment of dilution and product concentration, hard to distinguish between estimation and prognosticationPenicillin V-[9]
Balances of specific substrate uptake and growth rateA hybrid model provides better results than a traditional oneRecombinant protein-[10]
Balances of biomass, specific growth rate, production rates--Recombinant protein[11]
Balance of biomass, specific growth rate, and protein activityOptimization for maximal protein using induction time and feed profilesRecombinant protein [12]
Balance of biomass, pH, added ammonia-Ethanol-[13]
Spectroscopy data analysis with EKF-Ethanol-[17]
Empirical (data driven)Spectroscopy data analysis with PLS -Penicillin V [18]
Spectroscopy data analysis with PCA--Recombinant antibodies from mammalian cells[19]
Off-gas analysis with ANNGas sensors suffer from signal drift which requires additional compensation-Recombinant human blood coagulation factor VIII[21]
HybridANNs for product formation rate and specific growth rate-Recombinant protein [10]
ANN for dissolved oxygen assessmentThe assumption is valid only when the PID parameters for controlling the DO circuit are unchangedPenicillin [20]
ANN with inputs of biomass, dilution rate, etc.-Ethanol [23]
Support vector regression for observations of oxygen undertake, carbon production, and base consumption ratesThe presented model is for prediction, not for pseudo-global estimation-Recombinant protein[24]
Table 2. The cultivation conditions of Site 1 and Site 2 cell strains.
Table 2. The cultivation conditions of Site 1 and Site 2 cell strains.
ConditionSite 1Site 2Note
Bioreactor Volume 15 L7 L-
Cultivation TypeFed-batchFed-batch-
Temperature Setpoint30 °C37 °CBoth measured with a PT100 temperature sensor
DO Setpoint30%20%Both measured with an Ingold DO probe (Mettler Toledo)
pH Setpoint76.8Both kept constant using a PID controller with the addition of NaOH
Stirrer Setpoint Range100–1400 RPM800–1200 RPM-
Airflow0.3–15 L/min1.75–3.75 L/minPure oxygen flow was provided to bioreactors at a range from 0 to 7.5 L/min to increase the oxygen transfer rate
Maximum average cell age at induction, hours3.1052.985-
Minimum average cell age at induction, hours1.141.237-
Off-gas TrackingConcentrations of O2 and CO2Concentration of O2 Measured with a paramagnetic oxygen sensor (Maihak Oxor 610) during Site 1 cultivations and with BlueSens gas analyzer (BCpreFerm, BlueSens, Herten, Germany) during Site 2 cultivations.
Table 3. Hypothetical dependencies of the maximum specific product formation rate.
Table 3. Hypothetical dependencies of the maximum specific product formation rate.
P max   Arguments State VariablesReference(s)Equation
a 1 , a 2 , a 3 , a 4 μ ( t ) , X ( t ) or A g e ¯ ( t ) 1999, [38,39](14)
X i n d , k m 0 , k m 1 μ ( t ) ,   X ( t ) 2019, [11](15)
k m 0 , k μ , k i μ μ ( t ) 2003, [12](16)
Table 4. Product formation rate dependencies that are part of Equation (18).
Table 4. Product formation rate dependencies that are part of Equation (18).
P m a x   Arguments State VariablesModel Selection
Arguments in
This Study
Reference(s)
a 1 , a 2 , a 3 , a 4 A g e ¯ ( t ) k 8 1999, [38,39]
X i n d , k m 0 , k m 1 μ ( t ) ,   X ( t ) k 1 , k 3 2019, [11]
k m 0 , k μ , k i μ μ ( t ) , k 9 , k 14 , k 15 2003, [12]
A g e i n d , μ i n d , etc. μ ( t ) ,   X ( t ) , A g e ¯ ( t ) k t , k 0 k 22 2021/this study
Table 5. Product’s AIC, RSS, and MAE statistics in each historical study.
Table 5. Product’s AIC, RSS, and MAE statistics in each historical study.
AICRSSMAEkModel Selection
Arguments
Reference(s)
−967.0116.790.3932 k t 2.06 ,
k 8 0.01176 ;
1999 [38,39]
−1005.614.830.4243 k t 0.112 ,
k 1 0.00243 ,
k 3 0.074 ;
2019 [11]
−977.1716.070.4424 k t 0.321 ,
k 9 0.01193 ,
k 14 0.000473 ,
k 15 0.1677 ;
2003 [12]
−1488.163.150.24924 k t 0.209 , k 0 k 22 ; Full overfit with Equation (18)
Table 6. Product’s AIC as an asymptotic assessment of entropic measures S A and S B .
Table 6. Product’s AIC as an asymptotic assessment of entropic measures S A and S B .
AIC ln S A
k A I C , A 830
ln S B
k A I C , B 450
Reference(s)
−967.0110.5839.9681999 [38,39]
−1005.610.4199.8022019 [11]
−977.1710.539.91172003 [12]
Table 7. Product’s BIC as an asymptotic assessment of entropic measures S A and S B .
Table 7. Product’s BIC as an asymptotic assessment of entropic measures S A and S B .
BIC ln S A
k B I C , A 300
ln S B
k B I C , B 172
Reference(s)
−959.4309.5739.0081999 [38,39]
−994.2289.4178.8482019 [11]
−962.0139.5298.9562003 [12]
Table 8. Significance test for single parameters.
Table 8. Significance test for single parameters.
Parameter and Its ValueState Variable or ArgumentAICBIC ln S A I C , A ln S A I C , B  
k t 53.9 P X ( t ) −591.28−587.49-10.5
k 0 0.0159 μ i n d −936.78−932.999.1459.138
k 8 0.001384 A g e ¯ ( t ) −905.04−901.259.2739.267
Table 9. Parameter values for the significance test at k max = 33 .
Table 9. Parameter values for the significance test at k max = 33 .
Equation k 0 k 16   k 20   ln S A I C , A RSSMAEk
(29)−0.130.0232−1.0666.8697.2790.3993
(30)−0.03750.01148−1.2446.9798.7230.4323
(31)−0.03370.0098−1.2616.9988.9700.4623
(32)00.00298−1.3027.09911.7820.5792
Table 10. Parameter values for the significance test with k max = 6 .
Table 10. Parameter values for the significance test with k max = 6 .
Equation k 0 k 8 k 16   k 20   ln S A I C , A RSSMAEk
(32)0.01590005.97618.5240.6391
(34)00.00138006.04720.4120.4971
(35)000.00321−1.2986.05411.8570.5772
(36)0.04530006.10916.4750.6032
(37)00.01176006.11416.7850.3932
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Share and Cite

MDPI and ACS Style

Urniezius, R.; Kemesis, B.; Simutis, R. Bridging Offline Functional Model Carrying Aging-Specific Growth Rate Information and Recombinant Protein Expression: Entropic Extension of Akaike Information Criterion. Entropy 2021, 23, 1057. https://doi.org/10.3390/e23081057

AMA Style

Urniezius R, Kemesis B, Simutis R. Bridging Offline Functional Model Carrying Aging-Specific Growth Rate Information and Recombinant Protein Expression: Entropic Extension of Akaike Information Criterion. Entropy. 2021; 23(8):1057. https://doi.org/10.3390/e23081057

Chicago/Turabian Style

Urniezius, Renaldas, Benas Kemesis, and Rimvydas Simutis. 2021. "Bridging Offline Functional Model Carrying Aging-Specific Growth Rate Information and Recombinant Protein Expression: Entropic Extension of Akaike Information Criterion" Entropy 23, no. 8: 1057. https://doi.org/10.3390/e23081057

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop