1. Introduction
Among the patients infected with SARS-CoV-2, the detection of symptoms, clinical signs, and laboratory findings associated with poor outcome is crucial to identify those at high risk of clinical deterioration or death. Although many studies investigating risk factors and prediction models for coronavirus disease 19 (COVID-19) have been published, recent literature indicates that the proposed models are at high risk of bias [
1]. Since their performance is probably overestimated, only few of these prediction models are recommended for use in current practice [
1,
2]. In 2020, two promising risk prediction models for deterioration and mortality in patients hospitalized with COVID-19 were published, based on data from 260 hospitals in England, Scotland, and Wales [
3,
4]. The authors thoroughly developed and validated these models but suggested their usability for clinical decision making only upon successful validation and potential updating in other settings [
5]. The 4C (Coronavirus Clinical Characterisation Consortium) Deterioration Model and Mortality Score have shown to be valid prediction tools for clinical deterioration and in-hospital mortality and outperformed other risk stratification tools [
3,
4].
The 4C Deterioration Model is a multivariable logistic regression model developed to predict in-hospital clinical deterioration among hospitalized adults with highly suspected or confirmed COVID-19 using 11 variables [
3].
The 4C Mortality Score was developed and validated to predict in-hospital mortality in patients with COVID-19 and includes eight parameters routinely available at hospital admission. The score ranges from 0–21 points (
Table S1 shows how the score is calculated). Patients with a score of at least 15 had a 62% mortality compared with 1% mortality for those with a score of three or less [
4].
Both models showed good discrimination and calibration, but their generalisability remains to be tested. As patients’ characteristics and healthcare systems differ significantly among countries—which might affect the accuracy of prediction—it needs to be determined if the prediction tools also work in populations outside the UK. In this study, we first aimed to externally validate the modified 4C Deterioration Model and 4C Mortality Score in a cohort of Swiss patients. Second, we evaluated whether the inclusion of the neutrophil-to-lymphocyte ratio (NLR) improves the predictive performance of the models, since, as shown by our recent work and others, an elevated NLR identifies COVID-19 patients at risk for clinical deterioration and mortality [
6,
7].
2. Methods
2.1. Design and Setting
This retrospective single-centre cohort study was performed at the City Hospital Zurich Triemli in Switzerland. For reporting, we adhered to the transparent reporting of a multivariable prediction model for individual prediction or diagnosis (TRIPOD) guidelines [
8].
2.2. Study Population and Data Collection
Informed consent was obtained from the majority of patients with COVID-19. A surrogate permission (according to Art. 34 HFV) was granted by the cantonal ethics committee Zurich, Switzerland (BASEC-Nr. 2020-01852), for patients from whom no consent could be obtained. The study was conducted in accordance with the declaration of Helsinki.
Adult patients with a positive SARS-CoV-2 swab result using real-time reverse-transcription-polymerase-chain-reaction (RT-PCR) assay (CDC ncov-2019 rT PCR) and have been admitted to our hospital between 27 February and 31 December 2020 were included in our study. Patients transferred from another hospital were excluded for both models if they were treated for more than two days due to COVID-19 prior to transfer. Furthermore, patients were excluded from the 4C Deterioration Model if they met the definition of in-hospital deterioration upon arrival at our hospital. Patients with nosocomial infection were excluded for validation of the 4C Mortality Score [
4].
The vital signs were collected separately for the two prediction models: For the Deterioration Model, the first values upon admission were taken. If the SARS-CoV-2 infection was diagnosed later during the hospital stay, we used the values on the day of the positive PCR, as suggested by the authors of the original publication [
3]. For the Mortality Score, we used values from hospital admission regardless of whether the patients had been previously hospitalized. For the Deterioration Model, the peripheral oxygen saturation (SpO2) was measured with and without oxygen supplementation. For the Mortality Score, we only used SpO2 breathing room air. If only SpO2 under supplemental oxygen therapy was available, SpO2 was considered below 92%, as suggested by internal guidelines. Results from chest X-ray and CT scans were determined by board-certified radiologists.
2.3. Predictors
Predictor definitions were identical to the definitions used for the original model development, except for urea, which was not assessed. For this reason, we refer to both models as ‘modified’. The parameters used for the Deterioration Model are age, sex, nosocomial infection, Glasgow coma scale score (GCS), SpO2 at admission, breathing room air or oxygen therapy (contemporaneous with SpO2 measurement), respiratory rate, C-reactive protein, lymphocyte count, and presence of radiographic chest infiltrates. The parameters used for the Mortality Score are age, sex, respiratory rate, SpO2, GCS, C-reactive protein, and number of comorbidities. Comorbidities (according to the modified Charlson Comorbidity Index [
9]) collected were chronic cardiac disease, chronic respiratory disease (excluding asthma), chronic renal disease (estimated glomerular filtration rate ≤30), mild to severe liver disease, dementia, chronic neurological disease, connective tissue disease, diabetes, HIV or AIDS, malignancy, and clinician-defined obesity. If the GCS score on admission was missing, we used descriptions from the medical records to deduce the score of either 15 or below.
In accordance with the methods used in the original publication [
3], restricted cubic splines were used to model continuous predictors included in the 4C Deterioration Model. Knot positions were chosen accordingly to Gupta et al.
For the 4C Mortality Score, the same cut-off values as in Knight et al. [
4] were used to categorize the continuous variables.
2.4. Outcomes
For the 4C Deterioration Model, a composite outcome of in-hospital clinical deterioration was defined, comprising any of the following: the need for ventilator support (high flow therapy, non-invasive ventilation, invasive mechanical ventilation, or extracorporeal membrane oxygenation); admission to a high dependency or intensive care unit; or death [
3]. The primary outcome for the 4C Mortality Score was defined as in-hospital death from any cause [
4].
2.5. Statistical Methods
Descriptive statistics included median and interquartile range for the continuous variables and numbers and percentages of the total for the categorical variables.
2.6. Model Validation
For external validation of the modified 4C prediction models, individual risk outcome predictions were made based on the coefficients of the originally published models. We assessed model performance by three different parameters: (i) model discrimination by calculating the area under the receiver operating characteristic curve (AUC) together with a 95% confidence interval (CI) computed as defined by DeLong et al. [
10]; (ii) model calibration was inspected visually. Smoothing of the calibration curve was achieved by fitting restricted cubic splines using 4 knots. In addition, the calibration intercept (target value of 0), also termed calibration in-the-large, and slope (target value of 1) together with 95% CIs were estimated; (iii) overall goodness of fit was assessed using the quadratic scoring rule Brier score [
11], with lower values indicating better performance. The Brier score depends on the outcome event rate. Therefore, as recommended by Steyerberg et al. [
12], the scaled Brier score was additionally calculated, which is scaled by its maximum score under a non-informative model (model that predicts risk to be equal to prevalence for all patients, larger values indicating better performance).
2.7. Missing Values
Missing values in one or more of the predictor variables were addressed with multiple imputation based on chained equations [
13], under the missingness at random (MAR) assumption. Predictive mean matching using five donors (for each imputation, 1 donor’s observed value is randomly drawn from the 5 candidates to replace the missing value) was used to produce 100 imputed sets (m = 100). The predictors were categorized and transformed after multiple imputation was performed. The distribution of the imputed values for each variable was inspected visually and compared with the distributions of complete cases. For the imputation model, all predictors included in the 4C Deterioration Model and Mortality Score, both outcomes, and additionally dyspnea (y/n), pulmonary disease including asthma (y/n), chronic cardiac disease (y/n), additional infectious disease (y/n), and the disease severity according to the WHO (World Health Organization) classification system [
14] were used. The estimates and uncertainty measures from the imputed data sets were combined using Rubin’s rule [
15].
2.8. Model Recalibration
We compared the outcome incidences in the derivation and our cohorts by simply contrasting the observed prevalence of the outcomes in the Swiss setting to the prevalence of the outcomes in the original setting, in which the models were derived. In order to adjust the two models to the Swiss setting, we updated the baseline risk by re-estimating the models’ intercepts while holding all other coefficients constant. Among others, Morise et al. [
16] showed that errors introduced by factors outside the model could potentially be mitigated by this mathematically simple recalibration approach.
2.9. Model Updating
The 4C Deterioration Model, as well as the 4C Mortality Score, were both updated by including the additional risk factor NLR. We refrained from categorizing the variable as it was performed for the continuous predictors in the 4C Mortality Score to retain maximal information. Instead, we transformed the variable by taking the natural logarithm to make its distribution less skewed. The other coefficients were held constant, while the coefficient for the variable NLR was estimated.
The three performance measures mentioned above were also calculated for the updated and recalibrated models. After model recalibration, we evaluated whether the predictor NLR should be incorporated in the models by performing likelihood ratio tests as suggested by Vickers et al. [
17].
2.10. Sensitivity Analysis
As sensitivity analysis, we performed a complete case analysis (all patients for whom the information of one or more predictors was missing were not included).
3. Results
Between 27 February and 31 December 2020, 605 patients with positive PCR for SARS-CoV-2 were admitted to City Hospital Zurich Triemli. In total, 31 patients did not consent to the further use of their data. One patient was excluded because the diagnosis was incorrect, and sixteen were hospitalized for more than two days due to COVID-19 in a hospital prior to the transfer to City Hospital Zurich. For the 4C Deterioration Model validation cohort, 11 patients were excluded because they met the definition of deterioration before transfer to our hospital, and for the 4C Mortality Score validation cohort, 30 patients were excluded because they were tested positive throughout the hospital stay. Details are provided in
Figure 1.
3.1. Modified 4C Deterioration Model
In total, 546 patients (201 females, 345 males) were included in the validation cohort for the modified 4C Deterioration Model. The median age (IQR) was 69 (23) years, compared to 75 (24) years in the original study cohort. Demographic and clinical baseline information on patients included in the main analysis and information assessed throughout the hospital stay are shown in
Table 1 stratified by the outcome event (in-hospital deterioration). Baseline characteristics were assessed either at hospital admission or at the time of diagnosis for patients who tested positive for SARS-CoV-2 after the admission date. Predictors were handled identically during modelling as suggested by Gupta et al. (see
Table S2).
The outcome was assessed for all patients. In total, 133 (24.4%) patients met the definition of the composite outcome of in-hospital deterioration. The event rate in the original derivation cohort was higher with 43.2%. Moreover, 89 (16.3%) patients were transferred to an ICU, of whom 62 (11.4%) patients needed ventilator support. In total, 59 (10.8%) patients died. Of the 465 patients with complete predictor information, 105 (22.6%) met the definition of in-hospital deterioration.
Discrimination of the 4C Deterioration Model was 0.78 (95% CI from 0.73 to 0.83) (
Table 2), which is slightly better than the discriminatory performance of the model when applied to the original derivation cohort (AUC = 0.76 (95% CI from 0.75 to 0.77)). The 4C Deterioration Model overestimated the risk of in-hospital deterioration on average, and the risk predictions varied too little (calibration intercept = −0.39 (−0.60 to −0.18), slope = 1.30 (1.02 to 1.58)) (
Figure S1). A slope larger than the reference value 1 suggests that the risk of the outcome event is not extreme enough: patients at high risk of the event tend to receive underestimated risk predictions, whereas patients at low risk of the event tend to receive overestimated risk predictions [
18]. As a measure of overall performance, a Brier score of 0.15 was obtained (lower values indicating better performance). The relative performance improvement compared to an uninformative model was 0.19 (larger values indicating better performance). Performance measures did not change considerably after excluding patients with at least one missing predictor information (
Table 2).
In order to adjust the model to the Swiss setting, the model was recalibrated by re-estimating the models’ intercept while holding all other coefficients constant. As a result, a pooled model intercept of 3.64 (95% CI from 3.43 to 3.85) was obtained. All performance measurements for the recalibrated model are listed in
Table S3.
Updating the model with the additional risk factor NLR did not lead to a considerable change in performance (see
Table 2).
Figure 2 shows the ROC curves of the 4C Deterioration Model before and after model updating. Calibration was improved after model updating (calibration intercept = −0.12 (−0.33 to 0.10), slope = 1.35 (1.06 to 1.65)).
Table 3 shows the pooled log odds ratios of the log-transformed variable NLR and the model intercept estimates.
Comparison between the recalibrated and updated model to the recalibrated model without the additional risk factor NLR did not reveal any evidence that the updated 4C Deterioration model fits the data better than the model without the variable NLR (test statistic: = −1.24, p-value = 1).
3.2. Modified 4C Mortality Score
A total of 527 patients (196 females, 331 males) were included in the validation cohort for the 4C Mortality Score. The median age (IQR) was 68 (24) years, which is lower than in the original derivation cohort with 76 (25) years. Moreover, 33% of the patients had 2 or more comorbidities compared to 48% in the original derivation cohort. Demographic and clinical baseline information on patients included in the main analysis and information assessed throughout hospital stay are displayed in
Table 4 stratified by the outcome event (in-hospital mortality). Predictor cut-offs were chosen according to Knight et al. (see
Table S4).
The outcome was assessed for all patients: 55 (10.4%) patients died. The event rate in our cohort was lower than in the original derivation cohort, of which 32.2% of patients died in the hospital.
According to Knight et al., four risk groups with corresponding mortality rates were defined: low risk (0–3 score, mortality rate 1.2%), intermediate risk (4–8 score, 9.9%), high risk (9–14 score, 31.4%), and very high risk (≥15 score, 61.5%). None of the 65 patients belonging to the low-risk group died. We observed an event rate of 2.0% for the 199 patients classified as being at intermediate risk, 17.7% for 254 patients with high risk, and 66.7% for 9 patients in the very high-risk group. Except for the highest category, all mortality rates were lower than the mortality rates used to define the risk groups.
Discrimination of the 4C Mortality Score was 0.85 (95% CI from 0.79 to 0.89). Surprisingly, the model showed a considerably better discriminatory performance when applied to our cohort than when applied to the original derivation cohort (AUC = 0.79 (95% CI from 0.78 to 0.79)). The obtained estimate for calibration-in-the-large of −1.00 (95% CI from −1.29 to −0.70) implies that, on average, the risk of in-hospital mortality was slightly overestimated. The calibration slope was estimated to be 1.52 (95% CI from 1.10 to 1.94) (
Figure S2). As a measure of overall performance, a Brier score of 0.09 was obtained (lower values indicating better performance). The relative performance improvement compared to an uninformative model predicting constant risk was 0.02 (larger values indicating better performance).
When considering only complete cases, the scaled Brier score became negative, indicating a decline in overall performance compared to an uninformative model (see
Table 2).
After adjusting the model to the Swiss setting, a pooled model intercept of −5.20 (95% CI from −5.49 to −4.90) was obtained. The overall performance improved marginally after updating the baseline risk (Brier score before and after recalibration: 0.09 vs. 0.08). All performance measurements for the recalibrated model are provided in
Table S3.
Discrimination declined by updating the model with the additional risk factor NLR (see
Table 2).
Figure 3 shows the ROC curves of the 4C Mortality Score before and after model updating. After updating the model, the 95% CIs of both calibration measurements included their reference values (calibration intercept = −0.27 (−0.57 to 0.02), slope = 1.35 (0.96 to 1.75)).
Table 3 shows the pooled log odds ratios of the log-transformed variable NLR and the model intercept estimates. We compared the recalibrated and updated model to the recalibrated model without the additional risk factor NLR. Similar to the 4C Deterioration Model, there is no evidence for an incremental value of the biomarker NLR (test statistic: = −1.47,
p-value = 1).
4. Discussion
Since the beginning of the pandemic in 2019, more than 107 prognostic models for predicting deterioration and mortality risk for COVID-19 patients have been published [
1,
19]. However, most of the models are at high risk of bias, model overfitting, and unclear reporting. As such, only a few of them are recommended for use in practice [
1].
In this retrospective single-centre analysis, we externally validated the modified 4C Deterioration Model and Mortality Score in patients with laboratory-confirmed SARS-CoV-2 infection. Although the parameter urea was excluded, and despite lower event rates of in-hospital deterioration and mortality in the Swiss setting, both models performed very well and showed good discrimination ability. Before recalibration, the models overestimated the risk of the outcome events on average. To adjust the models to the Swiss setting, we recalibrated both models by updating the baseline risk. We refrained from re-estimating all model coefficients because of the prospect of overfitting. Both models were updated by including the predictor NLR. However, there was no evidence that the inclusion of NLR improved the model fit compared to the original models.
In line with other validation studies [
20,
21,
22,
23,
24,
25], the respective event rate in our cohort was lower than in the original studies. One reason might be that the patients in our cohort were younger than the original cohort. Moreover, the lower median baseline levels of C-reactive protein in our cohorts and the lower percentage of patients with two or more comorbidities indicate that our patient population was less ill. Due to lack of capacity during the first pandemic wave, some patients with severe illness were transferred to other ICUs. In addition, many elderly patients had a healthcare directive documenting that hospitalization or escalation of treatment, including intensive care treatment, should be avoided. The fact that we excluded patients transferred from another hospital and met the criterion of deterioration on hospital admission probably also resulted in a lower event rate in our cohort. Of note, SARS-CoV-2 infection was not the main reason for hospitalization in all our patients. The inclusion of patients, in which SARS-CoV-2 was detected incidentally, might also have affected both outcome scores in our cohort. Finally, whereas for both models’ original derivation, only patients with a minimal follow-up time of 28 days were included, the median (IQR) follow-up time was only 8 (7) days for both cohorts. However, since deterioration after recovery or even following discharge is uncommon in patients with COVID-19, the shorter follow-up period in our patient cohort is unlikely to affect outcome data.
The neutrophil-to-lymphocyte ratio is considered a surrogate marker for outcome and systemic hyperinflammation in patients hospitalized due to COVID-19 [
26,
27]. As we confirmed recently, baseline NLR not only identifies patients at high risk for deterioration but also accurately differentiates between high and low mortality risk in patients with COVID-19 [
6,
7]. Surprisingly, we found no evidence for a better model performance when NLR was included as a predictor in both the Deterioration Model and the Mortality Score. These results indicate that the two calculators are excellently powered and that NLR is already represented by other variables included in the models.
Our study has several limitations. The main limitations of this study are the small sample size, the single-centre design and that data collection was performed retrospectively. As such, information on follow-up after hospital discharge and urea levels is lacking. Urea was considered to be of importance in the derivation cohort. Conversely, our data emphasize that the performance of both models is not hampered when urea is excluded from the scoring system. This notion is along the line of other prediction tools that used a similar approach: for predicting the severity of pneumonia, for example, urea measurement was not found to substantially improve the predictive value of the CURB-65 score compared to CRB-65 (without urea) [
28,
29]. Another limitation is the use of the entire dataset for model recalibration, updating, and subsequent validation. In combination with the limited sample size, such an approach could potentially have led to overfitting.
Our study also has considerable strengths: First, this study represents a comprehensive validation of the 4C Deterioration Model and 4C Mortality Score in a Swiss tertiary hospital setting. Second, we employed state-of-the-art methodology for the validation of prediction models by assessing discrimination (AUC), calibration (intercept, slope), and overall goodness of fit (Brier score). Third, while both scores have already been externally validated in other countries, our data provide the first validation in the Swiss population. Of note, as compared to previous studies, the 4C Mortality Score showed the largest discrimination in the Swiss population (
Table 5 and
Table 6). In this context, we could show that the two models reliably predict the risk of deterioration and mortality of hospitalized patients with COVID-19 in a representative patient collective of the Swiss population.