In our previous work [
20], we analyzed the vibrational signatures of DNA thin films under different hydration conditions obtained by controlling the time spent in a vacuum chamber, from 3 to 40 min. The main absorption bands of DNA thin films of different hydrated states were presented in the range from 1350 to 750 cm
−1 and assigned according to the literature. On the basis of the visual categorization of the spectra, a detailed band shape analysis in the phosphate (1150–1000 cm
−1) and sugar-phosphate (900–750 cm
−1) region and an estimate of %B-form calculated utilizing integrated intensities of the 860, 836, and 805 cm
−1 bands, a rough classification of the spectra into two groups was obtained: 3 and 5 min films are in the first, and 10–40 min films are in the second group. In the first group, 3- and 5-min films showed distinct spectral signatures in the sugar-phosphate region, 900–750 cm
−1, corresponding roughly to 40% and 60% B-form, respectively. On the other hand, the spectral signatures of 10–40 min films all showed higher level of B-form, about 75%, indicating the end of conformational transitions after 10 min. Details related to the formation kinetics of the B-form, as well as relevant calculations are presented in our previous work [
20]. Next, it was shown that the changes in phosphate and sugar-phosphate vibrations, namely the 1232, 1089, 1055, 1030, and 765 cm
−1 bands, originate from the changes in the hydration of thin films and are mostly unrelated to the conformational changes that seem to saturate after 10 min (changes on the ≈890, 860, 837, and 805 cm
−1 bands). In this work, we expand the spectral range, and for the first time present vibrational signatures in the base region, 1800–1350 cm
−1. We first try to detect and untangle the signatures related to conformational transitions from the hydration-related signatures in the base region and then develop automated procedures for DNA conformation quantification.
3.1. Vibrational Signatures of DNA Thin Films in Base Region
In
Figure 1, the average spectra of 3, 5-, 10-, 15-, and 40-min films in the 1800–970 cm
−1 range are presented (for details see Materials and Methods). The vibrational bands in the spectral region from 1800 to 1350 cm
−1 are mostly due to the C=O, C=N, and C=C vibrations of purine and pyrimidine rings of DNA (base vibrations), while the vibrations in the range from 1350 to 970 cm
−1 belong to the vibrations of the DNA backbone (for a more detailed assignment of backbone vibrations see our previous work [
20]). Specifically, the absorption maximums observed near 1710 and 1661 cm
−1 are dominantly due to the C=O vibrations of guanine and thymine, the maximum near 1609 cm
−1 is mostly due to the C=N vibrations of adenine, while the band at ≈1490 cm
−1 originates from the C=N vibrations in cytosine and guanine [
26,
27,
28,
29,
30]. Several weaker bands also appear in the spectra; however, we leave them unassigned as the details of their changes are beyond the scope of this work.
The spectral changes observed in the base region between different films presented in the main panel of
Figure 1 can be rated as changes in the shape of the absorption curve and changes in the absorption intensity of the bands. Namely, the region between 1715 and 1685 cm
−1, that contains the C=O band at 1710 cm
−1, experiences spectral reshaping, while in the rest of the base region the bands retain their shape and show only an increase in the absorption intensity. Spectral reshaping in the 1715–1685 cm
−1 region is most evident in the inset of
Figure 1: the 3 min film reflects a distinct shape compared to the 10–40 min films mostly due to the lower absorption intensity near 1695 cm
−1. In 10–40 min films, the absorption curve shape in the 1730–1610 cm
−1 region is mostly unaffected by an increase of desiccation time, only a steady increase in the absorption intensity is evident. Namely, the bands at 1661, 1609, and 1490 cm
−1, but other weak base bands as well, show a continuous absorption intensity increase as the desiccation time increases (see
Figure 1). The intensity increase of the base bands is also followed by the intensity increase of the asymmetric PO
2− vibration (
Figure 1) and integrated intensity increase of those bands. The integrated intensity increase of the large absorption band in the 1800–1550 cm
−1 region and asymmetric and symmetric PO
2− vibrations in the 1320–1155 and 1155–990 cm
−1 region, respectively, are presented in
Figure 2 and are more evident for 20–40 min films. Note that 10–40 min films all have roughly the same %B-form (≈75%), so the observed intensity increase and integrated intensity increase for 20–40 min films are more likely the result of slight changes in hydration rather than changes in conformation.
Note that no frequency shifts of the bands are observable in the entire base region. The changes in the shape of the absorption curve in the 1720–1620 cm
−1 region are most likely caused by the dissimilar intensity changes of bands, i.e., the variations in the intensity of closely overlapped bands, with their individual frequencies unchanged. Similar behavior, spectral reshaping without frequency shifts of constituent modes, was found for phosphate vibrations in the 1350–990 cm
−1 region [
20]. We give more detailed interpretation of the changes in the base and phosphate region in Discussion.
3.2. Principal Component Analysis
In order to develop multivariate models capable of determining %B-form in DNA thin films, we first utilized PCA for the decomposition of all spectra in the principal component space.
Figure 3 shows the score plot of all 369 spectra in the 1800–650 cm
−1 range built with the first four principal components: in the PC1 versus PC2 score plot a strong clustering of the spectra with respect to the time spent in a vacuum chamber is evident. PC1 accounts to ≈15% of total variance, with the 3 and 5 min spectra obtaining only negative PC1 scores, while the 10–40 min spectra obtain only positive PC1 scores. This strong separation along the PC1 axis correlates well with the previously established grouping with respect to conformation with the 3 and 5 min films reflecting more A-like conformation, while the 10–40 min films all show mostly B-form. However, as the 10 and 15 min spectra score slightly different on the PC1 axis than the 20–40 min films, it can be reasoned that PC1 variance accounts for conformational and other differences in the spectra. Strong contenders are hydration-related features as the 20–40 min films obtain similar PC1 scores, different from the 10 and 15 min films. Note that in
Figure 2 the integrated intensities of the base and phosphate bands of 20–40 min films showed similar grouping, which was correlated to hydration effects as after 10 min there are no more conformational changes.
To conclude, the PC analysis revealed that DNA films associated with the lower level of B-form, 40–60%, have negative PC1 scores (3 and 5 min films), while the films associated with the high level of B-form, ≈75%, all have positive PC1 scores (10–40 min films). Additionally, further (sub)grouping of the 10–40 min spectra based on the PC1 and PC2 scores was observed: the 10- and 15-min films in the first cluster have similar PC1 and PC2 scores, opposed to the 20–40 min films in the second cluster. Such decomposition of data in the PC score plot served as a good indication how two different types of SVM and PCR models can be trained and validated. The first type of model would be based on the classification of spectra with respect to desiccation time, i.e., entire data can be classified into seven respective classes (3–40 min films) and the model would be validated on how well it can predict spectral signatures related to each desiccation time. The second type of model is based on the classification of spectra with respect to %B-form: entire data would be classified into three respective classes, 3, 5, and 10–40 min spectra, all representing different levels of B-form, as determined in our previous work.
3.3. Classification by SVM
In order to build the best automated model capable of resolving conformational signatures in DNA thin films, the SVM models were trained with different datasets and in different spectral range and then validated against unused spectra. For easier identification, the following nomenclature was chosen: the models with different calibration sets are denoted by numbers—SVM1, SVM2, etc.—while the models calculated for different spectral regions are denoted by letters (a), (b), etc. The results of all models, as well as their respective calibration and validation datasets are presented in
Table 1.
The initial classification model, SVM0, was built in order to test the ability of the algorithm to sort all 369 spectra with respect to the time spent in a vacuum chamber (seven classes representing 3–40 min films) in the entire spectral region, 1800–650 cm−1. Out of each class of 3–40 min films, approximately 80% of samples were randomly chosen to form a calibration dataset (294 spectra), while the rest was used as a validation dataset (75 spectra). In this way, the model is trained on the calibration dataset that includes all respective desiccation times (all seven classes) and, even though the validation dataset introduces unused spectra, a great amount of variability is already introduced into the calibration dataset. Out of 75 spectra from the validation set, only two spectra were misclassified, yielding over 97% classification success rate: two 40 min films were misclassified as 25 min films. Such a high classification rate indicated that the spectral profiles of each desiccation time are indeed unique and that the SVM algorithm is able to correctly classify a large number of data with respect to their spectral fingerprint.
As our previous work showed that the 3- and 5-min films have distinct conformation signatures compared to the 10–40 min films, the classification ability of samples with respect to three different levels of B-form was tested. Seven classes representing respective vacuum times were replaced by three classes representing different levels of %B-form: class 1 (40%, 3 min films), class 2 (60%, 5 min films), and class 3 (75%, 10–40 min films). We note that %B-form was calculated using the integrated intensity ratios of conformation bands as shown in our previous work [
20,
31].
The first classification of samples with respect to %B-form (SVM1) was performed on the previously presented dataset with randomly chosen 294 and 75 spectra in the calibration and validation set, respectively. The SVM1 models were calculated with respect to three classes of B-form (see the previous paragraph) for three different spectral regions: (a) the base and phosphate region from 1800 to 935 cm−1, (b) base region from 1800 to 1550 cm−1 and (c) asymmetric PO2 region from 1320 to 1155 cm−1. The (a) and (b) models had 100% success rate (75 correct), whereas the (c) model misclassified three spectra, one from each class, yielding 96% success rate. Such high levels of success rates for all three SVM1 models, (a)–(c), confirms that the spectral characteristics, indicated by visual inspection as well as machine-determined, in the base and phosphate region are tightly related to conformational changes observed in the 900–750 cm−1 region. However, note that the training of the SVM1 model was done on the calibration dataset that included a portion of spectra from all desiccation times (3–40 min films), the same as the validation dataset, which means that the calibration model already accounts for most of the variability in the spectra.
In order to simulate a real-life application, it is important to evaluate models for predicting samples with an unexpected variability. This was achieved by building calibration datasets that include three different classes of B-form; however, only certain desiccation times are selected for calibration, representing a more realistic model, with an entire unknown class of samples added to the validation dataset. The SVM2–SVM5 models were all built in such a way: different combinations of desiccation times were chosen as the calibration datasets and tested for selected spectral regions.
The first classification of samples with respect to B-form (classes 1–3) based on new samples added to the validation set was SVM2. The calibration set included 333 spectra of 3-, 5-, 10-, and 40-min films (all three classes), while the validation set included 36 spectra of 15, 20, and 25 min films (only class 3). The model was calculated for four spectral regions: (a) the base and phosphate region from 1800 to 935 cm−1, (b) base region from 1800 to 1550 cm−1, (c) asymmetric PO2 region from 1320 to 1155 cm−1, and (d) symmetric PO2 region from 1155 to 990 cm−1. The SVM2 models (a)–(c) had 100% success rate (36 correct), while the (d) model for the symmetric PO2 region misclassified 1 spectrum, yielding 97% success rate. Such high success rates, regardless of the spectral region, for the models in which validation was performed on newly introduced spectra, provide an excellent basis not only for the classification of spectra but also for the estimation of %B-form. This is particularly interesting for the model (a) where, utilizing the SVM classification, an unknown %B-form for the 15–25 min spectra was determined as ≈75% with 100% success rate based on the spectral signatures in the 1800–935 cm−1 region.
In order to challenge the classification ability, the SVM2 calibration dataset was deliberately reduced until it crashed and then expanded with the fewest number of spectra possible to regain correct classification (SVM3 and SVM4). In the SVM3 model, 10 min films were removed from the training and included into the validation. The calibration dataset included 279 spectra of the 3-, 5-, and 40-min films (all three classes), while the validation set included 90 spectra of the 10-, 15-, 20-, and 25-min films (only class 3) and was performed in the 1800–935 cm−1 region. Without the 10-, 15-, 20-, and 25-min films in the calibration dataset, the success rate fell to less than 25% (22 out of 90 spectra were classified correctly) indicating that the 3-, 5-, and 40-min films are not sufficient to train a good model. In other words, the extent of spectral differences between the 5- and 40-min films is big enough for the model to fall apart. In order to improve the SVM3 model, we expanded the calibration set with 18 spectra of 15 min film so that the calibration set of SVM4 included 297 spectra of the 3-, 5-, 15-, and 40-min films (all three classes), while the validation set included 72 spectra of the 10-, 20-, and 25-min films (only class 3) and was performed in the 1800–935 cm−1 region. This significantly improved the success rate to over 80% (only 13 misclassified spectra out of 72), which suggests that the 40 min films in the 1800–935 cm−1 region (compared to the 10- and 15-min films) also contain certain spectral features not entirely related to conformation, as discussed in our previous work.
Two final SVM classification models, SVM5 and SVM6, were based on the idea of the minimization of the calibration dataset still capable of resolving different levels of B-form in the spectra. In the SVM5 model, the calibration dataset included 288 spectra of the 3-, 5-, and 10-min films (all three classes), while the validation dataset included 81 spectra of 15-, 20-, 25-, and 40-min films (only class 3) and was performed in the 1800–935 cm−1 region. The model yielded 100% success rate with none misclassified spectra in 15–40 min films, suggesting two things: there is no significant conformational difference between the 10 and 15–40 min films, as argued in our previous work, and that the sample group of 10 min films is diverse enough to provide a strong model for classification based on %B-form. On the other hand, the SVM3 and SVM4 models, that based the prediction of %B-form on the 40 min spectra, proved to be inferior. This can be easily understood looking at the sample group of 40 min films, which contains less spectra, has less spectral variability and has certain aspects unrelated to conformation present in the spectra (mostly phosphate groups).
The final step of the SVM model evaluation (by introducing unseen spectra to the validation dataset) was to include previously obtained DNA thin film spectra from our spectral library and test it against the calibration dataset of SVM5 model. Three spectra of DNA thin films obtained several years ago in our lab were processed in the same manner and used as the validation dataset. Note that the library spectra chosen for validation were obtained from DNA solutions of the same concentration, deposited on the same substrate (Si windows) and recorded on the same instrument, but in a setup with a different vacuum pump (a rotary vacuum pump of the same class but different manufacturer). For the validation spectra, the previously calculated integrated intensity ratio of conformation markers estimated B-form at the level of ≈60%, which corresponds to class 2 spectra from this work. This final model, SVM6, was trained by the same calibration dataset as SVM5 (3-, 5-, and 10-min films) in the range from 1800 to 935 cm−1 and resulted in 100% success rate. Thus, the SVM model trained on the 3-, 5-, and 10-min spectra (DNA films prepared for the purpose of this work) proved to account enough data variability even to determine %B-form for DNA thin films with unexpected variability, same as the data obtained for the purpose of other works.
3.4. Principal Component Regression
In this section, we present the final multivariate method, PCR, utilized to validate the correlation of spectral signatures in the 1800–935 cm−1 region and DNA conformation as indicated from calculations. Two models build, PCR1 and PCR2, were trained on the 3-, 5-, 10-, and 40-min spectra and validated by introducing an unknown spectra of 15-, 20-, and 25-min films and performed in four distinct regions: (a) the base and phosphate region from 1800 to 935 cm−1, (b) base region from 1800 to 1550 cm−1, (c) asymmetric PO2 region from 1320 to 1155 cm−1 and (d) symmetric PO2 region from 1155 to 990 cm−1.
The first model, PCR1, was intended mostly to see whether %B-form can be correctly predicted for DNA thin films of different desiccation times from the vibrational signatures of the base and phosphate regions, (a)–(d). The calibration dataset in the PCR1 model was trained against an estimate of %B-form, calculated for each desiccation time as described previously. The results for the regression model PCR1 calculated for the region (a), 1800–935 cm
−1, are presented in
Figure 4, while the results of the PCR1 model for regions (b)–(d) are presented in
Table 2. As indicated from
Figure 4, %B-form in 15-, 20-, and 25-min films is correctly determined by the PCR1 (a) model and a high
R2 value of 0.922 indicates a very good correlation of vibrational signatures in the base and phosphate region to %B-form. The PCR1 (b)–(d) models proved inferior compared to the PCR1 (a) model with
R2 values of 0.875, 0.838, and 0.847, respectively.
The second model, PCR2, was intended to determine the correlation of spectra with respect to three classes representing different levels of %B-form: class 1 (40%, 3 min film), class 2 (60%, 5 min film) and class 3 (75%, 10–40 min films). The results of the PCR2 model for the region (a), 1800–935 cm
−1, are presented in
Figure 5, while the details of the PCR2 (b)–(d) models can be found in
Table 2. In the PCR2 (a) model, the class of 15-, 20-, and 25-min films was correctly determined as class 3 with a high
R2 value of 0.929, while the PCR2 (b)–(d) models proved inferior with
R2 values of 0.882, 0.848, and 0.856, respectively.
When the PCR1 and PCR2 models are compared, the PCR2 model seems superior with grater R2 values for all four regions, (a)–(d). Since 10–40 min films all have similar values of %B-form (≈75%), the accuracy of PCR is impaired as the differences in %B-form calculated between those groups of spectra are in fact negligible, i.e., resolving 75% and 76% B-form obtained from the relation for estimation of %B-form is not viable. Consequently, a slightly better model, PCR2, is obtained when 10–40 min spectra are all classified as single group, class 3.
Finally, we would like to mention that the datasets prepared for the PCR1 (a) and PCR2 (a) models were utilized to obtain results by the PLS regression. Similar results were obtained by PLS models and, as they did not provide additional information, they are not presented in this work.