1. Introduction
Next-generation sequencing (NGS) is commonly used to unveil genetic causes of diseases and whole-exome-sequencing (WES) has become one of the most commonly used diagnostic tools both in the clinic and in several programs investigating rare genetic diseases. Rare diseases collectively affect a significant fraction of the population (estimated to be about 4–5%) [
1,
2] with a resulting high impact on health-care costs and mortality rates. Currently, the standard protocol to investigate rare diseases includes multiple clinical diagnostics assays. Nonetheless, half of the cases still remain without a diagnosis [
3,
4,
5]. One of the reasons for this is the limited knowledge of how to detect Copy Number Variation (CNV) from sequencing data. It is estimated that about 12% of the genome in the human population is subject to copy number changes [
6,
7]. To detect CNVs, diagnostic laboratories often use multiplex ligation-dependent probe amplification (MLPA) and array comparative genomics hybridization analysis (ArrayCGH) prior to executing NGS-based analysis [
8]. However, both methods have high ranges in resolution (from kilobases to megabases) and add complexity to the overall patient screening process. Whole-genome-sequencing (WGS) data are more even in coverage in comparison to WES because of the enrichment protocols used, making it more reliable for CNV calls. However, due to extensive use of WES in diagnostics, there is a need for reliable methods to infer CNVs from exome data as well [
9,
10,
11]. Indeed, leveraging the sequencing outcome to detect CNVs offers potential advantages leading to increased diagnostic yield without increasing laboratory costs [
10,
12].
Several CNV-detection algorithms for WES data have been developed, all of which rely on the use of depth-of-coverage (DoC) from multiple samples to infer copy numbers [
13,
14,
15,
16]. Unfortunately, the CNV search is hampered by biases due to differences in capture protocol efficiency, the presence of GC-rich regions, and different coverage resolutions that influence DoC, among others [
17,
18,
19]. Such heterogeneity complicates the downstream analysis of the detected events, leading to false positives [
18,
20,
21,
22] while compromising the ability to reliably detect CNVs when these span less than three exons [
10,
19,
20,
23,
24]. Even though CNV detection could represent a valuable complementary way to analyze NGS data, the low concordance of detected events suggests that the algorithms designed so far are yet to be optimized [
19,
22,
24,
25]. Moreover, comparative works have demonstrated that these results are often difficult to replicate despite the high specificity and sensitivity declared [
26]. One method to overcome these issues could be to generate a consensus of variants called by different algorithms [
24]. However, to use any of these approaches, the user needs to prepare BAM files for unrelated samples sequenced with the same target writing ad hoc scripts, making such analyses difficult for those laboratories that do not have bioinformatics expertise. Therefore, the implementation of a fully automated CNV workflow along with different methods to investigate CNVs in WES data beyond the DoC strategies is of high importance for the scientific community.
Single-exon homozygous/hemizygous deletion (HD) detection methods, which compare normalized coverage values among samples produced with the same kits, already exist (e.g., Atlas-CNV, CoNVaDING, DECoN, and HMZDelFinder) [
27,
28,
29,
30]. While Atlas-CNV and CoNVaDING, as suggested by the authors, can only be used with high-coverage sequencing data (e.g., small targeted gene panels), HMZDelFinder and DECoN are ad hoc tools for exonic CNV detection. However, these tools are based on the assumption that data have a defined distribution and hence require intra- and inter-samples homogeneity [
26].
To overcome these challenges, we developed a new algorithm for the detection of rare single-exon HDs that exploit breadth-of-coverage (BoC), and we named it VarGenius-HZD (where HZD stands for homozygous/hemizygous deletion detection). Additionally, we automated its execution along with that of ExomeDepth and XHMM within our recently developed software that we devised for variant detection analysis and management of samples, i.e., VarGenius [
31]. This software is now able to automatically pick selected samples generated with the same target and to perform CNV, calling separately on autosomes and sex chromosomes and in parallel across different cores of a High-Performance Computing (HPC) system managed with a Portable Batch System (PBS) scheduler. The VarGenius-HZD algorithm is either integrated within VarGenius software, where it scales across HPC nodes, or is available as a stand-alone version that takes as input a list of manually selected BAM files and allows scaling across CPU cores.
We have validated our algorithm using 50 samples from the 1000 Genomes Project (1KGP) (
https://www.internationalgenome.org/, accessed on 1 February 2021) for which both WGS and WES was present and in which we detected both existing and artificially inserted HDs. For these test cases we compared VarGenius-HZD results with those of HMZDelFinder, DECoN, and ExomeDepth, and our algorithm obtained the highest sensitivity. Furthermore, we applied VarGenius-HZD on targeted sequencing data from a cohort of 188 individuals with Inherited Retinal Dystrophies (IRDs), resolving 5 out of 64 undiagnosed cases by identifying pathogenic HDs, which were then experimentally validated.
4. Discussion
HDs often lead to loss of function with pathogenic roles both in Mendelian diseases and cancer [
40,
41,
42,
43]. Indeed, a significant percentage of human Mendelian diseases is reported to be caused by molecular disruption within exons [
6,
7]. NGS-based approaches became cheap during the last decade, allowing diagnostic laboratories to use targeted sequencing [
44]. Nonetheless, the investigation of CNVs in WES is still challenging for several reasons mostly due to uneven coverage and due to enrichment kits and regions of the genome difficult to sequence [
18,
45,
46]. State-of-the-art tools require as input several samples for such comparison that should be unrelated and sequenced with the same target [
13,
16,
20]. Yet, comparative works have demonstrated a high number of false positives and hence alternative CNV detection strategies and filtering methods are needed [
18,
20,
22].
The goal of this work was to explore different solutions for HD discovery in targeted sequencing and to automate the overall workflow. We developed VarGenius-HZD, which searches for HDs within the single sample and leverages multi-sample information to corroborate such calls, and we integrated it within our recently developed VarGenius. CNV detection is still a challenging task, and we think that currently only highly trained bioinformaticians might disentangle the intrinsic difficulty of detection of such types of variation to understand the underlying complexities and cavities, especially for clinical practice. However, being able to automate CNV analysis and to reduce false positives for HD detection and, as a consequence, the number of events to manually inspect out of the tool could increase the availability of human-readable results and, hopefully, of genetic diagnoses for those laboratories lacking bioinformatics expertise. To make VarGenius-HZD useful for researchers exploiting other software for variant calling, we also developed a stand-alone VarGenius-HZD; in this version the user provides the list of full paths to the BAM files and the target file. One limitation of the stand-alone tool (compared to the complete VarGenius software) is that it cannot provide parents’ coverage as annotation but only on average across all samples used.
To compare our algorithm with state-of-the-art methods, we applied VarGenius-HZD, ExomeDepth, HMZDelFinder, and DECoN to 50 samples from 1KGP. The highest number of TPs was achieved only with our algorithm; hence, it is more sensitive than state-of-the-art tools, demonstrating that BoC can be effectively used to detect such variants. Furthermore, our tool was able to correctly detect all the synthetic HDs that we inserted within randomly chosen samples in the same dataset, achieving a sensitivity of 100%, while the only comparable results were obtained with HMZDelFinder with a sensitivity of 80%. ExomeDepth and DECoN were not able to detect any of the simulated HDs. Our results are in agreement with other comparative studies, which describe ExomeDepth’s ability to discover long CNVs covering large chromosomal regions while missing events that affect less than three exons. However, DECoN, which is based on ExomeDepth, provided similar results. We speculate that a higher number of TPs and thus higher sensitivity rather than precision would be preferable for clinical diagnosis at a cost of filtering few additional CNVs during downstream prioritization.
We then assessed the performance of VarGenius-HZD in a clinical context using targeted sequencing data from a cohort of unsolved IRD patients. Analysis of CNVs using ExomeDepth and XHMM with such data turned out to be challenging. These tools detect hundreds of events, and filtering FPs was a tough task. We observed several false positives detected by ExomeDepth and XHMM, in agreement with current studies showing that state-of-the-art CNV-calling algorithms are influenced by different instrument outcomes and low-coverage samples, possibly due to the high number of off-target bases, duplicates, and low base quality. We speculated that CNV callers should deal with such issues, and, to reduce the false discovery rate, as a pre-processing step, it could be useful to remove outlier samples which have a high number of calls (e.g., >2 standard deviation).
After filtering, we could confirm, through experimental assays, five pathogenic HDs. Only VarGenius-HZD was able to detect all of them. In summary, XHMM lost all of them; ExomeDepth detected all except one but provided very low BF score, and hence they were initially excluded; HMZDelFinder detected all except one. One of the called HDs was instrumental in defining a new association of biallelic variants in the
RAX2 gene with autosomal recessive Retinitis pigmentosa [
35].