# A Unified Deep Learning Framework for Single-Cell ATAC-Seq Analysis Based on ProdDep Transformer Encoder

^{1}

^{2}

^{*}

^{†}

## Abstract

**:**

## 1. Introduction

## 2. Results and Discussion

#### 2.1. PROTRAIT Predicts Single-Cell Chromatin Accessibility on Held-Out DNA Sequences

#### 2.2. PROTRAIT Annotates Cell Types by Clustering on Cell Embedding

#### 2.3. PROTRAIT Denoises Single-Cell Chromatin Accessibility Profiles

#### 2.4. PROTRAIT Infers TF Activity at Single-Cell and Single-Nucleotide Resolution

#### 2.5. PROTRAIT Is Scalable to Large Datasets

## 3. Materials and Methods

#### 3.1. Datasets

#### 3.2. Chromatin Accessibility Modeler Based on ProbDep Transformer Encoder

#### 3.2.1. Uniform Input Representation

**Motif Embedding.**Given an L-bp DNA sequence, we mapped it into a latent space by using one-hot embedding. Suppose L is less than a manually set threshold. In that case, we transformed the one-hot embedding into the motif embedding by a convolutional layer to represent the occupancy of the regulatory motif. Suppose L is more significant than a manually set threshold. In that case, we used sequential alternating convolution and pooling layers to generate the motif embedding to reduce the dimension of the embedding.

**Position Embedding.**To make use of the position information of each regulatory motif, we used fixed absolute position embedding:

#### 3.2.2. ProbDep Transformer Encoder

**Long-range dependency measurement.**The attention of the i-th query on all the keys is defined as a probability $p\left({\mathbf{k}}_{m}\right|{\mathbf{q}}_{i})$. The output is its composition with ${\mathbf{v}}_{i}$, where a query–key pair ${\mathbf{q}}_{i}{\mathbf{k}}_{m}^{T}$ indicates possible long-range dependency between two regulatory motifs. A long-range dependency measurement aims to discover bona fide dependency by finding dominant query–key pairs which encourage the attention probability distribution of the corresponding query away from the uniform distribution. If $p\left({\mathbf{k}}_{m}\right|{\mathbf{q}}_{i})$ is close to a uniform distribution $q\left({\mathbf{k}}_{m}\right|{\mathbf{q}}_{i})=1/{L}_{k}$, the self-attention mechanism becomes a meaningless sum of $\mathbf{V}$. We used Kullback–Leibler divergence between $p\left({\mathbf{k}}_{m}\right|{\mathbf{q}}_{i})$ and $q\left({\mathbf{k}}_{m}\right|{\mathbf{q}}_{i})$ to determine the dominant queries for finding the dominate query–key pairs. Dropping constants, the long-range dependency measurement of the i-th query is defined as:

**ProbDep self-attention.**Based on the long-range dependency measurement, we have the ProbDep self-attention by allowing each key to only attend to the u dominant queries:

**Self-attention pooling.**To reduce the redundant combinations of ${\mathbf{v}}_{i}$ and make a focused self-attention feature map in the next layer, we developed self-attention pooling. The procedure of self-attention pooling from the k-th layer into the (k+1)-th layer is defined as:

#### 3.2.3. Chromatin Accessibility Analyzer

**Sequence Embedder.**The ProbDep Transformer Encoder’s output ${\mathcal{X}}^{(k+1)}$ can be regarded as a high-order embedding of the DNA sequence, which has redundant regulatory grammar. To generate a low-redundant sequence embedding $\mathcal{Z}$, Sequence Embedder map ${\mathcal{X}}^{(k+1)}$ into a low-dimension space by a convolutional layer and a feed-forward layer:

**Accessibility Predictor.**Based on the sequence embedding $\mathcal{Z}$, the Accessibility Predictor determines the accessibility probability in each cell by a linear layer:

**Loss function.**Given the model prediction $\mathbf{y}=[{y}_{1},\dots ,{y}_{N}]$ and the binary label $\widehat{\mathbf{y}}=[{\widehat{y}}_{1},\dots ,{\widehat{y}}_{N}]$, we chose the binary cross-entropy function as the loss function. The loss is propagated back from the Predictor’s output across the entire model:

#### 3.2.4. Training and Implementation

#### 3.3. Cell Type Annotator Based on Cell Embedding

#### 3.4. scATAC-Seq Data Denoiser Based on Predicted Chromatin Accessibility

**Search of dropout values.**Given a peak-by-cell matrix, we employ a probabilistic mixture model to search which peaks are affected by the dropouts in which cells. This model comprises a Gamma distribution and a normal distribution. The Gamma distribution accounts for the dropouts, and the normal distribution indicates the bona fide counts of the scATAC-seq peaks. We separately build probabilistic mixture models for different cell clusters because the proportions of the Gamma distribution and the normal distribution for each peak are different in various cell types.

**Recovery of dropout peaks.**For each cell m, we first chose peak sets ${A}_{m}=\{i:{d}_{im}\ge T\}$ and ${B}_{m}=\{i:{d}_{im}<T\}$ based on the dropout probabilities of peaks in cell m, where T indicates a threshold on the dropout probabilities. ${A}_{m}$ is a set that needs recovery, and ${B}_{m}$ is a set with an accurate peak count and does not require recovery. Then, for each peak i, we gained candidate denoised counts in all cells by predicted chromatic accessibility. Finally, let ${r}_{i,m}$ stand for the candidate denoised count of peak i in cell m, and we only recovered the count of peaks in set ${A}_{m}$:

#### 3.5. TF Activity Analyzer Based on Differential Accessibility Analysis

**Single-cell TF activity inference based on motif insertion.**To compute a TF activity score for each TF for each cell, we performed motif insertion on PROTRAIT. Specifically, we first performed dinucleotide shuffling of randomly sampled scATAC-seq peaks to generate genomic background sequences. For each TF, we downloaded the motif sequences from the JASPAR database and inserted them into the center of each background sequence. Then, we predicted normalized accessibility across all cells for both motif-inserted sequences and background sequences by PROTRAIT. Finally, we took the difference in predicted accessibility between motif-inserted sequences and background sequences as the motif influence score for each sequence. For each cell, the averaged influence score across all sequences can be regarded as a cell-level prediction of TF activity.

**Single-nucleotide TF activity inference based on ISM.**To further compute a TF activity score at per-cell per-nucleotide resolution, we performed ISM for all single nucleotides of a DNA sequence. We first calculated the change in accessibility in every cell after mutating each position to its three alternative nucleotides. Then, we normalized the changed accessibility for the four nucleotides at each position such that they sum to zero. For a motif sequence, the normalized score at the reference nucleotide can be regarded as a nucleotide-level prediction of TF activity.

## 4. Conclusions

## Author Contributions

## Funding

## Institutional Review Board Statement

## Informed Consent Statement

## Data Availability Statement

## Conflicts of Interest

## References

- Baek, S.; Lee, I. Single-cell ATAC sequencing analysis: From data preprocessing to hypothesis generation. Comput. Struct. Biotechnol. J.
**2020**, 18, 1429–1439. [Google Scholar] [CrossRef] [PubMed] - Preissl, S.; Gaulton, K.J.; Ren, B. Characterizing cis-regulatory elements using single-cell epigenomics. Nat. Rev. Genet.
**2022**, 24, 21–43. [Google Scholar] [CrossRef] [PubMed] - Fu, L.; Zhang, L.; Dollinger, E.; Peng, Q.; Nie, Q.; Xie, X. Predicting transcription factor binding in single cells through deep learning. Sci. Adv.
**2020**, 6, eaba9031. [Google Scholar] [CrossRef] [PubMed] - Zhang, Z.; Yang, C.; Zhang, X. scDART: Integrating unmatched scRNA-seq and scATAC-seq data and learning cross-modality relationship simultaneously. Genome Biol.
**2022**, 23, 139. [Google Scholar] [CrossRef] - Cao, Y.; Fu, L.; Wu, J.; Peng, Q.; Nie, Q.; Zhang, J.; Xie, X. SAILER: Scalable and accurate invariant representation learning for single-cell ATAC-seq processing and integration. Bioinformatics
**2021**, 37, i317–i326. [Google Scholar] [CrossRef] - Li, Z.; Kuppe, C.; Ziegler, S.; Cheng, M.; Kabgani, N.; Menzel, S.; Zenke, M.; Kramann, R.; Costa, I.G. Chromatin-accessibility estimation from single-cell ATAC-seq data with scOpen. Nat. Commun.
**2021**, 12, 6386. [Google Scholar] [CrossRef] - Fang, R.; Preissl, S.; Li, Y.; Lucero, J.; Wang, X.; Motamedi, A.; Shiau, A.K.; Zhou, X.; Xie, F.; Mukamel, E.A.; et al. Comprehensive analysis of single cell ATAC-seq data with SnapATAC. Nat. Commun.
**2021**, 12, 1337. [Google Scholar] [CrossRef] - Abdi, H.; Williams, L.J. Principal component analysis. Wiley Interdiscip. Rev. Comput. Stat.
**2010**, 2, 433–459. [Google Scholar] [CrossRef] - Cieslak, M.C.; Castelfranco, A.M.; Roncalli, V.; Lenz, P.H.; Hartline, D.K. t-Distributed Stochastic Neighbor Embedding (t-SNE): A tool for eco-physiological transcriptomic analysis. Mar. Genom.
**2020**, 51, 100723. [Google Scholar] [CrossRef] - McInnes, L.; Healy, J.; Saul, N. UMAP: Uniform Manifold Approximation and Projection. J. Open Source Softw.
**2018**, 3, 861. [Google Scholar] [CrossRef] - Huang, M.; Wang, J.; Torre, E.; Dueck, H.; Shaffer, S.; Bonasio, R.; Murray, J.I.; Raj, A.; Zhang, M.L.; Zhang, N.R. SAVER: Gene expression recovery for single-Cell RNA sequencing. Nat. Methods
**2018**, 15, 539–542. [Google Scholar] [CrossRef] [PubMed] - Van Dijk, D.; Sharma, R.; Nainys, J.; Yim, K.; Kathail, P.; Carr, A.J.; Burdziak, C.; Moon, K.R.; Chaffer, C.L.; Pattabiraman, D.; et al. Recovering gene interactions from single-cell data using data diffusion. Cell
**2018**, 174, 716–729. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Bravo González-Blas, C.; Minnoye, L.; Papasokrati, D.; Aibar, S.; Hulselmans, G.; Christiaens, V.; Davie, K.; Wouters, J.; Aerts, S. cisTopic: Cis-regulatory topic modeling on single-cell ATAC-seq data. Nat. Methods
**2019**, 16, 397–400. [Google Scholar] [CrossRef] [PubMed] - Xiong, L.; Xu, K.; Tian, K.; Shao, Y.; Tang, L.; Gao, G.; Zhang, M.; Jiang, T.; Zhang, Q.C. SCALE method for single-Cell ATAC-seq analysis via latent feature extraction. Nat. Commun.
**2019**, 10, 4576. [Google Scholar] [CrossRef] [PubMed] [Green Version] - de Boer, C.G.; Regev, A. BROCKMAN: Deciphering variance in epigenomic regulators by k-mer factorization. BMC Bioinform.
**2018**, 19, 253. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Yuan, H.; Kelley, D.R. scBasset: Sequence-based modeling of single-cell ATAC-seq using convolutional neural networks. Nat. Methods
**2022**, 19, 1088–1096. [Google Scholar] [CrossRef] - Kelley, D.R.; Snoek, J.; Rinn, J.L. Basset: Learning the regulatory code of the accessible genome with deep convolutional neural networks. Genome Res.
**2016**, 26, 990–999. [Google Scholar] [CrossRef] [Green Version] - Zhou, J.; Troyanskaya, O.G. Predicting effects of noncoding variants with deep learning–Based sequence model. Nat. Methods
**2015**, 12, 931–934. [Google Scholar] [CrossRef] [Green Version] - Kelley, D.R.; Reshef, Y.A.; Bileschi, M.; Belanger, D.; McLean, C.Y.; Snoek, J. Sequential regulatory activity prediction across chromosomes with convolutional neural networks. Genome Res.
**2018**, 28, 739–750. [Google Scholar] [CrossRef] [Green Version] - Zhang, Y.; Wang, Z.; Liu, Y.; Lu, L.; Tan, X.; Zou, Q. By hybrid neural networks for prediction and interpretation of transcription factor binding sites based on multi-omics. In Proceedings of the 2021 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Houston, TX, USA, 9–12 December 2021; pp. 594–599. [Google Scholar]
- Wang, Z.; Tan, X.; Li, B.; Shao, Q.; Li, Z.; Yang, Y.; Zhang, Y. BindTransNet: A Transferable Transformer-Based Architecture for Cross-Cell Type DNA-Protein Binding Sites Prediction. In Proceedings of the International Symposium on Bioinformatics Research and Applications, Shenzhen, China, 26–28 November 2021; pp. 203–214. [Google Scholar]
- Beltagy, I.; Peters, M.E.; Cohan, A. Longformer: The long-document transformer. arXiv
**2020**, arXiv:2004.05150. [Google Scholar] - Jiang, K.; Peng, P.; Lian, Y.; Xu, W. The encoding method of position embeddings in vision transformer. J. Vis. Commun. Image Rep.
**2022**, 89, 103664. [Google Scholar] [CrossRef] - Liu, Y.; Zhang, R.; Li, T.; Jiang, J.; Ma, J.; Wang, P. MolRoPE-BERT: An enhanced molecular representation with Rotary Position Embedding for molecular property prediction. J. Mol. Graph. Model.
**2022**, 118, 108344. [Google Scholar] [CrossRef] [PubMed] - Schep, A.N.; Wu, B.; Buenrostro, J.D.; Greenleaf, W.J. chromVAR: Inferring transcription-factor-associated accessibility from single-cell epigenomic data. Nat. Methods
**2017**, 14, 975–978. [Google Scholar] [CrossRef] [PubMed] [Green Version] - Wang, Z.; Gong, M.; Liu, Y.; Xiong, S.; Wang, M.; Zhou, J.; Gong, M. Towards a better understanding of TF-DNA binding prediction from genomic features. Comput. Biol. Med.
**2022**, 149, 105993. [Google Scholar] [CrossRef] - Zhang, Y.; Wang, Z.; Zeng, Y.; Liu, Y.; Xiong, S.; Wang, M.; Zhou, J.; Zou, Q. A novel convolution attention model for predicting transcription factor binding sites by combination of sequence and shape. Brief. Bioinform.
**2022**, 23, bbab525. [Google Scholar] [CrossRef] - Zhang, Y.; Liu, Y.; Wang, Z.; Xiong, S.; Huang, G.; Gong, M. Uncovering the Relationship between Tissue-Specific TF-DNA Binding and Chromatin Features through a Transformer-Based Model. Genes
**2022**, 13, 1952. [Google Scholar] [CrossRef] - Castro-Mondragon, J.A.; Riudavets-Puig, R.; Rauluseviciute, I.; Lemma, R.B.; Turchi, L.; Blanc-Mathieu, R.; Lucas, J.; Boddie, P.; Khan, A.; Pérez, N.M.; et al. JASPAR 2022: The 9th release of the open-access database of transcription factor binding profiles. Nucleic Acids Res.
**2022**, 50, D165–D173. [Google Scholar] [CrossRef] - Wang, H.; Lee, C.H.; Qi, C.; Tailor, P.; Feng, J.; Abbasi, S.; Atsumi, T.; Morse, H.C. IRF8 regulates B-cell lineage specification, commitment, and differentiation. Blood J. Am. Soc. Hematol.
**2008**, 112, 4028–4038. [Google Scholar] [CrossRef] [Green Version] - Arinobu, Y.; Mizuno, S.i.; Chong, Y.; Shigematsu, H.; Iino, T.; Iwasaki, H.; Graf, T.; Mayfield, R.; Chan, S.; Kastner, P.; et al. Reciprocal activation of GATA-1 and PU. 1 marks initial specification of hematopoietic stem cells into myeloerythroid and myelolymphoid lineages. Cell Stem Cell
**2007**, 1, 416–427. [Google Scholar] [CrossRef] [Green Version] - Kato, H.; Igarashi, K. To be red or white: Lineage commitment and maintenance of the hematopoietic system by the “inner myeloid”. Haematologica
**2019**, 104, 1919. [Google Scholar] [CrossRef] - Jenal, M. HIC1 and BCL2A1: Novel Factors Involved in Myeloid Differentiation and Survival. Ph.D. Thesis, Universität Tübingen, Tübingen, Germany, 2009. [Google Scholar]
- Smith, B.W.; Rozelle, S.S.; Gadue, P.; Monti, S.; Chui, D.H.K.; Steinberg, M.H.; Frelinger, A.L.; Michelson, A.D.; Theberge, R.; McComb, M.E.; et al. The Aryl Hydrocarbon Receptor (AhR) Regulates the Production of Bipotential Hematopoietic Progenitor Cells. Blood
**2012**, 120, 766. [Google Scholar] [CrossRef] - In’t Hout, F.E.; Gerritsen, M.; Bullinger, L.; Van der Reijden, B.A.; Huls, G.; Vellenga, E.; Jansen, J.H. Transcription factor 4 (TCF4) expression predicts clinical outcome in RUNX1 mutated and translocated acute myeloid leukemia. Haematologica
**2020**, 105, e454. [Google Scholar] [CrossRef] [PubMed] - Mann-Nüttel, R.; Ali, S.; Petzsch, P.; Köhrer, K.; Alferink, J.; Scheu, S. The transcription factor reservoir and chromatin landscape in activated plasmacytoid dendritic cells. BMC Genom. Data
**2021**, 22, 1–20. [Google Scholar] [CrossRef] [PubMed] - Zhang, Y.; Wang, Z.; Zeng, Y.; Zhou, J.; Zou, Q. High-resolution transcription factor binding sites prediction improved performance and interpretability by deep learning method. Brief. Bioinform.
**2021**, 22, bbab273. [Google Scholar] [CrossRef] [PubMed] - Horak, C.E.; Mahajan, M.C.; Luscombe, N.M.; Gerstein, M.; Weissman, S.M.; Snyder, M. GATA-1 binding sites mapped in the β-globin locus by using mammalian chIp-chip analysis. Proc. Natl. Acad. Sci. USA
**2002**, 99, 2924–2929. [Google Scholar] [CrossRef] [Green Version] - Zhang, K.; Hocker, J.D.; Miller, M.; Hou, X.; Chiou, J.; Poirion, O.B.; Qiu, Y.; Li, Y.E.; Gaulton, K.J.; Wang, A.; et al. A single-cell atlas of chromatin accessibility in the human genome. Cell
**2021**, 184, 5985–6001. [Google Scholar] [CrossRef]

**Figure 1.**The overview of PROTRAIT. (

**a**) Chromatin accessibility modeler. This module comprises a Uniform Input Representation, a ProbDep Transformer Encoder, and a Chromatin Accessibility Analyzer. Specifically, Uniform Input Representation encodes each regulatory motif’s occupancy and position information. ProbDep Transformer Encoder further captures bona fide long-range dependency between different regulatory motifs. Chromatin Accessibility Analyzer integrates learned regulatory grammars to predict the chromatin accessibility of each cell. The final layer weight of the Chromatin Accessibility Analyzer can be regarded as cell embedding. (

**b**) Cell type annotator. After generating cell embedding, this module utilizes the Louvain algorithm to perform single-cell clustering for cell type annotation. (

**c**) scATAC-seq data denoiser. This module utilizes a statistical model to discover which zero counts are affected by dropout events. It only performs recovery on discovered zero counts by predicted chromatic accessibility. (

**d**). TF activity analyzer. This module analyzes the activity of the TF in each cell or nucleotide by measuring changes in predicted accessibility.

**Figure 2.**Evaluation of chromatin accessibility prediction approaches for scATAC-seq data. (

**a**) auROC comparison of PROTRAIT and different chromatin accessibility prediction approaches, including Basset, DeepSEA, scBasset, and Basenji. (

**b**) auROC comparison of PROTRAIT on down-sampled datasets (non-zero entries of labels are 95%, 90%, 80%, 70%, 60%, 50%, 40%, 30%, 20% and 10% of original). (

**c**) auROC comparison of the ProbDep Transformer Encoder and its variants (

**left**) and the fixed absolute position embedding and its variants (

**right**).

**Figure 3.**Performance of diverse cell representation approaches compared by cell type annotation. (

**a**) UMAP representation of the cell embeddings derived from PROTRAIT, cisTopic, SAILER, SCALE, MAGIC, and SAVER. All approaches are trained on the full dataset. (

**b**) The average v-score, ARI, and AMI for each cell type are measured after clustering based on the cell embeddings. All approaches are trained on the datasets where the total number of peaks per cell remains at 20%, 40%, 60%, 80%, and 100% of the total peaks, respectively. (

**c**) Runtimes for each of the cell-embedding approaches profiled. All approaches are trained on the full dataset.

**Figure 4.**Performance evaluation of denoising approaches for scATAC-seq data. (

**a**) The raw and denoised peak-by-cell matrix of 800 peaks and 200 cells from the Buenrostro2018 dataset, hierarchically clustered by cells. (

**b**) The label score (k = 50, 75, 100) comparison of PROTRAIT, MAGIC, SAILER, SCALE, and scOpen. (

**c**) The label score (k = 50, 75, 100) comparison of PROTRAIT with search of dropout values (SoDV) and without search of dropout values (WoSoDV).

**Figure 5.**Performance of PROTRAIT in inferring single-cell TF activity. (

**a**) HSC differentiation lineage diagram. (

**b**) The normalized average activity scores of 30 TFs in eight cell types. (

**c**) UMAP represents the single-cell activity scores of the most active TFs in eight cell types.

**Figure 6.**Performance of PROTRAIT in inferring single-cell and single-nucleotide TF activity. (

**a**) ISM scores for sequences that match GATA1 and ZNF354C motifs in eight cell types. (

**b**) Distributions of single-cell PWM-ISM scores for GATA1 and ZNF354C in eight cell types. The PWM-ISM score is the dot product of the PWM and ISM measurements of motif matches.

**Figure 7.**Performance of PROTRAIT in large-scale scATAC-seq analysis. (

**a**–

**d**) The runtime, peak CPU memory usage, peak GPU memory usage, and parameter of PROTRAIT on sciATAC human atlas. To remove fluctuations from random sampling, we employ a cross-validation strategy and repeat the down-sampling 10 times to obtain averages.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |

© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

## Share and Cite

**MDPI and ACS Style**

Wang, Z.; Zhang, Y.; Yu, Y.; Zhang, J.; Liu, Y.; Zou, Q.
A Unified Deep Learning Framework for Single-Cell ATAC-Seq Analysis Based on ProdDep Transformer Encoder. *Int. J. Mol. Sci.* **2023**, *24*, 4784.
https://doi.org/10.3390/ijms24054784

**AMA Style**

Wang Z, Zhang Y, Yu Y, Zhang J, Liu Y, Zou Q.
A Unified Deep Learning Framework for Single-Cell ATAC-Seq Analysis Based on ProdDep Transformer Encoder. *International Journal of Molecular Sciences*. 2023; 24(5):4784.
https://doi.org/10.3390/ijms24054784

**Chicago/Turabian Style**

Wang, Zixuan, Yongqing Zhang, Yun Yu, Junming Zhang, Yuhang Liu, and Quan Zou.
2023. "A Unified Deep Learning Framework for Single-Cell ATAC-Seq Analysis Based on ProdDep Transformer Encoder" *International Journal of Molecular Sciences* 24, no. 5: 4784.
https://doi.org/10.3390/ijms24054784