Next Article in Journal
Clinical Characteristics, Antimicrobial Resistance, Virulence Genes and Multi-Locus Sequence Typing of Non-Typhoidal Salmonella Serovar Typhimurium and Enteritidis Strains Isolated from Patients in Chiang Mai, Thailand
Next Article in Special Issue
Characterize the Growth and Metabolism of Acidithiobacillus ferrooxidans under Electroautotrophic and Chemoautotrophic Conditions
Previous Article in Journal
A Preliminary Study of the Potential Molecular Mechanisms of Individual Growth and Rumen Development in Calves with Different Feeding Patterns
Previous Article in Special Issue
The Genome of Varunaivibrio sulfuroxidans Strain TC8T, a Metabolically Versatile Alphaproteobacterium from the Tor Caldara Gas Vents in the Tyrrhenian Sea
Font Type:
Arial Georgia Verdana
Font Size:
Aa Aa Aa
Line Spacing:
Column Width:

A Survey of Archaeal Restriction–Modification Systems

New England Biolabs, Ipswich, MA 02127, USA
Author to whom correspondence should be addressed.
Microorganisms 2023, 11(10), 2424;
Submission received: 1 September 2023 / Revised: 24 September 2023 / Accepted: 25 September 2023 / Published: 28 September 2023
(This article belongs to the Special Issue Genomics of Extremophiles and Archaea)


When compared with bacteria, relatively little is known about the restriction–modification (RM) systems of archaea, particularly those in taxa outside of the haloarchaea. To improve our understanding of archaeal RM systems, we surveyed REBASE, the restriction enzyme database, to catalog what is known about the genes and activities present in the 519 completely sequenced archaeal genomes currently deposited there. For 49 (9.4%) of these genomes, we also have methylome data from Single-Molecule Real-Time (SMRT) sequencing that reveal the target recognition sites of the active m6A and m4C DNA methyltransferases (MTases). The gene-finding pipeline employed by REBASE is trained primarily on bacterial examples and so will look for similar genes in archaea. Nonetheless, the organizational structure and protein sequence of RM systems from archaea are highly similar to those of bacteria, with both groups acquiring systems from a shared genetic pool through horizontal gene transfer. As in bacteria, we observe numerous examples of “persistent” DNA MTases conserved within archaeal taxa at different levels. We experimentally validated two homologous members of one of the largest “persistent” MTase groups, revealing that methylation of C(m5C)WGG sites may play a key epigenetic role in Crenarchaea. Throughout the archaea, genes encoding m6A, m4C, and m5C DNA MTases, respectively, occur in approximately the ratio 4:2:1.

1. Introduction

Restriction–modification (RM) systems are one of the best-known defense systems used by prokaryotes to prevent phage infection [1,2]. They comprise a restriction enzyme (REase) that cleaves unmodified DNA and a DNA methyltransferase (MTase) that modifies DNA to block cleavage by the cognate REase. There are four main types of such systems. Type I systems employ three subunits acting in complex, where the R subunit is responsible for restriction, the M subunit is responsible for methylation, and the S subunit is responsible for recognizing the specific DNA sequence that is to be modified or cleaved. Type II systems usually contain two independent enzymes, an REase and an MTase, both of which must recognize and target the same DNA sequence. However, in some Type II systems (designated Type IIG), the MTase and REase activities are encoded in the same polypeptide. Type III systems, like Type I systems, consist of subunits that must act as complexes: an MTase (Mod) subunit that is also solely responsible for sequence recognition and an REase (Res) that must complex with the Mod subunit to cleave unmodified sequences. Type IV systems, also found in many prokaryotes, comprise only an REase that cleaves methylated DNA. Examples of all four types of systems are found in both bacteria and archaea.
REBASE is a comprehensive database of sequence and experimental information about RM systems, drawing information from all fully sequenced microbial genomes deposited in GenBank [3,4]. Methylome data derived from Single-Molecule Real-Time (SMRT) sequencing are also included, enabling the assignment of target sites to MTases and their companion REases. Such assignments are then propagated to homologous enzymes in other organisms for which no experimental data are available. While nanopore sequencing is also capable of detecting DNA methylation, the accuracy of de novo motif calling, particularly for motifs with m6A and m4C, is currently lower than for SMRT sequencing [5,6]. As a result, relatively little nanopore-based microbial methylation data have been deposited in REBASE to date. We expect this to change as methods continue to improve.
RM systems of bacteria have been far more extensively studied than those of archaea. There have been numerous studies surveying different types of RM systems that largely or exclusively focus on bacteria, many of which use REBASE as source data. Such studies have focused on such topics as Type I systems with recombining S subunits [7], phase-variable Type I systems [8], phase-variable Type III systems [9], solitary REase genes [10], conserved (“persistent”) MTases [11], and the association of RM systems with mobile elements and genome rearrangements [12]. Surveys of RM systems found specifically in archaea are fewer, with the largest being a study in Halobacteria [13]. This review examines more broadly the RM systems of archaea, for which there is relatively little experimental data about restriction and REases. Owing to methylation-sensitive sequencing techniques such as SMRT sequencing, however, our knowledge of DNA methylation and MTases in archaea is improving. There are currently 519 complete DNA sequences for archaeal genomes and SMRT methylation data are available for 49 of these.

2. Materials and Methods

2.1. Identification of Genomes

We retrieved a list of accession numbers of all genome sequence files stored in the REBASE database [3] and grouped together different accession numbers associated with the same strain (n = 59,327 strains). From this list, we first retrieved all genome sequences that had been taxonomically curated by NCBI and were stated to belong to the domain Archaea (n = 697 strains). We next retrieved all sequences, regardless of taxonomic assignment, that had not originated with NCBI (n = 1417 strains). The latter set was manually curated to identify the archaea (n = 15 strains), and these were combined with the NCBI set for a total of 712 strains.
This set was further parsed to remove genomes whose sequence was not complete at the time of accession. Of the 712 strains, 487 were flagged as “complete genome” in the GenBank definition line and retained. From the other 225 strains, we removed those flagged as whole-genome shotgun data, those where the longest sequence was less than 500 kb, and those where the status of the NCBI genome sequence project was anything less than complete. The remaining strains in the latter set (n = 32 strains) were combined with the earlier set for a total of 519 archaeal strains with complete genome data in REBASE. Of these, 49 also had associated methylome data from SMRT sequencing.

2.2. Identification and Clustering of Genes

Genomes processed for entry into REBASE were analyzed to identify genes associated with RM systems using the SEQWARE v. 4 software pipeline [14]. We obtained all such genes encoded by the 519 archaeal strains identified above (n = 4135 protein sequences). These sequences were clustered to 30% sequence identity using Usearch v11 cluster_fast (n = 1034 sequence clusters) [15].

2.3. Construction of HMM Library

To predict the function of the uncharacterized archaeal proteins, we built a library of 62 HMMs spanning many different RM system-related functions and protein types (Supplementary Materials, Table S1). Protein sequences from which these HMMs were constructed were obtained from REBASE, focusing on experimentally characterized examples, where available, and their close homologs. Of these protein sequences, 462 were DNA MTases (including Type IIG RM proteins) and 202 were of all other functions (REases, S proteins, etc.). Sequences comprising two fused MTase domains were separated into component domains, but other multidomain proteins (Type IIG RM proteins, for example) were left intact. These two groups were separately clustered and visualized in two dimensions using CLANS [16] run under the MPI Bioinformatics Toolkit [17]. The resulting clusters were used to verify and refine the protein sets used for each HMM. Most of the final sets formed visually well-defined clusters in the CLANS analysis.
The protein sets were presumed to comprise functionally similar and/or evolutionarily related groups of proteins. The MTase sets were generally homogeneous in terms of methylation type (m6A, m4C, or m5C) based on experimentally characterized examples. However, protein sequences of MTases conferring m6A and those conferring m4C can be very similar [18], and four HMMs (b1a, lmoa118-like, nru-like, and b3) were built from sequence sets that included characterized MTases of both types (Supplementary Materials, Table S1). For the purpose of classifying based on methylation type (used in the tables in this work), the HMMs b1a, lmoa118-like, and nru-like were all considered to be m6A, and b3 was considered to be m4C.
Each set of protein sequences was aligned using Muscle v. 5.1 [19] run under Geneious Prime 2023.0.4 ( using default parameters. An HMM was built from each alignment using Hmmer v. 3.3.2 hmmbuild ( A list of the HMMs can be found in the Supplementary Materials, Table S1. Of the 62 HMMs, 41 were built from the MTases and 21 from the other functions.

2.4. Bacterial Genes and Genomes

For comparison with archaea, we also retrieved the set of RM genes encoded in all completely sequenced bacterial genomes deposited in REBASE that had associated methylome data, resulting in a total of 36,718 RM-related genes from 3369 genomes. These RM-related genes were individually classified using the same HMM library and methodology used for the archaeal genomes described above.

2.5. Characterization of MTase Activity

Plasmid clones were synthesized (GenScript Biotech, Piscataway, NJ, USA) with codon-optimized genes encoding suaIIM and asp7IM in pRRS10, a lower-copy number derivative of the constitutive expression plasmid pRRS (GenBank acc. no. JN569339) with a pBR322 origin of replication. Clones were used to transform the DNA methylation-deficient E. coli strain ER2796, which is notably Dcm. Genomic DNA from overnight cultures grown at 37 °C in LB with 100 µg/mL ampicillin was purified using the Monarch HMW DNA Extraction Kit (New England Biolabs, Ipswich, MA, USA). DNA was sheared in a Covarys ML230 (Covarys, Woburn, MA, USA) using the 175 bp AFA-TPX protocol.
Sequencing libraries were constructed from 100 ng of sheared DNA using the NEBNext Ultra II DNA Library Prep Kit for Illumina (New England Biolabs, Ipswich, MA, USA) and partially deaminated using the RIMS-seq2 protocol [20]. Five µL of USER-treated library DNA was used for the PCR amplification step (6 cycles, with barcoded primers from the NEBNext 96 Unique Dual Index Primers) (New England Biolabs, Ipswich, MA, USA).
Libraries were sequenced on a NextSeq (Illumina, San Diego, CA, USA) using the 2 × 76 + 8 + 8 protocol. 1.4 × 107 reads from the asp7IM clone and 1.5 × 107 reads from the suaIIM clone were obtained. Methylation at m5C sites was determined by comparing the C>T deamination rates of read1 and read2 [21]. Motifs were determined by searching for over-represented sequences around these sites using pipelines based on both MoSDi [22] and DiNAMO [23], with similar results. The presence of dcm-6, the nonsense mutation inactivating the dcm gene in the ER2796 host, was verified in the sequence assembly.

3. Results and Discussion

3.1. Archaeal Genomes and RM Genes in REBASE

Genome sequences from archaea, and the RM system-related genes encoded by them, were obtained from the REBASE database [3]. To minimize our chances of making assumptions based on missing data, we restricted our analysis to those genomes that appeared to be completely sequenced, closed, and finished—a total of 519. The genomes in this set are not evenly distributed across the phylogenetic tree, with 480 (92.5%) coming from just six archaeal classes (phylum in parentheses): Thermoprotei (Crenarchaeota); Methanomada, Halobacteria, Methanomicrobia, and Thermococci (all Euryarchaeota); and Nitrososphaerota (TACK group). This uneven distribution likely reflects a combination of sampling bias, academic or industrial interest, and ease of culturing. In the 519 archaeal genomes, we identified 4135 RM-related genes, which were grouped into 1034 sequence clusters based on 30% protein sequence identity. The sizes of these clusters ranged from 167 to 1, with 88 clusters of size ≥ 10 and 494 of size = 1. A complete list of genomes and cluster members can be found in the Supplementary Materials, Table S2.

3.2. Functional Categorization of Gene Clusters

For functional prediction, we constructed a library of 62 HMMs, each built from an RM-related evolutionary or functional group of protein sequences, using experimentally characterized examples where available (see Section 2). Each HMM was assigned to one of 13 general functional categories based on RM system type and biochemical activity (Supplementary Materials, Table S1). The protein sequence of the centroid of each gene cluster was used as a query to search the HMM library, and the predicted function of the cluster was determined as the functional category of the top HMM hit. For Type II DNA MTase clusters that included experimentally characterized members (largely based on SMRT sequencing data), the target site of the characterized examples was taken as representative of the entire cluster.
For each high-level taxonomic group (phylum, class, and order) represented in our set of 519 archaeal genomes, we determined the mean number of genes per genome from each of these 13 functional categories (Table 1). Looking at the set of genomes in its entirety, the most common category is Type II MTases (IIM), with about 2.7 per genome. The mean number of known Type II REase genes (IIR) is more than 30-fold lower; this partially reflects the prevalence of orphan MTases, which are similar to those of Type II RM systems but lack an REase partner. However, it is worth noting that Type IIR genes are difficult to identify based on sequence similarity [24], and our HMM library captured only three specific homologous groups of these enzymes, typified by BsiHKI, DpnII, and (presumably) DUF3883. As a result, this category is expected to be significantly under-represented in our data, with most IIR genes instead captured in the “Other” category. Type I and Type IIG RM systems are the next most common types, at just under one per genome. Type III and IV systems are the least common, at less than 0.2 per genome on average. However, it is also possible that Type IV systems are under-represented for the same reason as Type II REases.
Among the phyla, the Crenarchaeota are generally depleted in RM systems of all types, although the single representative from the order Cenarchaeales, Cenarchaeum symbiosum A, harbors 22 MTase genes of Type IIM, so this is not universally true. Figure 1 illustrates two extremes in RM system content in the Crenarchaeota, and archaea in general. The 25 RM system loci in C. symbiosum A, which are spread throughout the chromosome, include 17 orphan Type II MTases, one Type II MTase paired with a second MTase, one Type II MTase paired with a vsr gene, two complete Type II RM systems, two Type IIG genes, and two complete Type III RM systems (Figure 1A). All or nearly all recognize different sites based on characterized homologous examples. The genome of Fervidicoccus fontis Kam940 is more typical of Crenarchaeota, with only two RM loci, both Type II orphan MTases (Figure 1B).
Type I systems are particularly prevalent among the Methanomicrobia, at more than three per genome, and Type III systems are prevalent among both the Methanomicrobia and Thermoplasmata. The Halobacteria and Methanomicrobiales are relatively rich in Type IIG RM systems, at more than one per genome. Factors affecting the differences in RM system content and type between taxonomic groups may include the frequency of exposure to phage, the relative efficiency of horizontal exchange, and the microbiomes in which their members typically reside.

3.3. DNA Methylation Phenotypes

Of the 519 complete archaeal genomes under consideration here, 49 have associated methylome data from SMRT sequencing (Pacific Biosciences). From these data, one can readily identify DNA motifs around m6A and m4C methyl marks; m5C-associated motifs can also sometimes be identified, but with less efficiency and accuracy [25,26]. Alternative methods such as bisulfite sequencing, EM-seq [27], TAPS-seq [28], and RIMS-seq [21] are better suited to identifying m5C motifs, but they have not yet been applied to archaeal genomes at a large scale. Table 2 shows the number of genomes in each taxonomic group that have associated methylome data derived from SMRT sequencing, as well as the mean number of genes and observed motifs of each methylation type.
It is expected that the number of MTase genes should equal or exceed the number of motifs since not every gene is active, and many m5C motifs are not detected via SMRT sequencing. Indeed, we observed that in general, the numbers of genes and motifs are comparable, indicating that most of the MTase genes are active. We observed two cases where the number of motifs exceeds the number of genes: m6A in Desulfurococcales and m4C in Methanosarcinales (Table 2). This can be due to erroneous prediction of protein activities (typically misclassifying m6A vs. m4C) or identification of motifs (typically misclassifying m4C vs. m5C), or it may indicate that the genome sequence is incomplete, likely missing one or more plasmids that could encode additional MTases. In the archaea as a whole, the ratio of MTase genes predicted to encode m6A, m4C, and m5C enzymes is approximately 4:2:1 (Table 2). Certain phyla show significantly different ratios, however. In Crenarchaeota, the most prevalent class is m5C due to the universal presence of a single persistent m5C MTase (see below) and the general depletion of RM systems in this taxon. In the TACK group, the most prevalent class is m4C due to the presence of several persistent m4C MTases in the Nitrososphaerota (see below).

3.4. Comparison with Bacteria

For comparison with archaea, we retrieved a large set of completely sequenced bacterial genomes from REBASE and performed a similar analysis (see Section 2). Overall, archaea encoded fewer RM-related genes than bacteria (7.9 vs. 10.9), and this was true of every class of genes except IIR, IIG, M (BREX), and V (Figure 2A). Interestingly, the overall ratio of m6A, m4C, and m5C MTase genes in the bacterial genome set is approximately 5:1:1.5, with m5C outnumbering m4C (Figure 2B). The relative difference in the ratio of m4C and m5C between bacteria and archaea may reflect a greater proportion of hyperthermophiles in archaea.

3.5. Persistent MTases and RM Systems

Many RM systems and orphan MTases show a “patchy” distribution of homologs across a phylogenetic tree and significant differences between closely related strains, a pattern most parsimoniously explained by frequent horizontal gene transfer (HGT) and gene loss [13]. The resulting diversity of defense systems can be advantageous in protecting a population from infection by phage and other deleterious genetic elements. However, the ability of DNA methylation to affect gene transcription and other DNA–protein interactions can result in orphan DNA MTases (and sometimes full RM systems) acquiring functional roles outside of cellular defense. When this happens, the selective pressure on the genes encoding them can favor conservation and vertical transmission; such genes are sometimes termed “persistent” because they are less likely to be lost over time than most RM systems [11]. Classical examples of these include dam in the Gammaproteobacteria and ccrM in the Alphaproteobacteria. Large-scale comparative genomic studies have identified additional examples in bacteria and in the archaeal phylum Halobacteria [11,13,14,29].
We define a persistent MTase or RM system as one that is present in at least 75% of members of a given taxonomic group represented by at least five genomes in our set. We mapped the 88 clusters with ≥10 members to the taxonomic tree of the 519 archaeal genomes to identify such cases. For those clusters that met our definition, or nearly so, we combined them with closely related clusters, built phylogenetic trees on the combined sets, and reassorted the members based on monophyletic groups where necessary. We refer to these manually adjusted clusters as homologous groups (HGs). Table 3 shows each taxonomic group encoding at least one HG that met the criteria for persistence, and Supplementary Materials Table S3 shows the original cluster number to which each HG member belongs.
We identified 1 persistent group at the phylum level, 3 at the class level, 7 at the order level, and 18 between the levels of family and species. Of these 29 persistent systems, 20 are Type II orphan MTases (all with 4–5 base recognition sites, and all but one palindromic), 1 is a complete Type II RM system, 2 are BREX-like MTases, 5 are Type I systems (comprising 2 or 3 genes), and 1 is a Type IV REase (Table 3). Four persistent systems (HG2, HG3, HG1, and HG11) are shared between multiple taxonomic groups, which may be due either to independent acquisition or to gene loss in sister taxa.
The largest group, HG1, is found throughout the Halobacteria (163/181), except for Halorubrum (1/10) and Haloquadraticum (0/2); its members are orphan m4C MTases that modify CTAG (with the underline here and elsewhere indicating the methylated base), and it corresponds to cHG U observed previously by Fullmer and coworkers [13]. Although the general function of this epigenetic signal remains unknown, the CTAG sequence is generally under-represented in Halobacterial genomes [13] but locally clustered upstream of orc6/cdc1 gene orthologs [14], which encode the origin of replication binding complex in most archaea, a role analogous to that of DnaA in bacteria. This suggests a role for HG1 in chromosome replication or the regulation thereof in Halobacteria, but its precise function remains to be elucidated.
The second largest group, HG2, is found almost universally throughout the Crenarchaeota phylum (99/100) as well as in most Methanococci (where in Methanocaldococcus it is present in two copies) and Pyrococcus; its members are orphan m5C MTases. Prior to this work, two examples from this clade had predicted recognition sites, although neither had been tested directly: M.SuaII had been predicted to modify RGATCY based on SMRT sequencing of Sulfolobus acidocaldarius DSM639 [14] and M.Asp7I was predicted to modify GGCAC in Acidilobus species 7A. To address the conflicting predictions, we cloned and expressed both genes in a methyl-deficient strain of E. coli and, using RIMS-seq [21], found both to modify the heterologous host chromosome in vivo at CCWGG sites, the same site modified in wild-type E. coli strains by the product of dcm. In other words, both predictions were incorrect. The presence of a persistent m5C MTase in hyperthermophiles is intriguing since the rate of deamination of m5C is expected to be high at elevated temperatures, leading to a mutator phenotype [30]. The answer to this conundrum may be that HG2 is silenced under most conditions: although M.SuaII is active as a constitutively expressed clone, negligible levels of m5C methylation were observed in its native host, S. acidocaldarius, under the conditions of one published experiment [31]. This suggests that HG2 may be under tight regulatory control, in contrast to Dcm, which provides nearly complete methylation of CCWGG sites in E. coli.
The third largest group, HG3 (which corresponds to cHG W described previously [13]), encodes a Dam-like orphan m6A MTase that, based on characterized examples, modifies GATC sites. This MTase appears to have been independently established in several taxa: genus Methanobacterium (10/12), species Methanococcus maripaludis (9/9), family Halorubraceae (15/20, often accompanied by a second, plasmid-encoded copy), order Methanomicrobiales (15/19), and class Nitrososphaerota (27/29). HG3 members also sporadically appear in other strains, sometimes as an orphan and sometimes with an associated REase gene.
Group HG4, nearly ubiquitous in Natrialbales (41/42), encodes an orphan m6A MTase and is the only example of a Type II MTase group found here that modifies a nonpalindromic sequence, CATTC. All of the remaining persistent Type II MTases modify m4C: HG10 and HG18 (GTAC); HG11, HG16, and HG19 (AGCT); HG12 (CGCG); HG15 (GGCC); HG20 (CTNAG); and HG21 (unknown recognition site). All are orphans except for HG15, which is always accompanied by a companion REase, an arrangement atypical of persistent systems [11]. Two taxa, the Methanomicrobiales and the Nitrososphaerota, are particularly rich in these persistent m4C orphan MTases. Interestingly, in both taxa, GATC (conferred by HG3) and AGCT (conferred by HG11, HG16, or HG19) are present throughout the group or nearly so, with one or more additional persistent m4C groups present in the subclades. This may indicate a common epigenetic function for GATC and AGCT methylation in these distantly related taxa.
Several Type I RM systems met the criteria for persistence. However, given that the target sites of these systems are dictated by the specificity subunit, which tends to be the least conserved of the three Type I components, it is not clear that members of all of these systems recognize and methylate the same sequence. It may be that these systems are not vertically inherited, but rather are frequently horizontally exchanged between strains of the same species or taxon. HG6, for example, is also found frequently in Thermococcus and Methanothermobacter, and HG14 in other Methanomicrobia. Interestingly, four of the five Type I RM systems that meet the criteria for persistence are found in Methanosarcina, and they are the primary reason that this species has the highest density of Type I RM systems in the archaea generally, at more than four per genome (Table 1).
HG5 resembles PglX, the MTase associated with BREX systems, and is persistent in the genus Haloterrigena but sporadically found throughout the rest of the Halobacteria. HG13, which weakly resembles Eco57I-like Type IIG systems, is persistent in Methanosarcina mazei (where it is largely coincident with the four Type I systems) but sporadic throughout other Methanosarcinales. The lone persistent Type IV system, HG9, strongly resembles (39% identity) Mrr from E. coli K-12 and is persistent in the genus Methanosarcina (22/29).
The determination of persistence is highly dependent on the availability of completely sequenced genomes. Many taxa in our set are not represented by a sufficient number of genomes to be able to determine persistence based on our criteria. In general, higher-order taxa are represented by more examples than lower-order taxa. However, even among higher-order taxa, two of six phyla and 9 of 16 classes are represented in our set by fewer than five examples, too few to make a persistence determination. Also, in general, lower-order taxa tend to be less diverse groups and therefore would be expected to have more persistent systems than higher-order taxa. However, more specific taxa are also less likely to have enough examples to make the assessment. For example, only seven named archaeal species have more than five examples in our set, but three of these seven have at least one persistent system by our criteria (Table 3). The sequencing to closure of additional archaeal genomes from a broad diversity of taxa will no doubt reveal many additional examples.

Supplementary Materials

The following supporting information can be downloaded at: Table S1: Profiles and functions used for gene classification. Functions refer to the biochemical function(s) of the proteins in the set: M = MTase; R = REase; C = control protein (transcriptional regulator); S = target specificity; V = Vsr repair endonuclease. Categories refer to the 13 general functional categories, a combination of biochemical activity and RM system type, used for functional classification. The methylation type applies only to the MTases. Table S2: Number of protein cluster members in each genome. In the header are shown the cluster number, number of members, and top HMM hit to the cluster centroid. Columns are in descending order by cluster size. Entries show the number of cluster members (n) in each genome, colored pink when n = 1 and yellow when n > 1. Those genomes with associated methylome data from SMRT sequencing are shaded in green. Orgnum = REBASE organism number; taxonomy = taxonomy string as determined by NCBI. Table S3: Number of homologous group (HG) members in each genome, showing only those homologous groups that are persistent in at least one taxon. In the header are shown the HG number, number of members, and top HMM hit to the cluster centroid. Columns are in descending order by HG size, except for Type I RM systems, for which the persistent MTase member is shown adjacent to its companion REase and specificity subunits, which may or may not also be persistent. Members of the same RM system are shaded in color in the top row. Entries show the original cluster number of each HG member(s), colored in orange for those taxa where it meets the criteria for persistence. Those genomes with associated methylome data from SMRT sequencing are shaded in green. Orgnum = REBASE organism number; taxonomy = taxonomy string as determined by NCBI.

Author Contributions

Conceptualization, R.J.R. and B.P.A.; methodology, R.J.R. and B.P.A.; software, R.J.R. and B.P.A.; investigation, B.P.A.; resources, R.J.R.; data curation, R.J.R. and B.P.A.; writing—original draft preparation, B.P.A.; writing—review and editing, R.J.R. All authors have read and agreed to the published version of the manuscript.


R.J.R. and B.P.A. are employed by New England Biolabs. This research received no additional external funding.

Data Availability Statement

Publicly available datasets were analyzed in this study. These data can be found in REBASE ( [3].


The authors would like to thank Dana Macelis and Tamas Vincze for their excellent technical assistance and the late Donald G. Comb for inspiration and support.

Conflicts of Interest

New England Biolabs had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript, and in the decision to publish the results.


  1. Loenen, W.A.; Dryden, D.T.; Raleigh, E.A.; Wilson, G.G.; Murray, N.E. Highlights of the DNA cutters: A short history of the restriction enzymes. Nucleic Acids Res. 2014, 42, 3–19. [Google Scholar] [CrossRef] [PubMed]
  2. Oliveira, P.H.; Touchon, M.; Rocha, E.P. The interplay of restriction-modification systems with mobile genetic elements and their prokaryotic hosts. Nucleic Acids Res. 2014, 42, 10618–10631. [Google Scholar] [CrossRef] [PubMed]
  3. Roberts, R.J.; Vincze, T.; Posfai, J.; Macelis, D. REBASE: A database for DNA restriction and modification: Enzymes, genes and genomes. Nucleic Acids Res. 2023, 51, D629–D630. [Google Scholar] [CrossRef] [PubMed]
  4. Sayers, E.W.; Cavanaugh, M.; Clark, K.; Pruitt, K.D.; Schoch, C.L.; Sherry, S.T.; Karsch-Mizrachi, I. GenBank. Nucleic Acids Res. 2022, 50, D161–D164. [Google Scholar] [CrossRef]
  5. McIntyre, A.B.R.; Alexander, N.; Grigorev, K.; Bezdan, D.; Sichtig, H.; Chiu, C.Y.; Mason, C.E. Single-molecule sequencing detection of N6-methyladenine in microbial reference materials. Nat. Commun. 2019, 10, 579. [Google Scholar] [CrossRef]
  6. Rand, A.C.; Jain, M.; Eizenga, J.M.; Musselman-Brown, A.; Olsen, H.E.; Akeson, M.; Paten, B. Mapping DNA methylation with high-throughput nanopore sequencing. Nat. Methods 2017, 14, 411–413. [Google Scholar] [CrossRef]
  7. Atack, J.M.; Guo, C.; Litfin, T.; Yang, L.; Blackall, P.J.; Zhou, Y.; Jennings, M.P. Systematic Analysis of REBASE Identifies Numerous Type I Restriction-Modification Systems with Duplicated, Distinct hsdS Specificity Genes That Can Switch System Specificity by Recombination. mSystems 2020, 5, e00497-20. [Google Scholar] [CrossRef]
  8. Atack, J.M.; Guo, C.; Yang, L.; Zhou, Y.; Jennings, M.P. DNA sequence repeats identify numerous Type I restriction-modification systems that are potential epigenetic regulators controlling phase-variable regulons; phasevarions. FASEB J. 2020, 34, 1038–1051. [Google Scholar] [CrossRef]
  9. Atack, J.M.; Yang, Y.; Seib, K.L.; Zhou, Y.; Jennings, M.P. A survey of Type III restriction-modification systems reveals numerous, novel epigenetic regulators controlling phase-variable regulons; phasevarions. Nucleic Acids Res. 2018, 46, 3532–3542. [Google Scholar] [CrossRef]
  10. Ershova, A.S.; Karyagina, A.S.; Vasiliev, M.O.; Lyashchuk, A.M.; Lunin, V.G.; Spirin, S.A.; Alexeevski, A.V. Solitary restriction endonucleases in prokaryotic genomes. Nucleic Acids Res. 2012, 40, 10107–10115. [Google Scholar] [CrossRef]
  11. Oliveira, P.H.; Fang, G. Conserved DNA Methyltransferases: A Window into Fundamental Mechanisms of Epigenetic Regulation in Bacteria. Trends Microbiol. 2021, 29, 28–40. [Google Scholar] [CrossRef] [PubMed]
  12. Furuta, Y.; Abe, K.; Kobayashi, I. Genome comparison and context analysis reveals putative mobile forms of restriction-modification systems and related rearrangements. Nucleic Acids Res. 2010, 38, 2428–2443. [Google Scholar] [CrossRef] [PubMed]
  13. Fullmer, M.S.; Ouellette, M.; Louyakis, A.S.; Papke, R.T.; Gogarten, J.P. The Patchy Distribution of Restriction(-)Modification System Genes and the Conservation of Orphan Methyltransferases in Halobacteria. Genes 2019, 10, 233. [Google Scholar] [CrossRef] [PubMed]
  14. Blow, M.J.; Clark, T.A.; Daum, C.G.; Deutschbauer, A.M.; Fomenkov, A.; Fries, R.; Froula, J.; Kang, D.D.; Malmstrom, R.R.; Morgan, R.D.; et al. The Epigenomic Landscape of Prokaryotes. PLoS Genet. 2016, 12, e1005854. [Google Scholar] [CrossRef]
  15. Edgar, R.C. Search and clustering orders of magnitude faster than BLAST. Bioinformatics 2010, 26, 2460–2461. [Google Scholar] [CrossRef]
  16. Frickey, T.; Lupas, A. CLANS: A Java application for visualizing protein families based on pairwise similarity. Bioinformatics 2004, 20, 3702–3704. [Google Scholar] [CrossRef]
  17. Gabler, F.; Nam, S.Z.; Till, S.; Mirdita, M.; Steinegger, M.; Soding, J.; Lupas, A.N.; Alva, V. Protein Sequence Analysis Using the MPI Bioinformatics Toolkit. Curr. Protoc. Bioinform. 2020, 72, e108. [Google Scholar] [CrossRef]
  18. Malone, T.; Blumenthal, R.M.; Cheng, X. Structure-guided analysis reveals nine sequence motifs conserved among DNA amino-methyltransferases, and suggests a catalytic mechanism for these enzymes. J. Mol. Biol. 1995, 253, 618–632. [Google Scholar] [CrossRef]
  19. Edgar, R.C. MUSCLE v5 enables improved estimates of phylogenetic tree confidence by ensemble bootstrapping. bioRxiv 2021. [Google Scholar] [CrossRef]
  20. Yan, B.; Wang, D.; Ettwiller, L. Simultaneous assessment of human genome and methylome data in a single experiment using limited deamination of methylated cytosine. bioRxiv 2023. [Google Scholar] [CrossRef]
  21. Baum, C.; Lin, Y.C.; Fomenkov, A.; Anton, B.P.; Chen, L.; Yan, B.; Evans, T.C.; Roberts, R.J.; Tolonen, A.C.; Ettwiller, L. Rapid identification of methylase specificity (RIMS-seq) jointly identifies methylated motifs and generates shotgun sequencing of bacterial genomes. Nucleic Acids Res. 2021, 49, e113. [Google Scholar] [CrossRef] [PubMed]
  22. Marschall, T.; Rahmann, S. Efficient exact motif discovery. Bioinformatics 2009, 25, i356–i364. [Google Scholar] [CrossRef] [PubMed]
  23. Saad, C.; Noe, L.; Richard, H.; Leclerc, J.; Buisine, M.P.; Touzet, H.; Figeac, M. DiNAMO: Highly sensitive DNA motif discovery in high-throughput sequencing data. BMC Bioinform. 2018, 19, 223. [Google Scholar] [CrossRef] [PubMed]
  24. Kinch, L.N.; Ginalski, K.; Rychlewski, L.; Grishin, N.V. Identification of novel restriction endonuclease-like fold families among hypothetical proteins. Nucleic Acids Res. 2005, 33, 3598–3605. [Google Scholar] [CrossRef]
  25. Clark, T.A.; Murray, I.A.; Morgan, R.D.; Kislyuk, A.O.; Spittle, K.E.; Boitano, M.; Fomenkov, A.; Roberts, R.J.; Korlach, J. Characterization of DNA methyltransferase specificities using single-molecule, real-time DNA sequencing. Nucleic Acids Res. 2012, 40, e29. [Google Scholar] [CrossRef]
  26. Flusberg, B.A.; Webster, D.R.; Lee, J.H.; Travers, K.J.; Olivares, E.C.; Clark, T.A.; Korlach, J.; Turner, S.W. Direct detection of DNA methylation during single-molecule, real-time sequencing. Nat. Methods 2010, 7, 461–465. [Google Scholar] [CrossRef]
  27. Vaisvila, R.; Ponnaluri, V.K.C.; Sun, Z.; Langhorst, B.W.; Saleh, L.; Guan, S.; Dai, N.; Campbell, M.A.; Sexton, B.S.; Marks, K.; et al. Enzymatic methyl sequencing detects DNA methylation at single-base resolution from picograms of DNA. Genome Res. 2021, 31, 1280–1289. [Google Scholar] [CrossRef]
  28. Liu, Y.; Siejka-Zielinska, P.; Velikova, G.; Bi, Y.; Yuan, F.; Tomkova, M.; Bai, C.; Chen, L.; Schuster-Bockler, B.; Song, C.X. Bisulfite-free direct detection of 5-methylcytosine and 5-hydroxymethylcytosine at base resolution. Nat. Biotechnol. 2019, 37, 424–429. [Google Scholar] [CrossRef]
  29. Seshasayee, A.S.; Singh, P.; Krishna, S. Context-dependent conservation of DNA methyltransferases in bacteria. Nucleic Acids Res. 2012, 40, 7066–7073. [Google Scholar] [CrossRef]
  30. Grogan, D.W. Cytosine methylation by the SuaI restriction-modification system: Implications for genetic fidelity in a hyperthermophilic archaeon. J. Bacteriol. 2003, 185, 4657–4661. [Google Scholar] [CrossRef]
  31. Couturier, M.; Lindas, A.C. The DNA Methylome of the Hyperthermoacidophilic Crenarchaeon Sulfolobus acidocaldarius. Front. Microbiol. 2018, 9, 137. [Google Scholar] [CrossRef] [PubMed]
Figure 1. Locations of RM systems in two Crenarchaeota. For each locus, the arrows show the gene arrangement (green = Type II MTase; blue = Type IIG RM system; red = Type III MTase; black = REase; and gray = Vsr nuclease). Numerals show the ORF number from the REBASE nomenclature (where, for example, 1514 = CysAORF1514P) and the motif is the predicted recognition site based on characterized homologs. (A) Cenarchaeum symbiosum A (2.05 Mbp). (B) Fervidicoccus fontis Kam940 (1.32 Mbp).
Figure 1. Locations of RM systems in two Crenarchaeota. For each locus, the arrows show the gene arrangement (green = Type II MTase; blue = Type IIG RM system; red = Type III MTase; black = REase; and gray = Vsr nuclease). Numerals show the ORF number from the REBASE nomenclature (where, for example, 1514 = CysAORF1514P) and the motif is the predicted recognition site based on characterized homologs. (A) Cenarchaeum symbiosum A (2.05 Mbp). (B) Fervidicoccus fontis Kam940 (1.32 Mbp).
Microorganisms 11 02424 g001
Figure 2. (A) Mean numbers of RM system genes, by function, in 3369 bacterial genomes (blue) and 519 archaeal genomes (orange). Values for archaea and definitions of the functional groups are from Table 1. (B) Mean numbers of MTase genes conferring each of the three methylated bases in the same sets of bacterial and archaeal genomes. Values for archaea are from Table 2.
Figure 2. (A) Mean numbers of RM system genes, by function, in 3369 bacterial genomes (blue) and 519 archaeal genomes (orange). Values for archaea and definitions of the functional groups are from Table 1. (B) Mean numbers of MTase genes conferring each of the three methylated bases in the same sets of bacterial and archaeal genomes. Values for archaea are from Table 2.
Microorganisms 11 02424 g002
Table 1. Mean number of RM-related genes per taxonomic group.
Table 1. Mean number of RM-related genes per taxonomic group.
Asgard group (Lokiarchaeota)100011042000002
Crenarchaeota (Thermoprotei)1000.
DPANN group60.3330.1670.510.1670.1670.50.33300000.167
   Nanohaloarchaeota (Nanohalobia)11111010000001
   Nanoarchaeota 30.33300.3330.667000000000
Environmental sample11119011100000
   Archaeoglobi (Archaeoglobales)810.750.8750.87500.250.250.2500001.375
   Thermococci (Thermococcales)440.5680.5680.6141.3410.0910.7950.0230.0230000.0451.705
TACK group310.4190.4190.5813.7100.290.12900000.0970.516
      Nitrososphaerota inc. sed.
a For the purposes of this work, the taxa in bold will be considered phyla, those in Roman type classes, and those in italics orders. If every member of a particular taxon represented here belongs to the same lower-order taxon, that lower-order taxon is shown in parentheses next to the higher-order taxon.
Table 2. Mean numbers of MTase genes and motifs based on methylated base and position.
Table 2. Mean numbers of MTase genes and motifs based on methylated base and position.
All Complete GenomesGenomes with Methylation Data
Taxonomic Group aGenomesGenes m6AGenes m4CGenes m5CGenomesGenes m6AGenes m4CGenes m5CMotifs m6AMotifs m4CMotifs m5C
Asgard group (Lokiarchaeota)11070
Crenarchaeota (Thermoprotei)1000.830.561.130.6670.667110.6670.667
DPANN group61.3330.50.1671100100
   Nanohaloarchaeota (Nanohalobia)1201
   Nanoarchaeota 30.6670.33301100100
Environmental sample1660
   Archaeoglobi (Archaeoglobales)81.6250.3750.375
   Thermococci (Thermococcales)441.750.50.477420.250.520.250
TACK group312.0652.1290.35541.
      Nitrososphaerota inc. sed.521.80.4
a For the purposes of this work, the taxa in bold will be considered phyla, those in Roman type classes, and those in italics orders. If every member of a particular taxon represented here belongs to the same lower-order taxon, that lower-order taxon is shown in parentheses next to the higher-order taxon.
Table 3. Persistent RM systems in each taxonomic group.
Table 3. Persistent RM systems in each taxonomic group.
Taxonomic Group aTotal
Cluster (Members)ClassMotif b
Crenarchaeota (Thermoprotei)100HG2 (99)IIMCCWGG (m5C)
            Sulfolobus acidocaldarius9HG15 M/R (8)IIM/RGGCC (m4C)
DPANN group6None
   Archaeoglobi (Archaeoglobales)8HG6 M/R (6)IM/Rn/d
         Methanobacterium12HG3 (10)IIMGATC (m6A)
      Methanococci24HG2 (19)IIMCCWGG (m5C)
         Methanococcus maripaludis9HG3 (9)IIMGATC (m6A)
   Halobacteria182HG1 (155)IIMCTAG (m4C)
         Halorubraceae20HG3 (15)IIMGATC (m6A)
      Natrialbales42HG4 (41)IIMCATTC (m6A)
         Haloterrigena9HG5 (7)BREXCTGGAG (m6A)
      Methanomicrobiales19HG3 (15)IIMGATC (m6A)
            Methanoculleus6HG16 (6)IIMAGCT (m4C)
            M-regula/M-spirilla group c6HG16 (6)IIMAGCT (m4C)
            M-regula/M-spirilla group c6HG18 (6)IIMGTAC (m4C)
            M-regula/M-spirilla group c6HG20 (6)IIMCTNAG (m4C)
      Methanosarcinales42HG8 M/R/S (28)IM/R/Sn/d
         Methanosarcina29HG9 (22)IVn/d
            Methanosarcina mazei9HG17 M/R (7)IM/Rn/d
            Methanosarcina mazei9HG13 (7)BREXn/d
            Methanosarcina mazei9HG14 M/R (9)IM/Rn/d
            Methanosarcina mazei9HG7 M/R/S (9)IM/R/Sn/d
   Thermococci (Thermococcales)44None
         Pyrococcus9HG2 (9)IIMCCWGG (m5C)
TACK group31None
   Nitrososphaerota29HG3 (27)IIMGATC (m6A)
      Nitrosopumilales14HG11 (14)IIMAGCT (m4C)
         Nitrososphaerales8HG10 (8)IIMGTAC (m4C)
         Nitrososphaerales8HG19 (8)IIMAGCT (m4C)
            Nitrososphaera5HG12 (5)IIMCGCG (m4C)
            Nitrososphaera5HG21 (5)IIMUnknown
      Nitrososphaerota inc. sed.5HG11 (4)IIMAGCT (m4C)
a For the purposes of this work, the taxa in bold will be considered phyla, those in Roman type classes, and those in italics orders. If every member of a particular taxon represented here belongs to the same lower-order taxon, that lower-order taxon is shown in parentheses next to the higher-order taxon. b Methylated base on the top strand is underlined. c Methanospirillaceae and Methanoregulaceae consistently form a subclade under Methanomicrobiales and are treated as a single group for the purpose of this table.
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Share and Cite

MDPI and ACS Style

Anton, B.P.; Roberts, R.J. A Survey of Archaeal Restriction–Modification Systems. Microorganisms 2023, 11, 2424.

AMA Style

Anton BP, Roberts RJ. A Survey of Archaeal Restriction–Modification Systems. Microorganisms. 2023; 11(10):2424.

Chicago/Turabian Style

Anton, Brian P., and Richard J. Roberts. 2023. "A Survey of Archaeal Restriction–Modification Systems" Microorganisms 11, no. 10: 2424.

Note that from the first issue of 2016, this journal uses article numbers instead of page numbers. See further details here.

Article Metrics

Back to TopTop