Abstract

Background. Genome data have been used to find novel allergen from house dust mites. Here, we aim to construct a chromosome-level genome assembly of Dermatophagoides farinae, a common allergenic mite species. Methods. We achieved a chromosome-level assembly of D. farinae’s genome by integrating PacBio single-molecule real-time sequencing, Illumina paired-end sequencing, and Hi-C technology, followed by annotating allergens and mapping them to specific chromosomes. Results. A 62.43 Mb genome was assembled with a 0.52% heterozygosity rate and a 36.11 Merqury-estimated quality value. The assembled genome represents 92.1% completeness benchmarking universal single-copy orthologs with a scaffold N50 value of 7.11 Mb. Hi-C scaffolding of the genome resulted in construction of 10 pseudochromosomes. The genome comprises 13.01% (7.66 Mb) repetitive sequences and predicts 10,709 protein-coding genes, 96.57% of which are functionally annotated. Moreover, we identified and located 36 allergen groups on specific chromosomes, including allergens Der f 1, Der f 2, Der f 23, Der f 4, Der f 5, Der f 7, and Der f 21 located on chromosomes 2, 1, 7, 3, 4, 6, and 4, respectively. Conclusion. This comprehensive genomic data provides valuable insights into mite biology and evolutionary adaptations, potentially advancing D. farinae allergy research and treatment strategies.

1. Introduction

Dermatophagoides farinae (D. farinae) and Dermatophagoides pteronyssinus (D. pteronyssinus) are domestic mite species commonly inducing atopic sensitization and allergy including affecting the eyes, upper and lower airways, skin, and, sometimes, systemic circulation [1]. More than 20% of people are affected by house dust mites (HDM) worldwide, and 30% of them carry asymptomatic HDM sensitization [2]. A cross-sectional survey of 6304 patients with asthma, rhinitis, or both in 17 cities from 4 regions of China revealed that the overall prevalence of positive skin prick responses was 59.0% for D. farinae and 57.6% for D. pteronyssinus [3].

An organism that induces allergy is defined as an allergen source, whereas an allergenic molecule (i.e., allergen component) is considered a molecule (i.e., protein or glycoprotein) derived from an allergen source identified using specific IgE antibodies. Therefore, allergen molecules are isolated from a natural allergen source (i.e., native, purified allergens) or can be produced with recombinant DNA technology [4]. Since HDM are allergen sources, their allergen components have been investigated since the 1980s to explore methods for personalized diagnosis and treatment. In total, 39 allergen groups of IgE-binding components with similar sequences among different mites have been identified from these Dermatophagoides species and recorded by the World Health Organization (WHO) and International Union of Immunological Societies (IUIS) Allergen Nomenclature Sub-Committee, including 36 groups in D. farinae and 30 in D. pteronyssinus [5].

In recent years, genome assemblies of allergenic mites have been an essential resource for finding novel allergen groups through homology search of various nonmite sources. The first mite genome of D. farinae was reported in 2015, which contained the full gene structures of 20 canonical allergens and 7 noncanonical allergen homologs; a major allergen, ubiquinol-cytochrome c reductase-binding protein-like protein, was identified and designated as Der f 24 [6]. In 2017, de novo draft genome assembly method was used for D. pteronyssinus; Der p 16, Der p 22, and Der p 25 till Der p 35 were identified as novel allergen groups [7]. In 2018, D. pteronyssinus genome was generated from a population of randomly breeding diploid dust mites that had been maintained as a colony for several years; allergen isoforms and variations were examined at the genome level in this unique case [8]. These genome data obtained through high-throughput DNA sequencing expedite novel allergen identification. However, the above-mentioned mite genomes [68] were assembled from short-read DNA sequences, resulting in gaps in the genomes and no information regarding the linkage groups, which has impeded the understanding of the house dust mite genome structure.

In 2021, Chen et al. extracted genetic material from D. farinae bodies and eggs and sequenced with short reads from next-generation sequencing (NGS) and long reads from PacBio/nanopore sequencing; compared with the D. farinae draft genome, genome size was corrected (from 53.55 Mb to 58.77 Mb), and the contig N50 was increased (from 8.54 kb to 9365.49 kb). The assembled genome has 10 contigs, 33 canonical allergens, and 2 novel allergens (Der f 37 and Der f 39) [9].

High-throughput/resolution chromosome conformation capture (Hi-C) is frequently used to provide links across various length scales, even spanning entire chromosomes. In contrast to paired-end reads from cloned libraries, any given Hi-C contact spans an unknown length and may connect loci from different chromosomes [10]. In recent years, Hi-C has been used to improve draft genome assemblies and create chromosome-length scaffolds for large genomes [1114]. In the present study, we generated the chromosome-level genome of D. farinae through long-read Pacific Bioscience sequencing, Illumina paired-end sequencing, and Hi-C-based chromatin contact maps; subsequently, the allergens of D. farinae were annotated and assigned to relevant chromosomes. Accordingly, we created a chromosome-level assembly of D. farinae to serve as the most comprehensive genomic data available for an allergenic mite species.

2. Methods

2.1. Mite Culture and Purity Evaluation

D. farinae mites were cultured in our laboratory for many generations in cell culture flasks (75 cm; Corning, USA) in an artificial climate incubator (RXZ-280D; Jiangnan Instrument Factory, Ningbo, China) at and relative humidity [15]. D. farinae was identified during cultivation through the observation of its morphological characteristics. To protect the integrity of the DNA, live mites (~3,000) were collected by hand, washed with 75% ethanol, and immediately frozen and stored in liquid nitrogen.

2.2. Mitochondrial Genome Sequencing

A purity check of the D. farinae mites was performed through complete mitochondrial genome sequencing. Total genomic DNA was extracted using a TIANamp Micro DNA Kit (Tiangen Biotech, Beijing, China). After random shearing using a Covairs ultrasonic breaker, we fragmented DNA for genomic DNA library construction by using the whole-genome shotgun strategy. The DNA fragments were repaired at their protruding ends by using a combination of 3-5 exonuclease and polymerase, and a single “A” base was introduced at the 3 end of the segment for fragment ligation; this promoted selective enrichment of DNA fragments with joints at both ends while amplifying the DNA library. DNA quality was determined through 1% agarose gel electrophoresis and analyzed on an Agilent 2100 Bioanalyzer (Agilent Technologies, Santa Clara, CA, USA). DNA quantity was detected on a NanoDrop 2000 spectrophotometer (Thermo Fisher Scientific, Waltham, MA, USA). Subsequent sequencing was performed on an Illumina NovaSeq platform in the paired-end 150 bp sequencing mode. De novo construction of mitochondrial sequences was performed using A5-miseq v2015052239 [16] and SPAdesv3.9.040 [17], followed by the alignment of the high-depth sequencing results with the mt library of the NCBI database by using BLAST (version 2.2.31) [18]. We then used MUMmer (version 3.1) [19] for collinearity analysis to determine the positional relationships between different contig sequences. The complete mitogenome sequence was revised and corrected using Pilon (version 1.1842) [20] and then uploaded to the MITOS web server [21] for functional annotation under default settings and the genetic code 02-Invertebrate.

2.3. D. farinae Genome Sequencing

The Illumina HiSeq X Ten platform (Illumina, San Diego, CA, USA) and the PacBio Sequel platform (Pacific Bioscience, San Diego, CA, USA) were used to perform genome sequencing to generate short and long reads, respectively. A paired-end Illumina genomic library was generated and sequenced according to the Illumina protocol. For long-read sequencing on the PacBio Sequel sequencer, a 20 kb single-molecule real-time sequencing bell library was constructed using a PacBio DNA Template Prep Kit 1.0. The quality and quantity of each library were assessed using the Agilent 2100 Bioanalyzer (Agilent Technologies). Templates were selected according to size, and long fragments (>10 kb) were enriched for sequencing using BluePippin (Sage Science, Inc., Beverly, MA, USA).

2.4. Genome Survey, Initial Genome Assembly, and Quality Evaluation
2.4.1. K-mer Analysis and Genome Feature Estimation with Illumina Sequencing Data

Raw data from Illumina paired-end reads were processed using fastp (version 0.20.1) [22]. Adaptors reads with low-quality bases and reads containing >10% unknown bases (“”) were removed from the analysis. Subsequently, k-mer frequency distribution analysis was performed, as described previously [12]. The k-mer was set to 17 for further analysis. “kmer_freq_stat” [23] was used to estimate genome size, the presence of repetitive sequences, and genome heterozygosity. The formula for genome size estimation was as follows: , where is the total number of k-mers, is the peak depth of the k-mer, and is the size of the genome. The histogram of the k-mers was generated using Jellyfish (version 2.3.0) [24], and GenomeScope (version 1.0) [25] was used to evaluate genomic characteristics.

2.4.2. Genome Assembly and Chromosome Construction through Long-Read Sequencing

For genome assembly, we first used the open-source FALCON and FALCON-Unzip algorithms Falcon [26], WTDBG2 [27], and Flye [28] to assemble high-quality PacBio subreads independently. The homologous contigs were further optimized and corrected based on self-alignment and sequencing depth. The reads assembled using Falcon, WTDBG2, and Flye were merged using Quickmerge [29]. The merged genome was corrected with the Illumina data using NextPolish and Pilon [30].

The assembled genome quality was evaluated by mapping the short reads from Illumina to the assembly by using BWA-MEM (version 0.7.10-r789) [31], BUSCO (version 3.0.2) in addition to the arachnida_odb10 database, and Merqury (version 1.1) with special parameters ().

2.5. High-Throughput Chromosome Conformation Capture Sequencing and Chromosome Assembly

DNA fixation, chromatin isolation, and library construction were performed to construct a Hi-C chromatin contact map for chromosome-level assembly, according to the manufacturer’s instructions [32]. First, the purified mite bodies were washed twice with phosphate-buffered saline (pH 7.4) and homogenized into a powder by using a mortar and pestle with liquid nitrogen. The homogenized mite samples were then incubated at 50°C for proteolysis, digested with the restriction enzyme HindIII, labeled with biotinylated nucleotides, and end repaired. After the reversal of the crosslinks, the ligated DNA was purified and sheared to a length of 300–700 bp. Biotinylated DNA fragments were captured with streptavidin beads and used for Hi-C fragment library construction. Finally, the quality and quantity of the purified library were verified using a Qubit 3.0 fluorometer (Invitrogen), Agilent Bioanalyzer 12-kb DNA Chip (Agilent Technologies), and quantitative polymerase chain reaction (PCR). A high-quality Hi-C fragment library was prepared for the Illumina HiSeq 2500 platform.

Sequencing data from the Hi-C library were used for chromosome-level assembly [33]. To obtain uniquely mapped read pairs, the raw data were aligned with the initial genome assembly by using BWA-MEM (version 0.7.10-r789) [34]. HiC-Pro [35] was used to evaluate valid Hi-C data based on uniquely mapped read pairs. Only valid read pairs were used for draft genome recorrection and chromosome-level genome assembly. The contigs of the draft genome were split into simulated 500 kb contigs, and LACHESIS [33] was used to sort these contigs into groups with the following parameters: CLUSTER_MIN_RE_SITE = 29, CLUSTER_MAX_LINK_ DENSITY = 2, CLUSTER_MONINFORMATIVE_RATIO = 2, ORDER_MIN_N_ RES_ TRUN = 15, and ORD-ER_MIN_N_RES_I-N-SHREDS = 15. Potential contamination by other organisms was removed during mounting; the subreads were compared with the deep sequencing results by using Bowtie2 and SAMtools. A Hi-C visualization heatmap was created using HiCPlotter, and a chromosome diagram was drawn using Circos. Heatmap colors ranging from light yellow to dark red were used to indicate the frequency (from low to high) of Hi-C interactions [35].

2.6. RNA Extraction, Library Construction, Transcriptome Sequencing, and Read Processing

Total RNA was extracted using a Qiagen 74104 Rneasy Mini Kit (Qiagen), according to the manufacturer’s protocol. Total RNA quantity and purity were analyzed on a Bioanalyzer 2100 and RNA 6000 Nano LabChip Kit (Agilent, CA, USA). Next-generation sequencing libraries of transcriptomes were constructed using the SMART cDNA Library Construction Kit (CLONTECH Company, code no. 634901). mRNA was enriched and purified with magnetic beads (oligo (dT)) and then fragmented into small pieces by using divalent cations at high temperatures; these cleaved mRNA fragments were randomly sheared and used as templates to produce cDNA. This cDNA was purified, its sticky ends were restored to flat ends, A-bases were added to the 3 ends, and joints were added as well. Finally, PCR amplification was performed. The library was submitted for quality control on the Agilent 2100 Bioanalyzer and ABI StepOnePlus Real-Time PCR system and then sequenced on the Illumina HiSeq 2000 instrument to produce paired-end  bp reads, according to the vendor’s recommended protocol. Sequencing data were pretreated, assembled, and annotated as described in our previous publications [36, 37].

2.7. Repetitive Sequence Analysis and Genome Annotation

Repetitive elements were annotated based on homology and ab initio. LTR_FINDER (RRID: SCR_015247) (Benson), RepeatScout (RRID: SCR_014653) [38], and RepeatModeler (RRID: SCR_015027) were used to construct an ab initio repetitive element database, and RepeatMasker (RRID: SCR_012954) [38] was used to annotate repetitive elements on the basis of the database. Subsequently, RepeatMasker and RepeatProteinMask [39] were used to search the genome sequence for known repetitive elements, with the genome sequences used as queries against the Repbase database [40]. Tandem repeats were predicted using the Tandem Repeats Finder (Benson). We masked the repetitive regions of the assembled genome sequences by using RepeatMasker (version 4.0.5) [39].

For protein-coding gene prediction, we used both de novo and homology-based strategies in accordance with the MAKER pipeline [38]. Ab initio gene prediction was performed on the repeat-masked genome assembly by using Genscan, GlimmerHMM (version 3.0.4), GeneID (version 1.4), SNAP (version 2006-07-28) [41], and AUGUSTUS (version 2.4) [42]. For homology-based predictions, Hisat (version 2.0.4), StringTie (version 1.2.3), GeneMarkS-T (version 5.1), and PASA (version 2.0.2) were used, and EVM (version 1.1.1) was used to integrate the four sets of results. Finally, we mapped the allergen sequences onto the mite chromosomes. The homology alignment of protein-coding genes was performed using the public protein databases Swiss-Prot, TreEMBL, and Pfam by using BLASTX, with an -value of 10-5.

2.8. Allergen Identification

We downloaded the sequences of the known allergen groups of D. farinae from the WHO/IUIS Allergen Nomenclature Sub-Committee website and searched for them in our genome. The protein database was created using BlastP with an -value cutoff of 1-50. Most match sequences in our genome met two standards: (1) having the highest score during BlastP and (2) being a top hit in hmmsearch with an HMM model constructed from known allergens downloaded from the WHO/IUIS website. When the HMM model was not used, the most matched sequences were the target during BlastP. When D. farinae allergen groups were unavailable (i.e., no reported allergen groups of D. farinae were present), we searched the entire Arthropod data to identify potential novel allergen groups. To confirm our genome annotation further, each identified candidate allergen was aligned with sequences deposited on the WHO/IUIS website.

3. Results and Discussion

3.1. Mitochondrial Genome Structure Confirmed High Purity of D. farinae in Our Study

The final assembled mitochondrial genome of D. farinae had a length of 14,320 bp and included 13 coding genes, 22 transfer RNA genes, 2 ribosomal RNA genes, and 1 control region (Figure 1), which was submitted to the National Center for Biotechnology Information (NCBI) database under the accession number OR197578. The length and arrangement of the mitochondrial genome from our study were similar to the previously reported mitochondrial genome of D. farinae, which has a length of 14,266 bp (GenBank: GQ465336.1), suggesting the high purity of D. farinae used in our research.

3.2. Initial Genome Assembly and Evaluation

Illumina platform generated 6.53 Gb data from a 400 bp insert library, representing a 105-fold coverage of the D. farinae genome. K-mers () were extracted from a paired-end library with an insert size of 400 bp. We calculated and plotted the 17-mer depth distribution (Figure 2). The 17-kmer distribution showed a major peak at 89× (Figure 2). Based on the number of k-mers and relative k-mer depth, we estimated the genome size of D. farinae to be 59.35 Mb, according to the following formula: .

PacBio platform generated 8.21 Gb high-quality sequences from the long-read library with a mean subread length of 7442 bp and read N50 value of 9679 bp, representing a 131-fold coverage of genome assembly. These PacBio reads resulted in an assembly with a contig N50 value of 273,396 bp.

3.3. Genome Quality Evaluation and Final Genome Assembly

The assembled genome was analyzed using benchmarking universal single-copy orthologs (BUSCOs; version 5.3.2) with the arachnida_odb10 database to assess its completeness. Of the 2934 total BUSCO groups searched, 2667 complete and single-copy ortholog groups and 35 complete and duplicated ortholog groups were identified. The assessment results indicated a relatively complete genome.

Considering the presence of symbiotic microorganisms in the digestive tract of mites, it is impossible to obtain pure mite bodies as sequencing samples [6, 7, 4345]. Thus, we manually removed the microbial DNA contamination during genome assembly. Briefly, by conducting BLAST searches against the microbial RefSeq database with a stringent cutoff (), the mite sequences were distinguished and separated from microbiota sequences. After filtering, we obtained a final genome size of 62.43 Mb from the Hi-C library. A previous minireview reported that the genome sizes of D. farinae, D. pteronyssinus, Euroglyphus maynei, Blomia tropicalis, and Tyrophagus putrescentiae were 63.7 Mb, 66.6 Mb, 43.4 Mb, 62.7 Mb, and 97.4 Mb, respectively [46], ranging from 43 to 100 Mb, which is inclusive of our genome size.

The National Center for Biotechnology Information (NCBI) database contains three D. farinae genomes: (1) GCA_000767015.2 [47], with a total length of 60.39 Mb and a scaffold N50 value of 8.98 Mb; (2) GCA_002085665.2 [48], with a total ungapped length of 91.9 Mb and a scaffold N50 value of 0.19 Mb; and (3) GCA_020809275.1 [9], with a total length of 58.8 Mb and a Contig N50 value of 9.3 Mb. Due to PacBio technology, this scaffold-level genome assembly includes 10 contigs scaffolds [9], but no assembled chromosomes. These D. farinae genome data were downloaded from NCBI, reassembled and annotated, and compared with our current findings (Table 1). Our genome has a total length of 62.43 Mb and an N50 value of 7.12 Mb. In genome assemblies, N50 is the measure of assembled genome quality. The higher the N50 value is, the higher the quality of the corresponding genome assembly becomes [49]. By using Merqury (version 1.1) [50], we found the assembly consensus quality value of our genome to be 36.1083, representing an accuracy of 99.975%. Therefore, the D. farinae genome presented in our study is of high quality.

Finally, 62.43 Mb of data from 21 scaffolds were anchored onto and oriented toward 10 pseudochromosomes through agglomerative hierarchical clustering. A heatmap of the interactions among the pseudochromosomes illustrated that genome assembly was complete and robust (Figure 3(a)). The sizes of the 10 pseudochromosomes ranged from 9,172,953 to 3,348,946 bp (Figure 3(b)).

3.4. Gene Annotation

After integrating several programs and results and eliminating redundancy, we estimated that 13.01% (7.66 Mb) of the D. farinae genome comprised repeat sequences (Table 2). Of the classified repetitive elements, simple repetitive elements were determined to be the most abundant (11.41%) in the genome.

In total, 10,709 protein-coding genes were identified in the D. farinae genome. The total coding sequence length was 17,338,048 bp, with an average transcript length of 1619.02 bp. Moreover, the total number of mRNA sequences was 100,538, with an average length of 793.03 bp.

In total, 10,709 genes, representing 96.57% of the predicted genes, were efficiently annotated with their putative functions (Table 3). With our functional annotation analysis, 9524 (88.93%), 6487 (60.58%), and 9853 (92.01%) genes demonstrated significant hits with proteins cataloged in the InterPro, Swissprot, and NR databases, respectively (Table 3). Moreover, 7612 (71.08%), 5000 (46.69%), 5930 (55.37%), and 7853 (73.33%) genes were annotated in the Pfam, KEGG, GO, and eggNOG databases, respectively (Table 3).

3.5. Allergen Identification and Location

We identified 36 allergen groups, and our sequences were determined to exhibit an excellent match with known allergen groups deposited in IUIS, with percent identities ranging from 83.99% to 100% for nucleotide sequences (Table 4). However, we did not find Der f 12, Der f 17, and Der f 19 and assigned these allergen groups to each chromosome of D. farinae on the basis of our chromosome assembly results (Figure 4). The Dermatophagoides allergens were classified into four categories according to their IgE-binding frequency: serodominant (groups 1, 2, and 23), midtier (groups 4, 5, 7, and 21), minor (groups 3, 6, 8, 9, 10, 11, 13, 15, 16, 17, 18, and 20), and allergen of unknown importance [51]. The major allergens Der f 1, Der f 2, and Der f 23 are located on chromosomes 2, 1, and 7, respectively. The midtier allergens Der f 4, Der f 5, Der f 7, and Der f 21 are located on chromosomes 3, 4, 6, and 4, respectively. Der f 24, an allergen group identified from the first D. farinae genome [6], is located on chromosome 3. Allergens of groups 3, 6, and 9 allergens are serine protease trypsin, chymotrypsin, and a collagenolytic enzyme, respectively [52, 53]. The present study located Der f 3 and Der f 6 on chromosomes 4 and 6, respectively. Although Der f 9 was not deposited in IUIS, we obtained a Der p 9-like serine protease sequence located on chromosome 8.

4. Conclusions

The genomic data will provide novel insights for the future investigation of the functionally important proteins of disease-related organisms. Our study successfully generated D. farinae genome on the chromosome level but did not perform chromosomal structure comparison due to limited mite chromosome information.

Abbreviations

PacBio:Pacific Bioscience
Hi-C:High-throughput/resolution chromosome conformation capture
BUSCOs:Benchmarking universal single-copy orthologs
WHO:World Health Organization
IUIS:International Union of Immunological Societies
NCBI:National Center for Biotechnology Information
NR:Nonredundant
Der f:Dermatophagoides farinae.

Data Availability

Genomic sequences and Hi-C data are deposited in NCBI Sequence Read Archive under the BioProject PRJNA756521 and accession number JAMLFS000000000.1. The mitochondrial genome of D. farinae is deposited in NCBI under the accession number OR197578.

Ethical Approval

This research was carried out in compliance with the relevant laws of China.

Conflicts of Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Authors’ Contributions

Xiaohong Gu and Yubao Cui conceived and designed the experiments. Rongxuan Hu, Haifeng Huang, Ying Zhou, Yanshan Liu, Yaning Ren, Yuanfen Liao, and Cunyin Yuan performed the experiments and analyzed and interpreted the data. Rongxuan Hu, Haifeng Huang, and Yubao Cui wrote the paper. Rongxuan Hu and Haifeng Huang contributed equally to this work.

Acknowledgments

This study was supported by grants from the National Natural Science Foundation of China (NSFC81971511 and NSFC31572319), the Top Talents Project of the Wuxi Taihu Lake Talent Plan (2020THRC-GD-7), the Wuxi Science and Technology Bureau (R & D of medical and health technology, Y20212006), and the Wuxi Health and Family Planning Commission (Z202214).