Review Article

Human Genomic Loci Important in Common Infectious Diseases: Role of High-Throughput Sequencing and Genome-Wide Association Studies

Figure 1

(a) Pipeline for interrogation of pathogen genomes using high-throughput sequencing and computational approaches. DNA extraction for HTS can be done from either direct clinical specimen of individuals who are suspected to be infected with the disease or from enriched/isolated cultures. Quality control and read preprocessing are critical steps in the analysis of datasets generated from high-throughput sequencing technologies. FASTQC is an example of a tool for general quality assessment of HTS data from all technologies. Genomes can be recreated with no prior knowledge using de novo sequence assembly as well as recreating the genome using prior knowledge based on a reference genome—alignment/mapping. The former is necessary for novel genomes and where the sequenced genome differs from reference. Sequence data analysis is important in infectious disease outbreak investigations, molecular typing, antimicrobial drug resistance, transmission, surveillance, and microbial evolution. (b) Pipeline for interrogation of host genomes using high-throughput sequencing and computational approaches. For a given infectious disease in a population, an appropriate study design is determined and host DNA is collected from cases (exposed to pathogen and infected) and controls (exposed to pathogen and uninfected). HTS of DNA from both cases and control is performed. Quality control (QC) procedures vary in different pipelines. These include QC on individuals for missingness, gender checks, duplicates and cryptic relatedness, population outliers, heterozygosity and inbreeding, QC on SNPs for missingness, minor allele frequency, and Hardy–Weinberg equilibrium. Many of these are computationally intensive, operationally challenging, and constantly evolving. Genome-wide association studies (GWASs) involving case-control studies compare the frequencies of common genetic variants, assume an appropriate statistical model, and account for multiple testing correction threshold to identify susceptibility and protective polymorphisms in the population.
(a)
(b)