Complex genome regions such as copy number variations (CNVs), variable number of tandem repeats (VNTR), short tandem repeats (STRs), and centromeric and telomeric regions can hide medically highly relevant mutations that shed new light on human phenotypes and diseases [1,2,3,4,5,6]. The development of long-read sequencing technologies and new bioinformatic tools have greatly improved the resolution of these difficult regions [7,8,9,10,11,12,13,14]. However, some overly complex repeat structures remain challenging. The LPA KIV-2 VNTR is a medically highly relevant protein-coding example of such a difficult region [15].
The LPA gene encodes the apolipoprotein(a) [apo(a)] and controls most (> 90%) of the lipoprotein(a) [Lp(a)] plasma variation [16]. High Lp(a) plasma concentrations are considered a nearly monogenically determined, very frequent, causal, independent, and heritable risk factor for atherosclerotic cardiovascular diseases [17,18,19] that increase cardiovascular risk up to threefold [20, 21]. Elevated Lp(a) concentrations are found in ≈20% of individuals of European ancestry and even in ≈50% of individuals of African ancestry [17]. Median Lp(a) levels vary tenfold between ancestries [22] and the individual plasma concentrations vary even 1000-fold [16]. The causes of this huge phenotypic variance are not fully understood but likely result from intricate, haplotype-dependent, non-linear interactions between multiple functional LPA variants and the KIV-2 VNTR size [15].
Xem thêm : More salt or less – what to do?
The complex structure of the LPA gene severely complicates genetic analysis [15]. It comprises ten highly homologous kringle-IV domains (KIV-1 to -10) [23, 24]. Each KIV domain consists of two short exons (≈160 and 182 bp) interspaced by a mostly ≈4 kb intra-kringle intron and a 1.2 kb intron linking it to the next domain [15]. The KIV-2 domain is encoded by a polymorphic VNTR, which introduces 1 to ≈40 KIV-2 units per gene allele (5.6 kb per repeat unit) [23]. This creates an up to > 200 kb VNTR region consisting of highly homologous coding repeat units that encompass up to 70% of the protein [25]. The VNTR copy number explains 30-70% of Lp(a) variance in a non-linear, ancestry-dependent manner [16]. Individuals carrying at least one low molecular weight (LMW) apo(a) isoform (defined as 10-22 KIV units [16], i.e., 1 to 13 KIV-2 units [15]) present 5 to 10 times higher median Lp(a) levels compared to high molecular weight isoforms (> 22 KIV; HMW) due to higher protein production rates [17]. However, the individual Lp(a) levels within the same isoforms can still vary 10- to 200-fold [15] due to multiple, partially unknown genetic variants that modify the effect of the VNTR [15].
The interactions between the KIV-2 VNTR size and modifier SNPs are complex and multilayered (reviewed in detail in [15]). They are haplotype-dependent and only partially captured by linkage disequilibrium (LD) [15]. Several functional SNPs, including the two SNPs (4925G > A [25] and 4733G > A [26]) that explain most of Lp(a) variance [25, 26], have been hidden until recently in the KIV-2 VNTR [25,26,27,28]. These two variants alone explain remarkable 11.9% of the Lp(a) variance in the general population, are ancestry-specific, are associated with considerably reduced cardiovascular risk, and were hidden in the KIV-2 VNTR until recently [25,26,27,28]. The background apo(a) isoform size and other variants on the same haplotype can both limit and amplify the effects of any functional variant [15]. Although the KIV-2 VNTR encompasses most of the LPA gene region, the full genetic and haplotypic diversity of KIV-2 units and the LD of KIV-2 haplotypes with the haplotypes in the non-repetitive kringles remain largely unknown.
Current short-read deep sequencing approaches confidently identify KIV-2 SNPs [24], but mostly lose the long-range SNP haplotype data. Early cloning studies identified three synonymous KIV-2 haplotypes named KIV-2A, KIV-2B, and KIV-2C [29, 30]. These KIV-2 subtypes are defined by the haplotype of three SNPs in KIV-2 exon 1 and differ by about 120 bases [23, 24]. The KIV-2 subtypes have no functional relevance, but their frequencies differ widely between ancestries and correlate with known differences of the Lp(a) phenotype across ancestries [24, 30]. They may thus reflect distinct evolutionary histories of the KIV-2 region and tag novel ancestry-specific functional variants. Further haplotypic effects in the KIV-2 VNTR are unknown and could not be studied so far.
Xem thêm : Male infertility
Nanopore sequencing (ONT-Seq; Oxford Nanopore Technologies, ONT) provides novel means to address this knowledge gap. DNA is sequenced by monitoring the sequence-specific current fluctuations generated by single-stranded DNA molecules translocating through protein pores [31, 32]. This generates hundred times longer read lengths than short-read next-generation sequencing (NGS) [14, 32] and provides single molecule resolution. It thus allows retrieving the complete haplotype of each DNA molecule sequenced, even in DNA mixtures [32]. However, at the single molecule level the benefits of nanopore sequencing are limited by its relatively high raw-read error rate (0.7-1% median error rate per read). Especially in highly similar repeat sequences like the KIV-2, errors cannot be polished by sequencing deeper (because the parental repeat of each read is unknown [33]) or by using double-stranded (“duplex”) basecalling (because of erroneous matching of strands originating from different parental molecules).
Coupling of ONT-Seq with unique molecular identifiers (UMI-ONT-Seq) allows lowering the ONT-Seq error rate considerably [33, 34]. UMIs are oligonucleotide libraries that randomly tag each template molecule with a unique identifier (Fig. 1). The tagged library is amplified by PCR to generate multiple copies of each UMI-tagged template molecule and full-length sequenced [34]. The reads are clustered according to terminal UMI combination, which reflects their original template, and a consensus sequence is generated for each UMI cluster. This removes PCR and sequencing errors [34] (Fig. 1), while preserving the complete SNP haplotype of each input molecule. In highly repetitive and homologous regions such as the LPA KIV-2 repeats, this finally provides highly accurate consensus sequences of each repeat unit.
We describe a comprehensive assessment of UMI-ONT-Seq for SNP detection, SNP haplotyping, generation of consensus sequences for each VNTR unit, and copy number determination by coverage-corrected quantification of the unique haplotypes, using the LPA KIV-2 VNTR region as an example for a medically highly relevant complex VNTR region. We created a scalable freely available UMI-ONT-Seq Nextflow analysis pipeline that can be generalized to also any other UMI-ONT-Seq experiment (https://github.com/genepi/umi-pipeline-nf) and demonstrate LPA KIV-2 haplotyping by UMI-ONT-Seq in 48 multi-ancestry samples from the 1000 Genomes [35] (1000G) Project.
Nguồn: https://buycookiesonline.eu
Danh mục: Info