Annotating and prioritizing genetic variants for single cell RNA sequencing

This vignette demonstrates how Yano can annotate genetic variants using public databases and assist users in interpreting the functional impact of these variants.

1. Predict the consequence of genetic variants

Load the object which generated from allele specific gene expression from scRNA-seq. In that case, we used annoVAR to annotate the gene and approximate gene region type for the variants. In this section, we will go further, to predict the consequence of genetic variants. Note that, different from genetic annotation at DNA level, strand information is also consdiered here, so antisense terms will also be generated.

The predicted consequences were derived from transcript annotations. However, gencode annotations, generated as part of the ENCODE project, contain numerous truncated transcript records that are not fully consistent with the reference genome, which can introduce bias when predicting amino acid changes. For gene expression analysis, such as scRNA-seq, it’s more common to use gencode, consider the good coverage. But for clinical research, validated annotations like RefSeq are preferred for genetic variant annotation. In this analysis, RefSeq annotations were downloaded from the UCSC server to provide more accurate and reliable predictions. Furthermore, the human genome reference is also need for the prediction.

wget -c https://hgdownload.soe.ucsc.edu/goldenPath/archive/hg38/ncbiRefSeq/110/hg38.110.ncbiRefSeq.gtf.gz
wget -c https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/GRCh38.p13.genome.fa.gz
unzip GRCh38.p13.genome.fa.gz

# FASTA need to be indexed
samtools faidx GRCh38.p13.genome.fa

require(Yano)

Loading required package: Yano

── Attaching packages ────────────────────────────────────────────── Yano 1.2 ──
✔ dplyr   1.1.4     ✔ Seurat  5.3.0
✔ ggplot2 3.5.2

obj <- readRDS("glioblastoma5k.rds")

gtf <- gtf2db("hg38.110.ncbiRefSeq.gtf.gz")

[2025-06-19 18:10:13] GTF loading..
[2025-06-19 18:10:25] Load 66991 genes.

DefaultAssay(obj) <- "var"
obj <- annoVAR(obj, gtf = gtf, fasta = "./GRCh38.p13.genome.fa")

Parse names ..

CDS region of transcript NM_001261418.2 cannot translate to codons properly.
CDS region of transcript XM_047433500.1 cannot translate to codons properly.
CDS region of transcript XM_005271283.4 cannot translate to codons properly.
CDS region of transcript XM_017002563.2 cannot translate to codons properly.
CDS region of transcript XM_047432264.1 cannot translate to codons properly.
CDS region of transcript XM_017002566.2 cannot translate to codons properly.
CDS region of transcript NM_001350238.2 cannot translate to codons properly.
CDS region of transcript XM_017002570.2 cannot translate to codons properly.
CDS region of transcript XM_017002567.2 cannot translate to codons properly.
CDS region of transcript XM_047432272.1 cannot translate to codons properly.
CDS region of transcript XM_017002562.2 cannot translate to codons properly.
CDS region of transcript NM_001350241.2 cannot translate to codons properly.
CDS region of transcript XM_047432255.1 cannot translate to codons properly.
CDS region of transcript XM_017002560.3 cannot translate to codons properly.
CDS region of transcript NM_001350239.2 cannot translate to codons properly.
CDS region of transcript NM_001032291.3 cannot translate to codons properly.
CDS region of transcript NM_001394005.1 cannot translate to codons properly.
CDS region of transcript NM_001350240.2 cannot translate to codons properly.
CDS region of transcript NM_032636.8 cannot translate to codons properly.
CDS region of transcript XM_047432251.1 cannot translate to codons properly.
CDS region of transcript XM_047432274.1 cannot translate to codons properly.
CDS region of transcript XM_017002569.2 cannot translate to codons properly.
CDS region of transcript XM_017002564.2 cannot translate to codons properly.
CDS region of transcript NM_001350242.2 cannot translate to codons properly.
CDS region of transcript NM_001394002.1 cannot translate to codons properly.
CDS region of transcript NM_001363309.2 cannot translate to codons properly.
CDS region of transcript XM_047432275.1 cannot translate to codons properly.
CDS region of transcript NM_001350237.2 cannot translate to codons properly.
CDS region of transcript NM_001385672.1 cannot translate to codons properly.
CDS region of transcript XM_047433003.1 cannot translate to codons properly.
CDS region of transcript NM_001301244.2 cannot translate to codons properly.
CDS region of transcript XM_006720667.5 cannot translate to codons properly.
CDS region of transcript NM_001018005.2 cannot translate to codons properly.
CDS region of transcript NM_001330346.2 cannot translate to codons properly.
CDS region of transcript XM_005254650.4 cannot translate to codons properly.
CDS region of transcript NM_004850.5 cannot translate to codons properly.
CDS region of transcript NM_001321643.2 cannot translate to codons properly.
CDS region of transcript NM_080794.4 cannot translate to codons properly.
CDS region of transcript XM_047447996.1 cannot translate to codons properly.
CDS region of transcript NM_001077654.3 cannot translate to codons properly.
CDS region of transcript NM_001385620.1 cannot translate to codons properly.
CDS region of transcript XM_006716181.5 cannot translate to codons properly.
CDS region of transcript XM_047421050.1 cannot translate to codons properly.
CDS region of transcript XM_047422695.1 cannot translate to codons properly.
CDS region of transcript rna-CYTB cannot translate to codons properly.
CDS region of transcript rna-COX3 cannot translate to codons properly.
CDS region of transcript rna-ND1 cannot translate to codons properly.
CDS region of transcript rna-ND2 cannot translate to codons properly.
CDS region of transcript rna-ND4 cannot translate to codons properly.

obj[['var']][[]] -> df
head(df)

                  chr  start ref alt strand         locus gene_name type
chr1:631862G>A/+ chr1 631862   G   A      + chr1:631862/+  MTCO1P12 utr5
chr1:633192G=/+  chr1 633192   G   G      + chr1:633192/+  MTCO2P12 utr5
chr1:634337T>C/+ chr1 634337   T   C      + chr1:634337/+  MTATP6P1 utr5
chr1:826352C>T/- chr1 826352   C   T      - chr1:826352/- LINC00115 utr5
chr1:853876T>C/+ chr1 853876   T   C      + chr1:853876/+ LINC01128 utr5
chr1:944296G>A/- chr1 944296   G   A      - chr1:944296/-     NOC2L utr3
                 moransi.pval       moransi autocorr.variable locus.D locus.t
chr1:631862G>A/+ 6.218333e-01 -0.0017809088             FALSE      NA      NA
chr1:633192G=/+  4.642293e-05  0.0162242246              TRUE      NA      NA
chr1:634337T>C/+ 4.882067e-01 -0.0003760064             FALSE      NA      NA
chr1:826352C>T/- 6.661973e-01 -0.0023088544             FALSE      NA      NA
chr1:853876T>C/+ 3.266314e-02  0.0076495861             FALSE      NA      NA
chr1:944296G>A/- 3.232372e-01  0.0015298954             FALSE      NA      NA
                 locus.pval locus.padj    consequence
chr1:631862G>A/+         NA         NA noncoding_exon
chr1:633192G=/+          NA         NA            ref
chr1:634337T>C/+         NA         NA noncoding_exon
chr1:826352C>T/-         NA         NA noncoding_exon
chr1:853876T>C/+         NA         NA noncoding_exon
chr1:944296G>A/-         NA         NA           utr3

table(df$consequence)


antisense_downstream_1K          antisense_exon        antisense_intron 
                     71                     543                    4112 
  antisense_upstream_1K           downstream_1K   frameshift_truncation 
                    188                     445                       3 
             intergenic                  intron                missense 
                   2459                   11399                     643 
         noncoding_exon        noncoding_intron noncoding_splice_region 
                    900                     836                       3 
                    ref         splice_acceptor            splice_donor 
                  11017                       7                       2 
          splice_region            splice_sites              start_loss 
                     22                      10                       1 
         start_retained             stop_gained              synonymous 
                      1                       6                     810 
            upstream_1K                    utr3                    utr5 
                    171                    4946                     633

subset(df, consequence %in% "stop_gained")

                      chr     start ref alt strand            locus gene_name
chr11:62602409G>A/- chr11  62602409   G   A      - chr11:62602409/-      EML3
chr21:21480040G>A/+ chr21  21480040   G   A      + chr21:21480040/+     NCAM2
chr3:53865249C>T/+   chr3  53865249   C   T      +  chr3:53865249/+    IL17RB
chr5:140557175G>T/-  chr5 140557175   G   T      - chr5:140557175/-      SRA1
chr7:55473014G>A/-   chr7  55473014   G   A      -  chr7:55473014/-     VOPP1
chr7:74732529C>T/+   chr7  74732529   C   T      +  chr7:74732529/+     GTF2I
                      type moransi.pval      moransi autocorr.variable locus.D
chr11:62602409G>A/-   exon  0.032763688 0.0076500533             FALSE      NA
chr21:21480040G>A/+ intron  0.222451372 0.0027535436             FALSE      NA
chr3:53865249C>T/+    exon  0.237271085 0.0025496157             FALSE      NA
chr5:140557175G>T/-   exon  0.388504949 0.0007173673             FALSE      NA
chr7:55473014G>A/-    exon  0.001314907 0.0127720172              TRUE      NA
chr7:74732529C>T/+    exon  0.013125243 0.0089092490             FALSE      NA
                    locus.t locus.pval locus.padj consequence
chr11:62602409G>A/-      NA         NA         NA stop_gained
chr21:21480040G>A/+      NA         NA         NA stop_gained
chr3:53865249C>T/+       NA         NA         NA stop_gained
chr5:140557175G>T/-      NA         NA         NA stop_gained
chr7:55473014G>A/-       NA         NA         NA stop_gained
chr7:74732529C>T/+       NA         NA         NA stop_gained

sel.chrs <- c(1:21, "X")
FbtPlot(obj, val = "locus.padj", remove.chr = TRUE, sel.chrs = sel.chrs, col.by = "consequence", pt.size = 2)

2. Annotate genetic variants with publish databases

Another key advancement in genetic variant studies is the ability to perform annotation using genotype-phenotype and allele frequency databases. For demonstration purposes, we use the ClinVar database for annotation in this vignette.

# download the database in VCF format
wget -c  https://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/clinvar.vcf.gz

# remeber to index this database, otherwise annoVAR will report error.
bcftools index clinvar.vcf.gz

Let’s take a look at the format of the ClinVar database. As noted, the chromosome names are in the NCBI style, meaning they do not include the “chr” prefix. Therefore, we need to generate a new column with chromosome names formatted for annotation.

bcftools view clinvar.vcf.gz| grep -v "^#"| head

1   66926   3385321 AG  A   .   .   ALLELEID=3544463;CLNDISDB=Human_Phenotype_Ontology:HP:0000547,MONDO:MONDO:0019200,MeSH:D012174,MedGen:C0035334,OMIM:268000,OMIM:PS268000,Orphanet:791;CLNDN=Retinitis_pigmentosa;CLNHGVS=NC_000001.11:g.66927del;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNSIGSCV=SCV005419006;CLNVC=Deletion;CLNVCSO=SO:0000159;GENEINFO=OR4F5:79501;MC=SO:0001627|intron_variant;ORIGIN=0
1   69134   2205837 A   G   .   .   ALLELEID=2193183;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000001.11:g.69134A>G;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign;CLNSIGSCV=SCV003526545;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA502008;GENEINFO=OR4F5:79501;MC=SO:0001583|missense_variant;ORIGIN=1
1   69314   3205580 T   G   .   .   ALLELEID=3374047;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000001.11:g.69314T>G;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNSIGSCV=SCV004995495;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA338197388;GENEINFO=OR4F5:79501;MC=SO:0001583|missense_variant;ORIGIN=1
1   69423   3205581 G   A   .   .   ALLELEID=3374048;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000001.11:g.69423G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNSIGSCV=SCV004995496;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA338197763;GENEINFO=OR4F5:79501;MC=SO:0001583|missense_variant;ORIGIN=1
1   69581   2252161 C   G   .   .   ALLELEID=2238986;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000001.11:g.69581C>G;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNSIGSCV=SCV003584698;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA338198277;GENEINFO=OR4F5:79501;MC=SO:0001583|missense_variant;ORIGIN=1
1   69682   2396347 G   A   .   .   ALLELEID=2386655;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000001.11:g.69682G>A;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNSIGSCV=SCV003739621;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA502063;GENEINFO=OR4F5:79501;MC=SO:0001583|missense_variant;ORIGIN=1
1   69731   3205582 T   C   .   .   ALLELEID=3374049;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000001.11:g.69731T>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNSIGSCV=SCV004995497;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA338198606;GENEINFO=OR4F5:79501;MC=SO:0001583|missense_variant;ORIGIN=1
1   69769   2288999 T   C   .   .   ALLELEID=2278803;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000001.11:g.69769T>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNSIGSCV=SCV003620560;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA338198691;GENEINFO=OR4F5:79501;MC=SO:0001583|missense_variant;ORIGIN=1
1   69995   2351346 G   C   .   .   ALLELEID=2333177;CLNDISDB=MedGen:CN169374;CLNDN=not_specified;CLNHGVS=NC_000001.11:g.69995G>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Uncertain_significance;CLNSIGSCV=SCV003688488;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA502093;GENEINFO=OR4F5:79501;MC=SO:0001583|missense_variant;ORIGIN=1
1   924518  3388928 G   C   .   .   ALLELEID=3548054;CLNDISDB=MedGen:C3661900;CLNDN=not_provided;CLNHGVS=NC_000001.11:g.924518G>C;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Likely_benign;CLNSIGSCV=SCV005435063;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;GENEINFO=SAMD11:148398|LOC107985728:107985728;MC=SO:0001819|synonymous_variant;ORIGIN=1

We need to create new chromosome names that are compatible with the NCBI database format, because chromosome names in Clinvar database are look like ‘1’, ‘2’.

obj[['var']][['chr0']] <- gsub("chr(.*)","\\1", obj[['var']][[]]$chr)
obj[['var']][[]] %>% head

                  chr  start ref alt strand         locus gene_name type
chr1:631862G>A/+ chr1 631862   G   A      + chr1:631862/+  MTCO1P12 utr5
chr1:633192G=/+  chr1 633192   G   G      + chr1:633192/+  MTCO2P12 utr5
chr1:634337T>C/+ chr1 634337   T   C      + chr1:634337/+  MTATP6P1 utr5
chr1:826352C>T/- chr1 826352   C   T      - chr1:826352/- LINC00115 utr5
chr1:853876T>C/+ chr1 853876   T   C      + chr1:853876/+ LINC01128 utr5
chr1:944296G>A/- chr1 944296   G   A      - chr1:944296/-     NOC2L utr3
                 moransi.pval       moransi autocorr.variable locus.D locus.t
chr1:631862G>A/+ 6.218333e-01 -0.0017809088             FALSE      NA      NA
chr1:633192G=/+  4.642293e-05  0.0162242246              TRUE      NA      NA
chr1:634337T>C/+ 4.882067e-01 -0.0003760064             FALSE      NA      NA
chr1:826352C>T/- 6.661973e-01 -0.0023088544             FALSE      NA      NA
chr1:853876T>C/+ 3.266314e-02  0.0076495861             FALSE      NA      NA
chr1:944296G>A/- 3.232372e-01  0.0015298954             FALSE      NA      NA
                 locus.pval locus.padj    consequence chr0
chr1:631862G>A/+         NA         NA noncoding_exon    1
chr1:633192G=/+          NA         NA            ref    1
chr1:634337T>C/+         NA         NA noncoding_exon    1
chr1:826352C>T/-         NA         NA noncoding_exon    1
chr1:853876T>C/+         NA         NA noncoding_exon    1
chr1:944296G>A/-         NA         NA           utr3    1

# here we annotate RS id and clinical signification for variants, to check the VCF header for all possible tags
obj <- annoVAR(obj, vcf = "clinvar.vcf.gz", chr = "chr0", tags = c("RS", "CLNSIG"))

Parse names ..

# The RS id and CLNSIG tag now created
obj[['var']][[]] -> df
head(df)

                  chr  start ref alt strand         locus gene_name type
chr1:631862G>A/+ chr1 631862   G   A      + chr1:631862/+  MTCO1P12 utr5
chr1:633192G=/+  chr1 633192   G   G      + chr1:633192/+  MTCO2P12 utr5
chr1:634337T>C/+ chr1 634337   T   C      + chr1:634337/+  MTATP6P1 utr5
chr1:826352C>T/- chr1 826352   C   T      - chr1:826352/- LINC00115 utr5
chr1:853876T>C/+ chr1 853876   T   C      + chr1:853876/+ LINC01128 utr5
chr1:944296G>A/- chr1 944296   G   A      - chr1:944296/-     NOC2L utr3
                 moransi.pval       moransi autocorr.variable locus.D locus.t
chr1:631862G>A/+ 6.218333e-01 -0.0017809088             FALSE      NA      NA
chr1:633192G=/+  4.642293e-05  0.0162242246              TRUE      NA      NA
chr1:634337T>C/+ 4.882067e-01 -0.0003760064             FALSE      NA      NA
chr1:826352C>T/- 6.661973e-01 -0.0023088544             FALSE      NA      NA
chr1:853876T>C/+ 3.266314e-02  0.0076495861             FALSE      NA      NA
chr1:944296G>A/- 3.232372e-01  0.0015298954             FALSE      NA      NA
                 locus.pval locus.padj    consequence chr0   RS CLNSIG
chr1:631862G>A/+         NA         NA noncoding_exon    1 <NA>   <NA>
chr1:633192G=/+          NA         NA            ref    1 <NA>   <NA>
chr1:634337T>C/+         NA         NA noncoding_exon    1 <NA>   <NA>
chr1:826352C>T/-         NA         NA noncoding_exon    1 <NA>   <NA>
chr1:853876T>C/+         NA         NA noncoding_exon    1 <NA>   <NA>
chr1:944296G>A/-         NA         NA           utr3    1 <NA>   <NA>

# We note that there are several pathogenic variants detected
table(df[['CLNSIG']])


                                             association 
                                                       8 
                                                  Benign 
                                                    1695 
                                    Benign/Likely_benign 
                                                      48 
            Conflicting_classifications_of_pathogenicity 
                                                      10 
Conflicting_classifications_of_pathogenicity|risk_factor 
                                                       3 
                                           drug_response 
                                                       1 
                                           Likely_benign 
                                                      50 
                no_classification_for_the_single_variant 
                                                      12 
                                            not_provided 
                                                       1 
                                                   other 
                                                       9 
                                              Pathogenic 
                                                       1 
                            Pathogenic/Likely_pathogenic 
                                                       2 
                                  Pathogenic|risk_factor 
                                                       1 
                                             risk_factor 
                                                       6 
                                  Uncertain_significance 
                                                      32

df %>% filter(CLNSIG %in% c('Pathogenic', 'Pathogenic/Likely_pathogenic', 'Pathogenic|risk_factor'))

                      chr     start ref alt strand            locus gene_name
chr1:196690107C=/+   chr1 196690107   C   C      + chr1:196690107/+       CFH
chr13:50945445G=/+  chr13  50945445   G   G      + chr13:50945445/+  RNASEH2B
chr13:50945445G>A/+ chr13  50945445   G   A      + chr13:50945445/+  RNASEH2B
chr15:92992967C=/+  chr15  92992967   C   C      + chr15:92992967/+      CHD2
                          type moransi.pval      moransi autocorr.variable
chr1:196690107C=/+  multigenes 3.455725e-76  0.076938465              TRUE
chr13:50945445G=/+        exon 6.488097e-04  0.013692536              TRUE
chr13:50945445G>A/+       exon 5.375273e-07  0.021007235              TRUE
chr15:92992967C=/+        exon 8.592924e-01 -0.005173039             FALSE
                      locus.D   locus.t locus.pval locus.padj consequence chr0
chr1:196690107C=/+  0.0286025 -4.661517  0.9999951          1         ref    1
chr13:50945445G=/+  0.0969400 -2.009264  0.9763843          1         ref   13
chr13:50945445G>A/+ 0.1023983 -1.766020  0.9597624          1    missense   13
chr15:92992967C=/+         NA        NA         NA         NA         ref   15
                          RS                       CLNSIG
chr1:196690107C=/+   1061170       Pathogenic|risk_factor
chr13:50945445G=/+  75184679 Pathogenic/Likely_pathogenic
chr13:50945445G>A/+ 75184679 Pathogenic/Likely_pathogenic
chr15:92992967C=/+   2272457                   Pathogenic

In the selected SNVs, we found some reference alleles are highlighted, such as chr15:92992967C=/+. These reference alelles are more like to be benign allele. Let’s query the orignal record in the clinvar database to check what exactly stituation it should be.

bcftools view clinvar.vcf.gz 15:92992967-92992967

##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=2025-02-17
##source=ClinVar
##reference=GRCh38
##ID=<Description="ClinVar Variation ID">
##INFO=<ID=AF_ESP,Number=1,Type=Float,Description="allele frequencies from GO-ESP">
##INFO=<ID=AF_EXAC,Number=1,Type=Float,Description="allele frequencies from ExAC">
##INFO=<ID=AF_TGP,Number=1,Type=Float,Description="allele frequencies from TGP">
##INFO=<ID=ALLELEID,Number=1,Type=Integer,Description="the ClinVar Allele ID">
##INFO=<ID=CLNDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDNINCL,Number=.,Type=String,Description="For included Variant : ClinVar's preferred disease name for the concept specified by disease identifiers in CLNDISDB">
##INFO=<ID=CLNDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for germline classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNDISDBINCL,Number=.,Type=String,Description="For included Variant: Tag-value pairs of disease database name and identifier for germline classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=CLNHGVS,Number=.,Type=String,Description="Top-level (primary assembly, alt, or patch) HGVS expression.">
##INFO=<ID=CLNREVSTAT,Number=.,Type=String,Description="ClinVar review status of germline classification for the Variation ID">
##INFO=<ID=CLNSIG,Number=.,Type=String,Description="Aggregate germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNSIGCONF,Number=.,Type=String,Description="Conflicting germline classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=CLNSIGINCL,Number=.,Type=String,Description="Germline classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##INFO=<ID=CLNSIGSCV,Number=.,Type=String,Description="SCV accession numbers for the submissions that contribute to the aggregate germline classification in ClinVar; multiple values are separated by a vertical bar">
##INFO=<ID=CLNVC,Number=1,Type=String,Description="Variant type">
##INFO=<ID=CLNVCSO,Number=1,Type=String,Description="Sequence Ontology id for variant type">
##INFO=<ID=CLNVI,Number=.,Type=String,Description="the variant's clinical sources reported as tag-value pairs of database and variant identifier">
##INFO=<ID=DBVARID,Number=.,Type=String,Description="nsv accessions from dbVar for the variant">
##INFO=<ID=GENEINFO,Number=1,Type=String,Description="Gene(s) for the variant reported as gene symbol:gene id. The gene symbol and id are delimited by a colon (:) and each pair is delimited by a vertical bar (|)">
##INFO=<ID=MC,Number=.,Type=String,Description="comma separated list of molecular consequence in the form of Sequence Ontology ID|molecular_consequence">
##INFO=<ID=ONCDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDB">
##INFO=<ID=ONCDNINCL,Number=.,Type=String,Description="For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in ONCDISDBINCL">
##INFO=<ID=ONCDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for oncogenicity classifications, e.g. MedGen:NNNNNN">
##INFO=<ID=ONCDISDBINCL,Number=.,Type=String,Description="For included variant: Tag-value pairs of disease database name and identifier for oncogenicity classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=ONC,Number=.,Type=String,Description="Aggregate oncogenicity classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=ONCINCL,Number=.,Type=String,Description="Oncogenicity classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##INFO=<ID=ONCREVSTAT,Number=.,Type=String,Description="ClinVar review status of oncogenicity classification for the Variation ID">
##INFO=<ID=ONCSCV,Number=.,Type=String,Description="SCV accession numbers for the submissions that contribute to the aggregate oncogenicity classification in ClinVar; multiple values are separated by a vertical bar">
##INFO=<ID=ONCCONF,Number=.,Type=String,Description="Conflicting oncogenicity classification for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=ORIGIN,Number=.,Type=String,Description="Allele origin. One or more of the following values may be added: 0 - unknown; 1 - germline; 2 - somatic; 4 - inherited; 8 - paternal; 16 - maternal; 32 - de-novo; 64 - biparental; 128 - uniparental; 256 - not-tested; 512 - tested-inconclusive; 1073741824 - other">
##INFO=<ID=RS,Number=.,Type=String,Description="dbSNP ID (i.e. rs number)">
##INFO=<ID=SCIDN,Number=.,Type=String,Description="ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDB">
##INFO=<ID=SCIDNINCL,Number=.,Type=String,Description="For included variant: ClinVar's preferred disease name for the concept specified by disease identifiers in SCIDISDBINCL">
##INFO=<ID=SCIDISDB,Number=.,Type=String,Description="Tag-value pairs of disease database name and identifier submitted for somatic clinial impact classifications, e.g. MedGen:NNNNNN">
##INFO=<ID=SCIDISDBINCL,Number=.,Type=String,Description="For included variant: Tag-value pairs of disease database name and identifier for somatic clinical impact classifications, e.g. OMIM:NNNNNN">
##INFO=<ID=SCIREVSTAT,Number=.,Type=String,Description="ClinVar review status of somatic clinical impact for the Variation ID">
##INFO=<ID=SCI,Number=.,Type=String,Description="Aggregate somatic clinical impact for this single variant; multiple values are separated by a vertical bar">
##INFO=<ID=SCIINCL,Number=.,Type=String,Description="Somatic clinical impact classification for a haplotype or genotype that includes this variant. Reported as pairs of VariationID:classification; multiple values are separated by a vertical bar">
##INFO=<ID=SCISCV,Number=.,Type=String,Description="SCV accession numbers for the submissions that contribute to the aggregate somatic clinical impact in ClinVar; multiple values are separated by a vertical bar">
##contig=<ID=1>
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##contig=<ID=Y>
##contig=<ID=MT>
##contig=<ID=NT_113889.1>
##contig=<ID=NT_187633.1>
##contig=<ID=NT_187661.1>
##contig=<ID=NT_187693.1>
##contig=<ID=NW_009646201.1>
##bcftools_viewVersion=1.21+htslib-1.21
##bcftools_viewCommand=view clinvar.vcf.gz 15:92992967-92992967; Date=Thu Jun 19 18:10:29 2025
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO
15  92992967    633515  C   G   .   .   ALLELEID=621980;CLNDISDB=MedGen:CN517202;CLNDN=not_provided;CLNHGVS=NC_000015.10:g.92992967C>G;CLNREVSTAT=criteria_provided,_single_submitter;CLNSIG=Pathogenic;CLNSIGSCV=SCV000920484;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA393896759;GENEINFO=CHD2:1106;MC=SO:0001587|nonsense;ORIGIN=1;RS=2272457
15  92992967    257709  C   T   .   .   AF_ESP=0.2485;AF_EXAC=0.22658;AF_TGP=0.19349;ALLELEID=255446;CLNDISDB=MeSH:D030342,MedGen:C0950123|MONDO:MONDO:0014150,MedGen:C3809278,OMIM:615369,Orphanet:1942,Orphanet:2382|MedGen:C3661900|MedGen:CN169374;CLNDN=Inborn_genetic_diseases|Developmental_and_epileptic_encephalopathy_94|not_provided|not_specified;CLNHGVS=NC_000015.10:g.92992967C>T;CLNREVSTAT=criteria_provided,_multiple_submitters,_no_conflicts;CLNSIG=Benign;CLNSIGSCV=SCV000307175|SCV000519223|SCV000841518|SCV000846021|SCV001730597|SCV001806395|SCV005087698|SCV005297510;CLNVC=single_nucleotide_variant;CLNVCSO=SO:0001483;CLNVI=ClinGen:CA7748248;GENEINFO=CHD2:1106;MC=SO:0001819|synonymous_variant;ORIGIN=1;RS=2272457

You can easily observe that the pathogenic allele is G, where the change from C to G leads to a nonsense mutation. However, if we examine the header, we see that the ‘Number’ field of CLNSIG is labeled as ‘.’, which indicates that this term is not allele-specific—an inappropriate type for this record. In fact, this issue extends beyond just CLNSIG; all terms in the ClinVar database are labeled as ‘.’.

To correct this, we need to manually change the type from ‘.’ to ‘A’ (allele-specific) for allele-specific tags in the VCF file. For more information, please refer to the VCF specification.

Unfortunately, many databases distributed in VCF format may contain non-uniformly formatted tags. Apart from manually updating these databases or filter reference allele afterward, there is no straightforward solution to this issue.

In addition, the RS tag is annotated, and we can refer to the NCBI SNP database for more information. For example, the mutation chr13:50945445G>A/+ at RS 75184679 can be found at the following link: https://www.ncbi.nlm.nih.gov/snp/rs75184679. This mutation is associated with conditions such as “Abnormality of the nervous system” and other diseases.

FeaturePlot(obj, features = c("chr13:50945445G>A/+", "chr13:50945445G=/+", "RNASEH2B"), order = TRUE, ncol=3)

Warning: The `slot` argument of `FetchData()` is deprecated as of SeuratObject 5.0.0.
ℹ Please use the `layer` argument instead.
ℹ The deprecated feature was likely used in the Seurat package.
  Please report the issue at <https://github.com/satijalab/seurat/issues>.

Warning: Could not find RNASEH2B in the default search locations, found in
'RNA' assay instead

Note that not all reads from this gene cover the SNV, which results in the SNV expression being sparser compared to the overall gene expression. Moreover, it also possible to map the clinical significant labels or other tags to the Manhatten plot.

sel.chrs <- c(1:21, "X")
FbtPlot(obj, val = "locus.padj", remove.chr = TRUE, sel.chrs = sel.chrs, col.by = "CLNSIG", pt.size = 2, cols = c("#131313","blue","#A6CEE3","#1F78B4","#B2DF8A","#33A02C","#FB9A99","#E31A1C","#FDBF6F","#FF7F00","#CAB2D6","#6A3D9A","#FFFF99","#B15928"))

Command(obj)

 [1] "NormalizeData.RNA"        "FindVariableFeatures.RNA"
 [3] "ScaleData.RNA"            "RunPCA.RNA"              
 [5] "FindNeighbors.RNA.pca"    "FindClusters"            
 [7] "RunUMAP.RNA.pca"          "NormalizeData.var"       
 [9] "RunAutoCorr.var.pca"      "SetAutoCorrFeatures.var" 
[11] "RunSDT.var"               "ParseVarName.var"        
[13] "annoVAR.var"

sessionInfo()

R version 4.5.0 (2025-04-11)
Platform: aarch64-apple-darwin24.4.0
Running under: macOS Sequoia 15.5

Matrix products: default
BLAS:   /opt/homebrew/Cellar/openblas/0.3.29/lib/libopenblasp-r0.3.29.dylib 
LAPACK: /opt/homebrew/Cellar/r/4.5.0/lib/R/lib/libRlapack.dylib;  LAPACK version 3.12.1

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: Asia/Shanghai
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.1.4        Seurat_5.3.0       SeuratObject_5.1.0 sp_2.2-0          
[5] ggplot2_3.5.2      Yano_1.2          

loaded via a namespace (and not attached):
  [1] deldir_2.0-4           pbapply_1.7-2          gridExtra_2.3         
  [4] rlang_1.1.6            magrittr_2.0.3         RcppAnnoy_0.0.22      
  [7] spatstat.geom_3.3-6    matrixStats_1.5.0      ggridges_0.5.6        
 [10] compiler_4.5.0         png_0.1-8              vctrs_0.6.5           
 [13] reshape2_1.4.4         stringr_1.5.1          pkgconfig_2.0.3       
 [16] fastmap_1.2.0          labeling_0.4.3         promises_1.3.2        
 [19] rmarkdown_2.29         purrr_1.0.4            xfun_0.52             
 [22] jsonlite_2.0.0         goftest_1.2-3          later_1.4.2           
 [25] spatstat.utils_3.1-4   irlba_2.3.5.1          parallel_4.5.0        
 [28] cluster_2.1.8.1        R6_2.6.1               ica_1.0-3             
 [31] stringi_1.8.7          RColorBrewer_1.1-3     spatstat.data_3.1-6   
 [34] reticulate_1.42.0      spatstat.univar_3.1-3  parallelly_1.44.0     
 [37] lmtest_0.9-40          scattermore_1.2        Rcpp_1.0.14           
 [40] knitr_1.50             tensor_1.5             future.apply_1.11.3   
 [43] zoo_1.8-14             R.utils_2.13.0         sctransform_0.4.2     
 [46] httpuv_1.6.16          Matrix_1.7-3           splines_4.5.0         
 [49] igraph_2.1.4           tidyselect_1.2.1       viridis_0.6.5         
 [52] abind_1.4-8            yaml_2.3.10            spatstat.random_3.3-3 
 [55] codetools_0.2-20       miniUI_0.1.2           spatstat.explore_3.4-2
 [58] listenv_0.9.1          lattice_0.22-6         tibble_3.2.1          
 [61] plyr_1.8.9             withr_3.0.2            shiny_1.10.0          
 [64] ROCR_1.0-11            evaluate_1.0.3         Rtsne_0.17            
 [67] future_1.49.0          fastDummies_1.7.5      survival_3.8-3        
 [70] polyclip_1.10-7        fitdistrplus_1.2-2     pillar_1.10.2         
 [73] KernSmooth_2.23-26     plotly_4.10.4          generics_0.1.4        
 [76] RcppHNSW_0.6.0         scales_1.4.0           gtools_3.9.5          
 [79] globals_0.18.0         xtable_1.8-4           glue_1.8.0            
 [82] lazyeval_0.2.2         tools_4.5.0            data.table_1.17.2     
 [85] RSpectra_0.16-2        RANN_2.6.2             dotCall64_1.2         
 [88] cowplot_1.1.3          grid_4.5.0             tidyr_1.3.1           
 [91] nlme_3.1-168           patchwork_1.3.0        cli_3.6.5             
 [94] spatstat.sparse_3.1-0  spam_2.11-1            viridisLite_0.4.2     
 [97] uwot_0.2.3             gtable_0.3.6           R.methodsS3_1.8.2     
[100] digest_0.6.37          progressr_0.15.1       ggrepel_0.9.6         
[103] htmlwidgets_1.6.4      farver_2.1.2           htmltools_0.5.8.1     
[106] R.oo_1.27.1            lifecycle_1.0.4        httr_1.4.7            
[109] mime_0.13              MASS_7.3-65