# Unpack the FASTA filegzip-d GRCh38.p13.genome.fa.gz# Build the FASTA indexsamtools faidx GRCh38.p13.genome.fa# Perform variant callingbcftools mpileup -Ou-f GRCh38.p13.genome.fa ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam |bcftools call -vmO z -o var.vcf.gz
Notes:
While several new methods for genetic variant calling have been developed, the basic principle remains the same: identifying differences between sequence reads and the reference genome. For simplicity, we are using the ‘traditional’ bcftools approach.
There is no need to filter raw genetic variants by sequencing depth or strand bias tests. Since this is strand-specific RNA sequencing data, strand bias is expected in many cases. Instead, the feature expression matrix can be filtered by cells during downstream analysis.
2. Annotate various features
The following command generates gene, transcript, exon, exon skipped reads, junction and genetic variant for each alignment:
PISA anno -gtf gencode.v44.annotation.gtf.gz -exon-psi-vcf var.vcf.gz -ref-alt-o anno.bam Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam
The above command will annotate genes, exons, junctions, and genetic variants all at once. However, it is also possible to annotate different features sequentially. For example, the command below produces the same results as the previous one, but allows for stepwise annotation.
PISA anno -gtf gencode.v44.annotation.gtf.gz -exon-psi-o anno1.bam Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam PISA anno -vcf var.vcf.gz -ref-alt-o anno2.bam anno1.bam
Notes:
-exon: Generates exon and junction names.
-psi: Generate labels for exon skipped reads, named after the skipped exons.
-ref-alt: For genetic variants, both reference allele and alternative allele will be annotated. If not set, reference allele will be ignored. This option is important for alelle-specific gene expression.
Comparing raw and annotated records.
# A original recordsamtools view Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam |head-n 3000|grep'A00836:286:HMTMVDMXX:2:2354:8422:1391'
PISA anno generates GN, TX, RE, and GX tags when the -gtf option is set. Original tag values are replaced, and PISA uses a slightly different annotation strategy than CellRanger, see details here. In the example above, the RE tag has been changed from E (exon) to S (spliced).
When the -exon option is used, EX and JC tags are annotated only if the read has a GN tag.
When the -psi option is used, ER tag is annotated, the label for ER is the exon name which indicate the read skip this exon.
Notes: These commands will generate raw counts for all droplets. Although the -list parameter can be used to export only selected cells with PISA anno, cell selection may sometimes lack robustness. Therefore, it is safer to store the raw matrix file for long-term storage.
The format for exon and junction names is “chr:pos reference_allele>alternative_allele/strand” for alternative alleles, or “chr:pos=/strand” for reference allele.
# the format of expressed genetic variant is zless var/features.tsv.gz|head
Notes: By default, the parameters for PISA anno are strand-sensitive. For non-strand-sensitive libraries, such as Smartseq2, the strand should be ignored by using the -is option.
Questions?
If you have any questions regarding this workflows, please feel free to report them through the github issues.
If you have any ideas or suggestions regarding PISA annotation, please refer to the Yano discussion forum. Since PISA is commonly used alongside Yano, it may not be necessary to host separate forums for each.