Count gene, exon, junction and SNV expression from a BAM file

The following vignette demonstrates how to generate various feature counts from an alignment BAM file. Here we use a 10X demo data of human glioblastoma cells. The description of data can be found at https://www.10xgenomics.com/datasets/human-glioblastoma-multiforme-3-v-3-whole-transcriptome-analysis-3-standard-4-0-0.

Prepare the raw data and reference

Download alignment data (BAM). (~14G).

The following file is used for variant calling and allele specific gene expression. It can be skipped for alternative splicing analysis.

BAM Index File (for variant calling): Download BAM index
Human Reference Databases and Gene Annotation Files:
- GRCh38.p13 genome
- Gencode v44 annotation

Programs we will use

PISA (>=v1.3) (https://github.com/shiquan/PISA)
bcftools (https://github.com/samtools/bcftools)
samtools (https://github.com/samtools/samtools)

Step by step tutorial

1. Genetic variants calling

# Unpack the FASTA file
gzip -d GRCh38.p13.genome.fa.gz

# Build the FASTA index
samtools faidx GRCh38.p13.genome.fa

# Perform variant calling
bcftools mpileup -Ou -f GRCh38.p13.genome.fa ./Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam  | bcftools call -vmO z -o var.vcf.gz

Notes:

While several new methods for genetic variant calling have been developed, the basic principle remains the same: identifying differences between sequence reads and the reference genome. For simplicity, we are using the ‘traditional’ bcftools approach.
There is no need to filter raw genetic variants by sequencing depth or strand bias tests. Since this is strand-specific RNA sequencing data, strand bias is expected in many cases. Instead, the feature expression matrix can be filtered by cells during downstream analysis.

2. Annotate various features

The following command generates gene, transcript, exon, exon skipped reads, junction and genetic variant for each alignment:

PISA anno -gtf gencode.v44.annotation.gtf.gz -exon -psi -vcf var.vcf.gz -ref-alt -o anno.bam Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam

The above command will annotate genes, exons, junctions, and genetic variants all at once. However, it is also possible to annotate different features sequentially. For example, the command below produces the same results as the previous one, but allows for stepwise annotation.

PISA anno -gtf gencode.v44.annotation.gtf.gz -exon -psi -o anno1.bam Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam 
PISA anno -vcf var.vcf.gz -ref-alt -o anno2.bam anno1.bam

Notes:

-exon: Generates exon and junction names.
-psi: Generate labels for exon skipped reads, named after the skipped exons.
-ref-alt: For genetic variants, both reference allele and alternative allele will be annotated. If not set, reference allele will be ignored. This option is important for alelle-specific gene expression.

Comparing raw and annotated records.

# A original record
samtools view Parent_SC3v3_Human_Glioblastoma_possorted_genome_bam.bam  |head -n 3000|grep 'A00836:286:HMTMVDMXX:2:2354:8422:1391'

A00836:286:HMTMVDMXX:2:2354:8422:1391   16  chr1    89995   255 56M236N35M  *   0   0   TTGGTCTCTCTTGCTGCCCTGGAGACCAGCTGCCCCACGAAGGAACCAGAGCCAACCTGCTGCTTCCTGGAGGAAGACAGTCCCTCTGTCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF NH:i:6  HI:i:3  AS:i:89 nM:i:1  RG:Z:Parent_SC3v3_Human_Glioblastoma:0:1:HMTMVDMXX:2    TX:Z:ENST00000495576,+784,91M   GX:Z:ENSG00000239945    GN:Z:AL627309.3 fx:Z:ENSG00000239945    RE:A:E  MM:i:1  xf:i:25 CR:Z:AGATGAATCTACGCGG   CY:Z:FFFFFFFFFF,FFFFF   CB:Z:AGATGAATCTACGCGG-1 UR:Z:TTTCATGCCTCA   UY:Z:FFFFFFFFFFFF   UB:Z:TTTCATGCCTCA

# the same record after annotation
samtools view anno.bam | head -n 3000|grep 'A00836:286:HMTMVDMXX:2:2354:8422:1391'

[W::bam_hdr_read] EOF marker is absent. The input is probably truncated
A00836:286:HMTMVDMXX:2:2354:8422:1391   16  chr1    89995   255 56M236N35M  *   0   0   TTGGTCTCTCTTGCTGCCCTGGAGACCAGCTGCCCCACGAAGGAACCAGAGCCAACCTGCTGCTTCCTGGAGGAAGACAGTCCCTCTGTCC FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF,FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF:FFFF NH:i:6  HI:i:3  AS:i:89 nM:i:1  RG:Z:Parent_SC3v3_Human_Glioblastoma:0:1:HMTMVDMXX:2    fx:Z:ENSG00000239945    MM:i:1  xf:i:25 CR:Z:AGATGAATCTACGCGG   CY:Z:FFFFFFFFFF,FFFFF   CB:Z:AGATGAATCTACGCGG-1 UR:Z:TTTCATGCCTCA   UY:Z:FFFFFFFFFFFF   UB:Z:TTTCATGCCTCA   RE:A:S  GX:Z:ENSG00000239945.1  GN:Z:ENSG00000239945    TX:Z:ENST00000495576.1  EX:Z:chr1:89551-90050/-/ENSG00000239945,chr1:90287-91105/-/ENSG00000239945  JC:Z:chr1:90050-90287/-/ENSG00000239945

Notes:

PISA anno generates GN, TX, RE, and GX tags when the -gtf option is set. Original tag values are replaced, and PISA uses a slightly different annotation strategy than CellRanger, see details here. In the example above, the RE tag has been changed from E (exon) to S (spliced).
When the -exon option is used, EX and JC tags are annotated only if the read has a GN tag.
When the -psi option is used, ER tag is annotated, the label for ER is the exon name which indicate the read skip this exon.

3. Generate Cell X feature expression matrix

mkdir exp 
mkdir exon
mkdir exclude
mkdir junction
mkdir var

PISA count -tags CB -anno-tag GN -umi UB -outdir exp anno.bam
PISA count -tags CB -anno-tag EX -umi UB -outdir exon anno.bam
PISA count -tags CB -anno-tag ER -umi UB -outdir exclude anno.bam
PISA count -tags CB -anno-tag JC -umi UB -outdir junction anno.bam
PISA count -tags CB -anno-tag VR -umi UB -outdir var anno.bam

Notes: These commands will generate raw counts for all droplets. Although the -list parameter can be used to export only selected cells with PISA anno, cell selection may sometimes lack robustness. Therefore, it is safer to store the raw matrix file for long-term storage.

ls exp/

barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz

zless exp/barcodes.tsv.gz|head

TTCCACGGTTCGGCTG-1
GTGCTTCGTAATGTGA-1
CACAGGCAGCTGGTGA-1
TGGCGTGGTGCAACGA-1
ATCTCTAGTCCATAGT-1
AGCGCCACAAGCGATG-1
CGAGTTAGTGTCGATT-1
GTCAAGTCAGGCTCTG-1
GGGTAGATCCCGTTGT-1
GGGACAAAGAGGTATT-1

zless exp/features.tsv.gz|head

MIR1302-2HG
ENSG00000238009
ENSG00000239945
ENSG00000268903
ENSG00000241860
ENSG00000236601
ENSG00000230021
MTND1P23
WASH9P
ENSG00000228463

zless exp/matrix.mtx.gz|head

%%MatrixMarket matrix coordinate integer general
% Generated by PISA v1.3
37716   971962  23795359
1   1   1
1   2   1
2   3   1
2   4   1
2   5   1
2   6   1
2   7   1

zless exon/features.tsv.gz|head

chr1:29554-30039/+/MIR1302-2HG
chr1:30564-30667/+/MIR1302-2HG
chr1:89295-91629/-/ENSG00000238009
chr1:89551-90050/-/ENSG00000239945
chr1:90287-91105/-/ENSG00000239945
chr1:135141-135895/-/ENSG00000268903
chr1:141474-143011/-/ENSG00000241860
chr1:146386-149707/-/ENSG00000241860
chr1:142808-143011/-/ENSG00000241860
chr1:146386-146509/-/ENSG00000241860

zless junction/features.tsv.gz|head

chr1:30039-30564/+/MIR1302-2HG
chr1:90050-90287/-/ENSG00000239945
chr1:143011-146386/-/ENSG00000241860
chr1:165942-168100/-/ENSG00000241860
chr1:168165-168610/-/ENSG00000241860
chr1:805891-808574/-/ENSG00000290784
chr1:805891-808574/-/ENSG00000230092
chr1:827775-829003/+/LINC01128
chr1:829104-847654/+/LINC01128
chr1:868530-868627/-/FAM41C

The format for exon and junction names is “chr:pos reference_allele>alternative_allele/strand” for alternative alleles, or “chr:pos=/strand” for reference allele.

# the format of expressed genetic variant is 
zless var/features.tsv.gz|head

chr1:631862G>A/+
chr1:632403T=/+
chr1:632403T>C/+
chr1:632427T>C/+
chr1:633192G>A/+
chr1:633192G=/+
chr1:633236A>G/+
chr1:633300A=/+
chr1:633372T>C/+
chr1:634244T>C/+

Notes: By default, the parameters for PISA anno are strand-sensitive. For non-strand-sensitive libraries, such as Smartseq2, the strand should be ignored by using the -is option.

Questions?

If you have any questions regarding this workflows, please feel free to report them through the github issues.

If you have any ideas or suggestions regarding PISA annotation, please refer to the Yano discussion forum. Since PISA is commonly used alongside Yano, it may not be necessary to host separate forums for each.