From fastq to feature counts with PISA

Published

July 8, 2024

The following vignette demonstrates how to parse raw FASTQ files to FASTQ+ format and perform alignment, feature annotation, and counting using PISA. For simplification, we use minimap2 to align reads in this example, but other tools such as STAR can also be used.

Prepare the raw data and reference

Download raw data: http://cf.10xgenomics.com/samples/cell-exp/3.0.0/pbmc_1k_v3/pbmc_1k_v3_fastqs.tar (5.2G)
Download the human reference databases and gene annotation files:
- https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/GRCh38.p13.genome.fa.gz
- https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_32/gencode.v32.annotation.gtf.gz

The third-party programs we will use

minimap2 (https://github.com/lh3/minimap2)
samtools (https://github.com/samtools/samtools)

Step by step tutorial

1. Parse cell barcode and UMI

# Unpack the raw fastq
tar xvf pbmc_1k_v3_fastqs.tar 
gzip -d GRCh38.p13.genome.fa.gz

# Convert raw fastq to fastq+
PISA parse -rule 'CB,R1:1-16;UR,R1:17-28;R1,R2' pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R1_001.fastq.gz,pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R1_001.fastq.gz pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L001_R2_001.fastq.gz,pbmc_1k_v3_fastqs/pbmc_1k_v3_S1_L002_R2_001.fastq.gz -nw -report fastq_qc.csv -1 reads.fq

# Check the fastq+ records
head reads.fq

@A00228:279:HFWFVDMXX:1:1101:8486:1000|||CB:Z:NGTGATTAGCTGTACT|||UR:Z:CGTATGTAAGGT
NACAAAGTCCCCCCCATAATACAGGGGGAGCCACTTGGGCAGGAGGCAGGGAGGGGTCCATTCCCCCTGGTGGGGCTGGTGGGGAGCTGTA
+
#FFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:10782:1000|||CB:Z:NTCATGAAGTTTGGCT|||UR:Z:AGTTATGTTCAT
NTTGCAGCTGAACTGGTAAACTTGTCCCTAAAGAGACATAAGAATGGTCAACTGGAATGTGGATTCATCTGTAACATTACTCAGTGGGCCT
+
#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF
@A00228:279:HFWFVDMXX:1:1101:12626:1000|||CB:Z:NACCAAAAGGACATCG|||UR:Z:CGGTAGAGGCTT
GCTCCGTAGCCCTGACTGCCTTGGTGACTGTATCCTCAGAAGAATTTGAAGACAAGTGGTTCAGAAAGATCAAAGACCATTTCTGTCCATT

# Check the report
cat fastq_qc.csv

Number of Fragments,66601887
Fragments pass QC,66601887
Fragments with Exactly Matched Barcodes,0
Fragments with Failed Barcodes,0

2. Align reads with minimap2

To get identical results with 10X’s Cell Ranger, you need use 10X’s gtf to index genome and annotate BAM as they modified it from original ENSEMBL gtf.

minimap2 -t 8 -ax splice GRCh38.p13.genome.fa reads.fq > aln.sam

Notice : Please do NOT use samtools view to convert the SAM to BAM. Because the barcode information still store in the read name. Using PISA sam2bam instead.

FASTQ+ aligned SAM records can be look like this.

samtools view aln.sam|head -n 1

A00228:279:HFWFVDMXX:1:1101:8486:1000|||CB:Z:NGTGATTAGCTGTACT|||UR:Z:CGTATGTAAGGT   0   chr19   12748180    60  1S90M   *   0   0   NACAAAGTCCCCCCCATAATACAGGGGGAGCCACTTGGGCAGGAGGCAGGGAGGGGTCCATTCCCCCTGGTGGGGCTGGTGGGGAGCTGTA #FFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF NM:i:0  ms:i:90 AS:i:90 nn:i:0  tp:A:P  cm:i:25 s1:i:85 s2:i:0  de:f:0  rl:i:0

3. Convert SAM to BAM

This step will convert the SAM to BAM, while generating the summary of alignments and parse the tags from the read id to SAM optional field. If -adjust-mapq and -gtf parameters set, mapping quality of multi-reads (reads mapped to one gene region and more than one integenic region) will be adjusted.

PISA sam2bam -report alignment_report.csv -@ 5 -adjust-mapq -gtf gencode.v32.annotation.gtf.gz -o aln.bam aln.sam

Converted records can be look like this.

samtools view aln.bam|head -n 1

A00228:279:HFWFVDMXX:1:1101:8486:1000|||CB:Z:NGTGATTAGCTGTACT|||UR:Z:CGTATGTAAGGT   0   chr19   12748180    60  1S90M   *   0   0   NACAAAGTCCCCCCCATAATACAGGGGGAGCCACTTGGGCAGGAGGCAGGGAGGGGTCCATTCCCCCTGGTGGGGCTGGTGGGGAGCTGTA #FFFFFFFFFFFFFFF:FFFFFFF:FFFFFFFFFFFFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFF:FFFFFFFFFFFFFFFFFFFFFF NM:i:0  ms:i:90 AS:i:90 nn:i:0  tp:A:P  cm:i:25 s1:i:85 s2:i:0  de:f:0  rl:i:0

And the summary.

cat alignment_report.csv

Raw reads,66601887
Mapped reads,52693779 (79.12%)
Plus strand,29523622
Minus strand,23170157
Mitochondria ratio,7.66%
Mapping quality corrected reads,889117

4. Gene annotations

PISA anno -gtf gencode.v32.annotation.gtf.gz -o anno.bam aln.bam -t 5 -report anno.csv

cat anno.csv

Reads Mapped to Genome (Map Quality >= 0),79.1%
Reads Mapped to Exonic Regions,61.9%
Reads Mapped to Intronic Regions,24.2%
Reads Mapped to both Exonic and Intronic Regions,0.0%
Reads Mapped Antisense to Exon,1.0%
Reads Mapped Antisense to Intron,6.3%
Reads Mapped to Intergenic Regions,5.6%
Reads Mapped to Gene but Failed to Interpret Type,0.0%
Reads Mapped to Overlapped genes,1.7%

5. Sort BAM by genomics coordinate (Optional)

A sorted and indexed BAM file can be useful for many secondary analyses, but it is not mandatory if you only need the gene expression count matrix. The annotation and expression count steps do not require a sorted BAM. However, if two similar UMIs have the same supporting reads, the first one will always be selected, and the second will be corrected during the corr step. If the records are not sorted, this may lead to non-robust results.

## sambamba can also be used to sort bam for better performance. For simplication, we use samtools here.
samtools sort -@ 5 -o sorted.bam anno.bam

6. UMIs correction for one cell in one gene

PISA corr -tag UR -new-tag UB -tags-block CB,GN -cr -o final.bam -@ 5 sorted.bam

7. Generate Cell X Gene expression matrix

mkdir raw_gene_expression
PISA count -tags CB -anno-tag GN -umi UB -outdir raw_gene_expression -@ 5 final.bam

PISA count will write the feature counts in MEX format.

ls raw_gene_expression

barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz

8. Cell selection with DropletUtils

In the last step, we generate a raw gene expression matrix. By using R package DropletUtils, we select “valid’ cell barcodes from this raw matrix and generate a filtered gene expression matrix in the end.

# load required packages
require(DropletUtils)
require(Yano)

# Read raw MEX file into a sparse matrix
counts <- ReadPISA("raw_gene_expression/")

# Read sparse matrix as a SingleCellExperiment object
sce <- SingleCellExperiment(list('counts' = counts))

# Run barcodeRanks function from DropletUtils, see ref for details
br.out <- barcodeRanks(sce)

# Get all barcodes above the inflection point
filter_bcs <- rownames(br.out)[br.out$total >metadata(br.out)$inflection]

# the filer matrix
filter_counts <- counts[,filter_bcs]

# write file to new folder in 10X format for further use
write10xCounts("filtered_gene_expression", filter_counts)