The following vignette demonstrates how to parse raw FASTQ files to FASTQ+ format and perform alignment, feature annotation, and counting using PISA. For simplification, we use minimap2 to align reads in this example, but other tools such as STAR can also be used.
Number of Fragments,66601887
Fragments pass QC,66601887
Fragments with Exactly Matched Barcodes,0
Fragments with Failed Barcodes,0
2. Align reads with minimap2
To get identical results with 10X’s Cell Ranger, you need use 10X’s gtf to index genome and annotate BAM as they modified it from original ENSEMBL gtf.
Notice : Please do NOT use samtools view to convert the SAM to BAM. Because the barcode information still store in the read name. Using PISA sam2bam instead.
This step will convert the SAM to BAM, while generating the summary of alignments and parse the tags from the read id to SAM optional field. If -adjust-mapq and -gtf parameters set, mapping quality of multi-reads (reads mapped to one gene region and more than one integenic region) will be adjusted.
Raw reads,66601887
Mapped reads,52582556 (78.95%)
Plus strand,29549905
Minus strand,23032651
Mitochondria ratio,7.66%
Mapping quality corrected reads,427081
Reads Mapped to Genome (Map Quality >= 0),79.0%
Reads Mapped to Exonic Regions,62.4%
Reads Mapped to Intronic Regions,24.5%
Reads Mapped to both Exonic and Intronic Regions,1.8%
Reads Mapped Antisense to Exon,0.8%
Reads Mapped Antisense to Intron,6.7%
Reads Mapped to Intergenic Regions,3.7%
Reads Mapped to Gene but Failed to Interpret Type,0.1%
Reads Mapped to Overlapped genes,1.5%
5. Sort BAM by genomics coordinate (Optional)
A sorted and indexed BAM file can be useful for many secondary analyses, but it is not mandatory if you only need the gene expression count matrix. The annotation and expression count steps do not require a sorted BAM. However, if two similar UMIs have the same supporting reads, the first one will always be selected, and the second will be corrected during the corr step. If the records are not sorted, this may lead to non-robust results.
## sambamba can also be used to sort bam for better performance. For simplication, we use samtools here.samtools sort -@ 5 -o sorted.bam anno.bam
PISA count will write the feature counts in MEX format.
ls raw_gene_expression
barcodes.tsv.gz
features.tsv.gz
matrix.mtx.gz
8. Cell selection with DropletUtils
In the last step, we generate a raw gene expression matrix. By using R package DropletUtils, we select “valid’ cell barcodes from this raw matrix and generate a filtered gene expression matrix in the end.
# load required packagesrequire(DropletUtils)require(Yano)# Read raw MEX file into a sparse matrixcounts <-ReadPISA("raw_gene_expression/")# Read sparse matrix as a SingleCellExperiment objectsce <-SingleCellExperiment(list('counts'= counts))# Run barcodeRanks function from DropletUtils, see ref for detailsbr.out <-barcodeRanks(sce)# Get all barcodes above the inflection pointfilter_bcs <-rownames(br.out)[br.out$total >metadata(br.out)$inflection]# the filer matrixfilter_counts <- counts[,filter_bcs]# write file to new folder in 10X format for further usewrite10xCounts("filtered_gene_expression", filter_counts)