Preprocessing sequencing reads (FASTQs) with Cassiopeia#

Cassiopeia provides an end-to-end pipeline to preprocess raw sequencing reads in FASTQ format into “Character Matrices” ready to pass into a phylogeny-inference algorithm. Given a set of FASTQs that contain RNA-seq reads from a target site construct (and NOT any endogenously expressed RNA), the preprocessing pipeline consists of the following steps.

  1. convert: Convert the FASTQs into an unmapped BAM, while parsing any barcode and/or UMI sequences into BAM tags.

  2. filter_bam: Filter reads with low-quality barcode and/or UMI sequences from the unmapped bam.

  3. error_correct_cellbcs_to_whitelist: For sequencing chemistries that have a predefined (cell) barcode whitelist, this steps perform correction of sequencing errors using this whitelist.

  4. collapse: Collapse reads into UMIs by constructing one or more consensus sequences for each UMI using the set of reads with that UMI.

  5. resolve: Resolve a single sequence for each UMI by choosing the most likely sequencing read to represent each UMI in a cell.

  6. align: Align sequences to the reference target site using the Smith-Waterman local alignment algorithm.

  7. call_alleles: Call alleles with respect to the reference target site and the alignment of a sequence, thereby reporting the set of mutations that a target site sequence contains.

  8. error_correct_intbcs_to_whitelist: For experimental designs for which each target site vector molecule has a unique barcode (“intBC”), and the set of intBCs present in the sequenced sample are known beforehand, this step performs sequencing error correction of these intBCs to the provided whitelist.

  9. error_correct_umis: Error-correct UMIs whose mutation data is identical and whose UMI barcode sequences are similar enough.

  10. filter_molecule_table: Filter UMIs that have conflicting allele information, too few reads, or do not meet other quality control criteria.

  11. call_lineages: Split up cells into clonal populations, based on their shared set of integration barcodes (intBCs).

The final output of this pipeline is an “AlleleTable” which stores the mutation data and clonal population identity for each cell. This data structure can then be broken up into character matrices for phylogenetic inference.

Pipeline API#

All of the key modules of the preprocessing pipeline can be invoked by a call from cassiopeia.pp. Assuming the user would like to begin at the beginning of the pipeline, we’ll start with the convert stage. You can find all documentation on our main site.

An alternative to running the pipeline interactively is to take advantage of the command line tool cassiopeia-preprocess, which takes in a configuration file (for example in Cassiopeia/data/preprocess.cfg) and runs the pipeline end-to-end. For example, if you have a config called example_config.cfg, this can be invoked from the command line with:

cassiopeia-preprocess example_config.cfg

In this brief tutorial, we will preprocess a sample prepared with the 10X Genomics 3’ version 3 chemistry and an intBC whitelist (that define the target site intBCs we know are present in the sample, obtained via DNA sequencing).

[5]:
import pandas as pd

import cassiopeia as cas
[6]:
# The raw FASTQs
input_files = [
    "R1.fastq.gz", "R2.fastq.gz"
]
# The sample name, used for naming output files
name = 'test_sample'
# Directory to output results
output_directory = "test_preprocess_pipeline"
# Path to the target site reference sequence in FASTA format
reference_filepath = "../data/PCT48.ref.fasta"
# Number of threads to use, whenever parallelization is possible
n_threads = 8
# Whether to allow a single intBC to have multiple allele states
# For chemistries for which barcode == cell, this should be `False`.
allow_allele_conflicts = False
# Verbosity of logging
verbose = True

cassiopeia.pp.setup(output_dir, verbose=verbose)

convert#

Since we used the 10X Genomics 3’ version 3 chemistry to prepare our samples, we provide chemistry=10xv3. Other supported chemistries are the following.

  • dropseq: Droplet-based scRNA-seq chemistry described in Macosco et al. 2015

  • 10xv2: 10x Genomics 3’ version 2

  • 10xv3: 10x Genomics 3’ version 3

  • indropsv3: inDrops version 3 by Zilionis et al. 2017

  • slideseq2: Slide-seq version 2

[ ]:
bam_fp = cas.pp.convert_fastqs_to_unmapped_bam(
    input_files,
    chemistry='10xv3',
    output_directory=output_directory,
    name=name,
    n_threads=n_threads
)

filter_bam#

The quality_threshold parameter controls the minimum PHRED sequencing quality the barcode and UMI sequence must have for a read to pass filtering.

[ ]:
bam_fp = cas.pp.filter_bam(
    bam_fp,
    output_directory=output_directory,
    quality_threshold=10,
    n_threads=n_threads,
)

error_correct_cellbcs_to_whitelist#

The 10X Genomics 3’ version 3 chemistry has a predefined barcode whitelist, to which we will correct our barcode sequences. For chemistries that do not have such a whitelist (such as Drop-seq), this step should be skipped.

The whitelist argument may be a path to the whitelist plaintext file (with one barcode per line) or a Python list containing the whitelisted barcode sequences. Here, we downloaded the appropriate whitelist file from here and uncompressed it.

[ ]:
bam_fp = cas.pp.error_correct_cellbcs_to_whitelist(
    bam_fp,
    whitelist='3M-february-2018.txt',
    output_directory=output_directory,
    n_threads=n_threads,
)

collapse#

The max_hq_mismatches and max_indels arguments control the threshold with which to decide whether to assign a read to a proposed consensus sequence. The defaults (3 and 2 respectively) should work best in most cases.

The method argument may take one of two values: cutoff and likelihood. The former uses a hard PHRED qualtiy cutoff of 30 (and any mismatches below this quality are ignored). Consensus sequences are proposed by selecting the msot common base at each position (with quality at least 30). The latter is a likelihood-based approach that selects the consensus sequences based on what is the most probable base at each position.

[ ]:
umi_table = cas.pp.collapse_umis(
    bam_fp,
    output_directory=output_directory,
    max_hq_mismatches=3,
    max_indels=2,
    method='likelihood',
    n_threads=n_threads,
)

resolve#

The min_umi_per_cell and min_avg_reads_per_umi specify filtering thresholds to filter cells. The former is the minimum number of UMIs a cell must have to pass filtering, and the latter is the minimum average number of reads per UMI a cell must have to pass filtering.

[ ]:
umi_table = cas.pp.resolve_umi_sequence(
    umi_table,
    output_directory=output_directory,
    min_umi_per_cell=10,
    min_avg_reads_per_umi=2.0,
    plot=True,
)

align#

The reference target site sequence must be provided as a FASTA file to the ref_filepath argument or as a string to the ref argument. The gap_open_penalty and gap_extend_penalty specify the gap open and extend penalties to use when aligning sequences. The provided defaults should work well for most cases.

[ ]:
umi_table = cas.pp.align_sequences(
    umi_table,
    ref_filepath=reference_filepath,
    gap_open_penalty=20,
    gap_extend_penalty=1,
    n_threads=n_threads,
)

call_alleles#

Same as with the “align” step, the reference target site sequence must be provided with either the ref_filepath or ref arguments. The following additional arguments must be provided.

  • barcode_interval: The start and end positions for the intBC, which is the barcode that uniquely identifies each target site molecule. The interval is represented as a tuple of the form (start, end), using 0-indexing and start-inclusive/end-exclusive.

  • cutsite_locations: The (center) location of each cutsite, represented as a list of indices, one element for each cutsite.

  • cutsite_width: The number of nucleotides to the left and right of the cutsite location that indels can appear in.

  • context: Whether or not to use the nucleotides surrounding the indels to identify the indels.

  • context_size: The number of bases to the left and right to include as the context.

For the target sites we used for this experiment, we have the following locations.

  • intBC located in the interval (20, 34)

  • cutsites at [112, 166, 120]

[ ]:
umi_table = cas.pp.call_alleles(
    umi_table,
    ref_filepath=reference_filepath,
    barcode_interval=(20, 34),
    cutsite_locations=[112, 166, 220],
    cutsite_width=12,
    context=True,
    context_size=5,
)

error_correct_intbcs_to_whitelist#

For experiments in which the intBC sequences that are present in the sample are not known beforehand, this step should be skipped.

In our case, we do have an intBC whitelist, obtained from DNA sequencing. The intbc_dist_thresh specifies the maximum Levenshtein (edit) distance between the intBC sequence and whitelist to be correct.

[ ]:
intbc_whitelist = [
    'ATGATTTAACTACT', 'CGATTGGTCACTTA', 'CGTGAGTCTCTGAT', 'GAACCCACAATTCC',
    'GAGTATATACCCTT', 'GCGTTTAGAATATT', 'GCCTTCAATTCCAA', 'TAACCAAGCCTACA',
    'TTTCGTCGCTCTTC', 'CGCTATGGGGGGAA', 'CGATATCTTCAAGC', 'TCAGTGGGGTATTG',
    'ACAATGCGTGTGGC',
]
umi_table = cas.pp.error_correct_intbcs_to_whitelist(
    umi_table,
    whitelist=intbc_whitelist,
    intbc_dist_thresh=1
)

error_correct_umis#

The max_umi_distance specifies the maximum Hamming distance between two UMIs for one to be corrected to another.

[ ]:
umi_table = cas.pp.error_correct_umis(
    umi_table,
    max_umi_distance=2,
    allow_allele_conflicts=allow_allele_conflicts,
    n_threads=n_threads,
)

filter_molecule_table#

The min_umi_per_cell and min_avg_reads_per_umi behave the same as the “resolve” step.

See the documentation for more details.

[ ]:
umi_table = cas.pp.filter_molecule_table(
    umi_table,
    output_directory=output_directory,
    min_umi_per_cell=10,
    min_avg_reads_per_umi=2.0,
    min_reads_per_umi=-1,
    intbc_prop_thresh=0.5,
    intbc_umi_thresh=10,
    intbc_dist_thresh=1,
    doublet_threshold=0.35,
    allow_allele_conflicts=allow_allele_conflicts,
    plot=True,
)

call_lineage_groups#

The min_umi_per_cell and min_avg_reads_per_umi behave the same as the “resolve” step.

See the documentation for more details.

[ ]:
allele_table = cas.pp.call_lineage_groups(
    umi_table,
    output_directory=output_directory,
    min_umi_per_cell=10,
    min_avg_reads_per_umi=2.0,
    min_cluster_prop=0.005,
    min_intbc_thresh=0.05,
    inter_doublet_threshold=0.35,
    kinship_thresh=0.25,
    plot=True,
)
[12]:
allele_table.head(5)
[12]:
cellBC intBC allele r1 r2 r3 lineageGrp UMI readCount Sample
0 AAACCTGAGGCTAGAC-1 CCCCGTGCCTTCCT 4 15 186.0 AAACCTGAGGCTAGAC-1
1 AAACCTGAGGCTAGAC-1 CGACAATGTAGTTG CTTTG[104:29D]TACGGGATAT[167:54D]CGGAGGATAT[16... CTTTG[104:29D]TACGG GATAT[167:54D]CGGAG GATAT[167:54D]CGGAG 4 26 326.0 AAACCTGAGGCTAGAC-1
2 AAACCTGAGGCTAGAC-1 CGCGCGTCCGGGTC CGCCG[111:1I]AAAAAACATAA[161:18D]CGTGAATTCG[No... CGCCG[111:1I]AAAAAA CATAA[161:18D]CGTGA ATTCG[None]CGGAG 4 13 104.0 AAACCTGAGGCTAGAC-1
3 AAACCTGAGGCTAGAC-1 GACTTTAATGTACA CCGAA[113:54D]CTCTGCCGAA[113:54D]CTCTGTAATT[21... CCGAA[113:54D]CTCTG CCGAA[113:54D]CTCTG TAATT[219:2D]CGGAG 4 18 187.0 AAACCTGAGGCTAGAC-1
4 AAACCTGAGGCTAGAC-1 GATGGACATTGGGG CCGAA[113:50D]ATATCCCGAA[113:50D]ATATCATTCG[No... CCGAA[113:50D]ATATC CCGAA[113:50D]ATATC ATTCG[None]CGGAG 4 19 178.0 AAACCTGAGGCTAGAC-1
[ ]: