Preprocess

Data Preprocessing

We have several functions that are part of our pipeline for processing sequencing data from single-cell lineage tracing technologies:

pp.align_sequences(queries[, ref_filepath, …])

Align reads to the TargetSite reference.

pp.call_alleles(alignments[, ref_filepath, …])

Call indels from CIGAR strings.

pp.call_lineage_groups(input_df, …[, …])

Assigns cells to their clonal populations.

pp.collapse_umis(bam_fp, output_directory[, …])

Collapses close UMIs together from a bam file.

pp.convert_fastqs_to_unmapped_bam(fastq_fps, …)

Converts FASTQs into an unmapped BAM based on a chemistry.

pp.error_correct_cellbcs_to_whitelist(…[, …])

Error-correct cell barcodes in the input BAM.

pp.error_correct_intbcs_to_whitelist(…[, …])

Corrects all intBCs to the provided whitelist.

pp.error_correct_umis(input_df[, …])

Within cellBC-intBC pairs, collapses UMIs that have close sequences.

pp.filter_bam(bam_fp, output_directory[, …])

Filter reads in a BAM that have low quality barcode or UMIs.

pp.filter_molecule_table(input_df, …[, …])

Filters and corrects a molecule table of cellBC-UMI pairs.

pp.filter_cells(molecule_table[, …])

Filter out cell barcodes that have too few UMIs or too few reads/UMI

pp.filter_umis(moleculetable[, readCountThresh])

Filters out UMIs with too few reads.

pp.resolve_umi_sequence(molecule_table, …)

Resolve a consensus sequence for each UMI.

Data Utilities

We also have several functions that are useful for converting between data formats for downstream analyses:

pp.compute_empirical_indel_priors(allele_table)

Computes indel prior probabilities.

pp.convert_alleletable_to_character_matrix(…)

Converts an AlleleTable into a character matrix.

pp.convert_alleletable_to_lineage_profile(…)

Converts an AlleleTable to a lineage profile.

pp.convert_lineage_profile_to_character_matrix(…)

Converts a lineage profile to a character matrix.