Preprocess#

Data Preprocessing#

We have several functions that are part of our pipeline for processing sequencing data from single-cell lineage tracing technologies:

`pp.align_sequences`(queries[, ref_filepath, ...])	Align reads to the TargetSite reference.
`pp.call_alleles`(alignments[, ref_filepath, ...])	Call indels from CIGAR strings.
`pp.call_lineage_groups`(input_df, ...[, ...])	Assigns cells to their clonal populations.
`pp.collapse_umis`(bam_fp, output_directory[, ...])	Collapses close UMIs together from a bam file.
`pp.convert_fastqs_to_unmapped_bam`(fastq_fps, ...)	Converts FASTQs into an unmapped BAM based on a chemistry.
`pp.error_correct_cellbcs_to_whitelist`(...[, ...])	Error-correct cell barcodes in the input BAM.
`pp.error_correct_intbcs_to_whitelist`(...[, ...])	Corrects all intBCs to the provided whitelist.
`pp.error_correct_umis`(input_df[, ...])	Within cellBC-intBC pairs, collapses UMIs that have close sequences.
`pp.filter_bam`(bam_fp, output_directory[, ...])	Filter reads in a BAM that have low quality barcode or UMIs.
`pp.filter_molecule_table`(input_df, ...[, ...])	Filters and corrects a molecule table of cellBC-UMI pairs.
`pp.filter_cells`(molecule_table[, ...])	Filter out cell barcodes that have too few UMIs or too few reads/UMI.
`pp.filter_umis`(molecule_table[, ...])	Filters out UMIs with too few reads.
`pp.resolve_umi_sequence`(molecule_table, ...)	Resolve a consensus sequence for each UMI.

Data Utilities#

We also have several functions that are useful for converting between data formats for downstream analyses:

`pp.compute_empirical_indel_priors`(allele_table)	Computes indel prior probabilities.
`pp.convert_alleletable_to_character_matrix`(...)	Converts an AlleleTable into a character matrix.
`pp.convert_alleletable_to_lineage_profile`(...)	Converts an AlleleTable to a lineage profile.
`pp.convert_lineage_profile_to_character_matrix`(...)	Converts a lineage profile to a character matrix.