Preprocess#

Data Preprocessing#

We have several functions that are part of our pipeline for processing sequencing data from single-cell lineage tracing technologies:

pp.align_sequences(queries[, ref_filepath, ...])

Align reads to the TargetSite reference.

pp.call_alleles(alignments[, ref_filepath, ...])

Call indels from CIGAR strings.

pp.call_lineage_groups(input_df, ...[, ...])

Assigns cells to their clonal populations.

pp.collapse_umis(bam_fp, output_directory[, ...])

Collapses close UMIs together from a bam file.

pp.convert_fastqs_to_unmapped_bam(fastq_fps, ...)

Converts FASTQs into an unmapped BAM based on a chemistry.

pp.error_correct_cellbcs_to_whitelist(...[, ...])

Error-correct cell barcodes in the input BAM.

pp.error_correct_intbcs_to_whitelist(...[, ...])

Corrects all intBCs to the provided whitelist.

pp.error_correct_umis(input_df[, ...])

Within cellBC-intBC pairs, collapses UMIs that have close sequences.

pp.filter_bam(bam_fp, output_directory[, ...])

Filter reads in a BAM that have low quality barcode or UMIs.

pp.filter_molecule_table(input_df, ...[, ...])

Filters and corrects a molecule table of cellBC-UMI pairs.

pp.filter_cells(molecule_table[, ...])

Filter out cell barcodes that have too few UMIs or too few reads/UMI.

pp.filter_umis(molecule_table[, ...])

Filters out UMIs with too few reads.

pp.resolve_umi_sequence(molecule_table, ...)

Resolve a consensus sequence for each UMI.

Data Utilities#

We also have several functions that are useful for converting between data formats for downstream analyses:

pp.compute_empirical_indel_priors(allele_table)

Computes indel prior probabilities.

pp.convert_alleletable_to_character_matrix(...)

Converts an AlleleTable into a character matrix.

pp.convert_alleletable_to_lineage_profile(...)

Converts an AlleleTable to a lineage profile.

pp.convert_lineage_profile_to_character_matrix(...)

Converts a lineage profile to a character matrix.