cassiopeia.pp.filter_molecule_table#

cassiopeia.pp.filter_molecule_table(input_df, output_directory, min_umi_per_cell=10, min_avg_reads_per_umi=2.0, min_reads_per_umi=-1, intbc_prop_thresh=0.5, intbc_umi_thresh=10, intbc_dist_thresh=1, doublet_threshold=0.35, allow_allele_conflicts=False, plot=False)[source]#

Filters and corrects a molecule table of cellBC-UMI pairs.

Performs the following steps on the alignments in a DataFrame:
  1. Filters out UMIs with read count < min_reads_per_umi. If

    min_reads_per_umi is less than 0, a dynamic threshold is calculated as (99th percentile of read counts) // 10.

  2. Filters out cellBCs with unique UMIs < min_umi_per_cell and

    average read count per UMI < min_avg_reads_per_umi.

  3. Error corrects intBCs by changing intBCs with low UMI counts to

    intBCs with the same allele and a close sequence

  4. Filters out cellBCs that contain too much conflicting allele

    information as intra-lineage doublets

  5. Chooses one allele for each cellBC-intBC pair, by selecting the most

    common. This is not performed when allow_allele_conflicts is True.

Parameters:
input_df DataFrame

A molecule table, i.e. cellBC-UMI pairs. Note that each cellBC should only contain one instance of each UMI

output_directory str

The output directory path to store plots

min_umi_per_cell int (default: 10)

The threshold specifying the minimum number of UMIs in a cell needed to be retained during filtering

min_avg_reads_per_umi float (default: 2.0)

The threshold specifying the minimum coverage (i.e. average) reads per UMI in a cell needed in order for that cell to be retained during filtering

min_reads_per_umi int (default: -1)

The threshold specifying the minimum read count needed for a UMI to be retained during filtering. Set dynamically if value is < 0.

intbc_prop_thresh float (default: 0.5)

The threshold specifying the maximum proportion of the total UMI counts for a intBC to be corrected to another

intbc_umi_thresh int (default: 10)

The threshold specifying the maximum UMI count for an intBC needs to be corrected to another

intbc_dist_thresh int (default: 1)

The threshold specifying the maximum Levenshtein Distance between sequences for an intBC to be corrected to another

doublet_threshold float (default: 0.35)

The threshold specifying the maximum proportion of conflicting alleles information allowed to for an intBC to be retained in doublet filtering. Set to None to skip doublet filtering

allow_allele_conflicts bool (default: False)

Whether or not to allow multiple alleles to be assigned to each cellBC-intBC pair. For fully single-cell data, this option should be set to False, since each cell is expected to have a single allele state for each intBC. However, this option should be set to True for chemistries that may result in multiple physical cells being captured for each barcode.

plot bool (default: False)

Indicates whether to plot the change in intBC and cellBC counts across filtering stages

Return type:

DataFrame

Returns:

A filtered and corrected allele table of cellBC-UMI-allele groups