cassiopeia.pp.filter_molecule_table

cassiopeia.pp.filter_molecule_table(input_df, output_directory, min_umi_per_cell=10, min_avg_reads_per_umi=2.0, umi_read_thresh=- 1, intbc_prop_thresh=0.5, intbc_umi_thresh=10, intbc_dist_thresh=1, doublet_threshold=0.35, allow_allele_conflicts=False, plot=False)[source]

Filters and corrects a molecule table of cellBC-UMI pairs.

Performs the following steps on the alignments in a DataFrame:
  1. Filters out cellBCs with less than <= min_umi_per_cell unique UMIs

  2. Filters out UMIs with read count less than <= umi_read_thresh

  3. Error corrects intBCs by changing intBCs with low UMI counts to intBCs with the same allele and a close sequence

  4. Filters out cellBCs that contain too much conflicting allele information as intra-lineage doublets

  5. Chooses one allele for each cellBC-intBC pair, by selecting the most common

Parameters
input_df : DataFrameDataFrame

A molecule table, i.e. cellBC-UMI pairs. Note that each cellBC should only contain one instance of each UMI

output_directory : strstr

The output directory path to store plots

min_umi_per_cell : intint (default: 10)

The threshold specifying the minimum number of UMIs in a cell needed to be retained during filtering

min_avg_reads_per_umi : floatfloat (default: 2.0)

The threshold specifying the minimum coverage (i.e. average) reads per UMI in a cell needed in order for that cell to be retained during filtering

umi_read_thresh : intint (default: -1)

The threshold specifying the minimum read count needed for a UMI to be retained during filtering. Set dynamically if value is < 0

intbc_prop_thresh : floatfloat (default: 0.5)

The threshold specifying the maximum proportion of the total UMI counts for a intBC to be corrected to another

intbc_umi_thresh : intint (default: 10)

The threshold specifying the maximum UMI count for an intBC needs to be corrected to another

intbc_dist_thresh : intint (default: 1)

The threshold specifying the maximum Levenshtein Distance between sequences for an intBC to be corrected to another

doublet_threshold : floatfloat (default: 0.35)

The threshold specifying the maximum proportion of conflicting alleles information allowed to for an intBC to be retained in doublet filtering. Set to None to skip doublet filtering

allow_allele_conflicts : boolbool (default: False)

Whether or not to allow multiple alleles to be assigned to each cellBC-intBC pair. For fully single-cell data, this option should be set to False, since each cell is expected to have a single allele state for each intBC. However, this option should be set to True for chemistries that may result in multiple physical cells being captured for each barcode.

plot : boolbool (default: False)

Indicates whether to plot the change in intBC and cellBC counts across filtering stages

Return type

DataFrameDataFrame

Returns

A filtered and corrected allele table of cellBC-UMI-allele groups