cassiopeia.pp.filter_molecule_table#
- cassiopeia.pp.filter_molecule_table(input_df, output_directory, min_umi_per_cell=10, min_avg_reads_per_umi=2.0, min_reads_per_umi=-1, intbc_prop_thresh=0.5, intbc_umi_thresh=10, intbc_dist_thresh=1, doublet_threshold=0.35, allow_allele_conflicts=False, plot=False)[source]#
Filters and corrects a molecule table of cellBC-UMI pairs.
- Performs the following steps on the alignments in a DataFrame:
- Filters out UMIs with read count < min_reads_per_umi. If
min_reads_per_umi is less than 0, a dynamic threshold is calculated as (99th percentile of read counts) // 10.
- Filters out cellBCs with unique UMIs < min_umi_per_cell and
average read count per UMI < min_avg_reads_per_umi.
- Error corrects intBCs by changing intBCs with low UMI counts to
intBCs with the same allele and a close sequence
- Filters out cellBCs that contain too much conflicting allele
information as intra-lineage doublets
- Chooses one allele for each cellBC-intBC pair, by selecting the most
common. This is not performed when allow_allele_conflicts is True.
- Parameters:
- input_df
DataFrame
A molecule table, i.e. cellBC-UMI pairs. Note that each cellBC should only contain one instance of each UMI
- output_directory
str
The output directory path to store plots
- min_umi_per_cell
int
(default:10
) The threshold specifying the minimum number of UMIs in a cell needed to be retained during filtering
- min_avg_reads_per_umi
float
(default:2.0
) The threshold specifying the minimum coverage (i.e. average) reads per UMI in a cell needed in order for that cell to be retained during filtering
- min_reads_per_umi
int
(default:-1
) The threshold specifying the minimum read count needed for a UMI to be retained during filtering. Set dynamically if value is < 0.
- intbc_prop_thresh
float
(default:0.5
) The threshold specifying the maximum proportion of the total UMI counts for a intBC to be corrected to another
- intbc_umi_thresh
int
(default:10
) The threshold specifying the maximum UMI count for an intBC needs to be corrected to another
- intbc_dist_thresh
int
(default:1
) The threshold specifying the maximum Levenshtein Distance between sequences for an intBC to be corrected to another
- doublet_threshold
float
(default:0.35
) The threshold specifying the maximum proportion of conflicting alleles information allowed to for an intBC to be retained in doublet filtering. Set to None to skip doublet filtering
- allow_allele_conflicts
bool
(default:False
) Whether or not to allow multiple alleles to be assigned to each cellBC-intBC pair. For fully single-cell data, this option should be set to False, since each cell is expected to have a single allele state for each intBC. However, this option should be set to True for chemistries that may result in multiple physical cells being captured for each barcode.
- plot
bool
(default:False
) Indicates whether to plot the change in intBC and cellBC counts across filtering stages
- input_df
- Return type:
- Returns:
A filtered and corrected allele table of cellBC-UMI-allele groups