cassiopeia.pp.filter_molecule_table¶
- cassiopeia.pp.filter_molecule_table(input_df, output_directory, min_umi_per_cell=10, min_avg_reads_per_umi=2.0, umi_read_thresh=- 1, intbc_prop_thresh=0.5, intbc_umi_thresh=10, intbc_dist_thresh=1, doublet_threshold=0.35, allow_allele_conflicts=False, plot=False)[source]¶
Filters and corrects a molecule table of cellBC-UMI pairs.
- Performs the following steps on the alignments in a DataFrame:
Filters out cellBCs with less than <= min_umi_per_cell unique UMIs
Filters out UMIs with read count less than <= umi_read_thresh
Error corrects intBCs by changing intBCs with low UMI counts to intBCs with the same allele and a close sequence
Filters out cellBCs that contain too much conflicting allele information as intra-lineage doublets
Chooses one allele for each cellBC-intBC pair, by selecting the most common
- Parameters
- input_df :
DataFrameDataFrame A molecule table, i.e. cellBC-UMI pairs. Note that each cellBC should only contain one instance of each UMI
- output_directory :
strstr The output directory path to store plots
- min_umi_per_cell :
intint(default:10) The threshold specifying the minimum number of UMIs in a cell needed to be retained during filtering
- min_avg_reads_per_umi :
floatfloat(default:2.0) The threshold specifying the minimum coverage (i.e. average) reads per UMI in a cell needed in order for that cell to be retained during filtering
- umi_read_thresh :
intint(default:-1) The threshold specifying the minimum read count needed for a UMI to be retained during filtering. Set dynamically if value is < 0
- intbc_prop_thresh :
floatfloat(default:0.5) The threshold specifying the maximum proportion of the total UMI counts for a intBC to be corrected to another
- intbc_umi_thresh :
intint(default:10) The threshold specifying the maximum UMI count for an intBC needs to be corrected to another
- intbc_dist_thresh :
intint(default:1) The threshold specifying the maximum Levenshtein Distance between sequences for an intBC to be corrected to another
- doublet_threshold :
floatfloat(default:0.35) The threshold specifying the maximum proportion of conflicting alleles information allowed to for an intBC to be retained in doublet filtering. Set to None to skip doublet filtering
- allow_allele_conflicts :
boolbool(default:False) Whether or not to allow multiple alleles to be assigned to each cellBC-intBC pair. For fully single-cell data, this option should be set to False, since each cell is expected to have a single allele state for each intBC. However, this option should be set to True for chemistries that may result in multiple physical cells being captured for each barcode.
- plot :
boolbool(default:False) Indicates whether to plot the change in intBC and cellBC counts across filtering stages
- input_df :
- Return type
- Returns
A filtered and corrected allele table of cellBC-UMI-allele groups