cassiopeia.pp.convert_alleletable_to_character_matrix#

cassiopeia.pp.convert_alleletable_to_character_matrix(alleletable, ignore_intbcs=[], allele_rep_thresh=1.0, missing_data_allele=None, missing_data_state=-1, mutation_priors=None, cut_sites=None, collapse_duplicates=True)[source]#

Converts an AlleleTable into a character matrix.

Given an AlleleTable storing the observed mutations for each intBC / cellBC combination, create a character matrix for input into a CassiopeiaSolver object. By default, we codify uncut mutations as ‘0’ and missing data items as ‘-1’. The function also have the ability to ignore certain intBC sets as well as cut sites with too little diversity.

Parameters:
alleletable DataFrame

Allele Table to be converted into a character matrix

ignore_intbcs List[str] (default: [])

A set of intBCs to ignore

allele_rep_thresh float (default: 1.0)

A threshold for removing target sites that have an allele represented by this proportion

missing_data_allele Optional[str] (default: None)

Value in the allele table that indicates that the cut-site is missing. This will be converted into missing_data_state

missing_data_state int (default: -1)

A state to use for missing data.

mutation_priors Optional[DataFrame] (default: None)

A table storing the prior probability of a mutation occurring. This table is used to create a character matrix-specific probability dictionary for reconstruction.

cut_sites Optional[List[str]] (default: None)

Columns in the AlleleTable to treat as cut sites. If None, we assume that the cut-sites are denoted by columns of the form “r{int}” (e.g. “r1”)

collapse_duplicates bool (default: True)

Whether or not to collapse duplicate character states present for a single cellBC-intBC pair. This option has no effect if there are no allele conflicts. Defaults to True.

Return type:

Tuple[DataFrame, Dict[int, Dict[int, float]], Dict[int, Dict[int, str]]]

Returns:

A character matrix, a probability dictionary, and a dictionary mapping

states to the original mutation.