cassiopeia.solver.PercolationSolver#

class cassiopeia.solver.PercolationSolver(joining_solver, prior_transformation='negative_log', similarity_function=<function hamming_similarity_without_missing>, threshold=0)[source]#

A top-down percolatin-based CassiopeiaSolver.

The PercolationSolver implements a top-down algorithm that recursively partitions the sample set based on similarity in the observed mutations. It is an implicit version of Aho’s algorithm for tree discovery (1981). At each recursive step, the similarities of each sample pair are embedded as edges in a graph with weight equal to the similarity between the nodes. The graph is then percolated by removing the minimum edges until multiple connected components are produced. Ultimately, this groups clusters of samples that share strong similarity amongst themselves. If more than two connected components are produced by the percolation procedure, then the components are clustered by applying the specified solver to the LCAs of the clusters, obeying parsimony TODO(richardyz98): Experiment to find the best default similarity function :type joining_solver: CassiopeiaSolver :param joining_solver: The CassiopeiaSolver that is used to cluster groups of

samples in the case that the percolation procedure generates more than two groups of samples in the partition

Parameters:

prior_transformation str (default: 'negative_log'): A function defining a transformation on the priors in forming weights to scale the contribution of each mutation in the similarity graph
similarity_function (array, array, int, {int: {int: float}} | None) → float | NoneOptional[Callable[[array, array, int, Optional[Dict[int, Dict[int, float]]]], float]] (default: <function hamming_similarity_without_missing at 0x7ff0910b21f0>): A function that calculates a similarity score between two given samples and their observed mutations. The default is “hamming_distance_without_missing”
threshold int | NoneOptional[int] (default: 0): The minimum similarity threshold. A similarity threshold of 1 for example means that only samples with similarities above 1 will have an edge between them in the graph. Acts as a hyperparameter that controls the sparsity of the graph by filtering low similarities.

joining_solver#: The CassiopeiaSolver that is used to cluster groups of samples after percolation steps that produce more than two groups

prior_transformation#: Function to use when transforming priors into weights.

similarity_function#: A function that calculates a similarity score between two given samples and their observed mutations

threshold#: A minimum similarity threshold

Methods

solve(cassiopeia_tree, layer=None, collapse_mutationless_edges=False, logfile='stdout.log')[source]#

Implements a solving procedure for the Percolation Algorithm. The procedure recursively splits a set of samples to build a tree. At each partition of the samples produced by the percolation procedure, an ancestral node is created and each side of the partition is placed as a daughter clade of that node. This continues until each side of the partition is comprised only of single samples. If an algorithm cannot produce a split on a set of samples, then those samples are placed as sister nodes and the procedure terminates, generating a polytomy in the tree. This function will populate a tree inside the input CassiopeiaTree.

Parameters:

cassiopeia_tree CassiopeiaTree: CassiopeiaTree storing a character matrix and priors.
layer str | NoneOptional[str] (default: None): Layer storing the character matrix for solving. If None, the default character matrix is used in the CassiopeiaTree.
collapse_mutationless_edges bool (default: False): Indicates if the final reconstructed tree should collapse mutationless edges based on internal states inferred by Camin-Sokal parsimony. In scoring accuracy, this removes artifacts caused by arbitrarily resolving polytomies.
logfile str (default: 'stdout.log'): Location to write standard out. Not currently used.

percolate(character_matrix, samples, priors=None, weights=None, missing_state_indicator=-1)[source]#

The function used by the percolation algorithm to partition the set of samples in two. First, a pairwise similarity graph is generated with samples as nodes such that edges between a pair of nodes is some provided function on the number of character/state mutations shared. Then, the algorithm removes the minimum edge (in the case of ties all are removed) until the graph is split into multiple connected components. If there are more than two connected components, the procedure joins them until two remain. This is done by inferring the mutations of the LCA of each sample set obeying Camin-Sokal Parsimony, and then clustering the groups of samples based on their LCAs. The provided solver is used to cluster the groups into two clusters. :type character_matrix: DataFrame :param character_matrix: Character matrix :type samples: List[str] :param samples: A list of samples to partition :type priors: {int: {int: float}} | NoneOptional[Dict[int, Dict[int, float]]] (default: None) :param priors: A dictionary storing the probability of each character

mutating to a particular state.

Parameters:

weights {int: {int: float}} | NoneOptional[Dict[int, Dict[int, float]]] (default: None): Weighting of each (character, state) pair. Typically a transformation of the priors.
missing_state_indicator int (default: -1): Character representing missing data.

Return type:

Tuple[List[str], List[str]]

Returns:

A tuple of lists, representing the left and right partition groups