cassiopeia.solver.SharedMutationJoiningSolver#

class cassiopeia.solver.SharedMutationJoiningSolver(similarity_function=<function hamming_similarity_without_missing>, prior_transformation='negative_log')[source]#

Shared-Mutation-Joining class for Cassiopeia.

Implements an iterative, bottom-up agglomerative clustering procedure. The algorithm iteratively clusters the samples in the sample pool by the number of shared mutations that they have in their character information. The algorithm has theoretical guarantees on correctness given a sufficiently large number of characters and bounds on edge lengths in the tree generative process.

TODO(mgjones, rzhang): Make the solver work with similarity maps as

flattened arrays

Parameters:
similarity_function Optional[Callable[[array, array, int, Optional[Dict[int, Dict[int, float]]]], float]] (default: <function hamming_similarity_without_missing at 0x7ff8808ab8b0>)

Function that can be used to compute the similarity between samples.

prior_transformation str (default: 'negative_log')

Function to use when transforming priors into weights. Supports the following transformations:

”negative_log”: Transforms each probability by the negative

log (default)

”inverse”: Transforms each probability p by taking 1/p “square_root_inverse”: Transforms each probability by the

the square root of 1/p

similarity_function#

Function used to compute similarity between samples.

prior_transformation#

Function to use when transforming priors into weights.

Methods

solve(cassiopeia_tree, layer=None, collapse_mutationless_edges=False, logfile='stdout.log')[source]#

Solves a tree for the SharedMutationJoiningSolver.

The solver routine calculates an n x n similarity matrix of all pairwise sample similarities based on a provided similarity function on the character vectors. The general solver routine proceeds by iteratively finding pairs of samples to join together into a “cherry” until all samples are joined. At each iterative step, the two samples with the most shared character/state mutations are joined. Then, an LCA node with a character vector containing only the mutations shared by the joined samples is added to the sample pool, and the similarity matrix is updated with respect to the new LCA node. The function will update the tree attribute of the input CassiopeiaTree.

Parameters:
cassiopeia_tree CassiopeiaTree

CassiopeiaTree object to be populated

layer Optional[str] (default: None)

Layer storing the character matrix for solving. If None, the default character matrix is used in the CassiopeiaTree.

collapse_mutationless_edges bool (default: False)

Indicates if the final reconstructed tree should collapse mutationless edges based on internal states inferred by Camin-Sokal parsimony. In scoring accuracy, this removes artifacts caused by arbitrarily resolving polytomies.

logfile str (default: 'stdout.log')

Location to write standard out. Not currently used.

Return type:

None

find_cherry(similarity_matrix)[source]#

Finds a pair of samples to join into a cherry.

Finds the pair of samples with the highest pairwise similarity to join.

Parameters:
similarity_matrix array

A sample x sample similarity matrix

Return type:

Tuple[int, int]

Returns:

A tuple of integers representing rows in the similarity matrix to join.

update_similarity_map_and_character_matrix(character_matrix, similarity_function, similarity_map, cherry, new_node, missing_state_indicator=-1, weights=None)[source]#

Update similarity map after finding a cherry.

Adds the new LCA node into the character matrix with the mutations shared by the joined nodes as its character vector. Then, updates the similarity matrix by calculating the pairwise similarity between the new LCA node and all existing nodes.

Parameters:
character_matrix DataFrame

Contains the character information for all nodes, updated as nodes are joined and new internal LCA nodes are added

similarity_function Callable[[array, array, int, Dict[int, Dict[int, float]]], float]

A similarity function

similarity_map DataFrame

A similarity map to update

cherry Tuple[str, str]

A tuple of indices in the similarity map that are joining

new_node str

New node name, to be added to the updated similarity map

missing_state_indicator int (default: -1)

Character representing missing data

weights default: None

Weighting of each (character, state) pair. Typically a transformation of the priors.

Return type:

DataFrame

Returns:

A new similarity map, updated with the new node