cassiopeia.solver.SpectralSolver#

class cassiopeia.solver.SpectralSolver(similarity_function=<function hamming_similarity_without_missing>, threshold=0, prior_transformation='negative_log')[source]#

A spectral-based CassiopeiaSolver.

The SpectralSolver implements a top-down algorithm that recursively partitions the sample set based on similarity. At each recursive step, a similarity graph is generated for the sample set, where edges represent the number of shared mutations between nodes. Then a partition is generated by finding the minimum weight cut over the graph, normalized over the sum of edges within each side of the partition. The cut is minimized in order to preserve similarities. The final partition is then improved upon by a greedy hill-climbing procedure that further optimizes the cut.

TODO: Experiment to find the best default similarity function

Parameters:

similarity_function (int, int, DataFrame, int, {int: {str: float}} | None) → float | NoneOptional[Callable[[int, int, DataFrame, int, Optional[Dict[int, Dict[str, float]]]], float]] (default: <function hamming_similarity_without_missing at 0x7ff0910b21f0>)

A function that calculates a similarity score between two given samples and their observed mutations. The default is “hamming_distance_without_missing”

threshold int | NoneOptional[int] (default: 0)

A minimum similarity threshold to include an edge in the similarity graph

prior_function

A function defining a transformation on the priors in forming weights to scale the contribution of each mutation in the similarity graph. One of the following:

”negative_log”: Transforms each probability by the negative
log (default)

”inverse”: Transforms each probability p by taking 1/p “square_root_inverse”: Transforms each probability by the

the square root of 1/p

similarity_function#: A function that calculates a similarity score between two given samples and their observed mutations

threshold#: A minimum similarity threshold

prior_transformation#: Function to use when transforming priors into weights.

Methods

perform_split(character_matrix, samples, weights=None, missing_state_indicator=-1)[source]#

Partitions the samples using the spectral algorithm.

First, a similarity graph is generated with samples as nodes such that edges between a pair of nodes is some provided function on the number of character/state mutations shared. Then, Fiedler’s algorithm is used to generate a partition on this graph that minimizes a modified normalized cut: weight of edges across cut/ min(weight of edges within each side of cut). It does this efficiently by first calculating the 2nd eigenvector of the normalized Laplacian of the similarity matrix. Then, it orders the nodes in a graph by the eigenvector values and finds an index such that partitioning the ordered nodes on that index minimizes the normalized cut ratio. As the optimal partition can be determined using the 2nd eigenvector, this greatly reduces the space of cuts needed to be explored.

Parameters:

character_matrix DataFrame: Character matrix
samples List[int]: A list of samples to partition
weights {int: {int: float}} | NoneOptional[Dict[int, Dict[int, float]]] (default: None): Weighting of each (character, state) pair. Typically a transformation of the priors.
missing_state_indicator int (default: -1): Character representing missing data.

Return type:

Tuple[List[str], List[str]]

Returns:

A tuple of lists, representing the left and right partition groups