cassiopeia.solver.MaxCutSolver#

class cassiopeia.solver.MaxCutSolver(sdimension=3, iterations=50, prior_transformation='negative_log')[source]#

A MaxCut graph-based CassiopeiaSolver.

The MaxCutSolver implements a top-down algorithm that recursively partitions the sample set based on connectivity. At each recursive step, a connectivity graph is generated for the sample set, where edge weights between samples represent the number of triplets that separate those samples minus the number of triplets that have those samples as an ingroup. Shared mutations are coded as strong negative connections and differing mutations are coded as positive connections. Then a partition is generated by finding a maximum weight cut over the graph. The final partition is also improved upon by a greedy hill-climbing procedure that further optimizes the cut.

Parameters:

prior_transformation str (default: 'negative_log')

A function defining a transformation on the priors in forming weights to scale frequencies and the contribution of each mutation in the connectivity graph. One of the following:

”negative_log”: Transforms each probability by the negative
log (default)

”inverse”: Transforms each probability p by taking 1/p “square_root_inverse”: Transforms each probability by the

the square root of 1/p

sdimension Optional[int] (default: 3)

The number of dimensions to use for the embedding space. Acts as a hyperparameter

iterations Optional[int] (default: 50)

The number of iterations in updating the embeddings. Acts as a hyperparameter

prior_transformation#: Function to use when transforming priors into weights.

sdimension#: The number of dimensions to use for the embedding space

iterations#: The number of iterations in updating the embeddings

Methods

perform_split(character_matrix, samples, weights=None, missing_state_indicator=-1)[source]#

Generate a partition of the samples by finding the max-cut.

First, a connectivity graph is generated with samples as nodes such that samples with shared mutations have strong negative edge weight and samples with distant mutations have positive edge weight. Then, the algorithm finds a partition by using a heuristic method to find the max-cut on the connectivity graph. The samples are randomly embedded in a d-dimensional sphere and the embeddings for each node are iteratively updated based on neighboring edge weights in the connectivity graph such that nodes with stronger connectivity cluster together. The final partition is generated by choosing random hyperplanes to bisect the d-sphere and taking the one that maximizes the cut.

Parameters:

character_matrix DataFrame: Character matrix
samples List[int]: A list of samples to partition
weights Optional[Dict[int, Dict[int, float]]] (default: None): Weighting of each (character, state) pair. Typically a transformation of the priors.
missing_state_indicator int (default: -1): Character representing missing data.

Return type:

Tuple[List[str], List[str]]

Returns:

A tuple of lists, representing the left and right partition groups

evaluate_cut(cut, G)[source]#

A simple function to evaluate the weight of a cut.

For each edge in the graph, checks if it is in the cut, and then adds its edge weight to the cut if it is.

Parameters:

cut List[str]: A list of nodes that represents one of the sides of a cut on the graph
G DiGraph: The graph the cut is over

Return type:

float

Returns:

The weight of the cut