cassiopeia.solver.dissimilarity_functions.cluster_dissimilarity#

cassiopeia.solver.dissimilarity_functions.cluster_dissimilarity(dissimilarity_function, s1, s2, missing_state_indicator, weights=None, linkage_function=<function mean>, normalize=True)[source]#

Compute the dissimilarity between (possibly) ambiguous character strings.

An ambiguous character string is a character string in which each character contains an tuple of possible states, and such a character string is represented as a list of tuples of integers.

A naive implementation is to first disambiguate each of the two ambiguous character strings by generating all possible strings, then computing the dissimilarity between all pairwise combinations, and finally applying the linkage function on the calculated dissimilarities. However, doing so has complexity O(prod_{i=1}^N |a_i| x |b_i|) where N is the number of target sites, |a_i| is the number of ambiguous characters at target site i of string a, and |b_i| is the number of amiguous characters at target site i of string b. As an example, if we have N=10 and all a_i=b_i=2, then we have to construct 1,038,576 * 2 strings and compute over 4 trillion dissimilarities.

By assuming each target site is independent, simply calculating the sum of the linkages of each target site separately is equivalent to the naive implementation (can be proven by induction). This procedure is implemented in this function. One caveat is that we usually normalize the distance by the number of shared non-missing positions. We approximate this by dividing the absolute distance by the sum of the probability of each site not being a missing site for both strings.

The idea of linkage is analogous to that in hierarchical clustering, where np.min can be used for single linkage, np.max for complete linkage, and np.mean for average linkage (the default).

The reason the dissimilarity_function argument is defined as the first argument is so that this function may be used as input to cassiopeia.data.CassiopeiaTree.compute_dissimilarity_map(). This can be done by partial application of this function with the desired dissimilarity function.

Note

If neither character string is ambiguous, then calling this function is equivalent to calling dissimilarity_function separately.

Parameters:

s1 Union[List[int], List[Tuple[int, ...]]]: The first (possibly) ambiguous sample
s2 Union[List[int], List[Tuple[int, ...]]]: The second (possibly) ambiguous sample
missing_state_indicator int: The character representing missing values
weights Optional[Dict[int, Dict[int, float]]] (default: None): A set of optional weights to weight the similarity of a mutation
dissimilarity_function Callable[[List[int], List[int], int, Dict[int, Dict[int, float]]], float]: The dissimilarity function to use to calculate pairwise dissimilarities.
linkage_function Callable[[Union[array, List[float]]], float] (default: <function mean at 0x7ff8a19a80b0>): The linkage function to use to aggregate
for dissimilarities into a single number. Defaults to np.mean: average linkage.
normalize bool (default: True): Whether to normalize to the proportion of sites present in both strings.

Return type:

float

Returns:

The dissimilarity between the two ambiguous samples