cassiopeia.solver.dissimilarity_functions.cluster_dissimilarity#
- cassiopeia.solver.dissimilarity_functions.cluster_dissimilarity(dissimilarity_function, s1, s2, missing_state_indicator, weights=None, linkage_function=<function mean>, normalize=True)[source]#
Compute the dissimilarity between (possibly) ambiguous character strings.
An ambiguous character string is a character string in which each character contains an tuple of possible states, and such a character string is represented as a list of tuples of integers.
A naive implementation is to first disambiguate each of the two ambiguous character strings by generating all possible strings, then computing the dissimilarity between all pairwise combinations, and finally applying the linkage function on the calculated dissimilarities. However, doing so has complexity O(prod_{i=1}^N |a_i| x |b_i|) where N is the number of target sites, |a_i| is the number of ambiguous characters at target site i of string a, and |b_i| is the number of amiguous characters at target site i of string b. As an example, if we have N=10 and all a_i=b_i=2, then we have to construct 1,038,576 * 2 strings and compute over 4 trillion dissimilarities.
By assuming each target site is independent, simply calculating the sum of the linkages of each target site separately is equivalent to the naive implementation (can be proven by induction). This procedure is implemented in this function. One caveat is that we usually normalize the distance by the number of shared non-missing positions. We approximate this by dividing the absolute distance by the sum of the probability of each site not being a missing site for both strings.
The idea of linkage is analogous to that in hierarchical clustering, where
np.min
can be used for single linkage,np.max
for complete linkage, andnp.mean
for average linkage (the default).The reason the
dissimilarity_function
argument is defined as the first argument is so that this function may be used as input tocassiopeia.data.CassiopeiaTree.compute_dissimilarity_map()
. This can be done by partial application of this function with the desired dissimilarity function.Note
If neither character string is ambiguous, then calling this function is equivalent to calling
dissimilarity_function
separately.- Parameters:
- s1
Union
[List
[int
],List
[Tuple
[int
,...
]]] The first (possibly) ambiguous sample
- s2
Union
[List
[int
],List
[Tuple
[int
,...
]]] The second (possibly) ambiguous sample
- missing_state_indicator
int
The character representing missing values
- weights
Optional
[Dict
[int
,Dict
[int
,float
]]] (default:None
) A set of optional weights to weight the similarity of a mutation
- dissimilarity_function
Callable
[[List
[int
],List
[int
],int
,Dict
[int
,Dict
[int
,float
]]],float
] The dissimilarity function to use to calculate pairwise dissimilarities.
- linkage_function
Callable
[[Union
[array
,List
[float
]]],float
] (default:<function mean at 0x766628debbf0>
) The linkage function to use to aggregate
- for dissimilarities into a single number. Defaults to np.mean
average linkage.
- normalize
bool
(default:True
) Whether to normalize to the proportion of sites present in both strings.
- s1
- Return type:
- Returns:
The dissimilarity between the two ambiguous samples