cassiopeia.tl.estimate_missing_data_rates#

cassiopeia.tl.estimate_missing_data_rates(tree, continuous=True, assume_root_implicit_branch=True, stochastic_missing_probability=None, heritable_missing_rate=None, layer=None)[source]#

Estimates both missing data parameters given one of the two from a tree.

The stochastic missing probability is the probability that any given cell/character pair acquires stochastic missing data in the character matrix due to low-capture in single-cell RNA sequencing. The heritable missing rate is either a continuous or per-generation rate according to which lineages accumulate heritable missing data events, such as transcriptional silencing or resection.

In most instances, the two types of missing data are convolved and we determine whether any single occurrence of missing data is due to stochastic or heritable missing data. We assume both contribute to the total amount of missing data as:

total missing proportion = heritable proportion + stochastic proportion
  • heritable proportion * stochastic proportion

This function attempts to consume the amount of missing data (the total missing proportion) as missing_proportion in tree.parameters, inferring it using get_proportion_of_missing_data if it is not populated.

Additionally, as the two types of data are convolved, we need to know the contribution of one of the types of missing data in order to estimate the other. This function attempts to consume the heritable missing rate as heritable_missing_rate in tree.parameters and the stochastic missing probability as stochastic_missing_probability in tree.parameters. If they are not provided on the tree, then they may be provided as function arguments. If neither or both parameters are provided by either of these methods, the function errors.

In estimating the heritable missing rate from the stochastic missing data probability, we take the proportion of stochastic missing data in the character matrix as equal to the stochastic probability. Then using the total observed proportion of missing data as well as the estimated proportion of stochastic missing data we can estimate the proportion of heritable missing data using the expression above. Finally, we use the heritable proportion as an estimate of the probability a lineage acquires a missing data event by the end of the phylogeny, and using this probability we can estimate the rate.

In the case where the rate is per-generation (probability a heritable missing data event occurs on an edge), it is estimated using:

heritable missing proportion =

1 - (1 - heritable missing rate) ^ (average depth of tree)

In the case where the rate is continuous, it is estimated using:

heritable_missing_proportion =

ExponentialCDF(average time of tree, heritable missing rate)

Note that these naive estimates perform better when the tree is ultrametric in depth or time. The average depth/lineage time of the tree is used as a proxy for the depth/total time when the tree is not ultrametric.

In calculating the heritable proportion from the heritable missing rate, we need to consider whether to assume an implicit root. This is specified by assume_root_implicit_branch. In the case where the tree does not have a single leading edge from the root representing the progenitor cell before cell division begins, this additional edge is added to the total time in calculating the estimate if assume_root_implicit_branch is True.

In estimating the stochastic missing probability from the heritable missing rate, we calculate the expected proportion of heritable missing data using the heritable rate in the same way, and then as above use the total proportion of missing data to estimate the stochastic proportion, which we assume is equal to the probability.

Parameters:
tree CassiopeiaTree

The CassiopeiaTree specifying the tree and the character matrix

continuous bool (default: True)

Whether to calculate a continuous missing rate, accounting for branch lengths. Otherwise, calculates a discrete missing rate based on the number of generations

assume_root_implicit_branch bool (default: True)

Whether to assume that there is an implicit branch leading from the root, if it doesn’t exist

stochastic_missing_probability float | NoneOptional[float] (default: None)

The stochastic missing probability. Will override the value on the tree. Observed probabilites of stochastic missing data range between 10-20%

heritable_missing_rate float | NoneOptional[float] (default: None)

The heritable missing rate. Will override the value on the tree

layer str | NoneOptional[str] (default: None)

Layer to use for character matrix. If this is None, then the current character_matrix variable will be used.

Return type:

Tuple[float, float]

Returns:

The stochastic missing probability and heritable missing rate. One of these will be the parameter as provided, the other will be an estimate

Raises:

ParameterEstimateError if the total_missing_proportion,stochastic_missing_probability, or heritable_missing_rate that are provided have invalid values, or if both or neither of stochastic_missing_probability, and heritable_missing_rate are provided. ParameterEstimateWarning if the estimated parameter is negative