cassiopeia.data.CassiopeiaTree#
- class cassiopeia.data.CassiopeiaTree(character_matrix=None, missing_state_indicator=-1, cell_meta=None, character_meta=None, priors=None, tree=None, dissimilarity_map=None, parameters=None, root_sample_name=None)[source]#
Basic tree object for Cassiopeia.
This object stores the key attributes and functionalities a user might want for working with lineage tracing experiments. At its core, it stores three main items - a tree, a character matrix, and meta data associated with the data.
The tree can be fed into the object via Ete3, Networkx, or can be inferred using one of the CassiopeiaSolver algorithms in the solver module. The tree here is only used for obtaining the _topology_ of the tree.
A character matrix can be stored in the object, containing the states observed for each cell. In typical lineage tracing experiments, these are integer representations of the indels observed at each unique cut site. We track both an unmodified version of the character matrix (obtainable via the get_original_character_matrix method) that does not maintain consistency with the character states of the leaves, and a working character matrix (obtainable via the get_current_character_matrix method) that is updated when the character states of leaves are changed.
Some reconstruction algorithms will make use of dissimilarities between cells. To this end, we store these dissimilarity maps in the CassiopeiaTree class itself. For users trying to diagnose the reconstruction accuracy with a known groundtruth, they can compare this dissimilarity map to the phylogenetic distance on the tree.
Meta data for cells or characters can also be stored in this object. These items can be categorical or numerical in nature. Common examples of cell meta data are the cluster identity, tissue identity, or number of target-site UMIs per cell. These items can be used in downstream analyses, for example the FitchCount algorithm which infers the number of transitions between categorical variables (e.g., tissues). Common examples of character meta data are the proportion of missing data for each character or the entropy of states. These are good statistics to have for feature selection.
TODO(rzhang): Add check upon initialization that input tree is valid tree. TODO(mattjones315): Add experimental meta data as arguments. TODO(mattjones315): Add utility methods to compute the colless index
and the cophenetic correlation wrt to some cell meta item
TODO(mattjones315): Add bulk set_states method. TODO(mattjones): Add boolean to get_tree_topology which will include
all attributes (e.g., node times)
- Parameters:
- character_matrix
Optional
[DataFrame
] (default:None
) The character matrix for the lineage.
- missing_state_indicator
int
(default:-1
) An indicator for missing states in the character matrix.
- cell_meta
Optional
[DataFrame
] (default:None
) Per-cell meta data
- character_meta
Optional
[DataFrame
] (default:None
) Per-character meta data
- priors
Optional
[Dict
[int
,Dict
[int
,float
]]] (default:None
) A dictionary storing the probability of each character mutating to a particular state.
- tree
Union
[str
,TreeNode
,DiGraph
,None
] (default:None
) A tree for the lineage.
- dissimilarity_map
Optional
[DataFrame
] (default:None
) An NxN dataframe storing the pairwise dissimilarities between samples.
- root_sample_name
Optional
[str
] (default:None
) The name of the sample to treat as the root. This is not always used, but will be added if needed during tree reconstruction. If the user already has a sample in the character matrix or dissimilarity map that they would like to use as the phylogenetic root, they can specify it here.
- character_matrix
Methods
- populate_tree(tree, layer=None)[source]#
Populates a tree object in CassiopeiaTree.
Accepts a tree topology and introduces the topology into the object. In doing so, this function will also append character states to the leaves of the tree topology, if possible, and instantiate edge lengths by default to be length 1. Node times will be instantiated as well, corresponding to their tree depth.
- property layers: Layers#
Dictionary object storing versions of character matrices.
Layers in this CassiopeiaTree object are inspired by AnnData’s layers functionality. All layers have the same number of samples as the tree and
character_matrix
.- Return the layer named “missing”::
cas_tree.layers[“missing”]
- Create or replace the “merged” layer::
cas_tree.layers[“merged”] = …
- Delete the “missing” layer::
del cas_tree.layers[“missing”]
- set_character_states_at_leaves(character_matrix=None, layer=None)[source]#
Populates character states at leaves.
Assigns character states to the leaves of the tree. This function must have a character state assignment to all leaves of the tree. If no argument is passed, the default
character_matrix
is used to set character states at the leaves.- Parameters:
- character_matrix
Union
[DataFrame
,Dict
,None
] (default:None
) A separate character matrix to use for setting character states at the leaves. This character matrix is not stored in the object afterwards. If this is None, uses the default character matrix stored in
character_matrix
- layer
Optional
[str
] (default:None
) Layer to use for resetting character information at leaves. If this is None, uses the default character matrix stored in
character_matrix
.
- character_matrix
- Raises:
CassiopeiaTreeError if not all leaves are accounted for or if the – tree has not been initialized.
- Return type:
- set_all_character_states(character_state_mapping, add_layer=None)[source]#
Populates character states across the tree.
Assigns character states to all of the nodes in the tree. The mapping must have an entry for every node in the tree.
- Parameters:
- Raises:
CassiopeiaTreeError if the tree is not initialized or if the – character_state_mapping does not contain assignments for every node.
- Return type:
- freeze_character_matrix(add_layer)[source]#
Freezes character matrix in specified layer.
Adds a new layer the CassiopeiaTree object corresponding to the current version of the character matrix.
- Parameters:
- add_layer
str
Layer to create.
- add_layer
- Raises:
CassiopeiaTreeError if no character matrix exists to freeze. –
CassiopeiaTreeWarning if layer already exists. –
- property n_cell: int#
Returns number of cells in tree.
- Raises:
CassiopeiaTreeError if the object is empty (i.e. no tree or –
character matrix). –
- property n_character: int#
Returns number of characters in character matrix.
- Raises:
CassiopeiaTreeError if the object is empty (i.e. no tree or –
character matrix) or if the character states have not been –
initialized. –
- property root: str#
Returns root of tree.
- Returns:
The root.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- property leaves: List[str]#
Returns leaves of tree.
- Returns:
The leaves of the tree.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- property internal_nodes: List[str]#
Returns internal nodes in tree (including the root).
- Returns:
The internal nodes of the tree (i.e. all nodes not at the leaves)
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- property nodes: List[str]#
Returns all nodes in tree.
- Returns:
All nodes of the tree (internal + leaves)
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- property edges: List[Tuple[str, str]]#
Returns all edges in the tree.
- Returns:
All edges of the tree.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- is_leaf(node)[source]#
Returns whether or not the node is a leaf.
- Return type:
- Returns:
Whether or not the node is a leaf.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- is_root(node)[source]#
Returns whether or not the node is the root.
- Return type:
- Returns:
Whether or not the node is the root.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- is_internal_node(node)[source]#
Returns whether or not the node is an internal node.
- Return type:
- Returns:
Whether or not the node is an internal node (i.e. out degree is greater than 0). In this case, the root is considered an internal node.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- is_ambiguous(node)[source]#
Returns whether the node is ambiguous.
This is detected by checking if any of the characters are tuples.
- collapse_ambiguous_characters()[source]#
Only retain unique characters for ambiguous nodes.
Ambiguous nodes have character strings represented as a list of tuples of integers. The inner list may contain multiple of the same states, encoding the relative abundance of a certain state in the ambiguous state distribution. Calling this function removes such duplicates and only retains unique characters. This function is idempotent and does nothing for trees that have no ambiguous characters.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- Return type:
- resolve_ambiguous_characters(resolve_function=<function resolve_most_abundant>)[source]#
Resolve all nodes with ambiguous characters.
A custom
resolve_function
may be provided to perform the resolution. By default, the most abundant state is selected. One is randomly selected on ties. Modifies the tree in-place.- Parameters:
- resolve_function
Callable
[[Tuple
[int
,...
]],int
] (default:<function resolve_most_abundant at 0x76660cb05af0>
) Function that performs character resolution. This function is called once per ambiguous character state, and thus takes a single integer tuple as its argument and returns the resolved character state.
- resolve_function
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- Return type:
- reconstruct_ancestral_characters()[source]#
Reconstruct ancestral character states.
Reconstructs ancestral states (i.e., those character states in the internal nodes) using the Camin-Sokal parsimony criterion (i.e., irreversibility). Operates on the tree in place.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- Return type:
- reorder_children(node, child_order)[source]#
Reorders the children of a particular node.
- Parameters:
- Raises:
CassiopeiaTreeError if the node of interest is a leaf that has not – been instantiated, or if the new order of children is not a permutation of the original children.
- Return type:
- set_time(node, new_time)[source]#
Sets the time of a node.
Importantly, this maintains consistency with the rest of the tree. In other words, setting the time of a particular node will change the length of the edge leading into the node and the edges leading out. This function requires monotonicity of times are maintained (i.e. no negative branch lengths).
- set_times(time_dict)[source]#
Sets the time of all nodes in the tree.
Importantly, this maintains consistency with the rest of the tree. In other words, setting the time of all nodes will change the length of the edges too. This function requires monotonicity of times are maintained (i.e. no negative branch lengths).
- get_time(node)[source]#
Gets the time of a node.
Returns the time of a node, defined as the sum of edge lengths from the root to the node.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- Return type:
- get_times()[source]#
Gets the times of all nodes.
Returns the times of all nodes, defined as the sum of edge lengths from the root to that node.
- set_branch_length(parent, child, length)[source]#
Sets the length of a branch.
Adjusts the branch length of the specified parent-child relationship. This procedure maintains the consistency with the rest of the times in the tree. Namely, by changing the branch length here, it will change the times of all the nodes below the parent of interest, relative to the difference between the old and new branch length.
- set_branch_lengths(branch_length_dict)[source]#
Sets the length of multiple branches on a tree.
Adjusts the branch length of specified parent-child relationships. This procedure maintains the consistency with the rest of the times in the tree. Namely, by changing branch lengths here, it will change the times of all the nodes in the tree such that the times are representative of the new branch lengths.
- Parameters:
- branch_dict
A dictionary of edges to updated branch lengths
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- Return type:
- get_branch_length(parent, child)[source]#
Gets the length of a branch.
- Raises:
CassiopeiaTreeError if the tree has not been initialized or if the – branch does not exist in the tree.
- Return type:
- set_character_states(node, states, layer=None)[source]#
Sets the character states for a particular node.
- Parameters:
- Raises:
CassiopeiaTreeError if the character vector is the incorrect length, – or if the node of interest is a leaf that has not been instantiated.
- Return type:
- get_all_ancestors(node, include_node=False)[source]#
Gets all the ancestors of a particular node.
Nodes that are closest to the given node appear first in the list. The ancestors, including those of all intermediate nodes, are cached.
- depth_first_traverse_nodes(source=None, postorder=True)[source]#
Nodes from depth first traversal of the tree.
Returns the nodes from a DFS on the tree.
- Parameters:
- Return type:
- Returns:
A list of nodes from the depth first traversal.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- depth_first_traverse_edges(source=None)[source]#
Edges from depth first traversal of the tree.
Returns the edges from a DFS on the tree.
- breadth_first_traverse_edges(source=None)[source]#
Edges from breadth first traversal of the tree.
Returns the edges from a BFS on the tree.
- subset_clade(node, copy=False)[source]#
Subset CassiopeiaTree object at node.
Given a node in the CassiopeiaTree, subset the entire tree object to only the nodes that fall below that node.
- Parameters:
- Return type:
- Returns:
A subset CassiopeiaTree object if copy is set to true, else None.
- get_newick(record_branch_lengths=False, record_node_names=False)[source]#
Returns newick format of tree.
- Parameters:
- record_branch_lengths default:
False
Whether to record branch lengths on the tree
- string in the newick
- record_node_names default:
False
Whether to record internal node names on the tree
- string
- record_branch_lengths default:
- Return type:
- Returns:
The tree in the form of a newick string
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- get_tree_topology()[source]#
Returns the tree in Networkx format.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- Return type:
DiGraph
- get_mean_depth_of_tree()[source]#
Computes mean depth of tree.
Returns the mean depth of the tree. If branch lengths have not been estimated, depth is by default the number of edges in the tree.
- Return type:
- Returns:
Mean depth of the tree.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- get_max_depth_of_tree()[source]#
Computes the maximum depth of the tree (in terms of time).
The maximum depth of the tree (in terms of time) is defined as the greatest time distance of any leaf of the tree from the root. Because branch lengths are by default equal to 1, if branch lengths have not yet been estimated (say with the cassiopeia.tools.branch_length_estimator module), then the max depth will be the number of edges in the tree from root to the furthest leaf.
- Return type:
- Returns:
Maximum depth of the tree.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- get_mutations_along_edge(parent, child, treat_missing_as_mutations=False)[source]#
Gets the mutations along an edge of interest.
Returns a list of tuples (character, state) of mutations that occur along an edge. Characters are 0-indexed.
Note that parent states can be ambiguous if all child states have the same ambiguous state. If this is the case, there will not be a mutation detected along an edge, but we handle this case so as to not throw an error handling ambiguous states.
- Parameters:
- Return type:
- Returns:
- A list of (character, state) tuples indicating which character
mutated and to which state.
- Raises:
CassiopeiaTreeError if the edge does not exist or if the tree is – not initialized.
- relabel_nodes(relabel_map)[source]#
Relabels the nodes in the tree.
Renames the nodes in the tree according to the relabeling map. Modifies the tree in-place.
- add_leaf(parent, node, states=None, time=None, branch_length=None, dissimilarity=None)[source]#
Add a leaf to the given parent node.
The parent node may NOT also be a leaf, as this makes bookkeeping the character, dissimilarity and cell meta quite complicated. Also, adding a leaf to an existing leaf is probably not a common use case.
Optional arguments
states
,time
,branch_length
,dissimilarity
may be provided to initialize the character string, time, branch length, dissimilarity, and cell meta for this leaf respectively. See__register_data_with_tree()
for default values that are used when any of these are not provided.Note that only one of
time
andbranch_length
may be provided, and when neither are provided, the new leaf is added at the same time as theparent
.- Parameters:
- parent
str
Parent node, to which to connect the leaf
- node
str
Name of the leaf to add
- states
Union
[List
[int
],List
[Tuple
[int
,...
]],None
] (default:None
) Character states to initialize the new leaf with
- time
Optional
[float
] (default:None
) Time to place the new leaf
- dissimilarity
Optional
[Dict
[str
,float
]] (default:None
) Indel dissimilarity between the new leaf and all other leaves
- meta
Cell metadata
- parent
- Raises:
CassiopeiaTreeError if parent does not exist or node – already exists, if
parent
is a leaf, if the tree has not been initialized, or if bothtime
andbranch_length
are provided.- Return type:
- remove_leaves_and_prune_lineages(nodes)[source]#
Removes a leaf (leaves) from the tree and prunes the lineage(s).
Removes the specified leaves and all ancestors of those leaves that are no longer the ancestor of any of the remaining leaves. In the context of a phylogeny, this prunes the lineage of all nodes no longer relevant to observed samples. Additionally, maintains consistency with the updated tree by removing the node from all leaf data.
- collapse_unifurcations(source=None)[source]#
Collapses unifurcations on the tree.
Removes all internal nodes that have in degree and out degree of 1, connecting their parent and children nodes by branchs with lengths equal to the total time elapsed from parent to each child. Therefore preserves the times of nodes that are not removed.
- collapse_mutationless_edges(infer_ancestral_characters)[source]#
Collapses mutationless edges in the tree in-place.
Uses the internal node annotations of a tree to collapse edges with no mutations. The introduction of a missing data event is considered a mutation in this context. Either takes the existing character states on the tree or infers the annotations bottom-up from the samples obeying Camin-Sokal Parsimony. Preserves the times of nodes that are not removed by connecting the parent and children of removed nodes by branches with lengths equal to the total time elapsed from parent to each child.
- set_dissimilarity_map(dissimilarity_map)[source]#
Sets the dissimilarity map variable in this object.
- set_dissimilarity(node, dissimilarity)[source]#
Set the dissimilarity of a single leaf.
This function may be used only when a dissimilarity map already exists.
- Parameters:
- Raises:
CassiopeiaTreeError if the dissimilarity map does not exist, if – the
dissimilarity
dictionary does not contain all other leaves- Return type:
- compute_dissimilarity_map(dissimilarity_function=None, prior_transformation='negative_log', layer=None, threads=1)[source]#
Computes a dissimilarity map.
Given the dissimilarity function passed in, the pairwise dissimilarities will be computed over the samples in the character matrix. Populates the dissimilarity_map attribute in the object.
If any of the leaves have ambiguous character states (detected by checking if any of the states are lists), then
cassiopeia.solver.dissimilarity_functions.cluster_dissimilarity()
is called instead, with the provideddissimilarity_function
as the first argument.- Parameters:
- dissimilarity_function
Optional
[Callable
[[array
,array
,int
,Dict
[int
,Dict
[int
,float
]]],float
]] (default:None
) A function that will take in two character vectors and priors and produce a dissimilarity.
- prior_transformation
str
(default:'negative_log'
) A function defining a transformation on the priors in forming weights. Supports the following transformations:
- ”negative_log”: Transforms each probability by the negative
log
”inverse”: Transforms each probability p by taking 1/p “square_root_inverse”: Transforms each probability by the
the square root of 1/p
- layer
Optional
[str
] (default:None
) Character matrix layer to use. If not specified, use the default
character_matrix
.- threads
int
(default:1
) Number of threads to use for dissimilarity map computation.
- dissimilarity_function
- Return type:
- find_lcas_of_pairs(pairs=None)[source]#
Finds LCAs of all (provided) pairs.
- Parameters:
- Return type:
- Returns:
A generator of ((u, v), LCA) tuples.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- find_lca(*nodes)[source]#
Finds the LCA of all provided nodes.
Internally, this function calls
get_all_ancestors()
for each provided node and finds the deepest node that is shared.
- get_distance(node1, node2)[source]#
Compute the branch distance between two nodes.
Internally, this function converts the tree to an undirected graph and finds the shortest path by calling
nx.shortest_path_length()
.
- get_distances(node, leaves_only=False)[source]#
Compute the distances between the given node and every other node.
Internally, this function converts the tree to an undirected graph and finds the shortest path by calling
nx.shortest_path_length()
.- Parameters:
- Return type:
- Returns:
Dictionary of distances
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- scale_to_unit_length()[source]#
Scales the tree to have unit length.
The longest path from root to leaf will have length 1 after the scaling.
- Raises:
CassiopeiaTreeError if the tree has not been initialized. –
- Return type:
- get_unmutated_characters_along_edge(parent, child)[source]#
List of unmutated characters along edge.
A character is considered to not mutate if it goes from the zero state to the zero state.
- impute_deducible_missing_states()[source]#
Impute deducible missing states.
If a state is missing in a node but present in its parent, then it can be imputed as the parent’s state. We perform all these imputations.
- Raises:
CassiopeiaTreeError if the character vectors do not all have –
the same length, or if the tree has not been initialized. –