cassiopeia.data.CassiopeiaTree¶
- class cassiopeia.data.CassiopeiaTree(character_matrix=None, missing_state_indicator=- 1, cell_meta=None, character_meta=None, priors=None, tree=None, dissimilarity_map=None, root_sample_name=None)[source]¶
Basic tree object for Cassiopeia.
This object stores the key attributes and functionalities a user might want for working with lineage tracing experiments. At its core, it stores three main items - a tree, a character matrix, and meta data associated with the data.
The tree can be fed into the object via Ete3, Networkx, or can be inferred using one of the CassiopeiaSolver algorithms in the solver module. The tree here is only used for obtaining the _topology_ of the tree.
A character matrix can be stored in the object, containing the states observed for each cell. In typical lineage tracing experiments, these are integer representations of the indels observed at each unique cut site. We track both an unmodified version of the character matrix (obtainable via the get_original_character_matrix method) that does not maintain consistency with the character states of the leaves, and a working character matrix (obtainable via the get_current_character_matrix method) that is updated when the character states of leaves are changed.
Some reconstruction algorithms will make use of dissimilarities between cells. To this end, we store these dissimilarity maps in the CassiopeiaTree class itself. For users trying to diagnose the reconstruction accuracy with a known groundtruth, they can compare this dissimilarity map to the phylogenetic distance on the tree.
Meta data for cells or characters can also be stored in this object. These items can be categorical or numerical in nature. Common examples of cell meta data are the cluster identity, tissue identity, or number of target-site UMIs per cell. These items can be used in downstream analyses, for example the FitchCount algorithm which infers the number of transitions between categorical variables (e.g., tissues). Common examples of character meta data are the proportion of missing data for each character or the entropy of states. These are good statistics to have for feature selection.
TODO(rzhang): Add check upon initialization that input tree is valid tree. TODO(mattjones315): Add experimental meta data as arguments. TODO(mattjones315): Add utility methods to compute the colless index
and the cophenetic correlation wrt to some cell meta item
TODO(mattjones315): Add bulk set_states method. TODO(mattjones): Add boolean to get_tree_topology which will include
all attributes (e.g., node times)
- Parameters
- character_matrix :
DataFrame,NoneOptional[DataFrame] (default:None) The character matrix for the lineage.
- missing_state_indicator :
intint(default:-1) An indicator for missing states in the character matrix.
- cell_meta :
DataFrame,NoneOptional[DataFrame] (default:None) Per-cell meta data
- character_meta :
DataFrame,NoneOptional[DataFrame] (default:None) Per-character meta data
- priors : {
int: {int:float}},NoneOptional[Dict[int,Dict[int,float]]] (default:None) A dictionary storing the probability of each character mutating to a particular state.
- tree :
str,TreeNode,DiGraph,NoneUnion[str,TreeNode,DiGraph,None] (default:None) A tree for the lineage.
- dissimilarity_map :
DataFrame,NoneOptional[DataFrame] (default:None) An NxN dataframe storing the pairwise dissimilarities between samples.
- root_sample_name :
str,NoneOptional[str] (default:None) The name of the sample to treat as the root. This is not always used, but will be added if needed during tree reconstruction. If the user already has a sample in the character matrix or dissimilarity map that they would like to use as the phylogenetic root, they can specify it here.
- character_matrix :
Methods
- populate_tree(tree, layer=None)[source]¶
Populates a tree object in CassiopeiaTree.
Accepts a tree topology and introduces the topology into the object. In doing so, this function will also append character states to the leaves of the tree topology, if possible, and instantiate edge lengths by default to be length 1. Node times will be instantiated as well, corresponding to their tree depth.
- Parameters
- Return type
- property layers: cassiopeia.data.Layers.Layers¶
Dictionary object storing versions of character matrices.
Layers in this CassiopeiaTree object are inspired by AnnData’s layers functionality. All layers have the same number of samples as the tree and
character_matrix.- Return the layer named “missing”::
cas_tree.layers[“missing”]
- Create or replace the “merged” layer::
cas_tree.layers[“merged”] = …
- Delete the “missing” layer::
del cas_tree.layers[“missing”]
- Return type
LayersLayers
- set_character_states_at_leaves(character_matrix=None, layer=None)[source]¶
Populates character states at leaves.
Assigns character states to the leaves of the tree. This function must have a character state assignment to all leaves of the tree. If no argument is passed, the default
character_matrixis used to set character states at the leaves.- Parameters
- character_matrix :
DataFrame,Dict,NoneUnion[DataFrame,Dict,None] (default:None) A separate character matrix to use for setting character states at the leaves. This character matrix is not stored in the object afterwards. If this is None, uses the default character matrix stored in
character_matrix- layer :
str,NoneOptional[str] (default:None) Layer to use for resetting character information at leaves. If this is None, uses the default character matrix stored in
character_matrix.
- character_matrix :
- Raises
CassiopeiaTreeError if not all leaves are accounted for or if the – tree has not been initialized.
- Return type
- set_all_character_states(character_state_mapping, add_layer=None)[source]¶
Populates character states across the tree.
Assigns character states to all of the nodes in the tree. The mapping must have an entry for every node in the tree.
- Parameters
- character_state_mapping : {
str:List[int]}Dict[str,List[int]] A mapping containing character state assignments for every node
- add_layer :
str,NoneOptional[str] (default:None) Layer to which to add the new character matrix. If this is None, then the specified character matrix will be set to the default matrix in this object.
- character_state_mapping : {
- Raises
CassiopeiaTreeError if the tree is not initialized or if the – character_state_mapping does not contain assignments for every node.
- Return type
- freeze_character_matrix(add_layer)[source]¶
Freezes character matrix in specified layer.
Adds a new layer the CassiopeiaTree object corresponding to the current version of the character matrix.
- property n_cell: int¶
Returns number of cells in tree.
- property n_character: int¶
Returns number of characters in character matrix.
- property root: str¶
Returns root of tree.
- property leaves: List[str]¶
Returns leaves of tree.
- property internal_nodes: List[str]¶
Returns internal nodes in tree (including the root).
- property nodes: List[str]¶
Returns all nodes in tree.
- property edges: List[Tuple[str, str]]¶
Returns all edges in the tree.
- is_ambiguous(node)[source]¶
Returns whether the node is ambiguous.
This is detected by checking if any of the characters are tuples.
- collapse_ambiguous_characters()[source]¶
Only retain unique characters for ambiguous nodes.
Ambiguous nodes have character strings represented as a list of tuples of integers. The inner list may contain multiple of the same states, encoding the relative abundance of a certain state in the ambiguous state distribution. Calling this function removes such duplicates and only retains unique characters. This function is idempotent and does nothing for trees that have no ambiguous characters.
- resolve_ambiguous_characters(resolve_function=<function resolve_most_abundant>)[source]¶
Resolve all nodes with ambiguous characters.
A custom
resolve_functionmay be provided to perform the resolution. By default, the most abundant state is selected. One is randomly selected on ties. Modifies the tree in-place.- Parameters
- resolve_function :
Callable[[Tuple[int, …]],int]Callable[[Tuple[int, …]],int] (default:<function resolve_most_abundant at 0x7f6b485e3048>) Function that performs character resolution. This function is called once per ambiguous character state, and thus takes a single integer tuple as its argument and returns the resolved character state.
- resolve_function :
- Raises
CassiopeiaTreeError if the tree has not been initialized. –
- Return type
- reconstruct_ancestral_characters()[source]¶
Reconstruct ancestral character states.
Reconstructs ancestral states (i.e., those character states in the internal nodes) using the Camin-Sokal parsimony criterion (i.e., irreversibility). Operates on the tree in place.
- set_time(node, new_time)[source]¶
Sets the time of a node.
Importantly, this maintains consistency with the rest of the tree. In other words, setting the time of a particular node will change the length of the edge leading into the node and the edges leading out. This function requires monotonicity of times are maintained (i.e. no negative branch lengths).
- set_times(time_dict)[source]¶
Sets the time of all nodes in the tree.
Importantly, this maintains consistency with the rest of the tree. In other words, setting the time of all nodes will change the length of the edges too. This function requires monotonicity of times are maintained (i.e. no negative branch lengths).
- get_time(node)[source]¶
Gets the time of a node.
Returns the time of a node, defined as the sum of edge lengths from the root to the node.
- get_times()[source]¶
Gets the times of all nodes.
Returns the times of all nodes, defined as the sum of edge lengths from the root to that node.
- set_branch_length(parent, child, length)[source]¶
Sets the length of a branch.
Adjusts the branch length of the specified parent-child relationship. This procedure maintains the consistency with the rest of the times in the tree. Namely, by changing the branch length here, it will change the times of all the nodes below the parent of interest, relative to the difference between the old and new branch length.
- set_branch_lengths(branch_length_dict)[source]¶
Sets the length of multiple branches on a tree.
Adjusts the branch length of specified parent-child relationships. This procedure maintains the consistency with the rest of the times in the tree. Namely, by changing branch lengths here, it will change the times of all the nodes in the tree such that the times are representative of the new branch lengths.
- set_character_states(node, states, layer=None)[source]¶
Sets the character states for a particular node.
- Parameters
- Raises
CassiopeiaTreeError if the character vector is the incorrect length, – or if the node of interest is a leaf that has not been instantiated.
- Return type
- get_all_ancestors(node, include_node=False)[source]¶
Gets all the ancestors of a particular node.
Nodes that are closest to the given node appear first in the list. The ancestors, including those of all intermediate nodes, are cached.
- Parameters
- Return type
- Returns
The list of nodes along the path from the root to the node.
- depth_first_traverse_nodes(source=None, postorder=True)[source]¶
Nodes from depth first traversal of the tree.
Returns the nodes from a DFS on the tree.
- Parameters
- Return type
- Returns
A list of nodes from the depth first traversal.
- Raises
CassiopeiaTreeError if the tree has not been initialized. –
- depth_first_traverse_edges(source=None)[source]¶
Edges from depth first traversal of the tree.
Returns the edges from a DFS on the tree.
- get_tree_topology()[source]¶
Returns the tree in Networkx format.
- Raises
CassiopeiaTreeError if the tree has not been initialized. –
- Return type
DiGraphDiGraph
- get_mean_depth_of_tree()[source]¶
Computes mean depth of tree.
Returns the mean depth of the tree. If branch lengths have not been estimated, depth is by default the number of edges in the tree.
- get_max_depth_of_tree()[source]¶
Computes the max depth of the tree.
Returns the maximum depth of the tree. If branch lengths have not been estimated, depth is by default the number of edges in the tree.
- get_mutations_along_edge(parent, child)[source]¶
Gets the mutations along an edge of interest.
Returns a list of tuples (character, state) of mutations that occur along an edge. Characters are 0-indexed.
- Parameters
- Return type
- Returns
- A list of (character, state) tuples indicating which character
mutated and to which state.
- Raises
CassiopeiaTreeError if the edge does not exist or if the tree is – not initialized.
- relabel_nodes(relabel_map)[source]¶
Relabels the nodes in the tree.
Renames the nodes in the tree according to the relabeling map. Modifies the tree in-place.
- add_leaf(parent, node, states=None, time=None, branch_length=None, dissimilarity=None)[source]¶
Add a leaf to the given parent node.
The parent node may NOT also be a leaf, as this makes bookkeeping the character, dissimilarity and cell meta quite complicated. Also, adding a leaf to an existing leaf is probably not a common use case.
Optional arguments
states,time,branch_length,dissimilaritymay be provided to initialize the character string, time, branch length, dissimilarity, and cell meta for this leaf respectively. See__register_data_with_tree()for default values that are used when any of these are not provided.Note that only one of
timeandbranch_lengthmay be provided, and when neither are provided, the new leaf is added at the same time as theparent.- Parameters
- parent :
strstr Parent node, to which to connect the leaf
- node :
strstr Name of the leaf to add
- states :
List[int],List[Tuple[int, …]],NoneUnion[List[int],List[Tuple[int, …]],None] (default:None) Character states to initialize the new leaf with
- time :
float,NoneOptional[float] (default:None) Time to place the new leaf
- dissimilarity : {
str:float},NoneOptional[Dict[str,float]] (default:None) Indel dissimilarity between the new leaf and all other leaves
- meta
Cell metadata
- parent :
- Raises
CassiopeiaTreeError if parent does not exist or node – already exists, if
parentis a leaf, if the tree has not been initialized, or if bothtimeandbranch_lengthare provided.- Return type
- remove_leaf_and_prune_lineage(node)[source]¶
Removes a leaf from the tree and prunes the lineage.
Removes a leaf and all ancestors of that leaf that are no longer the ancestor of any leaves. In the context of a phylogeny, this prunes the lineage of all nodes no longer relevant to observed samples. Additionally, maintains consistency with the updated tree by removing the node from all leaf data.
- collapse_unifurcations(source=None)[source]¶
Collapses unifurcations on the tree.
Removes all internal nodes that have in degree and out degree of 1, connecting their parent and children nodes by branchs with lengths equal to the total time elapsed from parent to each child. Therefore preserves the times of nodes that are not removed.
- collapse_mutationless_edges(infer_ancestral_characters)[source]¶
Collapses mutationless edges in the tree in-place.
Uses the internal node annotations of a tree to collapse edges with no mutations. The introduction of a missing data event is considered a mutation in this context. Either takes the existing character states on the tree or infers the annotations bottom-up from the samples obeying Camin-Sokal Parsimony. Preserves the times of nodes that are not removed by connecting the parent and children of removed nodes by branchs with lengths equal to the total time elapsed from parent to each child.
- set_dissimilarity_map(dissimilarity_map)[source]¶
Sets the dissimilarity map variable in this object.
- set_dissimilarity(node, dissimilarity)[source]¶
Set the dissimilarity of a single leaf.
This function may be used only when a dissimilarity map already exists.
- Parameters
- Raises
CassiopeiaTreeError if the dissimilarity map does not exist, if – the
dissimilaritydictionary does not contain all other leaves,- Return type
- compute_dissimilarity_map(dissimilarity_function=None, prior_transformation='negative_log', layer=None)[source]¶
Computes a dissimilarity map.
Given the dissimilarity function passed in, the pairwise dissimilarities will be computed over the samples in the character matrix. Populates the dissimilarity_map attribute in the object.
If any of the leaves have ambiguous character states (detected by checking if any of the states are lists), then
cassiopeia.solver.dissimilarity_functions.cluster_dissimilarity()is called instead, with the provideddissimilarity_functionas the first argument.- Parameters
- dissimilarity_function :
Callable[[array,array,int,Dict[int,Dict[int,float]]],float],NoneOptional[Callable[[array,array,int,Dict[int,Dict[int,float]]],float]] (default:None) A function that will take in two character vectors and priors and produce a dissimilarity.
- prior_transformation :
strstr(default:'negative_log') A function defining a transformation on the priors in forming weights. Supports the following transformations:
- ”negative_log”: Transforms each probability by the negative
log
”inverse”: Transforms each probability p by taking 1/p “square_root_inverse”: Transforms each probability by the
the square root of 1/p
- layer :
str,NoneOptional[str] (default:None) Character matrix layer to use. If not specified, use the default
character_matrix.
- dissimilarity_function :
- Return type
- find_lcas_of_pairs(pairs=None)[source]¶
Finds LCAs of all (provided) pairs.
- find_lca(*nodes)[source]¶
Finds the LCA of all provided nodes.
Internally, this function calls
get_all_ancestors()for each provided node and finds the deepest node that is shared.
- get_distance(node1, node2)[source]¶
Compute the branch distance between two nodes.
Internally, this function converts the tree to an undirected graph and finds the shortest path by calling
nx.shortest_path_length().
- get_distances(node, leaves_only=False)[source]¶
Compute the distances between the given node and every other node.
Internally, this function converts the tree to an undirected graph and finds the shortest path by calling
nx.shortest_path_length().- Parameters
- Return type
- Returns
Dictionary of distances
- Raises
CassiopeiaTreeError if the tree has not been initialized. –