cassiopeia.sim.Cas9LineageTracingDataSimulator#

class cassiopeia.sim.Cas9LineageTracingDataSimulator(number_of_cassettes=10, size_of_cassette=3, mutation_rate=0.01, state_generating_distribution=<function Cas9LineageTracingDataSimulator.<lambda>>, number_of_states=100, state_priors=None, heritable_silencing_rate=0.0001, stochastic_silencing_rate=0.01, heritable_missing_data_state=-1, stochastic_missing_data_state=-1, random_seed=None, collapse_sites_on_cassette=True)[source]#

Simulates Cas9-based lineage tracing data.

This subclass of LineageTracingDataSimulator implements the overlay_data function to simulate Cas9-based mutations onto a defined set of “cassette”. These cassettes emulate the TargetSites described in Chan et al, Nature 2019 and the GESTALT arrays described in McKenna et al, Science 2020 in which several Cas9 cut-sites are arrayed together. In the Chan et al technology, these cassettes are of length 3; in McKenna et al, these cassettes are of length 10.

The class accepts several parameters that govern the Cas9-based tracer. First and foremost is the cutting rate, describing how fast Cas9 is able to cut. We model Cas9 cutting as an exponential process, parameterized by the specified mutation rate - specifically, for a lifetime t and parameter lambda, the expected probability of Cas9 mutation, per site, is exp(-lambda * t).

Second, the class accepts the architecture of the recorder - described by the size of the cassette (by default 3) and the number of cassettes. The resulting lineage will have (size_of_cassette * number_of_cassettes) characters. The architecture is important in the sense that it encodes the correlation between characters. This is critical in two ways: the first is with respect to silencing – entire cassettes are lost from silencing, either transcriptional or stochastic events. The second is with respect to Cas9 resections, in which more than one Cas9 molecule introduces a cut to the same cassette. In this event, all cut sites intermediate will be observed as missing. In this simulation, we allow this event to occur if Cas9 molecules perform cuts on the same cassette at any point in a cell’s lifetime. Importantly, setting the cassette length to 1 will remove any resection events due to Cas9 cutting, and will reduce the amount of transcriptionally silencing observed. This behavior can be manually turned off by setting collapse_sites_on_cassette=False, which will keep cuts that occur simultaneously on the same cassette as separate events, instead of causing a resection event.

Third, the class accepts a state distribution describing the relative likelihoods of various indels. This is very useful, as it is typical that a handful of mutations are far likelier than the bulk of the possible mutations.

Finally, the class accepts two types of silencing rates. The first is the heritable silencing rate which is a rare event in which an entire cassette is transcriptionally silenced and therefore not observed. The second type of silencing is a stochastic dropout rate which simulates the loss of cassettes due to the low sensitivity of the RNA-sequencing assay.

The function overlay_data will operate on the tree in place and will specifically modify the data stored in the character attributes.

Parameters:
number_of_cassettes int (default: 10)

Number of cassettes (i.e., arrays of target sites)

size_of_cassette int (default: 3)

Number of editable target sites per cassette

mutation_rate Union[float, List[float]] (default: 0.01)

Exponential parameter for the Cas9 cutting rate. Can be a float, or a list of floats of length size_of_cassette or number_of_cassettes * size_of_cassette:

float - all sites mutate at the specified rate. list of length size_of_cassette - each site will mutate at

the specified rate across all cassettes.

list of length number_of_cassettes * size_of_cassette - each

site and cassettes will mutate at the specified rate.

state_generating_distribution Callable[[], float] (default: <function Cas9LineageTracingDataSimulator.<lambda> at 0x7ff864fcde50>)

Distribution from which to simulate state likelihoods. This is only used if mutation priors are not specified to the simulator.

number_of_states int (default: 100)

Number of states to simulate

state_priors Optional[Dict[int, float]] (default: None)

An optional dictionary mapping states to their prior probabilities. Can also be a list of dictionaries of length size_of_cassette or number_of_cassettes * size_of_cassette:

dict - all sites will have the same prior probabilities. list of length size_of_cassette - each site will have the

specified prior probabilities across all cassettes.

list of length number_of_cassettes * size_of_cassette - each

site and cassette will have the specified prior probabilities.

If this argument is None, states will not be pulled from the state distribution.

heritable_silencing_rate float (default: 0.0001)

Silencing rate for the cassettes, per node, simulating heritable missing data events.

stochastic_silencing_rate float (default: 0.01)

Rate at which to randomly drop out cassettes, to simulate dropout due to low sensitivity of assays.

heritable_missing_data_state int (default: -1)

Integer representing data that has gone missing due to a heritable event (i.e. Cas9 resection or heritable silencing).

stochastic_missing_data_state int (default: -1)

Integer representing data that has gone missing due to the stochastic dropout from single-cell assay sensitivity.

random_seed Optional[int] (default: None)

Numpy random seed to use for deterministic simulations. Note that the numpy random seed gets set during every call to overlay_data, thereby producing deterministic simulations every time this function is called.

collapse_sites_on_cassette bool (default: True)

Whether or not to collapse cuts that occur in the same cassette in a single iteration. This option only takes effect when size_of_cassette is greater than 1. Defaults to True.

Raises:

DataSimulatorError if assumptions about the system are broken.

Methods

overlay_data(tree)[source]#

Overlays Cas9-based lineage tracing data onto the CassiopeiaTree.

Parameters:
tree CassiopeiaTree

Input CassiopeiaTree

collapse_sites(character_array, cuts)[source]#

Collapses cassettes.

Given a character array and a new set of cuts that Cas9 is inducing, this function will infer which cuts occur within a given cassette and collapse the sites between the two cuts.

Parameters:
character_array List[int]

Character array in progress

cuts List[int]

Sites in the character array that are being cut.

Return type:

Tuple[List[int], List[int]]

Returns:

The updated character array and sites that are not part of a

cassette collapse.

introduce_states(character_array, cuts)[source]#

Adds states to character array.

New states are added to the character array at the predefined cut locations.

Parameters:
character_array List[int]

Character array

cuts List[int]

Loci being cut

Return type:

List[int]

Returns:

An updated character array.

silence_cassettes(character_array, silencing_rate, missing_state=-1)[source]#

Silences cassettes.

Using the specified silencing rate, this function will randomly select cassettes to silence.

Parameters:
character_array List[int]

Character array

silencing_rate float

Silencing rate.

missing_state int (default: -1)

State to use for encoding missing data.

Return type:

List[int]

Returns:

An updated character array.

get_cassettes()[source]#

Obtain indices of individual cassettes.

A helper function that returns the indices that correpspond to the independent cassettes in the experiment.

Return type:

List[int]

Returns:

An array of indices corresponding to the start positions of the

cassettes.