cassiopeia.pp.align_sequences#

cassiopeia.pp.align_sequences(queries, ref_filepath=None, ref=None, gap_open_penalty=20, gap_extend_penalty=1, method='local', n_threads=1)[source]#

Align reads to the TargetSite reference.

Take in several queries stored in a DataFrame mapping cellBC-UMIs to a sequence of interest and align each to a reference sequence. Either local or global alignment may be performed, depending on the method argument. The defaults for the gap open and gap extend penalties were selected via in-silico simulation (and are functionally equivalent to the values used in the GESTALT technology described in McKenna et al, 2016). The desired output consists of the best alignment score and the CIGAR string storing the indel locations in the query sequence.

Parameters:
queries DataFrame

DataFrame storing a list of sequences to align.

ref_filepath str | NoneOptional[str] (default: None)

Filepath to the reference FASTA.

ref str | NoneOptional[str] (default: None)

Reference sequence.

gap_open_penalty float (default: 20)

Gap open penalty

gap_extend_penalty float (default: 1)

Gap extension penalty

method Literal['local', 'global'] (default: 'local')

What alignment algorithm to use. Can be either “local” to perform local alignment using Smith-Waterman or “global” to perform global alignment using Needleman Wunsch.

n_threads int (default: 1)

Number of threads to use.

Return type:

DataFrame

Returns:

A DataFrame mapping each sequence name to the CIGAR string, quality,

and original query sequence.

Raises:

PreprocessError if both or neither ref_filepath and ref are – provided, or if the method is not either “local” or “global”.