cassiopeia.pp.collapse_umis#

cassiopeia.pp.collapse_umis(bam_fp, output_directory, max_hq_mismatches=3, max_indels=2, method='cutoff', n_threads=1)[source]#

Collapses close UMIs together from a bam file.

On a basic level, it aggregates together identical or close reads to count how many times a UMI was read. Performs basic error correction, allowing UMIs to be collapsed together which differ by at most a certain number of high quality mismatches and indels in the sequence read itself. Writes out a dataframe of the collapsed UMIs table.

Parameters:
bam_file_name

File path of the bam_file. Just the bam file name can be specified if the bam already exists in the output directory

output_directory str

The output directory where the sorted bam directory, the collapsed bam directory, and the final collapsed table are written to

max_hq_mismatches int (default: 3)

A threshold specifying the max number of high quality mismatches between the seqeunces of 2 aligned segments to be collapsed

max_indels int (default: 2)

A threshold specifying the maximum number of differing indels allowed between the sequences of 2 aligned segments to be collapsed

method Literal['cutoff', 'likelihood'] (default: 'cutoff')

Which method to use to form initial sequence clusters. Must be one of the following: * cutoff: Uses a quality score hard cutoff of 30, and any mismatches

below this quality are ignored. Initial sequence clusters are formed by selecting the most common base at each position (with quality at least 30).

  • likelihood: Utilizes the error probability encoded in the quality

    score. Initial sequence clusters are formed by selecting the most probable at each position.

n_threads int (default: 1)

Number of threads to use.

Return type:

DataFrame

Returns:

A DataFrame of collapsed reads.