cassiopeia.pp.collapse_umis#
- cassiopeia.pp.collapse_umis(bam_fp, output_directory, max_hq_mismatches=3, max_indels=2, method='cutoff', n_threads=1)[source]#
Collapses close UMIs together from a bam file.
On a basic level, it aggregates together identical or close reads to count how many times a UMI was read. Performs basic error correction, allowing UMIs to be collapsed together which differ by at most a certain number of high quality mismatches and indels in the sequence read itself. Writes out a dataframe of the collapsed UMIs table.
- Parameters:
- bam_file_name
File path of the bam_file. Just the bam file name can be specified if the bam already exists in the output directory
- output_directory
str
The output directory where the sorted bam directory, the collapsed bam directory, and the final collapsed table are written to
- max_hq_mismatches
int
(default:3
) A threshold specifying the max number of high quality mismatches between the seqeunces of 2 aligned segments to be collapsed
- max_indels
int
(default:2
) A threshold specifying the maximum number of differing indels allowed between the sequences of 2 aligned segments to be collapsed
- method
Literal
['cutoff'
,'likelihood'
] (default:'cutoff'
) Which method to use to form initial sequence clusters. Must be one of the following: * cutoff: Uses a quality score hard cutoff of 30, and any mismatches
below this quality are ignored. Initial sequence clusters are formed by selecting the most common base at each position (with quality at least 30).
- likelihood: Utilizes the error probability encoded in the quality
score. Initial sequence clusters are formed by selecting the most probable at each position.
- n_threads
int
(default:1
) Number of threads to use.
- Return type:
- Returns:
A DataFrame of collapsed reads.