{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### Preprocessing sequencing reads (FASTQs) with Cassiopeia\n",
    "\n",
    "Cassiopeia provides an end-to-end pipeline to preprocess raw sequencing reads in FASTQ format into \"Character Matrices\" ready to pass into a phylogeny-inference algorithm. Given a set of FASTQs that contain RNA-seq reads from a target site construct (and NOT any endogenously expressed RNA), the preprocessing pipeline consists of the following steps.\n",
    "\n",
    "1. **convert**: Convert the FASTQs into an unmapped BAM, while parsing any barcode and/or UMI sequences into BAM tags.\n",
    "2. **filter_bam**: Filter reads with low-quality barcode and/or UMI sequences from the unmapped bam.\n",
    "3. **error_correct_cellbcs_to_whitelist**: For sequencing chemistries that have a predefined (cell) barcode whitelist, this steps perform correction of sequencing errors using this whitelist.\n",
    "4. **collapse**: Collapse reads into UMIs by constructing one or more consensus sequences for each UMI using the set of reads with that UMI.\n",
    "5. **resolve**: Resolve a single sequence for each UMI by choosing the most likely sequencing read to represent each UMI in a cell.\n",
    "6. **align**: Align sequences to the reference target site using the Smith-Waterman local alignment algorithm.\n",
    "7. **call_alleles**: Call alleles with respect to the reference target site and the alignment of a sequence, thereby reporting the set of mutations that a target site sequence contains.\n",
    "8. **error_correct_intbcs_to_whitelist**: For experimental designs for which each target site vector molecule has a unique barcode (\"intBC\"), and the set of intBCs present in the sequenced sample are known beforehand, this step performs sequencing error correction of these intBCs to the provided whitelist.\n",
    "9. **error_correct_umis**: Error-correct UMIs whose mutation data is identical and whose UMI barcode sequences are similar enough.\n",
    "10. **filter_molecule_table**: Filter UMIs that have conflicting allele information, too few reads, or do not meet other quality control criteria.\n",
    "11. **call_lineages**: Split up cells into clonal populations, based on their shared set of integration barcodes (intBCs).\n",
    "\n",
    "The final output of this pipeline is an \"AlleleTable\" which stores the mutation data and clonal population identity for each cell. This data structure can then be broken up into character matrices for phylogenetic inference.\n",
    "\n",
    "\n",
    "## Pipeline API\n",
    "All of the key modules of the preprocessing pipeline can be invoked by a call from `cassiopeia.pp`. Assuming the user would like to begin at the beginning of the pipeline, we'll start with the `convert` stage. You can find all documentation on our [main site](https://cassiopeia-lineage.readthedocs.io/en/latest/).\n",
    "\n",
    "An alternative to running the pipeline interactively is to take advantage of the command line tool `cassiopeia-preprocess`, which takes in a configuration file (for example in Cassiopeia/data/preprocess.cfg) and runs the pipeline end-to-end. For example, if you have a config called `example_config.cfg`, this can be invoked from the command line with:\n",
    "\n",
    "```bash\n",
    "cassiopeia-preprocess example_config.cfg\n",
    "```\n",
    "\n",
    "In this brief tutorial, we will preprocess a sample prepared with the 10X Genomics 3' version 3 chemistry and an intBC whitelist (that define the target site intBCs we know are present in the sample, obtained via DNA sequencing)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "\n",
    "import cassiopeia as cas"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "# The raw FASTQs\n",
    "input_files = [\n",
    "    \"R1.fastq.gz\", \"R2.fastq.gz\"\n",
    "]\n",
    "# The sample name, used for naming output files\n",
    "name = 'test_sample'\n",
    "# Directory to output results\n",
    "output_directory = \"test_preprocess_pipeline\"\n",
    "# Path to the target site reference sequence in FASTA format\n",
    "reference_filepath = \"../data/PCT48.ref.fasta\"\n",
    "# Number of threads to use, whenever parallelization is possible\n",
    "n_threads = 8\n",
    "# Whether to allow a single intBC to have multiple allele states\n",
    "# For chemistries for which barcode == cell, this should be `False`.\n",
    "allow_allele_conflicts = False\n",
    "# Verbosity of logging\n",
    "verbose = True\n",
    "\n",
    "cassiopeia.pp.setup(output_dir, verbose=verbose)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### convert\n",
    "\n",
    "Since we used the 10X Genomics 3' version 3 chemistry to prepare our samples, we provide `chemistry=10xv3`. Other supported chemistries are the following.\n",
    "\n",
    "* `dropseq`: Droplet-based scRNA-seq chemistry described in Macosco et al. 2015\n",
    "\n",
    "* `10xv2`: 10x Genomics 3' version 2\n",
    "\n",
    "* `10xv3`: 10x Genomics 3' version 3\n",
    "\n",
    "* `indropsv3`: inDrops version 3 by Zilionis et al. 2017\n",
    "\n",
    "* `slideseq2`: Slide-seq version 2"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bam_fp = cas.pp.convert_fastqs_to_unmapped_bam(\n",
    "    input_files,\n",
    "    chemistry='10xv3',\n",
    "    output_directory=output_directory,\n",
    "    name=name,\n",
    "    n_threads=n_threads\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### filter_bam\n",
    "\n",
    "The `quality_threshold` parameter controls the minimum PHRED sequencing quality the barcode and UMI sequence must have for a read to pass filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bam_fp = cas.pp.filter_bam(\n",
    "    bam_fp,\n",
    "    output_directory=output_directory,\n",
    "    quality_threshold=10,\n",
    "    n_threads=n_threads,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### error_correct_cellbcs_to_whitelist\n",
    "\n",
    "The 10X Genomics 3' version 3 chemistry has a predefined barcode whitelist, to which we will correct our barcode sequences. For chemistries that do not have such a whitelist (such as Drop-seq), this step should be skipped.\n",
    "\n",
    "The `whitelist` argument may be a path to the whitelist plaintext file (with one barcode per line) or a Python list containing the whitelisted barcode sequences. Here, we downloaded the appropriate whitelist file from [here](https://github.com/10XGenomics/cellranger/raw/master/lib/python/cellranger/barcodes/3M-february-2018.txt.gz) and uncompressed it."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "bam_fp = cas.pp.error_correct_cellbcs_to_whitelist(\n",
    "    bam_fp,\n",
    "    whitelist='3M-february-2018.txt',\n",
    "    output_directory=output_directory,\n",
    "    n_threads=n_threads,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### collapse\n",
    "\n",
    "The `max_hq_mismatches` and `max_indels` arguments control the threshold with which to decide whether to assign a read to a proposed consensus sequence. The defaults (`3` and `2` respectively) should work best in most cases.\n",
    "\n",
    "The `method` argument may take one of two values: `cutoff` and `likelihood`. The former uses a hard PHRED qualtiy cutoff of 30 (and any mismatches below this quality are ignored). Consensus sequences are proposed by selecting the msot common base at each position (with quality at least 30). The latter is a likelihood-based approach that selects the consensus sequences based on what is the most probable base at each position."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "umi_table = cas.pp.collapse_umis(\n",
    "    bam_fp,\n",
    "    output_directory=output_directory,\n",
    "    max_hq_mismatches=3,\n",
    "    max_indels=2,\n",
    "    method='likelihood',\n",
    "    n_threads=n_threads,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### resolve\n",
    "\n",
    "The `min_umi_per_cell` and `min_avg_reads_per_umi` specify filtering thresholds to filter cells. The former is the minimum number of UMIs a cell must have to pass filtering, and the latter is the minimum average number of reads per UMI a cell must have to pass filtering."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "umi_table = cas.pp.resolve_umi_sequence(\n",
    "    umi_table,\n",
    "    output_directory=output_directory,\n",
    "    min_umi_per_cell=10,\n",
    "    min_avg_reads_per_umi=2.0,\n",
    "    plot=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### align\n",
    "\n",
    "The reference target site sequence must be provided as a FASTA file to the `ref_filepath` argument or as a string to the `ref` argument. The `gap_open_penalty` and `gap_extend_penalty` specify the gap open and extend penalties to use when aligning sequences. The provided defaults should work well for most cases."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "umi_table = cas.pp.align_sequences(\n",
    "    umi_table,\n",
    "    ref_filepath=reference_filepath,\n",
    "    gap_open_penalty=20,\n",
    "    gap_extend_penalty=1,\n",
    "    n_threads=n_threads,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### call_alleles\n",
    "\n",
    "Same as with the \"align\" step, the reference target site sequence must be provided with either the `ref_filepath` or `ref` arguments. The following additional arguments must be provided.\n",
    "\n",
    "* `barcode_interval`: The start and end positions for the intBC, which is the barcode that uniquely identifies each target site molecule. The interval is represented as a tuple of the form `(start, end)`, using 0-indexing and start-inclusive/end-exclusive.\n",
    "\n",
    "* `cutsite_locations`: The (center) location of each cutsite, represented as a list of indices, one element for each cutsite.\n",
    "\n",
    "* `cutsite_width`: The number of nucleotides to the left and right of the cutsite location that indels can appear in.\n",
    "\n",
    "* `context`: Whether or not to use the nucleotides surrounding the indels to identify the indels.\n",
    "\n",
    "* `context_size`: The number of bases to the left and right to include as the context.\n",
    "\n",
    "For the target sites we used for this experiment, we have the following locations.\n",
    "\n",
    "* intBC located in the interval `(20, 34)`\n",
    "\n",
    "* cutsites at `[112, 166, 120]`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "umi_table = cas.pp.call_alleles(\n",
    "    umi_table,\n",
    "    ref_filepath=reference_filepath,\n",
    "    barcode_interval=(20, 34),\n",
    "    cutsite_locations=[112, 166, 220],\n",
    "    cutsite_width=12,\n",
    "    context=True,\n",
    "    context_size=5,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### error_correct_intbcs_to_whitelist\n",
    "\n",
    "For experiments in which the intBC sequences that are present in the sample are not known beforehand, this step should be skipped.\n",
    "\n",
    "In our case, we do have an intBC whitelist, obtained from DNA sequencing. The `intbc_dist_thresh` specifies the maximum Levenshtein (edit) distance between the intBC sequence and whitelist to be correct."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "intbc_whitelist = [\n",
    "    'ATGATTTAACTACT', 'CGATTGGTCACTTA', 'CGTGAGTCTCTGAT', 'GAACCCACAATTCC',\n",
    "    'GAGTATATACCCTT', 'GCGTTTAGAATATT', 'GCCTTCAATTCCAA', 'TAACCAAGCCTACA',\n",
    "    'TTTCGTCGCTCTTC', 'CGCTATGGGGGGAA', 'CGATATCTTCAAGC', 'TCAGTGGGGTATTG', \n",
    "    'ACAATGCGTGTGGC',\n",
    "]\n",
    "umi_table = cas.pp.error_correct_intbcs_to_whitelist(\n",
    "    umi_table,\n",
    "    whitelist=intbc_whitelist,\n",
    "    intbc_dist_thresh=1\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### error_correct_umis\n",
    "\n",
    "The `max_umi_distance` specifies the maximum Hamming distance between two UMIs for one to be corrected to another."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "umi_table = cas.pp.error_correct_umis(\n",
    "    umi_table,\n",
    "    max_umi_distance=2,\n",
    "    allow_allele_conflicts=allow_allele_conflicts,\n",
    "    n_threads=n_threads,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### filter_molecule_table\n",
    "\n",
    "The `min_umi_per_cell` and `min_avg_reads_per_umi` behave the same as the \"resolve\" step.\n",
    "\n",
    "See the [documentation](https://cassiopeia-lineage.readthedocs.io/en/latest/api/reference/cassiopeia.pp.filter_molecule_table.html#cassiopeia.pp.filter_molecule_table) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "umi_table = cas.pp.filter_molecule_table(\n",
    "    umi_table,\n",
    "    output_directory=output_directory,\n",
    "    min_umi_per_cell=10,\n",
    "    min_avg_reads_per_umi=2.0,\n",
    "    min_reads_per_umi=-1,\n",
    "    intbc_prop_thresh=0.5,\n",
    "    intbc_umi_thresh=10,\n",
    "    intbc_dist_thresh=1,\n",
    "    doublet_threshold=0.35,\n",
    "    allow_allele_conflicts=allow_allele_conflicts,\n",
    "    plot=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### call_lineage_groups\n",
    "\n",
    "The `min_umi_per_cell` and `min_avg_reads_per_umi` behave the same as the \"resolve\" step.\n",
    "\n",
    "See the [documentation](https://cassiopeia-lineage.readthedocs.io/en/latest/api/reference/cassiopeia.pp.call_lineage_groups.html#cassiopeia.pp.call_lineage_groups) for more details."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "allele_table = cas.pp.call_lineage_groups(\n",
    "    umi_table,\n",
    "    output_directory=output_directory,\n",
    "    min_umi_per_cell=10,\n",
    "    min_avg_reads_per_umi=2.0,\n",
    "    min_cluster_prop=0.005,\n",
    "    min_intbc_thresh=0.05,\n",
    "    inter_doublet_threshold=0.35,\n",
    "    kinship_thresh=0.25,\n",
    "    plot=True,\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>cellBC</th>\n",
       "      <th>intBC</th>\n",
       "      <th>allele</th>\n",
       "      <th>r1</th>\n",
       "      <th>r2</th>\n",
       "      <th>r3</th>\n",
       "      <th>lineageGrp</th>\n",
       "      <th>UMI</th>\n",
       "      <th>readCount</th>\n",
       "      <th>Sample</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "      <td>CCCCGTGCCTTCCT</td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td></td>\n",
       "      <td>4</td>\n",
       "      <td>15</td>\n",
       "      <td>186.0</td>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "      <td>CGACAATGTAGTTG</td>\n",
       "      <td>CTTTG[104:29D]TACGGGATAT[167:54D]CGGAGGATAT[16...</td>\n",
       "      <td>CTTTG[104:29D]TACGG</td>\n",
       "      <td>GATAT[167:54D]CGGAG</td>\n",
       "      <td>GATAT[167:54D]CGGAG</td>\n",
       "      <td>4</td>\n",
       "      <td>26</td>\n",
       "      <td>326.0</td>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "      <td>CGCGCGTCCGGGTC</td>\n",
       "      <td>CGCCG[111:1I]AAAAAACATAA[161:18D]CGTGAATTCG[No...</td>\n",
       "      <td>CGCCG[111:1I]AAAAAA</td>\n",
       "      <td>CATAA[161:18D]CGTGA</td>\n",
       "      <td>ATTCG[None]CGGAG</td>\n",
       "      <td>4</td>\n",
       "      <td>13</td>\n",
       "      <td>104.0</td>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "      <td>GACTTTAATGTACA</td>\n",
       "      <td>CCGAA[113:54D]CTCTGCCGAA[113:54D]CTCTGTAATT[21...</td>\n",
       "      <td>CCGAA[113:54D]CTCTG</td>\n",
       "      <td>CCGAA[113:54D]CTCTG</td>\n",
       "      <td>TAATT[219:2D]CGGAG</td>\n",
       "      <td>4</td>\n",
       "      <td>18</td>\n",
       "      <td>187.0</td>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "      <td>GATGGACATTGGGG</td>\n",
       "      <td>CCGAA[113:50D]ATATCCCGAA[113:50D]ATATCATTCG[No...</td>\n",
       "      <td>CCGAA[113:50D]ATATC</td>\n",
       "      <td>CCGAA[113:50D]ATATC</td>\n",
       "      <td>ATTCG[None]CGGAG</td>\n",
       "      <td>4</td>\n",
       "      <td>19</td>\n",
       "      <td>178.0</td>\n",
       "      <td>AAACCTGAGGCTAGAC-1</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "               cellBC           intBC  \\\n",
       "0  AAACCTGAGGCTAGAC-1  CCCCGTGCCTTCCT   \n",
       "1  AAACCTGAGGCTAGAC-1  CGACAATGTAGTTG   \n",
       "2  AAACCTGAGGCTAGAC-1  CGCGCGTCCGGGTC   \n",
       "3  AAACCTGAGGCTAGAC-1  GACTTTAATGTACA   \n",
       "4  AAACCTGAGGCTAGAC-1  GATGGACATTGGGG   \n",
       "\n",
       "                                              allele                   r1  \\\n",
       "0                                                                           \n",
       "1  CTTTG[104:29D]TACGGGATAT[167:54D]CGGAGGATAT[16...  CTTTG[104:29D]TACGG   \n",
       "2  CGCCG[111:1I]AAAAAACATAA[161:18D]CGTGAATTCG[No...  CGCCG[111:1I]AAAAAA   \n",
       "3  CCGAA[113:54D]CTCTGCCGAA[113:54D]CTCTGTAATT[21...  CCGAA[113:54D]CTCTG   \n",
       "4  CCGAA[113:50D]ATATCCCGAA[113:50D]ATATCATTCG[No...  CCGAA[113:50D]ATATC   \n",
       "\n",
       "                    r2                   r3  lineageGrp  UMI  readCount  \\\n",
       "0                                                     4   15      186.0   \n",
       "1  GATAT[167:54D]CGGAG  GATAT[167:54D]CGGAG           4   26      326.0   \n",
       "2  CATAA[161:18D]CGTGA     ATTCG[None]CGGAG           4   13      104.0   \n",
       "3  CCGAA[113:54D]CTCTG   TAATT[219:2D]CGGAG           4   18      187.0   \n",
       "4  CCGAA[113:50D]ATATC     ATTCG[None]CGGAG           4   19      178.0   \n",
       "\n",
       "               Sample  \n",
       "0  AAACCTGAGGCTAGAC-1  \n",
       "1  AAACCTGAGGCTAGAC-1  \n",
       "2  AAACCTGAGGCTAGAC-1  \n",
       "3  AAACCTGAGGCTAGAC-1  \n",
       "4  AAACCTGAGGCTAGAC-1  "
      ]
     },
     "execution_count": 12,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "allele_table.head(5)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.6.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}