Package overview¶
The tribal
package can be imported as a package into a python script or jupyter notebookd or it can be used as a command line tool.
The following functions and classes are accessible via the tribal
package:
Name | Description | Type |
---|---|---|
preprocess | preprocess the input data to the correct format for tribal by finding a multiple sequence aligment and parsimony forest for each clonotype | function |
BaseTree | a class to model the lineage tree topology | class |
Clonotype | a dataclass to structure the input data for tribal |
class |
Tribal | the main class to run the tribal algorithm and fit the input data |
class |
LineageTree | a class to model an inferred B cell lineage tree | class |
LineageTreeList | an extensions of class list to contain a list of B cell lineage trees | class |
The API provides additional details on each of these items.
Example data¶
In addition to the above functions and class, the following example data can be imported to help users better understand the data formatting and package use.
Name | Description | Type |
---|---|---|
df | input sequencing data | pandas.DataFrame |
roots | input germline roots for sequencing data | pandas.DataFrame |
probabilities | example isotype transition probability matrix | numpy.ndarray |
clonotypes | dictionary of Clonotype objects | dict |
lineage_tree | an example inferred B cell lineage tree | LineageTree |
lineage_tree_list | an example inferred B cell lineage tree list | LineageTreeList |
See Data for more details on the input format for the data.
Load and view the example input data:
from tribal import df, roots
print(df.head())
print(roots.head())
Load and view the example output data from preprocess
:
from tribal import clonotypes
for key, clonotype in clonptypes:
print(key)
print(clonotype)
Load and view the example output data from the tribal
algorithm:
from tribal import probabilities, lineage_tree, lineage_tree_list
print(probabilities)
print(lineage_tree)
print(lineage_tree_list)
Using the package¶
Here is a brief walkthrough of how to utilize the functionality of the tribal
package.
First, load the package:
import tribal
or, alternatively load specific functions, classes or example data.
from tribal import preprocess, df, roots
Preprocessing¶
The preprocess function will:
- filter out clonotypes that are below the minimum number of cells .
- filter out cells which have v alleles that differ from the majority of the clonotype
- perform a multiple sequence alignment (MSA) for each valid clonotype using mafft
- infer a parsimony forest for each clonotype given the MSA using dnapars
See preprocess for more details.
from tribal import preprocess, df, roots
isotypes = ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
clonotypes, df_filt = preprocess(df, roots,isotypes=isotypes,
min_size=4, use_light_chain=True,
cores=3, verbose=True)
The output dictionary clonotypes
is the formatted input to tribal
. To view
the formatted example data without running the preprocessing step, run the following.
from tribal import clonotypes
for key, clonotype in clonotypes:
print(clonotype)
Running TRIBAL¶
Tribal takes the dictionary of Clonotype objects as input and can be run in two modes.
1. refinement
(recommended) : the full algorithm where the CSR likelihood is optimized by solving the most parsimonious tree refinement problem.
2. score
: the input parsimony lineage trees are not refined and isotypes of the internal nodes are inferred using weighted parsimony via the Sankoff algorithm, with the isotype transition probabilities as weights.
from tribal import Tribal, clonotypes
#the clonotype data contains the following isotypes encoded from 0 to 7
isotypes = ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
tr = Tribal(n_isotypes=len(isotypes), restarts=2, niter=15, verbose=True)
#run in refinement mode (recommended)
shm_score, csr_likelihood, best_trees, probabilities = tr.fit(clonotypes=clonotypes,
mode="refinement", cores=3)
#run in score mode to infer isotypes using weighted parsimony (Sankoff algorithm) w/o tree refinement
shm_score, csr_likelihood, best_trees, probabilities = tr.fit(clonotypes=clonotypes,
mode="score", cores=3)
shm_score
and csr_likelihood
are floats representing the corresponding SHM or CSR objective values.
probabilities
is a numpy array of shape (n_isotypes, n_isotypes)
containing the inferred isotype transition probabilites.
best_trees
is a dictionary with clonotype id as key and the value containing a LineageTreeList with all inferred optimal B cell lineage trees for a given clonotype.
Additionally, Tribal
can be fit
with a user-provided isotype transition probability matrix:
from tribal import Tribal, clonotypes, probabilities
#the clonotype data contains the following isotypes encoded from 0 to 7
isotypes = ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
tr = Tribal(n_isotypes=len(isotypes), restarts=2, niter=15, verbose=True)
#specifying the transmat argument will skip the step of inferring isotype transition probabilites
shm_score, csr_likelihood, best_trees, probabilities = tr.fit(clonotypes=clonotypes,
mode="refinement", transmat=probabilites,
cores=3)
Exploring and visualizing the inferred B cell lineage trees¶
tribal fit
returns a list of all optimal B cell lineage trees for each clonotype.
Specifically, in the above examples best_trees
is a dictionary, with clonotype as key, of LineageTreeLists.
A B cell lineage tree for tribal is a rooted tree with nodes labeled by BCR sequences (concentated heavy and optional light chains) and by isotypes. The LineageTree class also holds the current SHM parsimony score (shm_obj
) and CSR likelihood (csr_obj
).
A LineageTree
can be visualized as a png
or pdf
via the draw function. Nodes are colored by the isotype via the default color_encoding
.
color_encoding = {
-1: "#FFFFFF",
0 : "#808080",
1 : "#FFEDA0",
2 : "#FD8D3C",
3 : "#E31A1C",
4 : "#800026",
5 : "#6A51A3",
6 : "#74C476",
7 : "mediumseagreen",
8 : "darkgoldenrod",
9 : "thistle1"
}
show_legend=True
provides a legend on the visualization depicting the color encoding. if isotype_encoding
is provided in the form of a list, i.e., ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
then the legend will be utilize the isotype labeling. Otherwise, the encoding is used.
The show_labels
argument can be used to toggle on and off the labeling of the sequences.
from tribal import lineage_tree
print(lineage_tree)
isotypes = ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
#output visualization as a png with sequence labels included.
lineage_tree.draw(fname="example_tree.png",
isotype_encoding=isotypes,
show_legend=True,
show_labels=True,
color_encoding=None,
)
#output visualization as a pdf with sequence labels excluded.
lineage_tree.draw(fname="example_tree.pdf",
isotype_encoding=isotypes,
show_legend=True,
show_labels=False,
color_encoding=None,
)
The output file can also be saved as dot file instead of a png for pdf. Use the dot
argument to indicate that file should be written as a dot file.
from tribal import lineage_tree
isotypes = ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
#output visualization as a png with sequence labels included.
lineage_tree.draw(fname="example_tree.dot",
isotype_encoding=isotypes,
show_legend=True,
show_labels=True,
color_encoding=None,
dot = True
)
Use the write function the lineage tree data to files including:
1. The sequences as a FASTA file
2. The isotypes as a csv file
3. The tree as a png
4. The edge list of the lineage tree
from tribal import lineage_tree
lineage_tree.write("lineage_tree_files")
You can also pass the corresponding Clonotype
object to utilize the stored isotype encoding for the clonotype.
from tribal import lineage_tree, clonotypes
clonotpye = clonotypes[lineage_tree.clonotype]
lineage_tree.write("lineage_tree_files", clonotype=clonotype)
#an additional label to append to the file names can be optional provided
lineage_tree.write("lineage_tree_files", clonotype=clonotype, tree_label="best")
The LineageTree class also provides the ability to perform preorder, postorder traversals of the nodes or to iterate through all the nodes or edges in the lineage tree.
from tribal import lineage_tree
isotypes = ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
#preorder traversal
for n in lineage_tree.preorder_traversal():
print(f"Node {n}\nBCR Sequence:{lineage_tree.sequences[n]}\nIsotype:{lineage_tree.isotypes[isotypes[n]]}")
#postoder traversal
for n in lineage_tree.postorder_traversal():
print(f"{n}\nBCR Sequence{lineage_tree.sequences[n]}\nIsotype{lineage_tree.isotypes[isotypes[n]]}")
#iterate over nodes
for n in lineage_tree.nodes():
print(f"{n}\nBCR Sequence{lineage_tree.sequences[n]}\nIsotype{lineage_tree.isotypes[isotypes[n]]}")
#iterate over edges
for u,v in lineage_tree.edges():
print(f"{u}->{v}")
Lastly, you can query the parent, children, leaf status, root status of a specified node:
from tribal import lineage_tree
isotypes = ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
#preorder traversal
nodes = lineage_tree.nodes()
n = nodes[0]
#get parent of node n
print(lineage_tree.parent(n))
#get children of node n
print(lineage_tree.children(n))
#check if node n is a leaf
print(lineage_tree.is_leaf(n))
#check if node n is the root
print(lineage_tree.is_root(n))
#get the leafset
print(lineage_tree.get_leafs())
#get a dictionary containing the parent of every node
print(lineage_tree.get_parents())
Functionality of a LineageTreeList¶
The LineageTreeList is an extension of list, which provides additional functionality for organizing a list of LineageTree objects.
from tribal import lineage_tree, LineageTreeList
lt_list = LineageTreeList()
lt_list.append(lineage_tree)
print(lt_list)
The LineageTreeList class provides functionality to find the optimal or all optimal LineageTree in the list or randomly sample one.
from tribal import lineage_tree_list
print(len(lineage_tree_list))
#if there are multiple optimal solutions, the first in the list is returned
best_score, best_tree = lineage_tree_list.find_best_tree()
print(best_score)
print(best_tree)
best_score, all_best = lineage_tree_list.find_all_best_trees()
print(best_score)
random_score, random_tree = lineage_tree_list.sample_best_tree(seed=10)
print(random_tree)
In addition, the class provides a wrapper to the write
function in LineageTree to write all the lineage tree files to disk.
from tribal import lineage_tree_list
lineage_tree_list.write_all(outpath="all_trees")
#or to utilize isotype labels for the isotype files instead of numerical encoding
isotypes = ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
lineage_tree_list.write_all(outpath="all_trees", isotype_encoding=isotypes)
Lastly, a CSV file with the objective values of each LineageTree in the list can be written.
from tribal import lineage_tree_list
isotypes = ['IGHM', 'IGHG3', 'IGHG1', 'IGHA1','IGHG2','IGHG4','IGHE','IGHA2']
lineage_tree_list.write("objectives.csv", isotype_encoding=isotypes)