Preprocessing¶

A module to preprocess data for TRIBAL.

`preprocess(df, roots, isotypes, min_size=4, use_light_chain=True, cores=1, verbose=False)` ¶

Preprocess the input data to prepare data for TRIBAL.

The clonotypes will first be filtered to ensure each clonotype has at least min_size cells. Each retained clonotype is aligned to the root sequence and then maximum parsimony forest is enumerated for the B cells with that clonotype.

Parameters:

Name	Type	Description	Default
`df`	`DataFrame`	A dataframe with columns including 'clonotype', 'heavy_chain_seq', 'light_chain_seq', 'heavy_chain_v_allele', 'light_chain_v_allele', and 'heavy_chain_isotype'. The 'Light Chain' columns are optional.	required
`roots`	`DataFrame`	A dataframe containing the root sequences.	required
`isotypes`	`list`	A list of the ordered isotype labels, i.e., ['IghM', ...,'IghA']. These labels should match the isotype labels in the input dataframe. See notes below.	required
`min_size`	`int`	The minimum number of B cells needed to form a clonotype (default is 4).	`4`
`use_light_chain`	`bool`	Should the light chain be included in the BCR sequences (default is True).	`True`
`cores`	`int`	The number of CPU cores to use (default is 1).	`1`
`verbose`	`bool`	Should verbose output be printed (default is False).	`False`

Returns:

Type	Description
`dict`	A dictionary of clonotype objects formatted for input to tribal with clonotype id as key
`DataFrame`	A dataframe that is filtered to contain only unfiltered B cells

Notes

Ensure that the isotype labels in isotypes match the labels in the input dataframe.

Source code in tribal/preprocess.py

def preprocess( df: pd.DataFrame,
            roots: pd.DataFrame,
            isotypes:list,
            min_size=4,
            use_light_chain=True,
            cores:int =1,
            verbose=False
            ):
    """
    Preprocess the input data to prepare data for TRIBAL.

    The clonotypes will first be filtered to ensure each clonotype has at least `min_size` cells.
    Each retained clonotype is aligned to the root sequence and then maximum parsimony forest is
    enumerated for the B cells with that clonotype.

    Parameters
    ----------
    df : pd.DataFrame
        A dataframe with columns including 'clonotype', 'heavy_chain_seq', 'light_chain_seq',
        'heavy_chain_v_allele', 'light_chain_v_allele', and 'heavy_chain_isotype'.
        The 'Light Chain' columns are optional.
    roots : pd.DataFrame
        A dataframe containing the root sequences.
    isotypes : list
        A list of the ordered isotype labels, i.e., ['IghM', ...,'IghA']. These labels should
        match the isotype labels in the input dataframe. See notes below.  
    min_size : int, optional
        The minimum number of B cells needed to form a clonotype (default is 4).  
    use_light_chain : bool, optional
        Should the light chain be included in the BCR sequences (default is True).  
    cores : int, optional
        The number of CPU cores to use (default is 1).  
    verbose : bool, optional
        Should verbose output be printed (default is False).  

    Returns
    -------
    dict 
        A dictionary of clonotype objects formatted for input to tribal with clonotype id as key
    pd.DataFrame
        A dataframe that is filtered to contain only unfiltered B cells

    Notes
    -----
    Ensure that the isotype labels in `isotypes` match the labels in the input dataframe.
    """
    isotype = {iso: i for i, iso in enumerate(isotypes)}
    if verbose:
        print("\nPreprocessing input data for tribal...")
        print("\nIsotype ordering:")
        for iso in isotypes:
            print(iso)
        print("\nParameter settings:")
        print(f"minimum clonotype size: {min_size}")
        print(f"include light chain: {use_light_chain}")
        print(f"cores: {cores}")
        print(f"verbose: {verbose}")


    #first filter out clonotypes smaller than min size
    if verbose:

        print("\nPrior to filtering...")
        print(f"The number of cells is {df.shape[0]} and the number of clonotypes is {df['clonotype'].nunique()}.")

    df = df.groupby("clonotype").filter(lambda x: len(x) >= min_size)

    df = df[df["heavy_chain_isotype"].isin(isotype.keys())]

    if verbose:
        print(f"\nFiltering clonotypes with fewer than {min_size} cells...")
        print(f"The number of cells is {df.shape[0]} and the number of clonotypes is {df['clonotype'].nunique()}.")

    df = _filter_alleles(df, "heavy_chain_v_allele")
    if use_light_chain:
        df = _filter_alleles(df, "light_chain_v_allele")
    df = df[ df['heavy_chain_isotype'].notna()]

    df = df.groupby("clonotype").filter(lambda x: len(x) >= min_size)

    if verbose:
        print(f"\nFiltering cells based on v_alleles {min_size}...")

        print(f"The number of cells is {df.shape[0]} and the number of clonotypes is {df['clonotype'].nunique()}.")

    if verbose:
        print(f"\nAfter all filtering, the number of cells is {df.shape[0]} and the number of clonotypes is {df['clonotype'].nunique()}.\n")

    #prep dnapars sequence ids
    df['seq'] = df.groupby('clonotype').cumcount() + 1
    df['seq'] = 'seq' + df['seq'].astype(str)

    df.columns.values[0] = "cellid"
    roots.columns.values[0] = "clonotype"
    df["isotype"] = df['heavy_chain_isotype'].map(isotype)

    clonodict = {}    
    instances  = [ (j, df.copy(), roots.copy(), use_light_chain, isotypes, verbose)
                   for j in df["clonotype"].unique()]
    if cores > 1:
        with mp.Pool(cores) as pool:
            results = pool.starmap(_process_clonotype, instances)
    else:
        results = []
        for inst in instances:
            results.append(_process_clonotype(*inst))

    for j, linforest in results:
        clonodict[j] = linforest

    if verbose:
        print("\nPreprocessing complete!")
    return clonodict, df

Preprocessing¶

preprocess(df, roots, isotypes, min_size=4, use_light_chain=True, cores=1, verbose=False) ¶

`preprocess(df, roots, isotypes, min_size=4, use_light_chain=True, cores=1, verbose=False)` ¶