Skip to content

Preprocessing

A module to preprocess data for TRIBAL.

preprocess(df, roots, isotypes, min_size=4, use_light_chain=True, cores=1, verbose=False)

Preprocess the input data to prepare data for TRIBAL.

The clonotypes will first be filtered to ensure each clonotype has at least min_size cells. Each retained clonotype is aligned to the root sequence and then maximum parsimony forest is enumerated for the B cells with that clonotype.

Parameters:

Name Type Description Default
df DataFrame

A dataframe with columns including 'clonotype', 'heavy_chain_seq', 'light_chain_seq', 'heavy_chain_v_allele', 'light_chain_v_allele', and 'heavy_chain_isotype'. The 'Light Chain' columns are optional.

required
roots DataFrame

A dataframe containing the root sequences.

required
isotypes list

A list of the ordered isotype labels, i.e., ['IghM', ...,'IghA']. These labels should match the isotype labels in the input dataframe. See notes below.

required
min_size int

The minimum number of B cells needed to form a clonotype (default is 4).

4
use_light_chain bool

Should the light chain be included in the BCR sequences (default is True).

True
cores int

The number of CPU cores to use (default is 1).

1
verbose bool

Should verbose output be printed (default is False).

False

Returns:

Type Description
dict

A dictionary of clonotype objects formatted for input to tribal with clonotype id as key

DataFrame

A dataframe that is filtered to contain only unfiltered B cells

Notes

Ensure that the isotype labels in isotypes match the labels in the input dataframe.

Source code in tribal/preprocess.py
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
def preprocess( df: pd.DataFrame,
            roots: pd.DataFrame,
            isotypes:list,
            min_size=4,
            use_light_chain=True,
            cores:int =1,
            verbose=False
            ):
    """
    Preprocess the input data to prepare data for TRIBAL.

    The clonotypes will first be filtered to ensure each clonotype has at least `min_size` cells.
    Each retained clonotype is aligned to the root sequence and then maximum parsimony forest is
    enumerated for the B cells with that clonotype.

    Parameters
    ----------
    df : pd.DataFrame
        A dataframe with columns including 'clonotype', 'heavy_chain_seq', 'light_chain_seq',
        'heavy_chain_v_allele', 'light_chain_v_allele', and 'heavy_chain_isotype'.
        The 'Light Chain' columns are optional.
    roots : pd.DataFrame
        A dataframe containing the root sequences.
    isotypes : list
        A list of the ordered isotype labels, i.e., ['IghM', ...,'IghA']. These labels should
        match the isotype labels in the input dataframe. See notes below.  
    min_size : int, optional
        The minimum number of B cells needed to form a clonotype (default is 4).  
    use_light_chain : bool, optional
        Should the light chain be included in the BCR sequences (default is True).  
    cores : int, optional
        The number of CPU cores to use (default is 1).  
    verbose : bool, optional
        Should verbose output be printed (default is False).  

    Returns
    -------
    dict 
        A dictionary of clonotype objects formatted for input to tribal with clonotype id as key
    pd.DataFrame
        A dataframe that is filtered to contain only unfiltered B cells

    Notes
    -----
    Ensure that the isotype labels in `isotypes` match the labels in the input dataframe.
    """
    isotype = {iso: i for i, iso in enumerate(isotypes)}
    if verbose:
        print("\nPreprocessing input data for tribal...")
        print("\nIsotype ordering:")
        for iso in isotypes:
            print(iso)
        print("\nParameter settings:")
        print(f"minimum clonotype size: {min_size}")
        print(f"include light chain: {use_light_chain}")
        print(f"cores: {cores}")
        print(f"verbose: {verbose}")


    #first filter out clonotypes smaller than min size
    if verbose:

        print("\nPrior to filtering...")
        print(f"The number of cells is {df.shape[0]} and the number of clonotypes is {df['clonotype'].nunique()}.")

    df = df.groupby("clonotype").filter(lambda x: len(x) >= min_size)

    df = df[df["heavy_chain_isotype"].isin(isotype.keys())]

    if verbose:
        print(f"\nFiltering clonotypes with fewer than {min_size} cells...")
        print(f"The number of cells is {df.shape[0]} and the number of clonotypes is {df['clonotype'].nunique()}.")

    df = _filter_alleles(df, "heavy_chain_v_allele")
    if use_light_chain:
        df = _filter_alleles(df, "light_chain_v_allele")
    df = df[ df['heavy_chain_isotype'].notna()]

    df = df.groupby("clonotype").filter(lambda x: len(x) >= min_size)

    if verbose:
        print(f"\nFiltering cells based on v_alleles {min_size}...")

        print(f"The number of cells is {df.shape[0]} and the number of clonotypes is {df['clonotype'].nunique()}.")

    if verbose:
        print(f"\nAfter all filtering, the number of cells is {df.shape[0]} and the number of clonotypes is {df['clonotype'].nunique()}.\n")

    #prep dnapars sequence ids
    df['seq'] = df.groupby('clonotype').cumcount() + 1
    df['seq'] = 'seq' + df['seq'].astype(str)

    df.columns.values[0] = "cellid"
    roots.columns.values[0] = "clonotype"
    df["isotype"] = df['heavy_chain_isotype'].map(isotype)

    clonodict = {}    
    instances  = [ (j, df.copy(), roots.copy(), use_light_chain, isotypes, verbose)
                   for j in df["clonotype"].unique()]
    if cores > 1:
        with mp.Pool(cores) as pool:
            results = pool.starmap(_process_clonotype, instances)
    else:
        results = []
        for inst in instances:
            results.append(_process_clonotype(*inst))

    for j, linforest in results:
        clonodict[j] = linforest

    if verbose:
        print("\nPreprocessing complete!")
    return clonodict, df