Tutorial

In order to use scallop you need a basic knowledge of how scanpy and annData objects work. scallop has been developed to work with the naming convention of pp, tl, and pl from scanpy. If you are comfortable with that convention, then scallop will be as easy as scanpy is, and if you don’t know the convention, then this tutorial will help you use both scallop and scanpy!

Creating a Scallop object from an annData

In order to work with scallop you need to first run scanpy and load a dataset in an annData object.

import scanpy as sc

adata = sc.read(filename)

Now, you have to create the Scallop object in which all the information from bootstrapping experiments will be saved.

import scallop as sl

scal = sl.Scallop(adata)

In this object you can run several bootstrap experiments, which will be saved in a Bootstrap. Each Boostrap object will contain a unique combination of clustering, number of trials, percentage of cells, etc. chosen. All Bootstrap objects will be saved within the Scallop object and can be accessed.

Running a bootstrap experiment

In order to run a bootstrap experiment, use the command sl.tl.getScore():

sl.tl.getScore(scal, res=1.2, n_trials=30, frac_cells=0.95)

In this case, scallop will run a bootstrap experiment with leiden resolution parameter of 1.2, will execute 30 bootstrapping repetitions, and will select the 95% of cells to do the clustering with them in each repetition.

By default scallop stores all results in the Bootstrap object. If you need the score, running

score = sl.tl.getScore(scal, res=1.2, n_trials=30, frac_cells=0.95, do_return=True)

will return a pd.Series with the cells from adata, and the score for each cell.

If you run sl.tl.getScore() with the exact same parametters, scallop won’t run the boostrap and will return the bootstrap result instead.

You can access all the run bootstraps with

scal.getAllBootstraps()

An example output

------------------------------------------------------------
Bootstrap ID:  0 | res: 1.2 | frac_cells: 0.95 | n_trials: 10
------------------------------------------------------------
Bootstrap ID:  1 | res: 1.0 | frac_cells: 0.95 | n_trials: 20
------------------------------------------------------------
Bootstrap ID:  2 | res: 1.0 | frac_cells: 0.95 | n_trials: 100

Running sl.tl.getScore() in parallel

Running leiden on datasets with many cells, or running the experiment with many trials can slow down the computation performance. You can tweak the argument n_procs from sl.tl.getScore() to run different trials at the same time.:

sl.tl.getScore(scal, res=1.2, n_trials=100, n_procs=8)

Note

We recommend using between 4 and 12 processors. A higher number of processors does not always improve substantially the computation time for this case.

Warning

If your dataset has a low number of cells, or n_trials is small, we recommend trying n_procs = 1. For those cases, the preparation and coordination times for parallelization can be greater than the time saved.