Tutorial¶
In order to use scallop you need a basic knowledge of how scanpy and
annData objects work. scallop has been developed to work with the
naming convention of pp, tl, and pl from scanpy. If you are comfortable
with that convention, then scallop will be as easy as scanpy is, and if
you don’t know the convention, then this tutorial will help you use both scallop and scanpy!
Creating a Scallop object from an annData¶
In order to work with scallop you need to first run scanpy and load
a dataset in an annData object.
import scanpy as sc
adata = sc.read(filename)
Now, you have to create the Scallop object in which all the information from
bootstrapping experiments will be saved.
import scallop as sl
scal = sl.Scallop(adata)
In this object you can run several bootstrap experiments, which will be saved in a
Bootstrap. Each Boostrap object will contain a unique combination of
clustering, number of trials, percentage of cells, etc. chosen. All Bootstrap
objects will be saved within the Scallop object and can be accessed.
Running a bootstrap experiment¶
In order to run a bootstrap experiment, use the command sl.tl.getScore():
sl.tl.getScore(scal, res=1.2, n_trials=30, frac_cells=0.95)
In this case, scallop will run a bootstrap experiment with leiden
resolution parameter of 1.2, will execute 30 bootstrapping repetitions, and will
select the 95% of cells to do the clustering with them in each repetition.
By default scallop stores all results in the Bootstrap object. If you need
the score, running
score = sl.tl.getScore(scal, res=1.2, n_trials=30, frac_cells=0.95, do_return=True)
will return a pd.Series with the cells from adata, and the score for each cell.
If you run sl.tl.getScore() with the exact same parametters, scallop
won’t run the boostrap and will return the bootstrap result instead.
You can access all the run bootstraps with
scal.getAllBootstraps()
An example output
------------------------------------------------------------
Bootstrap ID: 0 | res: 1.2 | frac_cells: 0.95 | n_trials: 10
------------------------------------------------------------
Bootstrap ID: 1 | res: 1.0 | frac_cells: 0.95 | n_trials: 20
------------------------------------------------------------
Bootstrap ID: 2 | res: 1.0 | frac_cells: 0.95 | n_trials: 100
Running sl.tl.getScore() in parallel¶
Running leiden on datasets with many cells, or running the experiment with many trials
can slow down the computation performance. You can tweak the argument n_procs from sl.tl.getScore()
to run different trials at the same time.:
sl.tl.getScore(scal, res=1.2, n_trials=100, n_procs=8)
Note
We recommend using between 4 and 12 processors. A higher number of processors does not always improve substantially the computation time for this case.
Warning
If your dataset has a low number of cells, or n_trials is small, we recommend trying n_procs = 1.
For those cases, the preparation and coordination times for parallelization can be greater than the time
saved.