Tutorial¶
In order to use scallop
you need a basic knowledge of how scanpy
and
annData
objects work. scallop
has been developed to work with the
naming convention of pp
, tl
, and pl
from scanpy
. If you are comfortable
with that convention, then scallop
will be as easy as scanpy
is, and if
you don’t know the convention, then this tutorial will help you use both scallop
and scanpy
!
Creating a Scallop
object from an annData
¶
In order to work with scallop you need to first run scanpy
and load
a dataset in an annData
object.
import scanpy as sc
adata = sc.read(filename)
Now, you have to create the Scallop
object in which all the information from
bootstrapping experiments will be saved.
import scallop as sl
scal = sl.Scallop(adata)
In this object you can run several bootstrap experiments, which will be saved in a
Bootstrap
. Each Boostrap
object will contain a unique combination of
clustering, number of trials, percentage of cells, etc. chosen. All Bootstrap
objects will be saved within the Scallop
object and can be accessed.
Running a bootstrap experiment¶
In order to run a bootstrap experiment, use the command sl.tl.getScore()
:
sl.tl.getScore(scal, res=1.2, n_trials=30, frac_cells=0.95)
In this case, scallop
will run a bootstrap experiment with leiden
resolution parameter of 1.2, will execute 30 bootstrapping repetitions, and will
select the 95% of cells to do the clustering with them in each repetition.
By default scallop
stores all results in the Bootstrap
object. If you need
the score, running
score = sl.tl.getScore(scal, res=1.2, n_trials=30, frac_cells=0.95, do_return=True)
will return a pd.Series
with the cells from adata
, and the score for each cell.
If you run sl.tl.getScore()
with the exact same parametters, scallop
won’t run the boostrap and will return the bootstrap result instead.
You can access all the run bootstraps with
scal.getAllBootstraps()
An example output
------------------------------------------------------------
Bootstrap ID: 0 | res: 1.2 | frac_cells: 0.95 | n_trials: 10
------------------------------------------------------------
Bootstrap ID: 1 | res: 1.0 | frac_cells: 0.95 | n_trials: 20
------------------------------------------------------------
Bootstrap ID: 2 | res: 1.0 | frac_cells: 0.95 | n_trials: 100
Running sl.tl.getScore()
in parallel¶
Running leiden
on datasets with many cells, or running the experiment with many trials
can slow down the computation performance. You can tweak the argument n_procs
from sl.tl.getScore()
to run different trials at the same time.:
sl.tl.getScore(scal, res=1.2, n_trials=100, n_procs=8)
Note
We recommend using between 4 and 12 processors. A higher number of processors does not always improve substantially the computation time for this case.
Warning
If your dataset has a low number of cells, or n_trials
is small, we recommend trying n_procs = 1
.
For those cases, the preparation and coordination times for parallelization can be greater than the time
saved.