ondisc
is an R package that facilitates analysis of large-scale single-cell
data out-of-core on a laptop or distributed across tens to hundreds
processors on a cluster or cloud. In both of these settings,
ondisc requires only a few gigabytes of memory, even if the
input data are tens of gigabytes in size. ondisc mainly is
oriented toward single-cell CRISPR screen analysis, but
ondisc also can be used for single-cell differential
expression and single-cell co-expression analyses. ondisc
is powered by several new, efficient algorithms for manipulating and
querying large, sparse expression matrices.
Users can install ondisc using the code below.
ondisc depends on the Bioconductor package
Rhdf5lib, which should be installed from source before
installing ondisc.
See the frequently
asked questions page for tips on installing ondisc such
that it runs as fast as possible. We can load ondisc by
calling library().
The interface to ondisc is simple and minimal. The
package contains only one class: odm (short for
“ondisc matrix”). An odm object represents a
single-cell expression matrix stored on disk (as opposed to
in memory). odm objects can be used to store
expression matrices that are too large to fit in memory. Users can
create an odm object via one of two functions:
create_odm_from_cellranger() or
create_odm_from_r_matrix(). The former takes the output of
one or more calls to Cell Ranger count as input, while the latter takes
an R matrix (stored in standard format or sparse format) as input. Users
can interface with an odm object using several functions,
including the bracket ([,]) operator, which loads a
specified subset of the expression matrix into memory.
odm object via
create_odm_from_cellranger()ondisc provides two functions for initializing an
odm object: create_odm_from_cellranger() and
create_odm_from_r_matrix(). The former is considerably more
scalable and memory-efficient than the latter; thus, we recommend that
users employ create_odm_from_cellranger() when possible. We
illustrate use of create_odm_from_cellranger() on an
example single-cell CRISPR screen dataset stored in ondisc.
The example data contain two modalities, namely a gene modality and a
CRISPR gRNA modality. There are 100 genes, 60 gRNAs, and 500 cells in
the data. create_odm_from_cellranger() takes several
arguments: directories_to_load,
directory_to_write, write_cellwise_covariates,
chunk_size, compression_level, and
grna_target_data_frame. Only the first two of these
arguments are required; the rest are set to reasonable defaults. We
describe the directories_to_load and
directory_to_write arguments below.
directories_to_load is a character vector specifying the
locations of one or more directories outputted by Cell Ranger count.
Below, we set directories_to_load to the (machine-specific)
location of the example data on disk.
directories_to_load <- paste0(
system.file("extdata", "highmoi_example", package = "ondisc"),
"/gem_group_", 1:2
)
directories_to_load # file paths to the example data on your computer## [1] "/private/var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T/RtmpS4gNAS/Rinstaafa27f1ac2/ondisc/extdata/highmoi_example/gem_group_1"
## [2] "/private/var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T/RtmpS4gNAS/Rinstaafa27f1ac2/ondisc/extdata/highmoi_example/gem_group_2"
directories_to_load contains the file paths to two
directories, which correspond to cells sequenced across two batches. The
data are stored in feature
barcode format; each directory contains the files
barcodes.tsv.gz, features.tsv.gz, and
matrix.mtx.gz.
## [1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"
## [1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"
Next, directory_to_write is a file path to the directory
in which to write the backing .odm file, which is the file
that will store the expression data on disk. .odm files
contain the same information as .mtx files but stored in a
more efficient format for CRISPR screen analysis, differential
expression analysis, and gene co-expression analysis. .odm
files simply are HDF5 files with special structure. We set
directory_to_write to temp_dir (i.e., the
temporary directory) in this example. The remaining arguments are
optional, and most users will not need to specify them; see
?create_odm_from_cellranger() for more information. Below,
we call create_odm_from_cellranger() on the example data,
saving the output of the function to the variable
out_list.
temp_dir <- tempdir()
out_list <- create_odm_from_cellranger(
directories_to_load = directories_to_load,
directory_to_write = temp_dir
)## Round 1/2 processing of the input files.
## Processing file 1 of 2.
## Processing file 2 of 2.
## Round 2/2 processing of the input files.
## Processing file 1 of 2. Computing cellwise covariates. Writing to disk.
## Processing file 2 of 2. Computing cellwise covariates. Writing to disk.
out_list contains three entries: gene,
grna, and cellwise_covariates.
gene and grna are the odm objects
corresponding to the gene and gRNA modalities, respectively. Meanwhile,
cellwise_covariates is a data frame that contains the
cell-wise covariates. (More on the cell-wise covariates later.) An
inspection of temp_dir reveals that the files
gene.odm and grna.odm have been written to
this directory.
## [1] "gene.odm" "grna.odm"
odm objectWe extract the odm object corresponding to the gene
modality as follows.
Evaluating an odm object in the console prints
information about the matrix, including the number of features and cells
contained within the matrix, as well as the file path to the
(machine-specific) backing .odm file.
## An object of class odm with the following attributes:
## • 100 features
## • 500 cells
## • Backing file: /var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T//RtmpLF9Jz1/gene.odm
odm objects support several key matrix operations,
including ncol(), nrow(),
rownames(), and [,]. ncol() and
nrow() return the number of rows (i.e., features) and
columns (i.e., cells) contained within the matrix, respectively.
## [1] 100
## [1] 500
Next, rownames() returns the feature IDs.
## [1] "ENSG00000100218" "ENSG00000253546" "ENSG00000185340" "ENSG00000223726"
## [5] "ENSG00000099958" "ENSG00000234208"
Finally, the bracket operator ([,]) loads a specified
row of the expression matrix into memory. One can index into the rows by
integer index or feature ID, as follows.
## [1] 0 2 4 1 1 2
## [1] 0 2 4 1 1 2
Indexing into an odm object by column is not supported.
Finally, odm objects take up very little space, as the data
are stored on disk rather than in-memory. For example,
gene_odm takes up only 40 kilobytes of memory.
## [1] "8.7 Kb"
ondisc supports the following Cell Ranger modalities:
Gene Expression, CRISPR Guide Capture (i.e.,
gRNA expression), and Antibody Capture (i.e., protein
expression). (The modality of a given feature is listed within the third
column of the unzipped features.tsv file; see the Cell
Ranger documentation for more information.) The table below maps the
modality name used by Cell Ranger to that used by
ondisc.
| Cell Ranger modality name | ondisc modality name |
|---|---|
Gene Expression |
gene |
CRISPR Guide Capture |
grna |
Antibody Capture |
protein |
We provide an example of using
create_odm_from_cellranger() to import a dataset containing
three modalities: gene expression, gRNA expression, and protein
expression. We use a synthetic dataset for this purpose. To this end we
call the function write_example_cellranger_dataset(), which
creates a synthetic single-cell dataset, writing the dataset to disk in
Cell Ranger feature barcode format. (See
?write_example_cellranger_dataset() for more information
about this function.) We create a synthetic single-cell dataset
consisting of 100 genes, 20 gRNAs, 10 proteins, and 500 cells.
Furthermore, we specify that the cells are sequenced across three
batches. We write the synthetic dataset to the directory
temp_dir.
set.seed(4)
example_data <- write_example_cellranger_dataset(
n_features = c(100, 20, 10),
n_cells = 500,
n_batch = 3,
modalities = c("gene", "grna", "protein"),
directory_to_write = temp_dir ,
p_set_col_zero = 0
)The synthetic data are contained in the directories
batch_1, batch_2, and batch_3
within temp_dir:
directories_to_load <- list.files(
temp_dir,
pattern = "batch_",
full.names = TRUE
)
directories_to_load## [1] "/var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T//RtmpLF9Jz1/batch_1"
## [2] "/var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T//RtmpLF9Jz1/batch_2"
## [3] "/var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T//RtmpLF9Jz1/batch_3"
Each of these directories contains the files
matrix.mtx.gz, features.tsv.gz, and
barcodes.tsv.gz. For example, the contents of the
batch_1 are as follows.
## [1] "barcodes.tsv.gz" "features.tsv.gz" "matrix.mtx.gz"
We call create_odm_from_cellranger() to import these
data, saving the output of the function in the variable
out_list.
out_list <- create_odm_from_cellranger(
directories_to_load = directories_to_load,
directory_to_write = temp_dir
)## Round 1/2 processing of the input files.
## Processing file 1 of 3.
## Processing file 2 of 3.
## Processing file 3 of 3.
## Round 2/2 processing of the input files.
## Processing file 1 of 3. Computing cellwise covariates. Writing to disk.
## Processing file 2 of 3. Computing cellwise covariates. Writing to disk.
## Processing file 3 of 3. Computing cellwise covariates. Writing to disk.
out_list contains the cell-wise covariate data frame
alongside odm objects corresponding to the gene, gRNA, and
protein modalities.
## [1] "gene" "grna" "protein"
## [4] "cellwise_covariates"
Moreover, the files gene.odm, grna.odm, and
protein.odm have been written to disk. (The previous
gene.odm and grna.odm files are
overwritten.)
## [1] "gene.odm" "grna.odm" "protein.odm"
As part of importing the data,
create_odm_from_cellranger() computes the cell-wise
covariates. We print the first few rows of the cell-wise covariate data
frame corresponding to the synthetic data below.
## gene_n_umis gene_n_nonzero gene_p_mito grna_n_umis grna_n_nonzero
## <int> <int> <num> <int> <int>
## 1: 233 46 0.4420601 45 11
## 2: 159 37 0.4716981 37 6
## 3: 210 41 0.2047619 49 8
## 4: 215 36 0.3395349 60 13
## 5: 264 39 0.3106061 57 12
## 6: 207 35 0.2415459 58 10
## grna_feature_w_max_expression grna_frac_umis_max_feature protein_n_umis
## <char> <num> <int>
## 1: grna_1 0.2222222 14
## 2: grna_15 0.2702703 37
## 3: grna_1 0.1836735 24
## 4: grna_13 0.1333333 29
## 5: grna_14 0.1578947 21
## 6: grna_19 0.1724138 48
## protein_n_nonzero batch
## <int> <fctr>
## 1: 4 batch_1
## 2: 8 batch_1
## 3: 5 batch_1
## 4: 5 batch_1
## 5: 5 batch_1
## 6: 6 batch_1
The modality to which a given covariate corresponds (“gene”, “grna”, or “protein”) is prepended to the name of the covariate. We describe each covariate below.
gene_n_umis: the number of gene UMIs sequenced in a
given cell.
gene_n_nonzero: the number of genes that exhibit
nonzero expression in a given cell.
gene_p_mito: the fraction of gene transcripts that
map to mitochondrial genes in a given cell. (Mitochondrial genes are
identified as genes whose name starts with "MT-" or
"mt-".)
grna_n_umis: similar to gene_n_umis but
for the gRNA modality.
grna_n_nonzero: similar to
gene_n_nonzero but for the gRNA modality.
grna_feature_w_max_expression: the ID of the gRNA
that exhibits the maximum UMI count in a given cell.
grna_frac_umis_max_feature: the fraction of UMIs
that the maximally expressed gRNA in a given cell constitutes.
protein_n_umis: similar to gene_n_umis
but for the protein modality.
protein_n_nonzero: similar to
gene_n_nonzero but for the protein modality.
batch: the batch in which a given cell was
sequenced. Cells loaded from different directories are assumed to belong
to different batches.
sceptre uses the covariates
grna_feature_w_max_expression and
grna_frac_umis_max_feature to assign gRNAs to cells.
.odm file into RUsers can read an .odm file into R by calling the
function initialize_odm_from_backing_file(). Below, we call
initialize_odm_from_backing_file() on the file
gene.odm stored within temp_dir, which loads
the gene expression matrix that we created in the previous step.
temp_dir <- tempdir()
gene_odm <- initialize_odm_from_backing_file(
paste0(temp_dir, "/gene.odm")
)
gene_odm## An object of class odm with the following attributes:
## • 100 features
## • 500 cells
## • Backing file: /var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T//RtmpLF9Jz1/gene.odm
.odm files are portable. Thus, a user can create an
.odm file on one computer, move the .odm file
to another computer, and then open the .odm file on the
second computer. Note that odm objects themselves are not
portable; thus, to move an odm object from one computer to
another, the user should transfer the underlying .odm file
to the second computer and then open the .odm file on the
second computer via initialize_odm_from_backing_file().
odm object via
create_odm_from_r_matrix()We recommend that users create an odm object via
create_odm_from_cellranger(), as this function is highly
scalable and typically requires only a couple gigabytes of memory.
However, users also can convert an R matrix into an odm
object via the function create_odm_from_r_matrix().
create_odm_from_r_matrix() takes two main arguments:
mat and file_to_write. mat is a
standard R matrix (of type "matrix") or a sparse R matrix
(of type "dgCMatrix", "dgRMatrix", or
"dgTMatrix"). mat should contain row names
giving the ID of each feature. Next, file_to_write is a
fully-qualified file path specifying the location in which to write the
backing .odm file. We provide an example of calling
create_odm_from_r_matrix() on a small gene-by-cell
expression matrix.
set.seed(4)
x <- rpois(100, lambda = 1)
gene_mat <- matrix(
x,
nrow = 5L,
dimnames = list(paste0("gene_", seq_len(5L)), paste0("cell_", seq_len(20L)))
)gene_mat is a gene expression matrix containing 5 genes
and 20 cells. We pass this matrix to
create_odm_from_r_matrix(), setting
file_to_write to
paste0(temp_dir, "/gene.odm").
file_to_write <- paste0(temp_dir, "/gene.odm")
gene_odm <- create_odm_from_r_matrix(
mat = gene_mat,
file_to_write = file_to_write,
chunk_size = 5L
)gene_odm is a standard odm object.
## An object of class odm with the following attributes:
## • 5 features
## • 20 cells
## • Backing file: /var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T//RtmpLF9Jz1/gene.odm
Moreover, the file gene.odm has been written to
temp_dir. (The previous gene.odm file is
overwritten.)
create_odm_from_cellranger() and
create_odm_from_r_matrix() take optional arguments
chunk_size and compression_level (which are
set to reasonable defaults). chunk_size and
compression_level control the extent to which the backing
.odm file is compressed. chunk_size should be
a positive integer, and compression_level should be an
integer in the range of 0 to 9. Increasing the value of these arguments
increases the level of compression, thereby leading to a
smaller file size for the backing .odm file (but possibly
longer read and write times).
## Warning in system2("quarto", "-V", stdout = TRUE, env = paste0("TMPDIR=", :
## running command
## 'TMPDIR=/private/var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T/RtmpLF9Jz1/fileab7d637d4586
## 'quarto' -V' had status 1
## ─ Session info ───────────────────────────────────────────────────────────────
## setting value
## version R version 4.5.1 (2025-06-13)
## os macOS Tahoe 26.5
## system aarch64, darwin20
## ui X11
## language (EN)
## collate C.UTF-8
## ctype C.UTF-8
## tz America/New_York
## date 2026-06-09
## pandoc 3.1.1 @ /Applications/quarto/bin/tools/ (via rmarkdown)
## quarto quarto script failed: unrecognized architecture @ /usr/local/bin/quarto
##
## ─ Packages ───────────────────────────────────────────────────────────────────
## package * version date (UTC) lib source
## bit 4.6.0 2025-03-06 [2] CRAN (R 4.5.0)
## bit64 4.6.0-1 2025-01-16 [2] CRAN (R 4.5.0)
## bslib 0.9.0 2025-01-30 [2] CRAN (R 4.5.0)
## cachem 1.1.0 2024-05-16 [2] CRAN (R 4.5.0)
## cli 3.6.5 2025-04-23 [2] CRAN (R 4.5.0)
## crayon 1.5.3 2024-06-20 [2] CRAN (R 4.5.0)
## data.table 1.18.2.1 2026-01-27 [2] CRAN (R 4.5.2)
## digest 0.6.37 2024-08-19 [2] CRAN (R 4.5.0)
## evaluate 1.0.5 2025-08-27 [2] CRAN (R 4.5.0)
## fastmap 1.2.0 2024-05-15 [2] CRAN (R 4.5.0)
## glue 1.8.0 2024-09-30 [2] CRAN (R 4.5.0)
## hms 1.1.3 2023-03-21 [2] CRAN (R 4.5.0)
## htmltools 0.5.8.1 2024-04-04 [2] CRAN (R 4.5.0)
## jquerylib 0.1.4 2021-04-26 [2] CRAN (R 4.5.0)
## jsonlite 2.0.0 2025-03-27 [2] CRAN (R 4.5.0)
## knitr 1.50 2025-03-16 [2] CRAN (R 4.5.0)
## lattice 0.22-7 2025-04-02 [2] CRAN (R 4.5.1)
## lifecycle 1.0.5 2026-01-08 [2] CRAN (R 4.5.2)
## magrittr 2.0.5 2026-04-04 [2] CRAN (R 4.5.2)
## Matrix 1.7-4 2025-08-28 [2] CRAN (R 4.5.0)
## ondisc * 1.3.5 2026-06-09 [1] Bioconductor
## pillar 1.11.1 2025-09-17 [2] CRAN (R 4.5.0)
## pkgconfig 2.0.3 2019-09-22 [2] CRAN (R 4.5.0)
## R.methodsS3 1.8.2 2022-06-13 [2] CRAN (R 4.5.0)
## R.oo 1.27.1 2025-05-02 [2] CRAN (R 4.5.0)
## R.utils 2.13.0 2025-02-24 [2] CRAN (R 4.5.0)
## R6 2.6.1 2025-02-15 [2] CRAN (R 4.5.0)
## Rcpp 1.1.0 2025-07-02 [2] CRAN (R 4.5.0)
## readr 2.1.5 2024-01-10 [2] CRAN (R 4.5.0)
## Rhdf5lib 1.31.0 2025-04-27 [2] Bioconductor 3.22 (R 4.5.0)
## rlang 1.2.0 2026-04-06 [2] CRAN (R 4.5.2)
## rmarkdown 2.29 2024-11-04 [2] CRAN (R 4.5.0)
## sass 0.4.10 2025-04-11 [2] CRAN (R 4.5.0)
## sessioninfo * 1.2.3 2025-02-05 [2] CRAN (R 4.5.0)
## tibble 3.3.1 2026-01-11 [2] CRAN (R 4.5.2)
## tidyselect 1.2.1 2024-03-11 [2] CRAN (R 4.5.0)
## tzdb 0.5.0 2025-03-15 [2] CRAN (R 4.5.0)
## vctrs 0.7.2 2026-03-21 [2] CRAN (R 4.5.2)
## vroom 1.6.5 2023-12-05 [2] CRAN (R 4.5.0)
## xfun 0.57 2026-03-20 [2] CRAN (R 4.5.2)
## yaml 2.3.10 2024-07-26 [2] CRAN (R 4.5.0)
##
## [1] /private/var/folders/mp/qgnyl4ss2cl9p399f3vpwqpc0000gn/T/RtmpS4gNAS/Rinstaafa27f1ac2
## [2] /Library/Frameworks/R.framework/Versions/4.5-arm64/Resources/library
## * ── Packages attached to the search path.
##
## ──────────────────────────────────────────────────────────────────────────────