Single-Cell Omics
The manylatents.singlecell module provides data loading for single-cell omics datasets stored in the AnnData .h5ad format, covering scRNA-seq, scATAC-seq, and CITE-seq assays.
Overview
Single-cell experiments measure gene expression (or chromatin accessibility, surface proteins, etc.) in individual cells. The resulting count matrices are stored as AnnData objects in .h5ad files, which manylatents-omics loads directly into PyTorch for dimensionality reduction and geometric analysis.
Install: uv add "manylatents-omics[singlecell]"
Shipped Datasets
Preconfigured Hydra configs are provided for common benchmark datasets:
| Dataset | Cells | Features | Config |
|---|---|---|---|
| PBMC 3k | ~2,700 | ~1,800 genes | data=pbmc_3k |
| PBMC 10k | ~10,000 | varies | data=pbmc_10k |
| PBMC 68k | ~68,000 | varies | data=pbmc_68k |
| Embryoid Body | varies | varies | data=embryoid_body |
Key Classes
AnnDataset
manylatents.singlecell.data.AnnDataset
A PyTorch Dataset for any AnnData .h5ad file. Supports:
- Loading from
adata.X,adata.raw.X, or a named layer (adata.layers[layer]) - Automatic sparse-to-dense conversion
- Cell-type label extraction from
adata.obswith integer encoding - Access to observation annotations via
get_obs(key)
AnnDataModule
manylatents.singlecell.data.AnnDataModule
A PyTorch Lightning DataModule wrapping AnnDataset. Supports two modes:
- full: Entire dataset used for both training and testing
- split: Random train/test split with configurable ratio and seed
Usage
# UMAP on PBMC 3k
python -m manylatents.main data=pbmc_3k algorithms/latent=umap
# Sweep datasets and algorithms
python -m manylatents.main -m \
data=pbmc_3k,pbmc_10k \
algorithms/latent=umap,phate
Loading a Custom Dataset
To use your own .h5ad file, create a Hydra config or instantiate the datamodule directly:
from manylatents.singlecell.data import AnnDataModule
dm = AnnDataModule(
adata_path="path/to/your_data.h5ad",
label_key="cell_type",
batch_size=128,
mode="full",
)
dm.setup()