manylatents-omics

Biological extensions for manylatents adding population genetics, single-cell omics, and foundation model encoders for DNA, RNA, and protein sequences.

Installation

Install the base package:

uv add manylatents-omics

Enable domain-specific extras depending on your use case:

# Population genetics (manifold-genetics CSV pipeline)
uv add "manylatents-omics[popgen]"

# Single-cell omics (AnnData / scanpy)
uv add "manylatents-omics[singlecell]"

# Foundation model encoders (ESM3, Orthrus, Evo2 -- requires GPU)
uv add "manylatents-omics[dogma]"

For foundation model encoders on CUDA, use wheelnext uv to get prebuilt GPU wheels:

curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh sh
uv sync --extra dogma --index-strategy unsafe-best-match

Quick Start

Omics configs are auto-discovered when the package is installed, but the datasets are not bundled (they're gitignored and excluded from the wheel). Download them first into the location ${omics_data:} resolves to:

# Fetches PBMC 3k to <repo>/data from a checkout, else ~/.cache/manylatents/data.
# Override the destination with: export MANYLATENTS_DATA=/path/to/data
python scripts/download_pbmc.py --dataset 3k

# Single-cell: UMAP on PBMC 3k
python -m manylatents.main data=pbmc_3k algorithms/latent=umap

# Population genetics: HGDP dataset
python -m manylatents.main data=hgdp algorithms/latent=phate

# Foundation model encoding: ClinVar DNA
python -m manylatents.main experiment=clinvar/encode_dna

Note

Omics data configs (data=pbmc_3k, data=hgdp, etc.) are discovered automatically once manylatents-omics is installed. The ${omics_data:} resolver points at a writable data root — a source checkout's data/, else a per-user cache dir — so data=pbmc_3k finds the file regardless of whether the package is editable or an installed wheel. Set MANYLATENTS_DATA to override it.

Modules

manylatents-omics is organized into three domain modules:

Module	Domain	Data Format	Extra
PopGen	Population genetics	manifold-genetics CSVs	`[popgen]`
Single-Cell	Single-cell omics	AnnData `.h5ad`	`[singlecell]`
Dogma	DNA / RNA / Protein	FASTA sequences	`[dogma]`

PopGen provides the ManifoldGeneticsDataModule for loading PCA, admixture, and geographic data from the manifold-genetics pipeline, along with domain-specific metrics like geographic and admixture preservation.

Single-Cell provides AnnDataModule for loading scRNA-seq, scATAC-seq, and CITE-seq datasets stored in the AnnData .h5ad format. Ships with PBMC 3k, 10k, 68k, and Embryoid Body configs.

Dogma provides pretrained foundation model encoders (ESM3, Evo2, Orthrus, AlphaGenome) that transform biological sequences into dense embeddings, plus the ClinVar pipeline for multi-modal geometric analysis.

Parent Project

manylatents-omics extends the core manylatents library for dimensionality reduction and geometric analysis. Refer to the parent documentation for details on algorithms, metrics, and the experiment framework.