Foundation Model Encoders

manylatents-omics provides pretrained foundation model encoders for biological sequences. All encoders inherit from FoundationEncoder, which provides lazy model loading, batched encoding with automatic OOM retry, and a standard fit/transform interface compatible with the manylatents experiment pipeline.

Encoder Summary

Encoder	Domain	Embedding Dim	VRAM	Architecture
ESM3	Protein	1536	16 GB+	Transformer (1.4B params)
Evo2	DNA	1920 / 4096 / 8192	24 GB+ (Ampere+)	StripedHyena 2 (1B/7B/40B)
Orthrus	RNA	256 / 512	8 GB+	Mamba SSM
AlphaGenome	DNA	1536 / 3072	40 GB+	JAX-based (DeepMind)

ESM3

Domain: Protein sequences (amino acids)

ESM3 is a frontier multimodal protein model from EvolutionaryScale that jointly reasons across sequence, structure, and function. The open model (esm3-sm-open-v1) has 1.4 billion parameters.

Embedding dimension: 1536
VRAM: 16 GB+
Pooling: Masked mean pooling over sequence length
Batched inference: True GPU batching with tokenizer padding

Key features:

True batched forward pass (single GPU kernel per micro-batch)
Automatic sequence truncation via max_length
Loads from HuggingFace or local weights

from manylatents.dogma.encoders.esm3 import ESM3Encoder

encoder = ESM3Encoder(max_length=2000)
embedding = encoder.encode("MKFGVRA")  # (1, 1536)

Reference: Hayes et al. (2024) "Simulating 500 million years of evolution with a language model"

Evo2

Domain: DNA sequences (nucleotides)

Evo2 is a DNA language model using the StripedHyena 2 architecture, modeling DNA at single-nucleotide resolution with up to 1 million base pair context length. Available in 1B, 7B, and 40B parameter variants.

Embedding dimensions: 1920 (1B), 4096 (7B), 8192 (40B)
VRAM: 24 GB+ (requires Ampere or newer GPU)
Pooling: Masked mean pooling over sequence length
Multi-layer extraction: Extracts from multiple internal layers simultaneously

Key features:

Multi-layer embedding extraction (default: 3 layers at 56%, 76%, 92% depth for 1B)
Returns dict[str, Tensor] when multi-layer mode is active
True batched forward pass with tokenizer padding
OOM retry with automatic batch size halving

Model	Parameters	Hidden Dim	Default Layers
`evo2_1b_base`	1B	1920	blocks.14, blocks.19, blocks.23
`evo2_7b`	7B	4096	blocks.16
`evo2_40b`	40B	8192	blocks.32

from manylatents.dogma.encoders.evo2 import Evo2Encoder

encoder = Evo2Encoder(model_name="evo2_1b_base")
result = encoder.encode("ATGAAGTTTGGCGTCCGTGCCTGA")
# Multi-layer default: result is dict with 3 layer keys

Reference: Nguyen et al. (2025) "Genome modeling and design across all domains of life with Evo 2"

Orthrus

Domain: RNA sequences (nucleotides)

Orthrus is a Mamba SSM-based RNA foundation model. The manylatents-omics implementation is a native re-implementation compatible with mamba-ssm 2.x, avoiding version conflicts with Evo2.

Embedding dimensions: 256 (4-track base), 512 (6-track large)
VRAM: 8 GB+
Input encoding: One-hot (A, C, G, U)
Pooling: Length-aware mean pooling (respects padding)

Key features:

Native mamba-ssm 2.x implementation (no dependency conflict with Evo2)
Supports multi-layer intermediate capture
Loads pretrained weights from HuggingFace

Model	Tracks	Hidden Dim	Layers
`orthrus-base-4-track`	4	256	8
`orthrus-large-6-track`	6	512	12

from manylatents.dogma.encoders.orthrus_native import OrthrusNativeEncoder

encoder = OrthrusNativeEncoder()
embedding = encoder.encode("AUGCAUGCAUGCAUGC")  # (1, 256)

Reference: Fradkin et al. (2024) "Orthrus: Towards Evolutionary and Functional RNA Foundation Models"

AlphaGenome

Domain: DNA sequences with regulatory predictions

AlphaGenome is a JAX-based genomics foundation model from Google DeepMind that predicts regulatory features at single base-pair resolution across 1 Mb context windows.

Embedding dimensions: 1536 (1bp resolution), 3072 (128bp resolution)
VRAM: 40 GB+
Framework: JAX internally, PyTorch tensor output via torch-jax-interop
Context length: 1,000,000 bp

Key features:

Dual mode: embeddings (encode) and regulatory track predictions (predict)
Chunked encoding for sequences longer than context window
Regulatory track prediction (ATAC, CAGE, DNASE, RNA-seq, and more)
Automatic JAX compatibility patching for older JAX versions

Model	Resolution	Embedding Dim	Default Layer
`alphagenome`	1 bp	1536	`embeddings_1bp`
`alphagenome_128bp`	128 bp	3072	`embeddings_128bp`

from manylatents.dogma.encoders.alphagenome import AlphaGenomeEncoder

encoder = AlphaGenomeEncoder()
embedding = encoder.encode("ATGAAGTTTGGCGTCCGTGCCTGA")  # (1, 1536)
predictions = encoder.predict("ATGAAGTTTGGCGTCCGTGCCTGA")  # dict of track tensors

Reference: "AlphaGenome: Foundation model for the human genome" (Google DeepMind)

FoundationEncoder Base Class

All encoders inherit from FoundationEncoder, which extends LatentModule with:

Lazy loading: Models are loaded on first encode() call, not at instantiation
Batched encoding: encode_batch() chunks inputs into micro-batches with automatic OOM retry (halves batch size on CUDA OOM, retries without resetting)
fit/transform interface: fit() is a no-op; transform() reads sequences from the datamodule and calls encode_batch()
True batched forward: Subclasses that implement _tokenize_batch() and _extract_embeddings() get single-kernel-per-batch GPU inference instead of looped single-sample encoding

# All encoders follow the same interface
encoder.fit(x)                          # no-op for pretrained models
embeddings = encoder.transform(x)       # encodes sequences from datamodule
embeddings = encoder.encode_batch(seqs) # direct batched encoding