ClinVar Central Dogma Analysis Pipeline
Complete reproducible pipeline for geometric analysis of foundation model embeddings across the central dogma using ClinVar variants.
Pipeline Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE 1: DATA PREPARATION │
│ Entry: ClinVar VCF + RefSeq │
│ Exit: data/clinvar/{variants.tsv, dna.fasta, rna.fasta, protein.fasta} │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE 2: EMBEDDING GENERATION (All in same environment) │
│ Entry: FASTA files │
│ Exit: embeddings/clinvar/{evo2,orthrus,esm3}.pt │
└─────────────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────┐
│ STAGE 3: FUSION & GEOMETRIC ANALYSIS │
│ Entry: Per-modality embeddings │
│ Exit: Fused embeddings + geometric metrics │
└─────────────────────────────────────────────────────────────────────────────┘
Encoders
All three encoders work in the same environment (mamba-ssm 2.x compatible):
| Encoder | Modality | Embedding Dim | HuggingFace Model |
|---|---|---|---|
| Evo2Encoder | DNA | 1920 | arcinstitute/evo2_1b_base |
| OrthrusEncoder | RNA | 256 | quietflamingo/orthrus-base-4-track |
| ESM3Encoder | Protein | 1536 | esm3_sm_open_v1 |
Note: OrthrusEncoder was re-implemented to use mamba-ssm 2.x Block API, eliminating
the need for a separate environment. See manylatents/dogma/encoders/orthrus_native.py.
Quick Start
E2E Test (3-way fusion)
# Submit GPU job
sbatch scripts/run_e2e_test.sh
# Check results
cat logs/clinvar_e2e_*.out
Hydra Experiments
# Encode DNA
python -m manylatents.main --config-name=config experiment=clinvar/encode_dna
# Encode RNA
python -m manylatents.main --config-name=config experiment=clinvar/encode_rna
# Encode Protein
python -m manylatents.main --config-name=config experiment=clinvar/encode_protein
# Geometric analysis on fused embeddings
python -m manylatents.main --config-name=config experiment=clinvar/geometric_analysis
Stage 1: Data Preparation
Entry Point
- ClinVar VCF from NCBI:
ftp://ftp.ncbi.nlm.nih.gov/pub/clinvar/vcf_GRCh38/
Script
python scripts/download_clinvar.py \
--genes BRCA1,BRCA2 \
--output data/clinvar/
Exit Points
| File | Description |
|---|---|
variants.tsv |
Variant metadata |
dna.fasta |
DNA context sequences |
protein.fasta |
Translated protein sequences |
labels.csv |
Pathogenicity labels |
Stage 2: Embedding Generation
All encoders run in the same GPU job:
# Via Hydra (recommended)
python -m manylatents.main --config-name=config experiment=clinvar/encode_dna
python -m manylatents.main --config-name=config experiment=clinvar/encode_rna
python -m manylatents.main --config-name=config experiment=clinvar/encode_protein
# Or parallel SLURM
sbatch scripts/run_e2e_test.sh
Exit Points
| File | Encoder | Dimensions |
|---|---|---|
embeddings/clinvar/evo2.pt |
Evo2 | (N, 1920) |
embeddings/clinvar/orthrus.pt |
Orthrus | (N, 256) |
embeddings/clinvar/esm3.pt |
ESM3 | (N, 1536) |
Stage 3: Fusion & Geometric Analysis
MergingModule Strategies
| Strategy | Output Dim | Description |
|---|---|---|
concat |
3712 | Concatenation [DNA; RNA; Protein] |
mean |
requires same dim | Element-wise mean |
weighted_sum |
requires same dim | Weighted combination |
Hydra Config
python -m manylatents.main --config-name=config experiment=clinvar/geometric_analysis
# With weighted fusion
python -m manylatents.main --config-name=config experiment=clinvar/geometric_analysis \
algorithms.latent.strategy=weighted_sum \
'algorithms.latent.weights={evo2: 0.5, esm3: 0.5}'
Metrics
| Metric | Description |
|---|---|
ParticipationRatio |
Effective dimensionality |
LocalIntrinsicDimensionality |
KNN-based local dimension |
TangentSpaceApproximation |
PCA-based local dimension |
Directory Structure
omics/
├── manylatents/dogma/
│ ├── encoders/
│ │ ├── evo2.py # DNA encoder
│ │ ├── orthrus_native.py # RNA encoder (mamba-ssm 2.x)
│ │ └── esm3.py # Protein encoder
│ ├── configs/experiment/clinvar/
│ │ ├── encode_dna.yaml
│ │ ├── encode_rna.yaml
│ │ ├── encode_protein.yaml
│ │ └── geometric_analysis.yaml
│ └── data/
│ └── clinvar_dataset.py # ClinVarDataModule
├── scripts/
│ ├── download_clinvar.py # Data preparation
│ ├── test_clinvar_e2e.py # E2E test
│ └── run_e2e_test.sh # SLURM submission
└── tests/dogma/
└── test_config_e2e.py # Config validation
Environment Setup
Single environment for all encoders:
cd /network/scratch/c/cesar.valdez/lrw/omics
# Install wheelnext uv (for CUDA wheel variants)
curl -LsSf https://astral.sh/uv/install.sh | INSTALLER_DOWNLOAD_URL=https://wheelnext.astral.sh sh
# Sync with dogma extras
uv sync --extra dogma --index-strategy unsafe-best-match
# Verify imports
uv run python -c "from manylatents.dogma.encoders import Evo2Encoder, OrthrusEncoder, ESM3Encoder; print('OK')"
WandB Project
All experiments log to: merging-dogma
Example runs: - E2E 3-way fusion: https://wandb.ai/cesar-valdez-mcgill-university/merging-dogma/runs/imnb1zu7
Reproducibility Checklist
- [ ] ClinVar VCF version documented
- [ ] GPU type documented (L40S/H100 for Evo2)
- [ ] All embeddings saved with labels
- [ ] WandB run URL logged
- [ ] Config E2E tests pass (
uv run python -m pytest tests/dogma/test_config_e2e.py)