Skip to content

Population Genetics

The manylatents.popgen module provides data loading and evaluation metrics for population genetics, built around the manifold-genetics CSV pipeline.


Overview

Population genetics studies how genetic variation is distributed across human populations. manylatents-omics loads precomputed PCA coordinates, admixture proportions, and geographic metadata produced by manifold-genetics, then evaluates dimensionality reduction algorithms on how well they preserve population structure.

Install: uv add "manylatents-omics[popgen]"


Prerequisites

The popgen module does not perform genotype QC, PCA, or admixture estimation itself. These steps are handled upstream by manifold-genetics, a separate command-line tool (not a Python dependency).

Workflow:

  1. manifold-genetics — Takes raw genotype data (VCF/PLINK), runs QC, PCA, and ADMIXTURE, and outputs standardized CSVs
  2. manylatents-omics — Loads those CSVs via ManifoldGeneticsDataset, runs DR algorithms, and evaluates with popgen-specific metrics

This separation keeps manylatents-omics free of heavy bioinformatics dependencies (PLINK, ADMIXTURE) and makes it biobank-agnostic: any dataset preprocessed by manifold-genetics works out of the box.


Supported Datasets

Dataset Populations Samples Config
HGDP + 1KGP ~80 ~4,000 data=hgdp
UK Biobank varies varies data=ukbb
All of Us varies varies data=aou

Datasets are preprocessed upstream by manifold-genetics into a standardized directory structure:

output_dir/
  pca/
    fit_pca_*.csv
    transform_pca_*.csv
  admixture/
    fit.K5.csv
    transform.K5.csv
  labels.csv
  colormap.json

All CSVs are aligned by a sample_id column.


Key Classes

ManifoldGeneticsDataset

manylatents.popgen.data.ManifoldGeneticsDataset

A PyTorch Dataset that loads manifold-genetics outputs. It is completely biobank-agnostic -- all preprocessing, filtering, and subsetting is handled upstream by manifold-genetics.

Handles:

  • Loading PCA coordinates, admixture ratios, and optional custom embeddings from CSV
  • Joining data sources by sample_id
  • Integer label encoding for PyTorch compatibility (with get_label_names() for original strings)
  • Geographic coordinate access for preservation metrics

ManifoldGeneticsDataModule

manylatents.popgen.data.ManifoldGeneticsDataModule

A PyTorch Lightning DataModule wrapping ManifoldGeneticsDataset. Supports two modes:

  • split: Separate train and test CSVs (from manifold-genetics fit/transform outputs)
  • full: Same data for both train and test

Metrics

Population genetics metrics evaluate how well a low-dimensional embedding preserves genetic population structure.

Metric What it Measures
GeographicPreservation Spearman correlation between haversine distances (lat/lon) and embedding distances
AdmixturePreservation Spearman correlation between geodesic admixture distances and embedding distances

All metrics accept an embeddings array, the dataset, and optional parameters. They automatically handle subsetting to samples with meaningful geography (excluding recent migrants and admixed populations).


Usage

# Run UMAP on HGDP population genetics data
python -m manylatents.main data=hgdp algorithms/latent=umap

# Sweep over multiple algorithms
python -m manylatents.main -m \
  data=hgdp \
  algorithms/latent=umap,phate,tsne