Skip to content

Data

manyLatents provides synthetic manifold datasets for benchmarking and a precomputed loader for custom data.

dataset config key params
DLATree data=dla_tree n_dim=50, n_branch=10, branch_lengths=[300, 300, 25, 300, 300, 300, 25, 300, 300, 300]
DLATreeFromGraph data=dla_tree_from_graph n_dim=100, sigma=0.5, mode=full
DLATreeFromGraph data=dla_tree_from_graph_nogaps n_dim=100, sigma=0.5, mode=full
GaussianBlob data=gaussian_blobs n_samples=1000, n_features=2, centers=5
MHI data=mhi_split cache_dir=${cache_dir}, mmap_mode=None, mode=split
Precomputed data=precomputed path=None, label_col=None, mode=full
SaddleSurface data=saddle_surface n_distributions=100, n_points_per_distribution=50, noise=0.1
SwissRoll data=swissroll n_distributions=10, n_points_per_distribution=100, noise=0.2
Torus data=torus n_points=1000, noise=0.1, major_radius=5.0
Text data=wikitext dataset_name=wikitext, dataset_config=wikitext-2-raw-v1, tokenizer_name=gpt2

Domain-specific datasets (genomics, single-cell) are available via the manylatents-omics extension.

Precomputed Data

Load your own data from .npy or .npz files:

uv run python -m manylatents.main data=precomputed data.path=/path/to/data.npy algorithms/latent=umap

Sampling

Large datasets are subsampled before metric evaluation. Configure under metrics.sampling:

strategy config defaults
FarthestPointSampling sampling/farthest_point seed=42, fraction=0.1
RandomSampling sampling/random seed=42, fraction=0.1
StratifiedSampling sampling/stratified stratify_by=population_label, seed=42, fraction=0.1