Data
manyLatents provides synthetic manifold datasets for benchmarking and a precomputed loader for custom data.
| dataset | config | key params |
|---|---|---|
| DLATree | data=dla_tree |
n_dim=50, n_branch=10, branch_lengths=[300, 300, 25, 300, 300, 300, 25, 300, 300, 300] |
| DLATreeFromGraph | data=dla_tree_from_graph |
n_dim=100, sigma=0.5, mode=full |
| DLATreeFromGraph | data=dla_tree_from_graph_nogaps |
n_dim=100, sigma=0.5, mode=full |
| GaussianBlob | data=gaussian_blobs |
n_samples=1000, n_features=2, centers=5 |
| MHI | data=mhi_split |
cache_dir=${cache_dir}, mmap_mode=None, mode=split |
| Precomputed | data=precomputed |
path=None, label_col=None, mode=full |
| SaddleSurface | data=saddle_surface |
n_distributions=100, n_points_per_distribution=50, noise=0.1 |
| SwissRoll | data=swissroll |
n_distributions=10, n_points_per_distribution=100, noise=0.2 |
| Torus | data=torus |
n_points=1000, noise=0.1, major_radius=5.0 |
| Text | data=wikitext |
dataset_name=wikitext, dataset_config=wikitext-2-raw-v1, tokenizer_name=gpt2 |
Domain-specific datasets (genomics, single-cell) are available via the manylatents-omics extension.
Precomputed Data
Load your own data from .npy or .npz files:
uv run python -m manylatents.main data=precomputed data.path=/path/to/data.npy algorithms/latent=umap
Sampling
Large datasets are subsampled before metric evaluation. Configure under metrics.sampling:
| strategy | config | defaults |
|---|---|---|
| FarthestPointSampling | sampling/farthest_point |
seed=42, fraction=0.1 |
| RandomSampling | sampling/random |
seed=42, fraction=0.1 |
| StratifiedSampling | sampling/stratified |
stratify_by=population_label, seed=42, fraction=0.1 |