Skip to content

Data

manyLatents provides synthetic manifold datasets for benchmarking and a precomputed loader for custom data.

dataset config key params
Archetypal data=archetypal n_components=3, simplex_radius=1.0, n_obs=5000
DLATree data=dla_tree n_dim=50, n_branch=10, branch_lengths=[300, 300, 25, 300, 300, 300, 25, 300, 300, 300]
DLATreeFromGraph data=dla_tree_from_graph n_dim=100, sigma=0.5, mode=full
DLATreeFromGraph data=dla_tree_from_graph_nogaps n_dim=100, sigma=0.5, mode=full
GaussianBlob data=gaussian_blobs n_samples=1000, n_features=2, centers=5
MHI data=mhi_split cache_dir=${cache_dir}, mmap_mode=None, mode=split
HFText data=pile dataset_name=None, dataset_config=None, data_dir=/network/datasets/pile
Precomputed data=precomputed path=None, label_col=None, mode=full
ReasoningTrace data=reasoning_trace trace_store_path=None, tensor_key=pooled_steps, layer_index=-1
SaddleSurface data=saddle_surface n_distributions=100, n_points_per_distribution=50, noise=0.1
SwissRoll data=swissroll n_distributions=10, n_points_per_distribution=100, noise=0.2
Torus data=torus n_points=1000, noise=0.1, major_radius=5.0
HFText data=wikitext dataset_name=wikitext, dataset_config=wikitext-2-raw-v1, tokenizer_name=gpt2

Domain-specific datasets (genomics, single-cell) are available via the manylatents-omics extension.

Precomputed Data

Load your own data from .npy or .npz files:

uv run python -m manylatents.main data=precomputed data.path=/path/to/data.npy algorithms/latent=umap

Sampling

Large datasets are subsampled before metric evaluation. Configure under metrics.sampling:

strategy config defaults
FarthestPointSampling sampling/farthest_point seed=42, fraction=0.1
RandomSampling sampling/random seed=42, fraction=0.1
StratifiedSampling sampling/stratified stratify_by=population_label, seed=42, fraction=0.1