Data Structure

This page describes the directory layout and file formats that Asparagus expects. Everything is located through environment variables — make sure those are configured first.

One-step preprocessing

To avoid storing intermediate dataset versions, all formatting, restructuring, and preprocessing should happen in a single preprocessing script. Only operations that are constant (reorientation, resampling, label remapping) belong here — operations that change frequently, like normalization, are applied online during training.

Directory layout

$ASPARAGUS_DATA/                    # Root data directory
└── SEG005_MyDataset/               # One folder per task
    ├── dataset.json                # Task metadata (auto-generated by preprocessing)
    ├── paths.json                  # All sample paths (auto-generated by preprocessing)
    ├── split_75_15_10.json         # Train/val/test split (generated with asp_split)
    └── <subject dirs>/
        ├── file.pt                 # Preprocessed image tensor
        └── file.pkl                # Per-file metadata

$ASPARAGUS_MODELS/                  # Training outputs
└── <run_id>/
    ├── checkpoints/
    │   ├── best.ckpt
    │   └── last.ckpt
    ├── hydra/
    │   └── config.yaml             # Saved run configuration
    └── predictions/                # Inference outputs

$ASPARAGUS_RAW_LABELS/              # Original (unpreprocessed) labels for evaluation
└── SEG005_MyDataset/
    └── ...

Both dataset.json and paths.json are generated automatically during preprocessing. You should not need to edit them by hand.

Split files

Split files live in the task folder and are referenced by name (without .json) in training commands.

Train/val split (`split_<train>_<val>_<test>.json`)

A list of fold dictionaries, each with train and val keys:

[
    {
        "train": ["/path/to/sub_A/file.pt", "/path/to/sub_B/file.pt"],
        "val":   ["/path/to/sub_C/file.pt"]
    }
]

Multiple folds enable cross-validation. For a single fixed split, the list has one entry.

Test split / paths file

A flat list of file paths:

[
    "/path/to/sub_D/file.pt",
    "/path/to/sub_E/file.pt"
]

paths.json is a special case of this format that includes all samples in the dataset. Passing data.test_split=paths at training time runs inference on the entire dataset (see Cross-Dataset Evaluation).

Naming convention

Name pattern	Meaning
`split_75_15_10`	75% train, 15% val, 10% test split
`TEST_75_15_10`	Corresponding held-out test set
`paths`	All samples (no split)

Generate splits with asp_split — see Preprocessing.

dataset.json

Auto-generated by postprocess_standard_dataset. The fields Asparagus uses at training time are:

Field	Description
`n_modalities`	Number of input channels
`n_classes`	Number of output classes (or `1` for regression)

The rest of the file records the preprocessing configuration and file counts for auditing purposes. Example:

Full dataset.json example (SEG003_ISLES22_ADCDWI)

json { "dataset_config": { "df_columns": [], "in_extensions": [".nii.gz"], "n_classes": 2, "n_modalities": 2, "patterns_DWI": [], "patterns_PET": [], "patterns_bidsify": [], "patterns_exclusion": ["labels_derivatives", "_adc", "_dwi", ".json"], "patterns_m0": [], "patterns_perfusion": [], "split": "split_40_10_50", "task_name": "SEG003_ISLES22_ADCDWI" }, "metadata": { "files_delta_after_processing": -750, "files_source_directory_standard": 250, "files_source_directory_excluded": 750, "files_source_directory_total": 1000, "files_target_directory_standard": 250, "files_target_directory_total": 250, "n_classes": 2, "n_modalities": 2 }, "name": "SEG003_ISLES22_ADCDWI", "preprocessing_config": { "background_pixel_value": 0, "crop_to_nonzero": false, "image_properties": {}, "intensities": null, "keep_aspect_ratio_when_using_target_size": false, "min_slices": 15, "normalization_operation": ["no_norm", "no_norm"], "remove_nans": true, "target_orientation": "RAS", "target_size": null, "target_spacing": [1.0, 1.0, 1.0] }, "saving_config": { "bidsify": false, "save_as_tensor": true, "save_dset_metadata": false, "save_file_metadata": true, "tensor_dtype": "float32" } }

Referencing a task in commands

Pass the task folder name via task=:

asp_train_seg task=SEG005_MyDataset \
    +model=unet_b \
    data.train_split=split_75_15_10

Asparagus resolves $ASPARAGUS_DATA/SEG005_MyDataset/ automatically.