Asparagus trains on preprocessed data. Raw datasets must be converted into the expected format using the asparagus_preprocessing companion repository before you can start training.
This page walks you through the full preprocessing workflow step by step.
Overview
Raw dataset → Write a preprocessing script → Run it → Create splits → Train
- Write a dataset-specific preprocessing script
- Run it to produce preprocessed files + metadata
- Generate train/val/test splits
- Point Asparagus at the task and start training
Step 1 — Write a preprocessing script
Each dataset gets its own Python script placed in the appropriate subfolder of asparagus_preprocessing/:
| Task type | Subfolder |
|---|---|
| Pretraining | datasets_pretraining/ |
| Segmentation | datasets_segmentation/ |
| Classification | datasets_classification/ |
| Regression | datasets_regression/ |
Name the file after your task (e.g. SEG005_MyDataset.py). A blank template is available at datasets_pretraining/TEMPLATE.py.
Naming convention
Task names must follow <PREFIX><XXX>_<Name>:
| Task type | Prefix | Example |
|---|---|---|
| Pretraining | PT |
PT003_BrainMRI |
| Segmentation | SEG |
SEG005_MyDataset |
| Classification | CLS |
CLS003_SALD |
| Regression | REGR |
REGR005_Age |
The three-digit number must be unique within each task type (e.g. PT005 and SEG005 can coexist).
Script structure
Every preprocessing script follows the same five-step pattern:
# Step 0 — define main() with default arguments
def main(path=get_source_path(), subdir="MyDataset", processes=12, ...):
# Step 1 — define configs
dataset_config = DatasetConfig(task_name="SEG005_MyDataset", ...)
saving_config = get_FOMO300K_saving_config(save_as_tensor=True, ...)
preprocessing_config = get_noresampling_preprocessing_config()
# Step 2 — set up source / target paths
source_dir = os.path.join(path, subdir)
target_dir = os.path.join(get_data_path(), dataset_config.task_name)
os.makedirs(target_dir, exist_ok=True)
# Step 3 — find input files
files_standard, files_DWI, files_PET, files_Perf, files_excluded = \
recursive_find_and_group_files(source_dir, ...)
files_standard_out = get_image_output_paths(files_standard, source_dir, target_dir, ...)
# Step 4 — process the dataset
process_dataset_without_table(process_fn=process_sample, ...) # segmentation
# or
process_dataset_with_table(process_fn=process_sample, ...) # cls / regression
# Step 5 — postprocess (generates dataset.json and paths.json)
postprocess_standard_dataset(dataset_config=dataset_config, ...)
See the asparagus_preprocessing repo for example scripts covering each task type.
Key configs at a glance
DatasetConfig — describes the dataset:
| Field | Description |
|---|---|
task_name |
Unique task name (e.g. SEG005_MyDataset) |
n_modalities |
Number of input channels (1 for single-modality MRI) |
n_classes |
Number of output classes (or 1 for regression) |
in_extensions |
File extensions to look for (e.g. [".nii.gz"]) |
patterns_exclusion |
Filename patterns to skip (e.g. labels, unwanted sequences) |
PreprocessingConfig presets — pick one and customise:
| Preset function | Spacing | Use case |
|---|---|---|
get_noresampling_preprocessing_config() |
Native | Keep original voxel spacing |
get_iso_preprocessing_config() |
1 mm isotropic | Resample to 1×1×1 mm |
SavingConfig — get_FOMO300K_saving_config(save_as_tensor=True, ...) is the standard choice for most datasets.
Full config dataclass definitions
DatasetConfig
python
@dataclass
class DatasetConfig:
df_columns: list
task_name: str
n_classes: int
n_modalities: int
in_extensions: str
patterns_exclusion: list
patterns_DWI: list
patterns_PET: list
patterns_perfusion: list
patterns_m0: list
patterns_bidsify: list
split: str
SavingConfig
python
@dataclass
class SavingConfig:
save_as_tensor: bool
tensor_dtype: str
bidsify: bool
save_dset_metadata: bool
save_file_metadata: bool # must be True for segmentation
PreprocessingConfig
python
@dataclass
class PreprocessingConfig:
normalization_operation: List # one entry per modality
target_spacing: Optional[List]
background_pixel_value: int = 0
crop_to_nonzero: bool = True
keep_aspect_ratio_when_using_target_size: bool = False
image_properties: Optional[dict] = field(default_factory=dict)
intensities: Optional[List] = None
target_orientation: Optional[str] = "RAS"
target_size: Optional[List] = None
min_slices: int = 0
remove_nans: bool = True
Saving raw labels (segmentation)
For segmentation tasks, the process_sample function must also save the original label to $ASPARAGUS_RAW_LABELS so that final test metrics can be computed against native labels:
from asparagus_preprocessing.utils.saving import save_raw_label, save_modified_label
# If the label map is used as-is:
save_raw_label(file_out, label_path)
# If label values were remapped (e.g. collapsing classes), save the remapped-but-unprocessed version:
save_modified_label(file_out, label_arr)
When in doubt, use save_modified_label. Resampling or any spatial processing should not be applied to the raw label.
Segmentation metadata (.pkl)
For each segmentation sample, a .pkl file must be saved alongside the .pt file (same path, different extension). It must contain at minimum:
{"foreground_locations": [...]} # indices of non-zero labels; may be empty
Set save_file_metadata=True in SavingConfig to have this handled automatically. The indices are used to oversample underrepresented classes during training.
Step 2 — Run the script
Use the asp_preprocess CLI entry point, which automatically finds the right module by task name:
asp_preprocess \
--dataset SEG005_MyDataset \
--save_as_tensor \
--num_workers 12
This produces under $ASPARAGUS_DATA/SEG005_MyDataset/:
SEG005_MyDataset/
├── dataset.json ← task metadata (n_classes, n_modalities, preprocessing config, …)
├── paths.json ← paths to all processed samples (used for splitting)
└── <subject dirs>/
├── file.pt ← preprocessed image tensor
└── file.pkl ← per-file metadata (required for segmentation)
Step 3 — Create train/val/test splits
Once paths.json exists, generate a split file with asp_split:
# Simple percentage split: 75% train, 15% val, 10% test
asp_split --dataset SEG005_MyDataset --vals 75 15 10
This saves split_75_15_10.json inside the task directory. The three numbers must sum to 100.
To use a predefined splitting strategy (e.g. subject-level stratification):
asp_split --dataset SEG005_MyDataset --fn BIDSsplit_40_10_50
Split on subject level
The default --vals split operates on file level. If subjects have multiple scans, use --fn with a stratified function that groups by subject ID first.
Step 4 — Train
With preprocessing done and a split file created, you are ready to train:
asp_train_seg \
task=SEG005_MyDataset \
+model=unet_b \
data.train_split=split_75_15_10
See the task-specific training pages for full details: