datasets.yml Reference

datasets.yml decouples data sources from processing logic, so the same dataset definitions can be reused across multiple pipelines.

Structure

<dataset_key>:
  name: <string>
  file_pattern: <string or mount-point map>
  derivatives_path: <string or mount-point map>
  exclude_pattern: <glob string>   # optional
  skip: <bool>                     # optional, default False

Fields

name

Human-readable dataset identifier. Used in logging and dataframe output.

file_pattern

Glob pattern for finding source files. Supports mount-point maps for environment-specific paths:

# Single path
file_pattern: data/**/*.vhdr

# Environment-specific (mount points)
file_pattern:
  local: data/**/*.vhdr
  hpc: /cluster/BIDS/**/*.vhdr

The active mount point is set by mount_point in pipeline.yml.

derivatives_path

Root directory where derivative files are written. Mirrors the source file’s subdirectory structure under this root:

derivatives_path:
  local: outputs/
  hpc: /cluster/scratch/out

exclude_pattern

Optional glob to exclude files matching the pattern from processing.

skip

Set skip: True to temporarily disable a dataset without removing it.

Multiple Datasets

List any number of datasets. All are processed when running run_pipeline:

dataset1:
  name: StudyA
  file_pattern:
    local: data/studyA/**/*.vhdr
  derivatives_path:
    local: outputs/studyA/

dataset2:
  name: StudyB
  file_pattern:
    local: data/studyB/**/*.vhdr
  derivatives_path:
    local: outputs/studyB/

Referencing in pipeline.yml

# pipeline.yml
datasets: datasets.yml    # path to datasets file
mount_point: local        # which mount point to activate

datasets can be an absolute or relative path. Relative paths are resolved from the pipeline YAML location.