datasets.yml Reference¶

datasets.yml decouples data sources from processing logic, so the same dataset definitions can be reused across multiple pipelines.

Structure¶

<dataset_key>:
  name: <string>
  file_pattern: <string or mount-point map>
  derivatives_path: <string or mount-point map>
  exclude_pattern: <glob string>   # optional
  skip: <bool>                     # optional, default False

Fields¶

`name`¶

Human-readable dataset identifier. Used in logging and dataframe output.

`file_pattern`¶

Glob pattern for finding source files. Supports mount-point maps for environment-specific paths:

# Single path
file_pattern: data/**/*.vhdr

# Environment-specific (mount points)
file_pattern:
  local: data/**/*.vhdr
  hpc: /cluster/BIDS/**/*.vhdr

The active mount point is set by mount_point in pipeline.yml.

`derivatives_path`¶

Root directory where derivative files are written. Mirrors the source file’s subdirectory structure under this root:

derivatives_path:
  local: outputs/
  hpc: /cluster/scratch/out

`exclude_pattern`¶

Optional glob to exclude files matching the pattern from processing.

`skip`¶

Set skip: True to temporarily disable a dataset without removing it.

Multiple Datasets¶

List any number of datasets. All are processed when running run_pipeline:

dataset1:
  name: StudyA
  file_pattern:
    local: data/studyA/**/*.vhdr
  derivatives_path:
    local: outputs/studyA/

dataset2:
  name: StudyB
  file_pattern:
    local: data/studyB/**/*.vhdr
  derivatives_path:
    local: outputs/studyB/

Referencing in pipeline.yml¶

# pipeline.yml
datasets: datasets.yml    # path to datasets file
mount_point: local        # which mount point to activate

datasets can be an absolute or relative path. Relative paths are resolved from the pipeline YAML location.

datasets.yml Reference¶

Structure¶

Fields¶

name¶

file_pattern¶

derivatives_path¶

exclude_pattern¶

skip¶