Core Concepts¶
Directed Acyclic Graph (DAG)¶
A pipeline is a DAG of computation nodes. Each node takes inputs (from previous nodes or cached derivatives) and produces outputs (artifacts). Edges encode data dependencies; cycles are not allowed.
SourceFile → CleanedEEG → CrossSpectralDensity → PowerSpectrum → BandPower
↘ Coherence ↗
Derivative¶
A derivative is a named, reusable computation result associated with an input file. Derivatives are stored on disk with a @DerivativeName suffix next to their source file:
sub-1.fif → source
sub-1@CleanedEEG.fif → derivative
sub-1@PowerSpectrum.nc → derivative
Derivatives mirror the input directory structure — NeuroDAGs is agnostic to how your data is organized.
Chaining pipelines (neurodags output as source)¶
When a derivative produced by one pipeline run is used as the source file in a subsequent pipeline run, its filename already contains @ (e.g. sub-1.fif@CleanedEEG.fif). NeuroDAGs automatically replaces that @ with & when building the reference base, so the one-@-per-filename invariant is preserved:
sub-1.fif@CleanedEEG.fif → source (a previous derivative)
sub-1.fif&CleanedEEG.fif@PowerSpectrum.nc → new derivative
This means downstream tooling can always split on the single @ to separate the source identity from the derivative name.
Node¶
A node is a Python function decorated with @register_node. It receives data as keyword arguments and returns a NodeResult. Preprocessing nodes, spectral analysis nodes, entropy nodes, and custom nodes are all treated identically.
NodeResult and Artifact¶
class Artifact(NamedTuple):
item: Any # the data object
writer: Callable[[str], None] # how to save it to disk
class NodeResult(NamedTuple):
artifacts: dict[str, Artifact] # extension → Artifact
A node returns one or more artifacts keyed by file extension. Example: {".nc": Artifact(...), ".report.html": Artifact(...)}. The last node in a derivative’s chain saves under the @DerivativeName prefix.
Pipeline Steps¶
Each derivative definition has a nodes list. A step is either:
node: execute a registered Python functionderivative: load a cached derivative from disk (enables reuse without recomputing)
Steps reference earlier results via id.<N> (e.g., id.0 = result of step 0).
File Independence¶
NeuroDAGs processes each input file in isolation. Every derivative is computed independently per file — there are no cross-file operations within the framework. This is a deliberate design constraint that enables trivial parallelism (each file is an independent job) and caching.
The consequence: operations that require information from multiple files — group-level ICA, normalization to a group mean, atlas registration using a subject-average template — cannot be expressed as NeuroDAGs derivatives. These must be done outside the pipeline, either as a post-processing step or by dropping to plain Python/NumPy after running build_derivative_dataframe.
If your workflow needs cross-file operations mid-pipeline, consider Snakemake or Pydra instead.
Caching¶
If a derivative’s final artifact already exists on disk and overwrite: False (the default), the entire derivative computation is skipped. This allows you to resume interrupted pipelines and avoid redundant computation in large studies.
Cache invalidation is existence-based, not code-based. If you change a node’s implementation, NeuroDAGs has no way to know the cached output is stale — it only checks whether the output file exists. To force recomputation after modifying a node, either set overwrite: true on that derivative or delete the relevant @DerivativeName files manually.
SourceFile¶
SourceFile is a built-in pseudo-derivative that resolves to the raw input file. Use it as the first node in any derivative chain that starts from raw data.
nodes:
- id: 0
derivative: SourceFile
- id: 1
node: my_preprocessing_node
args:
data: id.0
Derivative Flags¶
Flag |
Type |
Default |
Description |
|---|---|---|---|
|
bool |
|
Write artifacts to disk. |
|
bool |
|
Recompute even if output exists. |
|
bool |
|
Include in |
DerivativeList¶
DerivativeList in pipeline.yml controls which derivatives are executed and in what order. Comment out a derivative name to skip it without removing its definition:
DerivativeList:
- CleanedEEG
- PowerSpectrum
# - SpectralEntropy # skipped
- BandPower
Mount Points¶
datasets.yml supports environment-specific path resolution. The mount_point key in pipeline.yml selects which path set to use:
# pipeline.yml
mount_point: local # or: hpc
# datasets.yml
my_dataset:
file_pattern:
local: data/**/*.vhdr
hpc: /cluster/BIDS/**/*.vhdr
derivatives_path:
local: outputs/
hpc: /cluster/scratch/out
This makes pipelines portable across workstations and HPC clusters without editing the pipeline YAML.