DVC Backend¶
The DVC Backend transforms your Wurzel pipeline into Data Version Control (DVC) configuration files, enabling reproducible machine learning workflows with built-in data versioning and experiment tracking.
Overview¶
DVC (Data Version Control) is a powerful tool for ML experiment management that works seamlessly with Git. The DVC Backend generates dvc.yaml files that define your pipeline stages, dependencies, and outputs in a format that DVC can execute and track.
Generate-Time vs Runtime Configuration
The DVC backend uses a two-phase configuration model:
- Generate-Time (YAML or Environment): A
values.yamlfile or environment variables configure the pipeline structure — data directories and environment encapsulation settings. This is used when runningwurzel generate. - Runtime (Environment Variables): Step settings (e.g.,
MANUALMARKDOWNSTEP__FOLDER_PATH) are read from environment variables whendvc reproexecutes the pipeline locally.
This separation allows you to generate pipeline definitions once and run them in different environments by changing only the runtime environment variables.
Key Features¶
- Data Versioning: Automatically track changes to datasets and model artifacts
- Reproducible Pipelines: Generate deterministic pipeline definitions
- Experiment Tracking: Compare different pipeline runs and their results
- Git Integration: Version control your pipeline configurations alongside your code
- Caching: Intelligent caching of intermediate results to speed up development
Usage¶
CLI Usage¶
Generate a DVC pipeline configuration:
# Install Wurzel
pip install wurzel
# Generate dvc.yaml (default backend)
wurzel generate examples.pipeline.pipelinedemo:pipeline
# Explicitly specify DVC backend
wurzel generate --backend DvcBackend --output dvc.yaml examples.pipeline.pipelinedemo:pipeline
# Generate using a values file (recommended)
wurzel generate --backend DvcBackend \
--values values.yaml \
--pipeline_name pipelinedemo \
--output dvc.yaml \
examples.pipeline.pipelinedemo:pipeline
Values File Configuration (Generate-Time)¶
The values.yaml file configures the pipeline structure at generate-time:
dvc:
pipelinedemo:
dataDir: "./data" # Directory for step outputs
encapsulateEnv: true # Whether to encapsulate environment in CLI calls
Environment Configuration (Generate-Time Alternative)¶
Alternatively, configure the DVC backend using environment variables at generate-time:
Configuration Reference¶
| Field | Environment Variable | Default | Description |
|---|---|---|---|
dataDir |
DVCBACKEND__DATA_DIR |
./data |
Directory for step output artifacts |
encapsulateEnv |
DVCBACKEND__ENCAPSULATE_ENV |
true |
Whether to encapsulate environment in CLI calls |
Runtime Environment Variables¶
Step settings are configured via environment variables at runtime (when dvc repro executes). Set these before running your pipeline:
# Step-specific settings (runtime)
export MANUALMARKDOWNSTEP__FOLDER_PATH="examples/pipeline/demo-data"
export SIMPLESPLITTERSTEP__BATCH_SIZE="100"
export SIMPLESPLITTERSTEP__NUM_THREADS="4"
# Run the pipeline
dvc repro
Inspecting Required Environment Variables
Use wurzel inspect to see all environment variables required by your pipeline steps:
Programmatic Usage¶
Use the DVC backend directly in Python:
from pathlib import Path
from wurzel.backend.backend_dvc import DvcBackend
from wurzel.steps.embedding import EmbeddingStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
from wurzel.steps.qdrant.step import QdrantConnectorStep
from wurzel.utils import WZ
# Define your pipeline
source = WZ(ManualMarkdownStep)
embedding = WZ(EmbeddingStep)
step = WZ(QdrantConnectorStep)
source >> embedding >> step
pipeline = step
# Option 1: Generate DVC configuration from values file
backend = DvcBackend.from_values(
files=[Path("values.yaml")],
workflow_name="pipelinedemo"
)
dvc_yaml = backend.generate_artifact(pipeline)
print(dvc_yaml)
# Option 2: Generate with default settings
dvc_yaml = DvcBackend().generate_artifact(pipeline)
print(dvc_yaml)
Running DVC Pipelines¶
Once you've generated your dvc.yaml file, you can execute the pipeline using DVC:
# Run the entire pipeline
dvc repro
# Run specific stages
dvc repro <stage_name>
# Show pipeline status
dvc status
# Compare experiments
dvc plots show
Benefits for ML Workflows¶
Data Lineage¶
Track the complete history of your data transformations, making it easy to understand how your final model was created.
Experiment Reproducibility¶
Every pipeline run is completely reproducible, with DVC tracking all inputs, parameters, and outputs.
Collaborative Development¶
Share pipeline definitions through Git while DVC handles the heavy lifting of data and model versioning.
Performance Optimization¶
DVC's intelligent caching means you only recompute what's changed, dramatically speeding up iterative development.