Argo Workflows Backend¶
The Argo Workflows Backend transforms your Wurzel pipeline into Kubernetes-native Workflow or CronWorkflow YAML configurations, enabling cloud-native, scalable pipeline orchestration with optional scheduling capabilities.
Overview¶
Argo Workflows is a powerful, Kubernetes-native workflow engine that excels at container orchestration and parallel execution. The Argo Backend generates either:
Workflow: For one-time or manually triggered pipeline executions (whenschedule: null)CronWorkflow: For scheduled, recurring pipeline executions (when a cron schedule is provided)
Both workflow types leverage Kubernetes' native scheduling and resource management capabilities.
Generate-Time vs Runtime Configuration
The Argo backend uses a two-phase configuration model:
- Generate-Time (YAML): A
values.yamlfile configures the workflow structure — container images, namespaces, schedules, security contexts, resource limits, and artifact storage. This is required when runningwurzel generate. - Runtime (Environment Variables): Step settings (e.g.,
MANUALMARKDOWNSTEP__FOLDER_PATH) are read from environment variables when the workflow executes in Kubernetes. These can be set viacontainer.env, Secrets, or ConfigMaps in yourvalues.yaml.
This separation allows you to generate workflow manifests once and deploy them to different environments by changing only the runtime environment variables.
Key Features¶
- Cloud-Native Orchestration: Run pipelines natively on Kubernetes clusters
- Flexible Execution: Support for both one-time Workflows and scheduled CronWorkflows
- Horizontal Scaling: Automatically scale pipeline steps based on resource requirements
- Advanced Scheduling: Optional cron-based scheduling with fine-grained control
- Resource Management: Leverage Kubernetes resource limits and requests
- Artifact Management: Integrated S3-compatible artifact storage
- Service Integration: Seamless integration with Kubernetes services and secrets
Usage¶
Installation¶
Install Wurzel with Argo support:
CLI Usage¶
Generate an Argo Workflows configuration using a values.yaml file:
# Generate a CronWorkflow with scheduled execution
wurzel generate --backend ArgoBackend \
--values values.yaml \
--pipeline_name pipelinedemo \
--output cronworkflow.yaml \
examples.pipeline.pipelinedemo:pipeline
# Or generate a one-time Workflow (set schedule: null in values.yaml)
wurzel generate --backend ArgoBackend \
--values values-no-schedule.yaml \
--pipeline_name pipelinedemo \
--output workflow.yaml \
examples.pipeline.pipelinedemo:pipeline
Note
The --values flag is required for the Argo backend. It specifies the YAML configuration file that defines the workflow structure.
Values File Configuration (Generate-Time)¶
The values.yaml file configures the workflow structure at generate-time. Here's a complete example:
workflows:
pipelinedemo:
# Workflow metadata
name: wurzel-pipeline
namespace: argo-workflows
schedule: "0 4 * * *" # Cron schedule for CronWorkflow, or null for one-time Workflow
entrypoint: wurzel-pipeline
serviceAccountName: wurzel-service-account
dataDir: /data
# Workflow-level annotations
annotations:
sidecar.istio.io/inject: "false"
# Pod-level security context (applied to all pods)
podSecurityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
fsGroup: 2000
fsGroupChangePolicy: Always # or "OnRootMismatch"
supplementalGroups:
- 1000
seccompProfileType: RuntimeDefault
# Optional: Custom podSpecPatch for advanced use cases
# podSpecPatch: |
# initContainers:
# - name: custom-init
# securityContext:
# runAsNonRoot: true
# Container configuration
container:
image: ghcr.io/telekom/wurzel
# Container-level security context
securityContext:
runAsNonRoot: true
runAsUser: 1000
runAsGroup: 1000
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
dropCapabilities:
- ALL
seccompProfileType: RuntimeDefault
# Resource requests and limits
resources:
cpu_request: "100m"
cpu_limit: "500m"
memory_request: "128Mi"
memory_limit: "512Mi"
# Runtime environment variables (step settings)
env:
MANUALMARKDOWNSTEP__FOLDER_PATH: "examples/pipeline/demo-data"
SIMPLESPLITTERSTEP__BATCH_SIZE: "100"
# Environment from Kubernetes Secrets/ConfigMaps
envFrom:
- kind: secret
name: wurzel-env-secret
prefix: ""
optional: true
- kind: configMap
name: wurzel-env-config
prefix: APP_
optional: true
# Reference existing secrets as env vars
secretRef:
- "wurzel-secrets"
# Reference existing configmaps as env vars
configMapRef:
- "wurzel-config"
# Mount secrets as files
mountSecrets:
- from: "tls-secret"
to: "/etc/ssl"
mappings:
- key: "tls.crt"
value: "cert.pem"
- key: "tls.key"
value: "key.pem"
# Tokenizer cache volume (for HuggingFace models)
tokenizerCache:
enabled: true
claimName: tokenizer-cache-pvc # Used when createPvc: false
mountPath: /cache/huggingface
readOnly: true
# To auto-create a workflow-scoped PVC:
# createPvc: true
# storageSize: 10Gi
# storageClassName: standard
# accessModes: ["ReadWriteOnce"]
# S3 artifact storage configuration
artifacts:
bucket: wurzel-bucket
endpoint: s3.amazonaws.com
defaultMode: 509 # File permissions (decimal), e.g., 509 = 0o775
Workflow vs CronWorkflow¶
The Argo backend generates different workflow types based on the schedule configuration:
Normal Workflow (One-Time Execution)¶
Set schedule: null (or omit it) to create a Workflow for manual or one-time execution:
Use cases: - Manual pipeline execution triggered via Argo UI or CLI - Event-driven pipelines triggered by other workflows - One-time data processing tasks - CI/CD integration where external systems trigger execution
Triggering:
# Submit the workflow manually
argo submit workflow.yaml
# Or trigger via kubectl
kubectl create -f workflow.yaml
CronWorkflow (Scheduled Execution)¶
Set a cron schedule string to create a CronWorkflow for recurring execution:
workflows:
my-cron-workflow:
name: my-scheduled-pipeline
schedule: "0 4 * * *" # Creates a CronWorkflow that runs daily at 4 AM
Use cases: - Regularly scheduled data ingestion - Periodic model training or evaluation - Automated report generation - Scheduled data synchronization
Common cron schedules:
- "0 4 * * *" - Daily at 4 AM
- "*/15 * * * *" - Every 15 minutes
- "0 0 * * 0" - Weekly on Sundays at midnight
- "0 0 1 * *" - Monthly on the 1st at midnight
Monitoring:
# List all CronWorkflows
argo cron list
# View CronWorkflow details
argo cron get my-scheduled-pipeline
# List workflow runs from CronWorkflow
argo list --label workflows.argoproj.io/cron-workflow=my-scheduled-pipeline
Choosing the Right Type
- Use Workflow (schedule: null) when you need explicit control over when pipelines run
- Use CronWorkflow (with schedule) for automated, time-based execution
- You can have both: a CronWorkflow for regular execution and a Workflow template for manual reruns
Configuration Reference¶
Workflow-Level Options¶
| Field | Type | Default | Description |
|---|---|---|---|
name |
string | wurzel |
Name of the Workflow/CronWorkflow |
namespace |
string | argo-workflows |
Kubernetes namespace |
schedule |
string | 0 4 * * * |
Cron schedule for CronWorkflow. Set to null to create a normal Workflow instead |
entrypoint |
string | wurzel-pipeline |
DAG entrypoint name |
serviceAccountName |
string | wurzel-service-account |
Kubernetes service account |
dataDir |
path | /usr/app |
Data directory inside containers |
annotations |
map | {} |
Workflow-level annotations |
podSpecPatch |
string | null |
Custom pod spec patch (YAML string) |
Pod Security Context Options¶
| Field | Type | Default | Description |
|---|---|---|---|
runAsNonRoot |
bool | true |
Require non-root user |
runAsUser |
int | null |
UID to run as |
runAsGroup |
int | null |
GID to run as |
fsGroup |
int | null |
Filesystem group |
fsGroupChangePolicy |
string | null |
Always or OnRootMismatch |
supplementalGroups |
list[int] | [] |
Additional group IDs |
seccompProfileType |
string | RuntimeDefault |
Seccomp profile type |
Container Security Context Options¶
| Field | Type | Default | Description |
|---|---|---|---|
runAsNonRoot |
bool | true |
Require non-root user |
runAsUser |
int | null |
UID to run as |
runAsGroup |
int | null |
GID to run as |
allowPrivilegeEscalation |
bool | false |
Allow privilege escalation |
readOnlyRootFilesystem |
bool | null |
Read-only root filesystem |
dropCapabilities |
list[str] | ["ALL"] |
Linux capabilities to drop |
seccompProfileType |
string | RuntimeDefault |
Seccomp profile type |
Container Resources Options¶
| Field | Type | Default | Description |
|---|---|---|---|
cpu_request |
string | 100m |
CPU request |
cpu_limit |
string | 500m |
CPU limit |
memory_request |
string | 128Mi |
Memory request |
memory_limit |
string | 512Mi |
Memory limit |
Tokenizer Cache Options¶
The tokenizer cache configuration allows you to mount a PersistentVolumeClaim (PVC) containing pre-downloaded HuggingFace tokenizer models. This is useful for:
- Avoiding repeated model downloads in air-gapped environments
- Reducing startup time by using cached models
- Sharing model cache across workflow runs
| Field | Type | Default | Description |
|---|---|---|---|
enabled |
bool | false |
Enable tokenizer cache volume mount |
claimName |
string | tokenizer-cache-pvc |
PVC name for existing PVC (when createPvc: false) |
mountPath |
string | /cache/huggingface |
Mount path inside container |
readOnly |
bool | true |
Mount as read-only |
createPvc |
bool | false |
Create PVC via volumeClaimTemplates (workflow-scoped) |
storageSize |
string | 10Gi |
Storage size (when createPvc: true) |
storageClassName |
string | null |
Storage class name (when createPvc: true) |
accessModes |
list[str] | ["ReadWriteOnce"] |
Access modes (when createPvc: true) |
When enabled, the HF_HOME environment variable is automatically set to the mountPath, directing HuggingFace libraries to use the cached models.
createPvc vs claimName
createPvc: false(default): Uses an existing PVC specified byclaimName. You must create the PVC separately.createPvc: true: Creates a workflow-scoped PVC via Argo'svolumeClaimTemplates. The PVC is created when the workflow starts and deleted when it completes. This is useful for temporary caches but not for persistent model storage across runs.
S3 Artifact Options¶
| Field | Type | Default | Description |
|---|---|---|---|
bucket |
string | wurzel-bucket |
S3 bucket name |
endpoint |
string | s3.amazonaws.com |
S3 endpoint URL |
defaultMode |
int | null |
File permissions (decimal) |
Runtime Environment Variables¶
Step settings are configured via environment variables at runtime (when the workflow executes). These can be set in three ways:
- Inline in
container.env: Directly in the values file - Via Kubernetes Secrets: Using
secretReforenvFromwithkind: secret - Via Kubernetes ConfigMaps: Using
configMapReforenvFromwithkind: configMap
container:
# Option 1: Inline environment variables
env:
MANUALMARKDOWNSTEP__FOLDER_PATH: "examples/pipeline/demo-data"
# Option 2: From Secrets/ConfigMaps with optional prefix
envFrom:
- kind: secret
name: wurzel-secrets
prefix: "" # No prefix
optional: true
# Option 3: Reference entire Secret/ConfigMap
secretRef:
- "wurzel-secrets"
configMapRef:
- "wurzel-config"
Inspecting Required Environment Variables
Use wurzel inspect to see all environment variables required by your pipeline steps:
Programmatic Usage¶
Use the Argo backend directly in Python:
from pathlib import Path
from wurzel.backend.backend_argo import ArgoBackend
from wurzel.steps.embedding import EmbeddingStep
from wurzel.steps.manual_markdown import ManualMarkdownStep
from wurzel.steps.qdrant.step import QdrantConnectorStep
from wurzel.utils import WZ
# Define your pipeline
source = WZ(ManualMarkdownStep)
embedding = WZ(EmbeddingStep)
step = WZ(QdrantConnectorStep)
source >> embedding >> step
pipeline = step
# Generate Argo Workflows configuration from values file
backend = ArgoBackend.from_values(
files=[Path("values.yaml")],
workflow_name="pipelinedemo"
)
argo_yaml = backend.generate_artifact(pipeline)
print(argo_yaml)
Deploying Argo Workflows¶
Once you've generated your Workflow or CronWorkflow YAML, deploy it to your Kubernetes cluster:
Deploying a Normal Workflow¶
# Apply the Workflow to your cluster
kubectl apply -f workflow.yaml
# Submit it for execution
argo submit workflow.yaml
# Or create and submit in one command
kubectl create -f workflow.yaml
Deploying a CronWorkflow¶
# Apply the CronWorkflow to your cluster (starts the cron schedule)
kubectl apply -f cronworkflow.yaml
# View CronWorkflow status
argo cron get wurzel-pipeline
# List CronWorkflows
argo cron list
Monitoring Workflow Executions¶
# List all workflow executions
argo list
# Get detailed workflow status
argo get <workflow-name>
# View workflow logs
argo logs <workflow-name>
# Follow logs in real-time
argo logs <workflow-name> -f
# View logs for specific step
argo logs <workflow-name> -c <container-name>
Benefits for Cloud-Native Pipelines¶
Kubernetes-Native Execution¶
Leverage the full power of Kubernetes for container orchestration, resource management, and fault tolerance.
Scalable Processing¶
Automatically scale pipeline steps based on workload requirements, with support for parallel execution across multiple nodes.
Enterprise Security¶
Integrate with Kubernetes RBAC, service accounts, and network policies for enterprise-grade security.
Cost Optimization¶
Take advantage of Kubernetes features like node auto-scaling and spot instances to optimize infrastructure costs.
Observability¶
Built-in integration with Kubernetes monitoring tools and Argo's web UI for comprehensive pipeline observability.
Multiple Values Files¶
You can use multiple values files for environment-specific overrides:
# Base configuration + environment-specific overrides
wurzel generate --backend ArgoBackend \
--values base-values.yaml \
--values production-values.yaml \
--pipeline_name pipelinedemo \
--output cronworkflow.yaml \
examples.pipeline.pipelinedemo:pipeline
Later files override earlier ones using deep merge semantics.
Prerequisites¶
- Kubernetes cluster with Argo Workflows installed
- kubectl configured to access your cluster
- Appropriate RBAC permissions for workflow execution
- S3-compatible storage for artifacts (optional but recommended)
- A
values.yamlfile for generate-time configuration