Building Pipelines — User Guide

Audience: Dagen users who build and manage dbt, Spark, and Data Model pipelines, and orchestrate workflows.

Overview

Dagen provides three pipeline-focused views:

View	Path	Purpose
Data Pipelines	`/pipelines`	Manage dbt, Dataform, and Data Model pipelines
Spark Pipelines	`/spark-pipelines`	Develop and run Spark applications
Workflow Orchestrator	`/workflow-orchestrator`	Design, schedule, and automate multi-step workflows

All pipeline types share a common set of Git operations (branching, committing, pushing, pull requests) and integrate with the AI Chat for assistance.

Part 1 — Data Pipelines (dbt / Dataform / Data Model)

Pipeline List

The Data Pipelines page displays your transformation pipelines with stats showing counts by type (Total, DBT, Dataform).

Use Search pipelines... to filter by name.
Sort by Name or Type, toggle ascending/descending order.
Adjust items per page (5, 10, 20, 50).

Each pipeline card shows the pipeline name, a type chip (DBT, Dataform, Spark, Data Model), branch name, file count, and change count.

Empty state: "No Data Pipelines Found — Data pipelines are automatically created when you connect repositories containing Dataform, DBT, or other supported pipeline projects."

Click Connect Repository or Manage Repos in the header to link a repository.

Pipeline Detail View

Click a pipeline card to open its detail view, which has two tabs:

DAG View

Interactive directed acyclic graph visualizing model dependencies.
Controls: zoom in/out, fit to screen, reset, toggle minimap.
Navigation: scroll to pan, Ctrl/Cmd + scroll to zoom, drag to move.
Toggle Preview to open the data preview panel.

Files

File tree showing all pipeline files with status indicators.
Search files... to filter.
Expand All / Collapse All to navigate the tree.
Refresh to reload from disk.
Double-click a file (or click the edit button) to open it in the inline editor.
The editor supports viewing changes (diff), editing, and saving directly.

Running a Pipeline

Click the Run menu on a pipeline card or detail header:

Run Mode	Description
Full Refresh	Rebuilds all models from scratch
Incremental	Processes only new or changed data
Test Only	Runs tests without building models
Custom Run	Configure timeout, additional arguments, environment variables, and run mode
Create New Schedule	Set up a recurring execution

After execution, the Pipeline Execution Logs dialog shows status, summary (pipeline name, run mode, start/finish times, duration), and scrollable log output. If the run failed, a Fix with Agent button opens the AI chat to help troubleshoot.

Part 2 — Spark Pipelines

Connecting a Spark Repository

Navigate to Spark Pipelines from the menu.
Click Connect Repository in the header.
Fill in:
- Repository URL (required).
- Pipeline Name (optional — auto-derived from the URL if blank).
- Branch (defaults to main).
- Personal Access Token (for private repositories).
Click Connect.

Supported languages: PySpark, Scala Spark, SQL, Python.

Pipeline Cards

Toggle between Cards and List view using the icons in the toolbar. Each card shows the pipeline name, type chip, branch, file count, and modified date.

Running a Spark Job

Click Run Job from the pipeline menu.
In the dialog, select a Platform:

Platform	Required Fields
Databricks	Main File; Cluster ID (optional — uses default cluster)
Google Cloud Dataproc	Main File, Cluster Name, Project ID, Region
Kubernetes	Main File, Namespace, Driver Memory, Executor Memory, Executors (slider)

Select the Main File from the dropdown (lists Python/Scala files in the repo).
Click Submit Job.

Syncing and Deleting

Sync Pipeline pulls the latest changes from the remote repository.
Delete removes the pipeline (with confirmation: "This action cannot be undone.").

Part 3 — Git Operations (All Pipeline Types)

All pipeline types support a consistent set of Git operations accessible from the Git Operations menu.

Branch Management

Create a Branch

Click Git Operations → Manage Branches (or Create Branch in the branch manager).
Enter a New Branch Name (alphanumeric, hyphens, underscores, dots, slashes).
Select Create from Branch (the base branch).
Click Create Branch.

Switch Branches

In the Branch Manager dialog, click any branch in the Available Branches list to switch to it. Branches are grouped into Local Branches and Remote Only Branches.

Viewing Changes

Click Git Operations → View Changes to open the diff viewer. It shows:

"Your working directory is clean" if there are no uncommitted changes.
A file list with additions/deletions when changes exist.

Committing Changes

Click Git Operations → Git Commit.
Enter a Commit Message (required, max 500 characters).
Click Commit Changes (commits all staged changes).

Pushing and Publishing

Git Push pushes committed changes to the remote repository.
Publish to GitHub creates a new GitHub repository:
1. Enter a Repository Name.
2. Add an optional Description.
3. Toggle Private repository on or off.
4. Click Create & Push.

Pulling Changes

Click Git Operations → Git Pull to fetch and merge the latest changes from the remote.

VS Code Integration

The VSCode menu provides multiple ways to edit pipeline code locally:

Option	Description
Open Local Folder	Opens the pipeline in your local VS Code installation
Clone Repository	Clones the repo to your machine
Open in Web VSCode	Opens the VS Code web editor in your browser
Copy Repository URL	Copies the git URL to clipboard

Part 4 — Workflow Orchestrator

Route: /workflow-orchestrator.

Agentic automation combines classical DAG orchestration with LLM agents that plan steps, read logs, and suggest fixes. Self-healing often flows through Execution Logs, Fix with Agent, AI Chat (with pipeline / DB / workflow context), Agent Intelligence rules and lessons, and Git Reviews before bad code merges.

Workflow UI capabilities

Visual DAG designer (drag-and-drop nodes and edges)
New Workflow, Run, View Runs, schedule
Import / Export workflow JSON (single or bulk)
Channels — notifications (e.g. Slack)
Dashboard — run metrics and status overview
AI Workflow Assistant (robot icon in header) — describe the flow in natural language
Clean up stuck workflow runs (broom icon) when runs stay “running” incorrectly

Creating a Workflow

Navigate to Workflow Orchestrator from the menu.
Click New Workflow in the header.
Use the visual designer to add and connect nodes (drag-and-drop).
Save the workflow.

You can also describe what you want to the AI Workflow Assistant (click the robot icon in the header) and let it build the workflow for you.

Workflow List

Workflows are displayed as cards in a grid (12 per page, paginated).

Filter by Status: All Statuses, Draft, Active, Paused, Archived.
Click Refresh to reload.

Each card shows the workflow name, status, and description with actions to Edit, Run, Delete, View Runs, Update Status, Export, and Duplicate.

Running a Workflow

Click Run on a workflow card. The workflow executes its nodes in sequence. Monitor progress via the Run History dialog (click View Runs), which lists all runs with timestamps and statuses.

Scheduling

Workflows can be scheduled via the workflow editor or by configuring a schedule from the pipeline run menu.

Import and Export

Import Workflow: Click Import/Export → Import Workflow, upload a JSON file, optionally set a custom name, and click Import. The preview shows node and edge counts.
Export All Workflows: Click Import/Export → Export All Workflows to download all workflows as JSON.

Notification Channels

Click Channels in the header to configure where workflow notifications are sent (e.g., Slack). See the Slack Integration User Guide for setup details.

Workflow Dashboard

Click Dashboard in the header to open an overview of workflow execution metrics and status.

Cleaning Up Stuck Runs

If workflow runs are stuck in a running state, click the broom icon (Clean up stuck workflow runs) in the filter bar.

Part 5 — Where runs surface & how to debug

Surface	Route / action	Logs & recovery
dbt / Dataform / Data Model	`/pipelines` — Run menu, schedules, execution log	Fix with Agent; attach pipeline in chat
Spark	`/spark-pipelines` — Run Job	Cluster UI + Dagen messages; check Runtime Environments
Data Ingestion	`/airbyte-ingestion`	Card stats; Configure CDC with AI; runtime selector
Workflows	`/workflow-orchestrator` — View Runs, Dashboard	Channel notifications; stuck-run cleanup

Runtime first: most “mystery” failures are wrong or unreachable runtimes, expired credentials, egress, or capacity. Re-run Test on the runtime; confirm default ingestion runtime on the ingestion page; for Spark validate cluster ID, region, namespace.

Agent-assisted: paste error excerpts into AI Chat with the right context attachment; use Guided or Semi until the fix is proven.

Job correlation: /job-history shows agent jobs with tool-level steps—useful when chat and tools fail together (Administration).

Troubleshooting

Symptom	Cause	Fix
"No Data Pipelines Found"	No repositories connected	Click Connect Repository and link a repo containing dbt/Dataform/Spark code
"No Spark Pipelines Found"	No Spark repositories connected	Click Connect Repository and provide a repo URL
DAG view is empty	Pipeline has no model files or parsing failed	Check the Files tab for valid model definitions
Git Commit fails	No changes to commit or invalid commit message	Ensure you have uncommitted changes and the message is under 500 characters
"No workflows found"	No workflows created yet	Click Create First Workflow or use the AI assistant
Spark job fails on submit	Incorrect platform credentials or unreachable cluster	Verify runtime configuration in Runtime Environments
Pipeline execution logs show "failed"	Model compilation or runtime error	Click Fix with Agent to get AI-assisted troubleshooting
Publish to GitHub fails	Repository name conflict or missing credentials	Choose a unique name and ensure your GitHub PAT has `repo` scope