Building Pipelines — User Guide

Audience: Dagen users who build and manage dbt, Spark, and Data Model pipelines, and orchestrate workflows.


Overview

Dagen provides three pipeline-focused views:

View Path Purpose
Data Pipelines /pipelines Manage dbt, Dataform, and Data Model pipelines
Spark Pipelines /spark-pipelines Develop and run Spark applications
Workflow Orchestrator /workflow-orchestrator Design, schedule, and automate multi-step workflows

All pipeline types share a common set of Git operations (branching, committing, pushing, pull requests) and integrate with the AI Chat for assistance.


Part 1 — Data Pipelines (dbt / Dataform / Data Model)

Pipeline List

The Data Pipelines page displays your transformation pipelines with stats showing counts by type (Total, DBT, Dataform).

  • Use Search pipelines... to filter by name.
  • Sort by Name or Type, toggle ascending/descending order.
  • Adjust items per page (5, 10, 20, 50).

Each pipeline card shows the pipeline name, a type chip (DBT, Dataform, Spark, Data Model), branch name, file count, and change count.

Empty state: "No Data Pipelines Found — Data pipelines are automatically created when you connect repositories containing Dataform, DBT, or other supported pipeline projects."

Click Connect Repository or Manage Repos in the header to link a repository.

Pipeline Detail View

Click a pipeline card to open its detail view, which has two tabs:

DAG View

  • Interactive directed acyclic graph visualizing model dependencies.
  • Controls: zoom in/out, fit to screen, reset, toggle minimap.
  • Navigation: scroll to pan, Ctrl/Cmd + scroll to zoom, drag to move.
  • Toggle Preview to open the data preview panel.

Files

  • File tree showing all pipeline files with status indicators.
  • Search files... to filter.
  • Expand All / Collapse All to navigate the tree.
  • Refresh to reload from disk.
  • Double-click a file (or click the edit button) to open it in the inline editor.
  • The editor supports viewing changes (diff), editing, and saving directly.

Running a Pipeline

Click the Run menu on a pipeline card or detail header:

Run Mode Description
Full Refresh Rebuilds all models from scratch
Incremental Processes only new or changed data
Test Only Runs tests without building models
Custom Run Configure timeout, additional arguments, environment variables, and run mode
Create New Schedule Set up a recurring execution

After execution, the Pipeline Execution Logs dialog shows status, summary (pipeline name, run mode, start/finish times, duration), and scrollable log output. If the run failed, a Fix with Agent button opens the AI chat to help troubleshoot.


Part 2 — Spark Pipelines

Connecting a Spark Repository

  1. Navigate to Spark Pipelines from the menu.
  2. Click Connect Repository in the header.
  3. Fill in:
    • Repository URL (required).
    • Pipeline Name (optional — auto-derived from the URL if blank).
    • Branch (defaults to main).
    • Personal Access Token (for private repositories).
  4. Click Connect.

Supported languages: PySpark, Scala Spark, SQL, Python.

Pipeline Cards

Toggle between Cards and List view using the icons in the toolbar. Each card shows the pipeline name, type chip, branch, file count, and modified date.

Running a Spark Job

  1. Click Run Job from the pipeline menu.
  2. In the dialog, select a Platform:
Platform Required Fields
Databricks Main File; Cluster ID (optional — uses default cluster)
Google Cloud Dataproc Main File, Cluster Name, Project ID, Region
Kubernetes Main File, Namespace, Driver Memory, Executor Memory, Executors (slider)
  1. Select the Main File from the dropdown (lists Python/Scala files in the repo).
  2. Click Submit Job.

Syncing and Deleting

  • Sync Pipeline pulls the latest changes from the remote repository.
  • Delete removes the pipeline (with confirmation: "This action cannot be undone.").

Part 3 — Git Operations (All Pipeline Types)

All pipeline types support a consistent set of Git operations accessible from the Git Operations menu.

Branch Management

Create a Branch

  1. Click Git OperationsManage Branches (or Create Branch in the branch manager).
  2. Enter a New Branch Name (alphanumeric, hyphens, underscores, dots, slashes).
  3. Select Create from Branch (the base branch).
  4. Click Create Branch.

Switch Branches

In the Branch Manager dialog, click any branch in the Available Branches list to switch to it. Branches are grouped into Local Branches and Remote Only Branches.

Viewing Changes

Click Git OperationsView Changes to open the diff viewer. It shows:

  • "Your working directory is clean" if there are no uncommitted changes.
  • A file list with additions/deletions when changes exist.

Committing Changes

  1. Click Git OperationsGit Commit.
  2. Enter a Commit Message (required, max 500 characters).
  3. Click Commit Changes (commits all staged changes).

Pushing and Publishing

  • Git Push pushes committed changes to the remote repository.
  • Publish to GitHub creates a new GitHub repository:
    1. Enter a Repository Name.
    2. Add an optional Description.
    3. Toggle Private repository on or off.
    4. Click Create & Push.

Pulling Changes

Click Git OperationsGit Pull to fetch and merge the latest changes from the remote.

VS Code Integration

The VSCode menu provides multiple ways to edit pipeline code locally:

Option Description
Open Local Folder Opens the pipeline in your local VS Code installation
Clone Repository Clones the repo to your machine
Open in Web VSCode Opens the VS Code web editor in your browser
Copy Repository URL Copies the git URL to clipboard

Part 4 — Workflow Orchestrator

Route: /workflow-orchestrator.

Agentic automation combines classical DAG orchestration with LLM agents that plan steps, read logs, and suggest fixes. Self-healing often flows through Execution Logs, Fix with Agent, AI Chat (with pipeline / DB / workflow context), Agent Intelligence rules and lessons, and Git Reviews before bad code merges.

Workflow UI capabilities

  • Visual DAG designer (drag-and-drop nodes and edges)
  • New Workflow, Run, View Runs, schedule
  • Import / Export workflow JSON (single or bulk)
  • Channels — notifications (e.g. Slack)
  • Dashboard — run metrics and status overview
  • AI Workflow Assistant (robot icon in header) — describe the flow in natural language
  • Clean up stuck workflow runs (broom icon) when runs stay “running” incorrectly

Creating a Workflow

  1. Navigate to Workflow Orchestrator from the menu.
  2. Click New Workflow in the header.
  3. Use the visual designer to add and connect nodes (drag-and-drop).
  4. Save the workflow.

You can also describe what you want to the AI Workflow Assistant (click the robot icon in the header) and let it build the workflow for you.

Workflow List

Workflows are displayed as cards in a grid (12 per page, paginated).

  • Filter by Status: All Statuses, Draft, Active, Paused, Archived.
  • Click Refresh to reload.

Each card shows the workflow name, status, and description with actions to Edit, Run, Delete, View Runs, Update Status, Export, and Duplicate.

Running a Workflow

Click Run on a workflow card. The workflow executes its nodes in sequence. Monitor progress via the Run History dialog (click View Runs), which lists all runs with timestamps and statuses.

Scheduling

Workflows can be scheduled via the workflow editor or by configuring a schedule from the pipeline run menu.

Import and Export

  • Import Workflow: Click Import/ExportImport Workflow, upload a JSON file, optionally set a custom name, and click Import. The preview shows node and edge counts.
  • Export All Workflows: Click Import/ExportExport All Workflows to download all workflows as JSON.

Notification Channels

Click Channels in the header to configure where workflow notifications are sent (e.g., Slack). See the Slack Integration User Guide for setup details.

Workflow Dashboard

Click Dashboard in the header to open an overview of workflow execution metrics and status.

Cleaning Up Stuck Runs

If workflow runs are stuck in a running state, click the broom icon (Clean up stuck workflow runs) in the filter bar.


Part 5 — Where runs surface & how to debug

Surface Route / action Logs & recovery
dbt / Dataform / Data Model /pipelines — Run menu, schedules, execution log Fix with Agent; attach pipeline in chat
Spark /spark-pipelines — Run Job Cluster UI + Dagen messages; check Runtime Environments
Data Ingestion /airbyte-ingestion Card stats; Configure CDC with AI; runtime selector
Workflows /workflow-orchestrator — View Runs, Dashboard Channel notifications; stuck-run cleanup

Runtime first: most “mystery” failures are wrong or unreachable runtimes, expired credentials, egress, or capacity. Re-run Test on the runtime; confirm default ingestion runtime on the ingestion page; for Spark validate cluster ID, region, namespace.

Agent-assisted: paste error excerpts into AI Chat with the right context attachment; use Guided or Semi until the fix is proven.

Job correlation: /job-history shows agent jobs with tool-level steps—useful when chat and tools fail together (Administration).


Troubleshooting

Symptom Cause Fix
"No Data Pipelines Found" No repositories connected Click Connect Repository and link a repo containing dbt/Dataform/Spark code
"No Spark Pipelines Found" No Spark repositories connected Click Connect Repository and provide a repo URL
DAG view is empty Pipeline has no model files or parsing failed Check the Files tab for valid model definitions
Git Commit fails No changes to commit or invalid commit message Ensure you have uncommitted changes and the message is under 500 characters
"No workflows found" No workflows created yet Click Create First Workflow or use the AI assistant
Spark job fails on submit Incorrect platform credentials or unreachable cluster Verify runtime configuration in Runtime Environments
Pipeline execution logs show "failed" Model compilation or runtime error Click Fix with Agent to get AI-assisted troubleshooting
Publish to GitHub fails Repository name conflict or missing credentials Choose a unique name and ensure your GitHub PAT has repo scope