Generating dbt Documentation with OpenAI and GitHub Actions

April 21, 2025

In this blog post, I’ll walk through how to automatically generate documentation for your dbt project using OpenAI—completely integrated into your GitHub Actions pipeline. This process will analyze every dbt model and seed file and generate detailed DBT documentation in Markdown with OpenAI’s GPT 4o LLM that’s committed back to the pull request.

Technology Stack

dbt: The analytics engineering tool for transforming data in your warehouse.
OpenAI GPT4o (via API): Used to infer and document dbt models with markdown output.
GitHub Actions: Automates the entire flow on every pull request.
Typer CLI: Simplifies local development of the script used to collect and submit prompt data.

Advantages of This Approach

Automated Metadata Generation: Removes the burden of writing and maintaining technical documentation.
Contextual, Business-Friendly Summaries: The model extracts intent, joins, lineage, and structure.
Seamless GitHub Integration: Keeps docs in sync by auto-committing the output to the PR branch.

Prerequisites

Before running the workflow, make sure the following are in place:

A valid OpenAI API Key (added as a GitHub Secret: OPENAI_API_KEY)
dbt project with models and seed files organized as usual
generate_with_openai.py script added to your repo, which:
- Traverses models/ and seeds/ folders
- Builds a prompt and sends it to OpenAI
- Saves the response as output.md

generate_with_openai.py (Key Steps)

prompt = PROMPT_HEADER.strip() + "\n\n"
for file_path in sorted(all_files):
    rel_path = os.path.relpath(file_path)
    prompt += f"File: {rel_path}\n" + ("-" * 80) + "\n"
    with open(file_path, "r", encoding="utf-8") as f:
        prompt += f.read().strip() + "\n"
    prompt += ("-" * 80) + "\n\n"

response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0
)

The script uses GPT to interpret your dbt project structure in order to generate business-readable documentation.

The Prompt

As you can see it’s a simple prompt but it outlines many of the needed pieces of documentation for a DBT setup.

You are an expert data engineer. Given a set of dbt model files and seed CSVs, generate a comprehensive summary document describing the DBT setup. Your summary should include:

1. A high-level **overview** of the structure (staging, marts, seeds).
2. A **description of each seed file** and what kind of data it contains.
3. A **table summarizing staging models**, their sources, and key fields extracted.
4. A **description of each mart model**, the joins it performs, and any calculated fields.
5. A **data lineage diagram** in text format (ASCII or pseudo-graph).
6. A brief **conclusion** highlighting good practices or architectural choices.

Use clear headings, structured tables, and readable prose. Make the summary business-friendly while still detailed enough for technical audiences.

---

Here are the files:

GitHub Action Workflow

The GitHub Action is triggered on PRs that touch models/, seeds/, or the script itself. Here’s what it does:

Sets up Python and installs dependencies
Waits for Postgres to be available (via service container)
Runs dbt seed and dbt run
Executes generate_with_openai.py
Commits output.md to the pull request branch

# Ensure that you set Read/Write permissions otherwise the output.md push will fail. Also make sure that the OPENAI_API_KEY secret is set and the attached API key has good permission settings.
#
# Read/Write Permissions:
# Setting -> Actions -> General -> Workflow permissions -> Select 'Read and Write permissions' option and save.

name: Run dbt in Docker

on:
  pull_request:
    branches: [ main ]
    paths:
      - 'dbt/**'
      - 'docker-compose.yml'
      - '.github/workflows/dbt.yml'

jobs:
  dbt-run:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: dbt_user
          POSTGRES_PASSWORD: dbt_pass
          POSTGRES_DB: dbt_db
        ports:
          - 5432:5432
        options: >-
          --health-cmd "pg_isready"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ github.head_ref }}

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: 🔐 Set OpenAI API Key
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: echo "API key set"

      - name: Install dependencies
        run: |
          pip install dbt-postgres typer openai

      - name: Wait for Postgres
        run: |
          until pg_isready -h localhost -p 5432; do
            echo "Waiting for postgres..."
            sleep 2
          done

      - name: Run dbt debug
        run: |
          mkdir -p ~/.dbt
          cp dbt/profiles.yml ~/.dbt/profiles.yml
          cd dbt/project
          dbt debug

      - name: Seed database
        run: |
          cd dbt/project
          dbt seed

      - name: Run dbt models
        run: |
          cd dbt/project
          dbt run

      - name: Show list of dbt models
        run: |
          cd dbt/project
          dbt ls --resource-type model

      - name: Submit to OpenAI
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python dbt/project/generate_with_openai.py --models dbt/project/models --seeds dbt/project/seeds --output output.md
          
      - name: ✅ Commit and push summary
        run: |
          git config --global user.name "github-actions"
          git config --global user.email "actions@github.com"
          git add output.md
          git commit -m "📝 Add DBT summary (auto-generated by OpenAI)" || echo "No changes to commit"
          git push

      - name: 💬 Comment on PR (optional)
        if: success()
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          message: |
            ✅ DBT summary generated by OpenAI and added to `output.md`.

Resulting Summary Output

The generated output.md includes:

Overview of the dbt project structure
Descriptions of seed files
Summary table of staging models and their sources
Explanation of mart models, joins, and transformations
An ASCII data lineage diagram
Architectural best practices

This makes it useful for technical and business stakeholders including for training new employees.

Conclusion

This workflow brings LLM-powered documentation into the CI/CD loop for your dbt project. It ensures your dbt DAGs remain well-documented without manual intervention—improving transparency, onboarding, and maintenance.

Future improvements:

Posting output.md directly in PR comments
Sending Slack/Teams notifications
Validating docs exist as part of a quality gate

Repo Integration Example

See the PR that adds this automation:

GitHub PR: Auto-Generate dbt Docs

Justin Miller