Generating dbt Documentation with OpenAI and GitHub Actions

In this blog post, I’ll walk through how to automatically generate documentation for your dbt project using OpenAI—completely integrated into your GitHub Actions pipeline. This process will analyze every dbt model and seed file and generate detailed DBT documentation in Markdown with OpenAI’s GPT 4o LLM that’s committed back to the pull request.

Technology Stack

Advantages of This Approach

Prerequisites

Before running the workflow, make sure the following are in place:

generate_with_openai.py (Key Steps)

prompt = PROMPT_HEADER.strip() + "\n\n"
for file_path in sorted(all_files):
    rel_path = os.path.relpath(file_path)
    prompt += f"File: {rel_path}\n" + ("-" * 80) + "\n"
    with open(file_path, "r", encoding="utf-8") as f:
        prompt += f.read().strip() + "\n"
    prompt += ("-" * 80) + "\n\n"

response = openai.ChatCompletion.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": prompt}],
    temperature=0.0
)

The script uses GPT to interpret your dbt project structure in order to generate business-readable documentation.

The Prompt

As you can see it’s a simple prompt but it outlines many of the needed pieces of documentation for a DBT setup.

You are an expert data engineer. Given a set of dbt model files and seed CSVs, generate a comprehensive summary document describing the DBT setup. Your summary should include:

1. A high-level **overview** of the structure (staging, marts, seeds).
2. A **description of each seed file** and what kind of data it contains.
3. A **table summarizing staging models**, their sources, and key fields extracted.
4. A **description of each mart model**, the joins it performs, and any calculated fields.
5. A **data lineage diagram** in text format (ASCII or pseudo-graph).
6. A brief **conclusion** highlighting good practices or architectural choices.

Use clear headings, structured tables, and readable prose. Make the summary business-friendly while still detailed enough for technical audiences.

---

Here are the files:

GitHub Action Workflow

The GitHub Action is triggered on PRs that touch models/, seeds/, or the script itself. Here’s what it does:

  1. Sets up Python and installs dependencies
  2. Waits for Postgres to be available (via service container)
  3. Runs dbt seed and dbt run
  4. Executes generate_with_openai.py
  5. Commits output.md to the pull request branch
# Ensure that you set Read/Write permissions otherwise the output.md push will fail. Also make sure that the OPENAI_API_KEY secret is set and the attached API key has good permission settings.
#
# Read/Write Permissions:
# Setting -> Actions -> General -> Workflow permissions -> Select 'Read and Write permissions' option and save.

name: Run dbt in Docker

on:
  pull_request:
    branches: [ main ]
    paths:
      - 'dbt/**'
      - 'docker-compose.yml'
      - '.github/workflows/dbt.yml'

jobs:
  dbt-run:
    runs-on: ubuntu-latest

    services:
      postgres:
        image: postgres:15
        env:
          POSTGRES_USER: dbt_user
          POSTGRES_PASSWORD: dbt_pass
          POSTGRES_DB: dbt_db
        ports:
          - 5432:5432
        options: >-
          --health-cmd "pg_isready"
          --health-interval 10s
          --health-timeout 5s
          --health-retries 5

    steps:
      - name: Checkout code
        uses: actions/checkout@v4
        with:
          ref: ${{ github.head_ref }}

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: 🔐 Set OpenAI API Key
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: echo "API key set"

      - name: Install dependencies
        run: |
          pip install dbt-postgres typer openai

      - name: Wait for Postgres
        run: |
          until pg_isready -h localhost -p 5432; do
            echo "Waiting for postgres..."
            sleep 2
          done

      - name: Run dbt debug
        run: |
          mkdir -p ~/.dbt
          cp dbt/profiles.yml ~/.dbt/profiles.yml
          cd dbt/project
          dbt debug

      - name: Seed database
        run: |
          cd dbt/project
          dbt seed

      - name: Run dbt models
        run: |
          cd dbt/project
          dbt run

      - name: Show list of dbt models
        run: |
          cd dbt/project
          dbt ls --resource-type model

      - name: Submit to OpenAI
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python dbt/project/generate_with_openai.py --models dbt/project/models --seeds dbt/project/seeds --output output.md
          
      - name: ✅ Commit and push summary
        run: |
          git config --global user.name "github-actions"
          git config --global user.email "actions@github.com"
          git add output.md
          git commit -m "📝 Add DBT summary (auto-generated by OpenAI)" || echo "No changes to commit"
          git push

      - name: 💬 Comment on PR (optional)
        if: success()
        uses: marocchino/sticky-pull-request-comment@v2
        with:
          message: |
            ✅ DBT summary generated by OpenAI and added to `output.md`.

Resulting Summary Output

The generated output.md includes:

This makes it useful for technical and business stakeholders including for training new employees.

Conclusion

This workflow brings LLM-powered documentation into the CI/CD loop for your dbt project. It ensures your dbt DAGs remain well-documented without manual intervention—improving transparency, onboarding, and maintenance.

Future improvements:

Further Reading

Repo Integration Example

See the PR that adds this automation:

GitHub PR: Auto-Generate dbt Docs