Generating dbt Documentation with OpenAI and GitHub Actions
In this blog post, I’ll walk through how to automatically generate documentation for your dbt project using OpenAI—completely integrated into your GitHub Actions pipeline. This process will analyze every dbt model and seed file and generate detailed DBT documentation in Markdown with OpenAI’s GPT 4o LLM that’s committed back to the pull request.
Technology Stack
- dbt: The analytics engineering tool for transforming data in your warehouse.
- OpenAI GPT4o (via API): Used to infer and document dbt models with markdown output.
- GitHub Actions: Automates the entire flow on every pull request.
- Typer CLI: Simplifies local development of the script used to collect and submit prompt data.
Advantages of This Approach
- Automated Metadata Generation: Removes the burden of writing and maintaining technical documentation.
- Contextual, Business-Friendly Summaries: The model extracts intent, joins, lineage, and structure.
- Seamless GitHub Integration: Keeps docs in sync by auto-committing the output to the PR branch.
Prerequisites
Before running the workflow, make sure the following are in place:
- A valid OpenAI API Key (added as a GitHub Secret:
OPENAI_API_KEY
) - dbt project with models and seed files organized as usual
generate_with_openai.py
script added to your repo, which:- Traverses
models/
andseeds/
folders - Builds a prompt and sends it to OpenAI
- Saves the response as
output.md
- Traverses
generate_with_openai.py (Key Steps)
prompt = PROMPT_HEADER.strip() + "\n\n"
for file_path in sorted(all_files):
rel_path = os.path.relpath(file_path)
prompt += f"File: {rel_path}\n" + ("-" * 80) + "\n"
with open(file_path, "r", encoding="utf-8") as f:
prompt += f.read().strip() + "\n"
prompt += ("-" * 80) + "\n\n"
response = openai.ChatCompletion.create(
model="gpt-4o",
messages=[{"role": "user", "content": prompt}],
temperature=0.0
)
The script uses GPT to interpret your dbt project structure in order to generate business-readable documentation.
The Prompt
As you can see it’s a simple prompt but it outlines many of the needed pieces of documentation for a DBT setup.
You are an expert data engineer. Given a set of dbt model files and seed CSVs, generate a comprehensive summary document describing the DBT setup. Your summary should include:
1. A high-level **overview** of the structure (staging, marts, seeds).
2. A **description of each seed file** and what kind of data it contains.
3. A **table summarizing staging models**, their sources, and key fields extracted.
4. A **description of each mart model**, the joins it performs, and any calculated fields.
5. A **data lineage diagram** in text format (ASCII or pseudo-graph).
6. A brief **conclusion** highlighting good practices or architectural choices.
Use clear headings, structured tables, and readable prose. Make the summary business-friendly while still detailed enough for technical audiences.
---
Here are the files:
GitHub Action Workflow
The GitHub Action is triggered on PRs that touch models/
, seeds/
, or the script itself. Here’s what it does:
- Sets up Python and installs dependencies
- Waits for Postgres to be available (via service container)
- Runs
dbt seed
anddbt run
- Executes
generate_with_openai.py
- Commits
output.md
to the pull request branch
# Ensure that you set Read/Write permissions otherwise the output.md push will fail. Also make sure that the OPENAI_API_KEY secret is set and the attached API key has good permission settings.
#
# Read/Write Permissions:
# Setting -> Actions -> General -> Workflow permissions -> Select 'Read and Write permissions' option and save.
name: Run dbt in Docker
on:
pull_request:
branches: [ main ]
paths:
- 'dbt/**'
- 'docker-compose.yml'
- '.github/workflows/dbt.yml'
jobs:
dbt-run:
runs-on: ubuntu-latest
services:
postgres:
image: postgres:15
env:
POSTGRES_USER: dbt_user
POSTGRES_PASSWORD: dbt_pass
POSTGRES_DB: dbt_db
ports:
- 5432:5432
options: >-
--health-cmd "pg_isready"
--health-interval 10s
--health-timeout 5s
--health-retries 5
steps:
- name: Checkout code
uses: actions/checkout@v4
with:
ref: ${{ github.head_ref }}
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
- name: 🔐 Set OpenAI API Key
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: echo "API key set"
- name: Install dependencies
run: |
pip install dbt-postgres typer openai
- name: Wait for Postgres
run: |
until pg_isready -h localhost -p 5432; do
echo "Waiting for postgres..."
sleep 2
done
- name: Run dbt debug
run: |
mkdir -p ~/.dbt
cp dbt/profiles.yml ~/.dbt/profiles.yml
cd dbt/project
dbt debug
- name: Seed database
run: |
cd dbt/project
dbt seed
- name: Run dbt models
run: |
cd dbt/project
dbt run
- name: Show list of dbt models
run: |
cd dbt/project
dbt ls --resource-type model
- name: Submit to OpenAI
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
run: |
python dbt/project/generate_with_openai.py --models dbt/project/models --seeds dbt/project/seeds --output output.md
- name: ✅ Commit and push summary
run: |
git config --global user.name "github-actions"
git config --global user.email "actions@github.com"
git add output.md
git commit -m "📝 Add DBT summary (auto-generated by OpenAI)" || echo "No changes to commit"
git push
- name: 💬 Comment on PR (optional)
if: success()
uses: marocchino/sticky-pull-request-comment@v2
with:
message: |
✅ DBT summary generated by OpenAI and added to `output.md`.
Resulting Summary Output
The generated output.md
includes:
- Overview of the dbt project structure
- Descriptions of seed files
- Summary table of staging models and their sources
- Explanation of mart models, joins, and transformations
- An ASCII data lineage diagram
- Architectural best practices
This makes it useful for technical and business stakeholders including for training new employees.
Conclusion
This workflow brings LLM-powered documentation into the CI/CD loop for your dbt project. It ensures your dbt DAGs remain well-documented without manual intervention—improving transparency, onboarding, and maintenance.
Future improvements:
- Posting
output.md
directly in PR comments - Sending Slack/Teams notifications
- Validating docs exist as part of a quality gate
Further Reading
Repo Integration Example
See the PR that adds this automation: