Chapter 7B: CI/CD & Automation
Manual deployments don't scale. Every code change should be automatically tested, security-scanned, and deployed to production. In this chapter, I'll show you how to build a complete CI/CD pipeline for ResearcherAI using GitHub Actions, with automated testing, security scanning, and GitOps-based deployments.
Why CI/CD for ML/AI Applications?
Traditional web apps need CI/CD, but LLM applications have additional challenges:
Traditional CI/CD Concerns:
- Code quality (linting, formatting)
- Unit tests
- Integration tests
- Security vulnerabilities
- Build artifacts
- Deployment automation
LLM-Specific Concerns:
- Prompt regression testing
- Model version management
- Token usage regression
- Cost regression testing
- Response quality evaluation
- Embedding drift detection
- Vector database migrations
The Goal
git push
↓
Automated Pipeline
├─ Lint & format code
├─ Run unit tests (Python)
├─ Run LLM evaluation tests
├─ Check prompt regressions
├─ Build Docker images
├─ Scan for vulnerabilities
├─ Deploy to staging
├─ Run integration tests
├─ Deploy to production
└─ Monitor deployment
GitHub Actions Overview
GitHub Actions is like having a CI/CD server that runs automatically on every git push.
Web Developer Analogy
GitHub Actions = Automated deployment like Vercel/Netlify
- Triggers on git push
- Runs tests automatically
- Builds and deploys
- All configured in
.github/workflows/
Workflow = Recipe for automation Job = Group of related steps Step = Individual command or action Action = Reusable workflow component
Basic Structure
.github/workflows/ci.yml:
name: CI Pipeline
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run tests
run: pytest
Complete CI/CD Pipeline for ResearcherAI
Let's build a production-grade pipeline step by step.
Step 1: Code Quality Checks
.github/workflows/lint.yml:
name: Code Quality
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
steps:
- name: Checkout code
uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
pip install black flake8 mypy pylint isort
pip install -r requirements.txt
- name: Check code formatting (Black)
run: black --check .
- name: Check import sorting (isort)
run: isort --check-only .
- name: Lint with flake8
run: |
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Type checking (mypy)
run: mypy agents/ utils/ --ignore-missing-imports
- name: Lint with pylint
run: pylint agents/ utils/ --disable=C0111,C0103
yaml-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Lint YAML files
uses: ibiqlik/action-yamllint@v3
with:
file_or_dir: .github/workflows/*.yml k8s/ terraform/
dockerfile-lint:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Lint Dockerfile
uses: hadolint/hadolint-action@v3.1.0
with:
dockerfile: Dockerfile
Step 2: Automated Testing
.github/workflows/test.yml:
name: Test Suite
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
jobs:
unit-tests:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ['3.10', '3.11', '3.12']
steps:
- uses: actions/checkout@v4
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v5
with:
python-version: ${{ matrix.python-version }}
cache: 'pip'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest pytest-cov pytest-asyncio pytest-mock
- name: Run unit tests
run: |
pytest tests/unit/ \
--cov=agents \
--cov=utils \
--cov-report=xml \
--cov-report=html \
--junitxml=test-results.xml
- name: Upload coverage to Codecov
uses: codecov/codecov-action@v3
with:
file: ./coverage.xml
flags: unittests
name: codecov-umbrella
- name: Upload test results
uses: actions/upload-artifact@v3
if: always()
with:
name: test-results-${{ matrix.python-version }}
path: test-results.xml
integration-tests:
runs-on: ubuntu-latest
services:
neo4j:
image: neo4j:5.13-community
env:
NEO4J_AUTH: neo4j/testpassword
ports:
- 7687:7687
- 7474:7474
qdrant:
image: qdrant/qdrant:latest
ports:
- 6333:6333
kafka:
image: bitnami/kafka:latest
env:
KAFKA_ENABLE_KRAFT: yes
KAFKA_CFG_PROCESS_ROLES: broker,controller
KAFKA_CFG_CONTROLLER_LISTENER_NAMES: CONTROLLER
KAFKA_CFG_LISTENERS: PLAINTEXT://:9092,CONTROLLER://:9093
KAFKA_CFG_LISTENER_SECURITY_PROTOCOL_MAP: CONTROLLER:PLAINTEXT,PLAINTEXT:PLAINTEXT
KAFKA_CFG_ADVERTISED_LISTENERS: PLAINTEXT://localhost:9092
KAFKA_BROKER_ID: 1
KAFKA_CFG_CONTROLLER_QUORUM_VOTERS: 1@localhost:9093
ALLOW_PLAINTEXT_LISTENER: yes
ports:
- 9092:9092
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install pytest pytest-asyncio
- name: Wait for services
run: |
timeout 60 bash -c 'until nc -z localhost 7687; do sleep 1; done'
timeout 60 bash -c 'until nc -z localhost 6333; do sleep 1; done'
timeout 60 bash -c 'until nc -z localhost 9092; do sleep 1; done'
- name: Run integration tests
env:
NEO4J_URI: bolt://localhost:7687
NEO4J_PASSWORD: testpassword
QDRANT_HOST: localhost:6333
KAFKA_BOOTSTRAP_SERVERS: localhost:9092
run: |
pytest tests/integration/ -v
llm-evaluation-tests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.11'
cache: 'pip'
- name: Install dependencies
run: |
pip install -r requirements.txt
pip install langsmith
- name: Run LLM evaluation tests
env:
GOOGLE_API_KEY: ${{ secrets.GOOGLE_API_KEY }}
LANGSMITH_API_KEY: ${{ secrets.LANGSMITH_API_KEY }}
LANGCHAIN_PROJECT: ResearcherAI-CI
run: |
pytest tests/llm_evaluation/ -v --tb=short
- name: Check for prompt regressions
run: |
python scripts/check_prompt_regressions.py \
--baseline=prompts/baseline.json \
--current=prompts/current.json
- name: Check token usage regression
run: |
python scripts/check_token_regression.py \
--max-increase-percent=10
Step 3: Security Scanning
.github/workflows/security.yml:
name: Security Scanning
on:
push:
branches: [main, develop]
pull_request:
branches: [main]
schedule:
- cron: '0 0 * * 1' # Weekly on Monday
jobs:
dependency-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Run Snyk to check for vulnerabilities
uses: snyk/actions/python@master
env:
SNYK_TOKEN: ${{ secrets.SNYK_TOKEN }}
with:
command: test
args: --severity-threshold=high
- name: Safety check (Python dependencies)
run: |
pip install safety
safety check --json
secret-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
with:
fetch-depth: 0 # Full history for secret scanning
- name: TruffleHog Secret Scanning
uses: trufflesecurity/trufflehog@main
with:
path: ./
base: main
head: HEAD
- name: GitLeaks Secret Scanning
uses: gitleaks/gitleaks-action@v2
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
code-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Initialize CodeQL
uses: github/codeql-action/init@v2
with:
languages: python
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@v2
container-scan:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Build Docker image
run: docker build -t researcherai:${{ github.sha }} .
- name: Run Trivy vulnerability scanner
uses: aquasecurity/trivy-action@master
with:
image-ref: researcherai:${{ github.sha }}
format: 'sarif'
output: 'trivy-results.sarif'
severity: 'CRITICAL,HIGH'
- name: Upload Trivy results to GitHub Security
uses: github/codeql-action/upload-sarif@v2
with:
sarif_file: 'trivy-results.sarif'
Step 4: Build and Push Docker Images
.github/workflows/build.yml:
name: Build and Push
on:
push:
branches: [main, develop]
tags:
- 'v*'
jobs:
build:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Docker Buildx
uses: docker/setup-buildx-action@v3
- name: Log in to GitHub Container Registry
uses: docker/login-action@v3
with:
registry: ghcr.io
username: ${{ github.actor }}
password: ${{ secrets.GITHUB_TOKEN }}
- name: Extract metadata
id: meta
uses: docker/metadata-action@v5
with:
images: ghcr.io/${{ github.repository }}
tags: |
type=ref,event=branch
type=ref,event=pr
type=semver,pattern={{version}}
type=semver,pattern={{major}}.{{minor}}
type=sha
- name: Build and push
uses: docker/build-push-action@v5
with:
context: .
push: true
tags: ${{ steps.meta.outputs.tags }}
labels: ${{ steps.meta.outputs.labels }}
cache-from: type=gha
cache-to: type=gha,mode=max
platforms: linux/amd64,linux/arm64
- name: Image digest
run: echo ${{ steps.meta.outputs.digest }}
Step 5: Deploy to Staging
.github/workflows/deploy-staging.yml:
name: Deploy to Staging
on:
push:
branches: [develop]
workflow_dispatch:
jobs:
deploy-staging:
runs-on: ubuntu-latest
environment:
name: staging
url: https://staging.researcherai.com
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
method: kubeconfig
kubeconfig: ${{ secrets.KUBE_CONFIG_STAGING }}
- name: Install Helm
uses: azure/setup-helm@v3
with:
version: '3.12.0'
- name: Deploy with Helm
run: |
helm upgrade --install researcherai ./k8s/helm/researcherai \
--namespace staging \
--create-namespace \
--set app.image.tag=${{ github.sha }} \
--set app.secrets.googleApiKey=${{ secrets.GOOGLE_API_KEY_STAGING }} \
--set app.environment=staging \
--wait \
--timeout 10m
- name: Verify deployment
run: |
kubectl rollout status deployment/researcherai -n staging
kubectl get pods -n staging
- name: Run smoke tests
run: |
sleep 30 # Wait for services to be ready
curl -f https://staging.researcherai.com/health || exit 1
python scripts/smoke_tests.py --environment=staging
- name: Notify Slack
if: always()
uses: 8398a7/action-slack@v3
with:
status: ${{ job.status }}
text: 'Staging deployment ${{ job.status }}'
webhook_url: ${{ secrets.SLACK_WEBHOOK }}
Step 6: Deploy to Production
.github/workflows/deploy-production.yml:
name: Deploy to Production
on:
push:
tags:
- 'v*'
workflow_dispatch:
jobs:
deploy-production:
runs-on: ubuntu-latest
environment:
name: production
url: https://researcherai.com
steps:
- uses: actions/checkout@v4
- name: Configure kubectl
uses: azure/k8s-set-context@v3
with:
method: kubeconfig
kubeconfig: ${{ secrets.KUBE_CONFIG_PRODUCTION }}
- name: Install Helm
uses: azure/setup-helm@v3
- name: Blue-Green Deployment
run: |
# Deploy to blue environment
helm upgrade --install researcherai-blue ./k8s/helm/researcherai \
--namespace production \
--create-namespace \
--set app.image.tag=${{ github.ref_name }} \
--set app.secrets.googleApiKey=${{ secrets.GOOGLE_API_KEY_PROD }} \
--set app.environment=production \
--set app.deployment.color=blue \
--wait \
--timeout 10m
- name: Run production smoke tests
run: |
python scripts/smoke_tests.py \
--environment=production \
--target=blue \
--critical-only
- name: Switch traffic to blue
run: |
kubectl patch service researcherai -n production \
-p '{"spec":{"selector":{"color":"blue"}}}'
- name: Monitor for 5 minutes
run: |
sleep 300
python scripts/monitor_deployment.py \
--duration=300 \
--error-threshold=0.01
- name: Scale down green
if: success()
run: |
helm delete researcherai-green -n production || true
- name: Rollback on failure
if: failure()
run: |
kubectl patch service researcherai -n production \
-p '{"spec":{"selector":{"color":"green"}}}'
helm delete researcherai-blue -n production
- name: Create GitHub Release
if: success()
uses: actions/create-release@v1
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
with:
tag_name: ${{ github.ref }}
release_name: Release ${{ github.ref_name }}
body: |
## Changes
${{ github.event.head_commit.message }}
## Deployment
- Deployed to production
- Image: ghcr.io/${{ github.repository }}:${{ github.ref_name }}
- name: Notify Slack
if: always()
uses: 8398a7/action-slack@v3
with:
status: ${{ job.status }}
text: 'Production deployment ${{ job.status }}: ${{ github.ref_name }}'
webhook_url: ${{ secrets.SLACK_WEBHOOK }}
GitOps with ArgoCD
For declarative deployments, ArgoCD watches your Git repository and automatically syncs changes to Kubernetes.
Step 1: Install ArgoCD
# Create namespace
kubectl create namespace argocd
# Install ArgoCD
kubectl apply -n argocd -f https://raw.githubusercontent.com/argoproj/argo-cd/stable/manifests/install.yaml
# Get admin password
kubectl -n argocd get secret argocd-initial-admin-secret \
-o jsonpath="{.data.password}" | base64 -d
# Port forward to access UI
kubectl port-forward svc/argocd-server -n argocd 8080:443
# Login
argocd login localhost:8080
Step 2: Create ArgoCD Application
argocd/application.yaml:
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata:
name: researcherai
namespace: argocd
spec:
project: default
source:
repoURL: https://github.com/Sebuliba-Adrian/ResearcherAI.git
targetRevision: HEAD
path: k8s/helm/researcherai
helm:
valueFiles:
- values.yaml
- values-production.yaml
destination:
server: https://kubernetes.default.svc
namespace: production
syncPolicy:
automated:
prune: true # Delete resources not in Git
selfHeal: true # Sync on changes
syncOptions:
- CreateNamespace=true
retry:
limit: 5
backoff:
duration: 5s
factor: 2
maxDuration: 3m
Apply:
kubectl apply -f argocd/application.yaml -n argocd
Step 3: Update GitHub Actions for GitOps
.github/workflows/gitops-deploy.yml:
name: GitOps Deployment
on:
push:
branches: [main]
paths:
- 'k8s/**'
- 'Dockerfile'
jobs:
update-manifests:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Update image tag
run: |
sed -i "s|tag: .*|tag: ${GITHUB_SHA}|" \
k8s/helm/researcherai/values-production.yaml
- name: Commit changes
run: |
git config user.name "GitHub Actions"
git config user.email "actions@github.com"
git add k8s/helm/researcherai/values-production.yaml
git commit -m "Update production image to ${GITHUB_SHA}"
git push
# ArgoCD automatically detects the change and syncs!
LLM-Specific CI/CD Practices
1. Prompt Regression Testing
scripts/check_prompt_regressions.py:
#!/usr/bin/env python3
import json
import sys
from langsmith import Client
def check_prompt_regressions():
"""
Compare current prompts against baseline.
Fail if quality drops significantly.
"""
client = Client()
# Load baseline scores
with open("prompts/baseline.json") as f:
baseline = json.load(f)
# Run current evaluation
results = client.evaluate(
lambda x: generate_with_current_prompt(x),
data="research-queries-test-set",
evaluators=[quality_evaluator, relevance_evaluator]
)
# Compare scores
for metric in ["quality", "relevance"]:
current_score = results[metric]
baseline_score = baseline[metric]
# Fail if more than 5% drop
if current_score < baseline_score * 0.95:
print(f"❌ Regression detected in {metric}!")
print(f" Baseline: {baseline_score:.2f}")
print(f" Current: {current_score:.2f}")
print(f" Drop: {(baseline_score - current_score):.2f}")
sys.exit(1)
print("✅ No prompt regressions detected")
if __name__ == "__main__":
check_prompt_regressions()
2. Token Usage Monitoring
scripts/check_token_regression.py:
#!/usr/bin/env python3
import sys
from langfuse import Langfuse
def check_token_regression(max_increase_percent=10):
"""
Ensure token usage doesn't increase unexpectedly.
"""
langfuse = Langfuse()
# Get token usage from last week
last_week = langfuse.get_trace_stats(
filter_time_start="7d ago",
filter_time_end="now"
)
avg_tokens_last_week = last_week["avg_tokens_per_trace"]
# Get token usage from last day
last_day = langfuse.get_trace_stats(
filter_time_start="1d ago",
filter_time_end="now"
)
avg_tokens_last_day = last_day["avg_tokens_per_trace"]
# Calculate increase
increase_percent = (
(avg_tokens_last_day - avg_tokens_last_week) / avg_tokens_last_week
) * 100
if increase_percent > max_increase_percent:
print(f"❌ Token usage increased by {increase_percent:.1f}%!")
print(f" Last week: {avg_tokens_last_week:.0f} tokens")
print(f" Last day: {avg_tokens_last_day:.0f} tokens")
sys.exit(1)
print(f"✅ Token usage stable (change: {increase_percent:.1f}%)")
if __name__ == "__main__":
check_token_regression()
3. Cost Budget Checks
scripts/check_cost_budget.py:
#!/usr/bin/env python3
import sys
from langfuse import Langfuse
def check_cost_budget(daily_budget=100):
"""
Fail deployment if projected cost exceeds budget.
"""
langfuse = Langfuse()
# Get cost from last hour
stats = langfuse.get_trace_stats(
filter_time_start="1h ago",
filter_time_end="now"
)
hourly_cost = stats["total_cost"]
projected_daily_cost = hourly_cost * 24
if projected_daily_cost > daily_budget:
print(f"❌ Projected daily cost exceeds budget!")
print(f" Budget: ${daily_budget:.2f}")
print(f" Projected: ${projected_daily_cost:.2f}")
print(f" Hourly: ${hourly_cost:.2f}")
sys.exit(1)
print(f"✅ Cost within budget (${projected_daily_cost:.2f}/day)")
if __name__ == "__main__":
check_cost_budget()
Deployment Strategies
1. Rolling Update (Default)
# k8s/helm/researcherai/templates/deployment.yaml
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Add 1 pod before removing old
maxUnavailable: 0 # Always keep all pods running
Pros:
- Zero downtime
- Automatic rollback on failure
- Resource efficient
Cons:
- Gradual rollout (not instant)
- Mixed versions running temporarily
2. Blue-Green Deployment
# Deploy blue
helm install researcherai-blue ./k8s/helm/researcherai \
--set app.deployment.color=blue
# Test blue
curl https://blue.researcherai.com/health
# Switch traffic
kubectl patch service researcherai \
-p '{"spec":{"selector":{"color":"blue"}}}'
# Remove green
helm delete researcherai-green
Pros:
- Instant rollback
- Full testing before switch
- Clean separation
Cons:
- Requires 2x resources
- Database migrations complex
3. Canary Deployment
# Deploy canary with 10% traffic
helm install researcherai-canary ./k8s/helm/researcherai \
--set app.replicaCount=1 \
--set ingress.weight=10
# Monitor
python scripts/monitor_canary.py --duration=600
# If successful, scale up
helm upgrade researcherai-canary \
--set app.replicaCount=10 \
--set ingress.weight=100
Pros:
- Gradual rollout
- Early problem detection
- User-based routing possible
Cons:
- Complex traffic management
- Requires service mesh (Istio)
Monitoring CI/CD Pipeline
GitHub Actions Dashboard
View all workflow runs:
https://github.com/Sebuliba-Adrian/ResearcherAI/actions
Pipeline Metrics to Track
-
Build Time
- Target: < 10 minutes
- Alert if > 15 minutes
-
Test Pass Rate
- Target: 100%
- Alert if < 95%
-
Deployment Frequency
- Target: Multiple times per day
- Track trend
-
Mean Time to Recovery (MTTR)
- Target: < 1 hour
- Track incidents
-
Change Failure Rate
- Target: < 5%
- Track rollbacks
Pipeline Alerts
.github/workflows/alerts.yml:
name: Pipeline Alerts
on:
workflow_run:
workflows: ["CI Pipeline", "Deploy to Production"]
types: [completed]
jobs:
alert-on-failure:
runs-on: ubuntu-latest
if: ${{ github.event.workflow_run.conclusion == 'failure' }}
steps:
- name: Send PagerDuty alert
uses: PagerDuty/pagerduty-github-action@v1
with:
pagerduty-token: ${{ secrets.PAGERDUTY_TOKEN }}
service-id: ${{ secrets.PAGERDUTY_SERVICE_ID }}
event-action: trigger
summary: "CI/CD Pipeline failed: ${{ github.event.workflow_run.name }}"
severity: high
Best Practices
1. Fast Feedback
# Run fast tests first
jobs:
quick-tests:
runs-on: ubuntu-latest
steps:
- run: pytest tests/unit/ -v # < 2 minutes
slow-tests:
needs: quick-tests # Only run if quick tests pass
runs-on: ubuntu-latest
steps:
- run: pytest tests/integration/ -v # 5-10 minutes
2. Fail Fast
jobs:
test:
strategy:
fail-fast: true # Stop all jobs on first failure
matrix:
python-version: ['3.10', '3.11', '3.12']
3. Caching
- name: Cache pip packages
uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('requirements.txt') }}
restore-keys: |
${{ runner.os }}-pip-
- name: Cache Docker layers
uses: actions/cache@v3
with:
path: /tmp/.buildx-cache
key: ${{ runner.os }}-buildx-${{ github.sha }}
restore-keys: |
${{ runner.os }}-buildx-
4. Secure Secrets
# Never log secrets
- name: Deploy with secrets
run: |
echo "::add-mask::${{ secrets.API_KEY }}"
deploy.sh --api-key="${{ secrets.API_KEY }}"
# API_KEY is now masked in logs
5. Environment Protection
# Require approval for production
environment:
name: production
url: https://researcherai.com
# Configure in GitHub Settings → Environments
# - Require reviewers: 2 approvals
# - Wait timer: 0 minutes
# - Deployment branches: main only
Troubleshooting CI/CD
Pipeline Fails on Dependencies
# Use exact versions
- name: Install dependencies
run: |
pip install -r requirements-lock.txt # Pinned versions
# Not: pip install -r requirements.txt # Can break
Tests Pass Locally, Fail in CI
# Use same environment
- name: Run tests in Docker
run: |
docker-compose -f docker-compose.test.yml up --abort-on-container-exit
# Same environment as local development
Slow Pipelines
# Run jobs in parallel
jobs:
unit-tests:
runs-on: ubuntu-latest
integration-tests:
runs-on: ubuntu-latest
security-scan:
runs-on: ubuntu-latest
# All run simultaneously
Deployment Failures
# Add rollback step
- name: Rollback on failure
if: failure()
run: |
helm rollback researcherai -n production
kubectl rollout undo deployment/researcherai -n production
Complete CI/CD Checklist
✅ Code Quality
- Linting (black, flake8)
- Type checking (mypy)
- Import sorting (isort)
- Complexity checks (pylint)
✅ Testing
- Unit tests (>80% coverage)
- Integration tests
- LLM evaluation tests
- Prompt regression tests
- Token usage checks
- Cost budget checks
✅ Security
- Dependency scanning (Snyk)
- Secret scanning (TruffleHog)
- Code scanning (CodeQL)
- Container scanning (Trivy)
- License compliance
✅ Build & Deploy
- Docker image building
- Multi-arch builds (amd64, arm64)
- Image signing
- Helm chart validation
- Kubernetes deployment
- Smoke tests
- Health checks
✅ Monitoring
- Deployment tracking
- Rollback automation
- Slack/PagerDuty alerts
- Metrics collection
- Log aggregation
✅ GitOps (Optional)
- ArgoCD setup
- Declarative manifests
- Automated sync
- Self-healing
Conclusion
Congratulations! You now have a complete CI/CD pipeline that:
- ✅ Automatically tests every code change
- ✅ Checks for security vulnerabilities
- ✅ Validates prompt quality
- ✅ Monitors token usage and costs
- ✅ Builds and pushes Docker images
- ✅ Deploys to staging and production
- ✅ Runs smoke tests
- ✅ Monitors deployments
- ✅ Automatically rolls back on failure
- ✅ Sends alerts on issues
Your ResearcherAI system is now production-ready with:
- Kubernetes deployment - Auto-scaling, high availability
- Terraform infrastructure - Reproducible cloud resources
- Comprehensive observability - LangFuse, Prometheus, Grafana
- Automated CI/CD - GitHub Actions, security scanning
- Multiple environments - Dev, staging, production
In the final chapter, we'll wrap up with lessons learned and future enhancements!