Chapter 6B: Terraform & Cloud Infrastructure

While Kubernetes is excellent for orchestrating containers, we still need to provision the underlying infrastructure - servers, networks, databases, and load balancers. This is where Terraform comes in. In this chapter, I'll show you how to deploy ResearcherAI's complete infrastructure on DigitalOcean using Infrastructure as Code.

Why Infrastructure as Code?

Before Terraform, deploying infrastructure meant:

Clicking through cloud provider dashboards
Copy-pasting configurations
Inconsistent environments (dev vs prod)
No version control
Manual documentation
"Works on my machine" problems

With Terraform:

Infrastructure defined in code files
Version controlled with Git
Reproducible deployments
Automated provisioning
Infrastructure reviews (pull requests)
Disaster recovery (redeploy from code)

Web Developer Analogy

Manual Infrastructure = Directly editing production database

Click buttons in admin panel
Hope you remember what you changed
No audit trail

Infrastructure as Code = Database migrations

Changes defined in code
Version controlled
Reviewable
Rollback-able
Documented automatically

Terraform Basics

Terraform is like npm for infrastructure:

# Terraform configuration
resource "digitalocean_droplet" "app_server" {
  name   = "app-server-1"
  size   = "s-2vcpu-4gb"
  image  = "ubuntu-22-04-x64"
  region = "nyc3"
}

// JavaScript equivalent (conceptual)
const appServer = new DigitalOceanDroplet({
  name: 'app-server-1',
  size: 's-2vcpu-4gb',
  image: 'ubuntu-22-04-x64',
  region: 'nyc3'
});

Core Concepts

1. Providers (Like npm packages)

terraform {
  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }
}

provider "digitalocean" {
  token = var.do_token
}

2. Resources (Infrastructure components)

resource "type" "name" {
  argument1 = "value1"
  argument2 = "value2"
}

3. Variables (Like function parameters)

variable "droplet_count" {
  type    = number
  default = 2
}

4. Outputs (Return values)

output "server_ip" {
  value = digitalocean_droplet.app_server.ipv4_address
}

5. Modules (Reusable components)

module "vpc" {
  source = "./modules/vpc"
  name   = "production-vpc"
}

ResearcherAI Infrastructure Architecture

Here's what we're building with Terraform:

Internet
  ↓
Load Balancer (DigitalOcean LB)
  ↓
┌─────────────── VPC (10.10.0.0/16) ───────────────┐
│                                                    │
│  ┌──────────────┐      ┌──────────────┐         │
│  │ App Server 1 │      │ App Server 2 │         │
│  │ Docker       │      │ Docker       │         │
│  │ Compose      │      │ Compose      │         │
│  └──────────────┘      └──────────────┘         │
│         │                     │                   │
│         └─────────┬───────────┘                  │
│                   ↓                               │
│  ┌─────────────────────────────────┐            │
│  │   Managed PostgreSQL Cluster    │            │
│  │   (Metadata & Sessions)          │            │
│  └─────────────────────────────────┘            │
│                                                    │
│  ┌─────────────────────────────────┐            │
│  │   Managed Kafka Cluster          │            │
│  │   (3 nodes for HA)               │            │
│  └─────────────────────────────────┘            │
│                                                    │
│  ┌──────────────┐      ┌──────────────┐         │
│  │ Neo4j        │      │ Qdrant       │         │
│  │ (on Docker)  │      │ (on Docker)  │         │
│  └──────────────┘      └──────────────┘         │
│                                                    │
└────────────────────────────────────────────────────┘

Firewall Rules:
- Internet → Load Balancer: 80, 443
- Load Balancer → App Servers: 8000
- App Servers → Databases: Internal only
- App Servers → Internet: API calls

Project Structure

terraform/
├── main.tf                    # Main infrastructure
├── variables.tf               # Input variables
├── outputs.tf                 # Output values
├── terraform.tfvars.example   # Configuration template
├── modules/
│   ├── vpc/                   # Virtual Private Cloud
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── droplet/               # Application servers
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── database/              # PostgreSQL
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── kafka/                 # Kafka cluster
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   ├── loadbalancer/          # Load balancer
│   │   ├── main.tf
│   │   ├── variables.tf
│   │   └── outputs.tf
│   └── firewall/              # Security rules
│       ├── main.tf
│       ├── variables.tf
│       └── outputs.tf
├── environments/
│   ├── dev/
│   │   └── terraform.tfvars
│   ├── staging/
│   │   └── terraform.tfvars
│   └── production/
│       └── terraform.tfvars
└── scripts/
    └── init_app_server.sh     # Server setup script

Building the Infrastructure

Step 1: Main Configuration

main.tf - The entry point:

terraform {
  required_version = ">= 1.0"

  required_providers {
    digitalocean = {
      source  = "digitalocean/digitalocean"
      version = "~> 2.0"
    }
  }

  # Optional: Remote state in S3/Spaces
  # backend "s3" {
  #   endpoint                    = "nyc3.digitaloceanspaces.com"
  #   key                         = "terraform.tfstate"
  #   bucket                      = "researcherai-terraform"
  #   region                      = "us-east-1"
  #   skip_credentials_validation = true
  #   skip_metadata_api_check     = true
  # }
}

provider "digitalocean" {
  token = var.do_token
}

# VPC Module
module "vpc" {
  source = "./modules/vpc"

  project_name = var.project_name
  environment  = var.environment
  region       = var.region
  vpc_cidr     = var.vpc_cidr
}

# Droplet Module (Application Servers)
module "droplet" {
  source = "./modules/droplet"

  project_name    = var.project_name
  environment     = var.environment
  region          = var.region
  droplet_count   = var.droplet_count
  droplet_size    = var.droplet_size
  ssh_key_ids     = var.ssh_key_ids
  vpc_uuid        = module.vpc.vpc_id

  # Environment variables for app
  google_api_key  = var.google_api_key
  neo4j_uri       = var.external_neo4j_uri != "" ? var.external_neo4j_uri : "bolt://localhost:7687"
  neo4j_password  = var.neo4j_password
  qdrant_url      = var.external_qdrant_url != "" ? var.external_qdrant_url : "http://localhost:6333"
  use_kafka       = var.use_managed_kafka
  kafka_bootstrap = var.use_managed_kafka ? module.kafka[0].bootstrap_servers : "localhost:9092"

  depends_on = [module.vpc]
}

# Database Module (PostgreSQL)
module "database" {
  source = "./modules/database"

  project_name = var.project_name
  environment  = var.environment
  region       = var.region
  db_size      = var.db_size
  db_node_count = var.db_node_count
  vpc_uuid     = module.vpc.vpc_id

  depends_on = [module.vpc]
}

# Kafka Module (if using managed Kafka)
module "kafka" {
  count  = var.use_managed_kafka ? 1 : 0
  source = "./modules/kafka"

  project_name = var.project_name
  environment  = var.environment
  region       = var.region
  kafka_size   = var.kafka_size
  kafka_nodes  = var.kafka_node_count
  vpc_uuid     = module.vpc.vpc_id

  depends_on = [module.vpc]
}

# Load Balancer Module
module "loadbalancer" {
  source = "./modules/loadbalancer"

  project_name = var.project_name
  environment  = var.environment
  region       = var.region
  vpc_uuid     = module.vpc.vpc_id
  droplet_ids  = module.droplet.droplet_ids

  depends_on = [module.droplet]
}

# Firewall Module
module "firewall" {
  source = "./modules/firewall"

  project_name      = var.project_name
  environment       = var.environment
  droplet_ids       = module.droplet.droplet_ids
  loadbalancer_id   = module.loadbalancer.lb_id
  vpc_cidr          = var.vpc_cidr
  allowed_ssh_ips   = var.allowed_ssh_ips

  depends_on = [module.droplet, module.loadbalancer]
}

# DigitalOcean Project (for organization)
resource "digitalocean_project" "researcherai" {
  name        = "${var.project_name}-${var.environment}"
  description = "ResearcherAI Multi-Agent RAG System - ${var.environment}"
  purpose     = "Web Application"
  environment = var.environment

  resources = concat(
    module.droplet.droplet_urns,
    [module.loadbalancer.lb_urn],
    [module.database.db_urn],
    var.use_managed_kafka ? [module.kafka[0].kafka_urn] : []
  )
}

Step 2: Variables

variables.tf - All configurable parameters:

# DigitalOcean API Token
variable "do_token" {
  description = "DigitalOcean API token"
  type        = string
  sensitive   = true
}

# Project Configuration
variable "project_name" {
  description = "Project name for resource naming"
  type        = string
  default     = "researcherai"
}

variable "environment" {
  description = "Environment (dev, staging, production)"
  type        = string
  default     = "production"

  validation {
    condition     = contains(["dev", "staging", "production"], var.environment)
    error_message = "Environment must be dev, staging, or production."
  }
}

variable "region" {
  description = "DigitalOcean region"
  type        = string
  default     = "nyc3"

  validation {
    condition = contains([
      "nyc1", "nyc3", "sfo1", "sfo2", "sfo3",
      "ams3", "sgp1", "lon1", "fra1", "tor1", "blr1"
    ], var.region)
    error_message = "Invalid DigitalOcean region."
  }
}

# Network Configuration
variable "vpc_cidr" {
  description = "VPC CIDR block"
  type        = string
  default     = "10.10.0.0/16"
}

# Droplet Configuration
variable "droplet_count" {
  description = "Number of application servers"
  type        = number
  default     = 2

  validation {
    condition     = var.droplet_count >= 1 && var.droplet_count <= 10
    error_message = "Droplet count must be between 1 and 10."
  }
}

variable "droplet_size" {
  description = "Droplet size slug"
  type        = string
  default     = "s-2vcpu-4gb"  # $36/month per server

  validation {
    condition = contains([
      "s-1vcpu-2gb",   # $18/mo - Dev/staging
      "s-2vcpu-4gb",   # $36/mo - Production
      "s-4vcpu-8gb",   # $72/mo - High traffic
      "s-8vcpu-16gb"   # $144/mo - Very high traffic
    ], var.droplet_size)
    error_message = "Invalid droplet size."
  }
}

variable "ssh_key_ids" {
  description = "SSH key IDs for droplet access"
  type        = list(string)
}

# Database Configuration
variable "db_size" {
  description = "Database cluster size"
  type        = string
  default     = "db-s-2vcpu-4gb"  # $60/month
}

variable "db_node_count" {
  description = "Number of database nodes (1 or 2)"
  type        = number
  default     = 1

  validation {
    condition     = var.db_node_count >= 1 && var.db_node_count <= 2
    error_message = "Database node count must be 1 or 2."
  }
}

# Kafka Configuration
variable "use_managed_kafka" {
  description = "Use DigitalOcean managed Kafka"
  type        = bool
  default     = true
}

variable "kafka_size" {
  description = "Kafka node size"
  type        = string
  default     = "db-s-2vcpu-2gb"  # $30/month per node
}

variable "kafka_node_count" {
  description = "Number of Kafka nodes (minimum 3 for HA)"
  type        = number
  default     = 3

  validation {
    condition     = var.kafka_node_count >= 3
    error_message = "Kafka requires minimum 3 nodes for high availability."
  }
}

# Application Secrets
variable "google_api_key" {
  description = "Google Gemini API key"
  type        = string
  sensitive   = true
}

variable "neo4j_password" {
  description = "Neo4j database password"
  type        = string
  sensitive   = true
  default     = "secure-neo4j-password"
}

# External Services (optional)
variable "external_neo4j_uri" {
  description = "External Neo4j URI (if not using local)"
  type        = string
  default     = ""
}

variable "external_qdrant_url" {
  description = "External Qdrant URL (if not using local)"
  type        = string
  default     = ""
}

# Security
variable "allowed_ssh_ips" {
  description = "IP addresses allowed SSH access"
  type        = list(string)
  default     = ["0.0.0.0/0"]  # WARNING: Restrict in production!
}

# Feature Flags
variable "enable_monitoring" {
  description = "Enable DigitalOcean monitoring"
  type        = bool
  default     = true
}

variable "enable_backups" {
  description = "Enable automated backups"
  type        = bool
  default     = true
}

# Tags
variable "custom_tags" {
  description = "Custom tags for resources"
  type        = list(string)
  default     = []
}

Step 3: VPC Module

modules/vpc/main.tf - Private network:

resource "digitalocean_vpc" "main" {
  name     = "${var.project_name}-vpc-${var.environment}"
  region   = var.region
  ip_range = var.vpc_cidr

  description = "VPC for ${var.project_name} ${var.environment} environment"
}

# Outputs
output "vpc_id" {
  value = digitalocean_vpc.main.id
}

output "vpc_urn" {
  value = digitalocean_vpc.main.urn
}

modules/vpc/variables.tf:

variable "project_name" {
  type = string
}

variable "environment" {
  type = string
}

variable "region" {
  type = string
}

variable "vpc_cidr" {
  type    = string
  default = "10.10.0.0/16"
}

Step 4: Droplet Module

modules/droplet/main.tf - Application servers:

resource "digitalocean_droplet" "app_server" {
  count  = var.droplet_count
  name   = "${var.project_name}-app-${var.environment}-${count.index + 1}"
  region = var.region
  size   = var.droplet_size
  image  = "ubuntu-22-04-x64"

  # SSH keys
  ssh_keys = var.ssh_key_ids

  # VPC
  vpc_uuid = var.vpc_uuid

  # IPv6
  ipv6 = true

  # Monitoring
  monitoring = true

  # Backups
  backups = var.enable_backups

  # User data - initialization script
  user_data = templatefile("${path.module}/../../scripts/init_app_server.sh", {
    google_api_key  = var.google_api_key
    neo4j_uri       = var.neo4j_uri
    neo4j_password  = var.neo4j_password
    qdrant_url      = var.qdrant_url
    kafka_bootstrap = var.kafka_bootstrap
    use_neo4j       = "true"
    use_qdrant      = "true"
    use_kafka       = var.use_kafka ? "true" : "false"
  })

  # Optional volumes for persistent data
  dynamic "volume_ids" {
    for_each = var.attach_volumes ? [1] : []
    content {
      volume_ids = [digitalocean_volume.data[count.index].id]
    }
  }

  # Tags
  tags = concat(
    [var.project_name, var.environment, "app-server"],
    var.custom_tags
  )
}

# Optional persistent volumes
resource "digitalocean_volume" "data" {
  count                   = var.attach_volumes ? var.droplet_count : 0
  region                  = var.region
  name                    = "${var.project_name}-data-${var.environment}-${count.index + 1}"
  size                    = var.volume_size
  initial_filesystem_type = "ext4"
  description             = "Data volume for ${var.project_name} app server ${count.index + 1}"

  tags = [var.project_name, var.environment, "data-volume"]
}

# Outputs
output "droplet_ids" {
  value = digitalocean_droplet.app_server[*].id
}

output "droplet_urns" {
  value = digitalocean_droplet.app_server[*].urn
}

output "droplet_ips" {
  value = digitalocean_droplet.app_server[*].ipv4_address
}

output "droplet_private_ips" {
  value = digitalocean_droplet.app_server[*].ipv4_address_private
}

Step 5: Server Initialization Script

scripts/init_app_server.sh - Automated setup:

#!/bin/bash
set -e

echo "=== ResearcherAI Server Initialization ==="

# Update system
apt-get update
apt-get upgrade -y

# Install Docker
echo "Installing Docker..."
curl -fsSL https://get.docker.com -o get-docker.sh
sh get-docker.sh
systemctl enable docker
systemctl start docker

# Install Docker Compose
echo "Installing Docker Compose..."
curl -L "https://github.com/docker/compose/releases/download/v2.20.0/docker-compose-$(uname -s)-$(uname -m)" \
  -o /usr/local/bin/docker-compose
chmod +x /usr/local/bin/docker-compose

# Install dependencies
apt-get install -y git python3-pip nginx

# Create application directory
mkdir -p /opt/researcherai
cd /opt/researcherai

# Create environment file
cat > /opt/researcherai/.env <<'EOF'
# API Keys
GOOGLE_API_KEY=${google_api_key}

# Neo4j Configuration
USE_NEO4J=${use_neo4j}
NEO4J_URI=${neo4j_uri}
NEO4J_USER=neo4j
NEO4J_PASSWORD=${neo4j_password}

# Qdrant Configuration
USE_QDRANT=${use_qdrant}
QDRANT_HOST=${qdrant_url}

# Kafka Configuration
USE_KAFKA=${use_kafka}
KAFKA_BOOTSTRAP_SERVERS=${kafka_bootstrap}

# Application Settings
ENVIRONMENT=production
LOG_LEVEL=INFO
EOF

# Setup docker-compose as systemd service
cat > /etc/systemd/system/researcherai.service <<'EOF'
[Unit]
Description=ResearcherAI Application
Requires=docker.service
After=docker.service

[Service]
Type=oneshot
RemainAfterExit=yes
WorkingDirectory=/opt/researcherai
ExecStart=/usr/local/bin/docker-compose up -d
ExecStop=/usr/local/bin/docker-compose down
TimeoutStartSec=0

[Install]
WantedBy=multi-user.target
EOF

# Setup Nginx reverse proxy
cat > /etc/nginx/sites-available/researcherai <<'EOF'
server {
    listen 8000;
    server_name _;

    location / {
        proxy_pass http://localhost:8001;
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
    }

    location /health {
        proxy_pass http://localhost:8001/health;
        access_log off;
    }
}
EOF

ln -sf /etc/nginx/sites-available/researcherai /etc/nginx/sites-enabled/
rm -f /etc/nginx/sites-enabled/default
nginx -t && systemctl restart nginx

# Enable and start service
systemctl daemon-reload
systemctl enable researcherai.service

echo "=== Server initialization complete ==="
echo "Application directory: /opt/researcherai"
echo "Next steps:"
echo "1. Clone your repository to /opt/researcherai"
echo "2. Start services: systemctl start researcherai"

Step 6: Database Module

modules/database/main.tf - Managed PostgreSQL:

resource "digitalocean_database_cluster" "postgres" {
  name       = "${var.project_name}-db-${var.environment}"
  engine     = "pg"
  version    = "15"
  size       = var.db_size
  region     = var.region
  node_count = var.db_node_count

  # Private network only
  private_network_uuid = var.vpc_uuid

  maintenance_window {
    day  = "sunday"
    hour = "02:00:00"
  }

  tags = [var.project_name, var.environment, "database"]
}

# Default database
resource "digitalocean_database_db" "app_metadata" {
  cluster_id = digitalocean_database_cluster.postgres.id
  name       = "app_metadata"
}

# Default user
resource "digitalocean_database_user" "app_user" {
  cluster_id = digitalocean_database_cluster.postgres.id
  name       = "app_user"
}

# Firewall - allow from VPC only
resource "digitalocean_database_firewall" "postgres_fw" {
  cluster_id = digitalocean_database_cluster.postgres.id

  rule {
    type  = "tag"
    value = var.project_name
  }
}

# Outputs
output "db_urn" {
  value = digitalocean_database_cluster.postgres.urn
}

output "db_host" {
  value     = digitalocean_database_cluster.postgres.private_host
  sensitive = true
}

output "db_port" {
  value = digitalocean_database_cluster.postgres.port
}

output "db_name" {
  value = digitalocean_database_db.app_metadata.name
}

output "db_user" {
  value     = digitalocean_database_user.app_user.name
  sensitive = true
}

output "db_password" {
  value     = digitalocean_database_user.app_user.password
  sensitive = true
}

output "connection_string" {
  value     = digitalocean_database_cluster.postgres.private_uri
  sensitive = true
}

Step 7: Kafka Module

modules/kafka/main.tf - Managed Kafka cluster:

resource "digitalocean_database_cluster" "kafka" {
  name       = "${var.project_name}-kafka-${var.environment}"
  engine     = "kafka"
  version    = "3.5"
  size       = var.kafka_size
  region     = var.region
  node_count = var.kafka_nodes  # Minimum 3 for HA

  # Private network only
  private_network_uuid = var.vpc_uuid

  maintenance_window {
    day  = "sunday"
    hour = "03:00:00"
  }

  tags = [var.project_name, var.environment, "kafka"]
}

# Default Kafka user
resource "digitalocean_database_user" "kafka_app_user" {
  cluster_id = digitalocean_database_cluster.kafka.id
  name       = "kafka_app_user"
}

# Kafka topics (create via API or application)
resource "digitalocean_database_kafka_topic" "events" {
  cluster_id         = digitalocean_database_cluster.kafka.id
  name               = "query.submitted"
  partition_count    = 3
  replication_factor = 3

  config {
    cleanup_policy                      = "delete"
    compression_type                    = "producer"
    delete_retention_ms                 = 86400000  # 1 day
    file_delete_delay_ms                = 60000
    flush_messages                      = 9223372036854775807
    flush_ms                            = 9223372036854775807
    index_interval_bytes                = 4096
    max_compaction_lag_ms               = 9223372036854775807
    max_message_bytes                   = 1048588
    message_down_conversion_enable      = true
    message_format_version              = "3.0-IV1"
    message_timestamp_difference_max_ms = 9223372036854775807
    message_timestamp_type              = "create_time"
    min_cleanable_dirty_ratio           = 0.5
    min_compaction_lag_ms               = 0
    min_insync_replicas                 = 2
    preallocate                         = false
    retention_bytes                     = -1
    retention_ms                        = 604800000  # 7 days
    segment_bytes                       = 1073741824
    segment_index_bytes                 = 10485760
    segment_jitter_ms                   = 0
    segment_ms                          = 604800000
  }
}

# Firewall - allow from VPC only
resource "digitalocean_database_firewall" "kafka_fw" {
  cluster_id = digitalocean_database_cluster.kafka.id

  rule {
    type  = "tag"
    value = var.project_name
  }
}

# Outputs
output "kafka_urn" {
  value = digitalocean_database_cluster.kafka.urn
}

output "bootstrap_servers" {
  value     = digitalocean_database_cluster.kafka.private_host
  sensitive = true
}

output "kafka_port" {
  value = digitalocean_database_cluster.kafka.port
}

output "kafka_username" {
  value     = digitalocean_database_user.kafka_app_user.name
  sensitive = true
}

output "kafka_password" {
  value     = digitalocean_database_user.kafka_app_user.password
  sensitive = true
}

Step 8: Load Balancer Module

modules/loadbalancer/main.tf - Traffic distribution:

resource "digitalocean_loadbalancer" "main" {
  name   = "${var.project_name}-lb-${var.environment}"
  region = var.region

  # VPC
  vpc_uuid = var.vpc_uuid

  # HTTP forwarding
  forwarding_rule {
    entry_protocol  = "http"
    entry_port      = 80
    target_protocol = "http"
    target_port     = 8000
  }

  # HTTPS forwarding (if you have SSL certificate)
  forwarding_rule {
    entry_protocol  = "https"
    entry_port      = 443
    target_protocol = "http"
    target_port     = 8000

    # Optional: Add your SSL certificate ID
    # certificate_id = var.ssl_certificate_id
  }

  # Health check
  healthcheck {
    protocol               = "http"
    port                   = 8000
    path                   = "/health"
    check_interval_seconds = 10
    response_timeout_seconds = 5
    unhealthy_threshold    = 3
    healthy_threshold      = 3
  }

  # Sticky sessions
  sticky_sessions {
    type               = "cookies"
    cookie_name        = "lb-session"
    cookie_ttl_seconds = 3600
  }

  # Droplet IDs
  droplet_ids = var.droplet_ids

  # Tags
  droplet_tag = "${var.project_name}-${var.environment}"

  # Algorithm
  algorithm = "round_robin"
}

# Outputs
output "lb_id" {
  value = digitalocean_loadbalancer.main.id
}

output "lb_urn" {
  value = digitalocean_loadbalancer.main.urn
}

output "lb_ip" {
  value = digitalocean_loadbalancer.main.ip
}

output "lb_url" {
  value = "http://${digitalocean_loadbalancer.main.ip}"
}

Step 9: Firewall Module

modules/firewall/main.tf - Security rules:

resource "digitalocean_firewall" "app_firewall" {
  name = "${var.project_name}-firewall-${var.environment}"

  # Apply to all application servers
  droplet_ids = var.droplet_ids

  # Inbound rules

  # Allow HTTP/HTTPS from load balancer
  inbound_rule {
    protocol         = "tcp"
    port_range       = "80"
    source_load_balancer_uids = [var.loadbalancer_id]
  }

  inbound_rule {
    protocol         = "tcp"
    port_range       = "443"
    source_load_balancer_uids = [var.loadbalancer_id]
  }

  # Allow app port from load balancer
  inbound_rule {
    protocol         = "tcp"
    port_range       = "8000"
    source_load_balancer_uids = [var.loadbalancer_id]
  }

  # Allow SSH from specific IPs
  inbound_rule {
    protocol         = "tcp"
    port_range       = "22"
    source_addresses = var.allowed_ssh_ips
  }

  # Allow all from VPC (internal communication)
  inbound_rule {
    protocol         = "tcp"
    port_range       = "1-65535"
    source_addresses = [var.vpc_cidr]
  }

  inbound_rule {
    protocol         = "udp"
    port_range       = "1-65535"
    source_addresses = [var.vpc_cidr]
  }

  # Outbound rules

  # Allow all TCP (for internet access)
  outbound_rule {
    protocol              = "tcp"
    port_range            = "1-65535"
    destination_addresses = ["0.0.0.0/0", "::/0"]
  }

  # Allow all UDP
  outbound_rule {
    protocol              = "udp"
    port_range            = "1-65535"
    destination_addresses = ["0.0.0.0/0", "::/0"]
  }

  # Allow ICMP
  outbound_rule {
    protocol              = "icmp"
    destination_addresses = ["0.0.0.0/0", "::/0"]
  }
}

Step 10: Outputs

outputs.tf - Information after deployment:

# VPC Information
output "vpc_id" {
  description = "VPC ID"
  value       = module.vpc.vpc_id
}

# Server Information
output "app_server_ips" {
  description = "Application server public IPs"
  value       = module.droplet.droplet_ips
}

output "app_server_private_ips" {
  description = "Application server private IPs"
  value       = module.droplet.droplet_private_ips
}

# Load Balancer
output "load_balancer_ip" {
  description = "Load balancer public IP"
  value       = module.loadbalancer.lb_ip
}

output "application_url" {
  description = "Application URL"
  value       = module.loadbalancer.lb_url
}

# Database (marked sensitive)
output "database_host" {
  description = "Database host"
  value       = module.database.db_host
  sensitive   = true
}

output "database_connection_string" {
  description = "Database connection string"
  value       = module.database.connection_string
  sensitive   = true
}

# Kafka (if enabled)
output "kafka_bootstrap_servers" {
  description = "Kafka bootstrap servers"
  value       = var.use_managed_kafka ? module.kafka[0].bootstrap_servers : "Using Docker Kafka"
  sensitive   = true
}

# SSH Commands
output "ssh_commands" {
  description = "SSH commands for each server"
  value = [
    for ip in module.droplet.droplet_ips :
    "ssh root@${ip}"
  ]
}

# Deployment Information
output "deployment_info" {
  description = "Deployment summary"
  value = {
    environment    = var.environment
    region         = var.region
    server_count   = var.droplet_count
    server_size    = var.droplet_size
    database       = "PostgreSQL ${var.db_size}"
    kafka_enabled  = var.use_managed_kafka
    load_balancer  = module.loadbalancer.lb_ip
  }
}

# Estimated Monthly Cost
output "estimated_monthly_cost" {
  description = "Estimated monthly cost in USD"
  value = format("$%.2f",
    var.droplet_count * (
      var.droplet_size == "s-1vcpu-2gb" ? 18 :
      var.droplet_size == "s-2vcpu-4gb" ? 36 :
      var.droplet_size == "s-4vcpu-8gb" ? 72 : 144
    ) +
    (var.db_size == "db-s-1vcpu-1gb" ? 15 :
     var.db_size == "db-s-2vcpu-4gb" ? 60 : 120) +
    (var.use_managed_kafka ? var.kafka_node_count * 30 : 0) +
    10  # Load balancer
  )
}

# Next Steps
output "next_steps" {
  description = "Next steps after deployment"
  value = <<-EOT

    Deployment Complete! 🎉

    1. Access your servers:
       ${join("\n       ", [for ip in module.droplet.droplet_ips : "ssh root@${ip}"])}

    2. Clone your repository:
       cd /opt/researcherai
       git clone https://github.com/Sebuliba-Adrian/ResearcherAI.git .

    3. Start the application:
       systemctl start researcherai

    4. Check logs:
       journalctl -u researcherai -f

    5. Access the application:
       ${module.loadbalancer.lb_url}

    6. Monitor resources:
       doctl compute droplet list
       doctl database list

    Estimated monthly cost: ${format("$%.2f",
      var.droplet_count * (
        var.droplet_size == "s-1vcpu-2gb" ? 18 :
        var.droplet_size == "s-2vcpu-4gb" ? 36 :
        var.droplet_size == "s-4vcpu-8gb" ? 72 : 144
      ) +
      (var.db_size == "db-s-1vcpu-1gb" ? 15 :
       var.db_size == "db-s-2vcpu-4gb" ? 60 : 120) +
      (var.use_managed_kafka ? var.kafka_node_count * 30 : 0) +
      10
    )}
  EOT
}

Deploying with Terraform

Step 1: Prerequisites

# Install Terraform
wget https://releases.hashicorp.com/terraform/1.6.0/terraform_1.6.0_linux_amd64.zip
unzip terraform_1.6.0_linux_amd64.zip
sudo mv terraform /usr/local/bin/
terraform version

# Install DigitalOcean CLI (optional)
curl -sL https://github.com/digitalocean/doctl/releases/download/v1.98.1/doctl-1.98.1-linux-amd64.tar.gz | tar -xzv
sudo mv doctl /usr/local/bin/

# Authenticate doctl
doctl auth init

Step 2: Get DigitalOcean API Token

# Create API token at:
# https://cloud.digitalocean.com/account/api/tokens

# Save to environment variable
export DO_TOKEN="your-digitalocean-api-token-here"

Step 3: Add SSH Key

# Generate SSH key if you don't have one
ssh-keygen -t rsa -b 4096 -C "sebuliba.adrian@gmail.com"

# Add to DigitalOcean
doctl compute ssh-key import my-key --public-key-file ~/.ssh/id_rsa.pub

# Get SSH key ID
doctl compute ssh-key list

Step 4: Create Configuration

cd terraform/

# Copy example configuration
cp terraform.tfvars.example terraform.tfvars

# Edit with your values
nano terraform.tfvars

terraform.tfvars:

# DigitalOcean Configuration
do_token = "dop_v1_your_token_here"

# Project Settings
project_name = "researcherai"
environment  = "production"
region       = "nyc3"

# Server Configuration
droplet_count = 2
droplet_size  = "s-2vcpu-4gb"  # $36/month per server
ssh_key_ids   = ["12345678"]    # Your SSH key ID

# Database Configuration
db_size       = "db-s-2vcpu-4gb"  # $60/month
db_node_count = 1

# Kafka Configuration
use_managed_kafka = true
kafka_size        = "db-s-2vcpu-2gb"  # $30/month per node
kafka_node_count  = 3                 # $90/month total

# Application Secrets
google_api_key = "AIzaSy..."
neo4j_password = "secure-password-here"

# Security
allowed_ssh_ips = ["your.ip.address.here/32"]  # Your IP only

# Features
enable_monitoring = true
enable_backups    = true

Step 5: Initialize Terraform

# Initialize (downloads providers)
terraform init

# Output:
# Initializing modules...
# Initializing the backend...
# Initializing provider plugins...
# - Finding digitalocean/digitalocean versions matching "~> 2.0"...
# - Installing digitalocean/digitalocean v2.32.0...
# Terraform has been successfully initialized!

Step 6: Plan Deployment

# See what will be created
terraform plan

# Output shows all resources:
# Terraform will perform the following actions:
#
#   # module.droplet.digitalocean_droplet.app_server[0] will be created
#   + resource "digitalocean_droplet" "app_server" {
#       + name   = "researcherai-app-production-1"
#       + region = "nyc3"
#       + size   = "s-2vcpu-4gb"
#       ...
#   }
#
# Plan: 15 to add, 0 to change, 0 to destroy.

Step 7: Apply Configuration

# Deploy infrastructure
terraform apply

# Review plan and type 'yes' to confirm

# Deployment takes 5-10 minutes
# Watch progress in terminal

Step 8: Get Outputs

# View all outputs
terraform output

# Example output:
# app_server_ips = [
#   "167.99.123.45",
#   "167.99.123.46",
# ]
# application_url = "http://174.138.45.67"
# estimated_monthly_cost = "$232.00"
# load_balancer_ip = "174.138.45.67"

# Get specific output
terraform output application_url

# Get sensitive output
terraform output -json database_connection_string

Step 9: SSH to Servers

# Get SSH commands
terraform output -json ssh_commands

# SSH to first server
ssh root@$(terraform output -json app_server_ips | jq -r '.[0]')

# Check application status
cd /opt/researcherai
systemctl status researcherai
docker-compose ps

Step 10: Deploy Application

# On each server:
cd /opt/researcherai

# Clone your repository
git clone https://github.com/Sebuliba-Adrian/ResearcherAI.git .

# Start services
systemctl start researcherai

# Check logs
journalctl -u researcherai -f

# Or with docker-compose directly
docker-compose logs -f

Multi-Environment Setup

Directory Structure

terraform/
├── environments/
│   ├── dev/
│   │   └── terraform.tfvars
│   ├── staging/
│   │   └── terraform.tfvars
│   └── production/
│       └── terraform.tfvars

Development Environment

environments/dev/terraform.tfvars:

project_name = "researcherai"
environment  = "dev"
region       = "nyc3"

# Minimal resources for dev
droplet_count = 1
droplet_size  = "s-1vcpu-2gb"  # $18/month

db_size       = "db-s-1vcpu-1gb"  # $15/month
db_node_count = 1

# Use Docker Kafka instead of managed
use_managed_kafka = false

# Dev secrets
google_api_key = "AIzaSy...dev-key"
neo4j_password = "dev-password"

# Allow SSH from anywhere in dev
allowed_ssh_ips = ["0.0.0.0/0"]

# Disable backups in dev
enable_backups = false

Deploy Development

cd terraform/

# Use dev configuration
terraform workspace new dev
terraform workspace select dev

# Plan with dev vars
terraform plan -var-file=environments/dev/terraform.tfvars

# Apply
terraform apply -var-file=environments/dev/terraform.tfvars

Production Environment

environments/production/terraform.tfvars:

project_name = "researcherai"
environment  = "production"
region       = "nyc3"

# Production resources
droplet_count = 4
droplet_size  = "s-4vcpu-8gb"  # $72/month × 4 = $288/month

db_size       = "db-s-4vcpu-8gb"  # $120/month
db_node_count = 2  # Standby for HA

# Managed Kafka with HA
use_managed_kafka = true
kafka_size        = "db-s-2vcpu-2gb"
kafka_node_count  = 3  # $90/month

# Production secrets (use environment variables!)
google_api_key = "${GOOGLE_API_KEY}"
neo4j_password = "${NEO4J_PASSWORD}"

# Restrict SSH to office/VPN
allowed_ssh_ips = ["203.0.113.0/24"]

# Enable all production features
enable_monitoring = true
enable_backups    = true

Deploy Production

# Use production workspace
terraform workspace new production
terraform workspace select production

# Store secrets in environment
export TF_VAR_google_api_key="your-prod-key"
export TF_VAR_neo4j_password="secure-prod-password"

# Plan
terraform plan -var-file=environments/production/terraform.tfvars

# Apply with extra caution
terraform apply -var-file=environments/production/terraform.tfvars

Managing Infrastructure

Update Infrastructure

# Make changes to .tf files or terraform.tfvars

# Preview changes
terraform plan

# Apply changes
terraform apply

# Terraform will:
# - Create new resources (green +)
# - Update existing resources (yellow ~)
# - Destroy/recreate resources (red -/+)

Scale Up/Down

# Edit terraform.tfvars
droplet_count = 4  # Increase from 2 to 4

# Apply changes
terraform apply

# Terraform will create 2 new droplets
# Load balancer automatically includes them

Upgrade Server Size

# Edit terraform.tfvars
droplet_size = "s-4vcpu-8gb"  # Upgrade from s-2vcpu-4gb

# Apply (WARNING: Recreates droplets!)
terraform apply

# For zero-downtime:
# 1. Increase droplet_count first
# 2. Change droplet_size
# 3. Wait for new servers
# 4. Decrease droplet_count

View State

# List all resources
terraform state list

# Show specific resource
terraform state show module.droplet.digitalocean_droplet.app_server[0]

# Show all outputs
terraform output

Destroy Infrastructure

# Preview destruction
terraform plan -destroy

# Destroy everything (DANGEROUS!)
terraform destroy

# Destroy specific resource
terraform destroy -target=module.droplet.digitalocean_droplet.app_server[0]

Cost Management

Cost Breakdown

Default Production Configuration:

2× Application Servers (s-2vcpu-4gb)     $72/month
1× PostgreSQL Database (db-s-2vcpu-4gb)  $60/month
3× Kafka Nodes (db-s-2vcpu-2gb)          $90/month
1× Load Balancer                         $10/month
────────────────────────────────────────────────────
Total                                    $232/month

Budget Options:

Development ($43/month):
- 1× s-1vcpu-2gb droplet       $18
- 1× db-s-1vcpu-1gb database   $15
- Kafka on Docker              $0
- Load balancer                $10

Production ($232/month):
- 2× s-2vcpu-4gb droplets      $72
- 1× db-s-2vcpu-4gb database   $60
- 3× Kafka nodes               $90
- Load balancer                $10

High-Traffic ($628/month):
- 4× s-4vcpu-8gb droplets      $288
- 2× db-s-4vcpu-8gb database   $240
- 3× Kafka nodes               $90
- Load balancer                $10

Cost Optimization Tips

Use development configs for dev/staging
Enable auto-shutdown for non-production (manual via API)
Use reserved instances (DigitalOcean doesn't offer, but AWS/GCP do)
Monitor with alerts to catch runaway costs
Right-size resources based on actual usage

Monitoring Costs

# DigitalOcean CLI
doctl balance get

# Monthly usage
doctl invoice list

# Resource costs
doctl compute droplet list --format Name,Size,Price
doctl database list --format Name,Size,Price

Backup and Disaster Recovery

Automated Backups

# Enable in terraform.tfvars
enable_backups = true

# Droplets: Weekly backups, 4 weeks retention
# Databases: Daily backups, 7 days retention

Manual Snapshots

# Create droplet snapshot
doctl compute droplet-action snapshot <droplet-id> --snapshot-name "before-upgrade"

# Create database backup
doctl database backup list <database-id>
doctl database backup create <database-id>

Disaster Recovery Plan

# 1. Store Terraform state safely (S3/Spaces)
terraform {
  backend "s3" {
    endpoint = "nyc3.digitaloceanspaces.com"
    bucket   = "researcherai-terraform-state"
    key      = "production/terraform.tfstate"
    region   = "us-east-1"
    skip_credentials_validation = true
  }
}

# 2. Backup secrets separately (password manager)

# 3. Document recovery procedure
# If complete disaster:
# - Clone Git repository
# - Configure terraform.tfvars with secrets
# - Run: terraform init -backend-config=backend.tfvars
# - Run: terraform apply
# - Restore database from backup
# - Deploy application code

Troubleshooting

Terraform Errors

Error: Invalid provider configuration

# Solution: Check API token
doctl auth list

# Reinitialize
rm -rf .terraform/
terraform init

Error: Resource already exists

# Solution: Import existing resource
terraform import module.droplet.digitalocean_droplet.app_server[0] <droplet-id>

Error: State lock

# Solution: Force unlock (if stuck)
terraform force-unlock <lock-id>

Server Connection Issues

# Can't SSH to droplet
# 1. Check firewall rules
doctl compute firewall list

# 2. Check droplet status
doctl compute droplet get <droplet-id>

# 3. Check SSH key
doctl compute ssh-key list

# 4. Use recovery console
# Login to DigitalOcean web console

Application Not Starting

# SSH to server
ssh root@<server-ip>

# Check systemd service
systemctl status researcherai
journalctl -u researcherai -n 100

# Check Docker
docker ps
docker-compose logs

# Check environment file
cat /opt/researcherai/.env

# Manual start
cd /opt/researcherai
docker-compose up -d

Security Best Practices

1. Secrets Management

# Never commit secrets to Git!
echo "terraform.tfvars" >> .gitignore
echo "*.tfvars" >> .gitignore

# Use environment variables
export TF_VAR_google_api_key="..."
export TF_VAR_do_token="..."

# Or use secret management service
# - HashiCorp Vault
# - AWS Secrets Manager
# - DigitalOcean Spaces with encryption

2. SSH Key Rotation

# Generate new SSH key
ssh-keygen -t ed25519 -f ~/.ssh/researcherai-prod

# Add to DigitalOcean
doctl compute ssh-key import prod-key-2024 \
  --public-key-file ~/.ssh/researcherai-prod.pub

# Update terraform.tfvars
ssh_key_ids = ["12345678", "87654321"]

# Apply
terraform apply

# Remove old key after verification

3. Firewall Rules

# Restrict SSH to specific IPs
allowed_ssh_ips = [
  "203.0.113.10/32",  # Office IP
  "198.51.100.50/32", # VPN IP
]

# Monitor firewall logs
doctl compute firewall list

4. Database Security

# Always use private network
private_network_uuid = var.vpc_uuid

# Restrict access with firewall
resource "digitalocean_database_firewall" "postgres_fw" {
  cluster_id = digitalocean_database_cluster.postgres.id

  rule {
    type  = "tag"
    value = var.project_name
  }
}

# Rotate database passwords regularly

Next Steps

Congratulations! You now have:

✅ Complete infrastructure as code with Terraform
✅ Modular, reusable Terraform modules
✅ Multi-environment support (dev/staging/prod)
✅ Automated server initialization
✅ Load balancing and high availability
✅ Managed databases (PostgreSQL, Kafka)
✅ Network security with firewall rules
✅ Cost optimization strategies

In the next chapter, we'll cover Observability & Monitoring including:

LangFuse and LangSmith for LLM observability
Prometheus and Grafana for metrics
Distributed tracing with Jaeger
Centralized logging with Loki
Custom dashboards and alerts

Let's ensure we can see what's happening in production!

Why Infrastructure as Code?​

Web Developer Analogy​

Terraform Basics​

Core Concepts​

ResearcherAI Infrastructure Architecture​

Project Structure​

Building the Infrastructure​

Step 1: Main Configuration​

Step 2: Variables​

Step 3: VPC Module​

Step 4: Droplet Module​

Step 5: Server Initialization Script​

Step 6: Database Module​

Step 7: Kafka Module​

Step 8: Load Balancer Module​

Step 9: Firewall Module​

Step 10: Outputs​

Deploying with Terraform​

Step 1: Prerequisites​

Step 2: Get DigitalOcean API Token​

Step 3: Add SSH Key​

Step 4: Create Configuration​

Step 5: Initialize Terraform​

Step 6: Plan Deployment​

Step 7: Apply Configuration​

Step 8: Get Outputs​

Step 9: SSH to Servers​

Step 10: Deploy Application​

Multi-Environment Setup​

Directory Structure​

Development Environment​

Deploy Development​

Production Environment​

Deploy Production​

Managing Infrastructure​

Update Infrastructure​

Scale Up/Down​

Upgrade Server Size​

View State​

Destroy Infrastructure​

Cost Management​

Cost Breakdown​

Cost Optimization Tips​

Monitoring Costs​

Backup and Disaster Recovery​

Automated Backups​

Manual Snapshots​

Disaster Recovery Plan​

Troubleshooting​

Terraform Errors​

Server Connection Issues​

Application Not Starting​

Security Best Practices​

1. Secrets Management​

2. SSH Key Rotation​

3. Firewall Rules​

4. Database Security​

Next Steps​