Planning & Requirements
Every great project starts with a problem worth solving. Let me take you back to where this all began.
The Problemโ
It was late October 2024, and I was drowning in research papers. I was trying to stay current with advances in transformer architectures, RAG systems, and multi-agent frameworks - but new papers were being published faster than I could read them.
I'd spend hours:
- Searching across arXiv, Semantic Scholar, PubMed
- Copying paper titles and abstracts into notes
- Trying to remember which paper mentioned what concept
- Re-reading papers because I forgot their key insights
There had to be a better way.
The Visionโ
I imagined a research assistant that could:
- Automatically collect papers from multiple sources based on my interests
- Build a knowledge graph showing how papers, authors, and concepts relate
- Answer my questions by synthesizing information across papers
- Remember our conversations so I don't have to repeat context
- Scale from my laptop to production without rewriting code
But I didn't want to build just another prototype. I wanted something that demonstrated production-grade patterns I could use in real applications.
Core Requirementsโ
I broke this down into functional and non-functional requirements.
Functional Requirementsโ
Data Collection
- โ Support multiple data sources (academic databases, web search)
- โ Automatic deduplication of papers
- โ Scheduled/automated collection in background
- โ Rate limiting and error handling
Knowledge Organization
- โ Knowledge graph with entities (papers, authors, topics)
- โ Relationships (authored, cites, is_about)
- โ Vector embeddings for semantic search
- โ Graph visualization
Query & Reasoning
- โ Natural language question answering
- โ Multi-hop reasoning across papers
- โ Source citation with paper references
- โ Conversation memory (5-turn history)
Session Management
- โ Multiple research sessions
- โ Save/load session state
- โ Session statistics and metadata
User Interface
- โ Modern, responsive web interface
- โ Data collection controls
- โ Interactive graph visualization
- โ Chat interface for queries
Non-Functional Requirementsโ
Performance
- Data collection: < 2 minutes for 10+ papers
- Query answering: < 5 seconds
- Graph queries: < 100ms
- Vector search: < 100ms
Reliability
- 99% uptime for production deployment
- Graceful degradation if services unavailable
- Automatic retries with exponential backoff
- Circuit breakers for external APIs
Scalability
- Support 1000+ papers in knowledge base
- Handle concurrent users in production
- Horizontal scaling with Kafka
- Efficient caching to reduce costs
Maintainability
- 90%+ test coverage
- Type safety with TypeScript/Pydantic
- Clear separation of concerns
- Comprehensive documentation
Cost Efficiency
- Intelligent model selection (use cheapest model for each task)
- Caching to avoid redundant API calls
- Token budgets to prevent runaway costs
- Target: < $10/month for moderate usage
Choosing the Tech Stackโ
This was one of the most important decisions. I needed technologies that were:
- Mature enough for production
- Well-documented so I could move fast
- Composable so I could swap components
- Cost-effective to run
Here's how I evaluated each component:
LLM Providerโ
Candidates: OpenAI GPT-4, Anthropic Claude, Google Gemini
Winner: Google Gemini 2.0 Flash
Why?
- โ Fast response times (< 2s average)
- โ Cost-effective ($0.35 per 1M tokens)
- โ Large context window (1M tokens)
- โ Good reasoning capabilities
- โ Free tier for development
I experimented with all three, and Gemini gave the best balance of speed, cost, and quality for research Q&A.
Orchestration Frameworkโ
Candidates: LangChain, LangGraph, Custom
Winner: LangGraph
Why?
- โ Built for multi-agent workflows
- โ Excellent state management
- โ Visual workflow debugging
- โ Works seamlessly with LlamaIndex
- โ Good documentation and examples
LangChain was too linear for my multi-agent pattern. LangGraph gave me the graph-based orchestration I needed.
RAG Frameworkโ
Candidates: LlamaIndex, Haystack, Custom
Winner: LlamaIndex
Why?
- โ Best-in-class retrieval strategies
- โ Flexible architecture
- โ Great integration with vector DBs
- โ Built-in evaluation tools
- โ Active community
LlamaIndex saved me weeks of work on chunking strategies, embedding management, and retrieval optimization.
Knowledge Graphโ
Candidates: Neo4j, NetworkX, TigerGraph
Winner: Both Neo4j AND NetworkX (dual backend)
Why?
- โ Neo4j for production (persistent, scalable, visual)
- โ NetworkX for development (fast startup, no infrastructure)
- โ Same API for both (abstraction layer)
- โ Easy switching via environment variables
This was a game-changer. I could develop and test on my laptop with NetworkX, then deploy to production with Neo4j without changing code.
Vector Databaseโ
Candidates: Pinecone, Weaviate, Qdrant, FAISS
Winner: Both Qdrant AND FAISS (dual backend)
Why?
- โ Qdrant for production (persistent, REST API, dashboard)
- โ FAISS for development (in-memory, no setup)
- โ Same abstraction layer
- โ Cost: $0 (self-hosted Qdrant)
Again, dual backends gave me the flexibility to move fast in development and scale in production.
Event Streamingโ
Candidates: Kafka, RabbitMQ, Redis Streams
Winner: Kafka (optional)
Why?
- โ Industry standard for event-driven systems
- โ Event persistence and replay
- โ Horizontal scaling with consumer groups
- โ Rich ecosystem (Kafka UI, connectors)
- โ Optional: falls back to sync mode if unavailable
I made Kafka optional because it's overkill for development but essential for production scalability.
ETL Orchestrationโ
Candidates: Airflow, Prefect, Dagster
Winner: Apache Airflow
Why?
- โ Industry standard for data pipelines
- โ Visual DAG editor and monitoring
- โ Automatic retries and error handling
- โ Scalable with Celery workers
- โ Rich integrations
Airflow gave me 3-4x faster data collection through parallel execution and automatic retries.
Frontendโ
Candidates: Next.js, Vite+React, SvelteKit
Winner: Vite + React + TypeScript
Why?
- โ Lightning fast dev server (< 1s HMR)
- โ React ecosystem and component libraries
- โ TypeScript for type safety
- โ Lightweight (no SSR overhead)
- โ Easy deployment
I didn't need SSR for this app, so Vite's simplicity and speed won.
Architecture Philosophyโ
I made some key architectural decisions early on:
1. Dual-Backend Strategyโ
Problem: Setting up Neo4j, Qdrant, and Kafka for development is slow and resource-heavy.
Solution: Abstract backends behind interfaces, provide in-memory alternatives.
Benefits:
- โก Instant startup in development (0s vs 30s)
- ๐งช Faster test suite (no Docker overhead)
- ๐ฐ Lower cloud costs (single container vs 7)
- ๐ Easy switching via env vars
2. Multi-Agent Patternโ
Problem: A single monolithic agent becomes complex and hard to test.
Solution: Separate concerns into specialized agents coordinated by an orchestrator.
Benefits:
- ๐งฉ Clear separation of concerns
- ๐งช Easier unit testing
- ๐ Can replace individual agents
- ๐ Can scale agents independently
3. Event-Driven Communicationโ
Problem: Synchronous agent calls create tight coupling and bottlenecks.
Solution: Agents publish events to Kafka; consumers process asynchronously.
Benefits:
- โก Parallel processing (3x faster)
- ๐ Loose coupling
- ๐ Event replay for debugging
- ๐ Horizontal scaling
4. Production-Grade Patternsโ
From day one, I implemented patterns that would matter at scale:
Circuit Breakers: Prevent cascade failures when APIs go down
@circuit_breaker(failure_threshold=5, recovery_timeout=60)
def call_external_api():
...
Token Budgets: Prevent runaway LLM costs
@token_budget(per_request=10000, per_user=100000)
def generate_answer():
...
Intelligent Caching: 40% cost reduction with dual-tier cache
@cache(ttl=3600, strategy="dual-tier")
def expensive_operation():
...
Dynamic Model Selection: Use cheapest model that meets requirements
model = select_model(task_type="summarization", max_latency=2.0)
The Planโ
With requirements and architecture decided, I created a development plan:
Phase 1: Core Agents (Week 1)
- Set up project structure
- Implement DataCollectorAgent with 3 sources
- Implement KnowledgeGraphAgent with NetworkX
- Implement VectorAgent with FAISS
- Implement ReasoningAgent with Gemini
- Basic OrchestratorAgent
Phase 2: Production Features (Week 2)
- Add Neo4j backend for graphs
- Add Qdrant backend for vectors
- Implement Kafka event system
- Add 4 more data sources
- Implement SchedulerAgent
- Session management and persistence
- Apache Airflow integration
Phase 3: Frontend & Testing (Week 3)
- React frontend with glassmorphism design
- 7 pages (Home, Collect, Ask, Graph, Vector, Upload, Sessions)
- Comprehensive test suite (90%+ coverage)
- GitHub Actions CI/CD
- Docker containerization
- Documentation
Lessons from Planningโ
Looking back, here's what I learned:
โ What Worked
- Dual-backend strategy was brilliant - Saved hours of dev time
- Starting with requirements - Kept me focused
- Choosing mature tech - Less debugging, more building
- Production patterns from day 1 - No painful refactoring later
๐ค What I'd Change
- Should have added Airflow earlier - Parallel collection is much faster
- Could have started with fewer data sources - 3 would have been enough to validate
- Frontend design took longer than expected - Glassmorphism is tricky to get right
๐ก Key Insights
The best architecture is one that lets you move fast in development and scale in production without rewriting code.
Abstractions are worth the upfront cost when they give you optionality.
Production patterns implemented early save painful refactoring later.
Ready for Architecture?โ
Now that you understand the "why" behind ResearcherAI, let's dive into the "how". In the next section, I'll walk you through the system architecture and how all these pieces fit together.
โ Back to Home Next: Architecture Design โ