Chapter 3: Data Foundations - Vector Databases, Knowledge Graphs, and GraphRAG
Introduction
Before building intelligent agents, we must understand how to store and retrieve information effectively. This chapter takes you from basic keyword search to advanced GraphRAG, explaining why each technology exists and when to use it.
This chapter is like learning about databases in web development:
- Keyword search = Simple
WHERE name LIKE '%query%' - Vector search = Semantic similarity (no exact matches needed)
- Knowledge graphs = Relational databases on steroids
- GraphRAG = Combining the best of all worlds
The Problem: Traditional Search Doesn't Work for Research
Keyword Search Limitations
Imagine searching for papers about "neural networks for language understanding":
Keyword Search:
def keyword_search(query: str, documents: List[str]) -> List[str]:
"""Traditional keyword matching"""
results = []
keywords = query.lower().split()
for doc in documents:
doc_lower = doc.lower()
if all(keyword in doc_lower for keyword in keywords):
results.append(doc)
return results
# Search papers
query = "neural networks for language understanding"
results = keyword_search(query, papers)
Problems:
- Synonym Problem: Misses "deep learning" when searching "neural networks"
- Word Order: "language understanding with neural networks" won't match
- Context Ignored: Can't understand "transformers" means attention mechanism
- No Semantics: "bank" (financial) vs "bank" (river) treated identically
Real Example:
# User searches: "attention mechanisms in NLP"
# Misses papers that say:
# - "self-attention for natural language processing"
# - "transformer architecture for text understanding"
# - "query-key-value attention for language models"
# All mean the same thing but use different words!
What We Actually Need
For research, we need:
- Semantic understanding: "neural network" = "deep learning" = "artificial network"
- Context awareness: Understand concepts, not just words
- Relationship mapping: How papers, authors, and concepts connect
- Reasoning capabilities: "If A cites B, and B discusses C, then A likely relates to C"
This requires two complementary technologies:
- Vector Databases (semantic similarity)
- Knowledge Graphs (relationship reasoning)
Let's understand each from scratch.
Part 1: Vector Databases and Semantic Search
From Words to Vectors
Core Idea: Represent text as numbers that capture meaning.
The Intuition
# Imagine each word has a position in "meaning space"
# Similar meanings = close positions
king = [0.8, 0.3, 0.1] # Royalty, male, power
queen = [0.8, 0.9, 0.1] # Royalty, female, power
man = [0.2, 0.3, 0.0] # Common, male, neutral
woman = [0.2, 0.9, 0.0] # Common, female, neutral
# Amazing property:
# king - man + woman ≈ queen
# [0.8, 0.3, 0.1] - [0.2, 0.3, 0.0] + [0.2, 0.9, 0.0] = [0.8, 0.9, 0.1]
This is word embeddings - representing words as dense vectors that capture semantic relationships.
How Embeddings Work
Creating Embeddings
from sentence_transformers import SentenceTransformer
# Load embedding model (runs locally!)
model = SentenceTransformer('all-MiniLM-L6-v2')
# Create embeddings
texts = [
"Neural networks for image classification",
"Deep learning in computer vision",
"Convolutional networks for image recognition"
]
embeddings = model.encode(texts)
print(embeddings.shape) # (3, 384) - 3 texts, 384 dimensions each
Measuring Similarity
import numpy as np
def cosine_similarity(vec1, vec2):
"""Measure how similar two vectors are (0=different, 1=identical)"""
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))
# Compare texts
query = "neural nets for images"
query_embedding = model.encode(query)
for i, text in enumerate(texts):
similarity = cosine_similarity(query_embedding, embeddings[i])
print(f"Similarity to '{text}': {similarity:.3f}")
# Output:
# Similarity to 'Neural networks for image classification': 0.782
# Similarity to 'Deep learning in computer vision': 0.691
# Similarity to 'Convolutional networks for image recognition': 0.745
Key Insight: Even though exact words differ, semantic similarity is captured!
Vector Databases: Scaling Semantic Search
Comparing embeddings one-by-one doesn't scale. For 1 million papers:
- Comparing query to all papers: 1 million comparisons
- At 0.01ms per comparison: 10 seconds per query ❌
Solution: Vector databases with approximate nearest neighbor (ANN) search.
Development: FAISS (In-Memory)
FAISS (Facebook AI Similarity Search) - perfect for development and testing.
import faiss
import numpy as np
class FAISSVectorStore:
"""Development vector database using FAISS"""
def __init__(self, dimension: int = 384):
self.dimension = dimension
# Create FAISS index (L2 distance)
self.index = faiss.IndexFlatL2(dimension)
self.documents = [] # Store original documents
def add_documents(self, texts: List[str], embeddings: np.ndarray):
"""Add documents to index"""
# FAISS requires float32
embeddings_f32 = embeddings.astype('float32')
# Add to index
self.index.add(embeddings_f32)
# Store documents
self.documents.extend(texts)
print(f"✓ Indexed {len(texts)} documents (total: {self.index.ntotal})")
def search(self, query_embedding: np.ndarray, top_k: int = 5) -> List[tuple]:
"""Search for similar documents"""
# Ensure float32
query_f32 = query_embedding.astype('float32').reshape(1, -1)
# Search (returns distances and indices)
distances, indices = self.index.search(query_f32, top_k)
# Convert L2 distances to similarity scores (0-1)
similarities = 1 / (1 + distances[0])
# Return documents with scores
results = [
(self.documents[idx], float(sim))
for idx, sim in zip(indices[0], similarities)
if idx < len(self.documents)
]
return results
# Usage
vector_store = FAISSVectorStore(dimension=384)
# Index papers
papers = [
"Attention is all you need - introduces transformer architecture",
"BERT: Pre-training of deep bidirectional transformers",
"GPT-3: Language models are few-shot learners",
"ResNet: Deep residual learning for image recognition",
"YOLO: Real-time object detection"
]
embeddings = model.encode(papers)
vector_store.add_documents(papers, embeddings)
# Search
query = "transformer models for NLP"
query_emb = model.encode(query)
results = vector_store.search(query_emb, top_k=3)
for doc, score in results:
print(f"Score: {score:.3f} - {doc}")
Output:
Score: 0.856 - Attention is all you need - introduces transformer architecture
Score: 0.792 - BERT: Pre-training of deep bidirectional transformers
Score: 0.743 - GPT-3: Language models are few-shot learners
Notice: ResNet and YOLO (computer vision) are correctly excluded even though they're valid papers!
Production: Qdrant (Persistent, Scalable)
Qdrant - production-grade vector database with persistence and APIs.
from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct
class QdrantVectorStore:
"""Production vector database using Qdrant"""
def __init__(
self,
collection_name: str = "research_papers",
url: str = "http://localhost:6333"
):
self.client = QdrantClient(url=url)
self.collection_name = collection_name
self.dimension = 384
# Create collection if not exists
self._create_collection()
def _create_collection(self):
"""Create Qdrant collection"""
try:
self.client.get_collection(self.collection_name)
print(f"✓ Collection '{self.collection_name}' exists")
except:
self.client.create_collection(
collection_name=self.collection_name,
vectors_config=VectorParams(
size=self.dimension,
distance=Distance.COSINE # Cosine similarity
)
)
print(f"✓ Created collection '{self.collection_name}'")
def add_documents(
self,
texts: List[str],
embeddings: np.ndarray,
metadata: List[Dict] = None
):
"""Add documents with metadata"""
points = []
for idx, (text, embedding) in enumerate(zip(texts, embeddings)):
point = PointStruct(
id=idx,
vector=embedding.tolist(),
payload={
"text": text,
**(metadata[idx] if metadata else {})
}
)
points.append(point)
# Batch upload
self.client.upsert(
collection_name=self.collection_name,
points=points
)
print(f"✓ Indexed {len(points)} documents")
def search(
self,
query_embedding: np.ndarray,
top_k: int = 5,
filters: Dict = None
) -> List[tuple]:
"""Search with optional metadata filters"""
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_embedding.tolist(),
limit=top_k,
query_filter=filters # Can filter by metadata!
)
return [
(result.payload["text"], result.score)
for result in results
]
# Usage
qdrant_store = QdrantVectorStore(collection_name="papers")
# Index with metadata
metadata = [
{"year": 2017, "citations": 50000, "venue": "NeurIPS"},
{"year": 2018, "citations": 30000, "venue": "NAACL"},
{"year": 2020, "citations": 15000, "venue": "NeurIPS"},
{"year": 2015, "citations": 40000, "venue": "CVPR"},
{"year": 2016, "citations": 25000, "venue": "CVPR"}
]
qdrant_store.add_documents(papers, embeddings, metadata)
# Search with filters
query_emb = model.encode("transformer models for NLP")
results = qdrant_store.search(
query_emb,
top_k=5,
filters={"must": [{"key": "year", "range": {"gte": 2017}}]} # Papers after 2017
)
for doc, score in results:
print(f"Score: {score:.3f} - {doc}")
Vector Search: Dev vs Prod Comparison
| Feature | FAISS (Dev) | Qdrant (Prod) |
|---|---|---|
| Storage | In-memory only | Persistent to disk |
| Startup | Instant | ~2 seconds |
| Scalability | Single machine | Distributed cluster |
| Metadata | Manual tracking | Built-in filtering |
| APIs | Python only | REST + gRPC + Python |
| Persistence | Save/load manually | Automatic |
| Best for | Development, testing, prototyping | Production, millions of vectors |
- Development: Use FAISS - instant startup, no infrastructure
- Production: Use Qdrant - persistence, scalability, filtering
- ResearcherAI: Uses both with abstraction layer!
Part 2: Knowledge Graphs and Structured Reasoning
Knowledge Graphs in the Real World
Before diving into the technical details, let's see knowledge graphs in action.
Try This: Open Google and search for "Tesla" (the electric vehicle company).
What do you see? Besides the typical list of matching websites, you'll notice a comprehensive panel on the right side showing:
- Description: "American electric vehicle and clean energy company..."
- Founded: 2003
- Headquarters: Austin, Texas
- CEO: Elon Musk
- Stock price and other properties
Now click on "Austin, Texas" (the headquarters location). You'll see another panel with:
- Description: "Capital city of Texas, United States"
- Population: ~1 million
- County: Travis County
This is a knowledge graph in action! Google isn't just returning text - it's showing you:
- Entities (Tesla, Austin, Elon Musk)
- Properties (founded date, population)
- Relationships (Tesla → headquarters → Austin → county → Travis County)
Web Developer Analogy:
// Traditional search result = List of text snippets
const results = ["Tesla is a company...", "Tesla founded in 2003...", "Tesla located..."]
// Knowledge graph = Structured interconnected data
const knowledgeGraph = {
"Tesla": {
type: "Company",
properties: {
founded: 2003,
name: "Tesla, Inc."
},
relationships: {
headquarters: "Austin",
CEO: "Elon_Musk"
}
},
"Austin": {
type: "City",
properties: {
population: 1000000,
state: "Texas"
},
relationships: {
county: "Travis_County",
companies: ["Tesla", "Oracle", "Dell"]
}
}
}
What Are Knowledge Graphs?
A knowledge graph represents structured information as a graph where:
- Nodes represent entities (Tesla, Austin, Elon Musk)
- Edges represent relationships between entities (headquarters, CEO, located_in)
Two Main Components:
-
Schema/Ontology: Defines the types of entities, their attributes, and allowed relationships
# Schema definition
Company has property: founded_year (integer)
Company has property: name (string)
Company has relationship: headquarters → City
Company has relationship: CEO → Person -
Instance Data: The actual entities and relationships that follow the schema
# Instance data
Tesla founded_year 2003
Tesla name "Tesla, Inc."
Tesla headquarters Austin
Tesla CEO Elon_Musk
Why Vector Search Isn't Enough
Vector search excels at finding similar content, but fails at:
1. Relationship Questions
Query: "Which papers cite both attention mechanisms and BERT?"
Vector search: ❌ Can't traverse citations
Knowledge graph: ✅ MATCH (p)-[:CITES]->(a), (p)-[:CITES]->(b)
2. Multi-Hop Reasoning
Query: "Find papers by authors who collaborated with Yoshua Bengio"
Vector search: ❌ Can't follow author → author connections
Knowledge graph: ✅ Path traversal over collaboration edges
3. Structured Queries
Query: "Papers published in 2020 that cite papers from before 2015"
Vector search: ❌ No temporal reasoning
Knowledge graph: ✅ Filter by year property + traverse citations
From Tables to Graphs
Key Differences:
Tables (Relational):
- Fixed schema
- Join operations expensive
- Hard to add new relationship types
- Optimized for transactional queries
Graphs (Network):
- Flexible schema
- Traversals are natural
- Easy to add new edges
- Optimized for relationship queries
Knowledge Graph Basics
Components:
- Nodes (Entities): Papers, Authors, Concepts, Institutions
- Edges (Relationships): CITES, AUTHORED_BY, DISCUSSES, AFFILIATED_WITH
- Properties: title, year, citations_count, etc.
Example Graph:
# Nodes
Paper1 = {
"id": "paper_1",
"type": "Paper",
"title": "Attention is All You Need",
"year": 2017,
"citations": 50000
}
Author1 = {
"id": "author_1",
"type": "Author",
"name": "Ashish Vaswani"
}
Concept1 = {
"id": "concept_1",
"type": "Concept",
"name": "Transformer"
}
# Edges
edges = [
(Author1, "AUTHORED", Paper1),
(Paper1, "INTRODUCES", Concept1),
(Paper2, "CITES", Paper1),
(Paper2, "USES", Concept1)
]
Development: NetworkX (In-Memory)
NetworkX - Python library for graph operations, perfect for development.
import networkx as nx
from typing import Dict, List
class NetworkXKnowledgeGraph:
"""Development knowledge graph using NetworkX"""
def __init__(self):
self.graph = nx.MultiDiGraph() # Directed graph with multiple edges
def add_paper(self, paper_id: str, title: str, year: int, abstract: str = ""):
"""Add paper node"""
self.graph.add_node(
paper_id,
type="Paper",
title=title,
year=year,
abstract=abstract
)
def add_author(self, author_id: str, name: str):
"""Add author node"""
self.graph.add_node(
author_id,
type="Author",
name=name
)
def add_concept(self, concept_id: str, name: str):
"""Add concept node"""
self.graph.add_node(
concept_id,
type="Concept",
name=name
)
def add_authored(self, author_id: str, paper_id: str):
"""Add AUTHORED relationship"""
self.graph.add_edge(author_id, paper_id, type="AUTHORED")
def add_citation(self, citing_paper: str, cited_paper: str):
"""Add CITES relationship"""
self.graph.add_edge(citing_paper, cited_paper, type="CITES")
def add_discusses(self, paper_id: str, concept_id: str):
"""Add DISCUSSES relationship"""
self.graph.add_edge(paper_id, concept_id, type="DISCUSSES")
def find_papers_by_author(self, author_name: str) -> List[Dict]:
"""Find all papers by an author"""
papers = []
for node, data in self.graph.nodes(data=True):
if data.get("type") == "Author" and data.get("name") == author_name:
# Find papers this author authored
for neighbor in self.graph.successors(node):
if self.graph.nodes[neighbor].get("type") == "Paper":
papers.append({
"id": neighbor,
**self.graph.nodes[neighbor]
})
return papers
def find_citing_papers(self, paper_id: str) -> List[Dict]:
"""Find papers that cite a given paper"""
citing = []
for pred in self.graph.predecessors(paper_id):
edge_data = self.graph.get_edge_data(pred, paper_id)
if any(e.get("type") == "CITES" for e in edge_data.values()):
if self.graph.nodes[pred].get("type") == "Paper":
citing.append({
"id": pred,
**self.graph.nodes[pred]
})
return citing
def find_related_concepts(self, paper_id: str) -> List[str]:
"""Find concepts discussed in a paper"""
concepts = []
for neighbor in self.graph.successors(paper_id):
if self.graph.nodes[neighbor].get("type") == "Concept":
concepts.append(self.graph.nodes[neighbor].get("name"))
return concepts
def find_collaboration_network(self, author_name: str, depth: int = 2) -> List[str]:
"""Find authors who collaborated (shared papers)"""
collaborators = set()
# Find author node
author_node = None
for node, data in self.graph.nodes(data=True):
if data.get("type") == "Author" and data.get("name") == author_name:
author_node = node
break
if not author_node:
return []
# Find papers by this author
papers = [n for n in self.graph.successors(author_node)
if self.graph.nodes[n].get("type") == "Paper"]
# Find co-authors
for paper in papers:
for pred in self.graph.predecessors(paper):
if self.graph.nodes[pred].get("type") == "Author" and pred != author_node:
collaborators.add(self.graph.nodes[pred].get("name"))
return list(collaborators)
# Usage
kg = NetworkXKnowledgeGraph()
# Add nodes
kg.add_paper("paper_1", "Attention is All You Need", 2017)
kg.add_paper("paper_2", "BERT: Pre-training Transformers", 2018)
kg.add_paper("paper_3", "GPT-3: Language Models", 2020)
kg.add_author("author_1", "Ashish Vaswani")
kg.add_author("author_2", "Jacob Devlin")
kg.add_author("author_3", "Tom Brown")
kg.add_concept("concept_1", "Transformer")
kg.add_concept("concept_2", "Attention Mechanism")
kg.add_concept("concept_3", "Pre-training")
# Add relationships
kg.add_authored("author_1", "paper_1")
kg.add_authored("author_2", "paper_2")
kg.add_authored("author_3", "paper_3")
kg.add_discusses("paper_1", "concept_1")
kg.add_discusses("paper_1", "concept_2")
kg.add_discusses("paper_2", "concept_1")
kg.add_discusses("paper_2", "concept_3")
kg.add_citation("paper_2", "paper_1") # BERT cites Attention
kg.add_citation("paper_3", "paper_1") # GPT-3 cites Attention
kg.add_citation("paper_3", "paper_2") # GPT-3 cites BERT
# Query the graph
print("Papers by Ashish Vaswani:")
papers = kg.find_papers_by_author("Ashish Vaswani")
for paper in papers:
print(f" - {paper['title']}")
print("\nPapers citing 'Attention is All You Need':")
citing = kg.find_citing_papers("paper_1")
for paper in citing:
print(f" - {paper['title']}")
print("\nConcepts in paper_1:")
concepts = kg.find_related_concepts("paper_1")
print(f" {', '.join(concepts)}")
print("\nCollaborators of Jacob Devlin:")
collabs = kg.find_collaboration_network("Jacob Devlin")
print(f" {', '.join(collabs)}")
Output:
Papers by Ashish Vaswani:
- Attention is All You Need
Papers citing 'Attention is All You Need':
- BERT: Pre-training Transformers
- GPT-3: Language Models
Concepts in paper_1:
Transformer, Attention Mechanism
Collaborators of Jacob Devlin:
Ashish Vaswani
Production: Neo4j (Persistent, Cypher)
Neo4j - enterprise-grade graph database with powerful query language (Cypher).
from neo4j import GraphDatabase
class Neo4jKnowledgeGraph:
"""Production knowledge graph using Neo4j"""
def __init__(self, uri: str = "bolt://localhost:7687", user: str = "neo4j", password: str = "password"):
self.driver = GraphDatabase.driver(uri, auth=(user, password))
def close(self):
self.driver.close()
def add_paper(self, paper_id: str, title: str, year: int, abstract: str = ""):
"""Add paper node"""
with self.driver.session() as session:
session.run("""
MERGE (p:Paper {id: $paper_id})
SET p.title = $title, p.year = $year, p.abstract = $abstract
""", paper_id=paper_id, title=title, year=year, abstract=abstract)
def add_author(self, author_id: str, name: str):
"""Add author node"""
with self.driver.session() as session:
session.run("""
MERGE (a:Author {id: $author_id})
SET a.name = $name
""", author_id=author_id, name=name)
def add_concept(self, concept_id: str, name: str):
"""Add concept node"""
with self.driver.session() as session:
session.run("""
MERGE (c:Concept {id: $concept_id})
SET c.name = $name
""", concept_id=concept_id, name=name)
def add_authored(self, author_id: str, paper_id: str):
"""Add AUTHORED relationship"""
with self.driver.session() as session:
session.run("""
MATCH (a:Author {id: $author_id})
MATCH (p:Paper {id: $paper_id})
MERGE (a)-[:AUTHORED]->(p)
""", author_id=author_id, paper_id=paper_id)
def add_citation(self, citing_paper: str, cited_paper: str):
"""Add CITES relationship"""
with self.driver.session() as session:
session.run("""
MATCH (citing:Paper {id: $citing_paper})
MATCH (cited:Paper {id: $cited_paper})
MERGE (citing)-[:CITES]->(cited)
""", citing_paper=citing_paper, cited_paper=cited_paper)
def add_discusses(self, paper_id: str, concept_id: str):
"""Add DISCUSSES relationship"""
with self.driver.session() as session:
session.run("""
MATCH (p:Paper {id: $paper_id})
MATCH (c:Concept {id: $concept_id})
MERGE (p)-[:DISCUSSES]->(c)
""", paper_id=paper_id, concept_id=concept_id)
def find_papers_by_author(self, author_name: str) -> List[Dict]:
"""Find all papers by an author"""
with self.driver.session() as session:
result = session.run("""
MATCH (a:Author {name: $name})-[:AUTHORED]->(p:Paper)
RETURN p.id as id, p.title as title, p.year as year
""", name=author_name)
return [dict(record) for record in result]
def find_citing_papers(self, paper_id: str) -> List[Dict]:
"""Find papers that cite a given paper"""
with self.driver.session() as session:
result = session.run("""
MATCH (citing:Paper)-[:CITES]->(cited:Paper {id: $paper_id})
RETURN citing.id as id, citing.title as title, citing.year as year
""", paper_id=paper_id)
return [dict(record) for record in result]
def find_related_concepts(self, paper_id: str) -> List[str]:
"""Find concepts discussed in a paper"""
with self.driver.session() as session:
result = session.run("""
MATCH (p:Paper {id: $paper_id})-[:DISCUSSES]->(c:Concept)
RETURN c.name as concept
""", paper_id=paper_id)
return [record["concept"] for record in result]
def find_collaboration_network(self, author_name: str) -> List[str]:
"""Find authors who collaborated (shared papers)"""
with self.driver.session() as session:
result = session.run("""
MATCH (a1:Author {name: $name})-[:AUTHORED]->(p:Paper)<-[:AUTHORED]-(a2:Author)
WHERE a1 <> a2
RETURN DISTINCT a2.name as collaborator
""", name=author_name)
return [record["collaborator"] for record in result]
def find_citation_chain(self, start_paper: str, end_paper: str, max_depth: int = 5):
"""Find citation path between two papers"""
with self.driver.session() as session:
result = session.run("""
MATCH path = shortestPath(
(start:Paper {id: $start})-[:CITES*1..{max_depth}]->(end:Paper {id: $end})
)
RETURN [node in nodes(path) | node.title] as path
""".replace("{max_depth}", str(max_depth)), start=start_paper, end=end_paper)
records = list(result)
return records[0]["path"] if records else []
# Usage
neo4j_kg = Neo4jKnowledgeGraph()
# Add same data as NetworkX example
neo4j_kg.add_paper("paper_1", "Attention is All You Need", 2017)
neo4j_kg.add_paper("paper_2", "BERT: Pre-training Transformers", 2018)
neo4j_kg.add_author("author_1", "Ashish Vaswani")
neo4j_kg.add_authored("author_1", "paper_1")
neo4j_kg.add_citation("paper_2", "paper_1")
# Advanced query: Citation chain
chain = neo4j_kg.find_citation_chain("paper_3", "paper_1")
print(f"Citation path: {' -> '.join(chain)}")
neo4j_kg.close()
Cypher Query Language
Cypher is Neo4j's declarative query language - incredibly powerful:
// Find papers published after 2018 that cite papers with >10000 citations
MATCH (recent:Paper)-[:CITES]->(influential:Paper)
WHERE recent.year > 2018 AND influential.citations > 10000
RETURN recent.title, influential.title, influential.citations
ORDER BY influential.citations DESC
// Find "research communities" - groups of authors who frequently collaborate
MATCH (a1:Author)-[:AUTHORED]->(:Paper)<-[:AUTHORED]-(a2:Author)
WHERE a1.name < a2.name
WITH a1, a2, count(*) as collaborations
WHERE collaborations > 3
RETURN a1.name, a2.name, collaborations
ORDER BY collaborations DESC
// Find trending concepts (discussed in papers with growing citations)
MATCH (p:Paper)-[:DISCUSSES]->(c:Concept)
WHERE p.year >= 2020
WITH c, avg(p.citations) as avg_citations, count(p) as paper_count
WHERE paper_count > 5
RETURN c.name, avg_citations, paper_count
ORDER BY avg_citations DESC
LIMIT 10
Knowledge Graph: Dev vs Prod Comparison
| Feature | NetworkX (Dev) | Neo4j (Prod) |
|---|---|---|
| Storage | In-memory only | Persistent to disk |
| Query Language | Python code | Cypher (declarative) |
| Scalability | 1000s of nodes | Billions of nodes |
| Performance | Slow for large graphs | Optimized indexes |
| Transactions | No | ACID transactions |
| Clustering | No | Multi-node clusters |
| Best for | Development, algorithms | Production, complex queries |
- Development: NetworkX - no setup, great for prototyping
- Production: Neo4j - performance, Cypher queries, persistence
- ResearcherAI: Abstracts both behind unified interface!
Semantic Web: Ontologies, RDF, and SPARQL
So far we've discussed property graphs (Neo4j, NetworkX). There's another powerful approach: semantic web technologies using RDF and ontologies.
What Are Ontologies?
Think of an ontology as a formal schema for your knowledge:
Web Developer Analogy:
// TypeScript interface = Ontology
interface Person {
name: string;
worksFor: Organization;
knows: Person[];
}
interface Organization {
name: string;
foundedDate: Date;
}
An ontology defines:
- Classes (Person, Paper, Author, Concept)
- Properties (name, authored, cites, discusses)
- Relationships (Author → authored → Paper)
- Constraints (a Paper must have at least one Author)
RDF: Resource Description Framework
RDF represents knowledge as triples:
Subject Predicate Object
Every statement is a triple (like a sentence):
# Turtle syntax (RDF format)
:paper1 rdf:type :ResearchPaper .
:paper1 :hasTitle "Attention Is All You Need" .
:paper1 :publishedYear 2017 .
:paper1 :hasAuthor :vaswani .
:paper1 :cites :paper2 .
:vaswani rdf:type :Author .
:vaswani :hasName "Ashish Vaswani" .
Web Developer Analogy:
// JSON = Property Graph
{
"id": "paper1",
"title": "Attention Is All You Need",
"year": 2017,
"authors": ["vaswani"]
}
// RDF Triples = Semantic Web
["paper1", "type", "ResearchPaper"]
["paper1", "hasTitle", "Attention Is All You Need"]
["paper1", "publishedYear", 2017]
["paper1", "hasAuthor", "vaswani"]
RDF Triple Visualization
Development: RDFLib (Python)
For development, use RDFLib - a pure Python library:
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS
class RDFKnowledgeGraph:
"""Development RDF knowledge graph using RDFLib"""
def __init__(self):
self.graph = Graph()
# Define custom namespace for our ontology
self.ns = Namespace("http://researcherai.org/ontology#")
self.graph.bind("research", self.ns)
def add_paper(
self,
paper_id: str,
title: str,
year: int,
abstract: str = ""
):
"""Add a research paper to the graph"""
paper_uri = URIRef(f"http://researcherai.org/papers/{paper_id}")
# Add triples
self.graph.add((paper_uri, RDF.type, self.ns.ResearchPaper))
self.graph.add((paper_uri, self.ns.hasTitle, Literal(title)))
self.graph.add((paper_uri, self.ns.publishedYear, Literal(year)))
if abstract:
self.graph.add((paper_uri, self.ns.hasAbstract, Literal(abstract)))
def add_author(self, author_id: str, name: str, affiliation: str = ""):
"""Add an author to the graph"""
author_uri = URIRef(f"http://researcherai.org/authors/{author_id}")
self.graph.add((author_uri, RDF.type, self.ns.Author))
self.graph.add((author_uri, self.ns.hasName, Literal(name)))
if affiliation:
self.graph.add((author_uri, self.ns.affiliation, Literal(affiliation)))
def link_author_to_paper(self, author_id: str, paper_id: str):
"""Create authorship relationship"""
author_uri = URIRef(f"http://researcherai.org/authors/{author_id}")
paper_uri = URIRef(f"http://researcherai.org/papers/{paper_id}")
self.graph.add((paper_uri, self.ns.hasAuthor, author_uri))
self.graph.add((author_uri, self.ns.authored, paper_uri))
def add_citation(self, citing_paper_id: str, cited_paper_id: str):
"""Add citation relationship"""
citing_uri = URIRef(f"http://researcherai.org/papers/{citing_paper_id}")
cited_uri = URIRef(f"http://researcherai.org/papers/{cited_paper_id}")
self.graph.add((citing_uri, self.ns.cites, cited_uri))
self.graph.add((cited_uri, self.ns.citedBy, citing_uri))
def query_sparql(self, sparql_query: str):
"""Execute SPARQL query"""
return self.graph.query(sparql_query)
def export_turtle(self, filename: str):
"""Export graph to Turtle format"""
self.graph.serialize(destination=filename, format='turtle')
def load_turtle(self, filename: str):
"""Load graph from Turtle format"""
self.graph.parse(filename, format='turtle')
# Example usage
rdf_kg = RDFKnowledgeGraph()
# Add papers
rdf_kg.add_paper(
"paper1",
"Attention Is All You Need",
2017,
"The dominant sequence transduction models..."
)
rdf_kg.add_paper(
"paper2",
"BERT: Pre-training of Deep Bidirectional Transformers",
2019,
"We introduce BERT..."
)
# Add authors
rdf_kg.add_author("vaswani", "Ashish Vaswani", "Google Brain")
rdf_kg.add_author("devlin", "Jacob Devlin", "Google AI")
# Link relationships
rdf_kg.link_author_to_paper("vaswani", "paper1")
rdf_kg.link_author_to_paper("devlin", "paper2")
rdf_kg.add_citation("paper2", "paper1") # BERT cites Attention paper
# Export to file
rdf_kg.export_turtle("research_graph.ttl")
SPARQL: Query Language for RDF
SPARQL is to RDF what Cypher is to Neo4j (or SQL to relational databases):
# Find all papers by a specific author
PREFIX research: <http://researcherai.org/ontology#>
SELECT ?paper ?title ?year
WHERE {
?author research:hasName "Ashish Vaswani" .
?author research:authored ?paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
}
ORDER BY ?year
Python Example:
sparql_query = """
PREFIX research: <http://researcherai.org/ontology#>
SELECT ?citing_title ?cited_title
WHERE {
?citing_paper research:cites ?cited_paper .
?citing_paper research:hasTitle ?citing_title .
?cited_paper research:hasTitle ?cited_title .
}
"""
results = rdf_kg.query_sparql(sparql_query)
for row in results:
print(f"{row.citing_title} cites {row.cited_title}")
More SPARQL Examples
# Find authors who collaborated (co-authored papers)
PREFIX research: <http://researcherai.org/ontology#>
SELECT ?author1_name ?author2_name ?paper_title
WHERE {
?paper research:hasAuthor ?author1 .
?paper research:hasAuthor ?author2 .
?paper research:hasTitle ?paper_title .
?author1 research:hasName ?author1_name .
?author2 research:hasName ?author2_name .
FILTER(?author1 != ?author2)
}
# Find highly cited papers (cited by many others)
SELECT ?title (COUNT(?citing) as ?citation_count)
WHERE {
?paper research:hasTitle ?title .
?citing research:cites ?paper .
}
GROUP BY ?title
HAVING (COUNT(?citing) > 100)
ORDER BY DESC(?citation_count)
# Find papers published after 2018 in a specific domain
SELECT ?title ?year
WHERE {
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
?paper research:discusses ?concept .
?concept research:hasName "transformers" .
FILTER(?year > 2018)
}
Production: Apache Jena & SPARQL Endpoint
For production, use Apache Jena Fuseki - a SPARQL server:
from SPARQLWrapper import SPARQLWrapper, JSON
import requests
class JenaKnowledgeGraph:
"""Production RDF knowledge graph using Apache Jena Fuseki"""
def __init__(
self,
endpoint_url: str = "http://localhost:3030/research",
update_endpoint: str = "http://localhost:3030/research/update"
):
self.endpoint_url = endpoint_url
self.update_endpoint = update_endpoint
self.sparql = SPARQLWrapper(endpoint_url)
def add_triples(self, triples_turtle: str):
"""Add RDF triples to the graph"""
# Use SPARQL UPDATE to insert data
update_query = f"""
PREFIX research: <http://researcherai.org/ontology#>
INSERT DATA {{
{triples_turtle}
}}
"""
response = requests.post(
self.update_endpoint,
data={"update": update_query},
headers={"Content-Type": "application/x-www-form-urlencoded"}
)
return response.status_code == 200
def add_paper(self, paper_id: str, title: str, year: int):
"""Add a research paper"""
triples = f"""
@prefix research: <http://researcherai.org/ontology#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
<http://researcherai.org/papers/{paper_id}>
rdf:type research:ResearchPaper ;
research:hasTitle "{title}" ;
research:publishedYear {year} .
"""
return self.add_triples(triples)
def query(self, sparql_query: str) -> list:
"""Execute SPARQL SELECT query"""
self.sparql.setQuery(sparql_query)
self.sparql.setReturnFormat(JSON)
results = self.sparql.query().convert()
return results["results"]["bindings"]
def find_papers_by_author(self, author_name: str) -> list:
"""Find all papers by an author"""
query = f"""
PREFIX research: <http://researcherai.org/ontology#>
SELECT ?paper ?title ?year
WHERE {{
?author research:hasName "{author_name}" .
?author research:authored ?paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
}}
ORDER BY DESC(?year)
"""
return self.query(query)
def find_citation_chain(self, paper_id: str, depth: int = 2) -> list:
"""Find papers that cite this paper (transitive)"""
query = f"""
PREFIX research: <http://researcherai.org/ontology#>
SELECT ?citing_paper ?title ?distance
WHERE {{
<http://researcherai.org/papers/{paper_id}>
^research:cites{{1,{depth}}} ?citing_paper .
?citing_paper research:hasTitle ?title .
BIND(
COUNT(?intermediate) as ?distance
)
}}
"""
return self.query(query)
# Example usage with Fuseki
jena_kg = JenaKnowledgeGraph(
endpoint_url="http://localhost:3030/research/sparql",
update_endpoint="http://localhost:3030/research/update"
)
# Add paper
jena_kg.add_paper(
"attention2017",
"Attention Is All You Need",
2017
)
# Query
results = jena_kg.find_papers_by_author("Ashish Vaswani")
for result in results:
print(f"{result['title']['value']} ({result['year']['value']})")
OWL: Web Ontology Language - The Power of Reasoning
OWL (Web Ontology Language) extends RDF with reasoning capabilities. It's the difference between storing facts and deriving new knowledge from those facts.
Web Developer Analogy:
// RDF = Data storage
const data = {
"Vaswani": { type: "Author", authored: ["paper1"] }
}
// OWL = Data storage + Logic rules
const data = { ... }
const rules = {
// If someone authored a paper, they are a researcher
"Author who authored something → Researcher"
}
// Reasoner can INFER: "Vaswani is a Researcher" (even if not explicitly stated)
Why OWL Matters: Automatic Inference
Without OWL (just RDF):
:vaswani :authored :paper1 .
# To know Vaswani is a researcher, you must explicitly state it
With OWL (RDF + reasoning):
# Define rule: anyone who authored something is a researcher
:authored rdfs:domain :Researcher .
# Just state the fact
:vaswani :authored :paper1 .
# OWL reasoner AUTOMATICALLY infers:
:vaswani rdf:type :Researcher . # Derived, not stated!
OWL Ontology Definition
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix research: <http://researcherai.org/ontology#> .
# Define classes
research:ResearchPaper rdf:type owl:Class .
research:Author rdf:type owl:Class .
research:Researcher rdf:type owl:Class .
research:Concept rdf:type owl:Class .
research:InfluentialPaper rdf:type owl:Class .
# Define class hierarchy
research:Author rdfs:subClassOf research:Researcher .
# All Authors are Researchers (but not all Researchers are Authors)
# Define properties
research:hasAuthor rdf:type owl:ObjectProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range research:Author .
research:cites rdf:type owl:ObjectProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range research:ResearchPaper .
research:citedBy rdf:type owl:ObjectProperty ;
owl:inverseOf research:cites . # Automatic inverse!
research:hasTitle rdf:type owl:DatatypeProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range xsd:string .
research:publishedYear rdf:type owl:DatatypeProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range xsd:integer .
research:citationCount rdf:type owl:DatatypeProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range xsd:integer .
# Define constraints
research:ResearchPaper rdfs:subClassOf [
rdf:type owl:Restriction ;
owl:onProperty research:hasAuthor ;
owl:minCardinality "1"^^xsd:nonNegativeInteger
] . # A paper must have at least one author
# Define derived class (automatic classification!)
research:InfluentialPaper owl:equivalentClass [
rdf:type owl:Restriction ;
owl:onProperty research:citationCount ;
owl:someValuesFrom [
rdf:type rdfs:Datatype ;
owl:onDatatype xsd:integer ;
owl:withRestrictions ([ xsd:minInclusive 100 ])
]
] . # Papers with 100+ citations are automatically "InfluentialPaper"
# Property characteristics
research:collaboratesWith rdf:type owl:SymmetricProperty .
# If A collaborates with B, then B collaborates with A
research:cites rdf:type owl:TransitiveProperty .
# If A cites B, and B cites C, then A transitively cites C
Development: Owlready2 with Reasoning
For development, use owlready2 - a Python library with built-in reasoner:
from owlready2 import *
import tempfile
class OWLKnowledgeGraph:
"""Development OWL ontology with reasoning"""
def __init__(self, ontology_iri="http://researcherai.org/ontology"):
self.onto = get_ontology(ontology_iri)
with self.onto:
# Define classes
class ResearchPaper(Thing):
pass
class Author(Thing):
pass
class Researcher(Thing):
pass
class Concept(Thing):
pass
class InfluentialPaper(ResearchPaper):
pass
# Define properties
class hasAuthor(ObjectProperty):
domain = [ResearchPaper]
range = [Author]
class authored(ObjectProperty):
domain = [Author]
range = [ResearchPaper]
inverse_property = hasAuthor
class cites(ObjectProperty, TransitiveProperty):
domain = [ResearchPaper]
range = [ResearchPaper]
class citedBy(ObjectProperty):
inverse_property = cites
class collaboratesWith(ObjectProperty, SymmetricProperty):
domain = [Author]
range = [Author]
class hasTitle(DataProperty, FunctionalProperty):
domain = [ResearchPaper]
range = [str]
class publishedYear(DataProperty, FunctionalProperty):
domain = [ResearchPaper]
range = [int]
class citationCount(DataProperty):
domain = [ResearchPaper]
range = [int]
# Define rules
class AuthorRule(Author >> Researcher):
"""All authors are researchers"""
pass
# Define automatic classification
class InfluentialPaperRule(ResearchPaper):
equivalent_to = [
ResearchPaper & citationCount.some(int >= 100)
]
self.ResearchPaper = self.onto.ResearchPaper
self.Author = self.onto.Author
self.hasAuthor = self.onto.hasAuthor
self.cites = self.onto.cites
self.hasTitle = self.onto.hasTitle
self.publishedYear = self.onto.publishedYear
self.citationCount = self.onto.citationCount
def add_paper(self, paper_id: str, title: str, year: int, citations: int = 0):
"""Add a research paper"""
paper = self.ResearchPaper(paper_id)
paper.hasTitle = [title]
paper.publishedYear = [year]
paper.citationCount = [citations]
return paper
def add_author(self, author_id: str, name: str):
"""Add an author"""
author = self.Author(author_id)
author.label = [name]
return author
def link_author_to_paper(self, author, paper):
"""Create authorship relationship"""
author.authored.append(paper)
# Inverse relationship is automatic!
def add_citation(self, citing_paper, cited_paper):
"""Add citation relationship"""
citing_paper.cites.append(cited_paper)
# citedBy is automatic (inverse property)!
def run_reasoner(self):
"""Run OWL reasoner to infer new facts"""
print("Running reasoner...")
with self.onto:
sync_reasoner(debug=False)
print("Reasoning complete!")
def find_influential_papers(self):
"""Find papers automatically classified as influential"""
return list(self.onto.InfluentialPaper.instances())
def find_all_researchers(self):
"""Find all researchers (including inferred ones)"""
return list(self.onto.Researcher.instances())
def save(self, filename: str):
"""Save ontology to file"""
self.onto.save(file=filename, format="rdfxml")
def load(self, filename: str):
"""Load ontology from file"""
self.onto = get_ontology(filename).load()
# Example usage with reasoning
owl_kg = OWLKnowledgeGraph()
# Add papers
paper1 = owl_kg.add_paper("attention2017", "Attention Is All You Need", 2017, 15000)
paper2 = owl_kg.add_paper("bert2019", "BERT", 2019, 8000)
paper3 = owl_kg.add_paper("transformer_xl", "Transformer-XL", 2019, 500)
# Add authors
vaswani = owl_kg.add_author("vaswani", "Ashish Vaswani")
devlin = owl_kg.add_author("devlin", "Jacob Devlin")
# Link relationships
owl_kg.link_author_to_paper(vaswani, paper1)
owl_kg.link_author_to_paper(devlin, paper2)
# Add citations
owl_kg.add_citation(paper2, paper1) # BERT cites Attention
owl_kg.add_citation(paper3, paper1) # Transformer-XL cites Attention
print("Before reasoning:")
print(f"Influential papers: {len(owl_kg.find_influential_papers())}")
print(f"Researchers: {len(owl_kg.find_all_researchers())}")
# Run reasoner
owl_kg.run_reasoner()
print("\nAfter reasoning:")
# Papers with 100+ citations are automatically classified as InfluentialPaper
influential = owl_kg.find_influential_papers()
print(f"Influential papers: {[p.hasTitle[0] for p in influential]}")
# Authors are automatically inferred to be Researchers
researchers = owl_kg.find_all_researchers()
print(f"Researchers: {[r.label[0] for r in researchers]}")
# Check inverse properties
print(f"\nVaswani authored: {[p.hasTitle[0] for p in vaswani.authored]}")
print(f"Paper1 has authors: {[a.label[0] for a in paper1.hasAuthor]}")
# Both work! Inverse is automatic.
# Check citedBy (inverse of cites)
print(f"\nPaper1 is cited by: {[p.hasTitle[0] for p in paper1.citedBy]}")
# Automatic from cites relationship!
OWL Reasoning Examples
1. Class Hierarchy Inference:
# Define hierarchy
with owl_kg.onto:
class Author(Thing):
pass
class PhDStudent(Author):
pass
class Professor(Author):
pass
# All Authors are Researchers
class AuthorIsResearcher(Author >> Researcher):
pass
# Create instance
phd_student = owl_kg.onto.PhDStudent("alice")
# Before reasoning
print(phd_student.is_a) # [PhDStudent]
# Run reasoner
sync_reasoner()
# After reasoning
print(phd_student.is_a) # [PhDStudent, Author, Researcher]
# Automatically inferred Alice is an Author and Researcher!
2. Property Propagation:
# Define transitive property
class influences(ObjectProperty, TransitiveProperty):
pass
# State facts
paper_a = ResearchPaper("paper_a")
paper_b = ResearchPaper("paper_b")
paper_c = ResearchPaper("paper_c")
paper_a.influences = [paper_b] # A influences B
paper_b.influences = [paper_c] # B influences C
# Run reasoner
sync_reasoner()
# Reasoner infers: A influences C (transitively)
print(paper_c in paper_a.influences) # True!
3. Automatic Classification:
# Define rule: Papers with many authors are "Collaborative"
with owl_kg.onto:
class CollaborativePaper(ResearchPaper):
equivalent_to = [
ResearchPaper & hasAuthor.min(5) # 5+ authors
]
# Add paper with 6 authors
paper = ResearchPaper("multi_author_paper")
for i in range(6):
author = Author(f"author_{i}")
paper.hasAuthor.append(author)
# Run reasoner
sync_reasoner()
# Paper is automatically classified as CollaborativePaper!
print(CollaborativePaper in paper.is_a) # True!
Production: Apache Jena with OWL Reasoner
For production, use Apache Jena with built-in OWL reasoners:
from rdflib import Graph, Namespace
from rdflib.plugins.sparql import prepareQuery
class JenaOWLKnowledgeGraph:
"""Production OWL knowledge graph with Jena reasoner"""
def __init__(self, fuseki_url: str = "http://localhost:3030/research"):
self.fuseki_url = fuseki_url
self.graph = Graph()
self.ns = Namespace("http://researcherai.org/ontology#")
# Load ontology schema
self.load_ontology()
def load_ontology(self):
"""Load OWL ontology definitions"""
ontology_ttl = """
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix research: <http://researcherai.org/ontology#> .
research:Author rdfs:subClassOf research:Researcher .
research:authored rdfs:domain research:Author .
research:cites rdf:type owl:TransitiveProperty .
research:citedBy owl:inverseOf research:cites .
"""
self.graph.parse(data=ontology_ttl, format="turtle")
def query_with_reasoning(self, sparql_query: str):
"""Execute SPARQL with reasoning enabled"""
# Jena Fuseki can enable reasoner via endpoint config
# Example: http://localhost:3030/research_reasoned/sparql
results = self.graph.query(sparql_query)
return list(results)
# Configure Jena Fuseki with OWL reasoner
fuseki_config = """
<#service> rdf:type fuseki:Service ;
fuseki:name "research" ;
fuseki:serviceQuery "sparql" ;
fuseki:dataset <#dataset> .
<#dataset> rdf:type ja:DatasetTxnMem ;
ja:defaultGraph <#model_inf> .
<#model_inf> rdf:type ja:InfModel ;
ja:reasoner [
ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLFBRuleReasoner>
] ;
ja:baseModel <#model_base> .
<#model_base> rdf:type ja:MemoryModel .
"""
OWL Profiles: Which to Use?
OWL has different profiles (subsets) for different use cases:
| Profile | Complexity | Reasoning | Use Case |
|---|---|---|---|
| OWL Full | Maximum expressivity | Undecidable | Research, experimental |
| OWL DL | Description Logic | Complete & decidable | General purpose |
| OWL Lite | Basic class hierarchy | Simple & fast | Simple taxonomies |
| OWL EL | Existential quantification | Polynomial time | Large ontologies (medical) |
| OWL QL | Query-oriented | Log-space | Database integration |
| OWL RL | Rule-based | Polynomial time | Business rules |
For ResearcherAI: Use OWL DL or OWL RL - balance of expressivity and performance.
When to Use OWL vs Just RDF
Use OWL when you need:
-
Automatic classification
- Classify papers as "influential" based on citation count
- Identify "interdisciplinary" papers based on concept diversity
-
Inference from rules
- Infer co-authors from paper authorship
- Derive expertise areas from publication history
-
Consistency checking
- Ensure papers have at least one author
- Validate that publication years are reasonable
-
Property inheritance
- Symmetric properties (collaboration)
- Transitive properties (influence, citation chains)
- Inverse properties (cites ↔ citedBy)
Use just RDF when:
- Simple data storage - no complex reasoning needed
- Performance critical - reasoning is computationally expensive
- Schema is stable - don't need automatic classification
- Explicit is better - want to state all facts explicitly
OWL Reasoning: Dev vs Prod Comparison
| Feature | Owlready2 (Dev) | Apache Jena (Prod) |
|---|---|---|
| Language | Python | Java (Python client) |
| Reasoners | HermiT, Pellet | Jena, Pellet, HermiT |
| Performance | Slower (Python) | Faster (Java) |
| Scalability | Small ontologies | Large ontologies |
| Ease of Use | Very easy (Pythonic) | More complex setup |
| Integration | Great for scripts | Enterprise integration |
| Best for | Development, prototyping | Production, large scale |
Example Use Case for ResearcherAI:
# Use OWL to automatically identify "rising stars" (researchers)
# Rule: A rising star is someone who:
# - Authored papers in the last 3 years
# - Has papers cited > 50 times
# - Collaborates with established researchers
with owl_kg.onto:
class EstablishedResearcher(Researcher):
equivalent_to = [
Researcher & authored.some(
ResearchPaper & citationCount.some(int >= 500)
)
]
class RisingStar(Researcher):
equivalent_to = [
Researcher &
authored.some(
ResearchPaper &
publishedYear.some(int >= 2021) &
citationCount.some(int >= 50)
) &
collaboratesWith.some(EstablishedResearcher)
]
# Run reasoner
sync_reasoner()
# Automatically finds rising stars!
rising_stars = list(owl_kg.onto.RisingStar.instances())
print(f"Rising stars: {[r.label[0] for r in rising_stars]}")
- OWL = RDF + reasoning/inference capabilities
- Use for: Automatic classification, rule-based inference, consistency checking
- Dev: owlready2 (easy Python integration)
- Prod: Apache Jena (performance, scalability)
- ResearcherAI: Could use OWL for researcher classification, paper categorization
OWL reasoning can be computationally expensive. For large knowledge graphs (millions of triples), reasoning can take minutes to hours. Consider:
- Using simpler OWL profiles (EL, QL, RL)
- Pre-computing inferences offline
- Using incremental reasoning
- Caching reasoner results
RDF vs Property Graphs: When to Use Each
| Feature | RDF (Jena/RDFLib) | Property Graphs (Neo4j) |
|---|---|---|
| Data Model | Triples (subject-predicate-object) | Nodes with properties + labeled edges |
| Schema | Ontology (OWL) | Schema optional |
| Standards | W3C standards (RDF, OWL, SPARQL) | No universal standard |
| Query Language | SPARQL | Cypher |
| Reasoning | Built-in inferencing (OWL reasoners) | No built-in reasoning |
| Flexibility | Extremely flexible, schema evolution | More rigid structure |
| Performance | Slower for graph traversal | Optimized for graph queries |
| Use Case | Scientific data, linked data, ontologies | Social networks, recommendations |
| Learning Curve | Steeper (ontologies, W3C specs) | Gentler (more intuitive) |
Web Developer Analogy:
- RDF = XML/JSON-LD with strict schemas (TypeScript with interfaces)
- Property Graphs = NoSQL document store with relationships (MongoDB + relationships)
When to Use RDF:
- Need formal ontologies - scientific domains, medical data
- Data integration - combining data from multiple sources
- Reasoning/inference - derive new facts from existing ones
- Linked open data - publish data others can link to
- Interoperability - strict W3C standards
Example: Medical knowledge graphs, DBpedia, Wikidata
When to Use Property Graphs:
- Graph algorithms - shortest path, community detection
- High-performance traversal - social networks, fraud detection
- Flexible schema - rapidly evolving data model
- Intuitive queries - easier to learn and use
- Real-time recommendations - e-commerce, content recommendations
Example: LinkedIn connections, recommendation engines, ResearcherAI
ResearcherAI's Approach
For ResearcherAI, we use property graphs (Neo4j) because:
- Better performance for citation traversal
- Simpler learning curve for developers
- Flexible schema - research data models evolve
- Cypher is intuitive - easier than SPARQL for most queries
- Neo4j has excellent tooling - Browser, Bloom, Graph Data Science
However, if you needed to:
- Integrate with external ontologies (e.g., medical ontologies)
- Publish linked open data
- Use formal reasoning/inference
- Comply with W3C standards
Then RDF with Apache Jena would be the better choice.
Hybrid Approach: RDF + Property Graphs
You can actually use both:
class HybridSemanticKnowledgeGraph:
"""Combine RDF (for ontology) with Neo4j (for performance)"""
def __init__(
self,
neo4j_kg: Neo4jKnowledgeGraph,
rdf_kg: RDFKnowledgeGraph
):
self.neo4j = neo4j_kg # For fast queries
self.rdf = rdf_kg # For ontology and reasoning
def add_paper(self, paper_data: dict):
"""Add to both stores"""
# Add to Neo4j for performance
self.neo4j.add_paper(
paper_data["id"],
paper_data["title"],
paper_data["year"]
)
# Add to RDF for formal semantics
self.rdf.add_paper(
paper_data["id"],
paper_data["title"],
paper_data["year"]
)
def query_with_reasoning(self, sparql_query: str):
"""Use RDF reasoner for inference"""
return self.rdf.query_sparql(sparql_query)
def query_with_performance(self, cypher_query: str):
"""Use Neo4j for fast graph traversal"""
return self.neo4j.query_cypher(cypher_query)
RDF/SPARQL: Dev vs Prod Comparison
| Feature | RDFLib (Dev) | Apache Jena Fuseki (Prod) |
|---|---|---|
| Storage | In-memory or file | Persistent triple store |
| Query | Python SPARQL | HTTP SPARQL endpoint |
| Scalability | 100k triples | Billions of triples |
| Performance | Slow for large graphs | Optimized indices |
| Reasoning | Basic | Full OWL reasoning |
| Concurrent Access | No | Yes (multi-user) |
| Best for | Development, testing | Production, linked data |
- RDF: Formal ontologies, reasoning, standards compliance, data integration
- Property Graphs: Performance, graph algorithms, simpler queries, flexibility
- ResearcherAI: Uses property graphs for performance, but you can use RDF if needed!
Building Knowledge Graphs: Construction Methods
Now that you understand what knowledge graphs are, let's explore how to build them from various data sources.
Three Main Approaches
There are three primary methods to construct knowledge graphs, depending on your data source structure:
1. Structured Sources (Relational Databases, CSV)
Characteristics:
- Fixed schema (all entities of same type have same attributes)
- Examples: SQL databases, CSV files, Excel spreadsheets
- Easiest to convert to knowledge graphs
Method: Use mapping rules (like R2RML - RDB to RDF Mapping Language)
# Example: CSV to Knowledge Graph
import pandas as pd
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF
# Source: papers.csv
# paper_id,title,year,author_id
# p1,"Attention Is All You Need",2017,a1
# p2,"BERT",2018,a2
df = pd.read_csv("papers.csv")
g = Graph()
ns = Namespace("http://example.org/")
for _, row in df.iterrows():
paper_uri = ns[row['paper_id']]
# Add triples
g.add((paper_uri, RDF.type, ns.ResearchPaper))
g.add((paper_uri, ns.hasTitle, Literal(row['title'])))
g.add((paper_uri, ns.publishedYear, Literal(row['year'])))
g.add((paper_uri, ns.hasAuthor, ns[row['author_id']]))
# Result: Knowledge graph with papers, titles, years, authors
Mapping Rules (R2RML):
# R2RML mapping: SQL table → RDF
@prefix rr: <http://www.w3.org/ns/r2rml#> .
<#PaperMapping> a rr:TriplesMap ;
rr:logicalTable [ rr:tableName "papers" ] ;
rr:subjectMap [
rr:template "http://example.org/paper/{paper_id}" ;
rr:class :ResearchPaper
] ;
rr:predicateObjectMap [
rr:predicate :hasTitle ;
rr:objectMap [ rr:column "title" ]
] .
2. Semi-Structured Sources (JSON, XML)
Characteristics:
- Flexible schema (entities of same type may have different attributes)
- Examples: JSON APIs, XML documents, HTML pages
- Moderate complexity to convert
Method: Use mapping rules adapted to the structure
import json
from rdflib import Graph, Namespace, Literal
# Source: papers.json
json_data = {
"paper1": {
"title": "Attention Is All You Need",
"year": 2017,
"authors": ["Vaswani", "Shazeer"], # Variable length!
"venue": "NIPS" # Optional field
},
"paper2": {
"title": "BERT",
"year": 2018,
"authors": ["Devlin"] # Different number of authors
# No venue field!
}
}
g = Graph()
ns = Namespace("http://example.org/")
for paper_id, paper_data in json_data.items():
paper_uri = ns[paper_id]
g.add((paper_uri, ns.hasTitle, Literal(paper_data['title'])))
g.add((paper_uri, ns.publishedYear, Literal(paper_data['year'])))
# Handle variable-length authors
for author in paper_data.get('authors', []):
g.add((paper_uri, ns.hasAuthor, Literal(author)))
# Handle optional venue
if 'venue' in paper_data:
g.add((paper_uri, ns.publishedAt, Literal(paper_data['venue'])))
3. Unstructured Sources (Text, Images, PDFs)
Characteristics:
- No fixed schema at all
- Examples: Natural language text, research papers (PDF), images
- Most complex to convert - requires AI/NLP
Method: Use NLP techniques to extract entities and relationships
# Example: Extract entities and relationships from text
from transformers import pipeline
# NLP model for Named Entity Recognition
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")
text = """
Attention Is All You Need was published in 2017 by Vaswani and colleagues at Google.
The paper introduced the Transformer architecture for sequence-to-sequence tasks.
"""
# Extract entities
entities = ner(text)
# Result: [
# {"entity": "Attention Is All You Need", "type": "WORK"},
# {"entity": "2017", "type": "DATE"},
# {"entity": "Vaswani", "type": "PERSON"},
# {"entity": "Google", "type": "ORG"}
# ]
# Extract relationships (requires relation extraction model)
# "Attention Is All You Need" -[published_in]-> "2017"
# "Attention Is All You Need" -[authored_by]-> "Vaswani"
# "Vaswani" -[works_at]-> "Google"
# Convert to knowledge graph triples
g = Graph()
ns = Namespace("http://example.org/")
g.add((ns.attention_paper, ns.publishedYear, Literal(2017)))
g.add((ns.attention_paper, ns.hasAuthor, ns.vaswani))
g.add((ns.vaswani, ns.worksAt, ns.google))
Knowledge Graph Construction Process
Here's the end-to-end process for building production knowledge graphs:
Step 0: Define Use Case and Scope
Before building, answer:
- What questions do we need to answer? ("Which papers cite X?", "Who are experts in Y?")
- What metadata do we need? (papers, authors, citations, concepts)
- What relationships matter? (cites, authored_by, discusses)
Example for ResearcherAI:
use_case = {
"questions": [
"Find papers related to transformers",
"Who are the leading researchers in NLP?",
"What papers cite the Attention paper?"
],
"metadata": ["papers", "authors", "citations", "concepts"],
"relationships": ["cites", "authored_by", "discusses", "collaborates_with"]
}
Step 1: Data Collection
Gather data from various sources across your ecosystem:
# Example: Collect from multiple sources
sources = {
"papers_db": "SELECT * FROM papers", # SQL database
"arxiv_api": "https://api.arxiv.org/papers", # REST API
"paper_pdfs": "/data/pdfs/*.pdf", # Unstructured files
"citations_csv": "/data/citations.csv" # CSV file
}
# Each source "speaks" a different language:
# - SQL: Relational tables
# - API: JSON
# - PDFs: Unstructured text
# - CSV: Tabular data
Step 2: Data Cleaning and Standardization
Ensure data quality by:
- Standardizing: Convert all dates to ISO format (YYYY-MM-DD)
- Deduplicating: Merge duplicate author entries
- Validating: Check data types, required fields
- Resolving: Handle inconsistencies (same paper different IDs)
import pandas as pd
# Example: Clean author data
authors_raw = pd.read_csv("authors.csv")
# Standardize names
authors_raw['name'] = authors_raw['name'].str.strip().str.title()
# Remove duplicates (same email = same person)
authors_clean = authors_raw.drop_duplicates(subset=['email'])
# Validate required fields
authors_clean = authors_clean.dropna(subset=['name', 'affiliation'])
# Resolve inconsistencies (assign unique IDs)
authors_clean['author_id'] = range(len(authors_clean))
Step 3: Data Modeling (Convert to RDF Triples)
Transform cleaned data into standardized RDF triples:
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF
g = Graph()
ns = Namespace("http://researcherai.org/")
# For each paper in cleaned data
for _, paper in papers_clean.iterrows():
paper_uri = ns[f"paper/{paper['id']}"]
# Subject-Predicate-Object triples
g.add((paper_uri, RDF.type, ns.ResearchPaper))
g.add((paper_uri, ns.hasTitle, Literal(paper['title'])))
g.add((paper_uri, ns.publishedYear, Literal(paper['year'])))
# Relationships to other entities
for author_id in paper['authors']:
author_uri = ns[f"author/{author_id}"]
g.add((paper_uri, ns.hasAuthor, author_uri))
Step 4: Usage and Insights
Now query the knowledge graph to deliver value:
# Find papers by author
SELECT ?paper ?title WHERE {
?author :hasName "Ashish Vaswani" .
?paper :hasAuthor ?author .
?paper :hasTitle ?title .
}
# Find citation paths
SELECT ?citing ?cited WHERE {
?citing :cites+ ?cited . # Transitive: 1+ hops
}
Comparison of Construction Methods
| Source Type | Complexity | Tools | Best For |
|---|---|---|---|
| Structured | ⭐ Low | R2RML, pandas | Databases, CSV, Excel |
| Semi-Structured | ⭐⭐ Medium | JSON/XML parsers | APIs, config files |
| Unstructured | ⭐⭐⭐ High | NLP, LLMs, OCR | PDFs, text, images |
ResearcherAI's Approach:
- Structured: arXiv metadata (JSON API) → Easy conversion
- Semi-Structured: Paper metadata from multiple APIs → JSON parsing
- Unstructured: Paper PDFs → NLP for concept extraction
- Start with structured sources - easiest to convert and validate
- Clean data thoroughly - garbage in, garbage out
- Define schema first - know what entities and relationships you need
- Validate incrementally - check quality at each step
- Use existing ontologies - don't reinvent the wheel (e.g., Schema.org)
Hands-On: Building a Research Paper Knowledge Graph
Now let's walk through a complete example of building a knowledge graph from structured data using the declarative SPARQL CONSTRUCT approach.
What You'll Learn:
- How to transform CSV data into RDF triples
- Using SPARQL CONSTRUCT queries for mapping
- Incrementally building a knowledge graph
- Visualizing the resulting graph
Step 1: Input Data
Imagine you have research paper data in CSV files:
papers.csv:
domain,title,year,abstract
NLP,Attention Is All You Need,2017,Transformer architecture for sequence-to-sequence
NLP,BERT,2018,Bidirectional encoder representations
CV,ResNet,2015,Deep residual learning for image recognition
authors.csv:
name,affiliation,domain
Ashish Vaswani,Google Brain,NLP
Jacob Devlin,Google AI,NLP
Kaiming He,Facebook AI,CV
citations.csv:
citing_paper,cited_paper,citation_type
BERT,Attention Is All You Need,builds_on
ResNet,VGGNet,improves
concepts.csv:
paper,concept,importance
Attention Is All You Need,self-attention,high
Attention Is All You Need,transformers,high
BERT,bidirectional,high
ResNet,residual-connections,high
Let's load this data:
import pandas as pd
from rdflib import Graph, Literal, Namespace
from rdflib.plugins.sparql.processor import prepareQuery
# Load CSV files
papers_df = pd.read_csv("papers.csv").fillna('')
authors_df = pd.read_csv("authors.csv").fillna('')
citations_df = pd.read_csv("citations.csv").fillna('')
concepts_df = pd.read_csv("concepts.csv").fillna('')
# Show distribution
data = {
"Papers": len(papers_df),
"Authors": len(authors_df),
"Citations": len(citations_df),
"Concepts": len(concepts_df)
}
print(pd.DataFrame.from_dict(data, orient='index', columns=['Count']))
# Output:
# Count
# Papers 3
# Authors 3
# Citations 3
# Concepts 4
Step 2: Define the Knowledge Graph Schema
Based on our data, we define the schema:
# Schema for Research Papers
@prefix research: <http://example.org/research#> .
# Classes (Entity Types)
research:Paper a rdfs:Class .
research:Author a rdfs:Class .
research:Concept a rdfs:Class .
research:ResearchDomain a rdfs:Class .
# Properties
research:hasTitle a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range xsd:string .
research:publishedYear a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range xsd:integer .
research:hasAbstract a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range xsd:string .
# Relationships
research:authoredBy a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range research:Author .
research:cites a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range research:Paper .
research:discusses a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range research:Concept .
research:belongsToDomain a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range research:ResearchDomain .
Visualization of Schema:
Step 3: SPARQL CONSTRUCT Queries for Mapping
Now we define SPARQL CONSTRUCT queries to transform CSV data into RDF triples:
Query 1: Create Paper Entities
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?paper a research:Paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
?paper research:hasAbstract ?abstract .
?paper research:belongsToDomain ?domainIRI .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?title, " ", "_"))) AS ?paper)
BIND(IRI(CONCAT("http://data.example.org/domain/",
?domain)) AS ?domainIRI)
}
Web Developer Analogy:
// SPARQL CONSTRUCT is like a template for creating objects
const papers = csvData.map(row => ({
id: `http://data.example.org/paper/${row.title.replace(/ /g, '_')}`,
type: "Paper",
title: row.title,
year: row.year,
abstract: row.abstract,
domain: `http://data.example.org/domain/${row.domain}`
}))
Query 2: Create Author Entities
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?author a research:Author .
?author research:hasName ?name .
?author research:affiliation ?affiliation .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/author/",
REPLACE(?name, " ", "_"))) AS ?author)
}
Query 3: Create Citation Relationships
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?citingPaperIRI research:cites ?citedPaperIRI .
?citingPaperIRI research:citationType ?citation_type .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?citing_paper, " ", "_"))) AS ?citingPaperIRI)
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?cited_paper, " ", "_"))) AS ?citedPaperIRI)
}
Query 4: Create Concept Relationships
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?paperIRI research:discusses ?conceptIRI .
?conceptIRI research:importance ?importance .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?paper, " ", "_"))) AS ?paperIRI)
BIND(IRI(CONCAT("http://data.example.org/concept/",
?concept)) AS ?conceptIRI)
}
Step 4: The Transform Function
This function applies SPARQL CONSTRUCT queries to DataFrame rows:
import re
from rdflib import Graph, Literal
from rdflib.plugins.sparql.processor import prepareQuery
def transform(df: pd.DataFrame, construct_query: str,
first: bool = False) -> Graph:
"""Transform Pandas DataFrame to RDFLib Graph using SPARQL CONSTRUCT.
Args:
df: Input DataFrame with CSV data
construct_query: SPARQL CONSTRUCT query template
first: If True, only process first row (for testing)
Returns:
RDF Graph with constructed triples
"""
# Setup graphs
query_graph = Graph()
result_graph = Graph()
# Parse the SPARQL query
query = prepareQuery(construct_query)
# Clean column names (remove special characters)
invalid_pattern = re.compile(r"[^\w_]+")
headers = dict((k, invalid_pattern.sub("_", k)) for k in df.columns)
# Process each row
for _, row in df.iterrows():
# Create variable bindings: column name -> cell value
binding = dict((headers[k], Literal(row[k]))
for k in df.columns if len(str(row[k])) > 0)
# Execute query with bindings
results = query_graph.query(query, initBindings=binding)
# Add resulting triples to graph
for triple in results:
result_graph.add(triple)
# Stop after first row if testing
if first:
break
return result_graph
How It Works:
- Parse query: Prepare the SPARQL CONSTRUCT template
- For each row: Create variable bindings (CSV columns → SPARQL variables)
- Execute query: Replace variables with values, construct triples
- Add to graph: Accumulate all triples in result graph
Step 5: Build the Knowledge Graph Incrementally
# Initialize empty knowledge graph
kg = Graph()
# Step 1: Add papers
construct_papers = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?paper a research:Paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
?paper research:hasAbstract ?abstract .
?paper research:belongsToDomain ?domainIRI .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?title, " ", "_"))) AS ?paper)
BIND(IRI(CONCAT("http://data.example.org/domain/",
?domain)) AS ?domainIRI)
}
"""
# Test with first row
print("Testing with first paper:")
print(transform(papers_df, construct_papers, first=True).serialize(format='turtle'))
# Output:
# @prefix research: <http://example.org/research#> .
#
# <http://data.example.org/paper/Attention_Is_All_You_Need>
# a research:Paper ;
# research:hasTitle "Attention Is All You Need" ;
# research:publishedYear 2017 ;
# research:hasAbstract "Transformer architecture..." ;
# research:belongsToDomain <http://data.example.org/domain/NLP> .
# Add all papers to knowledge graph
kg += transform(papers_df, construct_papers)
print(f"After adding papers: {len(kg)} triples")
# Output: After adding papers: 12 triples (3 papers × 4 properties each)
# Step 2: Add authors
construct_authors = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?author a research:Author .
?author research:hasName ?name .
?author research:affiliation ?affiliation .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/author/",
REPLACE(?name, " ", "_"))) AS ?author)
}
"""
kg += transform(authors_df, construct_authors)
print(f"After adding authors: {len(kg)} triples")
# Output: After adding authors: 21 triples
# Step 3: Add citations
construct_citations = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?citingPaperIRI research:cites ?citedPaperIRI .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?citing_paper, " ", "_"))) AS ?citingPaperIRI)
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?cited_paper, " ", "_"))) AS ?citedPaperIRI)
}
"""
kg += transform(citations_df, construct_citations)
print(f"After adding citations: {len(kg)} triples")
# Output: After adding citations: 24 triples
# Step 4: Add concepts
construct_concepts = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?paperIRI research:discusses ?conceptIRI .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?paper, " ", "_"))) AS ?paperIRI)
BIND(IRI(CONCAT("http://data.example.org/concept/",
?concept)) AS ?conceptIRI)
}
"""
kg += transform(concepts_df, construct_concepts)
print(f"Final knowledge graph: {len(kg)} triples")
# Output: Final knowledge graph: 28 triples
Step 6: Query the Knowledge Graph
Now we can query the constructed graph:
# Query 1: Find all NLP papers
query_nlp_papers = """
PREFIX research: <http://example.org/research#>
SELECT ?title ?year
WHERE {
?paper a research:Paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
?paper research:belongsToDomain <http://data.example.org/domain/NLP> .
}
ORDER BY ?year
"""
results = kg.query(query_nlp_papers)
for row in results:
print(f"{row.title} ({row.year})")
# Output:
# Attention Is All You Need (2017)
# BERT (2018)
# Query 2: Find papers citing "Attention Is All You Need"
query_citations = """
PREFIX research: <http://example.org/research#>
SELECT ?citing_title
WHERE {
?citing research:cites <http://data.example.org/paper/Attention_Is_All_You_Need> .
?citing research:hasTitle ?citing_title .
}
"""
results = kg.query(query_citations)
for row in results:
print(f"Paper citing Attention: {row.citing_title}")
# Output: Paper citing Attention: BERT
# Query 3: Find all concepts discussed in NLP papers
query_concepts = """
PREFIX research: <http://example.org/research#>
SELECT ?concept
WHERE {
?paper research:belongsToDomain <http://data.example.org/domain/NLP> .
?paper research:discusses ?conceptIRI .
BIND(REPLACE(STR(?conceptIRI), ".*/", "") AS ?concept)
}
"""
results = kg.query(query_concepts)
concepts = [row.concept for row in results]
print(f"NLP concepts: {', '.join(concepts)}")
# Output: NLP concepts: self-attention, transformers, bidirectional
Step 7: Visualize the Knowledge Graph
import networkx as nx
import matplotlib.pyplot as plt
def rdf_to_nx(rdf_graph: Graph) -> nx.DiGraph:
"""Convert RDF graph to NetworkX directed graph."""
G = nx.DiGraph()
for s, p, o in rdf_graph:
# Extract local names (remove URI prefixes)
subject = str(s).split('/')[-1]
predicate = str(p).split('#')[-1]
obj = str(o).split('/')[-1] if isinstance(o, URIRef) else str(o)
# Add nodes and edges
G.add_edge(subject, obj, label=predicate)
return G
# Convert to NetworkX
G = rdf_to_nx(kg)
# Visualize
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(G, seed=42)
# Draw nodes
nx.draw_networkx_nodes(G, pos, node_size=1000, node_color='lightblue')
# Draw edges
nx.draw_networkx_edges(G, pos, edge_color='gray', arrows=True,
arrowsize=20, connectionstyle='arc3,rad=0.1')
# Draw labels
nx.draw_networkx_labels(G, pos, font_size=8)
# Draw edge labels
edge_labels = nx.get_edge_attributes(G, 'label')
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=6)
plt.title("Research Paper Knowledge Graph")
plt.axis('off')
plt.tight_layout()
plt.show()
Step 8: Save the Knowledge Graph
# Save to Turtle file (human-readable RDF format)
kg.serialize(destination='research_papers.ttl', format='turtle')
print("Knowledge graph saved to research_papers.ttl")
# The file content looks like:
# @prefix research: <http://example.org/research#> .
#
# <http://data.example.org/paper/Attention_Is_All_You_Need>
# a research:Paper ;
# research:hasTitle "Attention Is All You Need" ;
# research:publishedYear 2017 ;
# research:belongsToDomain <http://data.example.org/domain/NLP> ;
# research:discusses <http://data.example.org/concept/self-attention>,
# <http://data.example.org/concept/transformers> .
#
# <http://data.example.org/paper/BERT>
# a research:Paper ;
# research:hasTitle "BERT" ;
# research:publishedYear 2018 ;
# research:cites <http://data.example.org/paper/Attention_Is_All_You_Need> ;
# research:discusses <http://data.example.org/concept/bidirectional> .
Key Takeaways
Declarative Approach Benefits:
- Separation of concerns: Data (CSV) separate from logic (SPARQL)
- Reusable queries: Same query works for any CSV with same schema
- Incremental building: Add entities and relationships step-by-step
- Easy to validate: Test queries on single rows first
- Standard-based: SPARQL is W3C standard
Process Summary:
When to Use This Approach:
- ✅ Have structured data (CSV, databases)
- ✅ Need standard-compliant knowledge graphs
- ✅ Want to query with SPARQL
- ✅ Require formal schema/ontology
- ✅ Building production knowledge graphs
ResearcherAI Uses This For:
- arXiv paper metadata → Knowledge graph
- Citation networks from Semantic Scholar
- Author collaboration graphs
- Concept hierarchies from papers
Declarative (SPARQL CONSTRUCT): "What you want" - Define the shape of output graph Imperative (Python loops): "How to do it" - Step-by-step instructions
SPARQL CONSTRUCT is declarative - you describe the desired graph structure, and the engine figures out how to create it. This is more maintainable and less error-prone than imperative loops.
Production Decision: Neo4j vs Apache Jena Fuseki
Critical Understanding: Neo4j and Apache Jena Fuseki are NOT interchangeable alternatives. They serve fundamentally different use cases!
The Key Question: Which data model fits your use case?
When to Use Neo4j in Production
Use Neo4j when you need:
-
High-Performance Graph Traversal
// Find shortest path between papers (fast!)
MATCH path = shortestPath(
(a:Paper {id: "paper1"})-[:CITES*]-(b:Paper {id: "paper50"})
)
RETURN path- Neo4j is optimized for this (index-free adjacency)
- Jena/RDF is much slower for deep graph traversal
-
Graph Algorithms
- PageRank, Louvain community detection
- Shortest paths, centrality measures
- Neo4j Graph Data Science library
- RDF/Jena: No built-in graph algorithms
-
Real-Time Recommendations
- Friend recommendations (social networks)
- Paper recommendations based on citations
- Collaborative filtering
- Performance critical - Neo4j is faster
-
Intuitive Queries
// Cypher is very readable
MATCH (p:Paper)-[:CITES]->(cited:Paper)
WHERE p.year > 2020
RETURN cited.title, count(*) as citations
ORDER BY citations DESC- Easier for developers to learn than SPARQL
- Better tooling (Neo4j Browser, Bloom)
-
Flexible Schema
- Add new node labels and edge types dynamically
- Schema evolves with your application
- No formal ontology needed
Example: ResearcherAI uses Neo4j for:
- Citation network traversal
- Finding related papers (graph algorithms)
- Author collaboration networks
- Fast query performance
When to Use Apache Jena Fuseki in Production
Use Apache Jena Fuseki when you need:
-
Formal Ontologies
# Define strict schema
:ResearchPaper rdfs:subClassOf :Publication .
:ConferencePaper rdfs:subClassOf :ResearchPaper .
:JournalPaper rdfs:subClassOf :ResearchPaper .- W3C standard ontologies (OWL)
- Strict domain modeling
- Neo4j: No formal ontology support
-
Reasoning and Inference
# Automatic inference: If A cites B and B cites C, find transitive citations
SELECT ?paper ?influenced
WHERE {
:paper1 :cites+ ?influenced . # Transitive closure
}- OWL reasoners infer new facts
- Automatic classification
- Neo4j: No built-in reasoning
-
W3C Standards Compliance
- Publishing Linked Open Data (LOD)
- Interoperability with DBpedia, Wikidata
- RDF, SPARQL, OWL standards
- Neo4j: Proprietary (Cypher is not W3C standard)
-
Data Integration
- Merging data from multiple RDF sources
- Schema mapping and alignment
- Federated SPARQL queries across endpoints
- Neo4j: Harder to integrate with external sources
-
Scientific/Medical Domains
- Existing domain ontologies (Gene Ontology, SNOMED CT)
- Formal knowledge representation
- Regulatory compliance requirements
- Neo4j: Not suitable for formal ontologies
Example: Use Jena Fuseki for:
- Medical knowledge graphs (SNOMED, ICD-10)
- Scientific literature with formal taxonomies
- Publishing linked open data
- Integration with Wikidata/DBpedia
Production Performance Comparison
| Operation | Neo4j | Apache Jena Fuseki | Winner |
|---|---|---|---|
| Graph Traversal (5 hops) | ~10ms | ~500ms+ | Neo4j (50x faster) |
| Complex Cypher/SPARQL | Fast | Moderate | Neo4j |
| Write Throughput | 10k-100k/sec | 1k-10k/sec | Neo4j |
| OWL Reasoning | Not supported | Supported | Jena |
| Inference | Manual (application) | Automatic (reasoner) | Jena |
| Standards Compliance | Proprietary | W3C standards | Jena |
| Graph Algorithms | Built-in (GDS) | Not supported | Neo4j |
| Storage Efficiency | Good | Moderate (triples overhead) | Neo4j |
When to Use BOTH (Hybrid Architecture)
You can use both together for the best of both worlds:
class HybridProductionKnowledgeGraph:
"""Use Neo4j for performance, Jena for reasoning"""
def __init__(self):
# Neo4j for fast graph operations
self.neo4j = Neo4jKnowledgeGraph(
uri="bolt://neo4j-prod.example.com:7687"
)
# Jena for ontology and reasoning
self.jena = JenaKnowledgeGraph(
endpoint_url="http://fuseki-prod.example.com:3030/research/sparql"
)
def add_paper(self, paper_data: dict):
"""Add to both databases"""
# Add to Neo4j for fast queries
self.neo4j.add_paper(
paper_data["id"],
paper_data["title"],
paper_data["year"]
)
# Add to Jena for reasoning
self.jena.add_paper(
paper_data["id"],
paper_data["title"],
paper_data["year"]
)
def find_related_papers(self, paper_id: str):
"""Use Neo4j for fast graph traversal"""
return self.neo4j.find_related_papers(paper_id)
def classify_paper_topic(self, paper_id: str):
"""Use Jena reasoner for automatic classification"""
sparql = f"""
PREFIX research: <http://researcherai.org/ontology#>
SELECT ?topic
WHERE {{
<http://researcherai.org/papers/{paper_id}>
research:hasInferredTopic ?topic .
}}
"""
return self.jena.query(sparql)
def sync_databases(self):
"""Periodically sync data between Neo4j and Jena"""
# Export from Neo4j, import to Jena
# Or vice versa
pass
Use Hybrid When:
- Need both fast traversal AND formal reasoning
- Want graph algorithms + automatic classification
- Building enterprise knowledge graph with complex requirements
- Have resources to maintain two databases
Trade-offs:
- ❌ More complex architecture
- ❌ Data synchronization overhead
- ❌ Higher infrastructure costs
- ✅ Best of both worlds
ResearcherAI's Production Choice: Neo4j
Why ResearcherAI uses Neo4j (not Jena) in production:
# ResearcherAI's production config
PRODUCTION_CONFIG = {
"knowledge_graph": "Neo4j", # Not Apache Jena
# Why Neo4j? See reasons below
}
Reasons:
-
Primary Use Case: Citation Network Traversal
- Finding related papers by citation paths
- Author collaboration networks
- Paper recommendation based on graph structure
- Neo4j excels at this, Jena is slow
-
Performance Requirements
- Real-time query responses (under 100ms)
- High read throughput (1000s queries/sec)
- Neo4j is 10-50x faster for graph queries
-
No Formal Ontology Needed
- Research paper schema is relatively simple
- Don't need OWL reasoning
- Don't need automatic inference
- Jena's strengths aren't needed
-
Developer Experience
- Cypher is easier to learn than SPARQL
- Better visualization tools (Neo4j Browser)
- Larger community and resources
- Faster development
-
Graph Algorithms
- Use PageRank to find influential papers
- Community detection for research clusters
- Shortest paths for paper relationships
- Neo4j Graph Data Science library
When ResearcherAI WOULD use Jena instead:
If the requirements were:
- ❌ Need formal research ontologies (ACM Computing Classification)
- ❌ Must publish linked open data
- ❌ Need automatic paper classification via reasoning
- ❌ Integrate with existing RDF sources (DBpedia)
- ❌ W3C standards compliance required
Then use Apache Jena Fuseki in production.
Quick Decision Guide
Choose Neo4j if:
- ✅ Primary use case: graph traversal, recommendations
- ✅ Need high performance (real-time queries)
- ✅ Want graph algorithms (PageRank, community detection)
- ✅ Flexible schema, rapid development
- ✅ Team familiar with SQL-like queries (Cypher)
Choose Apache Jena Fuseki if:
- ✅ Need formal ontologies (OWL)
- ✅ Require reasoning and inference
- ✅ Publishing linked open data
- ✅ W3C standards compliance
- ✅ Integrating with existing RDF sources
Choose Both (Hybrid) if:
- ✅ Need graph performance AND reasoning
- ✅ Have resources for complex architecture
- ✅ Enterprise requirements
For Most Projects: Start with Neo4j. Only add Jena if you truly need formal ontologies or reasoning.
Real-World Production Examples
Companies Using Neo4j:
- LinkedIn - Professional network graph
- eBay - Product recommendations
- Airbnb - Location-based search
- Walmart - Supply chain optimization
- NASA - Lessons learned database
Companies/Projects Using Apache Jena:
- BBC - Linked data for content
- Wikidata - Knowledge base
- Getty Vocabularies - Art metadata
- UK Government - Open data publishing
- PubMed - Biomedical ontologies
Notice the Pattern:
- Neo4j: Performance-critical, real-time applications
- Jena: Formal knowledge, standards, publishing
- Default choice for most projects: Neo4j (performance, ease of use)
- Choose Jena if: You need formal ontologies, reasoning, or standards compliance
- ResearcherAI uses Neo4j: Because citation networks need fast traversal, not formal reasoning
- Start simple: Begin with one database, add the other only if truly needed
Many projects think they need formal ontologies and reasoning, but actually just need a fast graph database. Start with Neo4j. Only add Jena if you have a clear use case for OWL reasoning or standards compliance.
Part 3: Hybrid RAG - Best of Both Worlds
Neither vector search nor knowledge graphs alone are sufficient:
Vector Search Alone:
- ✅ Finds semantically similar content
- ❌ Can't answer relationship questions
- ❌ No structured reasoning
- ❌ Can't follow citations, collaborations
Knowledge Graph Alone:
- ✅ Excellent at relationships
- ✅ Multi-hop reasoning
- ❌ Requires exact entity matches
- ❌ Can't do semantic similarity
Solution: Hybrid RAG - Combine both!
Hybrid RAG Architecture
Implementing Hybrid RAG
from typing import List, Dict, Tuple
from enum import Enum
class QueryType(Enum):
SEMANTIC = "semantic" # "Papers about attention mechanisms"
RELATIONAL = "relational" # "Papers citing X"
HYBRID = "hybrid" # "Papers about Y citing X"
class HybridRAG:
"""Hybrid RAG combining vector search and knowledge graphs"""
def __init__(
self,
vector_store: QdrantVectorStore,
knowledge_graph: Neo4jKnowledgeGraph,
embedding_model: SentenceTransformer
):
self.vector_store = vector_store
self.knowledge_graph = knowledge_graph
self.embedding_model = embedding_model
def classify_query(self, query: str) -> QueryType:
"""Determine query type"""
# Keywords indicating relational queries
relational_keywords = [
"cite", "cites", "citing", "cited",
"author", "authored", "written by",
"collaborate", "coauthor",
"published in", "appeared in"
]
# Keywords indicating semantic queries
semantic_keywords = [
"about", "discuss", "related to",
"similar to", "like", "regarding"
]
query_lower = query.lower()
has_relational = any(kw in query_lower for kw in relational_keywords)
has_semantic = any(kw in query_lower for kw in semantic_keywords)
if has_relational and has_semantic:
return QueryType.HYBRID
elif has_relational:
return QueryType.RELATIONAL
else:
return QueryType.SEMANTIC
def semantic_search(self, query: str, top_k: int = 5) -> List[Dict]:
"""Pure vector search"""
query_embedding = self.embedding_model.encode(query)
results = self.vector_store.search(query_embedding, top_k)
return [
{
"text": text,
"score": score,
"source": "vector"
}
for text, score in results
]
def relational_search(self, query: str) -> List[Dict]:
"""Pure graph search"""
# Parse query for entities and relationships
# Simplified - in production, use NER + intent detection
if "citing" in query.lower():
# Extract paper being cited
# Simplified extraction
cited_paper = self._extract_paper_mention(query)
citing_papers = self.knowledge_graph.find_citing_papers(cited_paper)
return [
{
"text": paper["title"],
"year": paper["year"],
"source": "graph"
}
for paper in citing_papers
]
elif "author" in query.lower():
author_name = self._extract_author_name(query)
papers = self.knowledge_graph.find_papers_by_author(author_name)
return [
{
"text": paper["title"],
"year": paper["year"],
"source": "graph"
}
for paper in papers
]
return []
def hybrid_search(
self,
query: str,
top_k: int = 10
) -> List[Dict]:
"""Combined vector + graph search"""
# Step 1: Vector search for semantic similarity
semantic_results = self.semantic_search(query, top_k=top_k)
# Step 2: Graph search for relationships
relational_results = self.relational_search(query)
# Step 3: Merge and deduplicate
all_results = semantic_results + relational_results
seen = set()
unique_results = []
for result in all_results:
key = result["text"]
if key not in seen:
seen.add(key)
unique_results.append(result)
# Step 4: Re-rank using both semantic and structural scores
reranked = self._rerank_results(unique_results, query)
return reranked[:top_k]
def _rerank_results(self, results: List[Dict], query: str) -> List[Dict]:
"""Re-rank results combining semantic + structural scores"""
for result in results:
# Semantic score from vector search
semantic_score = result.get("score", 0.5)
# Structural score from graph (e.g., citation count, centrality)
structural_score = 0.5 # Simplified
# Combined score (weighted average)
result["final_score"] = 0.6 * semantic_score + 0.4 * structural_score
# Sort by final score
results.sort(key=lambda x: x.get("final_score", 0), reverse=True)
return results
def search(self, query: str, top_k: int = 5) -> List[Dict]:
"""Main search interface - automatically routes to appropriate method"""
query_type = self.classify_query(query)
if query_type == QueryType.SEMANTIC:
return self.semantic_search(query, top_k)
elif query_type == QueryType.RELATIONAL:
return self.relational_search(query)
else: # HYBRID
return self.hybrid_search(query, top_k)
# Usage
hybrid_rag = HybridRAG(
vector_store=qdrant_store,
knowledge_graph=neo4j_kg,
embedding_model=model
)
# Different query types automatically routed
queries = [
"Papers about attention mechanisms", # SEMANTIC
"Papers citing 'Attention is All You Need'", # RELATIONAL
"Papers about transformers citing early NLP work" # HYBRID
]
for query in queries:
print(f"\nQuery: {query}")
results = hybrid_rag.search(query, top_k=3)
for i, result in enumerate(results, 1):
print(f"{i}. {result['text']} (source: {result.get('source', 'hybrid')})")
Part 4: GraphRAG - Knowledge Graph Enhanced RAG
What Exactly is GraphRAG?
GraphRAG is NOT just "using a knowledge graph with RAG". It's a specific approach where the knowledge graph actively enhances the retrieval process by:
- Expanding initial search results with graph-connected context
- Enriching retrieved documents with relationship information
- Providing multi-hop reasoning paths through the graph
Web Developer Analogy:
// Traditional RAG = Direct database query
const results = db.query("SELECT * FROM articles WHERE text MATCHES 'transformers'")
return results // Just the matching articles
// GraphRAG = Query + JOIN on relationships
const initial = db.query("SELECT * FROM articles WHERE text MATCHES 'transformers'")
const expanded = initial.map(article => ({
...article,
citations: db.query("SELECT * FROM articles WHERE id IN article.cited_papers"),
relatedConcepts: db.query("SELECT * FROM concepts WHERE article_id = article.id"),
authorExpertise: db.query("SELECT * FROM articles WHERE author_id = article.author_id")
}))
return expanded // Original + graph-enriched context
The Problem GraphRAG Solves
Scenario: User asks "How do transformers handle long-range dependencies?"
Traditional RAG (just vector search):
# Returns: Top 3 papers about transformers
results = [
"Attention is All You Need (Vaswani, 2017)",
"BERT (Devlin, 2018)",
"GPT-3 (Brown, 2020)"
]
# ❌ Missing: WHY transformers were invented (what came before)
# ❌ Missing: HOW they evolved (what improved them)
# ❌ Missing: WHAT problems remain (recent criticisms)
GraphRAG (vector search + graph expansion):
# Returns: Top 3 papers + graph context
results = {
"initial_papers": [
"Attention is All You Need (Vaswani, 2017)",
"BERT (Devlin, 2018)",
"GPT-3 (Brown, 2020)"
],
"cited_papers": [ # WHAT CAME BEFORE (context)
"Neural Machine Translation by Jointly Learning to Align (Bahdanau, 2014)",
"Sequence to Sequence Learning (Sutskever, 2014)",
"Long Short-Term Memory (Hochreiter, 1997)" # The problem transformers solved!
],
"citing_papers": [ # WHAT CAME AFTER (evolution)
"Reformer: Efficient Transformer (Kitaev, 2020)", # Addressed limitations
"Linformer (Wang, 2020)", # Improved efficiency
"Performer (Choromanski, 2020)" # Better for long sequences
],
"related_concepts": [
"Self-attention", "Positional encoding", "Multi-head attention"
]
}
# ✅ Has: Historical context (why transformers exist)
# ✅ Has: Evolution (how they improved)
# ✅ Has: Current solutions (what's happening now)
Result: LLM can now give a complete historical narrative, not just describe what transformers are!
Why Traditional RAG Isn't Enough
Problem 1: No Historical Context
# User question: "Why were transformers invented?"
# Traditional RAG retrieves:
papers = [
"Attention is All You Need (2017): Transformers use self-attention..."
]
# ❌ Doesn't explain what problem RNNs/LSTMs had
# ❌ Doesn't show what transformers improved over
# GraphRAG retrieves + expands via citations:
graph_context = {
"main_paper": "Attention is All You Need (2017): Transformers use self-attention...",
"cited_papers": [
"LSTM (1997): LSTMs struggle with sequences >100 tokens",
"RNN vanishing gradients (1994): RNNs can't learn long dependencies"
]
}
# ✅ Now LLM can explain: "RNNs had vanishing gradients, LSTMs helped but
# still struggled with long sequences, transformers solved this with self-attention"
Problem 2: Missing Evolution
# User question: "How have transformers been improved since 2017?"
# Traditional RAG:
papers = ["Attention is All You Need (2017)"] # The original paper
# ❌ Doesn't show what came after
# GraphRAG (with citing papers):
papers = {
"initial": "Attention is All You Need (2017)",
"citing": [
"BERT (2018): Bidirectional pretraining",
"GPT-2 (2019): Larger scale",
"T5 (2019): Text-to-text framework",
"Reformer (2020): Efficient attention",
"Switch Transformers (2021): Sparse models"
]
}
# ✅ Can now trace the evolution timeline
Problem 3: No Multi-Hop Reasoning
# User question: "What recent work improves on BERT's limitations?"
# Traditional RAG: Only finds papers mentioning "BERT limitations"
# ❌ Might miss papers that solve the problem without mentioning BERT
# GraphRAG:
# 1. Find BERT paper
# 2. Find papers citing BERT
# 3. Filter for papers discussing "limitations" or "improvements"
# 4. Find papers those papers cite (multi-hop)
# ✅ Discovers solutions even if they don't directly mention BERT
GraphRAG vs Traditional RAG vs Hybrid RAG
| Approach | How It Works | Strengths | Weaknesses |
|---|---|---|---|
| Traditional RAG | Vector search → Top-K docs → LLM | Simple, fast | No context, no relationships |
| Hybrid RAG | Vector search + Graph search → Merge → LLM | Combines semantic + relational | Still retrieves documents independently |
| GraphRAG | Vector search → Graph expansion → Enhanced context → LLM | Rich context, multi-hop, relationships | More complex, slower |
Key Difference:
- Hybrid RAG: Uses graph for direct queries ("papers citing X")
- GraphRAG: Uses graph to expand vector search results with connected context
When to Use GraphRAG
Use GraphRAG when:
-
Historical Context Matters
- Research paper Q&A (evolution of ideas)
- Patent analysis (prior art, citations)
- Scientific literature review
-
Relationships Are Important
- "How did this idea evolve?"
- "What influenced this paper?"
- "What improved on this approach?"
-
Multi-Hop Reasoning Needed
- "What recent work addresses limitations of X?"
- "Find papers in the intellectual lineage of X"
-
You Have a Knowledge Graph
- Already built citation network
- Already have entity relationships
- Graph is well-structured
DON'T use GraphRAG when:
- Simple Keyword Matching - Traditional search is fine
- No Graph Available - Building a graph is expensive
- Real-Time Speed Critical - Graph traversal adds latency
- Documents Are Independent - No meaningful relationships
GraphRAG: Two Approaches
There are two main approaches to GraphRAG:
Approach 1: Graph-Enhanced Retrieval (Shown Here)
How it works:
- Use vector search to find initial relevant documents
- Use knowledge graph to expand those documents with connected context
- Feed enriched context to LLM
Pros:
- ✅ Simple to implement
- ✅ Works with existing knowledge graphs (Neo4j, etc.)
- ✅ Explainable (you can see the expansion)
Cons:
- ❌ Depends on quality of knowledge graph
- ❌ Expansion can be noisy
- ❌ Slower than pure vector search
Example: ResearcherAI uses this approach
Approach 2: Microsoft GraphRAG (Community Summaries)
How it works (different from approach 1!):
- Build knowledge graph from documents
- Detect communities in the graph (clusters of related entities)
- Generate LLM summaries of each community
- At query time, search community summaries
- Retrieve relevant communities + their documents
Pros:
- ✅ Handles "global" questions ("What are the main themes?")
- ✅ Summarizes large corpora
- ✅ Finds patterns across documents
Cons:
- ❌ More complex (requires community detection + summarization)
- ❌ Higher upfront cost (LLM summarization of all communities)
- ❌ Less direct than traditional retrieval
Key Difference:
- Approach 1 (Graph-Enhanced): Expands specific documents via graph
- Approach 2 (Microsoft GraphRAG): Summarizes graph communities
Which to use?
- Specific questions ("How do transformers work?") → Approach 1
- Global questions ("What are the main AI trends?") → Approach 2
- ResearcherAI: Uses Approach 1 (graph-enhanced retrieval)
GraphRAG Decision Tree
Quick Decision:
- Have graph + need context → GraphRAG (Approach 1)
- Need to summarize corpus → Microsoft GraphRAG (Approach 2)
- Simple Q&A → Traditional RAG
GraphRAG vs Traditional RAG
Key Idea: Use the graph to expand initial search results with related context.
GraphRAG Implementation
class GraphRAG:
"""GraphRAG: Use knowledge graph to enhance retrieval"""
def __init__(
self,
vector_store: QdrantVectorStore,
knowledge_graph: Neo4jKnowledgeGraph,
embedding_model: SentenceTransformer
):
self.vector_store = vector_store
self.knowledge_graph = knowledge_graph
self.embedding_model = embedding_model
def retrieve_and_expand(
self,
query: str,
initial_k: int = 3,
expansion_depth: int = 2
) -> Dict:
"""Retrieve documents and expand using graph"""
# Step 1: Initial vector search
query_embedding = self.embedding_model.encode(query)
initial_results = self.vector_store.search(query_embedding, top_k=initial_k)
# Step 2: Expand using knowledge graph
expanded_context = {
"initial_papers": [],
"cited_papers": [],
"citing_papers": [],
"related_concepts": set(),
"author_expertise": []
}
for text, score in initial_results:
paper_id = self._extract_paper_id(text)
expanded_context["initial_papers"].append({
"id": paper_id,
"text": text,
"score": score
})
# Expand: Find papers this paper cites
cited = self.knowledge_graph.find_papers_cited_by(paper_id)
expanded_context["cited_papers"].extend(cited)
# Expand: Find papers citing this paper
citing = self.knowledge_graph.find_citing_papers(paper_id)
expanded_context["citing_papers"].extend(citing)
# Expand: Find related concepts
concepts = self.knowledge_graph.find_related_concepts(paper_id)
expanded_context["related_concepts"].update(concepts)
# Expand: Find author expertise
authors = self.knowledge_graph.find_paper_authors(paper_id)
for author in authors:
other_papers = self.knowledge_graph.find_papers_by_author(author)
expanded_context["author_expertise"].append({
"author": author,
"other_work": other_papers[:3] # Top 3
})
return expanded_context
def generate_answer(self, query: str, context: Dict) -> str:
"""Generate answer using expanded context"""
# Build rich context from graph expansion
context_text = self._format_context(context)
prompt = f"""Based on the following research papers and related context, answer the question.
Question: {query}
Initial Papers:
{context_text['initial']}
Cited Papers (background):
{context_text['cited']}
Citing Papers (follow-up work):
{context_text['citing']}
Related Concepts:
{', '.join(context['related_concepts'])}
Provide a comprehensive answer with citations."""
# Use LLM to generate answer
response = llm.generate(prompt)
return response
def _format_context(self, context: Dict) -> Dict[str, str]:
"""Format expanded context for prompt"""
initial = "\n\n".join([
f"[{i+1}] {paper['text']}"
for i, paper in enumerate(context["initial_papers"])
])
cited = "\n".join([
f"- {paper['title']} ({paper['year']})"
for paper in context["cited_papers"][:5]
])
citing = "\n".join([
f"- {paper['title']} ({paper['year']})"
for paper in context["citing_papers"][:5]
])
return {
"initial": initial,
"cited": cited,
"citing": citing
}
# Usage
graph_rag = GraphRAG(
vector_store=qdrant_store,
knowledge_graph=neo4j_kg,
embedding_model=model
)
query = "How do transformers handle long-range dependencies?"
# Retrieve and expand
context = graph_rag.retrieve_and_expand(query, initial_k=3)
print("Initial papers:", len(context["initial_papers"]))
print("Cited papers:", len(context["cited_papers"]))
print("Citing papers:", len(context["citing_papers"]))
print("Concepts:", len(context["related_concepts"]))
# Generate answer with expanded context
answer = graph_rag.generate_answer(query, context)
print(f"\nAnswer: {answer}")
GraphRAG Benefits
1. Richer Context
# Traditional RAG: 3 papers
traditional_context = """
1. Attention is All You Need (2017)
2. BERT (2018)
3. GPT-3 (2020)
"""
# GraphRAG: 3 papers + expansions
graphrag_context = """
Initial:
1. Attention is All You Need (2017)
2. BERT (2018)
3. GPT-3 (2020)
Background (cited by these):
- Neural Machine Translation (Bahdanau, 2014)
- Sequence to Sequence Learning (Sutskever, 2014)
Follow-up (citing these):
- T5 (2019)
- BART (2020)
- Switch Transformers (2021)
Related Concepts:
- Self-attention, Multi-head attention, Positional encoding
"""
2. Better Citation Paths
// Find "intellectual lineage" of an idea
MATCH path = (old:Paper {year: 2014})-[:CITES*1..5]->(new:Paper {year: 2024})
WHERE old.title CONTAINS "attention"
RETURN path
3. Contextual Understanding
# Understand how transformer attention differs from earlier attention
# By traversing citation graph from Bahdanau (2014) to Vaswani (2017)
Enterprise Use Case: Knowledge Graphs for AI Agent API Discovery
Now let's see a practical enterprise application showing why knowledge graphs matter for AI agents in complex business environments.
The Problem: Complex Enterprise API Landscapes
Imagine you have an AI agent in an enterprise environment, and a user makes this request:
"Create a purchase order for 5 pencils in purchasing group 002 and purchasing organization 3000"
Seems simple, right? But here's the reality:
- Enterprise systems have thousands of different APIs
- Which API creates purchase orders? (Could be 10+ candidates)
- What parameters are required? (Purchasing group? Organization? Material codes?)
- What's the correct sequence? (Auth → Validate → Create → Submit?)
Without context, the AI agent faces:
- Trial and error API discovery (slow, expensive)
- Missing domain knowledge (which API for which business process?)
- No structure (can't understand API dependencies)
- No explainability (can't trace what it did or why)
Solution: Build a knowledge graph of your enterprise APIs enriched with business process information!
Knowledge Graph for Enterprise APIs
Structure:
# Nodes (entities)
:PurchaseOrderAPI rdf:type :API .
:PurchasingProcess rdf:type :BusinessProcess .
:PurchasingGroup rdf:type :Parameter .
# Relationships
:PurchaseOrderAPI :belongsTo :PurchasingProcess .
:PurchaseOrderAPI :requires :PurchasingGroup .
:PurchaseOrderAPI :requires :PurchasingOrganization .
:PurchaseOrderAPI :requires :MaterialNumber .
# Metadata
:PurchaseOrderAPI :endpoint "/api/v1/purchase-orders" .
:PurchaseOrderAPI :method "POST" .
:PurchasingGroup :allowedValues ["001", "002", "003"] .
Visualization:
How Knowledge Graphs Solve Enterprise AI Agent Challenges
Challenge 1: Slow API Discovery
Without Knowledge Graph:
# Agent tries APIs randomly
attempts = [
"Try: /api/orders/create → Wrong (this is for sales orders)",
"Try: /api/procurement/new → Wrong (deprecated API)",
"Try: /api/purchase-orders/post → Wrong (requires different params)",
"Try: /api/v2/purchasing/create-po → Success! (after 4 attempts)"
]
# Result: 4 failed attempts, wasted tokens, slow response
With Knowledge Graph:
# Agent queries knowledge graph
query = """
SELECT ?api WHERE {
?process :name "Purchasing" .
?api :belongsTo ?process .
?api :action "CreatePurchaseOrder" .
}
"""
result = ["POST /api/v2/purchasing/create-po"] # Direct match!
# Result: 1 attempt, instant, efficient
Benefit: 90% reduction in API discovery time
Challenge 2: Missing Domain Context
Without Knowledge Graph:
# Agent doesn't know business logic
user_request = "Create PO for pencils in group 002"
# Agent tries:
api_call = {
"endpoint": "/api/purchase-orders",
"params": {
"item": "pencils", # ❌ Wrong: needs material number
"group": "002" # ✅ Correct
}
}
# Result: API error "Missing material_number parameter"
With Knowledge Graph:
# Agent queries graph for required workflow
workflow = knowledge_graph.query("""
SELECT ?step ?api ?required_param WHERE {
?process :name "CreatePurchaseOrder" .
?process :hasStep ?step .
?step :callsAPI ?api .
?api :requiresParameter ?required_param .
}
ORDER BY ?step
""")
# Result:
# Step 1: Material Lookup API (param: material_name → returns: material_number)
# Step 2: Purchase Order API (params: material_number, purchasing_group, purchasing_org)
# Agent executes correctly:
material_number = call_api("MaterialLookup", {"name": "pencils"}) # M12345
po = call_api("PurchaseOrder", {
"material": material_number,
"group": "002",
"org": "3000"
})
# Result: ✅ Success in correct sequence
Benefit: Automatic workflow understanding from graph structure
Challenge 3: Complex API Dependencies
Without Knowledge Graph:
# Agent doesn't know allowed values
params = {
"purchasing_group": "999" # ❌ Invalid value
}
# Result: API rejects with cryptic error
With Knowledge Graph:
# Graph contains allowed values
allowed = knowledge_graph.query("""
SELECT ?value WHERE {
:PurchasingGroup :allowedValues ?value .
}
""")
# Result: ["001", "002", "003"]
# Agent validates BEFORE calling API
if "999" not in allowed:
# Ask user or pick valid value
pass
Benefit: Validation before execution, reducing errors
Challenge 4: No Explainability
Without Knowledge Graph:
User: "Why did you use that API?"
Agent: "Based on my training, I determined..." (vague)
With Knowledge Graph:
# Agent can trace reasoning through graph
explanation = {
"question": "Create purchase order for pencils",
"reasoning_path": [
"1. Identified business process: Purchasing",
"2. Found process step: MaterialLookup (required for material_number)",
"3. Called MaterialLookup API with 'pencils' → returned M12345",
"4. Found process step: CreatePurchaseOrder",
"5. Validated parameters against schema:",
" - material_number: M12345 (from step 3)",
" - purchasing_group: 002 (from user, validated against allowed values)",
" - purchasing_org: 3000 (from user)",
"6. Called PurchaseOrder API with validated params",
"7. Result: PO-2024-001 created successfully"
],
"source_apis": ["/api/materials/lookup", "/api/purchase-orders/create"],
"graph_nodes_traversed": ["PurchasingProcess", "MaterialLookupAPI", "PurchaseOrderAPI"]
}
Benefit: Complete transparency and auditability
Implementation Example
class EnterpriseAPIAgent:
"""AI Agent with Knowledge Graph for API discovery"""
def __init__(self, kg: Neo4jKnowledgeGraph, llm: LLM):
self.kg = kg
self.llm = llm
def execute_request(self, user_request: str):
"""Execute user request using knowledge graph"""
# Step 1: Understand intent
intent = self.llm.extract_intent(user_request)
# {"action": "CreatePurchaseOrder", "params": {"item": "pencils", "group": "002"}}
# Step 2: Query knowledge graph for workflow
workflow = self.kg.query_cypher(f"""
MATCH (process:BusinessProcess {{name: 'Purchasing'}})
-[:HAS_STEP]->(step:ProcessStep)
-[:CALLS_API]->(api:API)
WHERE step.action = '{intent["action"]}'
RETURN step.order as order, api.endpoint as endpoint,
api.required_params as params
ORDER BY order
""")
# Step 3: Execute workflow steps
context = {}
for step in workflow:
# Validate and enrich parameters
params = self._prepare_params(step["params"], intent["params"], context)
# Call API
response = self._call_api(step["endpoint"], params)
# Store result for next step
context[step["order"]] = response
return context
def _prepare_params(self, required, user_provided, context):
"""Prepare API parameters using knowledge graph validation"""
params = {}
for param_name in required:
# Check if user provided
if param_name in user_provided:
# Validate against knowledge graph
allowed = self.kg.get_allowed_values(param_name)
if allowed and user_provided[param_name] not in allowed:
raise ValueError(f"{param_name} must be one of {allowed}")
params[param_name] = user_provided[param_name]
# Check if available from previous steps
elif param_name in context:
params[param_name] = context[param_name]
else:
raise ValueError(f"Missing required parameter: {param_name}")
return params
Benefits Summary
| Challenge | Without KG | With KG | Improvement |
|---|---|---|---|
| API Discovery | Trial & error (10+ attempts) | Direct lookup (1 attempt) | 90% faster |
| Context Understanding | Missing domain logic | Business process aware | Correct workflows |
| Parameter Validation | Runtime errors | Pre-validated | Fewer failures |
| Explainability | Black box | Full trace | Audit-ready |
| Maintenance | Update LLM training | Update graph | Easy updates |
When to Use Knowledge Graphs for AI Agents
Use KG-powered AI agents when:
- ✅ Complex API landscapes (100s-1000s of APIs)
- ✅ Domain-specific business logic required
- ✅ Auditability and explainability critical
- ✅ APIs change frequently (easier to update graph than retrain)
- ✅ Multi-step workflows common
Traditional RAG/LLM when:
- Simple, stable API sets (10-20 APIs)
- No strict compliance requirements
- Flexibility more important than precision
Knowledge graphs transform AI agents from "smart guessers" to "informed executors" by providing:
- Structure: API relationships and dependencies
- Context: Business process integration
- Validation: Allowed values and constraints
- Explainability: Traceable reasoning paths
Part 5: Structured vs Unstructured Data
Real research papers contain both:
Unstructured:
- Abstract (free text)
- Full paper text
- Author descriptions
Structured:
- Title, authors, year, venue
- Citation counts
- Keywords, categories
- Figures, tables (semi-structured)
Handling Both with GraphRAG
Complete Example: ResearcherAI Data Pipeline
class ResearchDataPipeline:
"""Complete pipeline for handling structured + unstructured data"""
def __init__(
self,
vector_store: QdrantVectorStore,
knowledge_graph: Neo4jKnowledgeGraph,
embedding_model: SentenceTransformer
):
self.vector_store = vector_store
self.knowledge_graph = knowledge_graph
self.embedding_model = embedding_model
def process_paper(self, paper: Dict):
"""Process single paper with both structured and unstructured data"""
# Step 1: Extract structured data
paper_id = paper["id"]
title = paper["title"]
authors = paper["authors"]
year = paper["year"]
citations = paper.get("citations", [])
keywords = paper.get("keywords", [])
# Step 2: Extract unstructured data
abstract = paper["abstract"]
full_text = paper.get("full_text", "")
# Step 3: Add to knowledge graph (structured)
self.knowledge_graph.add_paper(paper_id, title, year, abstract)
for author in authors:
author_id = self._get_author_id(author)
self.knowledge_graph.add_author(author_id, author)
self.knowledge_graph.add_authored(author_id, paper_id)
for cited_paper_id in citations:
self.knowledge_graph.add_citation(paper_id, cited_paper_id)
for keyword in keywords:
concept_id = self._get_concept_id(keyword)
self.knowledge_graph.add_concept(concept_id, keyword)
self.knowledge_graph.add_discusses(paper_id, concept_id)
# Step 4: Add to vector store (unstructured)
# Combine title + abstract for better semantic search
text_for_embedding = f"{title}. {abstract}"
embedding = self.embedding_model.encode(text_for_embedding)
self.vector_store.add_documents(
texts=[text_for_embedding],
embeddings=np.array([embedding]),
metadata=[{
"paper_id": paper_id,
"title": title,
"year": year,
"authors": authors,
"citation_count": len(citations)
}]
)
print(f"✓ Processed: {title}")
def query(self, question: str, mode: str = "hybrid") -> Dict:
"""Query with automatic routing"""
if mode == "semantic":
# Pure vector search
query_emb = self.embedding_model.encode(question)
results = self.vector_store.search(query_emb, top_k=5)
return {
"results": results,
"mode": "semantic"
}
elif mode == "structured":
# Pure graph query
# Extract query intent and route to appropriate graph query
results = self._graph_query(question)
return {
"results": results,
"mode": "structured"
}
else: # hybrid or graphrag
# GraphRAG: Combine both
query_emb = self.embedding_model.encode(question)
initial_results = self.vector_store.search(query_emb, top_k=3)
# Expand using graph
expanded = []
for text, score in initial_results:
paper_id = self._extract_paper_id_from_text(text)
# Get structured context from graph
graph_context = {
"citations": self.knowledge_graph.find_citing_papers(paper_id),
"concepts": self.knowledge_graph.find_related_concepts(paper_id),
"authors": self.knowledge_graph.find_paper_authors(paper_id)
}
expanded.append({
"paper": text,
"score": score,
"graph_context": graph_context
})
return {
"results": expanded,
"mode": "graphrag"
}
# Complete workflow
pipeline = ResearchDataPipeline(
vector_store=qdrant_store,
knowledge_graph=neo4j_kg,
embedding_model=model
)
# Process papers
papers = [
{
"id": "paper_1",
"title": "Attention is All You Need",
"authors": ["Ashish Vaswani", "Noam Shazeer"],
"year": 2017,
"abstract": "We propose the Transformer, a model architecture...",
"keywords": ["transformer", "attention", "sequence-to-sequence"],
"citations": []
},
{
"id": "paper_2",
"title": "BERT: Pre-training Transformers",
"authors": ["Jacob Devlin", "Ming-Wei Chang"],
"year": 2018,
"abstract": "We introduce BERT, a bidirectional transformer...",
"keywords": ["BERT", "pre-training", "transformers"],
"citations": ["paper_1"]
}
]
for paper in papers:
pipeline.process_paper(paper)
# Query with different modes
print("\n=== SEMANTIC MODE ===")
results = pipeline.query("attention mechanisms in neural networks", mode="semantic")
print(results)
print("\n=== STRUCTURED MODE ===")
results = pipeline.query("papers citing Attention is All You Need", mode="structured")
print(results)
print("\n=== GRAPHRAG MODE ===")
results = pipeline.query("how do transformers work?", mode="hybrid")
print(results)
Summary and Decision Guide
Technology Comparison
| Technology | Best For | Limitations |
|---|---|---|
| Vector DB | Semantic similarity, fuzzy matching | No relationships, no reasoning |
| Knowledge Graph | Relationships, structured queries | Requires exact entities, no fuzzy search |
| Hybrid RAG | Combining semantic + structured | More complex, two systems |
| GraphRAG | Rich context, citation analysis | Highest complexity, needs both systems |
When to Use Each
Use Vector Search Alone:
- Simple semantic search
- No relationship queries
- Quick prototypes
- Example: "Find similar papers"
Use Knowledge Graph Alone:
- Known entities
- Relationship-heavy queries
- Network analysis
- Example: "Find collaboration networks"
Use Hybrid RAG:
- Production RAG systems
- Mix of semantic + structured queries
- Need both similarity and relationships
- Example: ResearcherAI
Use GraphRAG:
- Research assistance (like ResearcherAI)
- Need citation context
- Complex multi-hop queries
- Example: "Trace the evolution of an idea"
ResearcherAI's Approach
ResearcherAI uses GraphRAG with dual backends:
# Development Mode
dev_config = {
"vector_store": "FAISS (in-memory)",
"knowledge_graph": "NetworkX (in-memory)",
"startup_time": "instant",
"data_persistence": "manual save/load"
}
# Production Mode
prod_config = {
"vector_store": "Qdrant (persistent)",
"knowledge_graph": "Neo4j (persistent)",
"startup_time": "~2 seconds",
"data_persistence": "automatic"
}
# Abstraction layer allows switching
system = ResearcherAI(mode="development") # or "production"
Key Takeaways
- Vector databases enable semantic search - finding similar content without exact keyword matches
- Knowledge graphs enable relationship reasoning - following citations, collaborations, concept evolution
- Hybrid RAG combines both for richer retrieval
- GraphRAG uses graphs to expand and enhance vector search results
- Structured + Unstructured data both matter - use appropriate storage for each
- Dev/Prod duality enables fast iteration with production fidelity
Next Steps
Now you understand the data layer. Next chapters cover:
- Chapter 3.5 (Agent Foundations): How agents use these data stores
- Chapter 4 (Orchestration Frameworks): LangGraph for agent coordination
- Chapter 5 (Backend): Implementing the full stack
The data foundations you learned here power every query in ResearcherAI!
Build your own hybrid system:
- Start with FAISS + NetworkX (development)
- Index 20-30 papers with abstracts
- Create author and citation relationships
- Implement semantic search
- Implement citation traversal
- Combine both for hybrid queries
- Deploy with Qdrant + Neo4j (production)