Skip to main content

Chapter 3: Data Foundations - Vector Databases, Knowledge Graphs, and GraphRAG

Introduction

Before building intelligent agents, we must understand how to store and retrieve information effectively. This chapter takes you from basic keyword search to advanced GraphRAG, explaining why each technology exists and when to use it.

For Web Developers

This chapter is like learning about databases in web development:

  • Keyword search = Simple WHERE name LIKE '%query%'
  • Vector search = Semantic similarity (no exact matches needed)
  • Knowledge graphs = Relational databases on steroids
  • GraphRAG = Combining the best of all worlds

The Problem: Traditional Search Doesn't Work for Research

Keyword Search Limitations

Imagine searching for papers about "neural networks for language understanding":

Keyword Search:

def keyword_search(query: str, documents: List[str]) -> List[str]:
"""Traditional keyword matching"""
results = []
keywords = query.lower().split()

for doc in documents:
doc_lower = doc.lower()
if all(keyword in doc_lower for keyword in keywords):
results.append(doc)

return results

# Search papers
query = "neural networks for language understanding"
results = keyword_search(query, papers)

Problems:

  1. Synonym Problem: Misses "deep learning" when searching "neural networks"
  2. Word Order: "language understanding with neural networks" won't match
  3. Context Ignored: Can't understand "transformers" means attention mechanism
  4. No Semantics: "bank" (financial) vs "bank" (river) treated identically

Real Example:

# User searches: "attention mechanisms in NLP"
# Misses papers that say:
# - "self-attention for natural language processing"
# - "transformer architecture for text understanding"
# - "query-key-value attention for language models"
# All mean the same thing but use different words!

What We Actually Need

For research, we need:

  • Semantic understanding: "neural network" = "deep learning" = "artificial network"
  • Context awareness: Understand concepts, not just words
  • Relationship mapping: How papers, authors, and concepts connect
  • Reasoning capabilities: "If A cites B, and B discusses C, then A likely relates to C"

This requires two complementary technologies:

  1. Vector Databases (semantic similarity)
  2. Knowledge Graphs (relationship reasoning)

Let's understand each from scratch.


From Words to Vectors

Core Idea: Represent text as numbers that capture meaning.

The Intuition

# Imagine each word has a position in "meaning space"
# Similar meanings = close positions

king = [0.8, 0.3, 0.1] # Royalty, male, power
queen = [0.8, 0.9, 0.1] # Royalty, female, power
man = [0.2, 0.3, 0.0] # Common, male, neutral
woman = [0.2, 0.9, 0.0] # Common, female, neutral

# Amazing property:
# king - man + woman ≈ queen
# [0.8, 0.3, 0.1] - [0.2, 0.3, 0.0] + [0.2, 0.9, 0.0] = [0.8, 0.9, 0.1]

This is word embeddings - representing words as dense vectors that capture semantic relationships.

How Embeddings Work

Creating Embeddings

from sentence_transformers import SentenceTransformer

# Load embedding model (runs locally!)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
texts = [
"Neural networks for image classification",
"Deep learning in computer vision",
"Convolutional networks for image recognition"
]

embeddings = model.encode(texts)
print(embeddings.shape) # (3, 384) - 3 texts, 384 dimensions each

Measuring Similarity

import numpy as np

def cosine_similarity(vec1, vec2):
"""Measure how similar two vectors are (0=different, 1=identical)"""
return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Compare texts
query = "neural nets for images"
query_embedding = model.encode(query)

for i, text in enumerate(texts):
similarity = cosine_similarity(query_embedding, embeddings[i])
print(f"Similarity to '{text}': {similarity:.3f}")

# Output:
# Similarity to 'Neural networks for image classification': 0.782
# Similarity to 'Deep learning in computer vision': 0.691
# Similarity to 'Convolutional networks for image recognition': 0.745

Key Insight: Even though exact words differ, semantic similarity is captured!

Comparing embeddings one-by-one doesn't scale. For 1 million papers:

  • Comparing query to all papers: 1 million comparisons
  • At 0.01ms per comparison: 10 seconds per query

Solution: Vector databases with approximate nearest neighbor (ANN) search.

Development: FAISS (In-Memory)

FAISS (Facebook AI Similarity Search) - perfect for development and testing.

import faiss
import numpy as np

class FAISSVectorStore:
"""Development vector database using FAISS"""

def __init__(self, dimension: int = 384):
self.dimension = dimension
# Create FAISS index (L2 distance)
self.index = faiss.IndexFlatL2(dimension)
self.documents = [] # Store original documents

def add_documents(self, texts: List[str], embeddings: np.ndarray):
"""Add documents to index"""
# FAISS requires float32
embeddings_f32 = embeddings.astype('float32')

# Add to index
self.index.add(embeddings_f32)

# Store documents
self.documents.extend(texts)

print(f"✓ Indexed {len(texts)} documents (total: {self.index.ntotal})")

def search(self, query_embedding: np.ndarray, top_k: int = 5) -> List[tuple]:
"""Search for similar documents"""
# Ensure float32
query_f32 = query_embedding.astype('float32').reshape(1, -1)

# Search (returns distances and indices)
distances, indices = self.index.search(query_f32, top_k)

# Convert L2 distances to similarity scores (0-1)
similarities = 1 / (1 + distances[0])

# Return documents with scores
results = [
(self.documents[idx], float(sim))
for idx, sim in zip(indices[0], similarities)
if idx < len(self.documents)
]

return results

# Usage
vector_store = FAISSVectorStore(dimension=384)

# Index papers
papers = [
"Attention is all you need - introduces transformer architecture",
"BERT: Pre-training of deep bidirectional transformers",
"GPT-3: Language models are few-shot learners",
"ResNet: Deep residual learning for image recognition",
"YOLO: Real-time object detection"
]

embeddings = model.encode(papers)
vector_store.add_documents(papers, embeddings)

# Search
query = "transformer models for NLP"
query_emb = model.encode(query)
results = vector_store.search(query_emb, top_k=3)

for doc, score in results:
print(f"Score: {score:.3f} - {doc}")

Output:

Score: 0.856 - Attention is all you need - introduces transformer architecture
Score: 0.792 - BERT: Pre-training of deep bidirectional transformers
Score: 0.743 - GPT-3: Language models are few-shot learners

Notice: ResNet and YOLO (computer vision) are correctly excluded even though they're valid papers!

Production: Qdrant (Persistent, Scalable)

Qdrant - production-grade vector database with persistence and APIs.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

class QdrantVectorStore:
"""Production vector database using Qdrant"""

def __init__(
self,
collection_name: str = "research_papers",
url: str = "http://localhost:6333"
):
self.client = QdrantClient(url=url)
self.collection_name = collection_name
self.dimension = 384

# Create collection if not exists
self._create_collection()

def _create_collection(self):
"""Create Qdrant collection"""
try:
self.client.get_collection(self.collection_name)
print(f"✓ Collection '{self.collection_name}' exists")
except:
self.client.create_collection(
collection_name=self.collection_name,
vectors_config=VectorParams(
size=self.dimension,
distance=Distance.COSINE # Cosine similarity
)
)
print(f"✓ Created collection '{self.collection_name}'")

def add_documents(
self,
texts: List[str],
embeddings: np.ndarray,
metadata: List[Dict] = None
):
"""Add documents with metadata"""
points = []

for idx, (text, embedding) in enumerate(zip(texts, embeddings)):
point = PointStruct(
id=idx,
vector=embedding.tolist(),
payload={
"text": text,
**(metadata[idx] if metadata else {})
}
)
points.append(point)

# Batch upload
self.client.upsert(
collection_name=self.collection_name,
points=points
)

print(f"✓ Indexed {len(points)} documents")

def search(
self,
query_embedding: np.ndarray,
top_k: int = 5,
filters: Dict = None
) -> List[tuple]:
"""Search with optional metadata filters"""
results = self.client.search(
collection_name=self.collection_name,
query_vector=query_embedding.tolist(),
limit=top_k,
query_filter=filters # Can filter by metadata!
)

return [
(result.payload["text"], result.score)
for result in results
]

# Usage
qdrant_store = QdrantVectorStore(collection_name="papers")

# Index with metadata
metadata = [
{"year": 2017, "citations": 50000, "venue": "NeurIPS"},
{"year": 2018, "citations": 30000, "venue": "NAACL"},
{"year": 2020, "citations": 15000, "venue": "NeurIPS"},
{"year": 2015, "citations": 40000, "venue": "CVPR"},
{"year": 2016, "citations": 25000, "venue": "CVPR"}
]

qdrant_store.add_documents(papers, embeddings, metadata)

# Search with filters
query_emb = model.encode("transformer models for NLP")
results = qdrant_store.search(
query_emb,
top_k=5,
filters={"must": [{"key": "year", "range": {"gte": 2017}}]} # Papers after 2017
)

for doc, score in results:
print(f"Score: {score:.3f} - {doc}")

Vector Search: Dev vs Prod Comparison

FeatureFAISS (Dev)Qdrant (Prod)
StorageIn-memory onlyPersistent to disk
StartupInstant~2 seconds
ScalabilitySingle machineDistributed cluster
MetadataManual trackingBuilt-in filtering
APIsPython onlyREST + gRPC + Python
PersistenceSave/load manuallyAutomatic
Best forDevelopment, testing, prototypingProduction, millions of vectors
When to Use Each
  • Development: Use FAISS - instant startup, no infrastructure
  • Production: Use Qdrant - persistence, scalability, filtering
  • ResearcherAI: Uses both with abstraction layer!

Part 2: Knowledge Graphs and Structured Reasoning

Knowledge Graphs in the Real World

Before diving into the technical details, let's see knowledge graphs in action.

Try This: Open Google and search for "Tesla" (the electric vehicle company).

What do you see? Besides the typical list of matching websites, you'll notice a comprehensive panel on the right side showing:

  • Description: "American electric vehicle and clean energy company..."
  • Founded: 2003
  • Headquarters: Austin, Texas
  • CEO: Elon Musk
  • Stock price and other properties

Now click on "Austin, Texas" (the headquarters location). You'll see another panel with:

  • Description: "Capital city of Texas, United States"
  • Population: ~1 million
  • County: Travis County

This is a knowledge graph in action! Google isn't just returning text - it's showing you:

  • Entities (Tesla, Austin, Elon Musk)
  • Properties (founded date, population)
  • Relationships (Tesla → headquarters → Austin → county → Travis County)

Web Developer Analogy:

// Traditional search result = List of text snippets
const results = ["Tesla is a company...", "Tesla founded in 2003...", "Tesla located..."]

// Knowledge graph = Structured interconnected data
const knowledgeGraph = {
"Tesla": {
type: "Company",
properties: {
founded: 2003,
name: "Tesla, Inc."
},
relationships: {
headquarters: "Austin",
CEO: "Elon_Musk"
}
},
"Austin": {
type: "City",
properties: {
population: 1000000,
state: "Texas"
},
relationships: {
county: "Travis_County",
companies: ["Tesla", "Oracle", "Dell"]
}
}
}

What Are Knowledge Graphs?

A knowledge graph represents structured information as a graph where:

  • Nodes represent entities (Tesla, Austin, Elon Musk)
  • Edges represent relationships between entities (headquarters, CEO, located_in)

Two Main Components:

  1. Schema/Ontology: Defines the types of entities, their attributes, and allowed relationships

    # Schema definition
    Company has property: founded_year (integer)
    Company has property: name (string)
    Company has relationship: headquarters → City
    Company has relationship: CEO → Person
  2. Instance Data: The actual entities and relationships that follow the schema

    # Instance data
    Tesla founded_year 2003
    Tesla name "Tesla, Inc."
    Tesla headquarters Austin
    Tesla CEO Elon_Musk

Why Vector Search Isn't Enough

Vector search excels at finding similar content, but fails at:

1. Relationship Questions

Query: "Which papers cite both attention mechanisms and BERT?"
Vector search: ❌ Can't traverse citations
Knowledge graph: ✅ MATCH (p)-[:CITES]->(a), (p)-[:CITES]->(b)

2. Multi-Hop Reasoning

Query: "Find papers by authors who collaborated with Yoshua Bengio"
Vector search: ❌ Can't follow author → author connections
Knowledge graph: ✅ Path traversal over collaboration edges

3. Structured Queries

Query: "Papers published in 2020 that cite papers from before 2015"
Vector search: ❌ No temporal reasoning
Knowledge graph: ✅ Filter by year property + traverse citations

From Tables to Graphs

Key Differences:

Tables (Relational):

  • Fixed schema
  • Join operations expensive
  • Hard to add new relationship types
  • Optimized for transactional queries

Graphs (Network):

  • Flexible schema
  • Traversals are natural
  • Easy to add new edges
  • Optimized for relationship queries

Knowledge Graph Basics

Components:

  1. Nodes (Entities): Papers, Authors, Concepts, Institutions
  2. Edges (Relationships): CITES, AUTHORED_BY, DISCUSSES, AFFILIATED_WITH
  3. Properties: title, year, citations_count, etc.

Example Graph:

# Nodes
Paper1 = {
"id": "paper_1",
"type": "Paper",
"title": "Attention is All You Need",
"year": 2017,
"citations": 50000
}

Author1 = {
"id": "author_1",
"type": "Author",
"name": "Ashish Vaswani"
}

Concept1 = {
"id": "concept_1",
"type": "Concept",
"name": "Transformer"
}

# Edges
edges = [
(Author1, "AUTHORED", Paper1),
(Paper1, "INTRODUCES", Concept1),
(Paper2, "CITES", Paper1),
(Paper2, "USES", Concept1)
]

Development: NetworkX (In-Memory)

NetworkX - Python library for graph operations, perfect for development.

import networkx as nx
from typing import Dict, List

class NetworkXKnowledgeGraph:
"""Development knowledge graph using NetworkX"""

def __init__(self):
self.graph = nx.MultiDiGraph() # Directed graph with multiple edges

def add_paper(self, paper_id: str, title: str, year: int, abstract: str = ""):
"""Add paper node"""
self.graph.add_node(
paper_id,
type="Paper",
title=title,
year=year,
abstract=abstract
)

def add_author(self, author_id: str, name: str):
"""Add author node"""
self.graph.add_node(
author_id,
type="Author",
name=name
)

def add_concept(self, concept_id: str, name: str):
"""Add concept node"""
self.graph.add_node(
concept_id,
type="Concept",
name=name
)

def add_authored(self, author_id: str, paper_id: str):
"""Add AUTHORED relationship"""
self.graph.add_edge(author_id, paper_id, type="AUTHORED")

def add_citation(self, citing_paper: str, cited_paper: str):
"""Add CITES relationship"""
self.graph.add_edge(citing_paper, cited_paper, type="CITES")

def add_discusses(self, paper_id: str, concept_id: str):
"""Add DISCUSSES relationship"""
self.graph.add_edge(paper_id, concept_id, type="DISCUSSES")

def find_papers_by_author(self, author_name: str) -> List[Dict]:
"""Find all papers by an author"""
papers = []

for node, data in self.graph.nodes(data=True):
if data.get("type") == "Author" and data.get("name") == author_name:
# Find papers this author authored
for neighbor in self.graph.successors(node):
if self.graph.nodes[neighbor].get("type") == "Paper":
papers.append({
"id": neighbor,
**self.graph.nodes[neighbor]
})

return papers

def find_citing_papers(self, paper_id: str) -> List[Dict]:
"""Find papers that cite a given paper"""
citing = []

for pred in self.graph.predecessors(paper_id):
edge_data = self.graph.get_edge_data(pred, paper_id)
if any(e.get("type") == "CITES" for e in edge_data.values()):
if self.graph.nodes[pred].get("type") == "Paper":
citing.append({
"id": pred,
**self.graph.nodes[pred]
})

return citing

def find_related_concepts(self, paper_id: str) -> List[str]:
"""Find concepts discussed in a paper"""
concepts = []

for neighbor in self.graph.successors(paper_id):
if self.graph.nodes[neighbor].get("type") == "Concept":
concepts.append(self.graph.nodes[neighbor].get("name"))

return concepts

def find_collaboration_network(self, author_name: str, depth: int = 2) -> List[str]:
"""Find authors who collaborated (shared papers)"""
collaborators = set()

# Find author node
author_node = None
for node, data in self.graph.nodes(data=True):
if data.get("type") == "Author" and data.get("name") == author_name:
author_node = node
break

if not author_node:
return []

# Find papers by this author
papers = [n for n in self.graph.successors(author_node)
if self.graph.nodes[n].get("type") == "Paper"]

# Find co-authors
for paper in papers:
for pred in self.graph.predecessors(paper):
if self.graph.nodes[pred].get("type") == "Author" and pred != author_node:
collaborators.add(self.graph.nodes[pred].get("name"))

return list(collaborators)

# Usage
kg = NetworkXKnowledgeGraph()

# Add nodes
kg.add_paper("paper_1", "Attention is All You Need", 2017)
kg.add_paper("paper_2", "BERT: Pre-training Transformers", 2018)
kg.add_paper("paper_3", "GPT-3: Language Models", 2020)

kg.add_author("author_1", "Ashish Vaswani")
kg.add_author("author_2", "Jacob Devlin")
kg.add_author("author_3", "Tom Brown")

kg.add_concept("concept_1", "Transformer")
kg.add_concept("concept_2", "Attention Mechanism")
kg.add_concept("concept_3", "Pre-training")

# Add relationships
kg.add_authored("author_1", "paper_1")
kg.add_authored("author_2", "paper_2")
kg.add_authored("author_3", "paper_3")

kg.add_discusses("paper_1", "concept_1")
kg.add_discusses("paper_1", "concept_2")
kg.add_discusses("paper_2", "concept_1")
kg.add_discusses("paper_2", "concept_3")

kg.add_citation("paper_2", "paper_1") # BERT cites Attention
kg.add_citation("paper_3", "paper_1") # GPT-3 cites Attention
kg.add_citation("paper_3", "paper_2") # GPT-3 cites BERT

# Query the graph
print("Papers by Ashish Vaswani:")
papers = kg.find_papers_by_author("Ashish Vaswani")
for paper in papers:
print(f" - {paper['title']}")

print("\nPapers citing 'Attention is All You Need':")
citing = kg.find_citing_papers("paper_1")
for paper in citing:
print(f" - {paper['title']}")

print("\nConcepts in paper_1:")
concepts = kg.find_related_concepts("paper_1")
print(f" {', '.join(concepts)}")

print("\nCollaborators of Jacob Devlin:")
collabs = kg.find_collaboration_network("Jacob Devlin")
print(f" {', '.join(collabs)}")

Output:

Papers by Ashish Vaswani:
- Attention is All You Need

Papers citing 'Attention is All You Need':
- BERT: Pre-training Transformers
- GPT-3: Language Models

Concepts in paper_1:
Transformer, Attention Mechanism

Collaborators of Jacob Devlin:
Ashish Vaswani

Production: Neo4j (Persistent, Cypher)

Neo4j - enterprise-grade graph database with powerful query language (Cypher).

from neo4j import GraphDatabase

class Neo4jKnowledgeGraph:
"""Production knowledge graph using Neo4j"""

def __init__(self, uri: str = "bolt://localhost:7687", user: str = "neo4j", password: str = "password"):
self.driver = GraphDatabase.driver(uri, auth=(user, password))

def close(self):
self.driver.close()

def add_paper(self, paper_id: str, title: str, year: int, abstract: str = ""):
"""Add paper node"""
with self.driver.session() as session:
session.run("""
MERGE (p:Paper {id: $paper_id})
SET p.title = $title, p.year = $year, p.abstract = $abstract
""", paper_id=paper_id, title=title, year=year, abstract=abstract)

def add_author(self, author_id: str, name: str):
"""Add author node"""
with self.driver.session() as session:
session.run("""
MERGE (a:Author {id: $author_id})
SET a.name = $name
""", author_id=author_id, name=name)

def add_concept(self, concept_id: str, name: str):
"""Add concept node"""
with self.driver.session() as session:
session.run("""
MERGE (c:Concept {id: $concept_id})
SET c.name = $name
""", concept_id=concept_id, name=name)

def add_authored(self, author_id: str, paper_id: str):
"""Add AUTHORED relationship"""
with self.driver.session() as session:
session.run("""
MATCH (a:Author {id: $author_id})
MATCH (p:Paper {id: $paper_id})
MERGE (a)-[:AUTHORED]->(p)
""", author_id=author_id, paper_id=paper_id)

def add_citation(self, citing_paper: str, cited_paper: str):
"""Add CITES relationship"""
with self.driver.session() as session:
session.run("""
MATCH (citing:Paper {id: $citing_paper})
MATCH (cited:Paper {id: $cited_paper})
MERGE (citing)-[:CITES]->(cited)
""", citing_paper=citing_paper, cited_paper=cited_paper)

def add_discusses(self, paper_id: str, concept_id: str):
"""Add DISCUSSES relationship"""
with self.driver.session() as session:
session.run("""
MATCH (p:Paper {id: $paper_id})
MATCH (c:Concept {id: $concept_id})
MERGE (p)-[:DISCUSSES]->(c)
""", paper_id=paper_id, concept_id=concept_id)

def find_papers_by_author(self, author_name: str) -> List[Dict]:
"""Find all papers by an author"""
with self.driver.session() as session:
result = session.run("""
MATCH (a:Author {name: $name})-[:AUTHORED]->(p:Paper)
RETURN p.id as id, p.title as title, p.year as year
""", name=author_name)
return [dict(record) for record in result]

def find_citing_papers(self, paper_id: str) -> List[Dict]:
"""Find papers that cite a given paper"""
with self.driver.session() as session:
result = session.run("""
MATCH (citing:Paper)-[:CITES]->(cited:Paper {id: $paper_id})
RETURN citing.id as id, citing.title as title, citing.year as year
""", paper_id=paper_id)
return [dict(record) for record in result]

def find_related_concepts(self, paper_id: str) -> List[str]:
"""Find concepts discussed in a paper"""
with self.driver.session() as session:
result = session.run("""
MATCH (p:Paper {id: $paper_id})-[:DISCUSSES]->(c:Concept)
RETURN c.name as concept
""", paper_id=paper_id)
return [record["concept"] for record in result]

def find_collaboration_network(self, author_name: str) -> List[str]:
"""Find authors who collaborated (shared papers)"""
with self.driver.session() as session:
result = session.run("""
MATCH (a1:Author {name: $name})-[:AUTHORED]->(p:Paper)<-[:AUTHORED]-(a2:Author)
WHERE a1 <> a2
RETURN DISTINCT a2.name as collaborator
""", name=author_name)
return [record["collaborator"] for record in result]

def find_citation_chain(self, start_paper: str, end_paper: str, max_depth: int = 5):
"""Find citation path between two papers"""
with self.driver.session() as session:
result = session.run("""
MATCH path = shortestPath(
(start:Paper {id: $start})-[:CITES*1..{max_depth}]->(end:Paper {id: $end})
)
RETURN [node in nodes(path) | node.title] as path
""".replace("{max_depth}", str(max_depth)), start=start_paper, end=end_paper)

records = list(result)
return records[0]["path"] if records else []

# Usage
neo4j_kg = Neo4jKnowledgeGraph()

# Add same data as NetworkX example
neo4j_kg.add_paper("paper_1", "Attention is All You Need", 2017)
neo4j_kg.add_paper("paper_2", "BERT: Pre-training Transformers", 2018)
neo4j_kg.add_author("author_1", "Ashish Vaswani")
neo4j_kg.add_authored("author_1", "paper_1")
neo4j_kg.add_citation("paper_2", "paper_1")

# Advanced query: Citation chain
chain = neo4j_kg.find_citation_chain("paper_3", "paper_1")
print(f"Citation path: {' -> '.join(chain)}")

neo4j_kg.close()

Cypher Query Language

Cypher is Neo4j's declarative query language - incredibly powerful:

// Find papers published after 2018 that cite papers with >10000 citations
MATCH (recent:Paper)-[:CITES]->(influential:Paper)
WHERE recent.year > 2018 AND influential.citations > 10000
RETURN recent.title, influential.title, influential.citations
ORDER BY influential.citations DESC

// Find "research communities" - groups of authors who frequently collaborate
MATCH (a1:Author)-[:AUTHORED]->(:Paper)<-[:AUTHORED]-(a2:Author)
WHERE a1.name < a2.name
WITH a1, a2, count(*) as collaborations
WHERE collaborations > 3
RETURN a1.name, a2.name, collaborations
ORDER BY collaborations DESC

// Find trending concepts (discussed in papers with growing citations)
MATCH (p:Paper)-[:DISCUSSES]->(c:Concept)
WHERE p.year >= 2020
WITH c, avg(p.citations) as avg_citations, count(p) as paper_count
WHERE paper_count > 5
RETURN c.name, avg_citations, paper_count
ORDER BY avg_citations DESC
LIMIT 10

Knowledge Graph: Dev vs Prod Comparison

FeatureNetworkX (Dev)Neo4j (Prod)
StorageIn-memory onlyPersistent to disk
Query LanguagePython codeCypher (declarative)
Scalability1000s of nodesBillions of nodes
PerformanceSlow for large graphsOptimized indexes
TransactionsNoACID transactions
ClusteringNoMulti-node clusters
Best forDevelopment, algorithmsProduction, complex queries
When to Use Each
  • Development: NetworkX - no setup, great for prototyping
  • Production: Neo4j - performance, Cypher queries, persistence
  • ResearcherAI: Abstracts both behind unified interface!

Semantic Web: Ontologies, RDF, and SPARQL

So far we've discussed property graphs (Neo4j, NetworkX). There's another powerful approach: semantic web technologies using RDF and ontologies.

What Are Ontologies?

Think of an ontology as a formal schema for your knowledge:

Web Developer Analogy:

// TypeScript interface = Ontology
interface Person {
name: string;
worksFor: Organization;
knows: Person[];
}

interface Organization {
name: string;
foundedDate: Date;
}

An ontology defines:

  • Classes (Person, Paper, Author, Concept)
  • Properties (name, authored, cites, discusses)
  • Relationships (Author → authored → Paper)
  • Constraints (a Paper must have at least one Author)

RDF: Resource Description Framework

RDF represents knowledge as triples:

Subject  Predicate  Object

Every statement is a triple (like a sentence):

# Turtle syntax (RDF format)
:paper1 rdf:type :ResearchPaper .
:paper1 :hasTitle "Attention Is All You Need" .
:paper1 :publishedYear 2017 .
:paper1 :hasAuthor :vaswani .
:paper1 :cites :paper2 .

:vaswani rdf:type :Author .
:vaswani :hasName "Ashish Vaswani" .

Web Developer Analogy:

// JSON = Property Graph
{
"id": "paper1",
"title": "Attention Is All You Need",
"year": 2017,
"authors": ["vaswani"]
}

// RDF Triples = Semantic Web
["paper1", "type", "ResearchPaper"]
["paper1", "hasTitle", "Attention Is All You Need"]
["paper1", "publishedYear", 2017]
["paper1", "hasAuthor", "vaswani"]

RDF Triple Visualization

Development: RDFLib (Python)

For development, use RDFLib - a pure Python library:

from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS

class RDFKnowledgeGraph:
"""Development RDF knowledge graph using RDFLib"""

def __init__(self):
self.graph = Graph()

# Define custom namespace for our ontology
self.ns = Namespace("http://researcherai.org/ontology#")
self.graph.bind("research", self.ns)

def add_paper(
self,
paper_id: str,
title: str,
year: int,
abstract: str = ""
):
"""Add a research paper to the graph"""
paper_uri = URIRef(f"http://researcherai.org/papers/{paper_id}")

# Add triples
self.graph.add((paper_uri, RDF.type, self.ns.ResearchPaper))
self.graph.add((paper_uri, self.ns.hasTitle, Literal(title)))
self.graph.add((paper_uri, self.ns.publishedYear, Literal(year)))

if abstract:
self.graph.add((paper_uri, self.ns.hasAbstract, Literal(abstract)))

def add_author(self, author_id: str, name: str, affiliation: str = ""):
"""Add an author to the graph"""
author_uri = URIRef(f"http://researcherai.org/authors/{author_id}")

self.graph.add((author_uri, RDF.type, self.ns.Author))
self.graph.add((author_uri, self.ns.hasName, Literal(name)))

if affiliation:
self.graph.add((author_uri, self.ns.affiliation, Literal(affiliation)))

def link_author_to_paper(self, author_id: str, paper_id: str):
"""Create authorship relationship"""
author_uri = URIRef(f"http://researcherai.org/authors/{author_id}")
paper_uri = URIRef(f"http://researcherai.org/papers/{paper_id}")

self.graph.add((paper_uri, self.ns.hasAuthor, author_uri))
self.graph.add((author_uri, self.ns.authored, paper_uri))

def add_citation(self, citing_paper_id: str, cited_paper_id: str):
"""Add citation relationship"""
citing_uri = URIRef(f"http://researcherai.org/papers/{citing_paper_id}")
cited_uri = URIRef(f"http://researcherai.org/papers/{cited_paper_id}")

self.graph.add((citing_uri, self.ns.cites, cited_uri))
self.graph.add((cited_uri, self.ns.citedBy, citing_uri))

def query_sparql(self, sparql_query: str):
"""Execute SPARQL query"""
return self.graph.query(sparql_query)

def export_turtle(self, filename: str):
"""Export graph to Turtle format"""
self.graph.serialize(destination=filename, format='turtle')

def load_turtle(self, filename: str):
"""Load graph from Turtle format"""
self.graph.parse(filename, format='turtle')


# Example usage
rdf_kg = RDFKnowledgeGraph()

# Add papers
rdf_kg.add_paper(
"paper1",
"Attention Is All You Need",
2017,
"The dominant sequence transduction models..."
)

rdf_kg.add_paper(
"paper2",
"BERT: Pre-training of Deep Bidirectional Transformers",
2019,
"We introduce BERT..."
)

# Add authors
rdf_kg.add_author("vaswani", "Ashish Vaswani", "Google Brain")
rdf_kg.add_author("devlin", "Jacob Devlin", "Google AI")

# Link relationships
rdf_kg.link_author_to_paper("vaswani", "paper1")
rdf_kg.link_author_to_paper("devlin", "paper2")
rdf_kg.add_citation("paper2", "paper1") # BERT cites Attention paper

# Export to file
rdf_kg.export_turtle("research_graph.ttl")

SPARQL: Query Language for RDF

SPARQL is to RDF what Cypher is to Neo4j (or SQL to relational databases):

# Find all papers by a specific author
PREFIX research: <http://researcherai.org/ontology#>

SELECT ?paper ?title ?year
WHERE {
?author research:hasName "Ashish Vaswani" .
?author research:authored ?paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
}
ORDER BY ?year

Python Example:

sparql_query = """
PREFIX research: <http://researcherai.org/ontology#>

SELECT ?citing_title ?cited_title
WHERE {
?citing_paper research:cites ?cited_paper .
?citing_paper research:hasTitle ?citing_title .
?cited_paper research:hasTitle ?cited_title .
}
"""

results = rdf_kg.query_sparql(sparql_query)
for row in results:
print(f"{row.citing_title} cites {row.cited_title}")

More SPARQL Examples

# Find authors who collaborated (co-authored papers)
PREFIX research: <http://researcherai.org/ontology#>

SELECT ?author1_name ?author2_name ?paper_title
WHERE {
?paper research:hasAuthor ?author1 .
?paper research:hasAuthor ?author2 .
?paper research:hasTitle ?paper_title .
?author1 research:hasName ?author1_name .
?author2 research:hasName ?author2_name .
FILTER(?author1 != ?author2)
}

# Find highly cited papers (cited by many others)
SELECT ?title (COUNT(?citing) as ?citation_count)
WHERE {
?paper research:hasTitle ?title .
?citing research:cites ?paper .
}
GROUP BY ?title
HAVING (COUNT(?citing) > 100)
ORDER BY DESC(?citation_count)

# Find papers published after 2018 in a specific domain
SELECT ?title ?year
WHERE {
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
?paper research:discusses ?concept .
?concept research:hasName "transformers" .
FILTER(?year > 2018)
}

Production: Apache Jena & SPARQL Endpoint

For production, use Apache Jena Fuseki - a SPARQL server:

from SPARQLWrapper import SPARQLWrapper, JSON
import requests

class JenaKnowledgeGraph:
"""Production RDF knowledge graph using Apache Jena Fuseki"""

def __init__(
self,
endpoint_url: str = "http://localhost:3030/research",
update_endpoint: str = "http://localhost:3030/research/update"
):
self.endpoint_url = endpoint_url
self.update_endpoint = update_endpoint
self.sparql = SPARQLWrapper(endpoint_url)

def add_triples(self, triples_turtle: str):
"""Add RDF triples to the graph"""
# Use SPARQL UPDATE to insert data
update_query = f"""
PREFIX research: <http://researcherai.org/ontology#>

INSERT DATA {{
{triples_turtle}
}}
"""

response = requests.post(
self.update_endpoint,
data={"update": update_query},
headers={"Content-Type": "application/x-www-form-urlencoded"}
)
return response.status_code == 200

def add_paper(self, paper_id: str, title: str, year: int):
"""Add a research paper"""
triples = f"""
@prefix research: <http://researcherai.org/ontology#> .
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

<http://researcherai.org/papers/{paper_id}>
rdf:type research:ResearchPaper ;
research:hasTitle "{title}" ;
research:publishedYear {year} .
"""
return self.add_triples(triples)

def query(self, sparql_query: str) -> list:
"""Execute SPARQL SELECT query"""
self.sparql.setQuery(sparql_query)
self.sparql.setReturnFormat(JSON)

results = self.sparql.query().convert()
return results["results"]["bindings"]

def find_papers_by_author(self, author_name: str) -> list:
"""Find all papers by an author"""
query = f"""
PREFIX research: <http://researcherai.org/ontology#>

SELECT ?paper ?title ?year
WHERE {{
?author research:hasName "{author_name}" .
?author research:authored ?paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
}}
ORDER BY DESC(?year)
"""
return self.query(query)

def find_citation_chain(self, paper_id: str, depth: int = 2) -> list:
"""Find papers that cite this paper (transitive)"""
query = f"""
PREFIX research: <http://researcherai.org/ontology#>

SELECT ?citing_paper ?title ?distance
WHERE {{
<http://researcherai.org/papers/{paper_id}>
^research:cites{{1,{depth}}} ?citing_paper .
?citing_paper research:hasTitle ?title .

BIND(
COUNT(?intermediate) as ?distance
)
}}
"""
return self.query(query)


# Example usage with Fuseki
jena_kg = JenaKnowledgeGraph(
endpoint_url="http://localhost:3030/research/sparql",
update_endpoint="http://localhost:3030/research/update"
)

# Add paper
jena_kg.add_paper(
"attention2017",
"Attention Is All You Need",
2017
)

# Query
results = jena_kg.find_papers_by_author("Ashish Vaswani")
for result in results:
print(f"{result['title']['value']} ({result['year']['value']})")

OWL: Web Ontology Language - The Power of Reasoning

OWL (Web Ontology Language) extends RDF with reasoning capabilities. It's the difference between storing facts and deriving new knowledge from those facts.

Web Developer Analogy:

// RDF = Data storage
const data = {
"Vaswani": { type: "Author", authored: ["paper1"] }
}

// OWL = Data storage + Logic rules
const data = { ... }
const rules = {
// If someone authored a paper, they are a researcher
"Author who authored something → Researcher"
}
// Reasoner can INFER: "Vaswani is a Researcher" (even if not explicitly stated)

Why OWL Matters: Automatic Inference

Without OWL (just RDF):

:vaswani :authored :paper1 .
# To know Vaswani is a researcher, you must explicitly state it

With OWL (RDF + reasoning):

# Define rule: anyone who authored something is a researcher
:authored rdfs:domain :Researcher .

# Just state the fact
:vaswani :authored :paper1 .

# OWL reasoner AUTOMATICALLY infers:
:vaswani rdf:type :Researcher . # Derived, not stated!

OWL Ontology Definition

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix research: <http://researcherai.org/ontology#> .

# Define classes
research:ResearchPaper rdf:type owl:Class .
research:Author rdf:type owl:Class .
research:Researcher rdf:type owl:Class .
research:Concept rdf:type owl:Class .
research:InfluentialPaper rdf:type owl:Class .

# Define class hierarchy
research:Author rdfs:subClassOf research:Researcher .
# All Authors are Researchers (but not all Researchers are Authors)

# Define properties
research:hasAuthor rdf:type owl:ObjectProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range research:Author .

research:cites rdf:type owl:ObjectProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range research:ResearchPaper .

research:citedBy rdf:type owl:ObjectProperty ;
owl:inverseOf research:cites . # Automatic inverse!

research:hasTitle rdf:type owl:DatatypeProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range xsd:string .

research:publishedYear rdf:type owl:DatatypeProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range xsd:integer .

research:citationCount rdf:type owl:DatatypeProperty ;
rdfs:domain research:ResearchPaper ;
rdfs:range xsd:integer .

# Define constraints
research:ResearchPaper rdfs:subClassOf [
rdf:type owl:Restriction ;
owl:onProperty research:hasAuthor ;
owl:minCardinality "1"^^xsd:nonNegativeInteger
] . # A paper must have at least one author

# Define derived class (automatic classification!)
research:InfluentialPaper owl:equivalentClass [
rdf:type owl:Restriction ;
owl:onProperty research:citationCount ;
owl:someValuesFrom [
rdf:type rdfs:Datatype ;
owl:onDatatype xsd:integer ;
owl:withRestrictions ([ xsd:minInclusive 100 ])
]
] . # Papers with 100+ citations are automatically "InfluentialPaper"

# Property characteristics
research:collaboratesWith rdf:type owl:SymmetricProperty .
# If A collaborates with B, then B collaborates with A

research:cites rdf:type owl:TransitiveProperty .
# If A cites B, and B cites C, then A transitively cites C

Development: Owlready2 with Reasoning

For development, use owlready2 - a Python library with built-in reasoner:

from owlready2 import *
import tempfile

class OWLKnowledgeGraph:
"""Development OWL ontology with reasoning"""

def __init__(self, ontology_iri="http://researcherai.org/ontology"):
self.onto = get_ontology(ontology_iri)

with self.onto:
# Define classes
class ResearchPaper(Thing):
pass

class Author(Thing):
pass

class Researcher(Thing):
pass

class Concept(Thing):
pass

class InfluentialPaper(ResearchPaper):
pass

# Define properties
class hasAuthor(ObjectProperty):
domain = [ResearchPaper]
range = [Author]

class authored(ObjectProperty):
domain = [Author]
range = [ResearchPaper]
inverse_property = hasAuthor

class cites(ObjectProperty, TransitiveProperty):
domain = [ResearchPaper]
range = [ResearchPaper]

class citedBy(ObjectProperty):
inverse_property = cites

class collaboratesWith(ObjectProperty, SymmetricProperty):
domain = [Author]
range = [Author]

class hasTitle(DataProperty, FunctionalProperty):
domain = [ResearchPaper]
range = [str]

class publishedYear(DataProperty, FunctionalProperty):
domain = [ResearchPaper]
range = [int]

class citationCount(DataProperty):
domain = [ResearchPaper]
range = [int]

# Define rules
class AuthorRule(Author >> Researcher):
"""All authors are researchers"""
pass

# Define automatic classification
class InfluentialPaperRule(ResearchPaper):
equivalent_to = [
ResearchPaper & citationCount.some(int >= 100)
]

self.ResearchPaper = self.onto.ResearchPaper
self.Author = self.onto.Author
self.hasAuthor = self.onto.hasAuthor
self.cites = self.onto.cites
self.hasTitle = self.onto.hasTitle
self.publishedYear = self.onto.publishedYear
self.citationCount = self.onto.citationCount

def add_paper(self, paper_id: str, title: str, year: int, citations: int = 0):
"""Add a research paper"""
paper = self.ResearchPaper(paper_id)
paper.hasTitle = [title]
paper.publishedYear = [year]
paper.citationCount = [citations]
return paper

def add_author(self, author_id: str, name: str):
"""Add an author"""
author = self.Author(author_id)
author.label = [name]
return author

def link_author_to_paper(self, author, paper):
"""Create authorship relationship"""
author.authored.append(paper)
# Inverse relationship is automatic!

def add_citation(self, citing_paper, cited_paper):
"""Add citation relationship"""
citing_paper.cites.append(cited_paper)
# citedBy is automatic (inverse property)!

def run_reasoner(self):
"""Run OWL reasoner to infer new facts"""
print("Running reasoner...")
with self.onto:
sync_reasoner(debug=False)
print("Reasoning complete!")

def find_influential_papers(self):
"""Find papers automatically classified as influential"""
return list(self.onto.InfluentialPaper.instances())

def find_all_researchers(self):
"""Find all researchers (including inferred ones)"""
return list(self.onto.Researcher.instances())

def save(self, filename: str):
"""Save ontology to file"""
self.onto.save(file=filename, format="rdfxml")

def load(self, filename: str):
"""Load ontology from file"""
self.onto = get_ontology(filename).load()


# Example usage with reasoning
owl_kg = OWLKnowledgeGraph()

# Add papers
paper1 = owl_kg.add_paper("attention2017", "Attention Is All You Need", 2017, 15000)
paper2 = owl_kg.add_paper("bert2019", "BERT", 2019, 8000)
paper3 = owl_kg.add_paper("transformer_xl", "Transformer-XL", 2019, 500)

# Add authors
vaswani = owl_kg.add_author("vaswani", "Ashish Vaswani")
devlin = owl_kg.add_author("devlin", "Jacob Devlin")

# Link relationships
owl_kg.link_author_to_paper(vaswani, paper1)
owl_kg.link_author_to_paper(devlin, paper2)

# Add citations
owl_kg.add_citation(paper2, paper1) # BERT cites Attention
owl_kg.add_citation(paper3, paper1) # Transformer-XL cites Attention

print("Before reasoning:")
print(f"Influential papers: {len(owl_kg.find_influential_papers())}")
print(f"Researchers: {len(owl_kg.find_all_researchers())}")

# Run reasoner
owl_kg.run_reasoner()

print("\nAfter reasoning:")
# Papers with 100+ citations are automatically classified as InfluentialPaper
influential = owl_kg.find_influential_papers()
print(f"Influential papers: {[p.hasTitle[0] for p in influential]}")

# Authors are automatically inferred to be Researchers
researchers = owl_kg.find_all_researchers()
print(f"Researchers: {[r.label[0] for r in researchers]}")

# Check inverse properties
print(f"\nVaswani authored: {[p.hasTitle[0] for p in vaswani.authored]}")
print(f"Paper1 has authors: {[a.label[0] for a in paper1.hasAuthor]}")
# Both work! Inverse is automatic.

# Check citedBy (inverse of cites)
print(f"\nPaper1 is cited by: {[p.hasTitle[0] for p in paper1.citedBy]}")
# Automatic from cites relationship!

OWL Reasoning Examples

1. Class Hierarchy Inference:

# Define hierarchy
with owl_kg.onto:
class Author(Thing):
pass

class PhDStudent(Author):
pass

class Professor(Author):
pass

# All Authors are Researchers
class AuthorIsResearcher(Author >> Researcher):
pass

# Create instance
phd_student = owl_kg.onto.PhDStudent("alice")

# Before reasoning
print(phd_student.is_a) # [PhDStudent]

# Run reasoner
sync_reasoner()

# After reasoning
print(phd_student.is_a) # [PhDStudent, Author, Researcher]
# Automatically inferred Alice is an Author and Researcher!

2. Property Propagation:

# Define transitive property
class influences(ObjectProperty, TransitiveProperty):
pass

# State facts
paper_a = ResearchPaper("paper_a")
paper_b = ResearchPaper("paper_b")
paper_c = ResearchPaper("paper_c")

paper_a.influences = [paper_b] # A influences B
paper_b.influences = [paper_c] # B influences C

# Run reasoner
sync_reasoner()

# Reasoner infers: A influences C (transitively)
print(paper_c in paper_a.influences) # True!

3. Automatic Classification:

# Define rule: Papers with many authors are "Collaborative"
with owl_kg.onto:
class CollaborativePaper(ResearchPaper):
equivalent_to = [
ResearchPaper & hasAuthor.min(5) # 5+ authors
]

# Add paper with 6 authors
paper = ResearchPaper("multi_author_paper")
for i in range(6):
author = Author(f"author_{i}")
paper.hasAuthor.append(author)

# Run reasoner
sync_reasoner()

# Paper is automatically classified as CollaborativePaper!
print(CollaborativePaper in paper.is_a) # True!

Production: Apache Jena with OWL Reasoner

For production, use Apache Jena with built-in OWL reasoners:

from rdflib import Graph, Namespace
from rdflib.plugins.sparql import prepareQuery

class JenaOWLKnowledgeGraph:
"""Production OWL knowledge graph with Jena reasoner"""

def __init__(self, fuseki_url: str = "http://localhost:3030/research"):
self.fuseki_url = fuseki_url
self.graph = Graph()
self.ns = Namespace("http://researcherai.org/ontology#")

# Load ontology schema
self.load_ontology()

def load_ontology(self):
"""Load OWL ontology definitions"""
ontology_ttl = """
@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix research: <http://researcherai.org/ontology#> .

research:Author rdfs:subClassOf research:Researcher .
research:authored rdfs:domain research:Author .
research:cites rdf:type owl:TransitiveProperty .
research:citedBy owl:inverseOf research:cites .
"""
self.graph.parse(data=ontology_ttl, format="turtle")

def query_with_reasoning(self, sparql_query: str):
"""Execute SPARQL with reasoning enabled"""
# Jena Fuseki can enable reasoner via endpoint config
# Example: http://localhost:3030/research_reasoned/sparql
results = self.graph.query(sparql_query)
return list(results)


# Configure Jena Fuseki with OWL reasoner
fuseki_config = """
<#service> rdf:type fuseki:Service ;
fuseki:name "research" ;
fuseki:serviceQuery "sparql" ;
fuseki:dataset <#dataset> .

<#dataset> rdf:type ja:DatasetTxnMem ;
ja:defaultGraph <#model_inf> .

<#model_inf> rdf:type ja:InfModel ;
ja:reasoner [
ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLFBRuleReasoner>
] ;
ja:baseModel <#model_base> .

<#model_base> rdf:type ja:MemoryModel .
"""

OWL Profiles: Which to Use?

OWL has different profiles (subsets) for different use cases:

ProfileComplexityReasoningUse Case
OWL FullMaximum expressivityUndecidableResearch, experimental
OWL DLDescription LogicComplete & decidableGeneral purpose
OWL LiteBasic class hierarchySimple & fastSimple taxonomies
OWL ELExistential quantificationPolynomial timeLarge ontologies (medical)
OWL QLQuery-orientedLog-spaceDatabase integration
OWL RLRule-basedPolynomial timeBusiness rules

For ResearcherAI: Use OWL DL or OWL RL - balance of expressivity and performance.

When to Use OWL vs Just RDF

Use OWL when you need:

  1. Automatic classification

    • Classify papers as "influential" based on citation count
    • Identify "interdisciplinary" papers based on concept diversity
  2. Inference from rules

    • Infer co-authors from paper authorship
    • Derive expertise areas from publication history
  3. Consistency checking

    • Ensure papers have at least one author
    • Validate that publication years are reasonable
  4. Property inheritance

    • Symmetric properties (collaboration)
    • Transitive properties (influence, citation chains)
    • Inverse properties (cites ↔ citedBy)

Use just RDF when:

  1. Simple data storage - no complex reasoning needed
  2. Performance critical - reasoning is computationally expensive
  3. Schema is stable - don't need automatic classification
  4. Explicit is better - want to state all facts explicitly

OWL Reasoning: Dev vs Prod Comparison

FeatureOwlready2 (Dev)Apache Jena (Prod)
LanguagePythonJava (Python client)
ReasonersHermiT, PelletJena, Pellet, HermiT
PerformanceSlower (Python)Faster (Java)
ScalabilitySmall ontologiesLarge ontologies
Ease of UseVery easy (Pythonic)More complex setup
IntegrationGreat for scriptsEnterprise integration
Best forDevelopment, prototypingProduction, large scale

Example Use Case for ResearcherAI:

# Use OWL to automatically identify "rising stars" (researchers)
# Rule: A rising star is someone who:
# - Authored papers in the last 3 years
# - Has papers cited > 50 times
# - Collaborates with established researchers

with owl_kg.onto:
class EstablishedResearcher(Researcher):
equivalent_to = [
Researcher & authored.some(
ResearchPaper & citationCount.some(int >= 500)
)
]

class RisingStar(Researcher):
equivalent_to = [
Researcher &
authored.some(
ResearchPaper &
publishedYear.some(int >= 2021) &
citationCount.some(int >= 50)
) &
collaboratesWith.some(EstablishedResearcher)
]

# Run reasoner
sync_reasoner()

# Automatically finds rising stars!
rising_stars = list(owl_kg.onto.RisingStar.instances())
print(f"Rising stars: {[r.label[0] for r in rising_stars]}")
OWL Summary
  • OWL = RDF + reasoning/inference capabilities
  • Use for: Automatic classification, rule-based inference, consistency checking
  • Dev: owlready2 (easy Python integration)
  • Prod: Apache Jena (performance, scalability)
  • ResearcherAI: Could use OWL for researcher classification, paper categorization
OWL Performance

OWL reasoning can be computationally expensive. For large knowledge graphs (millions of triples), reasoning can take minutes to hours. Consider:

  • Using simpler OWL profiles (EL, QL, RL)
  • Pre-computing inferences offline
  • Using incremental reasoning
  • Caching reasoner results

RDF vs Property Graphs: When to Use Each

FeatureRDF (Jena/RDFLib)Property Graphs (Neo4j)
Data ModelTriples (subject-predicate-object)Nodes with properties + labeled edges
SchemaOntology (OWL)Schema optional
StandardsW3C standards (RDF, OWL, SPARQL)No universal standard
Query LanguageSPARQLCypher
ReasoningBuilt-in inferencing (OWL reasoners)No built-in reasoning
FlexibilityExtremely flexible, schema evolutionMore rigid structure
PerformanceSlower for graph traversalOptimized for graph queries
Use CaseScientific data, linked data, ontologiesSocial networks, recommendations
Learning CurveSteeper (ontologies, W3C specs)Gentler (more intuitive)

Web Developer Analogy:

  • RDF = XML/JSON-LD with strict schemas (TypeScript with interfaces)
  • Property Graphs = NoSQL document store with relationships (MongoDB + relationships)

When to Use RDF:

  1. Need formal ontologies - scientific domains, medical data
  2. Data integration - combining data from multiple sources
  3. Reasoning/inference - derive new facts from existing ones
  4. Linked open data - publish data others can link to
  5. Interoperability - strict W3C standards

Example: Medical knowledge graphs, DBpedia, Wikidata

When to Use Property Graphs:

  1. Graph algorithms - shortest path, community detection
  2. High-performance traversal - social networks, fraud detection
  3. Flexible schema - rapidly evolving data model
  4. Intuitive queries - easier to learn and use
  5. Real-time recommendations - e-commerce, content recommendations

Example: LinkedIn connections, recommendation engines, ResearcherAI

ResearcherAI's Approach

For ResearcherAI, we use property graphs (Neo4j) because:

  1. Better performance for citation traversal
  2. Simpler learning curve for developers
  3. Flexible schema - research data models evolve
  4. Cypher is intuitive - easier than SPARQL for most queries
  5. Neo4j has excellent tooling - Browser, Bloom, Graph Data Science

However, if you needed to:

  • Integrate with external ontologies (e.g., medical ontologies)
  • Publish linked open data
  • Use formal reasoning/inference
  • Comply with W3C standards

Then RDF with Apache Jena would be the better choice.

Hybrid Approach: RDF + Property Graphs

You can actually use both:

class HybridSemanticKnowledgeGraph:
"""Combine RDF (for ontology) with Neo4j (for performance)"""

def __init__(
self,
neo4j_kg: Neo4jKnowledgeGraph,
rdf_kg: RDFKnowledgeGraph
):
self.neo4j = neo4j_kg # For fast queries
self.rdf = rdf_kg # For ontology and reasoning

def add_paper(self, paper_data: dict):
"""Add to both stores"""
# Add to Neo4j for performance
self.neo4j.add_paper(
paper_data["id"],
paper_data["title"],
paper_data["year"]
)

# Add to RDF for formal semantics
self.rdf.add_paper(
paper_data["id"],
paper_data["title"],
paper_data["year"]
)

def query_with_reasoning(self, sparql_query: str):
"""Use RDF reasoner for inference"""
return self.rdf.query_sparql(sparql_query)

def query_with_performance(self, cypher_query: str):
"""Use Neo4j for fast graph traversal"""
return self.neo4j.query_cypher(cypher_query)

RDF/SPARQL: Dev vs Prod Comparison

FeatureRDFLib (Dev)Apache Jena Fuseki (Prod)
StorageIn-memory or filePersistent triple store
QueryPython SPARQLHTTP SPARQL endpoint
Scalability100k triplesBillions of triples
PerformanceSlow for large graphsOptimized indices
ReasoningBasicFull OWL reasoning
Concurrent AccessNoYes (multi-user)
Best forDevelopment, testingProduction, linked data
RDF vs Property Graphs Summary
  • RDF: Formal ontologies, reasoning, standards compliance, data integration
  • Property Graphs: Performance, graph algorithms, simpler queries, flexibility
  • ResearcherAI: Uses property graphs for performance, but you can use RDF if needed!

Building Knowledge Graphs: Construction Methods

Now that you understand what knowledge graphs are, let's explore how to build them from various data sources.

Three Main Approaches

There are three primary methods to construct knowledge graphs, depending on your data source structure:

1. Structured Sources (Relational Databases, CSV)

Characteristics:

  • Fixed schema (all entities of same type have same attributes)
  • Examples: SQL databases, CSV files, Excel spreadsheets
  • Easiest to convert to knowledge graphs

Method: Use mapping rules (like R2RML - RDB to RDF Mapping Language)

# Example: CSV to Knowledge Graph
import pandas as pd
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF

# Source: papers.csv
# paper_id,title,year,author_id
# p1,"Attention Is All You Need",2017,a1
# p2,"BERT",2018,a2

df = pd.read_csv("papers.csv")
g = Graph()
ns = Namespace("http://example.org/")

for _, row in df.iterrows():
paper_uri = ns[row['paper_id']]

# Add triples
g.add((paper_uri, RDF.type, ns.ResearchPaper))
g.add((paper_uri, ns.hasTitle, Literal(row['title'])))
g.add((paper_uri, ns.publishedYear, Literal(row['year'])))
g.add((paper_uri, ns.hasAuthor, ns[row['author_id']]))

# Result: Knowledge graph with papers, titles, years, authors

Mapping Rules (R2RML):

# R2RML mapping: SQL table → RDF
@prefix rr: <http://www.w3.org/ns/r2rml#> .

<#PaperMapping> a rr:TriplesMap ;
rr:logicalTable [ rr:tableName "papers" ] ;
rr:subjectMap [
rr:template "http://example.org/paper/{paper_id}" ;
rr:class :ResearchPaper
] ;
rr:predicateObjectMap [
rr:predicate :hasTitle ;
rr:objectMap [ rr:column "title" ]
] .

2. Semi-Structured Sources (JSON, XML)

Characteristics:

  • Flexible schema (entities of same type may have different attributes)
  • Examples: JSON APIs, XML documents, HTML pages
  • Moderate complexity to convert

Method: Use mapping rules adapted to the structure

import json
from rdflib import Graph, Namespace, Literal

# Source: papers.json
json_data = {
"paper1": {
"title": "Attention Is All You Need",
"year": 2017,
"authors": ["Vaswani", "Shazeer"], # Variable length!
"venue": "NIPS" # Optional field
},
"paper2": {
"title": "BERT",
"year": 2018,
"authors": ["Devlin"] # Different number of authors
# No venue field!
}
}

g = Graph()
ns = Namespace("http://example.org/")

for paper_id, paper_data in json_data.items():
paper_uri = ns[paper_id]
g.add((paper_uri, ns.hasTitle, Literal(paper_data['title'])))
g.add((paper_uri, ns.publishedYear, Literal(paper_data['year'])))

# Handle variable-length authors
for author in paper_data.get('authors', []):
g.add((paper_uri, ns.hasAuthor, Literal(author)))

# Handle optional venue
if 'venue' in paper_data:
g.add((paper_uri, ns.publishedAt, Literal(paper_data['venue'])))

3. Unstructured Sources (Text, Images, PDFs)

Characteristics:

  • No fixed schema at all
  • Examples: Natural language text, research papers (PDF), images
  • Most complex to convert - requires AI/NLP

Method: Use NLP techniques to extract entities and relationships

# Example: Extract entities and relationships from text
from transformers import pipeline

# NLP model for Named Entity Recognition
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

text = """
Attention Is All You Need was published in 2017 by Vaswani and colleagues at Google.
The paper introduced the Transformer architecture for sequence-to-sequence tasks.
"""

# Extract entities
entities = ner(text)
# Result: [
# {"entity": "Attention Is All You Need", "type": "WORK"},
# {"entity": "2017", "type": "DATE"},
# {"entity": "Vaswani", "type": "PERSON"},
# {"entity": "Google", "type": "ORG"}
# ]

# Extract relationships (requires relation extraction model)
# "Attention Is All You Need" -[published_in]-> "2017"
# "Attention Is All You Need" -[authored_by]-> "Vaswani"
# "Vaswani" -[works_at]-> "Google"

# Convert to knowledge graph triples
g = Graph()
ns = Namespace("http://example.org/")

g.add((ns.attention_paper, ns.publishedYear, Literal(2017)))
g.add((ns.attention_paper, ns.hasAuthor, ns.vaswani))
g.add((ns.vaswani, ns.worksAt, ns.google))

Knowledge Graph Construction Process

Here's the end-to-end process for building production knowledge graphs:

Step 0: Define Use Case and Scope

Before building, answer:

  • What questions do we need to answer? ("Which papers cite X?", "Who are experts in Y?")
  • What metadata do we need? (papers, authors, citations, concepts)
  • What relationships matter? (cites, authored_by, discusses)

Example for ResearcherAI:

use_case = {
"questions": [
"Find papers related to transformers",
"Who are the leading researchers in NLP?",
"What papers cite the Attention paper?"
],
"metadata": ["papers", "authors", "citations", "concepts"],
"relationships": ["cites", "authored_by", "discusses", "collaborates_with"]
}

Step 1: Data Collection

Gather data from various sources across your ecosystem:

# Example: Collect from multiple sources
sources = {
"papers_db": "SELECT * FROM papers", # SQL database
"arxiv_api": "https://api.arxiv.org/papers", # REST API
"paper_pdfs": "/data/pdfs/*.pdf", # Unstructured files
"citations_csv": "/data/citations.csv" # CSV file
}

# Each source "speaks" a different language:
# - SQL: Relational tables
# - API: JSON
# - PDFs: Unstructured text
# - CSV: Tabular data

Step 2: Data Cleaning and Standardization

Ensure data quality by:

  • Standardizing: Convert all dates to ISO format (YYYY-MM-DD)
  • Deduplicating: Merge duplicate author entries
  • Validating: Check data types, required fields
  • Resolving: Handle inconsistencies (same paper different IDs)
import pandas as pd

# Example: Clean author data
authors_raw = pd.read_csv("authors.csv")

# Standardize names
authors_raw['name'] = authors_raw['name'].str.strip().str.title()

# Remove duplicates (same email = same person)
authors_clean = authors_raw.drop_duplicates(subset=['email'])

# Validate required fields
authors_clean = authors_clean.dropna(subset=['name', 'affiliation'])

# Resolve inconsistencies (assign unique IDs)
authors_clean['author_id'] = range(len(authors_clean))

Step 3: Data Modeling (Convert to RDF Triples)

Transform cleaned data into standardized RDF triples:

from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF

g = Graph()
ns = Namespace("http://researcherai.org/")

# For each paper in cleaned data
for _, paper in papers_clean.iterrows():
paper_uri = ns[f"paper/{paper['id']}"]

# Subject-Predicate-Object triples
g.add((paper_uri, RDF.type, ns.ResearchPaper))
g.add((paper_uri, ns.hasTitle, Literal(paper['title'])))
g.add((paper_uri, ns.publishedYear, Literal(paper['year'])))

# Relationships to other entities
for author_id in paper['authors']:
author_uri = ns[f"author/{author_id}"]
g.add((paper_uri, ns.hasAuthor, author_uri))

Step 4: Usage and Insights

Now query the knowledge graph to deliver value:

# Find papers by author
SELECT ?paper ?title WHERE {
?author :hasName "Ashish Vaswani" .
?paper :hasAuthor ?author .
?paper :hasTitle ?title .
}

# Find citation paths
SELECT ?citing ?cited WHERE {
?citing :cites+ ?cited . # Transitive: 1+ hops
}

Comparison of Construction Methods

Source TypeComplexityToolsBest For
Structured⭐ LowR2RML, pandasDatabases, CSV, Excel
Semi-Structured⭐⭐ MediumJSON/XML parsersAPIs, config files
Unstructured⭐⭐⭐ HighNLP, LLMs, OCRPDFs, text, images

ResearcherAI's Approach:

  1. Structured: arXiv metadata (JSON API) → Easy conversion
  2. Semi-Structured: Paper metadata from multiple APIs → JSON parsing
  3. Unstructured: Paper PDFs → NLP for concept extraction
Construction Best Practices
  • Start with structured sources - easiest to convert and validate
  • Clean data thoroughly - garbage in, garbage out
  • Define schema first - know what entities and relationships you need
  • Validate incrementally - check quality at each step
  • Use existing ontologies - don't reinvent the wheel (e.g., Schema.org)

Hands-On: Building a Research Paper Knowledge Graph

Now let's walk through a complete example of building a knowledge graph from structured data using the declarative SPARQL CONSTRUCT approach.

What You'll Learn:

  • How to transform CSV data into RDF triples
  • Using SPARQL CONSTRUCT queries for mapping
  • Incrementally building a knowledge graph
  • Visualizing the resulting graph

Step 1: Input Data

Imagine you have research paper data in CSV files:

papers.csv:

domain,title,year,abstract
NLP,Attention Is All You Need,2017,Transformer architecture for sequence-to-sequence
NLP,BERT,2018,Bidirectional encoder representations
CV,ResNet,2015,Deep residual learning for image recognition

authors.csv:

name,affiliation,domain
Ashish Vaswani,Google Brain,NLP
Jacob Devlin,Google AI,NLP
Kaiming He,Facebook AI,CV

citations.csv:

citing_paper,cited_paper,citation_type
BERT,Attention Is All You Need,builds_on
ResNet,VGGNet,improves

concepts.csv:

paper,concept,importance
Attention Is All You Need,self-attention,high
Attention Is All You Need,transformers,high
BERT,bidirectional,high
ResNet,residual-connections,high

Let's load this data:

import pandas as pd
from rdflib import Graph, Literal, Namespace
from rdflib.plugins.sparql.processor import prepareQuery

# Load CSV files
papers_df = pd.read_csv("papers.csv").fillna('')
authors_df = pd.read_csv("authors.csv").fillna('')
citations_df = pd.read_csv("citations.csv").fillna('')
concepts_df = pd.read_csv("concepts.csv").fillna('')

# Show distribution
data = {
"Papers": len(papers_df),
"Authors": len(authors_df),
"Citations": len(citations_df),
"Concepts": len(concepts_df)
}
print(pd.DataFrame.from_dict(data, orient='index', columns=['Count']))
# Output:
# Count
# Papers 3
# Authors 3
# Citations 3
# Concepts 4

Step 2: Define the Knowledge Graph Schema

Based on our data, we define the schema:

# Schema for Research Papers
@prefix research: <http://example.org/research#> .

# Classes (Entity Types)
research:Paper a rdfs:Class .
research:Author a rdfs:Class .
research:Concept a rdfs:Class .
research:ResearchDomain a rdfs:Class .

# Properties
research:hasTitle a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range xsd:string .

research:publishedYear a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range xsd:integer .

research:hasAbstract a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range xsd:string .

# Relationships
research:authoredBy a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range research:Author .

research:cites a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range research:Paper .

research:discusses a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range research:Concept .

research:belongsToDomain a rdf:Property ;
rdfs:domain research:Paper ;
rdfs:range research:ResearchDomain .

Visualization of Schema:

Step 3: SPARQL CONSTRUCT Queries for Mapping

Now we define SPARQL CONSTRUCT queries to transform CSV data into RDF triples:

Query 1: Create Paper Entities

PREFIX research: <http://example.org/research#>
CONSTRUCT {
?paper a research:Paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
?paper research:hasAbstract ?abstract .
?paper research:belongsToDomain ?domainIRI .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?title, " ", "_"))) AS ?paper)
BIND(IRI(CONCAT("http://data.example.org/domain/",
?domain)) AS ?domainIRI)
}

Web Developer Analogy:

// SPARQL CONSTRUCT is like a template for creating objects
const papers = csvData.map(row => ({
id: `http://data.example.org/paper/${row.title.replace(/ /g, '_')}`,
type: "Paper",
title: row.title,
year: row.year,
abstract: row.abstract,
domain: `http://data.example.org/domain/${row.domain}`
}))

Query 2: Create Author Entities

PREFIX research: <http://example.org/research#>
CONSTRUCT {
?author a research:Author .
?author research:hasName ?name .
?author research:affiliation ?affiliation .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/author/",
REPLACE(?name, " ", "_"))) AS ?author)
}

Query 3: Create Citation Relationships

PREFIX research: <http://example.org/research#>
CONSTRUCT {
?citingPaperIRI research:cites ?citedPaperIRI .
?citingPaperIRI research:citationType ?citation_type .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?citing_paper, " ", "_"))) AS ?citingPaperIRI)
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?cited_paper, " ", "_"))) AS ?citedPaperIRI)
}

Query 4: Create Concept Relationships

PREFIX research: <http://example.org/research#>
CONSTRUCT {
?paperIRI research:discusses ?conceptIRI .
?conceptIRI research:importance ?importance .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?paper, " ", "_"))) AS ?paperIRI)
BIND(IRI(CONCAT("http://data.example.org/concept/",
?concept)) AS ?conceptIRI)
}

Step 4: The Transform Function

This function applies SPARQL CONSTRUCT queries to DataFrame rows:

import re
from rdflib import Graph, Literal
from rdflib.plugins.sparql.processor import prepareQuery

def transform(df: pd.DataFrame, construct_query: str,
first: bool = False) -> Graph:
"""Transform Pandas DataFrame to RDFLib Graph using SPARQL CONSTRUCT.

Args:
df: Input DataFrame with CSV data
construct_query: SPARQL CONSTRUCT query template
first: If True, only process first row (for testing)

Returns:
RDF Graph with constructed triples
"""
# Setup graphs
query_graph = Graph()
result_graph = Graph()

# Parse the SPARQL query
query = prepareQuery(construct_query)

# Clean column names (remove special characters)
invalid_pattern = re.compile(r"[^\w_]+")
headers = dict((k, invalid_pattern.sub("_", k)) for k in df.columns)

# Process each row
for _, row in df.iterrows():
# Create variable bindings: column name -> cell value
binding = dict((headers[k], Literal(row[k]))
for k in df.columns if len(str(row[k])) > 0)

# Execute query with bindings
results = query_graph.query(query, initBindings=binding)

# Add resulting triples to graph
for triple in results:
result_graph.add(triple)

# Stop after first row if testing
if first:
break

return result_graph

How It Works:

  1. Parse query: Prepare the SPARQL CONSTRUCT template
  2. For each row: Create variable bindings (CSV columns → SPARQL variables)
  3. Execute query: Replace variables with values, construct triples
  4. Add to graph: Accumulate all triples in result graph

Step 5: Build the Knowledge Graph Incrementally

# Initialize empty knowledge graph
kg = Graph()

# Step 1: Add papers
construct_papers = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?paper a research:Paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
?paper research:hasAbstract ?abstract .
?paper research:belongsToDomain ?domainIRI .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?title, " ", "_"))) AS ?paper)
BIND(IRI(CONCAT("http://data.example.org/domain/",
?domain)) AS ?domainIRI)
}
"""

# Test with first row
print("Testing with first paper:")
print(transform(papers_df, construct_papers, first=True).serialize(format='turtle'))

# Output:
# @prefix research: <http://example.org/research#> .
#
# <http://data.example.org/paper/Attention_Is_All_You_Need>
# a research:Paper ;
# research:hasTitle "Attention Is All You Need" ;
# research:publishedYear 2017 ;
# research:hasAbstract "Transformer architecture..." ;
# research:belongsToDomain <http://data.example.org/domain/NLP> .

# Add all papers to knowledge graph
kg += transform(papers_df, construct_papers)
print(f"After adding papers: {len(kg)} triples")
# Output: After adding papers: 12 triples (3 papers × 4 properties each)

# Step 2: Add authors
construct_authors = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?author a research:Author .
?author research:hasName ?name .
?author research:affiliation ?affiliation .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/author/",
REPLACE(?name, " ", "_"))) AS ?author)
}
"""

kg += transform(authors_df, construct_authors)
print(f"After adding authors: {len(kg)} triples")
# Output: After adding authors: 21 triples

# Step 3: Add citations
construct_citations = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?citingPaperIRI research:cites ?citedPaperIRI .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?citing_paper, " ", "_"))) AS ?citingPaperIRI)
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?cited_paper, " ", "_"))) AS ?citedPaperIRI)
}
"""

kg += transform(citations_df, construct_citations)
print(f"After adding citations: {len(kg)} triples")
# Output: After adding citations: 24 triples

# Step 4: Add concepts
construct_concepts = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
?paperIRI research:discusses ?conceptIRI .
}
WHERE {
BIND(IRI(CONCAT("http://data.example.org/paper/",
REPLACE(?paper, " ", "_"))) AS ?paperIRI)
BIND(IRI(CONCAT("http://data.example.org/concept/",
?concept)) AS ?conceptIRI)
}
"""

kg += transform(concepts_df, construct_concepts)
print(f"Final knowledge graph: {len(kg)} triples")
# Output: Final knowledge graph: 28 triples

Step 6: Query the Knowledge Graph

Now we can query the constructed graph:

# Query 1: Find all NLP papers
query_nlp_papers = """
PREFIX research: <http://example.org/research#>
SELECT ?title ?year
WHERE {
?paper a research:Paper .
?paper research:hasTitle ?title .
?paper research:publishedYear ?year .
?paper research:belongsToDomain <http://data.example.org/domain/NLP> .
}
ORDER BY ?year
"""

results = kg.query(query_nlp_papers)
for row in results:
print(f"{row.title} ({row.year})")
# Output:
# Attention Is All You Need (2017)
# BERT (2018)

# Query 2: Find papers citing "Attention Is All You Need"
query_citations = """
PREFIX research: <http://example.org/research#>
SELECT ?citing_title
WHERE {
?citing research:cites <http://data.example.org/paper/Attention_Is_All_You_Need> .
?citing research:hasTitle ?citing_title .
}
"""

results = kg.query(query_citations)
for row in results:
print(f"Paper citing Attention: {row.citing_title}")
# Output: Paper citing Attention: BERT

# Query 3: Find all concepts discussed in NLP papers
query_concepts = """
PREFIX research: <http://example.org/research#>
SELECT ?concept
WHERE {
?paper research:belongsToDomain <http://data.example.org/domain/NLP> .
?paper research:discusses ?conceptIRI .
BIND(REPLACE(STR(?conceptIRI), ".*/", "") AS ?concept)
}
"""

results = kg.query(query_concepts)
concepts = [row.concept for row in results]
print(f"NLP concepts: {', '.join(concepts)}")
# Output: NLP concepts: self-attention, transformers, bidirectional

Step 7: Visualize the Knowledge Graph

import networkx as nx
import matplotlib.pyplot as plt

def rdf_to_nx(rdf_graph: Graph) -> nx.DiGraph:
"""Convert RDF graph to NetworkX directed graph."""
G = nx.DiGraph()

for s, p, o in rdf_graph:
# Extract local names (remove URI prefixes)
subject = str(s).split('/')[-1]
predicate = str(p).split('#')[-1]
obj = str(o).split('/')[-1] if isinstance(o, URIRef) else str(o)

# Add nodes and edges
G.add_edge(subject, obj, label=predicate)

return G

# Convert to NetworkX
G = rdf_to_nx(kg)

# Visualize
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(G, seed=42)

# Draw nodes
nx.draw_networkx_nodes(G, pos, node_size=1000, node_color='lightblue')

# Draw edges
nx.draw_networkx_edges(G, pos, edge_color='gray', arrows=True,
arrowsize=20, connectionstyle='arc3,rad=0.1')

# Draw labels
nx.draw_networkx_labels(G, pos, font_size=8)

# Draw edge labels
edge_labels = nx.get_edge_attributes(G, 'label')
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=6)

plt.title("Research Paper Knowledge Graph")
plt.axis('off')
plt.tight_layout()
plt.show()

Step 8: Save the Knowledge Graph

# Save to Turtle file (human-readable RDF format)
kg.serialize(destination='research_papers.ttl', format='turtle')
print("Knowledge graph saved to research_papers.ttl")

# The file content looks like:
# @prefix research: <http://example.org/research#> .
#
# <http://data.example.org/paper/Attention_Is_All_You_Need>
# a research:Paper ;
# research:hasTitle "Attention Is All You Need" ;
# research:publishedYear 2017 ;
# research:belongsToDomain <http://data.example.org/domain/NLP> ;
# research:discusses <http://data.example.org/concept/self-attention>,
# <http://data.example.org/concept/transformers> .
#
# <http://data.example.org/paper/BERT>
# a research:Paper ;
# research:hasTitle "BERT" ;
# research:publishedYear 2018 ;
# research:cites <http://data.example.org/paper/Attention_Is_All_You_Need> ;
# research:discusses <http://data.example.org/concept/bidirectional> .

Key Takeaways

Declarative Approach Benefits:

  1. Separation of concerns: Data (CSV) separate from logic (SPARQL)
  2. Reusable queries: Same query works for any CSV with same schema
  3. Incremental building: Add entities and relationships step-by-step
  4. Easy to validate: Test queries on single rows first
  5. Standard-based: SPARQL is W3C standard

Process Summary:

When to Use This Approach:

  • ✅ Have structured data (CSV, databases)
  • ✅ Need standard-compliant knowledge graphs
  • ✅ Want to query with SPARQL
  • ✅ Require formal schema/ontology
  • ✅ Building production knowledge graphs

ResearcherAI Uses This For:

  • arXiv paper metadata → Knowledge graph
  • Citation networks from Semantic Scholar
  • Author collaboration graphs
  • Concept hierarchies from papers
Declarative vs Imperative

Declarative (SPARQL CONSTRUCT): "What you want" - Define the shape of output graph Imperative (Python loops): "How to do it" - Step-by-step instructions

SPARQL CONSTRUCT is declarative - you describe the desired graph structure, and the engine figures out how to create it. This is more maintainable and less error-prone than imperative loops.

Production Decision: Neo4j vs Apache Jena Fuseki

Critical Understanding: Neo4j and Apache Jena Fuseki are NOT interchangeable alternatives. They serve fundamentally different use cases!

The Key Question: Which data model fits your use case?

When to Use Neo4j in Production

Use Neo4j when you need:

  1. High-Performance Graph Traversal

    // Find shortest path between papers (fast!)
    MATCH path = shortestPath(
    (a:Paper {id: "paper1"})-[:CITES*]-(b:Paper {id: "paper50"})
    )
    RETURN path
    • Neo4j is optimized for this (index-free adjacency)
    • Jena/RDF is much slower for deep graph traversal
  2. Graph Algorithms

    • PageRank, Louvain community detection
    • Shortest paths, centrality measures
    • Neo4j Graph Data Science library
    • RDF/Jena: No built-in graph algorithms
  3. Real-Time Recommendations

    • Friend recommendations (social networks)
    • Paper recommendations based on citations
    • Collaborative filtering
    • Performance critical - Neo4j is faster
  4. Intuitive Queries

    // Cypher is very readable
    MATCH (p:Paper)-[:CITES]->(cited:Paper)
    WHERE p.year > 2020
    RETURN cited.title, count(*) as citations
    ORDER BY citations DESC
    • Easier for developers to learn than SPARQL
    • Better tooling (Neo4j Browser, Bloom)
  5. Flexible Schema

    • Add new node labels and edge types dynamically
    • Schema evolves with your application
    • No formal ontology needed

Example: ResearcherAI uses Neo4j for:

  • Citation network traversal
  • Finding related papers (graph algorithms)
  • Author collaboration networks
  • Fast query performance

When to Use Apache Jena Fuseki in Production

Use Apache Jena Fuseki when you need:

  1. Formal Ontologies

    # Define strict schema
    :ResearchPaper rdfs:subClassOf :Publication .
    :ConferencePaper rdfs:subClassOf :ResearchPaper .
    :JournalPaper rdfs:subClassOf :ResearchPaper .
    • W3C standard ontologies (OWL)
    • Strict domain modeling
    • Neo4j: No formal ontology support
  2. Reasoning and Inference

    # Automatic inference: If A cites B and B cites C, find transitive citations
    SELECT ?paper ?influenced
    WHERE {
    :paper1 :cites+ ?influenced . # Transitive closure
    }
    • OWL reasoners infer new facts
    • Automatic classification
    • Neo4j: No built-in reasoning
  3. W3C Standards Compliance

    • Publishing Linked Open Data (LOD)
    • Interoperability with DBpedia, Wikidata
    • RDF, SPARQL, OWL standards
    • Neo4j: Proprietary (Cypher is not W3C standard)
  4. Data Integration

    • Merging data from multiple RDF sources
    • Schema mapping and alignment
    • Federated SPARQL queries across endpoints
    • Neo4j: Harder to integrate with external sources
  5. Scientific/Medical Domains

    • Existing domain ontologies (Gene Ontology, SNOMED CT)
    • Formal knowledge representation
    • Regulatory compliance requirements
    • Neo4j: Not suitable for formal ontologies

Example: Use Jena Fuseki for:

  • Medical knowledge graphs (SNOMED, ICD-10)
  • Scientific literature with formal taxonomies
  • Publishing linked open data
  • Integration with Wikidata/DBpedia

Production Performance Comparison

OperationNeo4jApache Jena FusekiWinner
Graph Traversal (5 hops)~10ms~500ms+Neo4j (50x faster)
Complex Cypher/SPARQLFastModerateNeo4j
Write Throughput10k-100k/sec1k-10k/secNeo4j
OWL ReasoningNot supportedSupportedJena
InferenceManual (application)Automatic (reasoner)Jena
Standards ComplianceProprietaryW3C standardsJena
Graph AlgorithmsBuilt-in (GDS)Not supportedNeo4j
Storage EfficiencyGoodModerate (triples overhead)Neo4j

When to Use BOTH (Hybrid Architecture)

You can use both together for the best of both worlds:

class HybridProductionKnowledgeGraph:
"""Use Neo4j for performance, Jena for reasoning"""

def __init__(self):
# Neo4j for fast graph operations
self.neo4j = Neo4jKnowledgeGraph(
uri="bolt://neo4j-prod.example.com:7687"
)

# Jena for ontology and reasoning
self.jena = JenaKnowledgeGraph(
endpoint_url="http://fuseki-prod.example.com:3030/research/sparql"
)

def add_paper(self, paper_data: dict):
"""Add to both databases"""
# Add to Neo4j for fast queries
self.neo4j.add_paper(
paper_data["id"],
paper_data["title"],
paper_data["year"]
)

# Add to Jena for reasoning
self.jena.add_paper(
paper_data["id"],
paper_data["title"],
paper_data["year"]
)

def find_related_papers(self, paper_id: str):
"""Use Neo4j for fast graph traversal"""
return self.neo4j.find_related_papers(paper_id)

def classify_paper_topic(self, paper_id: str):
"""Use Jena reasoner for automatic classification"""
sparql = f"""
PREFIX research: <http://researcherai.org/ontology#>

SELECT ?topic
WHERE {{
<http://researcherai.org/papers/{paper_id}>
research:hasInferredTopic ?topic .
}}
"""
return self.jena.query(sparql)

def sync_databases(self):
"""Periodically sync data between Neo4j and Jena"""
# Export from Neo4j, import to Jena
# Or vice versa
pass

Use Hybrid When:

  • Need both fast traversal AND formal reasoning
  • Want graph algorithms + automatic classification
  • Building enterprise knowledge graph with complex requirements
  • Have resources to maintain two databases

Trade-offs:

  • ❌ More complex architecture
  • ❌ Data synchronization overhead
  • ❌ Higher infrastructure costs
  • ✅ Best of both worlds

ResearcherAI's Production Choice: Neo4j

Why ResearcherAI uses Neo4j (not Jena) in production:

# ResearcherAI's production config
PRODUCTION_CONFIG = {
"knowledge_graph": "Neo4j", # Not Apache Jena
# Why Neo4j? See reasons below
}

Reasons:

  1. Primary Use Case: Citation Network Traversal

    • Finding related papers by citation paths
    • Author collaboration networks
    • Paper recommendation based on graph structure
    • Neo4j excels at this, Jena is slow
  2. Performance Requirements

    • Real-time query responses (under 100ms)
    • High read throughput (1000s queries/sec)
    • Neo4j is 10-50x faster for graph queries
  3. No Formal Ontology Needed

    • Research paper schema is relatively simple
    • Don't need OWL reasoning
    • Don't need automatic inference
    • Jena's strengths aren't needed
  4. Developer Experience

    • Cypher is easier to learn than SPARQL
    • Better visualization tools (Neo4j Browser)
    • Larger community and resources
    • Faster development
  5. Graph Algorithms

    • Use PageRank to find influential papers
    • Community detection for research clusters
    • Shortest paths for paper relationships
    • Neo4j Graph Data Science library

When ResearcherAI WOULD use Jena instead:

If the requirements were:

  • ❌ Need formal research ontologies (ACM Computing Classification)
  • ❌ Must publish linked open data
  • ❌ Need automatic paper classification via reasoning
  • ❌ Integrate with existing RDF sources (DBpedia)
  • ❌ W3C standards compliance required

Then use Apache Jena Fuseki in production.

Quick Decision Guide

Choose Neo4j if:

  • ✅ Primary use case: graph traversal, recommendations
  • ✅ Need high performance (real-time queries)
  • ✅ Want graph algorithms (PageRank, community detection)
  • ✅ Flexible schema, rapid development
  • ✅ Team familiar with SQL-like queries (Cypher)

Choose Apache Jena Fuseki if:

  • ✅ Need formal ontologies (OWL)
  • ✅ Require reasoning and inference
  • ✅ Publishing linked open data
  • ✅ W3C standards compliance
  • ✅ Integrating with existing RDF sources

Choose Both (Hybrid) if:

  • ✅ Need graph performance AND reasoning
  • ✅ Have resources for complex architecture
  • ✅ Enterprise requirements

For Most Projects: Start with Neo4j. Only add Jena if you truly need formal ontologies or reasoning.

Real-World Production Examples

Companies Using Neo4j:

  • LinkedIn - Professional network graph
  • eBay - Product recommendations
  • Airbnb - Location-based search
  • Walmart - Supply chain optimization
  • NASA - Lessons learned database

Companies/Projects Using Apache Jena:

  • BBC - Linked data for content
  • Wikidata - Knowledge base
  • Getty Vocabularies - Art metadata
  • UK Government - Open data publishing
  • PubMed - Biomedical ontologies

Notice the Pattern:

  • Neo4j: Performance-critical, real-time applications
  • Jena: Formal knowledge, standards, publishing
Production Database Selection
  • Default choice for most projects: Neo4j (performance, ease of use)
  • Choose Jena if: You need formal ontologies, reasoning, or standards compliance
  • ResearcherAI uses Neo4j: Because citation networks need fast traversal, not formal reasoning
  • Start simple: Begin with one database, add the other only if truly needed
Don't Overkill

Many projects think they need formal ontologies and reasoning, but actually just need a fast graph database. Start with Neo4j. Only add Jena if you have a clear use case for OWL reasoning or standards compliance.


Part 3: Hybrid RAG - Best of Both Worlds

Neither vector search nor knowledge graphs alone are sufficient:

Vector Search Alone:

  • ✅ Finds semantically similar content
  • ❌ Can't answer relationship questions
  • ❌ No structured reasoning
  • ❌ Can't follow citations, collaborations

Knowledge Graph Alone:

  • ✅ Excellent at relationships
  • ✅ Multi-hop reasoning
  • ❌ Requires exact entity matches
  • ❌ Can't do semantic similarity

Solution: Hybrid RAG - Combine both!

Hybrid RAG Architecture

Implementing Hybrid RAG

from typing import List, Dict, Tuple
from enum import Enum

class QueryType(Enum):
SEMANTIC = "semantic" # "Papers about attention mechanisms"
RELATIONAL = "relational" # "Papers citing X"
HYBRID = "hybrid" # "Papers about Y citing X"

class HybridRAG:
"""Hybrid RAG combining vector search and knowledge graphs"""

def __init__(
self,
vector_store: QdrantVectorStore,
knowledge_graph: Neo4jKnowledgeGraph,
embedding_model: SentenceTransformer
):
self.vector_store = vector_store
self.knowledge_graph = knowledge_graph
self.embedding_model = embedding_model

def classify_query(self, query: str) -> QueryType:
"""Determine query type"""
# Keywords indicating relational queries
relational_keywords = [
"cite", "cites", "citing", "cited",
"author", "authored", "written by",
"collaborate", "coauthor",
"published in", "appeared in"
]

# Keywords indicating semantic queries
semantic_keywords = [
"about", "discuss", "related to",
"similar to", "like", "regarding"
]

query_lower = query.lower()

has_relational = any(kw in query_lower for kw in relational_keywords)
has_semantic = any(kw in query_lower for kw in semantic_keywords)

if has_relational and has_semantic:
return QueryType.HYBRID
elif has_relational:
return QueryType.RELATIONAL
else:
return QueryType.SEMANTIC

def semantic_search(self, query: str, top_k: int = 5) -> List[Dict]:
"""Pure vector search"""
query_embedding = self.embedding_model.encode(query)
results = self.vector_store.search(query_embedding, top_k)

return [
{
"text": text,
"score": score,
"source": "vector"
}
for text, score in results
]

def relational_search(self, query: str) -> List[Dict]:
"""Pure graph search"""
# Parse query for entities and relationships
# Simplified - in production, use NER + intent detection

if "citing" in query.lower():
# Extract paper being cited
# Simplified extraction
cited_paper = self._extract_paper_mention(query)
citing_papers = self.knowledge_graph.find_citing_papers(cited_paper)

return [
{
"text": paper["title"],
"year": paper["year"],
"source": "graph"
}
for paper in citing_papers
]

elif "author" in query.lower():
author_name = self._extract_author_name(query)
papers = self.knowledge_graph.find_papers_by_author(author_name)

return [
{
"text": paper["title"],
"year": paper["year"],
"source": "graph"
}
for paper in papers
]

return []

def hybrid_search(
self,
query: str,
top_k: int = 10
) -> List[Dict]:
"""Combined vector + graph search"""

# Step 1: Vector search for semantic similarity
semantic_results = self.semantic_search(query, top_k=top_k)

# Step 2: Graph search for relationships
relational_results = self.relational_search(query)

# Step 3: Merge and deduplicate
all_results = semantic_results + relational_results
seen = set()
unique_results = []

for result in all_results:
key = result["text"]
if key not in seen:
seen.add(key)
unique_results.append(result)

# Step 4: Re-rank using both semantic and structural scores
reranked = self._rerank_results(unique_results, query)

return reranked[:top_k]

def _rerank_results(self, results: List[Dict], query: str) -> List[Dict]:
"""Re-rank results combining semantic + structural scores"""

for result in results:
# Semantic score from vector search
semantic_score = result.get("score", 0.5)

# Structural score from graph (e.g., citation count, centrality)
structural_score = 0.5 # Simplified

# Combined score (weighted average)
result["final_score"] = 0.6 * semantic_score + 0.4 * structural_score

# Sort by final score
results.sort(key=lambda x: x.get("final_score", 0), reverse=True)

return results

def search(self, query: str, top_k: int = 5) -> List[Dict]:
"""Main search interface - automatically routes to appropriate method"""

query_type = self.classify_query(query)

if query_type == QueryType.SEMANTIC:
return self.semantic_search(query, top_k)
elif query_type == QueryType.RELATIONAL:
return self.relational_search(query)
else: # HYBRID
return self.hybrid_search(query, top_k)

# Usage
hybrid_rag = HybridRAG(
vector_store=qdrant_store,
knowledge_graph=neo4j_kg,
embedding_model=model
)

# Different query types automatically routed
queries = [
"Papers about attention mechanisms", # SEMANTIC
"Papers citing 'Attention is All You Need'", # RELATIONAL
"Papers about transformers citing early NLP work" # HYBRID
]

for query in queries:
print(f"\nQuery: {query}")
results = hybrid_rag.search(query, top_k=3)

for i, result in enumerate(results, 1):
print(f"{i}. {result['text']} (source: {result.get('source', 'hybrid')})")

Part 4: GraphRAG - Knowledge Graph Enhanced RAG

What Exactly is GraphRAG?

GraphRAG is NOT just "using a knowledge graph with RAG". It's a specific approach where the knowledge graph actively enhances the retrieval process by:

  1. Expanding initial search results with graph-connected context
  2. Enriching retrieved documents with relationship information
  3. Providing multi-hop reasoning paths through the graph

Web Developer Analogy:

// Traditional RAG = Direct database query
const results = db.query("SELECT * FROM articles WHERE text MATCHES 'transformers'")
return results // Just the matching articles

// GraphRAG = Query + JOIN on relationships
const initial = db.query("SELECT * FROM articles WHERE text MATCHES 'transformers'")
const expanded = initial.map(article => ({
...article,
citations: db.query("SELECT * FROM articles WHERE id IN article.cited_papers"),
relatedConcepts: db.query("SELECT * FROM concepts WHERE article_id = article.id"),
authorExpertise: db.query("SELECT * FROM articles WHERE author_id = article.author_id")
}))
return expanded // Original + graph-enriched context

The Problem GraphRAG Solves

Scenario: User asks "How do transformers handle long-range dependencies?"

Traditional RAG (just vector search):

# Returns: Top 3 papers about transformers
results = [
"Attention is All You Need (Vaswani, 2017)",
"BERT (Devlin, 2018)",
"GPT-3 (Brown, 2020)"
]
# ❌ Missing: WHY transformers were invented (what came before)
# ❌ Missing: HOW they evolved (what improved them)
# ❌ Missing: WHAT problems remain (recent criticisms)

GraphRAG (vector search + graph expansion):

# Returns: Top 3 papers + graph context
results = {
"initial_papers": [
"Attention is All You Need (Vaswani, 2017)",
"BERT (Devlin, 2018)",
"GPT-3 (Brown, 2020)"
],
"cited_papers": [ # WHAT CAME BEFORE (context)
"Neural Machine Translation by Jointly Learning to Align (Bahdanau, 2014)",
"Sequence to Sequence Learning (Sutskever, 2014)",
"Long Short-Term Memory (Hochreiter, 1997)" # The problem transformers solved!
],
"citing_papers": [ # WHAT CAME AFTER (evolution)
"Reformer: Efficient Transformer (Kitaev, 2020)", # Addressed limitations
"Linformer (Wang, 2020)", # Improved efficiency
"Performer (Choromanski, 2020)" # Better for long sequences
],
"related_concepts": [
"Self-attention", "Positional encoding", "Multi-head attention"
]
}
# ✅ Has: Historical context (why transformers exist)
# ✅ Has: Evolution (how they improved)
# ✅ Has: Current solutions (what's happening now)

Result: LLM can now give a complete historical narrative, not just describe what transformers are!

Why Traditional RAG Isn't Enough

Problem 1: No Historical Context

# User question: "Why were transformers invented?"

# Traditional RAG retrieves:
papers = [
"Attention is All You Need (2017): Transformers use self-attention..."
]
# ❌ Doesn't explain what problem RNNs/LSTMs had
# ❌ Doesn't show what transformers improved over

# GraphRAG retrieves + expands via citations:
graph_context = {
"main_paper": "Attention is All You Need (2017): Transformers use self-attention...",
"cited_papers": [
"LSTM (1997): LSTMs struggle with sequences >100 tokens",
"RNN vanishing gradients (1994): RNNs can't learn long dependencies"
]
}
# ✅ Now LLM can explain: "RNNs had vanishing gradients, LSTMs helped but
# still struggled with long sequences, transformers solved this with self-attention"

Problem 2: Missing Evolution

# User question: "How have transformers been improved since 2017?"

# Traditional RAG:
papers = ["Attention is All You Need (2017)"] # The original paper
# ❌ Doesn't show what came after

# GraphRAG (with citing papers):
papers = {
"initial": "Attention is All You Need (2017)",
"citing": [
"BERT (2018): Bidirectional pretraining",
"GPT-2 (2019): Larger scale",
"T5 (2019): Text-to-text framework",
"Reformer (2020): Efficient attention",
"Switch Transformers (2021): Sparse models"
]
}
# ✅ Can now trace the evolution timeline

Problem 3: No Multi-Hop Reasoning

# User question: "What recent work improves on BERT's limitations?"

# Traditional RAG: Only finds papers mentioning "BERT limitations"
# ❌ Might miss papers that solve the problem without mentioning BERT

# GraphRAG:
# 1. Find BERT paper
# 2. Find papers citing BERT
# 3. Filter for papers discussing "limitations" or "improvements"
# 4. Find papers those papers cite (multi-hop)
# ✅ Discovers solutions even if they don't directly mention BERT

GraphRAG vs Traditional RAG vs Hybrid RAG

ApproachHow It WorksStrengthsWeaknesses
Traditional RAGVector search → Top-K docs → LLMSimple, fastNo context, no relationships
Hybrid RAGVector search + Graph search → Merge → LLMCombines semantic + relationalStill retrieves documents independently
GraphRAGVector search → Graph expansion → Enhanced context → LLMRich context, multi-hop, relationshipsMore complex, slower

Key Difference:

  • Hybrid RAG: Uses graph for direct queries ("papers citing X")
  • GraphRAG: Uses graph to expand vector search results with connected context

When to Use GraphRAG

Use GraphRAG when:

  1. Historical Context Matters

    • Research paper Q&A (evolution of ideas)
    • Patent analysis (prior art, citations)
    • Scientific literature review
  2. Relationships Are Important

    • "How did this idea evolve?"
    • "What influenced this paper?"
    • "What improved on this approach?"
  3. Multi-Hop Reasoning Needed

    • "What recent work addresses limitations of X?"
    • "Find papers in the intellectual lineage of X"
  4. You Have a Knowledge Graph

    • Already built citation network
    • Already have entity relationships
    • Graph is well-structured

DON'T use GraphRAG when:

  1. Simple Keyword Matching - Traditional search is fine
  2. No Graph Available - Building a graph is expensive
  3. Real-Time Speed Critical - Graph traversal adds latency
  4. Documents Are Independent - No meaningful relationships

GraphRAG: Two Approaches

There are two main approaches to GraphRAG:

Approach 1: Graph-Enhanced Retrieval (Shown Here)

How it works:

  1. Use vector search to find initial relevant documents
  2. Use knowledge graph to expand those documents with connected context
  3. Feed enriched context to LLM

Pros:

  • ✅ Simple to implement
  • ✅ Works with existing knowledge graphs (Neo4j, etc.)
  • ✅ Explainable (you can see the expansion)

Cons:

  • ❌ Depends on quality of knowledge graph
  • ❌ Expansion can be noisy
  • ❌ Slower than pure vector search

Example: ResearcherAI uses this approach

Approach 2: Microsoft GraphRAG (Community Summaries)

How it works (different from approach 1!):

  1. Build knowledge graph from documents
  2. Detect communities in the graph (clusters of related entities)
  3. Generate LLM summaries of each community
  4. At query time, search community summaries
  5. Retrieve relevant communities + their documents

Pros:

  • ✅ Handles "global" questions ("What are the main themes?")
  • ✅ Summarizes large corpora
  • ✅ Finds patterns across documents

Cons:

  • ❌ More complex (requires community detection + summarization)
  • ❌ Higher upfront cost (LLM summarization of all communities)
  • ❌ Less direct than traditional retrieval

Key Difference:

  • Approach 1 (Graph-Enhanced): Expands specific documents via graph
  • Approach 2 (Microsoft GraphRAG): Summarizes graph communities

Which to use?

  • Specific questions ("How do transformers work?") → Approach 1
  • Global questions ("What are the main AI trends?") → Approach 2
  • ResearcherAI: Uses Approach 1 (graph-enhanced retrieval)

GraphRAG Decision Tree

Quick Decision:

  • Have graph + need context → GraphRAG (Approach 1)
  • Need to summarize corpus → Microsoft GraphRAG (Approach 2)
  • Simple Q&A → Traditional RAG

GraphRAG vs Traditional RAG

Key Idea: Use the graph to expand initial search results with related context.

GraphRAG Implementation

class GraphRAG:
"""GraphRAG: Use knowledge graph to enhance retrieval"""

def __init__(
self,
vector_store: QdrantVectorStore,
knowledge_graph: Neo4jKnowledgeGraph,
embedding_model: SentenceTransformer
):
self.vector_store = vector_store
self.knowledge_graph = knowledge_graph
self.embedding_model = embedding_model

def retrieve_and_expand(
self,
query: str,
initial_k: int = 3,
expansion_depth: int = 2
) -> Dict:
"""Retrieve documents and expand using graph"""

# Step 1: Initial vector search
query_embedding = self.embedding_model.encode(query)
initial_results = self.vector_store.search(query_embedding, top_k=initial_k)

# Step 2: Expand using knowledge graph
expanded_context = {
"initial_papers": [],
"cited_papers": [],
"citing_papers": [],
"related_concepts": set(),
"author_expertise": []
}

for text, score in initial_results:
paper_id = self._extract_paper_id(text)

expanded_context["initial_papers"].append({
"id": paper_id,
"text": text,
"score": score
})

# Expand: Find papers this paper cites
cited = self.knowledge_graph.find_papers_cited_by(paper_id)
expanded_context["cited_papers"].extend(cited)

# Expand: Find papers citing this paper
citing = self.knowledge_graph.find_citing_papers(paper_id)
expanded_context["citing_papers"].extend(citing)

# Expand: Find related concepts
concepts = self.knowledge_graph.find_related_concepts(paper_id)
expanded_context["related_concepts"].update(concepts)

# Expand: Find author expertise
authors = self.knowledge_graph.find_paper_authors(paper_id)
for author in authors:
other_papers = self.knowledge_graph.find_papers_by_author(author)
expanded_context["author_expertise"].append({
"author": author,
"other_work": other_papers[:3] # Top 3
})

return expanded_context

def generate_answer(self, query: str, context: Dict) -> str:
"""Generate answer using expanded context"""

# Build rich context from graph expansion
context_text = self._format_context(context)

prompt = f"""Based on the following research papers and related context, answer the question.

Question: {query}

Initial Papers:
{context_text['initial']}

Cited Papers (background):
{context_text['cited']}

Citing Papers (follow-up work):
{context_text['citing']}

Related Concepts:
{', '.join(context['related_concepts'])}

Provide a comprehensive answer with citations."""

# Use LLM to generate answer
response = llm.generate(prompt)

return response

def _format_context(self, context: Dict) -> Dict[str, str]:
"""Format expanded context for prompt"""

initial = "\n\n".join([
f"[{i+1}] {paper['text']}"
for i, paper in enumerate(context["initial_papers"])
])

cited = "\n".join([
f"- {paper['title']} ({paper['year']})"
for paper in context["cited_papers"][:5]
])

citing = "\n".join([
f"- {paper['title']} ({paper['year']})"
for paper in context["citing_papers"][:5]
])

return {
"initial": initial,
"cited": cited,
"citing": citing
}

# Usage
graph_rag = GraphRAG(
vector_store=qdrant_store,
knowledge_graph=neo4j_kg,
embedding_model=model
)

query = "How do transformers handle long-range dependencies?"

# Retrieve and expand
context = graph_rag.retrieve_and_expand(query, initial_k=3)

print("Initial papers:", len(context["initial_papers"]))
print("Cited papers:", len(context["cited_papers"]))
print("Citing papers:", len(context["citing_papers"]))
print("Concepts:", len(context["related_concepts"]))

# Generate answer with expanded context
answer = graph_rag.generate_answer(query, context)
print(f"\nAnswer: {answer}")

GraphRAG Benefits

1. Richer Context

# Traditional RAG: 3 papers
traditional_context = """
1. Attention is All You Need (2017)
2. BERT (2018)
3. GPT-3 (2020)
"""

# GraphRAG: 3 papers + expansions
graphrag_context = """
Initial:
1. Attention is All You Need (2017)
2. BERT (2018)
3. GPT-3 (2020)

Background (cited by these):
- Neural Machine Translation (Bahdanau, 2014)
- Sequence to Sequence Learning (Sutskever, 2014)

Follow-up (citing these):
- T5 (2019)
- BART (2020)
- Switch Transformers (2021)

Related Concepts:
- Self-attention, Multi-head attention, Positional encoding
"""

2. Better Citation Paths

// Find "intellectual lineage" of an idea
MATCH path = (old:Paper {year: 2014})-[:CITES*1..5]->(new:Paper {year: 2024})
WHERE old.title CONTAINS "attention"
RETURN path

3. Contextual Understanding

# Understand how transformer attention differs from earlier attention
# By traversing citation graph from Bahdanau (2014) to Vaswani (2017)

Enterprise Use Case: Knowledge Graphs for AI Agent API Discovery

Now let's see a practical enterprise application showing why knowledge graphs matter for AI agents in complex business environments.

The Problem: Complex Enterprise API Landscapes

Imagine you have an AI agent in an enterprise environment, and a user makes this request:

"Create a purchase order for 5 pencils in purchasing group 002 and purchasing organization 3000"

Seems simple, right? But here's the reality:

  • Enterprise systems have thousands of different APIs
  • Which API creates purchase orders? (Could be 10+ candidates)
  • What parameters are required? (Purchasing group? Organization? Material codes?)
  • What's the correct sequence? (Auth → Validate → Create → Submit?)

Without context, the AI agent faces:

  1. Trial and error API discovery (slow, expensive)
  2. Missing domain knowledge (which API for which business process?)
  3. No structure (can't understand API dependencies)
  4. No explainability (can't trace what it did or why)

Solution: Build a knowledge graph of your enterprise APIs enriched with business process information!

Knowledge Graph for Enterprise APIs

Structure:

# Nodes (entities)
:PurchaseOrderAPI rdf:type :API .
:PurchasingProcess rdf:type :BusinessProcess .
:PurchasingGroup rdf:type :Parameter .

# Relationships
:PurchaseOrderAPI :belongsTo :PurchasingProcess .
:PurchaseOrderAPI :requires :PurchasingGroup .
:PurchaseOrderAPI :requires :PurchasingOrganization .
:PurchaseOrderAPI :requires :MaterialNumber .

# Metadata
:PurchaseOrderAPI :endpoint "/api/v1/purchase-orders" .
:PurchaseOrderAPI :method "POST" .
:PurchasingGroup :allowedValues ["001", "002", "003"] .

Visualization:

How Knowledge Graphs Solve Enterprise AI Agent Challenges

Challenge 1: Slow API Discovery

Without Knowledge Graph:

# Agent tries APIs randomly
attempts = [
"Try: /api/orders/create → Wrong (this is for sales orders)",
"Try: /api/procurement/new → Wrong (deprecated API)",
"Try: /api/purchase-orders/post → Wrong (requires different params)",
"Try: /api/v2/purchasing/create-po → Success! (after 4 attempts)"
]
# Result: 4 failed attempts, wasted tokens, slow response

With Knowledge Graph:

# Agent queries knowledge graph
query = """
SELECT ?api WHERE {
?process :name "Purchasing" .
?api :belongsTo ?process .
?api :action "CreatePurchaseOrder" .
}
"""
result = ["POST /api/v2/purchasing/create-po"] # Direct match!
# Result: 1 attempt, instant, efficient

Benefit: 90% reduction in API discovery time

Challenge 2: Missing Domain Context

Without Knowledge Graph:

# Agent doesn't know business logic
user_request = "Create PO for pencils in group 002"

# Agent tries:
api_call = {
"endpoint": "/api/purchase-orders",
"params": {
"item": "pencils", # ❌ Wrong: needs material number
"group": "002" # ✅ Correct
}
}
# Result: API error "Missing material_number parameter"

With Knowledge Graph:

# Agent queries graph for required workflow
workflow = knowledge_graph.query("""
SELECT ?step ?api ?required_param WHERE {
?process :name "CreatePurchaseOrder" .
?process :hasStep ?step .
?step :callsAPI ?api .
?api :requiresParameter ?required_param .
}
ORDER BY ?step
""")

# Result:
# Step 1: Material Lookup API (param: material_name → returns: material_number)
# Step 2: Purchase Order API (params: material_number, purchasing_group, purchasing_org)

# Agent executes correctly:
material_number = call_api("MaterialLookup", {"name": "pencils"}) # M12345
po = call_api("PurchaseOrder", {
"material": material_number,
"group": "002",
"org": "3000"
})
# Result: ✅ Success in correct sequence

Benefit: Automatic workflow understanding from graph structure

Challenge 3: Complex API Dependencies

Without Knowledge Graph:

# Agent doesn't know allowed values
params = {
"purchasing_group": "999" # ❌ Invalid value
}
# Result: API rejects with cryptic error

With Knowledge Graph:

# Graph contains allowed values
allowed = knowledge_graph.query("""
SELECT ?value WHERE {
:PurchasingGroup :allowedValues ?value .
}
""")
# Result: ["001", "002", "003"]

# Agent validates BEFORE calling API
if "999" not in allowed:
# Ask user or pick valid value
pass

Benefit: Validation before execution, reducing errors

Challenge 4: No Explainability

Without Knowledge Graph:

User: "Why did you use that API?"
Agent: "Based on my training, I determined..." (vague)

With Knowledge Graph:

# Agent can trace reasoning through graph
explanation = {
"question": "Create purchase order for pencils",
"reasoning_path": [
"1. Identified business process: Purchasing",
"2. Found process step: MaterialLookup (required for material_number)",
"3. Called MaterialLookup API with 'pencils' → returned M12345",
"4. Found process step: CreatePurchaseOrder",
"5. Validated parameters against schema:",
" - material_number: M12345 (from step 3)",
" - purchasing_group: 002 (from user, validated against allowed values)",
" - purchasing_org: 3000 (from user)",
"6. Called PurchaseOrder API with validated params",
"7. Result: PO-2024-001 created successfully"
],
"source_apis": ["/api/materials/lookup", "/api/purchase-orders/create"],
"graph_nodes_traversed": ["PurchasingProcess", "MaterialLookupAPI", "PurchaseOrderAPI"]
}

Benefit: Complete transparency and auditability

Implementation Example

class EnterpriseAPIAgent:
"""AI Agent with Knowledge Graph for API discovery"""

def __init__(self, kg: Neo4jKnowledgeGraph, llm: LLM):
self.kg = kg
self.llm = llm

def execute_request(self, user_request: str):
"""Execute user request using knowledge graph"""

# Step 1: Understand intent
intent = self.llm.extract_intent(user_request)
# {"action": "CreatePurchaseOrder", "params": {"item": "pencils", "group": "002"}}

# Step 2: Query knowledge graph for workflow
workflow = self.kg.query_cypher(f"""
MATCH (process:BusinessProcess {{name: 'Purchasing'}})
-[:HAS_STEP]->(step:ProcessStep)
-[:CALLS_API]->(api:API)
WHERE step.action = '{intent["action"]}'
RETURN step.order as order, api.endpoint as endpoint,
api.required_params as params
ORDER BY order
""")

# Step 3: Execute workflow steps
context = {}
for step in workflow:
# Validate and enrich parameters
params = self._prepare_params(step["params"], intent["params"], context)

# Call API
response = self._call_api(step["endpoint"], params)

# Store result for next step
context[step["order"]] = response

return context

def _prepare_params(self, required, user_provided, context):
"""Prepare API parameters using knowledge graph validation"""

params = {}
for param_name in required:
# Check if user provided
if param_name in user_provided:
# Validate against knowledge graph
allowed = self.kg.get_allowed_values(param_name)
if allowed and user_provided[param_name] not in allowed:
raise ValueError(f"{param_name} must be one of {allowed}")
params[param_name] = user_provided[param_name]

# Check if available from previous steps
elif param_name in context:
params[param_name] = context[param_name]

else:
raise ValueError(f"Missing required parameter: {param_name}")

return params

Benefits Summary

ChallengeWithout KGWith KGImprovement
API DiscoveryTrial & error (10+ attempts)Direct lookup (1 attempt)90% faster
Context UnderstandingMissing domain logicBusiness process awareCorrect workflows
Parameter ValidationRuntime errorsPre-validatedFewer failures
ExplainabilityBlack boxFull traceAudit-ready
MaintenanceUpdate LLM trainingUpdate graphEasy updates

When to Use Knowledge Graphs for AI Agents

Use KG-powered AI agents when:

  • ✅ Complex API landscapes (100s-1000s of APIs)
  • ✅ Domain-specific business logic required
  • ✅ Auditability and explainability critical
  • ✅ APIs change frequently (easier to update graph than retrain)
  • ✅ Multi-step workflows common

Traditional RAG/LLM when:

  • Simple, stable API sets (10-20 APIs)
  • No strict compliance requirements
  • Flexibility more important than precision
Enterprise Knowledge Graphs

Knowledge graphs transform AI agents from "smart guessers" to "informed executors" by providing:

  • Structure: API relationships and dependencies
  • Context: Business process integration
  • Validation: Allowed values and constraints
  • Explainability: Traceable reasoning paths

Part 5: Structured vs Unstructured Data

Real research papers contain both:

Unstructured:

  • Abstract (free text)
  • Full paper text
  • Author descriptions

Structured:

  • Title, authors, year, venue
  • Citation counts
  • Keywords, categories
  • Figures, tables (semi-structured)

Handling Both with GraphRAG

Complete Example: ResearcherAI Data Pipeline

class ResearchDataPipeline:
"""Complete pipeline for handling structured + unstructured data"""

def __init__(
self,
vector_store: QdrantVectorStore,
knowledge_graph: Neo4jKnowledgeGraph,
embedding_model: SentenceTransformer
):
self.vector_store = vector_store
self.knowledge_graph = knowledge_graph
self.embedding_model = embedding_model

def process_paper(self, paper: Dict):
"""Process single paper with both structured and unstructured data"""

# Step 1: Extract structured data
paper_id = paper["id"]
title = paper["title"]
authors = paper["authors"]
year = paper["year"]
citations = paper.get("citations", [])
keywords = paper.get("keywords", [])

# Step 2: Extract unstructured data
abstract = paper["abstract"]
full_text = paper.get("full_text", "")

# Step 3: Add to knowledge graph (structured)
self.knowledge_graph.add_paper(paper_id, title, year, abstract)

for author in authors:
author_id = self._get_author_id(author)
self.knowledge_graph.add_author(author_id, author)
self.knowledge_graph.add_authored(author_id, paper_id)

for cited_paper_id in citations:
self.knowledge_graph.add_citation(paper_id, cited_paper_id)

for keyword in keywords:
concept_id = self._get_concept_id(keyword)
self.knowledge_graph.add_concept(concept_id, keyword)
self.knowledge_graph.add_discusses(paper_id, concept_id)

# Step 4: Add to vector store (unstructured)
# Combine title + abstract for better semantic search
text_for_embedding = f"{title}. {abstract}"
embedding = self.embedding_model.encode(text_for_embedding)

self.vector_store.add_documents(
texts=[text_for_embedding],
embeddings=np.array([embedding]),
metadata=[{
"paper_id": paper_id,
"title": title,
"year": year,
"authors": authors,
"citation_count": len(citations)
}]
)

print(f"✓ Processed: {title}")

def query(self, question: str, mode: str = "hybrid") -> Dict:
"""Query with automatic routing"""

if mode == "semantic":
# Pure vector search
query_emb = self.embedding_model.encode(question)
results = self.vector_store.search(query_emb, top_k=5)

return {
"results": results,
"mode": "semantic"
}

elif mode == "structured":
# Pure graph query
# Extract query intent and route to appropriate graph query
results = self._graph_query(question)

return {
"results": results,
"mode": "structured"
}

else: # hybrid or graphrag
# GraphRAG: Combine both
query_emb = self.embedding_model.encode(question)
initial_results = self.vector_store.search(query_emb, top_k=3)

# Expand using graph
expanded = []
for text, score in initial_results:
paper_id = self._extract_paper_id_from_text(text)

# Get structured context from graph
graph_context = {
"citations": self.knowledge_graph.find_citing_papers(paper_id),
"concepts": self.knowledge_graph.find_related_concepts(paper_id),
"authors": self.knowledge_graph.find_paper_authors(paper_id)
}

expanded.append({
"paper": text,
"score": score,
"graph_context": graph_context
})

return {
"results": expanded,
"mode": "graphrag"
}

# Complete workflow
pipeline = ResearchDataPipeline(
vector_store=qdrant_store,
knowledge_graph=neo4j_kg,
embedding_model=model
)

# Process papers
papers = [
{
"id": "paper_1",
"title": "Attention is All You Need",
"authors": ["Ashish Vaswani", "Noam Shazeer"],
"year": 2017,
"abstract": "We propose the Transformer, a model architecture...",
"keywords": ["transformer", "attention", "sequence-to-sequence"],
"citations": []
},
{
"id": "paper_2",
"title": "BERT: Pre-training Transformers",
"authors": ["Jacob Devlin", "Ming-Wei Chang"],
"year": 2018,
"abstract": "We introduce BERT, a bidirectional transformer...",
"keywords": ["BERT", "pre-training", "transformers"],
"citations": ["paper_1"]
}
]

for paper in papers:
pipeline.process_paper(paper)

# Query with different modes
print("\n=== SEMANTIC MODE ===")
results = pipeline.query("attention mechanisms in neural networks", mode="semantic")
print(results)

print("\n=== STRUCTURED MODE ===")
results = pipeline.query("papers citing Attention is All You Need", mode="structured")
print(results)

print("\n=== GRAPHRAG MODE ===")
results = pipeline.query("how do transformers work?", mode="hybrid")
print(results)

Summary and Decision Guide

Technology Comparison

TechnologyBest ForLimitations
Vector DBSemantic similarity, fuzzy matchingNo relationships, no reasoning
Knowledge GraphRelationships, structured queriesRequires exact entities, no fuzzy search
Hybrid RAGCombining semantic + structuredMore complex, two systems
GraphRAGRich context, citation analysisHighest complexity, needs both systems

When to Use Each

Use Vector Search Alone:

  • Simple semantic search
  • No relationship queries
  • Quick prototypes
  • Example: "Find similar papers"

Use Knowledge Graph Alone:

  • Known entities
  • Relationship-heavy queries
  • Network analysis
  • Example: "Find collaboration networks"

Use Hybrid RAG:

  • Production RAG systems
  • Mix of semantic + structured queries
  • Need both similarity and relationships
  • Example: ResearcherAI

Use GraphRAG:

  • Research assistance (like ResearcherAI)
  • Need citation context
  • Complex multi-hop queries
  • Example: "Trace the evolution of an idea"

ResearcherAI's Approach

ResearcherAI uses GraphRAG with dual backends:

# Development Mode
dev_config = {
"vector_store": "FAISS (in-memory)",
"knowledge_graph": "NetworkX (in-memory)",
"startup_time": "instant",
"data_persistence": "manual save/load"
}

# Production Mode
prod_config = {
"vector_store": "Qdrant (persistent)",
"knowledge_graph": "Neo4j (persistent)",
"startup_time": "~2 seconds",
"data_persistence": "automatic"
}

# Abstraction layer allows switching
system = ResearcherAI(mode="development") # or "production"

Key Takeaways

  1. Vector databases enable semantic search - finding similar content without exact keyword matches
  2. Knowledge graphs enable relationship reasoning - following citations, collaborations, concept evolution
  3. Hybrid RAG combines both for richer retrieval
  4. GraphRAG uses graphs to expand and enhance vector search results
  5. Structured + Unstructured data both matter - use appropriate storage for each
  6. Dev/Prod duality enables fast iteration with production fidelity

Next Steps

Now you understand the data layer. Next chapters cover:

  • Chapter 3.5 (Agent Foundations): How agents use these data stores
  • Chapter 4 (Orchestration Frameworks): LangGraph for agent coordination
  • Chapter 5 (Backend): Implementing the full stack

The data foundations you learned here power every query in ResearcherAI!


Practice Exercise

Build your own hybrid system:

  1. Start with FAISS + NetworkX (development)
  2. Index 20-30 papers with abstracts
  3. Create author and citation relationships
  4. Implement semantic search
  5. Implement citation traversal
  6. Combine both for hybrid queries
  7. Deploy with Qdrant + Neo4j (production)