Chapter 3: Data Foundations - Vector Databases, Knowledge Graphs, and GraphRAG

Introduction

Before building intelligent agents, we must understand how to store and retrieve information effectively. This chapter takes you from basic keyword search to advanced GraphRAG, explaining why each technology exists and when to use it.

For Web Developers

This chapter is like learning about databases in web development:

Keyword search = Simple WHERE name LIKE '%query%'
Vector search = Semantic similarity (no exact matches needed)
Knowledge graphs = Relational databases on steroids
GraphRAG = Combining the best of all worlds

The Problem: Traditional Search Doesn't Work for Research

Keyword Search Limitations

Imagine searching for papers about "neural networks for language understanding":

Keyword Search:

def keyword_search(query: str, documents: List[str]) -> List[str]:
    """Traditional keyword matching"""
    results = []
    keywords = query.lower().split()

    for doc in documents:
        doc_lower = doc.lower()
        if all(keyword in doc_lower for keyword in keywords):
            results.append(doc)

    return results

# Search papers
query = "neural networks for language understanding"
results = keyword_search(query, papers)

Problems:

Synonym Problem: Misses "deep learning" when searching "neural networks"
Word Order: "language understanding with neural networks" won't match
Context Ignored: Can't understand "transformers" means attention mechanism
No Semantics: "bank" (financial) vs "bank" (river) treated identically

Real Example:

# User searches: "attention mechanisms in NLP"
# Misses papers that say:
# - "self-attention for natural language processing"
# - "transformer architecture for text understanding"
# - "query-key-value attention for language models"
# All mean the same thing but use different words!

What We Actually Need

For research, we need:

Semantic understanding: "neural network" = "deep learning" = "artificial network"
Context awareness: Understand concepts, not just words
Relationship mapping: How papers, authors, and concepts connect
Reasoning capabilities: "If A cites B, and B discusses C, then A likely relates to C"

This requires two complementary technologies:

Vector Databases (semantic similarity)
Knowledge Graphs (relationship reasoning)

Let's understand each from scratch.

Part 1: Vector Databases and Semantic Search

From Words to Vectors

Core Idea: Represent text as numbers that capture meaning.

The Intuition

# Imagine each word has a position in "meaning space"
# Similar meanings = close positions

king    = [0.8, 0.3, 0.1]  # Royalty, male, power
queen   = [0.8, 0.9, 0.1]  # Royalty, female, power
man     = [0.2, 0.3, 0.0]  # Common, male, neutral
woman   = [0.2, 0.9, 0.0]  # Common, female, neutral

# Amazing property:
# king - man + woman ≈ queen
# [0.8, 0.3, 0.1] - [0.2, 0.3, 0.0] + [0.2, 0.9, 0.0] = [0.8, 0.9, 0.1]

This is word embeddings - representing words as dense vectors that capture semantic relationships.

How Embeddings Work

Creating Embeddings

from sentence_transformers import SentenceTransformer

# Load embedding model (runs locally!)
model = SentenceTransformer('all-MiniLM-L6-v2')

# Create embeddings
texts = [
    "Neural networks for image classification",
    "Deep learning in computer vision",
    "Convolutional networks for image recognition"
]

embeddings = model.encode(texts)
print(embeddings.shape)  # (3, 384) - 3 texts, 384 dimensions each

Measuring Similarity

import numpy as np

def cosine_similarity(vec1, vec2):
    """Measure how similar two vectors are (0=different, 1=identical)"""
    return np.dot(vec1, vec2) / (np.linalg.norm(vec1) * np.linalg.norm(vec2))

# Compare texts
query = "neural nets for images"
query_embedding = model.encode(query)

for i, text in enumerate(texts):
    similarity = cosine_similarity(query_embedding, embeddings[i])
    print(f"Similarity to '{text}': {similarity:.3f}")

# Output:
# Similarity to 'Neural networks for image classification': 0.782
# Similarity to 'Deep learning in computer vision': 0.691
# Similarity to 'Convolutional networks for image recognition': 0.745

Key Insight: Even though exact words differ, semantic similarity is captured!

Vector Databases: Scaling Semantic Search

Comparing embeddings one-by-one doesn't scale. For 1 million papers:

Comparing query to all papers: 1 million comparisons
At 0.01ms per comparison: 10 seconds per query ❌

Solution: Vector databases with approximate nearest neighbor (ANN) search.

Development: FAISS (In-Memory)

FAISS (Facebook AI Similarity Search) - perfect for development and testing.

import faiss
import numpy as np

class FAISSVectorStore:
    """Development vector database using FAISS"""

    def __init__(self, dimension: int = 384):
        self.dimension = dimension
        # Create FAISS index (L2 distance)
        self.index = faiss.IndexFlatL2(dimension)
        self.documents = []  # Store original documents

    def add_documents(self, texts: List[str], embeddings: np.ndarray):
        """Add documents to index"""
        # FAISS requires float32
        embeddings_f32 = embeddings.astype('float32')

        # Add to index
        self.index.add(embeddings_f32)

        # Store documents
        self.documents.extend(texts)

        print(f"✓ Indexed {len(texts)} documents (total: {self.index.ntotal})")

    def search(self, query_embedding: np.ndarray, top_k: int = 5) -> List[tuple]:
        """Search for similar documents"""
        # Ensure float32
        query_f32 = query_embedding.astype('float32').reshape(1, -1)

        # Search (returns distances and indices)
        distances, indices = self.index.search(query_f32, top_k)

        # Convert L2 distances to similarity scores (0-1)
        similarities = 1 / (1 + distances[0])

        # Return documents with scores
        results = [
            (self.documents[idx], float(sim))
            for idx, sim in zip(indices[0], similarities)
            if idx < len(self.documents)
        ]

        return results

# Usage
vector_store = FAISSVectorStore(dimension=384)

# Index papers
papers = [
    "Attention is all you need - introduces transformer architecture",
    "BERT: Pre-training of deep bidirectional transformers",
    "GPT-3: Language models are few-shot learners",
    "ResNet: Deep residual learning for image recognition",
    "YOLO: Real-time object detection"
]

embeddings = model.encode(papers)
vector_store.add_documents(papers, embeddings)

# Search
query = "transformer models for NLP"
query_emb = model.encode(query)
results = vector_store.search(query_emb, top_k=3)

for doc, score in results:
    print(f"Score: {score:.3f} - {doc}")

Output:

Score: 0.856 - Attention is all you need - introduces transformer architecture
Score: 0.792 - BERT: Pre-training of deep bidirectional transformers
Score: 0.743 - GPT-3: Language models are few-shot learners

Notice: ResNet and YOLO (computer vision) are correctly excluded even though they're valid papers!

Production: Qdrant (Persistent, Scalable)

Qdrant - production-grade vector database with persistence and APIs.

from qdrant_client import QdrantClient
from qdrant_client.models import Distance, VectorParams, PointStruct

class QdrantVectorStore:
    """Production vector database using Qdrant"""

    def __init__(
        self,
        collection_name: str = "research_papers",
        url: str = "http://localhost:6333"
    ):
        self.client = QdrantClient(url=url)
        self.collection_name = collection_name
        self.dimension = 384

        # Create collection if not exists
        self._create_collection()

    def _create_collection(self):
        """Create Qdrant collection"""
        try:
            self.client.get_collection(self.collection_name)
            print(f"✓ Collection '{self.collection_name}' exists")
        except:
            self.client.create_collection(
                collection_name=self.collection_name,
                vectors_config=VectorParams(
                    size=self.dimension,
                    distance=Distance.COSINE  # Cosine similarity
                )
            )
            print(f"✓ Created collection '{self.collection_name}'")

    def add_documents(
        self,
        texts: List[str],
        embeddings: np.ndarray,
        metadata: List[Dict] = None
    ):
        """Add documents with metadata"""
        points = []

        for idx, (text, embedding) in enumerate(zip(texts, embeddings)):
            point = PointStruct(
                id=idx,
                vector=embedding.tolist(),
                payload={
                    "text": text,
                    **(metadata[idx] if metadata else {})
                }
            )
            points.append(point)

        # Batch upload
        self.client.upsert(
            collection_name=self.collection_name,
            points=points
        )

        print(f"✓ Indexed {len(points)} documents")

    def search(
        self,
        query_embedding: np.ndarray,
        top_k: int = 5,
        filters: Dict = None
    ) -> List[tuple]:
        """Search with optional metadata filters"""
        results = self.client.search(
            collection_name=self.collection_name,
            query_vector=query_embedding.tolist(),
            limit=top_k,
            query_filter=filters  # Can filter by metadata!
        )

        return [
            (result.payload["text"], result.score)
            for result in results
        ]

# Usage
qdrant_store = QdrantVectorStore(collection_name="papers")

# Index with metadata
metadata = [
    {"year": 2017, "citations": 50000, "venue": "NeurIPS"},
    {"year": 2018, "citations": 30000, "venue": "NAACL"},
    {"year": 2020, "citations": 15000, "venue": "NeurIPS"},
    {"year": 2015, "citations": 40000, "venue": "CVPR"},
    {"year": 2016, "citations": 25000, "venue": "CVPR"}
]

qdrant_store.add_documents(papers, embeddings, metadata)

# Search with filters
query_emb = model.encode("transformer models for NLP")
results = qdrant_store.search(
    query_emb,
    top_k=5,
    filters={"must": [{"key": "year", "range": {"gte": 2017}}]}  # Papers after 2017
)

for doc, score in results:
    print(f"Score: {score:.3f} - {doc}")

Vector Search: Dev vs Prod Comparison

Feature	FAISS (Dev)	Qdrant (Prod)
Storage	In-memory only	Persistent to disk
Startup	Instant	~2 seconds
Scalability	Single machine	Distributed cluster
Metadata	Manual tracking	Built-in filtering
APIs	Python only	REST + gRPC + Python
Persistence	Save/load manually	Automatic
Best for	Development, testing, prototyping	Production, millions of vectors

When to Use Each

Development: Use FAISS - instant startup, no infrastructure
Production: Use Qdrant - persistence, scalability, filtering
ResearcherAI: Uses both with abstraction layer!

Part 2: Knowledge Graphs and Structured Reasoning

Knowledge Graphs in the Real World

Before diving into the technical details, let's see knowledge graphs in action.

Try This: Open Google and search for "Tesla" (the electric vehicle company).

What do you see? Besides the typical list of matching websites, you'll notice a comprehensive panel on the right side showing:

Description: "American electric vehicle and clean energy company..."
Founded: 2003
Headquarters: Austin, Texas
CEO: Elon Musk
Stock price and other properties

Now click on "Austin, Texas" (the headquarters location). You'll see another panel with:

Description: "Capital city of Texas, United States"
Population: ~1 million
County: Travis County

This is a knowledge graph in action! Google isn't just returning text - it's showing you:

Entities (Tesla, Austin, Elon Musk)
Properties (founded date, population)
Relationships (Tesla → headquarters → Austin → county → Travis County)

Web Developer Analogy:

// Traditional search result = List of text snippets
const results = ["Tesla is a company...", "Tesla founded in 2003...", "Tesla located..."]

// Knowledge graph = Structured interconnected data
const knowledgeGraph = {
  "Tesla": {
    type: "Company",
    properties: {
      founded: 2003,
      name: "Tesla, Inc."
    },
    relationships: {
      headquarters: "Austin",
      CEO: "Elon_Musk"
    }
  },
  "Austin": {
    type: "City",
    properties: {
      population: 1000000,
      state: "Texas"
    },
    relationships: {
      county: "Travis_County",
      companies: ["Tesla", "Oracle", "Dell"]
    }
  }
}

What Are Knowledge Graphs?

A knowledge graph represents structured information as a graph where:

Nodes represent entities (Tesla, Austin, Elon Musk)
Edges represent relationships between entities (headquarters, CEO, located_in)

Two Main Components:

Schema/Ontology: Defines the types of entities, their attributes, and allowed relationships

# Schema definition
Company has property: founded_year (integer)
Company has property: name (string)
Company has relationship: headquarters → City
Company has relationship: CEO → Person

Instance Data: The actual entities and relationships that follow the schema

# Instance data
Tesla founded_year 2003
Tesla name "Tesla, Inc."
Tesla headquarters Austin
Tesla CEO Elon_Musk

Why Vector Search Isn't Enough

Vector search excels at finding similar content, but fails at:

1. Relationship Questions

Query: "Which papers cite both attention mechanisms and BERT?"
Vector search: ❌ Can't traverse citations
Knowledge graph: ✅ MATCH (p)-[:CITES]->(a), (p)-[:CITES]->(b)

2. Multi-Hop Reasoning

Query: "Find papers by authors who collaborated with Yoshua Bengio"
Vector search: ❌ Can't follow author → author connections
Knowledge graph: ✅ Path traversal over collaboration edges

3. Structured Queries

Query: "Papers published in 2020 that cite papers from before 2015"
Vector search: ❌ No temporal reasoning
Knowledge graph: ✅ Filter by year property + traverse citations

From Tables to Graphs

Key Differences:

Tables (Relational):

Fixed schema
Join operations expensive
Hard to add new relationship types
Optimized for transactional queries

Graphs (Network):

Flexible schema
Traversals are natural
Easy to add new edges
Optimized for relationship queries

Knowledge Graph Basics

Components:

Nodes (Entities): Papers, Authors, Concepts, Institutions
Edges (Relationships): CITES, AUTHORED_BY, DISCUSSES, AFFILIATED_WITH
Properties: title, year, citations_count, etc.

Example Graph:

# Nodes
Paper1 = {
    "id": "paper_1",
    "type": "Paper",
    "title": "Attention is All You Need",
    "year": 2017,
    "citations": 50000
}

Author1 = {
    "id": "author_1",
    "type": "Author",
    "name": "Ashish Vaswani"
}

Concept1 = {
    "id": "concept_1",
    "type": "Concept",
    "name": "Transformer"
}

# Edges
edges = [
    (Author1, "AUTHORED", Paper1),
    (Paper1, "INTRODUCES", Concept1),
    (Paper2, "CITES", Paper1),
    (Paper2, "USES", Concept1)
]

Development: NetworkX (In-Memory)

NetworkX - Python library for graph operations, perfect for development.

import networkx as nx
from typing import Dict, List

class NetworkXKnowledgeGraph:
    """Development knowledge graph using NetworkX"""

    def __init__(self):
        self.graph = nx.MultiDiGraph()  # Directed graph with multiple edges

    def add_paper(self, paper_id: str, title: str, year: int, abstract: str = ""):
        """Add paper node"""
        self.graph.add_node(
            paper_id,
            type="Paper",
            title=title,
            year=year,
            abstract=abstract
        )

    def add_author(self, author_id: str, name: str):
        """Add author node"""
        self.graph.add_node(
            author_id,
            type="Author",
            name=name
        )

    def add_concept(self, concept_id: str, name: str):
        """Add concept node"""
        self.graph.add_node(
            concept_id,
            type="Concept",
            name=name
        )

    def add_authored(self, author_id: str, paper_id: str):
        """Add AUTHORED relationship"""
        self.graph.add_edge(author_id, paper_id, type="AUTHORED")

    def add_citation(self, citing_paper: str, cited_paper: str):
        """Add CITES relationship"""
        self.graph.add_edge(citing_paper, cited_paper, type="CITES")

    def add_discusses(self, paper_id: str, concept_id: str):
        """Add DISCUSSES relationship"""
        self.graph.add_edge(paper_id, concept_id, type="DISCUSSES")

    def find_papers_by_author(self, author_name: str) -> List[Dict]:
        """Find all papers by an author"""
        papers = []

        for node, data in self.graph.nodes(data=True):
            if data.get("type") == "Author" and data.get("name") == author_name:
                # Find papers this author authored
                for neighbor in self.graph.successors(node):
                    if self.graph.nodes[neighbor].get("type") == "Paper":
                        papers.append({
                            "id": neighbor,
                            **self.graph.nodes[neighbor]
                        })

        return papers

    def find_citing_papers(self, paper_id: str) -> List[Dict]:
        """Find papers that cite a given paper"""
        citing = []

        for pred in self.graph.predecessors(paper_id):
            edge_data = self.graph.get_edge_data(pred, paper_id)
            if any(e.get("type") == "CITES" for e in edge_data.values()):
                if self.graph.nodes[pred].get("type") == "Paper":
                    citing.append({
                        "id": pred,
                        **self.graph.nodes[pred]
                    })

        return citing

    def find_related_concepts(self, paper_id: str) -> List[str]:
        """Find concepts discussed in a paper"""
        concepts = []

        for neighbor in self.graph.successors(paper_id):
            if self.graph.nodes[neighbor].get("type") == "Concept":
                concepts.append(self.graph.nodes[neighbor].get("name"))

        return concepts

    def find_collaboration_network(self, author_name: str, depth: int = 2) -> List[str]:
        """Find authors who collaborated (shared papers)"""
        collaborators = set()

        # Find author node
        author_node = None
        for node, data in self.graph.nodes(data=True):
            if data.get("type") == "Author" and data.get("name") == author_name:
                author_node = node
                break

        if not author_node:
            return []

        # Find papers by this author
        papers = [n for n in self.graph.successors(author_node)
                 if self.graph.nodes[n].get("type") == "Paper"]

        # Find co-authors
        for paper in papers:
            for pred in self.graph.predecessors(paper):
                if self.graph.nodes[pred].get("type") == "Author" and pred != author_node:
                    collaborators.add(self.graph.nodes[pred].get("name"))

        return list(collaborators)

# Usage
kg = NetworkXKnowledgeGraph()

# Add nodes
kg.add_paper("paper_1", "Attention is All You Need", 2017)
kg.add_paper("paper_2", "BERT: Pre-training Transformers", 2018)
kg.add_paper("paper_3", "GPT-3: Language Models", 2020)

kg.add_author("author_1", "Ashish Vaswani")
kg.add_author("author_2", "Jacob Devlin")
kg.add_author("author_3", "Tom Brown")

kg.add_concept("concept_1", "Transformer")
kg.add_concept("concept_2", "Attention Mechanism")
kg.add_concept("concept_3", "Pre-training")

# Add relationships
kg.add_authored("author_1", "paper_1")
kg.add_authored("author_2", "paper_2")
kg.add_authored("author_3", "paper_3")

kg.add_discusses("paper_1", "concept_1")
kg.add_discusses("paper_1", "concept_2")
kg.add_discusses("paper_2", "concept_1")
kg.add_discusses("paper_2", "concept_3")

kg.add_citation("paper_2", "paper_1")  # BERT cites Attention
kg.add_citation("paper_3", "paper_1")  # GPT-3 cites Attention
kg.add_citation("paper_3", "paper_2")  # GPT-3 cites BERT

# Query the graph
print("Papers by Ashish Vaswani:")
papers = kg.find_papers_by_author("Ashish Vaswani")
for paper in papers:
    print(f"  - {paper['title']}")

print("\nPapers citing 'Attention is All You Need':")
citing = kg.find_citing_papers("paper_1")
for paper in citing:
    print(f"  - {paper['title']}")

print("\nConcepts in paper_1:")
concepts = kg.find_related_concepts("paper_1")
print(f"  {', '.join(concepts)}")

print("\nCollaborators of Jacob Devlin:")
collabs = kg.find_collaboration_network("Jacob Devlin")
print(f"  {', '.join(collabs)}")

Output:

Papers by Ashish Vaswani:
  - Attention is All You Need

Papers citing 'Attention is All You Need':
  - BERT: Pre-training Transformers
  - GPT-3: Language Models

Concepts in paper_1:
  Transformer, Attention Mechanism

Collaborators of Jacob Devlin:
  Ashish Vaswani

Production: Neo4j (Persistent, Cypher)

Neo4j - enterprise-grade graph database with powerful query language (Cypher).

from neo4j import GraphDatabase

class Neo4jKnowledgeGraph:
    """Production knowledge graph using Neo4j"""

    def __init__(self, uri: str = "bolt://localhost:7687", user: str = "neo4j", password: str = "password"):
        self.driver = GraphDatabase.driver(uri, auth=(user, password))

    def close(self):
        self.driver.close()

    def add_paper(self, paper_id: str, title: str, year: int, abstract: str = ""):
        """Add paper node"""
        with self.driver.session() as session:
            session.run("""
                MERGE (p:Paper {id: $paper_id})
                SET p.title = $title, p.year = $year, p.abstract = $abstract
            """, paper_id=paper_id, title=title, year=year, abstract=abstract)

    def add_author(self, author_id: str, name: str):
        """Add author node"""
        with self.driver.session() as session:
            session.run("""
                MERGE (a:Author {id: $author_id})
                SET a.name = $name
            """, author_id=author_id, name=name)

    def add_concept(self, concept_id: str, name: str):
        """Add concept node"""
        with self.driver.session() as session:
            session.run("""
                MERGE (c:Concept {id: $concept_id})
                SET c.name = $name
            """, concept_id=concept_id, name=name)

    def add_authored(self, author_id: str, paper_id: str):
        """Add AUTHORED relationship"""
        with self.driver.session() as session:
            session.run("""
                MATCH (a:Author {id: $author_id})
                MATCH (p:Paper {id: $paper_id})
                MERGE (a)-[:AUTHORED]->(p)
            """, author_id=author_id, paper_id=paper_id)

    def add_citation(self, citing_paper: str, cited_paper: str):
        """Add CITES relationship"""
        with self.driver.session() as session:
            session.run("""
                MATCH (citing:Paper {id: $citing_paper})
                MATCH (cited:Paper {id: $cited_paper})
                MERGE (citing)-[:CITES]->(cited)
            """, citing_paper=citing_paper, cited_paper=cited_paper)

    def add_discusses(self, paper_id: str, concept_id: str):
        """Add DISCUSSES relationship"""
        with self.driver.session() as session:
            session.run("""
                MATCH (p:Paper {id: $paper_id})
                MATCH (c:Concept {id: $concept_id})
                MERGE (p)-[:DISCUSSES]->(c)
            """, paper_id=paper_id, concept_id=concept_id)

    def find_papers_by_author(self, author_name: str) -> List[Dict]:
        """Find all papers by an author"""
        with self.driver.session() as session:
            result = session.run("""
                MATCH (a:Author {name: $name})-[:AUTHORED]->(p:Paper)
                RETURN p.id as id, p.title as title, p.year as year
            """, name=author_name)
            return [dict(record) for record in result]

    def find_citing_papers(self, paper_id: str) -> List[Dict]:
        """Find papers that cite a given paper"""
        with self.driver.session() as session:
            result = session.run("""
                MATCH (citing:Paper)-[:CITES]->(cited:Paper {id: $paper_id})
                RETURN citing.id as id, citing.title as title, citing.year as year
            """, paper_id=paper_id)
            return [dict(record) for record in result]

    def find_related_concepts(self, paper_id: str) -> List[str]:
        """Find concepts discussed in a paper"""
        with self.driver.session() as session:
            result = session.run("""
                MATCH (p:Paper {id: $paper_id})-[:DISCUSSES]->(c:Concept)
                RETURN c.name as concept
            """, paper_id=paper_id)
            return [record["concept"] for record in result]

    def find_collaboration_network(self, author_name: str) -> List[str]:
        """Find authors who collaborated (shared papers)"""
        with self.driver.session() as session:
            result = session.run("""
                MATCH (a1:Author {name: $name})-[:AUTHORED]->(p:Paper)<-[:AUTHORED]-(a2:Author)
                WHERE a1 <> a2
                RETURN DISTINCT a2.name as collaborator
            """, name=author_name)
            return [record["collaborator"] for record in result]

    def find_citation_chain(self, start_paper: str, end_paper: str, max_depth: int = 5):
        """Find citation path between two papers"""
        with self.driver.session() as session:
            result = session.run("""
                MATCH path = shortestPath(
                    (start:Paper {id: $start})-[:CITES*1..{max_depth}]->(end:Paper {id: $end})
                )
                RETURN [node in nodes(path) | node.title] as path
            """.replace("{max_depth}", str(max_depth)), start=start_paper, end=end_paper)

            records = list(result)
            return records[0]["path"] if records else []

# Usage
neo4j_kg = Neo4jKnowledgeGraph()

# Add same data as NetworkX example
neo4j_kg.add_paper("paper_1", "Attention is All You Need", 2017)
neo4j_kg.add_paper("paper_2", "BERT: Pre-training Transformers", 2018)
neo4j_kg.add_author("author_1", "Ashish Vaswani")
neo4j_kg.add_authored("author_1", "paper_1")
neo4j_kg.add_citation("paper_2", "paper_1")

# Advanced query: Citation chain
chain = neo4j_kg.find_citation_chain("paper_3", "paper_1")
print(f"Citation path: {' -> '.join(chain)}")

neo4j_kg.close()

Cypher Query Language

Cypher is Neo4j's declarative query language - incredibly powerful:

// Find papers published after 2018 that cite papers with >10000 citations
MATCH (recent:Paper)-[:CITES]->(influential:Paper)
WHERE recent.year > 2018 AND influential.citations > 10000
RETURN recent.title, influential.title, influential.citations
ORDER BY influential.citations DESC

// Find "research communities" - groups of authors who frequently collaborate
MATCH (a1:Author)-[:AUTHORED]->(:Paper)<-[:AUTHORED]-(a2:Author)
WHERE a1.name < a2.name
WITH a1, a2, count(*) as collaborations
WHERE collaborations > 3
RETURN a1.name, a2.name, collaborations
ORDER BY collaborations DESC

// Find trending concepts (discussed in papers with growing citations)
MATCH (p:Paper)-[:DISCUSSES]->(c:Concept)
WHERE p.year >= 2020
WITH c, avg(p.citations) as avg_citations, count(p) as paper_count
WHERE paper_count > 5
RETURN c.name, avg_citations, paper_count
ORDER BY avg_citations DESC
LIMIT 10

Knowledge Graph: Dev vs Prod Comparison

Feature	NetworkX (Dev)	Neo4j (Prod)
Storage	In-memory only	Persistent to disk
Query Language	Python code	Cypher (declarative)
Scalability	1000s of nodes	Billions of nodes
Performance	Slow for large graphs	Optimized indexes
Transactions	No	ACID transactions
Clustering	No	Multi-node clusters
Best for	Development, algorithms	Production, complex queries

When to Use Each

Development: NetworkX - no setup, great for prototyping
Production: Neo4j - performance, Cypher queries, persistence
ResearcherAI: Abstracts both behind unified interface!

Semantic Web: Ontologies, RDF, and SPARQL

So far we've discussed property graphs (Neo4j, NetworkX). There's another powerful approach: semantic web technologies using RDF and ontologies.

What Are Ontologies?

Think of an ontology as a formal schema for your knowledge:

Web Developer Analogy:

// TypeScript interface = Ontology
interface Person {
  name: string;
  worksFor: Organization;
  knows: Person[];
}

interface Organization {
  name: string;
  foundedDate: Date;
}

An ontology defines:

Classes (Person, Paper, Author, Concept)
Properties (name, authored, cites, discusses)
Relationships (Author → authored → Paper)
Constraints (a Paper must have at least one Author)

RDF: Resource Description Framework

RDF represents knowledge as triples:

Subject  Predicate  Object

Every statement is a triple (like a sentence):

# Turtle syntax (RDF format)
:paper1  rdf:type  :ResearchPaper .
:paper1  :hasTitle  "Attention Is All You Need" .
:paper1  :publishedYear  2017 .
:paper1  :hasAuthor  :vaswani .
:paper1  :cites  :paper2 .

:vaswani  rdf:type  :Author .
:vaswani  :hasName  "Ashish Vaswani" .

Web Developer Analogy:

// JSON = Property Graph
{
  "id": "paper1",
  "title": "Attention Is All You Need",
  "year": 2017,
  "authors": ["vaswani"]
}

// RDF Triples = Semantic Web
["paper1", "type", "ResearchPaper"]
["paper1", "hasTitle", "Attention Is All You Need"]
["paper1", "publishedYear", 2017]
["paper1", "hasAuthor", "vaswani"]

RDF Triple Visualization

Development: RDFLib (Python)

For development, use RDFLib - a pure Python library:

from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF, RDFS

class RDFKnowledgeGraph:
    """Development RDF knowledge graph using RDFLib"""

    def __init__(self):
        self.graph = Graph()

        # Define custom namespace for our ontology
        self.ns = Namespace("http://researcherai.org/ontology#")
        self.graph.bind("research", self.ns)

    def add_paper(
        self,
        paper_id: str,
        title: str,
        year: int,
        abstract: str = ""
    ):
        """Add a research paper to the graph"""
        paper_uri = URIRef(f"http://researcherai.org/papers/{paper_id}")

        # Add triples
        self.graph.add((paper_uri, RDF.type, self.ns.ResearchPaper))
        self.graph.add((paper_uri, self.ns.hasTitle, Literal(title)))
        self.graph.add((paper_uri, self.ns.publishedYear, Literal(year)))

        if abstract:
            self.graph.add((paper_uri, self.ns.hasAbstract, Literal(abstract)))

    def add_author(self, author_id: str, name: str, affiliation: str = ""):
        """Add an author to the graph"""
        author_uri = URIRef(f"http://researcherai.org/authors/{author_id}")

        self.graph.add((author_uri, RDF.type, self.ns.Author))
        self.graph.add((author_uri, self.ns.hasName, Literal(name)))

        if affiliation:
            self.graph.add((author_uri, self.ns.affiliation, Literal(affiliation)))

    def link_author_to_paper(self, author_id: str, paper_id: str):
        """Create authorship relationship"""
        author_uri = URIRef(f"http://researcherai.org/authors/{author_id}")
        paper_uri = URIRef(f"http://researcherai.org/papers/{paper_id}")

        self.graph.add((paper_uri, self.ns.hasAuthor, author_uri))
        self.graph.add((author_uri, self.ns.authored, paper_uri))

    def add_citation(self, citing_paper_id: str, cited_paper_id: str):
        """Add citation relationship"""
        citing_uri = URIRef(f"http://researcherai.org/papers/{citing_paper_id}")
        cited_uri = URIRef(f"http://researcherai.org/papers/{cited_paper_id}")

        self.graph.add((citing_uri, self.ns.cites, cited_uri))
        self.graph.add((cited_uri, self.ns.citedBy, citing_uri))

    def query_sparql(self, sparql_query: str):
        """Execute SPARQL query"""
        return self.graph.query(sparql_query)

    def export_turtle(self, filename: str):
        """Export graph to Turtle format"""
        self.graph.serialize(destination=filename, format='turtle')

    def load_turtle(self, filename: str):
        """Load graph from Turtle format"""
        self.graph.parse(filename, format='turtle')


# Example usage
rdf_kg = RDFKnowledgeGraph()

# Add papers
rdf_kg.add_paper(
    "paper1",
    "Attention Is All You Need",
    2017,
    "The dominant sequence transduction models..."
)

rdf_kg.add_paper(
    "paper2",
    "BERT: Pre-training of Deep Bidirectional Transformers",
    2019,
    "We introduce BERT..."
)

# Add authors
rdf_kg.add_author("vaswani", "Ashish Vaswani", "Google Brain")
rdf_kg.add_author("devlin", "Jacob Devlin", "Google AI")

# Link relationships
rdf_kg.link_author_to_paper("vaswani", "paper1")
rdf_kg.link_author_to_paper("devlin", "paper2")
rdf_kg.add_citation("paper2", "paper1")  # BERT cites Attention paper

# Export to file
rdf_kg.export_turtle("research_graph.ttl")

SPARQL: Query Language for RDF

SPARQL is to RDF what Cypher is to Neo4j (or SQL to relational databases):

# Find all papers by a specific author
PREFIX research: <http://researcherai.org/ontology#>

SELECT ?paper ?title ?year
WHERE {
    ?author research:hasName "Ashish Vaswani" .
    ?author research:authored ?paper .
    ?paper research:hasTitle ?title .
    ?paper research:publishedYear ?year .
}
ORDER BY ?year

Python Example:

sparql_query = """
PREFIX research: <http://researcherai.org/ontology#>

SELECT ?citing_title ?cited_title
WHERE {
    ?citing_paper research:cites ?cited_paper .
    ?citing_paper research:hasTitle ?citing_title .
    ?cited_paper research:hasTitle ?cited_title .
}
"""

results = rdf_kg.query_sparql(sparql_query)
for row in results:
    print(f"{row.citing_title} cites {row.cited_title}")

More SPARQL Examples

# Find authors who collaborated (co-authored papers)
PREFIX research: <http://researcherai.org/ontology#>

SELECT ?author1_name ?author2_name ?paper_title
WHERE {
    ?paper research:hasAuthor ?author1 .
    ?paper research:hasAuthor ?author2 .
    ?paper research:hasTitle ?paper_title .
    ?author1 research:hasName ?author1_name .
    ?author2 research:hasName ?author2_name .
    FILTER(?author1 != ?author2)
}

# Find highly cited papers (cited by many others)
SELECT ?title (COUNT(?citing) as ?citation_count)
WHERE {
    ?paper research:hasTitle ?title .
    ?citing research:cites ?paper .
}
GROUP BY ?title
HAVING (COUNT(?citing) > 100)
ORDER BY DESC(?citation_count)

# Find papers published after 2018 in a specific domain
SELECT ?title ?year
WHERE {
    ?paper research:hasTitle ?title .
    ?paper research:publishedYear ?year .
    ?paper research:discusses ?concept .
    ?concept research:hasName "transformers" .
    FILTER(?year > 2018)
}

Production: Apache Jena & SPARQL Endpoint

For production, use Apache Jena Fuseki - a SPARQL server:

from SPARQLWrapper import SPARQLWrapper, JSON
import requests

class JenaKnowledgeGraph:
    """Production RDF knowledge graph using Apache Jena Fuseki"""

    def __init__(
        self,
        endpoint_url: str = "http://localhost:3030/research",
        update_endpoint: str = "http://localhost:3030/research/update"
    ):
        self.endpoint_url = endpoint_url
        self.update_endpoint = update_endpoint
        self.sparql = SPARQLWrapper(endpoint_url)

    def add_triples(self, triples_turtle: str):
        """Add RDF triples to the graph"""
        # Use SPARQL UPDATE to insert data
        update_query = f"""
        PREFIX research: <http://researcherai.org/ontology#>

        INSERT DATA {{
            {triples_turtle}
        }}
        """

        response = requests.post(
            self.update_endpoint,
            data={"update": update_query},
            headers={"Content-Type": "application/x-www-form-urlencoded"}
        )
        return response.status_code == 200

    def add_paper(self, paper_id: str, title: str, year: int):
        """Add a research paper"""
        triples = f"""
        @prefix research: <http://researcherai.org/ontology#> .
        @prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .

        <http://researcherai.org/papers/{paper_id}>
            rdf:type research:ResearchPaper ;
            research:hasTitle "{title}" ;
            research:publishedYear {year} .
        """
        return self.add_triples(triples)

    def query(self, sparql_query: str) -> list:
        """Execute SPARQL SELECT query"""
        self.sparql.setQuery(sparql_query)
        self.sparql.setReturnFormat(JSON)

        results = self.sparql.query().convert()
        return results["results"]["bindings"]

    def find_papers_by_author(self, author_name: str) -> list:
        """Find all papers by an author"""
        query = f"""
        PREFIX research: <http://researcherai.org/ontology#>

        SELECT ?paper ?title ?year
        WHERE {{
            ?author research:hasName "{author_name}" .
            ?author research:authored ?paper .
            ?paper research:hasTitle ?title .
            ?paper research:publishedYear ?year .
        }}
        ORDER BY DESC(?year)
        """
        return self.query(query)

    def find_citation_chain(self, paper_id: str, depth: int = 2) -> list:
        """Find papers that cite this paper (transitive)"""
        query = f"""
        PREFIX research: <http://researcherai.org/ontology#>

        SELECT ?citing_paper ?title ?distance
        WHERE {{
            <http://researcherai.org/papers/{paper_id}>
                ^research:cites{{1,{depth}}} ?citing_paper .
            ?citing_paper research:hasTitle ?title .

            BIND(
                COUNT(?intermediate) as ?distance
            )
        }}
        """
        return self.query(query)


# Example usage with Fuseki
jena_kg = JenaKnowledgeGraph(
    endpoint_url="http://localhost:3030/research/sparql",
    update_endpoint="http://localhost:3030/research/update"
)

# Add paper
jena_kg.add_paper(
    "attention2017",
    "Attention Is All You Need",
    2017
)

# Query
results = jena_kg.find_papers_by_author("Ashish Vaswani")
for result in results:
    print(f"{result['title']['value']} ({result['year']['value']})")

OWL: Web Ontology Language - The Power of Reasoning

OWL (Web Ontology Language) extends RDF with reasoning capabilities. It's the difference between storing facts and deriving new knowledge from those facts.

Web Developer Analogy:

// RDF = Data storage
const data = {
  "Vaswani": { type: "Author", authored: ["paper1"] }
}

// OWL = Data storage + Logic rules
const data = { ... }
const rules = {
  // If someone authored a paper, they are a researcher
  "Author who authored something → Researcher"
}
// Reasoner can INFER: "Vaswani is a Researcher" (even if not explicitly stated)

Why OWL Matters: Automatic Inference

Without OWL (just RDF):

:vaswani :authored :paper1 .
# To know Vaswani is a researcher, you must explicitly state it

With OWL (RDF + reasoning):

# Define rule: anyone who authored something is a researcher
:authored rdfs:domain :Researcher .

# Just state the fact
:vaswani :authored :paper1 .

# OWL reasoner AUTOMATICALLY infers:
:vaswani rdf:type :Researcher .  # Derived, not stated!

OWL Ontology Definition

@prefix owl: <http://www.w3.org/2002/07/owl#> .
@prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
@prefix xsd: <http://www.w3.org/2001/XMLSchema#> .
@prefix research: <http://researcherai.org/ontology#> .

# Define classes
research:ResearchPaper rdf:type owl:Class .
research:Author rdf:type owl:Class .
research:Researcher rdf:type owl:Class .
research:Concept rdf:type owl:Class .
research:InfluentialPaper rdf:type owl:Class .

# Define class hierarchy
research:Author rdfs:subClassOf research:Researcher .
# All Authors are Researchers (but not all Researchers are Authors)

# Define properties
research:hasAuthor rdf:type owl:ObjectProperty ;
    rdfs:domain research:ResearchPaper ;
    rdfs:range research:Author .

research:cites rdf:type owl:ObjectProperty ;
    rdfs:domain research:ResearchPaper ;
    rdfs:range research:ResearchPaper .

research:citedBy rdf:type owl:ObjectProperty ;
    owl:inverseOf research:cites .  # Automatic inverse!

research:hasTitle rdf:type owl:DatatypeProperty ;
    rdfs:domain research:ResearchPaper ;
    rdfs:range xsd:string .

research:publishedYear rdf:type owl:DatatypeProperty ;
    rdfs:domain research:ResearchPaper ;
    rdfs:range xsd:integer .

research:citationCount rdf:type owl:DatatypeProperty ;
    rdfs:domain research:ResearchPaper ;
    rdfs:range xsd:integer .

# Define constraints
research:ResearchPaper rdfs:subClassOf [
    rdf:type owl:Restriction ;
    owl:onProperty research:hasAuthor ;
    owl:minCardinality "1"^^xsd:nonNegativeInteger
] .  # A paper must have at least one author

# Define derived class (automatic classification!)
research:InfluentialPaper owl:equivalentClass [
    rdf:type owl:Restriction ;
    owl:onProperty research:citationCount ;
    owl:someValuesFrom [
        rdf:type rdfs:Datatype ;
        owl:onDatatype xsd:integer ;
        owl:withRestrictions ([ xsd:minInclusive 100 ])
    ]
] .  # Papers with 100+ citations are automatically "InfluentialPaper"

# Property characteristics
research:collaboratesWith rdf:type owl:SymmetricProperty .
# If A collaborates with B, then B collaborates with A

research:cites rdf:type owl:TransitiveProperty .
# If A cites B, and B cites C, then A transitively cites C

Development: Owlready2 with Reasoning

For development, use owlready2 - a Python library with built-in reasoner:

from owlready2 import *
import tempfile

class OWLKnowledgeGraph:
    """Development OWL ontology with reasoning"""

    def __init__(self, ontology_iri="http://researcherai.org/ontology"):
        self.onto = get_ontology(ontology_iri)

        with self.onto:
            # Define classes
            class ResearchPaper(Thing):
                pass

            class Author(Thing):
                pass

            class Researcher(Thing):
                pass

            class Concept(Thing):
                pass

            class InfluentialPaper(ResearchPaper):
                pass

            # Define properties
            class hasAuthor(ObjectProperty):
                domain = [ResearchPaper]
                range = [Author]

            class authored(ObjectProperty):
                domain = [Author]
                range = [ResearchPaper]
                inverse_property = hasAuthor

            class cites(ObjectProperty, TransitiveProperty):
                domain = [ResearchPaper]
                range = [ResearchPaper]

            class citedBy(ObjectProperty):
                inverse_property = cites

            class collaboratesWith(ObjectProperty, SymmetricProperty):
                domain = [Author]
                range = [Author]

            class hasTitle(DataProperty, FunctionalProperty):
                domain = [ResearchPaper]
                range = [str]

            class publishedYear(DataProperty, FunctionalProperty):
                domain = [ResearchPaper]
                range = [int]

            class citationCount(DataProperty):
                domain = [ResearchPaper]
                range = [int]

            # Define rules
            class AuthorRule(Author >> Researcher):
                """All authors are researchers"""
                pass

            # Define automatic classification
            class InfluentialPaperRule(ResearchPaper):
                equivalent_to = [
                    ResearchPaper & citationCount.some(int >= 100)
                ]

        self.ResearchPaper = self.onto.ResearchPaper
        self.Author = self.onto.Author
        self.hasAuthor = self.onto.hasAuthor
        self.cites = self.onto.cites
        self.hasTitle = self.onto.hasTitle
        self.publishedYear = self.onto.publishedYear
        self.citationCount = self.onto.citationCount

    def add_paper(self, paper_id: str, title: str, year: int, citations: int = 0):
        """Add a research paper"""
        paper = self.ResearchPaper(paper_id)
        paper.hasTitle = [title]
        paper.publishedYear = [year]
        paper.citationCount = [citations]
        return paper

    def add_author(self, author_id: str, name: str):
        """Add an author"""
        author = self.Author(author_id)
        author.label = [name]
        return author

    def link_author_to_paper(self, author, paper):
        """Create authorship relationship"""
        author.authored.append(paper)
        # Inverse relationship is automatic!

    def add_citation(self, citing_paper, cited_paper):
        """Add citation relationship"""
        citing_paper.cites.append(cited_paper)
        # citedBy is automatic (inverse property)!

    def run_reasoner(self):
        """Run OWL reasoner to infer new facts"""
        print("Running reasoner...")
        with self.onto:
            sync_reasoner(debug=False)
        print("Reasoning complete!")

    def find_influential_papers(self):
        """Find papers automatically classified as influential"""
        return list(self.onto.InfluentialPaper.instances())

    def find_all_researchers(self):
        """Find all researchers (including inferred ones)"""
        return list(self.onto.Researcher.instances())

    def save(self, filename: str):
        """Save ontology to file"""
        self.onto.save(file=filename, format="rdfxml")

    def load(self, filename: str):
        """Load ontology from file"""
        self.onto = get_ontology(filename).load()


# Example usage with reasoning
owl_kg = OWLKnowledgeGraph()

# Add papers
paper1 = owl_kg.add_paper("attention2017", "Attention Is All You Need", 2017, 15000)
paper2 = owl_kg.add_paper("bert2019", "BERT", 2019, 8000)
paper3 = owl_kg.add_paper("transformer_xl", "Transformer-XL", 2019, 500)

# Add authors
vaswani = owl_kg.add_author("vaswani", "Ashish Vaswani")
devlin = owl_kg.add_author("devlin", "Jacob Devlin")

# Link relationships
owl_kg.link_author_to_paper(vaswani, paper1)
owl_kg.link_author_to_paper(devlin, paper2)

# Add citations
owl_kg.add_citation(paper2, paper1)  # BERT cites Attention
owl_kg.add_citation(paper3, paper1)  # Transformer-XL cites Attention

print("Before reasoning:")
print(f"Influential papers: {len(owl_kg.find_influential_papers())}")
print(f"Researchers: {len(owl_kg.find_all_researchers())}")

# Run reasoner
owl_kg.run_reasoner()

print("\nAfter reasoning:")
# Papers with 100+ citations are automatically classified as InfluentialPaper
influential = owl_kg.find_influential_papers()
print(f"Influential papers: {[p.hasTitle[0] for p in influential]}")

# Authors are automatically inferred to be Researchers
researchers = owl_kg.find_all_researchers()
print(f"Researchers: {[r.label[0] for r in researchers]}")

# Check inverse properties
print(f"\nVaswani authored: {[p.hasTitle[0] for p in vaswani.authored]}")
print(f"Paper1 has authors: {[a.label[0] for a in paper1.hasAuthor]}")
# Both work! Inverse is automatic.

# Check citedBy (inverse of cites)
print(f"\nPaper1 is cited by: {[p.hasTitle[0] for p in paper1.citedBy]}")
# Automatic from cites relationship!

OWL Reasoning Examples

1. Class Hierarchy Inference:

# Define hierarchy
with owl_kg.onto:
    class Author(Thing):
        pass

    class PhDStudent(Author):
        pass

    class Professor(Author):
        pass

    # All Authors are Researchers
    class AuthorIsResearcher(Author >> Researcher):
        pass

# Create instance
phd_student = owl_kg.onto.PhDStudent("alice")

# Before reasoning
print(phd_student.is_a)  # [PhDStudent]

# Run reasoner
sync_reasoner()

# After reasoning
print(phd_student.is_a)  # [PhDStudent, Author, Researcher]
# Automatically inferred Alice is an Author and Researcher!

2. Property Propagation:

# Define transitive property
class influences(ObjectProperty, TransitiveProperty):
    pass

# State facts
paper_a = ResearchPaper("paper_a")
paper_b = ResearchPaper("paper_b")
paper_c = ResearchPaper("paper_c")

paper_a.influences = [paper_b]  # A influences B
paper_b.influences = [paper_c]  # B influences C

# Run reasoner
sync_reasoner()

# Reasoner infers: A influences C (transitively)
print(paper_c in paper_a.influences)  # True!

3. Automatic Classification:

# Define rule: Papers with many authors are "Collaborative"
with owl_kg.onto:
    class CollaborativePaper(ResearchPaper):
        equivalent_to = [
            ResearchPaper & hasAuthor.min(5)  # 5+ authors
        ]

# Add paper with 6 authors
paper = ResearchPaper("multi_author_paper")
for i in range(6):
    author = Author(f"author_{i}")
    paper.hasAuthor.append(author)

# Run reasoner
sync_reasoner()

# Paper is automatically classified as CollaborativePaper!
print(CollaborativePaper in paper.is_a)  # True!

Production: Apache Jena with OWL Reasoner

For production, use Apache Jena with built-in OWL reasoners:

from rdflib import Graph, Namespace
from rdflib.plugins.sparql import prepareQuery

class JenaOWLKnowledgeGraph:
    """Production OWL knowledge graph with Jena reasoner"""

    def __init__(self, fuseki_url: str = "http://localhost:3030/research"):
        self.fuseki_url = fuseki_url
        self.graph = Graph()
        self.ns = Namespace("http://researcherai.org/ontology#")

        # Load ontology schema
        self.load_ontology()

    def load_ontology(self):
        """Load OWL ontology definitions"""
        ontology_ttl = """
        @prefix owl: <http://www.w3.org/2002/07/owl#> .
        @prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#> .
        @prefix research: <http://researcherai.org/ontology#> .

        research:Author rdfs:subClassOf research:Researcher .
        research:authored rdfs:domain research:Author .
        research:cites rdf:type owl:TransitiveProperty .
        research:citedBy owl:inverseOf research:cites .
        """
        self.graph.parse(data=ontology_ttl, format="turtle")

    def query_with_reasoning(self, sparql_query: str):
        """Execute SPARQL with reasoning enabled"""
        # Jena Fuseki can enable reasoner via endpoint config
        # Example: http://localhost:3030/research_reasoned/sparql
        results = self.graph.query(sparql_query)
        return list(results)


# Configure Jena Fuseki with OWL reasoner
fuseki_config = """
<#service> rdf:type fuseki:Service ;
    fuseki:name "research" ;
    fuseki:serviceQuery "sparql" ;
    fuseki:dataset <#dataset> .

<#dataset> rdf:type ja:DatasetTxnMem ;
    ja:defaultGraph <#model_inf> .

<#model_inf> rdf:type ja:InfModel ;
    ja:reasoner [
        ja:reasonerURL <http://jena.hpl.hp.com/2003/OWLFBRuleReasoner>
    ] ;
    ja:baseModel <#model_base> .

<#model_base> rdf:type ja:MemoryModel .
"""

OWL Profiles: Which to Use?

OWL has different profiles (subsets) for different use cases:

Profile	Complexity	Reasoning	Use Case
OWL Full	Maximum expressivity	Undecidable	Research, experimental
OWL DL	Description Logic	Complete & decidable	General purpose
OWL Lite	Basic class hierarchy	Simple & fast	Simple taxonomies
OWL EL	Existential quantification	Polynomial time	Large ontologies (medical)
OWL QL	Query-oriented	Log-space	Database integration
OWL RL	Rule-based	Polynomial time	Business rules

For ResearcherAI: Use OWL DL or OWL RL - balance of expressivity and performance.

When to Use OWL vs Just RDF

Use OWL when you need:

Automatic classification
- Classify papers as "influential" based on citation count
- Identify "interdisciplinary" papers based on concept diversity
Inference from rules
- Infer co-authors from paper authorship
- Derive expertise areas from publication history
Consistency checking
- Ensure papers have at least one author
- Validate that publication years are reasonable
Property inheritance
- Symmetric properties (collaboration)
- Transitive properties (influence, citation chains)
- Inverse properties (cites ↔ citedBy)

Use just RDF when:

Simple data storage - no complex reasoning needed
Performance critical - reasoning is computationally expensive
Schema is stable - don't need automatic classification
Explicit is better - want to state all facts explicitly

OWL Reasoning: Dev vs Prod Comparison

Feature	Owlready2 (Dev)	Apache Jena (Prod)
Language	Python	Java (Python client)
Reasoners	HermiT, Pellet	Jena, Pellet, HermiT
Performance	Slower (Python)	Faster (Java)
Scalability	Small ontologies	Large ontologies
Ease of Use	Very easy (Pythonic)	More complex setup
Integration	Great for scripts	Enterprise integration
Best for	Development, prototyping	Production, large scale

Example Use Case for ResearcherAI:

# Use OWL to automatically identify "rising stars" (researchers)
# Rule: A rising star is someone who:
# - Authored papers in the last 3 years
# - Has papers cited > 50 times
# - Collaborates with established researchers

with owl_kg.onto:
    class EstablishedResearcher(Researcher):
        equivalent_to = [
            Researcher & authored.some(
                ResearchPaper & citationCount.some(int >= 500)
            )
        ]

    class RisingStar(Researcher):
        equivalent_to = [
            Researcher &
            authored.some(
                ResearchPaper &
                publishedYear.some(int >= 2021) &
                citationCount.some(int >= 50)
            ) &
            collaboratesWith.some(EstablishedResearcher)
        ]

# Run reasoner
sync_reasoner()

# Automatically finds rising stars!
rising_stars = list(owl_kg.onto.RisingStar.instances())
print(f"Rising stars: {[r.label[0] for r in rising_stars]}")

OWL Summary

OWL = RDF + reasoning/inference capabilities
Use for: Automatic classification, rule-based inference, consistency checking
Dev: owlready2 (easy Python integration)
Prod: Apache Jena (performance, scalability)
ResearcherAI: Could use OWL for researcher classification, paper categorization

OWL Performance

OWL reasoning can be computationally expensive. For large knowledge graphs (millions of triples), reasoning can take minutes to hours. Consider:

Using simpler OWL profiles (EL, QL, RL)
Pre-computing inferences offline
Using incremental reasoning
Caching reasoner results

RDF vs Property Graphs: When to Use Each

Feature	RDF (Jena/RDFLib)	Property Graphs (Neo4j)
Data Model	Triples (subject-predicate-object)	Nodes with properties + labeled edges
Schema	Ontology (OWL)	Schema optional
Standards	W3C standards (RDF, OWL, SPARQL)	No universal standard
Query Language	SPARQL	Cypher
Reasoning	Built-in inferencing (OWL reasoners)	No built-in reasoning
Flexibility	Extremely flexible, schema evolution	More rigid structure
Performance	Slower for graph traversal	Optimized for graph queries
Use Case	Scientific data, linked data, ontologies	Social networks, recommendations
Learning Curve	Steeper (ontologies, W3C specs)	Gentler (more intuitive)

Web Developer Analogy:

RDF = XML/JSON-LD with strict schemas (TypeScript with interfaces)
Property Graphs = NoSQL document store with relationships (MongoDB + relationships)

When to Use RDF:

Need formal ontologies - scientific domains, medical data
Data integration - combining data from multiple sources
Reasoning/inference - derive new facts from existing ones
Linked open data - publish data others can link to
Interoperability - strict W3C standards

Example: Medical knowledge graphs, DBpedia, Wikidata

When to Use Property Graphs:

Graph algorithms - shortest path, community detection
High-performance traversal - social networks, fraud detection
Flexible schema - rapidly evolving data model
Intuitive queries - easier to learn and use
Real-time recommendations - e-commerce, content recommendations

Example: LinkedIn connections, recommendation engines, ResearcherAI

ResearcherAI's Approach

For ResearcherAI, we use property graphs (Neo4j) because:

Better performance for citation traversal
Simpler learning curve for developers
Flexible schema - research data models evolve
Cypher is intuitive - easier than SPARQL for most queries
Neo4j has excellent tooling - Browser, Bloom, Graph Data Science

However, if you needed to:

Integrate with external ontologies (e.g., medical ontologies)
Publish linked open data
Use formal reasoning/inference
Comply with W3C standards

Then RDF with Apache Jena would be the better choice.

Hybrid Approach: RDF + Property Graphs

You can actually use both:

class HybridSemanticKnowledgeGraph:
    """Combine RDF (for ontology) with Neo4j (for performance)"""

    def __init__(
        self,
        neo4j_kg: Neo4jKnowledgeGraph,
        rdf_kg: RDFKnowledgeGraph
    ):
        self.neo4j = neo4j_kg  # For fast queries
        self.rdf = rdf_kg      # For ontology and reasoning

    def add_paper(self, paper_data: dict):
        """Add to both stores"""
        # Add to Neo4j for performance
        self.neo4j.add_paper(
            paper_data["id"],
            paper_data["title"],
            paper_data["year"]
        )

        # Add to RDF for formal semantics
        self.rdf.add_paper(
            paper_data["id"],
            paper_data["title"],
            paper_data["year"]
        )

    def query_with_reasoning(self, sparql_query: str):
        """Use RDF reasoner for inference"""
        return self.rdf.query_sparql(sparql_query)

    def query_with_performance(self, cypher_query: str):
        """Use Neo4j for fast graph traversal"""
        return self.neo4j.query_cypher(cypher_query)

RDF/SPARQL: Dev vs Prod Comparison

Feature	RDFLib (Dev)	Apache Jena Fuseki (Prod)
Storage	In-memory or file	Persistent triple store
Query	Python SPARQL	HTTP SPARQL endpoint
Scalability	100k triples	Billions of triples
Performance	Slow for large graphs	Optimized indices
Reasoning	Basic	Full OWL reasoning
Concurrent Access	No	Yes (multi-user)
Best for	Development, testing	Production, linked data

RDF vs Property Graphs Summary

RDF: Formal ontologies, reasoning, standards compliance, data integration
Property Graphs: Performance, graph algorithms, simpler queries, flexibility
ResearcherAI: Uses property graphs for performance, but you can use RDF if needed!

Building Knowledge Graphs: Construction Methods

Now that you understand what knowledge graphs are, let's explore how to build them from various data sources.

Three Main Approaches

There are three primary methods to construct knowledge graphs, depending on your data source structure:

1. Structured Sources (Relational Databases, CSV)

Characteristics:

Fixed schema (all entities of same type have same attributes)
Examples: SQL databases, CSV files, Excel spreadsheets
Easiest to convert to knowledge graphs

Method: Use mapping rules (like R2RML - RDB to RDF Mapping Language)

# Example: CSV to Knowledge Graph
import pandas as pd
from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF

# Source: papers.csv
# paper_id,title,year,author_id
# p1,"Attention Is All You Need",2017,a1
# p2,"BERT",2018,a2

df = pd.read_csv("papers.csv")
g = Graph()
ns = Namespace("http://example.org/")

for _, row in df.iterrows():
    paper_uri = ns[row['paper_id']]

    # Add triples
    g.add((paper_uri, RDF.type, ns.ResearchPaper))
    g.add((paper_uri, ns.hasTitle, Literal(row['title'])))
    g.add((paper_uri, ns.publishedYear, Literal(row['year'])))
    g.add((paper_uri, ns.hasAuthor, ns[row['author_id']]))

# Result: Knowledge graph with papers, titles, years, authors

Mapping Rules (R2RML):

# R2RML mapping: SQL table → RDF
@prefix rr: <http://www.w3.org/ns/r2rml#> .

<#PaperMapping> a rr:TriplesMap ;
    rr:logicalTable [ rr:tableName "papers" ] ;
    rr:subjectMap [
        rr:template "http://example.org/paper/{paper_id}" ;
        rr:class :ResearchPaper
    ] ;
    rr:predicateObjectMap [
        rr:predicate :hasTitle ;
        rr:objectMap [ rr:column "title" ]
    ] .

2. Semi-Structured Sources (JSON, XML)

Characteristics:

Flexible schema (entities of same type may have different attributes)
Examples: JSON APIs, XML documents, HTML pages
Moderate complexity to convert

Method: Use mapping rules adapted to the structure

import json
from rdflib import Graph, Namespace, Literal

# Source: papers.json
json_data = {
    "paper1": {
        "title": "Attention Is All You Need",
        "year": 2017,
        "authors": ["Vaswani", "Shazeer"],  # Variable length!
        "venue": "NIPS"  # Optional field
    },
    "paper2": {
        "title": "BERT",
        "year": 2018,
        "authors": ["Devlin"]  # Different number of authors
        # No venue field!
    }
}

g = Graph()
ns = Namespace("http://example.org/")

for paper_id, paper_data in json_data.items():
    paper_uri = ns[paper_id]
    g.add((paper_uri, ns.hasTitle, Literal(paper_data['title'])))
    g.add((paper_uri, ns.publishedYear, Literal(paper_data['year'])))

    # Handle variable-length authors
    for author in paper_data.get('authors', []):
        g.add((paper_uri, ns.hasAuthor, Literal(author)))

    # Handle optional venue
    if 'venue' in paper_data:
        g.add((paper_uri, ns.publishedAt, Literal(paper_data['venue'])))

3. Unstructured Sources (Text, Images, PDFs)

Characteristics:

No fixed schema at all
Examples: Natural language text, research papers (PDF), images
Most complex to convert - requires AI/NLP

Method: Use NLP techniques to extract entities and relationships

# Example: Extract entities and relationships from text
from transformers import pipeline

# NLP model for Named Entity Recognition
ner = pipeline("ner", model="dbmdz/bert-large-cased-finetuned-conll03-english")

text = """
Attention Is All You Need was published in 2017 by Vaswani and colleagues at Google.
The paper introduced the Transformer architecture for sequence-to-sequence tasks.
"""

# Extract entities
entities = ner(text)
# Result: [
#   {"entity": "Attention Is All You Need", "type": "WORK"},
#   {"entity": "2017", "type": "DATE"},
#   {"entity": "Vaswani", "type": "PERSON"},
#   {"entity": "Google", "type": "ORG"}
# ]

# Extract relationships (requires relation extraction model)
# "Attention Is All You Need" -[published_in]-> "2017"
# "Attention Is All You Need" -[authored_by]-> "Vaswani"
# "Vaswani" -[works_at]-> "Google"

# Convert to knowledge graph triples
g = Graph()
ns = Namespace("http://example.org/")

g.add((ns.attention_paper, ns.publishedYear, Literal(2017)))
g.add((ns.attention_paper, ns.hasAuthor, ns.vaswani))
g.add((ns.vaswani, ns.worksAt, ns.google))

Knowledge Graph Construction Process

Here's the end-to-end process for building production knowledge graphs:

Step 0: Define Use Case and Scope

Before building, answer:

What questions do we need to answer? ("Which papers cite X?", "Who are experts in Y?")
What metadata do we need? (papers, authors, citations, concepts)
What relationships matter? (cites, authored_by, discusses)

Example for ResearcherAI:

use_case = {
    "questions": [
        "Find papers related to transformers",
        "Who are the leading researchers in NLP?",
        "What papers cite the Attention paper?"
    ],
    "metadata": ["papers", "authors", "citations", "concepts"],
    "relationships": ["cites", "authored_by", "discusses", "collaborates_with"]
}

Step 1: Data Collection

Gather data from various sources across your ecosystem:

# Example: Collect from multiple sources
sources = {
    "papers_db": "SELECT * FROM papers",  # SQL database
    "arxiv_api": "https://api.arxiv.org/papers",  # REST API
    "paper_pdfs": "/data/pdfs/*.pdf",  # Unstructured files
    "citations_csv": "/data/citations.csv"  # CSV file
}

# Each source "speaks" a different language:
# - SQL: Relational tables
# - API: JSON
# - PDFs: Unstructured text
# - CSV: Tabular data

Step 2: Data Cleaning and Standardization

Ensure data quality by:

Standardizing: Convert all dates to ISO format (YYYY-MM-DD)
Deduplicating: Merge duplicate author entries
Validating: Check data types, required fields
Resolving: Handle inconsistencies (same paper different IDs)

import pandas as pd

# Example: Clean author data
authors_raw = pd.read_csv("authors.csv")

# Standardize names
authors_raw['name'] = authors_raw['name'].str.strip().str.title()

# Remove duplicates (same email = same person)
authors_clean = authors_raw.drop_duplicates(subset=['email'])

# Validate required fields
authors_clean = authors_clean.dropna(subset=['name', 'affiliation'])

# Resolve inconsistencies (assign unique IDs)
authors_clean['author_id'] = range(len(authors_clean))

Step 3: Data Modeling (Convert to RDF Triples)

Transform cleaned data into standardized RDF triples:

from rdflib import Graph, Namespace, Literal, URIRef
from rdflib.namespace import RDF

g = Graph()
ns = Namespace("http://researcherai.org/")

# For each paper in cleaned data
for _, paper in papers_clean.iterrows():
    paper_uri = ns[f"paper/{paper['id']}"]

    # Subject-Predicate-Object triples
    g.add((paper_uri, RDF.type, ns.ResearchPaper))
    g.add((paper_uri, ns.hasTitle, Literal(paper['title'])))
    g.add((paper_uri, ns.publishedYear, Literal(paper['year'])))

    # Relationships to other entities
    for author_id in paper['authors']:
        author_uri = ns[f"author/{author_id}"]
        g.add((paper_uri, ns.hasAuthor, author_uri))

Step 4: Usage and Insights

Now query the knowledge graph to deliver value:

# Find papers by author
SELECT ?paper ?title WHERE {
    ?author :hasName "Ashish Vaswani" .
    ?paper :hasAuthor ?author .
    ?paper :hasTitle ?title .
}

# Find citation paths
SELECT ?citing ?cited WHERE {
    ?citing :cites+ ?cited .  # Transitive: 1+ hops
}

Comparison of Construction Methods

Source Type	Complexity	Tools	Best For
Structured	⭐ Low	R2RML, pandas	Databases, CSV, Excel
Semi-Structured	⭐⭐ Medium	JSON/XML parsers	APIs, config files
Unstructured	⭐⭐⭐ High	NLP, LLMs, OCR	PDFs, text, images

ResearcherAI's Approach:

Structured: arXiv metadata (JSON API) → Easy conversion
Semi-Structured: Paper metadata from multiple APIs → JSON parsing
Unstructured: Paper PDFs → NLP for concept extraction

Construction Best Practices

Start with structured sources - easiest to convert and validate
Clean data thoroughly - garbage in, garbage out
Define schema first - know what entities and relationships you need
Validate incrementally - check quality at each step
Use existing ontologies - don't reinvent the wheel (e.g., Schema.org)

Hands-On: Building a Research Paper Knowledge Graph

Now let's walk through a complete example of building a knowledge graph from structured data using the declarative SPARQL CONSTRUCT approach.

What You'll Learn:

How to transform CSV data into RDF triples
Using SPARQL CONSTRUCT queries for mapping
Incrementally building a knowledge graph
Visualizing the resulting graph

Step 1: Input Data

Imagine you have research paper data in CSV files:

papers.csv:

domain,title,year,abstract
NLP,Attention Is All You Need,2017,Transformer architecture for sequence-to-sequence
NLP,BERT,2018,Bidirectional encoder representations
CV,ResNet,2015,Deep residual learning for image recognition

authors.csv:

name,affiliation,domain
Ashish Vaswani,Google Brain,NLP
Jacob Devlin,Google AI,NLP
Kaiming He,Facebook AI,CV

citations.csv:

citing_paper,cited_paper,citation_type
BERT,Attention Is All You Need,builds_on
ResNet,VGGNet,improves

concepts.csv:

paper,concept,importance
Attention Is All You Need,self-attention,high
Attention Is All You Need,transformers,high
BERT,bidirectional,high
ResNet,residual-connections,high

Let's load this data:

import pandas as pd
from rdflib import Graph, Literal, Namespace
from rdflib.plugins.sparql.processor import prepareQuery

# Load CSV files
papers_df = pd.read_csv("papers.csv").fillna('')
authors_df = pd.read_csv("authors.csv").fillna('')
citations_df = pd.read_csv("citations.csv").fillna('')
concepts_df = pd.read_csv("concepts.csv").fillna('')

# Show distribution
data = {
    "Papers": len(papers_df),
    "Authors": len(authors_df),
    "Citations": len(citations_df),
    "Concepts": len(concepts_df)
}
print(pd.DataFrame.from_dict(data, orient='index', columns=['Count']))
# Output:
#           Count
# Papers        3
# Authors       3
# Citations     3
# Concepts      4

Step 2: Define the Knowledge Graph Schema

Based on our data, we define the schema:

# Schema for Research Papers
@prefix research: <http://example.org/research#> .

# Classes (Entity Types)
research:Paper a rdfs:Class .
research:Author a rdfs:Class .
research:Concept a rdfs:Class .
research:ResearchDomain a rdfs:Class .

# Properties
research:hasTitle a rdf:Property ;
    rdfs:domain research:Paper ;
    rdfs:range xsd:string .

research:publishedYear a rdf:Property ;
    rdfs:domain research:Paper ;
    rdfs:range xsd:integer .

research:hasAbstract a rdf:Property ;
    rdfs:domain research:Paper ;
    rdfs:range xsd:string .

# Relationships
research:authoredBy a rdf:Property ;
    rdfs:domain research:Paper ;
    rdfs:range research:Author .

research:cites a rdf:Property ;
    rdfs:domain research:Paper ;
    rdfs:range research:Paper .

research:discusses a rdf:Property ;
    rdfs:domain research:Paper ;
    rdfs:range research:Concept .

research:belongsToDomain a rdf:Property ;
    rdfs:domain research:Paper ;
    rdfs:range research:ResearchDomain .

Visualization of Schema:

Step 3: SPARQL CONSTRUCT Queries for Mapping

Now we define SPARQL CONSTRUCT queries to transform CSV data into RDF triples:

Query 1: Create Paper Entities

PREFIX research: <http://example.org/research#>
CONSTRUCT {
    ?paper a research:Paper .
    ?paper research:hasTitle ?title .
    ?paper research:publishedYear ?year .
    ?paper research:hasAbstract ?abstract .
    ?paper research:belongsToDomain ?domainIRI .
}
WHERE {
    BIND(IRI(CONCAT("http://data.example.org/paper/",
                    REPLACE(?title, " ", "_"))) AS ?paper)
    BIND(IRI(CONCAT("http://data.example.org/domain/",
                    ?domain)) AS ?domainIRI)
}

Web Developer Analogy:

// SPARQL CONSTRUCT is like a template for creating objects
const papers = csvData.map(row => ({
  id: `http://data.example.org/paper/${row.title.replace(/ /g, '_')}`,
  type: "Paper",
  title: row.title,
  year: row.year,
  abstract: row.abstract,
  domain: `http://data.example.org/domain/${row.domain}`
}))

Query 2: Create Author Entities

PREFIX research: <http://example.org/research#>
CONSTRUCT {
    ?author a research:Author .
    ?author research:hasName ?name .
    ?author research:affiliation ?affiliation .
}
WHERE {
    BIND(IRI(CONCAT("http://data.example.org/author/",
                    REPLACE(?name, " ", "_"))) AS ?author)
}

Query 3: Create Citation Relationships

PREFIX research: <http://example.org/research#>
CONSTRUCT {
    ?citingPaperIRI research:cites ?citedPaperIRI .
    ?citingPaperIRI research:citationType ?citation_type .
}
WHERE {
    BIND(IRI(CONCAT("http://data.example.org/paper/",
                    REPLACE(?citing_paper, " ", "_"))) AS ?citingPaperIRI)
    BIND(IRI(CONCAT("http://data.example.org/paper/",
                    REPLACE(?cited_paper, " ", "_"))) AS ?citedPaperIRI)
}

Query 4: Create Concept Relationships

PREFIX research: <http://example.org/research#>
CONSTRUCT {
    ?paperIRI research:discusses ?conceptIRI .
    ?conceptIRI research:importance ?importance .
}
WHERE {
    BIND(IRI(CONCAT("http://data.example.org/paper/",
                    REPLACE(?paper, " ", "_"))) AS ?paperIRI)
    BIND(IRI(CONCAT("http://data.example.org/concept/",
                    ?concept)) AS ?conceptIRI)
}

Step 4: The Transform Function

This function applies SPARQL CONSTRUCT queries to DataFrame rows:

import re
from rdflib import Graph, Literal
from rdflib.plugins.sparql.processor import prepareQuery

def transform(df: pd.DataFrame, construct_query: str,
              first: bool = False) -> Graph:
    """Transform Pandas DataFrame to RDFLib Graph using SPARQL CONSTRUCT.

    Args:
        df: Input DataFrame with CSV data
        construct_query: SPARQL CONSTRUCT query template
        first: If True, only process first row (for testing)

    Returns:
        RDF Graph with constructed triples
    """
    # Setup graphs
    query_graph = Graph()
    result_graph = Graph()

    # Parse the SPARQL query
    query = prepareQuery(construct_query)

    # Clean column names (remove special characters)
    invalid_pattern = re.compile(r"[^\w_]+")
    headers = dict((k, invalid_pattern.sub("_", k)) for k in df.columns)

    # Process each row
    for _, row in df.iterrows():
        # Create variable bindings: column name -> cell value
        binding = dict((headers[k], Literal(row[k]))
                      for k in df.columns if len(str(row[k])) > 0)

        # Execute query with bindings
        results = query_graph.query(query, initBindings=binding)

        # Add resulting triples to graph
        for triple in results:
            result_graph.add(triple)

        # Stop after first row if testing
        if first:
            break

    return result_graph

How It Works:

Parse query: Prepare the SPARQL CONSTRUCT template
For each row: Create variable bindings (CSV columns → SPARQL variables)
Execute query: Replace variables with values, construct triples
Add to graph: Accumulate all triples in result graph

Step 5: Build the Knowledge Graph Incrementally

# Initialize empty knowledge graph
kg = Graph()

# Step 1: Add papers
construct_papers = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
    ?paper a research:Paper .
    ?paper research:hasTitle ?title .
    ?paper research:publishedYear ?year .
    ?paper research:hasAbstract ?abstract .
    ?paper research:belongsToDomain ?domainIRI .
}
WHERE {
    BIND(IRI(CONCAT("http://data.example.org/paper/",
                    REPLACE(?title, " ", "_"))) AS ?paper)
    BIND(IRI(CONCAT("http://data.example.org/domain/",
                    ?domain)) AS ?domainIRI)
}
"""

# Test with first row
print("Testing with first paper:")
print(transform(papers_df, construct_papers, first=True).serialize(format='turtle'))

# Output:
# @prefix research: <http://example.org/research#> .
#
# <http://data.example.org/paper/Attention_Is_All_You_Need>
#     a research:Paper ;
#     research:hasTitle "Attention Is All You Need" ;
#     research:publishedYear 2017 ;
#     research:hasAbstract "Transformer architecture..." ;
#     research:belongsToDomain <http://data.example.org/domain/NLP> .

# Add all papers to knowledge graph
kg += transform(papers_df, construct_papers)
print(f"After adding papers: {len(kg)} triples")
# Output: After adding papers: 12 triples (3 papers × 4 properties each)

# Step 2: Add authors
construct_authors = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
    ?author a research:Author .
    ?author research:hasName ?name .
    ?author research:affiliation ?affiliation .
}
WHERE {
    BIND(IRI(CONCAT("http://data.example.org/author/",
                    REPLACE(?name, " ", "_"))) AS ?author)
}
"""

kg += transform(authors_df, construct_authors)
print(f"After adding authors: {len(kg)} triples")
# Output: After adding authors: 21 triples

# Step 3: Add citations
construct_citations = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
    ?citingPaperIRI research:cites ?citedPaperIRI .
}
WHERE {
    BIND(IRI(CONCAT("http://data.example.org/paper/",
                    REPLACE(?citing_paper, " ", "_"))) AS ?citingPaperIRI)
    BIND(IRI(CONCAT("http://data.example.org/paper/",
                    REPLACE(?cited_paper, " ", "_"))) AS ?citedPaperIRI)
}
"""

kg += transform(citations_df, construct_citations)
print(f"After adding citations: {len(kg)} triples")
# Output: After adding citations: 24 triples

# Step 4: Add concepts
construct_concepts = """
PREFIX research: <http://example.org/research#>
CONSTRUCT {
    ?paperIRI research:discusses ?conceptIRI .
}
WHERE {
    BIND(IRI(CONCAT("http://data.example.org/paper/",
                    REPLACE(?paper, " ", "_"))) AS ?paperIRI)
    BIND(IRI(CONCAT("http://data.example.org/concept/",
                    ?concept)) AS ?conceptIRI)
}
"""

kg += transform(concepts_df, construct_concepts)
print(f"Final knowledge graph: {len(kg)} triples")
# Output: Final knowledge graph: 28 triples

Step 6: Query the Knowledge Graph

Now we can query the constructed graph:

# Query 1: Find all NLP papers
query_nlp_papers = """
PREFIX research: <http://example.org/research#>
SELECT ?title ?year
WHERE {
    ?paper a research:Paper .
    ?paper research:hasTitle ?title .
    ?paper research:publishedYear ?year .
    ?paper research:belongsToDomain <http://data.example.org/domain/NLP> .
}
ORDER BY ?year
"""

results = kg.query(query_nlp_papers)
for row in results:
    print(f"{row.title} ({row.year})")
# Output:
# Attention Is All You Need (2017)
# BERT (2018)

# Query 2: Find papers citing "Attention Is All You Need"
query_citations = """
PREFIX research: <http://example.org/research#>
SELECT ?citing_title
WHERE {
    ?citing research:cites <http://data.example.org/paper/Attention_Is_All_You_Need> .
    ?citing research:hasTitle ?citing_title .
}
"""

results = kg.query(query_citations)
for row in results:
    print(f"Paper citing Attention: {row.citing_title}")
# Output: Paper citing Attention: BERT

# Query 3: Find all concepts discussed in NLP papers
query_concepts = """
PREFIX research: <http://example.org/research#>
SELECT ?concept
WHERE {
    ?paper research:belongsToDomain <http://data.example.org/domain/NLP> .
    ?paper research:discusses ?conceptIRI .
    BIND(REPLACE(STR(?conceptIRI), ".*/", "") AS ?concept)
}
"""

results = kg.query(query_concepts)
concepts = [row.concept for row in results]
print(f"NLP concepts: {', '.join(concepts)}")
# Output: NLP concepts: self-attention, transformers, bidirectional

Step 7: Visualize the Knowledge Graph

import networkx as nx
import matplotlib.pyplot as plt

def rdf_to_nx(rdf_graph: Graph) -> nx.DiGraph:
    """Convert RDF graph to NetworkX directed graph."""
    G = nx.DiGraph()

    for s, p, o in rdf_graph:
        # Extract local names (remove URI prefixes)
        subject = str(s).split('/')[-1]
        predicate = str(p).split('#')[-1]
        obj = str(o).split('/')[-1] if isinstance(o, URIRef) else str(o)

        # Add nodes and edges
        G.add_edge(subject, obj, label=predicate)

    return G

# Convert to NetworkX
G = rdf_to_nx(kg)

# Visualize
plt.figure(figsize=(15, 10))
pos = nx.spring_layout(G, seed=42)

# Draw nodes
nx.draw_networkx_nodes(G, pos, node_size=1000, node_color='lightblue')

# Draw edges
nx.draw_networkx_edges(G, pos, edge_color='gray', arrows=True,
                        arrowsize=20, connectionstyle='arc3,rad=0.1')

# Draw labels
nx.draw_networkx_labels(G, pos, font_size=8)

# Draw edge labels
edge_labels = nx.get_edge_attributes(G, 'label')
nx.draw_networkx_edge_labels(G, pos, edge_labels, font_size=6)

plt.title("Research Paper Knowledge Graph")
plt.axis('off')
plt.tight_layout()
plt.show()

Step 8: Save the Knowledge Graph

# Save to Turtle file (human-readable RDF format)
kg.serialize(destination='research_papers.ttl', format='turtle')
print("Knowledge graph saved to research_papers.ttl")

# The file content looks like:
# @prefix research: <http://example.org/research#> .
#
# <http://data.example.org/paper/Attention_Is_All_You_Need>
#     a research:Paper ;
#     research:hasTitle "Attention Is All You Need" ;
#     research:publishedYear 2017 ;
#     research:belongsToDomain <http://data.example.org/domain/NLP> ;
#     research:discusses <http://data.example.org/concept/self-attention>,
#                        <http://data.example.org/concept/transformers> .
#
# <http://data.example.org/paper/BERT>
#     a research:Paper ;
#     research:hasTitle "BERT" ;
#     research:publishedYear 2018 ;
#     research:cites <http://data.example.org/paper/Attention_Is_All_You_Need> ;
#     research:discusses <http://data.example.org/concept/bidirectional> .

Key Takeaways

Declarative Approach Benefits:

Separation of concerns: Data (CSV) separate from logic (SPARQL)
Reusable queries: Same query works for any CSV with same schema
Incremental building: Add entities and relationships step-by-step
Easy to validate: Test queries on single rows first
Standard-based: SPARQL is W3C standard

Process Summary:

When to Use This Approach:

✅ Have structured data (CSV, databases)
✅ Need standard-compliant knowledge graphs
✅ Want to query with SPARQL
✅ Require formal schema/ontology
✅ Building production knowledge graphs

ResearcherAI Uses This For:

arXiv paper metadata → Knowledge graph
Citation networks from Semantic Scholar
Author collaboration graphs
Concept hierarchies from papers

Declarative vs Imperative

Declarative (SPARQL CONSTRUCT): "What you want" - Define the shape of output graph Imperative (Python loops): "How to do it" - Step-by-step instructions

SPARQL CONSTRUCT is declarative - you describe the desired graph structure, and the engine figures out how to create it. This is more maintainable and less error-prone than imperative loops.

Production Decision: Neo4j vs Apache Jena Fuseki

Critical Understanding: Neo4j and Apache Jena Fuseki are NOT interchangeable alternatives. They serve fundamentally different use cases!

The Key Question: Which data model fits your use case?

When to Use Neo4j in Production

Use Neo4j when you need:

High-Performance Graph Traversal

// Find shortest path between papers (fast!)
MATCH path = shortestPath(
  (a:Paper {id: "paper1"})-[:CITES*]-(b:Paper {id: "paper50"})
)
RETURN path

Neo4j is optimized for this (index-free adjacency)
Jena/RDF is much slower for deep graph traversal

Graph Algorithms
- PageRank, Louvain community detection
- Shortest paths, centrality measures
- Neo4j Graph Data Science library
- RDF/Jena: No built-in graph algorithms
Real-Time Recommendations
- Friend recommendations (social networks)
- Paper recommendations based on citations
- Collaborative filtering
- Performance critical - Neo4j is faster

Intuitive Queries

// Cypher is very readable
MATCH (p:Paper)-[:CITES]->(cited:Paper)
WHERE p.year > 2020
RETURN cited.title, count(*) as citations
ORDER BY citations DESC

Easier for developers to learn than SPARQL
Better tooling (Neo4j Browser, Bloom)

Flexible Schema
- Add new node labels and edge types dynamically
- Schema evolves with your application
- No formal ontology needed

Example: ResearcherAI uses Neo4j for:

Citation network traversal
Finding related papers (graph algorithms)
Author collaboration networks
Fast query performance

When to Use Apache Jena Fuseki in Production

Use Apache Jena Fuseki when you need:

Formal Ontologies

# Define strict schema
:ResearchPaper rdfs:subClassOf :Publication .
:ConferencePaper rdfs:subClassOf :ResearchPaper .
:JournalPaper rdfs:subClassOf :ResearchPaper .

W3C standard ontologies (OWL)
Strict domain modeling
Neo4j: No formal ontology support

Reasoning and Inference

# Automatic inference: If A cites B and B cites C, find transitive citations
SELECT ?paper ?influenced
WHERE {
  :paper1 :cites+ ?influenced .  # Transitive closure
}

OWL reasoners infer new facts
Automatic classification
Neo4j: No built-in reasoning

W3C Standards Compliance
- Publishing Linked Open Data (LOD)
- Interoperability with DBpedia, Wikidata
- RDF, SPARQL, OWL standards
- Neo4j: Proprietary (Cypher is not W3C standard)
Data Integration
- Merging data from multiple RDF sources
- Schema mapping and alignment
- Federated SPARQL queries across endpoints
- Neo4j: Harder to integrate with external sources
Scientific/Medical Domains
- Existing domain ontologies (Gene Ontology, SNOMED CT)
- Formal knowledge representation
- Regulatory compliance requirements
- Neo4j: Not suitable for formal ontologies

Example: Use Jena Fuseki for:

Medical knowledge graphs (SNOMED, ICD-10)
Scientific literature with formal taxonomies
Publishing linked open data
Integration with Wikidata/DBpedia

Production Performance Comparison

Operation	Neo4j	Apache Jena Fuseki	Winner
Graph Traversal (5 hops)	~10ms	~500ms+	Neo4j (50x faster)
Complex Cypher/SPARQL	Fast	Moderate	Neo4j
Write Throughput	10k-100k/sec	1k-10k/sec	Neo4j
OWL Reasoning	Not supported	Supported	Jena
Inference	Manual (application)	Automatic (reasoner)	Jena
Standards Compliance	Proprietary	W3C standards	Jena
Graph Algorithms	Built-in (GDS)	Not supported	Neo4j
Storage Efficiency	Good	Moderate (triples overhead)	Neo4j

When to Use BOTH (Hybrid Architecture)

You can use both together for the best of both worlds:

class HybridProductionKnowledgeGraph:
    """Use Neo4j for performance, Jena for reasoning"""

    def __init__(self):
        # Neo4j for fast graph operations
        self.neo4j = Neo4jKnowledgeGraph(
            uri="bolt://neo4j-prod.example.com:7687"
        )

        # Jena for ontology and reasoning
        self.jena = JenaKnowledgeGraph(
            endpoint_url="http://fuseki-prod.example.com:3030/research/sparql"
        )

    def add_paper(self, paper_data: dict):
        """Add to both databases"""
        # Add to Neo4j for fast queries
        self.neo4j.add_paper(
            paper_data["id"],
            paper_data["title"],
            paper_data["year"]
        )

        # Add to Jena for reasoning
        self.jena.add_paper(
            paper_data["id"],
            paper_data["title"],
            paper_data["year"]
        )

    def find_related_papers(self, paper_id: str):
        """Use Neo4j for fast graph traversal"""
        return self.neo4j.find_related_papers(paper_id)

    def classify_paper_topic(self, paper_id: str):
        """Use Jena reasoner for automatic classification"""
        sparql = f"""
        PREFIX research: <http://researcherai.org/ontology#>

        SELECT ?topic
        WHERE {{
            <http://researcherai.org/papers/{paper_id}>
                research:hasInferredTopic ?topic .
        }}
        """
        return self.jena.query(sparql)

    def sync_databases(self):
        """Periodically sync data between Neo4j and Jena"""
        # Export from Neo4j, import to Jena
        # Or vice versa
        pass

Use Hybrid When:

Need both fast traversal AND formal reasoning
Want graph algorithms + automatic classification
Building enterprise knowledge graph with complex requirements
Have resources to maintain two databases

Trade-offs:

❌ More complex architecture
❌ Data synchronization overhead
❌ Higher infrastructure costs
✅ Best of both worlds

ResearcherAI's Production Choice: Neo4j

Why ResearcherAI uses Neo4j (not Jena) in production:

# ResearcherAI's production config
PRODUCTION_CONFIG = {
    "knowledge_graph": "Neo4j",  # Not Apache Jena
    # Why Neo4j? See reasons below
}

Reasons:

Primary Use Case: Citation Network Traversal
- Finding related papers by citation paths
- Author collaboration networks
- Paper recommendation based on graph structure
- Neo4j excels at this, Jena is slow
Performance Requirements
- Real-time query responses (under 100ms)
- High read throughput (1000s queries/sec)
- Neo4j is 10-50x faster for graph queries
No Formal Ontology Needed
- Research paper schema is relatively simple
- Don't need OWL reasoning
- Don't need automatic inference
- Jena's strengths aren't needed
Developer Experience
- Cypher is easier to learn than SPARQL
- Better visualization tools (Neo4j Browser)
- Larger community and resources
- Faster development
Graph Algorithms
- Use PageRank to find influential papers
- Community detection for research clusters
- Shortest paths for paper relationships
- Neo4j Graph Data Science library

When ResearcherAI WOULD use Jena instead:

If the requirements were:

❌ Need formal research ontologies (ACM Computing Classification)
❌ Must publish linked open data
❌ Need automatic paper classification via reasoning
❌ Integrate with existing RDF sources (DBpedia)
❌ W3C standards compliance required

Then use Apache Jena Fuseki in production.

Quick Decision Guide

Choose Neo4j if:

✅ Primary use case: graph traversal, recommendations
✅ Need high performance (real-time queries)
✅ Want graph algorithms (PageRank, community detection)
✅ Flexible schema, rapid development
✅ Team familiar with SQL-like queries (Cypher)

Choose Apache Jena Fuseki if:

✅ Need formal ontologies (OWL)
✅ Require reasoning and inference
✅ Publishing linked open data
✅ W3C standards compliance
✅ Integrating with existing RDF sources

Choose Both (Hybrid) if:

✅ Need graph performance AND reasoning
✅ Have resources for complex architecture
✅ Enterprise requirements

For Most Projects: Start with Neo4j. Only add Jena if you truly need formal ontologies or reasoning.

Real-World Production Examples

Companies Using Neo4j:

LinkedIn - Professional network graph
eBay - Product recommendations
Airbnb - Location-based search
Walmart - Supply chain optimization
NASA - Lessons learned database

Companies/Projects Using Apache Jena:

BBC - Linked data for content
Wikidata - Knowledge base
Getty Vocabularies - Art metadata
UK Government - Open data publishing
PubMed - Biomedical ontologies

Notice the Pattern:

Neo4j: Performance-critical, real-time applications
Jena: Formal knowledge, standards, publishing

Production Database Selection

Default choice for most projects: Neo4j (performance, ease of use)
Choose Jena if: You need formal ontologies, reasoning, or standards compliance
ResearcherAI uses Neo4j: Because citation networks need fast traversal, not formal reasoning
Start simple: Begin with one database, add the other only if truly needed

Don't Overkill

Many projects think they need formal ontologies and reasoning, but actually just need a fast graph database. Start with Neo4j. Only add Jena if you have a clear use case for OWL reasoning or standards compliance.

Part 3: Hybrid RAG - Best of Both Worlds

Neither vector search nor knowledge graphs alone are sufficient:

Vector Search Alone:

✅ Finds semantically similar content
❌ Can't answer relationship questions
❌ No structured reasoning
❌ Can't follow citations, collaborations

Knowledge Graph Alone:

✅ Excellent at relationships
✅ Multi-hop reasoning
❌ Requires exact entity matches
❌ Can't do semantic similarity

Solution: Hybrid RAG - Combine both!

Hybrid RAG Architecture

Implementing Hybrid RAG

from typing import List, Dict, Tuple
from enum import Enum

class QueryType(Enum):
    SEMANTIC = "semantic"      # "Papers about attention mechanisms"
    RELATIONAL = "relational"  # "Papers citing X"
    HYBRID = "hybrid"          # "Papers about Y citing X"

class HybridRAG:
    """Hybrid RAG combining vector search and knowledge graphs"""

    def __init__(
        self,
        vector_store: QdrantVectorStore,
        knowledge_graph: Neo4jKnowledgeGraph,
        embedding_model: SentenceTransformer
    ):
        self.vector_store = vector_store
        self.knowledge_graph = knowledge_graph
        self.embedding_model = embedding_model

    def classify_query(self, query: str) -> QueryType:
        """Determine query type"""
        # Keywords indicating relational queries
        relational_keywords = [
            "cite", "cites", "citing", "cited",
            "author", "authored", "written by",
            "collaborate", "coauthor",
            "published in", "appeared in"
        ]

        # Keywords indicating semantic queries
        semantic_keywords = [
            "about", "discuss", "related to",
            "similar to", "like", "regarding"
        ]

        query_lower = query.lower()

        has_relational = any(kw in query_lower for kw in relational_keywords)
        has_semantic = any(kw in query_lower for kw in semantic_keywords)

        if has_relational and has_semantic:
            return QueryType.HYBRID
        elif has_relational:
            return QueryType.RELATIONAL
        else:
            return QueryType.SEMANTIC

    def semantic_search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Pure vector search"""
        query_embedding = self.embedding_model.encode(query)
        results = self.vector_store.search(query_embedding, top_k)

        return [
            {
                "text": text,
                "score": score,
                "source": "vector"
            }
            for text, score in results
        ]

    def relational_search(self, query: str) -> List[Dict]:
        """Pure graph search"""
        # Parse query for entities and relationships
        # Simplified - in production, use NER + intent detection

        if "citing" in query.lower():
            # Extract paper being cited
            # Simplified extraction
            cited_paper = self._extract_paper_mention(query)
            citing_papers = self.knowledge_graph.find_citing_papers(cited_paper)

            return [
                {
                    "text": paper["title"],
                    "year": paper["year"],
                    "source": "graph"
                }
                for paper in citing_papers
            ]

        elif "author" in query.lower():
            author_name = self._extract_author_name(query)
            papers = self.knowledge_graph.find_papers_by_author(author_name)

            return [
                {
                    "text": paper["title"],
                    "year": paper["year"],
                    "source": "graph"
                }
                for paper in papers
            ]

        return []

    def hybrid_search(
        self,
        query: str,
        top_k: int = 10
    ) -> List[Dict]:
        """Combined vector + graph search"""

        # Step 1: Vector search for semantic similarity
        semantic_results = self.semantic_search(query, top_k=top_k)

        # Step 2: Graph search for relationships
        relational_results = self.relational_search(query)

        # Step 3: Merge and deduplicate
        all_results = semantic_results + relational_results
        seen = set()
        unique_results = []

        for result in all_results:
            key = result["text"]
            if key not in seen:
                seen.add(key)
                unique_results.append(result)

        # Step 4: Re-rank using both semantic and structural scores
        reranked = self._rerank_results(unique_results, query)

        return reranked[:top_k]

    def _rerank_results(self, results: List[Dict], query: str) -> List[Dict]:
        """Re-rank results combining semantic + structural scores"""

        for result in results:
            # Semantic score from vector search
            semantic_score = result.get("score", 0.5)

            # Structural score from graph (e.g., citation count, centrality)
            structural_score = 0.5  # Simplified

            # Combined score (weighted average)
            result["final_score"] = 0.6 * semantic_score + 0.4 * structural_score

        # Sort by final score
        results.sort(key=lambda x: x.get("final_score", 0), reverse=True)

        return results

    def search(self, query: str, top_k: int = 5) -> List[Dict]:
        """Main search interface - automatically routes to appropriate method"""

        query_type = self.classify_query(query)

        if query_type == QueryType.SEMANTIC:
            return self.semantic_search(query, top_k)
        elif query_type == QueryType.RELATIONAL:
            return self.relational_search(query)
        else:  # HYBRID
            return self.hybrid_search(query, top_k)

# Usage
hybrid_rag = HybridRAG(
    vector_store=qdrant_store,
    knowledge_graph=neo4j_kg,
    embedding_model=model
)

# Different query types automatically routed
queries = [
    "Papers about attention mechanisms",                    # SEMANTIC
    "Papers citing 'Attention is All You Need'",          # RELATIONAL
    "Papers about transformers citing early NLP work"      # HYBRID
]

for query in queries:
    print(f"\nQuery: {query}")
    results = hybrid_rag.search(query, top_k=3)

    for i, result in enumerate(results, 1):
        print(f"{i}. {result['text']} (source: {result.get('source', 'hybrid')})")

Part 4: GraphRAG - Knowledge Graph Enhanced RAG

What Exactly is GraphRAG?

GraphRAG is NOT just "using a knowledge graph with RAG". It's a specific approach where the knowledge graph actively enhances the retrieval process by:

Expanding initial search results with graph-connected context
Enriching retrieved documents with relationship information
Providing multi-hop reasoning paths through the graph

Web Developer Analogy:

// Traditional RAG = Direct database query
const results = db.query("SELECT * FROM articles WHERE text MATCHES 'transformers'")
return results  // Just the matching articles

// GraphRAG = Query + JOIN on relationships
const initial = db.query("SELECT * FROM articles WHERE text MATCHES 'transformers'")
const expanded = initial.map(article => ({
  ...article,
  citations: db.query("SELECT * FROM articles WHERE id IN article.cited_papers"),
  relatedConcepts: db.query("SELECT * FROM concepts WHERE article_id = article.id"),
  authorExpertise: db.query("SELECT * FROM articles WHERE author_id = article.author_id")
}))
return expanded  // Original + graph-enriched context

The Problem GraphRAG Solves

Scenario: User asks "How do transformers handle long-range dependencies?"

Traditional RAG (just vector search):

# Returns: Top 3 papers about transformers
results = [
    "Attention is All You Need (Vaswani, 2017)",
    "BERT (Devlin, 2018)",
    "GPT-3 (Brown, 2020)"
]
# ❌ Missing: WHY transformers were invented (what came before)
# ❌ Missing: HOW they evolved (what improved them)
# ❌ Missing: WHAT problems remain (recent criticisms)

GraphRAG (vector search + graph expansion):

# Returns: Top 3 papers + graph context
results = {
    "initial_papers": [
        "Attention is All You Need (Vaswani, 2017)",
        "BERT (Devlin, 2018)",
        "GPT-3 (Brown, 2020)"
    ],
    "cited_papers": [  # WHAT CAME BEFORE (context)
        "Neural Machine Translation by Jointly Learning to Align (Bahdanau, 2014)",
        "Sequence to Sequence Learning (Sutskever, 2014)",
        "Long Short-Term Memory (Hochreiter, 1997)"  # The problem transformers solved!
    ],
    "citing_papers": [  # WHAT CAME AFTER (evolution)
        "Reformer: Efficient Transformer (Kitaev, 2020)",  # Addressed limitations
        "Linformer (Wang, 2020)",  # Improved efficiency
        "Performer (Choromanski, 2020)"  # Better for long sequences
    ],
    "related_concepts": [
        "Self-attention", "Positional encoding", "Multi-head attention"
    ]
}
# ✅ Has: Historical context (why transformers exist)
# ✅ Has: Evolution (how they improved)
# ✅ Has: Current solutions (what's happening now)

Result: LLM can now give a complete historical narrative, not just describe what transformers are!

Why Traditional RAG Isn't Enough

Problem 1: No Historical Context

# User question: "Why were transformers invented?"

# Traditional RAG retrieves:
papers = [
    "Attention is All You Need (2017): Transformers use self-attention..."
]
# ❌ Doesn't explain what problem RNNs/LSTMs had
# ❌ Doesn't show what transformers improved over

# GraphRAG retrieves + expands via citations:
graph_context = {
    "main_paper": "Attention is All You Need (2017): Transformers use self-attention...",
    "cited_papers": [
        "LSTM (1997): LSTMs struggle with sequences >100 tokens",
        "RNN vanishing gradients (1994): RNNs can't learn long dependencies"
    ]
}
# ✅ Now LLM can explain: "RNNs had vanishing gradients, LSTMs helped but
#    still struggled with long sequences, transformers solved this with self-attention"

Problem 2: Missing Evolution

# User question: "How have transformers been improved since 2017?"

# Traditional RAG:
papers = ["Attention is All You Need (2017)"]  # The original paper
# ❌ Doesn't show what came after

# GraphRAG (with citing papers):
papers = {
    "initial": "Attention is All You Need (2017)",
    "citing": [
        "BERT (2018): Bidirectional pretraining",
        "GPT-2 (2019): Larger scale",
        "T5 (2019): Text-to-text framework",
        "Reformer (2020): Efficient attention",
        "Switch Transformers (2021): Sparse models"
    ]
}
# ✅ Can now trace the evolution timeline

Problem 3: No Multi-Hop Reasoning

# User question: "What recent work improves on BERT's limitations?"

# Traditional RAG: Only finds papers mentioning "BERT limitations"
# ❌ Might miss papers that solve the problem without mentioning BERT

# GraphRAG:
# 1. Find BERT paper
# 2. Find papers citing BERT
# 3. Filter for papers discussing "limitations" or "improvements"
# 4. Find papers those papers cite (multi-hop)
# ✅ Discovers solutions even if they don't directly mention BERT

GraphRAG vs Traditional RAG vs Hybrid RAG

Approach	How It Works	Strengths	Weaknesses
Traditional RAG	Vector search → Top-K docs → LLM	Simple, fast	No context, no relationships
Hybrid RAG	Vector search + Graph search → Merge → LLM	Combines semantic + relational	Still retrieves documents independently
GraphRAG	Vector search → Graph expansion → Enhanced context → LLM	Rich context, multi-hop, relationships	More complex, slower

Key Difference:

Hybrid RAG: Uses graph for direct queries ("papers citing X")
GraphRAG: Uses graph to expand vector search results with connected context

When to Use GraphRAG

Use GraphRAG when:

Historical Context Matters
- Research paper Q&A (evolution of ideas)
- Patent analysis (prior art, citations)
- Scientific literature review
Relationships Are Important
- "How did this idea evolve?"
- "What influenced this paper?"
- "What improved on this approach?"
Multi-Hop Reasoning Needed
- "What recent work addresses limitations of X?"
- "Find papers in the intellectual lineage of X"
You Have a Knowledge Graph
- Already built citation network
- Already have entity relationships
- Graph is well-structured

DON'T use GraphRAG when:

Simple Keyword Matching - Traditional search is fine
No Graph Available - Building a graph is expensive
Real-Time Speed Critical - Graph traversal adds latency
Documents Are Independent - No meaningful relationships

GraphRAG: Two Approaches

There are two main approaches to GraphRAG:

Approach 1: Graph-Enhanced Retrieval (Shown Here)

How it works:

Use vector search to find initial relevant documents
Use knowledge graph to expand those documents with connected context
Feed enriched context to LLM

Pros:

✅ Simple to implement
✅ Works with existing knowledge graphs (Neo4j, etc.)
✅ Explainable (you can see the expansion)

Cons:

❌ Depends on quality of knowledge graph
❌ Expansion can be noisy
❌ Slower than pure vector search

Example: ResearcherAI uses this approach

Approach 2: Microsoft GraphRAG (Community Summaries)

How it works (different from approach 1!):

Build knowledge graph from documents
Detect communities in the graph (clusters of related entities)
Generate LLM summaries of each community
At query time, search community summaries
Retrieve relevant communities + their documents

Pros:

✅ Handles "global" questions ("What are the main themes?")
✅ Summarizes large corpora
✅ Finds patterns across documents

Cons:

❌ More complex (requires community detection + summarization)
❌ Higher upfront cost (LLM summarization of all communities)
❌ Less direct than traditional retrieval

Key Difference:

Approach 1 (Graph-Enhanced): Expands specific documents via graph
Approach 2 (Microsoft GraphRAG): Summarizes graph communities

Which to use?

Specific questions ("How do transformers work?") → Approach 1
Global questions ("What are the main AI trends?") → Approach 2
ResearcherAI: Uses Approach 1 (graph-enhanced retrieval)

GraphRAG Decision Tree

Quick Decision:

Have graph + need context → GraphRAG (Approach 1)
Need to summarize corpus → Microsoft GraphRAG (Approach 2)
Simple Q&A → Traditional RAG

GraphRAG vs Traditional RAG

Key Idea: Use the graph to expand initial search results with related context.

GraphRAG Implementation

class GraphRAG:
    """GraphRAG: Use knowledge graph to enhance retrieval"""

    def __init__(
        self,
        vector_store: QdrantVectorStore,
        knowledge_graph: Neo4jKnowledgeGraph,
        embedding_model: SentenceTransformer
    ):
        self.vector_store = vector_store
        self.knowledge_graph = knowledge_graph
        self.embedding_model = embedding_model

    def retrieve_and_expand(
        self,
        query: str,
        initial_k: int = 3,
        expansion_depth: int = 2
    ) -> Dict:
        """Retrieve documents and expand using graph"""

        # Step 1: Initial vector search
        query_embedding = self.embedding_model.encode(query)
        initial_results = self.vector_store.search(query_embedding, top_k=initial_k)

        # Step 2: Expand using knowledge graph
        expanded_context = {
            "initial_papers": [],
            "cited_papers": [],
            "citing_papers": [],
            "related_concepts": set(),
            "author_expertise": []
        }

        for text, score in initial_results:
            paper_id = self._extract_paper_id(text)

            expanded_context["initial_papers"].append({
                "id": paper_id,
                "text": text,
                "score": score
            })

            # Expand: Find papers this paper cites
            cited = self.knowledge_graph.find_papers_cited_by(paper_id)
            expanded_context["cited_papers"].extend(cited)

            # Expand: Find papers citing this paper
            citing = self.knowledge_graph.find_citing_papers(paper_id)
            expanded_context["citing_papers"].extend(citing)

            # Expand: Find related concepts
            concepts = self.knowledge_graph.find_related_concepts(paper_id)
            expanded_context["related_concepts"].update(concepts)

            # Expand: Find author expertise
            authors = self.knowledge_graph.find_paper_authors(paper_id)
            for author in authors:
                other_papers = self.knowledge_graph.find_papers_by_author(author)
                expanded_context["author_expertise"].append({
                    "author": author,
                    "other_work": other_papers[:3]  # Top 3
                })

        return expanded_context

    def generate_answer(self, query: str, context: Dict) -> str:
        """Generate answer using expanded context"""

        # Build rich context from graph expansion
        context_text = self._format_context(context)

        prompt = f"""Based on the following research papers and related context, answer the question.

Question: {query}

Initial Papers:
{context_text['initial']}

Cited Papers (background):
{context_text['cited']}

Citing Papers (follow-up work):
{context_text['citing']}

Related Concepts:
{', '.join(context['related_concepts'])}

Provide a comprehensive answer with citations."""

        # Use LLM to generate answer
        response = llm.generate(prompt)

        return response

    def _format_context(self, context: Dict) -> Dict[str, str]:
        """Format expanded context for prompt"""

        initial = "\n\n".join([
            f"[{i+1}] {paper['text']}"
            for i, paper in enumerate(context["initial_papers"])
        ])

        cited = "\n".join([
            f"- {paper['title']} ({paper['year']})"
            for paper in context["cited_papers"][:5]
        ])

        citing = "\n".join([
            f"- {paper['title']} ({paper['year']})"
            for paper in context["citing_papers"][:5]
        ])

        return {
            "initial": initial,
            "cited": cited,
            "citing": citing
        }

# Usage
graph_rag = GraphRAG(
    vector_store=qdrant_store,
    knowledge_graph=neo4j_kg,
    embedding_model=model
)

query = "How do transformers handle long-range dependencies?"

# Retrieve and expand
context = graph_rag.retrieve_and_expand(query, initial_k=3)

print("Initial papers:", len(context["initial_papers"]))
print("Cited papers:", len(context["cited_papers"]))
print("Citing papers:", len(context["citing_papers"]))
print("Concepts:", len(context["related_concepts"]))

# Generate answer with expanded context
answer = graph_rag.generate_answer(query, context)
print(f"\nAnswer: {answer}")

GraphRAG Benefits

1. Richer Context

# Traditional RAG: 3 papers
traditional_context = """
1. Attention is All You Need (2017)
2. BERT (2018)
3. GPT-3 (2020)
"""

# GraphRAG: 3 papers + expansions
graphrag_context = """
Initial:
1. Attention is All You Need (2017)
2. BERT (2018)
3. GPT-3 (2020)

Background (cited by these):
- Neural Machine Translation (Bahdanau, 2014)
- Sequence to Sequence Learning (Sutskever, 2014)

Follow-up (citing these):
- T5 (2019)
- BART (2020)
- Switch Transformers (2021)

Related Concepts:
- Self-attention, Multi-head attention, Positional encoding
"""

2. Better Citation Paths

// Find "intellectual lineage" of an idea
MATCH path = (old:Paper {year: 2014})-[:CITES*1..5]->(new:Paper {year: 2024})
WHERE old.title CONTAINS "attention"
RETURN path

3. Contextual Understanding

# Understand how transformer attention differs from earlier attention
# By traversing citation graph from Bahdanau (2014) to Vaswani (2017)

Enterprise Use Case: Knowledge Graphs for AI Agent API Discovery

Now let's see a practical enterprise application showing why knowledge graphs matter for AI agents in complex business environments.

The Problem: Complex Enterprise API Landscapes

Imagine you have an AI agent in an enterprise environment, and a user makes this request:

"Create a purchase order for 5 pencils in purchasing group 002 and purchasing organization 3000"

Seems simple, right? But here's the reality:

Enterprise systems have thousands of different APIs
Which API creates purchase orders? (Could be 10+ candidates)
What parameters are required? (Purchasing group? Organization? Material codes?)
What's the correct sequence? (Auth → Validate → Create → Submit?)

Without context, the AI agent faces:

Trial and error API discovery (slow, expensive)
Missing domain knowledge (which API for which business process?)
No structure (can't understand API dependencies)
No explainability (can't trace what it did or why)

Solution: Build a knowledge graph of your enterprise APIs enriched with business process information!

Knowledge Graph for Enterprise APIs

Structure:

# Nodes (entities)
:PurchaseOrderAPI rdf:type :API .
:PurchasingProcess rdf:type :BusinessProcess .
:PurchasingGroup rdf:type :Parameter .

# Relationships
:PurchaseOrderAPI :belongsTo :PurchasingProcess .
:PurchaseOrderAPI :requires :PurchasingGroup .
:PurchaseOrderAPI :requires :PurchasingOrganization .
:PurchaseOrderAPI :requires :MaterialNumber .

# Metadata
:PurchaseOrderAPI :endpoint "/api/v1/purchase-orders" .
:PurchaseOrderAPI :method "POST" .
:PurchasingGroup :allowedValues ["001", "002", "003"] .

Visualization:

How Knowledge Graphs Solve Enterprise AI Agent Challenges

Challenge 1: Slow API Discovery

Without Knowledge Graph:

# Agent tries APIs randomly
attempts = [
    "Try: /api/orders/create → Wrong (this is for sales orders)",
    "Try: /api/procurement/new → Wrong (deprecated API)",
    "Try: /api/purchase-orders/post → Wrong (requires different params)",
    "Try: /api/v2/purchasing/create-po → Success! (after 4 attempts)"
]
# Result: 4 failed attempts, wasted tokens, slow response

With Knowledge Graph:

# Agent queries knowledge graph
query = """
SELECT ?api WHERE {
    ?process :name "Purchasing" .
    ?api :belongsTo ?process .
    ?api :action "CreatePurchaseOrder" .
}
"""
result = ["POST /api/v2/purchasing/create-po"]  # Direct match!
# Result: 1 attempt, instant, efficient

Benefit: 90% reduction in API discovery time

Challenge 2: Missing Domain Context

Without Knowledge Graph:

# Agent doesn't know business logic
user_request = "Create PO for pencils in group 002"

# Agent tries:
api_call = {
    "endpoint": "/api/purchase-orders",
    "params": {
        "item": "pencils",  # ❌ Wrong: needs material number
        "group": "002"  # ✅ Correct
    }
}
# Result: API error "Missing material_number parameter"

With Knowledge Graph:

# Agent queries graph for required workflow
workflow = knowledge_graph.query("""
SELECT ?step ?api ?required_param WHERE {
    ?process :name "CreatePurchaseOrder" .
    ?process :hasStep ?step .
    ?step :callsAPI ?api .
    ?api :requiresParameter ?required_param .
}
ORDER BY ?step
""")

# Result:
# Step 1: Material Lookup API (param: material_name → returns: material_number)
# Step 2: Purchase Order API (params: material_number, purchasing_group, purchasing_org)

# Agent executes correctly:
material_number = call_api("MaterialLookup", {"name": "pencils"})  # M12345
po = call_api("PurchaseOrder", {
    "material": material_number,
    "group": "002",
    "org": "3000"
})
# Result: ✅ Success in correct sequence

Benefit: Automatic workflow understanding from graph structure

Challenge 3: Complex API Dependencies

Without Knowledge Graph:

# Agent doesn't know allowed values
params = {
    "purchasing_group": "999"  # ❌ Invalid value
}
# Result: API rejects with cryptic error

With Knowledge Graph:

# Graph contains allowed values
allowed = knowledge_graph.query("""
SELECT ?value WHERE {
    :PurchasingGroup :allowedValues ?value .
}
""")
# Result: ["001", "002", "003"]

# Agent validates BEFORE calling API
if "999" not in allowed:
    # Ask user or pick valid value
    pass

Benefit: Validation before execution, reducing errors

Challenge 4: No Explainability

Without Knowledge Graph:

User: "Why did you use that API?"
Agent: "Based on my training, I determined..." (vague)

With Knowledge Graph:

# Agent can trace reasoning through graph
explanation = {
    "question": "Create purchase order for pencils",
    "reasoning_path": [
        "1. Identified business process: Purchasing",
        "2. Found process step: MaterialLookup (required for material_number)",
        "3. Called MaterialLookup API with 'pencils' → returned M12345",
        "4. Found process step: CreatePurchaseOrder",
        "5. Validated parameters against schema:",
        "   - material_number: M12345 (from step 3)",
        "   - purchasing_group: 002 (from user, validated against allowed values)",
        "   - purchasing_org: 3000 (from user)",
        "6. Called PurchaseOrder API with validated params",
        "7. Result: PO-2024-001 created successfully"
    ],
    "source_apis": ["/api/materials/lookup", "/api/purchase-orders/create"],
    "graph_nodes_traversed": ["PurchasingProcess", "MaterialLookupAPI", "PurchaseOrderAPI"]
}

Benefit: Complete transparency and auditability

Implementation Example

class EnterpriseAPIAgent:
    """AI Agent with Knowledge Graph for API discovery"""

    def __init__(self, kg: Neo4jKnowledgeGraph, llm: LLM):
        self.kg = kg
        self.llm = llm

    def execute_request(self, user_request: str):
        """Execute user request using knowledge graph"""

        # Step 1: Understand intent
        intent = self.llm.extract_intent(user_request)
        # {"action": "CreatePurchaseOrder", "params": {"item": "pencils", "group": "002"}}

        # Step 2: Query knowledge graph for workflow
        workflow = self.kg.query_cypher(f"""
        MATCH (process:BusinessProcess {{name: 'Purchasing'}})
              -[:HAS_STEP]->(step:ProcessStep)
              -[:CALLS_API]->(api:API)
        WHERE step.action = '{intent["action"]}'
        RETURN step.order as order, api.endpoint as endpoint,
               api.required_params as params
        ORDER BY order
        """)

        # Step 3: Execute workflow steps
        context = {}
        for step in workflow:
            # Validate and enrich parameters
            params = self._prepare_params(step["params"], intent["params"], context)

            # Call API
            response = self._call_api(step["endpoint"], params)

            # Store result for next step
            context[step["order"]] = response

        return context

    def _prepare_params(self, required, user_provided, context):
        """Prepare API parameters using knowledge graph validation"""

        params = {}
        for param_name in required:
            # Check if user provided
            if param_name in user_provided:
                # Validate against knowledge graph
                allowed = self.kg.get_allowed_values(param_name)
                if allowed and user_provided[param_name] not in allowed:
                    raise ValueError(f"{param_name} must be one of {allowed}")
                params[param_name] = user_provided[param_name]

            # Check if available from previous steps
            elif param_name in context:
                params[param_name] = context[param_name]

            else:
                raise ValueError(f"Missing required parameter: {param_name}")

        return params

Benefits Summary

Challenge	Without KG	With KG	Improvement
API Discovery	Trial & error (10+ attempts)	Direct lookup (1 attempt)	90% faster
Context Understanding	Missing domain logic	Business process aware	Correct workflows
Parameter Validation	Runtime errors	Pre-validated	Fewer failures
Explainability	Black box	Full trace	Audit-ready
Maintenance	Update LLM training	Update graph	Easy updates

When to Use Knowledge Graphs for AI Agents

Use KG-powered AI agents when:

✅ Complex API landscapes (100s-1000s of APIs)
✅ Domain-specific business logic required
✅ Auditability and explainability critical
✅ APIs change frequently (easier to update graph than retrain)
✅ Multi-step workflows common

Traditional RAG/LLM when:

Simple, stable API sets (10-20 APIs)
No strict compliance requirements
Flexibility more important than precision

Enterprise Knowledge Graphs

Knowledge graphs transform AI agents from "smart guessers" to "informed executors" by providing:

Structure: API relationships and dependencies
Context: Business process integration
Validation: Allowed values and constraints
Explainability: Traceable reasoning paths

Part 5: Structured vs Unstructured Data

Real research papers contain both:

Unstructured:

Abstract (free text)
Full paper text
Author descriptions

Structured:

Title, authors, year, venue
Citation counts
Keywords, categories
Figures, tables (semi-structured)

Handling Both with GraphRAG

Complete Example: ResearcherAI Data Pipeline

class ResearchDataPipeline:
    """Complete pipeline for handling structured + unstructured data"""

    def __init__(
        self,
        vector_store: QdrantVectorStore,
        knowledge_graph: Neo4jKnowledgeGraph,
        embedding_model: SentenceTransformer
    ):
        self.vector_store = vector_store
        self.knowledge_graph = knowledge_graph
        self.embedding_model = embedding_model

    def process_paper(self, paper: Dict):
        """Process single paper with both structured and unstructured data"""

        # Step 1: Extract structured data
        paper_id = paper["id"]
        title = paper["title"]
        authors = paper["authors"]
        year = paper["year"]
        citations = paper.get("citations", [])
        keywords = paper.get("keywords", [])

        # Step 2: Extract unstructured data
        abstract = paper["abstract"]
        full_text = paper.get("full_text", "")

        # Step 3: Add to knowledge graph (structured)
        self.knowledge_graph.add_paper(paper_id, title, year, abstract)

        for author in authors:
            author_id = self._get_author_id(author)
            self.knowledge_graph.add_author(author_id, author)
            self.knowledge_graph.add_authored(author_id, paper_id)

        for cited_paper_id in citations:
            self.knowledge_graph.add_citation(paper_id, cited_paper_id)

        for keyword in keywords:
            concept_id = self._get_concept_id(keyword)
            self.knowledge_graph.add_concept(concept_id, keyword)
            self.knowledge_graph.add_discusses(paper_id, concept_id)

        # Step 4: Add to vector store (unstructured)
        # Combine title + abstract for better semantic search
        text_for_embedding = f"{title}. {abstract}"
        embedding = self.embedding_model.encode(text_for_embedding)

        self.vector_store.add_documents(
            texts=[text_for_embedding],
            embeddings=np.array([embedding]),
            metadata=[{
                "paper_id": paper_id,
                "title": title,
                "year": year,
                "authors": authors,
                "citation_count": len(citations)
            }]
        )

        print(f"✓ Processed: {title}")

    def query(self, question: str, mode: str = "hybrid") -> Dict:
        """Query with automatic routing"""

        if mode == "semantic":
            # Pure vector search
            query_emb = self.embedding_model.encode(question)
            results = self.vector_store.search(query_emb, top_k=5)

            return {
                "results": results,
                "mode": "semantic"
            }

        elif mode == "structured":
            # Pure graph query
            # Extract query intent and route to appropriate graph query
            results = self._graph_query(question)

            return {
                "results": results,
                "mode": "structured"
            }

        else:  # hybrid or graphrag
            # GraphRAG: Combine both
            query_emb = self.embedding_model.encode(question)
            initial_results = self.vector_store.search(query_emb, top_k=3)

            # Expand using graph
            expanded = []
            for text, score in initial_results:
                paper_id = self._extract_paper_id_from_text(text)

                # Get structured context from graph
                graph_context = {
                    "citations": self.knowledge_graph.find_citing_papers(paper_id),
                    "concepts": self.knowledge_graph.find_related_concepts(paper_id),
                    "authors": self.knowledge_graph.find_paper_authors(paper_id)
                }

                expanded.append({
                    "paper": text,
                    "score": score,
                    "graph_context": graph_context
                })

            return {
                "results": expanded,
                "mode": "graphrag"
            }

# Complete workflow
pipeline = ResearchDataPipeline(
    vector_store=qdrant_store,
    knowledge_graph=neo4j_kg,
    embedding_model=model
)

# Process papers
papers = [
    {
        "id": "paper_1",
        "title": "Attention is All You Need",
        "authors": ["Ashish Vaswani", "Noam Shazeer"],
        "year": 2017,
        "abstract": "We propose the Transformer, a model architecture...",
        "keywords": ["transformer", "attention", "sequence-to-sequence"],
        "citations": []
    },
    {
        "id": "paper_2",
        "title": "BERT: Pre-training Transformers",
        "authors": ["Jacob Devlin", "Ming-Wei Chang"],
        "year": 2018,
        "abstract": "We introduce BERT, a bidirectional transformer...",
        "keywords": ["BERT", "pre-training", "transformers"],
        "citations": ["paper_1"]
    }
]

for paper in papers:
    pipeline.process_paper(paper)

# Query with different modes
print("\n=== SEMANTIC MODE ===")
results = pipeline.query("attention mechanisms in neural networks", mode="semantic")
print(results)

print("\n=== STRUCTURED MODE ===")
results = pipeline.query("papers citing Attention is All You Need", mode="structured")
print(results)

print("\n=== GRAPHRAG MODE ===")
results = pipeline.query("how do transformers work?", mode="hybrid")
print(results)

Summary and Decision Guide

Technology Comparison

Technology	Best For	Limitations
Vector DB	Semantic similarity, fuzzy matching	No relationships, no reasoning
Knowledge Graph	Relationships, structured queries	Requires exact entities, no fuzzy search
Hybrid RAG	Combining semantic + structured	More complex, two systems
GraphRAG	Rich context, citation analysis	Highest complexity, needs both systems

When to Use Each

Use Vector Search Alone:

Simple semantic search
No relationship queries
Quick prototypes
Example: "Find similar papers"

Use Knowledge Graph Alone:

Known entities
Relationship-heavy queries
Network analysis
Example: "Find collaboration networks"

Use Hybrid RAG:

Production RAG systems
Mix of semantic + structured queries
Need both similarity and relationships
Example: ResearcherAI

Use GraphRAG:

Research assistance (like ResearcherAI)
Need citation context
Complex multi-hop queries
Example: "Trace the evolution of an idea"

ResearcherAI's Approach

ResearcherAI uses GraphRAG with dual backends:

# Development Mode
dev_config = {
    "vector_store": "FAISS (in-memory)",
    "knowledge_graph": "NetworkX (in-memory)",
    "startup_time": "instant",
    "data_persistence": "manual save/load"
}

# Production Mode
prod_config = {
    "vector_store": "Qdrant (persistent)",
    "knowledge_graph": "Neo4j (persistent)",
    "startup_time": "~2 seconds",
    "data_persistence": "automatic"
}

# Abstraction layer allows switching
system = ResearcherAI(mode="development")  # or "production"

Key Takeaways

Vector databases enable semantic search - finding similar content without exact keyword matches
Knowledge graphs enable relationship reasoning - following citations, collaborations, concept evolution
Hybrid RAG combines both for richer retrieval
GraphRAG uses graphs to expand and enhance vector search results
Structured + Unstructured data both matter - use appropriate storage for each
Dev/Prod duality enables fast iteration with production fidelity

Next Steps

Now you understand the data layer. Next chapters cover:

Chapter 3.5 (Agent Foundations): How agents use these data stores
Chapter 4 (Orchestration Frameworks): LangGraph for agent coordination
Chapter 5 (Backend): Implementing the full stack

The data foundations you learned here power every query in ResearcherAI!

Practice Exercise

Build your own hybrid system:

Start with FAISS + NetworkX (development)
Index 20-30 papers with abstracts
Create author and citation relationships
Implement semantic search
Implement citation traversal
Combine both for hybrid queries
Deploy with Qdrant + Neo4j (production)

Introduction​

The Problem: Traditional Search Doesn't Work for Research​

Keyword Search Limitations​

What We Actually Need​

Part 1: Vector Databases and Semantic Search​

From Words to Vectors​

The Intuition​

How Embeddings Work​

Creating Embeddings​

Measuring Similarity​

Vector Databases: Scaling Semantic Search​

Development: FAISS (In-Memory)​

Production: Qdrant (Persistent, Scalable)​

Vector Search: Dev vs Prod Comparison​

Part 2: Knowledge Graphs and Structured Reasoning​

Knowledge Graphs in the Real World​

What Are Knowledge Graphs?​

Why Vector Search Isn't Enough​

From Tables to Graphs​

Knowledge Graph Basics​

Development: NetworkX (In-Memory)​

Production: Neo4j (Persistent, Cypher)​

Cypher Query Language​

Knowledge Graph: Dev vs Prod Comparison​

Semantic Web: Ontologies, RDF, and SPARQL​

What Are Ontologies?​

RDF: Resource Description Framework​

RDF Triple Visualization​

Development: RDFLib (Python)​

SPARQL: Query Language for RDF​

More SPARQL Examples​

Production: Apache Jena & SPARQL Endpoint​

OWL: Web Ontology Language - The Power of Reasoning​

Why OWL Matters: Automatic Inference​

OWL Ontology Definition​

Development: Owlready2 with Reasoning​

OWL Reasoning Examples​

Production: Apache Jena with OWL Reasoner​

OWL Profiles: Which to Use?​

When to Use OWL vs Just RDF​

OWL Reasoning: Dev vs Prod Comparison​

RDF vs Property Graphs: When to Use Each​

When to Use RDF:​

When to Use Property Graphs:​

ResearcherAI's Approach​

Hybrid Approach: RDF + Property Graphs​

RDF/SPARQL: Dev vs Prod Comparison​

Building Knowledge Graphs: Construction Methods​

Three Main Approaches​

Knowledge Graph Construction Process​

Comparison of Construction Methods​

Hands-On: Building a Research Paper Knowledge Graph​

Step 1: Input Data​

Step 2: Define the Knowledge Graph Schema​

Step 3: SPARQL CONSTRUCT Queries for Mapping​

Step 4: The Transform Function​

Step 5: Build the Knowledge Graph Incrementally​

Step 6: Query the Knowledge Graph​

Step 7: Visualize the Knowledge Graph​

Step 8: Save the Knowledge Graph​

Key Takeaways​

Production Decision: Neo4j vs Apache Jena Fuseki​

When to Use Neo4j in Production​

When to Use Apache Jena Fuseki in Production​

Production Performance Comparison​

When to Use BOTH (Hybrid Architecture)​

ResearcherAI's Production Choice: Neo4j​

Quick Decision Guide​

Real-World Production Examples​

Part 3: Hybrid RAG - Best of Both Worlds​

Hybrid RAG Architecture​

Implementing Hybrid RAG​

Part 4: GraphRAG - Knowledge Graph Enhanced RAG​

What Exactly is GraphRAG?​

The Problem GraphRAG Solves​

Why Traditional RAG Isn't Enough​

GraphRAG vs Traditional RAG vs Hybrid RAG​

When to Use GraphRAG​

GraphRAG: Two Approaches​

Approach 1: Graph-Enhanced Retrieval (Shown Here)​

Introduction

The Problem: Traditional Search Doesn't Work for Research

Keyword Search Limitations

What We Actually Need

Part 1: Vector Databases and Semantic Search

From Words to Vectors

The Intuition

How Embeddings Work

Creating Embeddings

Measuring Similarity

Vector Databases: Scaling Semantic Search

Development: FAISS (In-Memory)

Production: Qdrant (Persistent, Scalable)

Vector Search: Dev vs Prod Comparison

Part 2: Knowledge Graphs and Structured Reasoning

Knowledge Graphs in the Real World

What Are Knowledge Graphs?

Why Vector Search Isn't Enough

From Tables to Graphs

Knowledge Graph Basics

Development: NetworkX (In-Memory)

Production: Neo4j (Persistent, Cypher)

Cypher Query Language

Knowledge Graph: Dev vs Prod Comparison

Semantic Web: Ontologies, RDF, and SPARQL

What Are Ontologies?

RDF: Resource Description Framework

RDF Triple Visualization

Development: RDFLib (Python)

SPARQL: Query Language for RDF

More SPARQL Examples

Production: Apache Jena & SPARQL Endpoint

OWL: Web Ontology Language - The Power of Reasoning

Why OWL Matters: Automatic Inference

OWL Ontology Definition

Development: Owlready2 with Reasoning

OWL Reasoning Examples

Production: Apache Jena with OWL Reasoner

OWL Profiles: Which to Use?

When to Use OWL vs Just RDF

OWL Reasoning: Dev vs Prod Comparison

RDF vs Property Graphs: When to Use Each

When to Use RDF:

When to Use Property Graphs:

ResearcherAI's Approach

Hybrid Approach: RDF + Property Graphs

RDF/SPARQL: Dev vs Prod Comparison

Building Knowledge Graphs: Construction Methods

Three Main Approaches

Knowledge Graph Construction Process

Comparison of Construction Methods

Hands-On: Building a Research Paper Knowledge Graph

Step 1: Input Data

Step 2: Define the Knowledge Graph Schema

Step 3: SPARQL CONSTRUCT Queries for Mapping

Step 4: The Transform Function

Step 5: Build the Knowledge Graph Incrementally

Step 6: Query the Knowledge Graph

Step 7: Visualize the Knowledge Graph

Step 8: Save the Knowledge Graph

Key Takeaways

Production Decision: Neo4j vs Apache Jena Fuseki

When to Use Neo4j in Production

When to Use Apache Jena Fuseki in Production

Production Performance Comparison

When to Use BOTH (Hybrid Architecture)

ResearcherAI's Production Choice: Neo4j

Quick Decision Guide

Real-World Production Examples

Part 3: Hybrid RAG - Best of Both Worlds

Hybrid RAG Architecture

Implementing Hybrid RAG

Part 4: GraphRAG - Knowledge Graph Enhanced RAG

What Exactly is GraphRAG?

The Problem GraphRAG Solves

Why Traditional RAG Isn't Enough

GraphRAG vs Traditional RAG vs Hybrid RAG

When to Use GraphRAG

GraphRAG: Two Approaches

Approach 1: Graph-Enhanced Retrieval (Shown Here)