Six Degrees of Kevin Bacon

In linguistics you’ve probably heard the term “ontology”. In knowledge graphs, an ontology is a formal, machine‑readable vocabulary that defines a domain’s classes (things that exist), properties (how they’re related) and constraints (logical rules). By using universal identifiers (URIs) and logic‑friendly standards like RDF Schema or OWL, ontologies ensure that systems agree on what “Customer,” “Order” or “PaymentTerm” actually mean, without necessarily requiring data dictionaries.

A graph database is the storage engine optimized for this kind of highly connected data. Triple‑stores (e.g. GraphDB, Amazon Neptune‑RDF) persist facts as Subject‑Predicate‑Object triples and run SPARQL queries with built‑in reasoning. Property‑graph systems like Neo4j, model nodes and relationships with arbitrary key/value properties and offer Cypher for analytics‑friendly traversals. With relationships as first‑class citizens, questions like “find all suppliers two hops from a recalled component” execute in milliseconds, without tangled SQL JOINs required.

Compared to relational databases, graph stores trade rigid, table‑bound schemas for schema‑late, edge‑like flexibility. While you do give up some of the cross‑table ACID simplicity (especially at massive scale or across clusters), you gain the ability to evolve your data model on the fly and uncover hidden patterns, making graphs a natural fit whenever relationships matter most.

In this experiment, we will explore a simple use case for a graph database using the MovieLens dataset, which contains information about movies, users, ratings, and tags.

We will use Neo4j to create a graph representation of the data and perform some interesting queries.
An embedding model will compute semantic embeddings for movie titles and tags, and we will use those embeddings to enhance our graph queries.
We will implement a simple recommendation system using collaborative filtering.
Finally we will use a large language model (LLM) to generate explanations for recommendations and build a conversational agent that can answer movie-related queries using the graph database.

Note

“Collaborative filtering” is a technique used in recommendation systems to suggest items (like movies) based on the preferences of similar users. It relies on the idea that if two users have similar tastes, they are likely to enjoy similar items.

Downloading the MovieLens dataset

We will start by downloading the 100,000 movie ratings dataset. It is small enough to fit within the free-tier limits of Neo4j Aura, and it contains enough data to demonstrate the capabilities of graph databases.

Show the code

!mkdir -p .data

!if [ ! -f .data/ml-latest-small.zip ]; then \
    echo "Downloading MovieLens…"; \
    curl -L -o .data/ml-latest-small.zip https://files.grouplens.org/datasets/movielens/ml-latest-small.zip; \
  else \
    echo ".data/ml-latest-small.zip already exists; skipping download."; \
  fi

!echo "Unzipping…"
!unzip -o .data/ml-latest-small.zip -d .data
!echo "Done."

.data/ml-latest-small.zip already exists; skipping download.
Unzipping…
Archive:  .data/ml-latest-small.zip
  inflating: .data/ml-latest-small/links.csv  
  inflating: .data/ml-latest-small/tags.csv  
  inflating: .data/ml-latest-small/ratings.csv  
  inflating: .data/ml-latest-small/README.txt  
  inflating: .data/ml-latest-small/movies.csv  
Done.

With the dataset downloaded, we can now proceed to set up our Neo4j database and import the data.

Setting up Neo4j

We will use Neo4j Aura, the cloud version of Neo4j, to host our graph database. You can sign up for a free account at Neo4j Aura. Once you have created an account, you can create a new database instance and obtain the connection details (URI, username, and password).

Make sure to set the environment variables NEO4J_URI, NEO4J_USERNAME, and NEO4J_PASSWORD with the connection details of your Neo4j Aura instance.

Show the code

import os
from neo4j import GraphDatabase, basic_auth, Driver, Session, Transaction, Record

URI = os.getenv("NEO4J_URI")
USER = os.getenv("NEO4J_USERNAME")
PASSWORD = os.getenv("NEO4J_PASSWORD")
AUTH = (USER, PASSWORD)

print(f"Connecting to Neo4j at {URI} with user {USER}")

with GraphDatabase.driver(URI, auth=AUTH) as driver:
    driver.verify_connectivity()


def test_aura_connection() -> None:
    with driver.session() as session:
        result = session.run("RETURN 'Hello, Aura!' AS message")
        record = result.single()
        print(record["message"])  # should print "Hello, Aura!"


test_aura_connection()

Connecting to Neo4j at neo4j+s://8c1ab3e4.databases.neo4j.io with user neo4j

Hello, Aura!

Computing embeddings

With the connection established, we can now proceed to import the dataset. First however, let us define a helper class to compute embeddings for movie titles and tags using a Transformer model. For simplicity, we will use the all-MiniLM-L6-v2 model from the sentence-transformers library, which is a lightweight model suitable for semantic similarity tasks (however, it lags behind larger models in terms of accuracy).

Show the code

# Method to compute embeddings using a given Transformer model
from transformers import AutoTokenizer, AutoModel
import torch
import numpy as np
from typing import List


class EmbeddingModel:
    def __init__(
        self, model_name: str = "sentence-transformers/all-MiniLM-L6-v2"
    ) -> None:
        # Use CUDA if available
        if torch.cuda.is_available():
            print("Using CUDA for embeddings.")
            self.device = torch.device("cuda")
        elif torch.backends.mps.is_available():
            print("Using MPS for embeddings.")
            self.device = torch.device("mps")
        else:
            print("Using CPU for embeddings.")
            self.device = torch.device("cpu")
        self.tokenizer = AutoTokenizer.from_pretrained(model_name)
        self.model = AutoModel.from_pretrained(model_name).to(self.device)

    def embed_batch(self, texts: List[str]) -> np.ndarray:
        inputs = self.tokenizer(
            texts, return_tensors="pt", truncation=True, padding=True, max_length=512
        ).to(self.device)
        with torch.no_grad():
            outputs = self.model(**inputs)
        return outputs.last_hidden_state.mean(dim=1).cpu().numpy()


embedding_model = EmbeddingModel()

Using CUDA for embeddings.

In addition, we will also need a few methods to handle the import of the dataset into Neo4j. We will create nodes for movies, users, genres, and tags, and establish relationships between them. We will be defining three methods, drop_schema to drop existing constraints and indexes, create_constraints to create the necessary constraints in the database, and import_movies_batched, import_ratings_batched, and import_tags_batched to import the movies, ratings, and tags data in batches (batched loading is important if you are doing any significant volume of transactions).

Once all data is loaded, our graph will be structured as follows:

graph TD
    subgraph Nodes
        direction LR
        U[User]
        M[Movie]
        G[Genre]
    end

    subgraph Relationships
        direction LR
        U -- RATED --> M
        U -- TAGGED --> M
        M -- IN_GENRE --> G
    end

    style U fill:#FFDAB9,stroke:#333,stroke-width:2px
    style M fill:#ADD8E6,stroke:#333,stroke-width:2px
    style G fill:#90EE90,stroke:#333,stroke-width:2px

    linkStyle 0 stroke:red,stroke-width:2px;
    linkStyle 1 stroke:blue,stroke-width:2px;
    linkStyle 2 stroke:green,stroke-width:2px;

Importing the dataset

Our Cypher query to import movies (and similarly, other entities) looks like this:

UNWIND $batch AS row
MERGE (m:Movie {movieId: toInteger(row.movieId)})
SET m.title  = row.title,
    m.imdbId = row.imdbId,
    m.tmdbId = row.tmdbId,
    m.embedding = row.embedding
WITH m, row
UNWIND split(row.genres, '|') AS genreName
    MERGE (g:Genre {name: genreName})
    MERGE (m)-[:IN_GENRE]->(g)

It takes a list of records supplied in $batch and processes them one at a time (UNWIND). For each record it either finds or creates a  Movie node keyed by movieId, then updates that node with its title, IMDb and TMDb identifiers, plus a pre‑computed vector stored in embedding. Because it uses MERGE, you’ll never get duplicate movie nodes with the same ID.

After updating the movie, the query pulls the pipe‑separated genre string into individual genre names, again ensuring each unique name has exactly one Genre node. It then creates (or confirms) an IN_GENRE relationship from the movie to each of its genres. The whole snippet is basically an idempotent “upsert” that normalises movies and genres while wiring them together in a clean graph structure with an IN_GENRE relationship.

Show the code

import csv
import itertools
import re
from typing import Dict, Any, Tuple, Optional


def drop_schema(tx: Transaction) -> None:
    # Drop constraints
    for record in tx.run("SHOW CONSTRAINTS"):
        name = record["name"]
        tx.run(f"DROP CONSTRAINT `{name}`")
    # Drop indexes
    for record in tx.run("SHOW INDEXES"):
        name = record["name"]
        tx.run(f"DROP INDEX `{name}`")


def create_constraints(tx: Transaction) -> None:
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (m:Movie)  REQUIRE m.movieId IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (u:User)   REQUIRE u.userId IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (g:Genre)  REQUIRE g.name    IS UNIQUE")


def _load_movie_batch(tx: Transaction, batch: List[Dict[str, Any]]) -> None:
    tx.run(
        """
        UNWIND $batch AS row
        MERGE (m:Movie {movieId: toInteger(row.movieId)})
        SET m.title  = row.title,
            m.imdbId = row.imdbId,
            m.tmdbId = row.tmdbId,
            m.embedding = row.embedding
        WITH m, row
        UNWIND split(row.genres, '|') AS genreName
          MERGE (g:Genre {name: genreName})
          MERGE (m)-[:IN_GENRE]->(g)
        """,
        batch=batch,
    )


def import_movies_batched(
    session: Session, movies_f: str, links_f: str, batch_size: int = 1000
) -> None:
    # preload links into memory once
    links = {}
    with open(links_f, newline="", encoding="utf-8") as f:
        for r in csv.DictReader(f):
            links[r["movieId"]] = {"imdbId": r["imdbId"], "tmdbId": r["tmdbId"]}

    batch = []
    with open(movies_f, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        while True:
            # Read a batch of rows from the CSV
            rows = list(itertools.islice(reader, batch_size))
            if not rows:
                break

            # Extract titles and compute embeddings in a batch
            titles = [row["title"] for row in rows]
            # Titles have a year ("... (1994)" for example), remove it with a regexp for better embeddings
            titles = [re.sub(r"\s*\(\d{4}\)$", "", title) for title in titles]
            embeddings = embedding_model.embed_batch(titles)

            # Prepare batch for Neo4j import
            batch_to_load = []
            for i, row in enumerate(rows):
                lm = links.get(row["movieId"], {})
                batch_to_load.append(
                    {
                        "movieId": row["movieId"],
                        "title": row["title"],
                        "genres": row["genres"],
                        "imdbId": lm.get("imdbId"),
                        "tmdbId": lm.get("tmdbId"),
                        "embedding": embeddings[i],
                    }
                )

            session.execute_write(_load_movie_batch, batch_to_load)


def _load_ratings_batch(tx: Transaction, batch: List[Dict[str, str]]) -> None:
    tx.run(
        """
        UNWIND $batch AS row
        MERGE (u:User  {userId: toInteger(row.userId)})
        MERGE (m:Movie {movieId: toInteger(row.movieId)})
        MERGE (u)-[r:RATED]->(m)
        SET r.rating    = toFloat(row.rating),
            r.timestamp = toInteger(row.timestamp)
        """,
        batch=batch,
    )


def import_ratings_batched(
    session: Session, ratings_f: str, batch_size: int = 1000
) -> None:
    batch = []
    with open(ratings_f, newline="", encoding="utf-8") as f:
        for row in csv.DictReader(f):
            batch.append(row)
            if len(batch) >= batch_size:
                session.execute_write(_load_ratings_batch, batch)
                batch.clear()
        if batch:
            session.execute_write(_load_ratings_batch, batch)


def _load_tags_batch(tx: Transaction, batch: List[Dict[str, Any]]) -> None:
    tx.run(
        """
        UNWIND $batch AS row
        MERGE (u:User  {userId: toInteger(row.userId)})
        MERGE (m:Movie {movieId: toInteger(row.movieId)})
        MERGE (u)-[t:TAGGED]->(m)
        SET t.tag       = row.tag,
            t.timestamp = toInteger(row.timestamp),
            t.embedding = row.embedding
        """,
        batch=batch,
    )


def import_tags_batched(session: Session, tags_f: str, batch_size: int = 1000) -> None:
    with open(tags_f, newline="", encoding="utf-8") as f:
        reader = csv.DictReader(f)
        while True:
            rows = list(itertools.islice(reader, batch_size))
            if not rows:
                break

            tags = [row["tag"] for row in rows]
            embeddings = embedding_model.embed_batch(tags)

            for i, row in enumerate(rows):
                row["embedding"] = embeddings[i]

            session.execute_write(_load_tags_batch, rows)

We also need a couple of methods to create the vector indexes for the movie and tag embeddings. These indexes will allow us to perform efficient similarity searches against the backend.

Show the code

def create_vector_index(tx: Transaction) -> None:
    tx.run(
        """
        CREATE VECTOR INDEX `movie_embeddings` IF NOT EXISTS
        FOR (m:Movie)
        ON m.embedding
        OPTIONS {indexConfig: {
            `vector.dimensions`: 384,
            `vector.similarity_function`: 'cosine'
        }}
    """
    )


def create_tag_vector_index(tx: Transaction) -> None:
    tx.run(
        """
        CREATE VECTOR INDEX `tag_embeddings` IF NOT EXISTS
        FOR ()-[t:TAGGED]-()
        ON t.embedding
        OPTIONS {indexConfig: {
            `vector.dimensions`: 384,
            `vector.similarity_function`: 'cosine'
        }}
    """
    )

With this defined, we can now proceed to drop any existing schema, create the necessary constraints and indexes, and import the data into Aura.

Show the code

with driver.session() as sess:
    print("Dropping existing graph...")
    sess.execute_write(lambda tx: tx.run("MATCH (n) DETACH DELETE n"))
    print("Dropping existing schema...")
    sess.execute_write(drop_schema)

    print("Creating constraints...")
    sess.execute_write(create_constraints)
    print("Creating vector index...")
    sess.execute_write(create_vector_index)
    print("Creating tag vector index...")
    sess.execute_write(create_tag_vector_index)

    print("Importing movies...")
    import_movies_batched(
        sess, ".data/ml-latest-small/movies.csv", ".data/ml-latest-small/links.csv"
    )
    print("Importing ratings...")
    import_ratings_batched(sess, ".data/ml-latest-small/ratings.csv")
    print("Importing tags...")
    import_tags_batched(sess, ".data/ml-latest-small/tags.csv")

Dropping existing graph...
Dropping existing schema...
Creating constraints...
Creating vector index...
Creating tag vector index...
Importing movies...
Importing ratings...
Importing tags...

Exploring the graph

Let us do some exploratory queries to see what we have in our graph. We can start by picking a single movie, say “Forrest Gump (1994)”, and find out which users have tagged it, what tags they used, and which other movies those users have additionally tagged.

In Cypher, this is a multi-step query which first finds the movie:

MATCH (fg:Movie {title: "Forrest Gump (1994)"})

Then finds the users who have tagged it (CALL is used to define a subquery):

CALL(fg) {
    MATCH (u:User)-[t:TAGGED]->(fg)
    WITH u, count(t) AS fgTagCount
    ORDER BY fgTagCount DESC
    LIMIT 20
    RETURN collect(u) AS topUsers
}

We then then unwind the top users and find the top 10 other movies they have tagged:

CALL(fg, topUsers) {
      UNWIND topUsers AS u
      MATCH (u)-[:TAGGED]->(m:Movie)
      WHERE m <> fg
      WITH m, count(DISTINCT u) AS userCount
      ORDER BY userCount DESC
      LIMIT 10
      RETURN collect(m) AS topMovies
    }

Followed by unwinding (meaning “exploding” every element in the list into separate rows) both the Forrest Gump movie and the other top 10 movies, and pulling every tag edge between those users and those movies:

WITH fg, topUsers, topMovies
    UNWIND topUsers AS u
    UNWIND [fg] + topMovies AS m

And finally we return the user ID, movie ID, movie title, tags, and tag count of the resulting tag edges:

MATCH (u)-[t:TAGGED]->(m)
    RETURN
      u.userId      AS uId,
      m.movieId     AS mId,
      m.title       AS title,
      collect(t.tag) AS tags,
      count(t)      AS tagCount

Show the code

film = "Forrest Gump (1994)"
max_users = 20
max_other_movies = 10

# Grab exactly the users and movies we care about, plus their tag‐counts & tag names:
with driver.session() as sess:
    result = sess.run(
        """
    MATCH (fg:Movie {title: $film})

    // top 20 users by # of tags on FG
    CALL(fg) {
      MATCH (u:User)-[t:TAGGED]->(fg)
      WITH u, count(t) AS fgTagCount
      ORDER BY fgTagCount DESC
      LIMIT $max_users
      RETURN collect(u) AS topUsers
    }

    // top 10 other movies tagged by those users
    CALL(fg, topUsers) {
      UNWIND topUsers AS u
      MATCH (u)-[:TAGGED]->(m:Movie)
      WHERE m <> fg
      WITH m, count(DISTINCT u) AS userCount
      ORDER BY userCount DESC
      LIMIT $max_other_movies
      RETURN collect(m) AS topMovies
    }

    WITH fg, topUsers, topMovies
    // unwind both FG + the other top 10
    UNWIND topUsers AS u
    UNWIND [fg] + topMovies AS m

    // pull every tag‐edge between those users and those movies
    MATCH (u)-[t:TAGGED]->(m)
    RETURN
      u.userId      AS uId,
      m.movieId     AS mId,
      m.title       AS title,
      collect(t.tag) AS tags,
      count(t)      AS tagCount
    """,
        {"film": film, "max_users": max_users, "max_other_movies": max_other_movies},
    )
    records = [r.data() for r in result]

The resulting records will contain the user ID, movie ID, movie title, tags used by the user, and the count of those tags. We can then display a sample record and build a network visualization of the tag edges.

Show the code

from IPython.display import display, Markdown
import json

print("Tag‐edges for 'Forrest Gump'")
print(
    f"Found {len(records)} tag‐edges among {len({rec['uId'] for rec in records})} users and {len({rec['mId'] for rec in records})} movies.\n"
)

if records:
    print("Sample record:")
    pretty_record = json.dumps(records[0], indent=2)
    print(pretty_record)

Tag‐edges for 'Forrest Gump'
Found 23 tag‐edges among 3 users and 11 movies.

Sample record:
{
  "uId": 474,
  "mId": 356,
  "title": "Forrest Gump (1994)",
  "tags": [
    "Vietnam"
  ],
  "tagCount": 1
}

Visualizing the data

To get a good intuition of our graph, we can use the pyvis library to create an interactive visualization of the tag edges. The nodes will represent users and movies, while the edges will represent the tags applied by users to movies. We will also highlight “Forrest Gump (1994)” in orange for better visibility.

Show the code

from pyvis.network import Network

# Build the network
net = Network(height="600px", width="100%", notebook=True, cdn_resources="in_line")

for rec in records:
    uid = f"U{rec['uId']}"
    mid = f"M{rec['mId']}"
    # nodes (idempotent)
    net.add_node(uid, label=f"User {rec['uId']}", color="grey")

    # Set Forrest Gump to orange, other movies to lightblue
    movie_color = "orange" if rec["title"] == "Forrest Gump (1994)" else "lightblue"
    net.add_node(mid, label=rec["title"], color=movie_color)

    # edge thickness = # of tags
    net.add_edge(uid, mid, value=rec["tagCount"], title=", ".join(rec["tags"]))

# configure a stabilization run of 1000 iterations
net.set_options(
    """
var options = {
  "physics": {
    "stabilization": {
      "enabled": true,
      "iterations": 1000,
      "updateInterval": 25
    },
    
    "barnesHut": {
      "gravitationalConstant": -8000,
      "centralGravity": 0.3,
      "springLength": 200,
      "springConstant": 0.04,
      "damping": 0.09,
      "avoidOverlap": 0.1
    }
  }
}
"""
)

net.save_graph("forrest_gump_graph.html")

Performing recommendations

We previously calculated embeddings for tags and film titles, which we can use to perform “fuzzy” semantic searches. This allows us to find films that are similar to a given title, even if the title is not an exact match. We can use these embeddings to build a recommendation system that suggests films based on their semantic similarity, collaborative filtering, and shared tags.

The algorithm is simple: we first find the closest movie match to the input title using the vector index, and then use that as the basis for recommendations. We will also incorporate collaborative filtering by finding users who liked both the source movie and the recommended movie, and content filtering by finding shared tags between the source and recommended movies.

The formula for the final recommendation score is a weighted sum of the title similarity, user overlap, and shared tags:

$\text{final\_score} = 0.5 \times \text{title\_similarity} + 0.3 \times \text{user\_overlap} + 0.2 \times \text{shared\_tags}$

Here is the recommendation method that performs these steps.

Show the code

def recommend_movies(
    movie_title: str, num_recommendations: int = 5
) -> Tuple[Optional[str], List[Dict[str, Any]]]:
    """
    Recommends movies based on a hybrid of semantic similarity,
    collaborative filtering, and shared tags.

    It first finds the closest movie match to the input title and then
    uses that as the basis for recommendations.

    Args:
        movie_title (str): The title of the movie to get recommendations for.
        num_recommendations (int): The number of recommendations to return.

    Returns:
        tuple: A tuple containing (source_movie_title, list_of_recommendations).
               Each recommendation is a dictionary with title, score, and evidence.
    """
    # Compute the embedding for the input movie title
    title_embedding = embedding_model.embed_batch([movie_title])[0]

    # Query Neo4j for recommendations
    with driver.session() as sess:
        result = sess.run(
            """
            // Find the closest movie to the input title string
            CALL db.index.vector.queryNodes('movie_embeddings', 1, $embedding)
            YIELD node AS source_movie
            
            // Find recommendation candidates based on the source movie's embedding
            CALL db.index.vector.queryNodes('movie_embeddings', $k, source_movie.embedding)
            YIELD node AS similar_movie, score AS title_similarity
            WHERE similar_movie <> source_movie

            // Collaborative Filtering - find users who liked both movies
            WITH source_movie, similar_movie, title_similarity
            OPTIONAL MATCH (source_movie)<-[r1:RATED]-(u:User)-[r2:RATED]->(similar_movie)
            WHERE r1.rating >= 4.0 AND r2.rating >= 4.0
            WITH source_movie, similar_movie, title_similarity, count(DISTINCT u) AS user_overlap
            
            // Content Filtering - find shared tags more robustly
            WITH source_movie, similar_movie, title_similarity, user_overlap
            OPTIONAL MATCH (source_movie)<-[t1:TAGGED]-(:User)
            WITH source_movie, similar_movie, title_similarity, user_overlap, collect(DISTINCT t1.tag) AS source_tags
            OPTIONAL MATCH (similar_movie)<-[t2:TAGGED]-(:User)
            WITH source_movie, similar_movie, title_similarity, user_overlap, source_tags, collect(DISTINCT t2.tag) AS similar_tags
            
            // Calculate the intersection of tags
            WITH source_movie, similar_movie, title_similarity, user_overlap,
                 [tag IN source_tags WHERE tag IN similar_tags] AS shared_tags
            
            // Get shared genres
            WITH source_movie, similar_movie, title_similarity, user_overlap, shared_tags
            OPTIONAL MATCH (source_movie)-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(similar_movie)
            WITH source_movie, similar_movie, title_similarity, user_overlap, shared_tags, collect(DISTINCT g.name) AS shared_genres

            // Calculate final score and return
            WITH source_movie,
                 similar_movie,
                 (title_similarity * 0.5) + (user_overlap * 0.3) + (size(shared_tags) * 0.2) AS final_score,
                 user_overlap,
                 shared_tags,
                 shared_genres
            
            RETURN source_movie.title AS source_title,
                   similar_movie.title AS title,
                   final_score,
                   user_overlap,
                   shared_tags,
                   shared_genres
            ORDER BY final_score DESC
            LIMIT $num_recommendations
            """,
            {
                "k": 20,  # Get more initial candidates to refine
                "embedding": title_embedding,
                "num_recommendations": int(
                    num_recommendations
                ),  # Ensure it's an integer
            },
        )

        records = list(result)
        if not records:
            return None, []

        source_title = records[0]["source_title"]
        recommendations = [
            {
                "title": r["title"],
                "score": r["final_score"],
                "evidence": {
                    "user_overlap": r["user_overlap"],
                    "shared_tags": r["shared_tags"],
                    "shared_genres": r["shared_genres"],
                },
            }
            for r in records
        ]

    return source_title, recommendations


def visualize_recommendations(
    source_title: str, recommendations: List[Dict[str, Any]]
) -> Network:
    """
    Generates a pyvis graph to visualize the relationships between the source
    movie and its recommendations based on shared genres.

    Args:
        source_title (str): The title of the source movie.
        recommendations (list): A list of recommendation dictionaries.

    Returns:
        pyvis.network.Network: A pyvis Network object representing the graph.
    """
    net = Network(height="600px", width="100%", notebook=True, cdn_resources="in_line")

    # Add the source movie node
    net.add_node(
        source_title, label=source_title, color="orange", size=25, font={"size": 16}
    )

    # Keep track of genres already added
    added_genres = set()

    for rec in recommendations:
        rec_title = rec["title"]
        evidence = rec["evidence"]

        # Add the recommended movie node
        net.add_node(rec_title, label=rec_title, color="lightblue", size=15)

        # Add a direct edge from source to recommendation
        net.add_edge(
            source_title,
            rec_title,
            value=rec["score"],
            title=f"Score: {rec['score']:.2f}\nUsers: {evidence['user_overlap']}\nTags: {', '.join(evidence['shared_tags'])}",
            color="#cccccc",
        )

        # Add genre nodes and edges
        for genre in evidence.get("shared_genres", []):
            if genre not in added_genres:
                net.add_node(genre, label=genre, color="lightgreen", size=10)
                added_genres.add(genre)

            # Connect movies to their shared genres
            net.add_edge(source_title, genre, color="lightgrey", width=2)
            net.add_edge(rec_title, genre, color="lightgrey", width=2)

    net.set_options(
        """
    var options = {
      "physics": {
        "stabilization": {
          "enabled": true,
          "iterations": 1000,
          "updateInterval": 25
        },
        "barnesHut": {
          "gravitationalConstant": -8000,
          "centralGravity": 0.3,
          "springLength": 250,
          "springConstant": 0.05,
          "damping": 0.09,
          "avoidOverlap": 0.1
        }
      }
    }
    """
    )

    return net

What does the recommendation algorithm returns for a couple of example movie titles ? The first example will be “Jurassic Park”, which should return recommendations based on that title, while the second example will be a more generic search for “world war” to see how well the algorithm can handle partial matches.

Show the code

# Example usage
movie_title = "Jurassic Park"
source, recommendations = recommend_movies(movie_title)

if source:
    md = f"Recommendations based on '{source}':\n\n"
    for rec in recommendations:
        md += f"- {rec['title']} (Score: {rec['score']:.4f})\n"
    print(md)

    # Visualize the recommendations
    net = visualize_recommendations(source, recommendations)
    net.save_graph("jurassic_park_recommendations.html")
else:
    print("No recommendations found.")

# Another example with a non-exact title
movie_title = "world war"
source, recommendations = recommend_movies(movie_title)

if source:
    md = f"Recommendations based on '{source}':\n"
    for rec in recommendations:
        md += f"- {rec['title']} (Score: {rec['score']:.4f})\n"
    print(md)

    # Visualize the recommendations
    net = visualize_recommendations(source, recommendations)
    net.save_graph("world_war_recommendations.html")
else:
    print("No recommendations found.")

Recommendations based on 'Jurassic World: Fallen Kingdom (2018)':

- Lost World: Jurassic Park, The (1997) (Score: 0.4793)
- Jurassic World (2015) (Score: 0.4699)
- Jurassic Park III (2001) (Score: 0.4665)
- Jurassic Park (1993) (Score: 0.4661)
- Dinotopia (2002) (Score: 0.4473)

Recommendations based on 'War of the Worlds (2005)':
- War of the Worlds (2005) (Score: 0.4975)
- War of the Worlds, The (1953) (Score: 0.4931)
- Lord of War (2005) (Score: 0.4761)
- In Love and War (1996) (Score: 0.4715)
- Reign of Fire (2002) (Score: 0.4703)

The recommendations returned by the algorithm are a pretty reasonable match to the input titles, and include a mix of movies that are semantically similar, have shared tags, and are liked by users who also liked the source movie.

Let us also visualize the recommendations for “Jurassic Park” to see how the relationships between the source movie and its recommendations look like in a graph.

And for the “world war” query, we can see how the algorithm handles a more abstract search, returning movies that are related to the themes of world wars.

Finding movies by description

Further down we will also need a method to find movies based on a natural language description of their themes or content. This will allow us to search for movies using more abstract queries, such as “space travel and aliens” or “funny romantic movies”. We will use the same embedding model to compute the embedding for the input description and then query Neo4j for movies based on similarity using a :link cosine similarity search.

Note

Neo4j supports vector search using the db.index.vector.queryRelationships procedure, which allows us to find relationships that are similar to a given embedding.

Show the code

def find_movies_by_description(
    description: str, num_results: int = 10
) -> List[Tuple[str, float]]:
    """
    Finds movies based on a natural language description of their themes or content.

    Args:
        description (str): The descriptive search query.
        num_results (int): The number of movies to return.

    Returns:
        list: A list of tuples, each containing a movie title and its relevance score.
    """
    # 1. Compute the embedding for the input description
    description_embedding = embedding_model.embed_batch([description])[0]

    # 2. Query Neo4j for movies based on tag similarity
    with driver.session() as sess:
        result = sess.run(
            """
            // Find top K similar tags via vector search
            CALL db.index.vector.queryRelationships('tag_embeddings', $k, $embedding)
            YIELD relationship AS t, score
            
            // Find the movies associated with those tags
            WITH t, score
            MATCH (m:Movie)<-[t]-()
            
            // Aggregate scores for each movie
            WITH m, sum(score) AS total_score, count(t) AS matching_tags
            
            // Return top N movies ranked by score
            RETURN m.title AS title, total_score, matching_tags
            ORDER BY total_score DESC
            LIMIT $num_results
            """,
            {
                "k": 20,  # Find top 20 tags to broaden the search space
                "embedding": description_embedding,
                "num_results": int(num_results),
            },
        )

        movies = [(r["title"], r["total_score"]) for r in result]

    return movies

Let’s try this method with a couple of example queries to see how well it can find movies based on their tags. The first query will be “space travel and aliens”, which should return movies related to those themes, and the second query will be “funny romantic movies”, which should return light-hearted romantic comedies.

This will be heavily dependent on the tags that users have applied to movies in the dataset, so the results may vary based on tag availability.

Show the code

# Example usage:
search_query = "space travel and aliens"
movies = find_movies_by_description(search_query)

md = f"Movies found for '{search_query}':\n\n"
if movies:
    for title, score in movies:
        md += f"- {title} (Relevance: {score:.4f})\n"
else:
    md += "No matching movies found."
print(md)


# Another example:
search_query = "funny romantic movies"
movies = find_movies_by_description(search_query)

md = f"Movies found for '{search_query}':\n\n"
if movies:
    for title, score in movies:
        md += f"- {title} (Relevance: {score:.4f})\n"
else:
    md += "No matching movies found."
print(md)

Movies found for 'space travel and aliens':

- Day the Earth Stood Still, The (1951) (Relevance: 0.7658)
- Thing from Another World, The (1951) (Relevance: 0.7658)
- Astronaut's Wife, The (1999) (Relevance: 0.7658)
- Independence Day (a.k.a. ID4) (1996) (Relevance: 0.7658)
- Men in Black (a.k.a. MIB) (1997) (Relevance: 0.7658)
- Arrival, The (1996) (Relevance: 0.7658)
- Alien (1979) (Relevance: 0.7658)
- My Stepmother Is an Alien (1988) (Relevance: 0.7658)
- E.T. the Extra-Terrestrial (1982) (Relevance: 0.7658)
- Signs (2002) (Relevance: 0.7433)

Movies found for 'funny romantic movies':

- Son of Rambow (2007) (Relevance: 0.7653)
- Titanic (1997) (Relevance: 0.7422)
- Harold and Maude (1971) (Relevance: 0.7266)
- Punchline (1988) (Relevance: 0.7195)
- Personal Velocity (2002) (Relevance: 0.7180)
- Corrina, Corrina (1994) (Relevance: 0.7165)
- Monty Python's The Meaning of Life (1983) (Relevance: 0.7077)
- Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) (Relevance: 0.7023)
- State and Main (2000) (Relevance: 0.7023)
- Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) (Relevance: 0.6999)

Integrating with a Large Language Model

To enhance our recommendation system, we can integrate a large language model (LLM) to generate natural language explanations for why certain movies are recommended. This will provide users with more context and reasoning behind the recommendations, making the system more user-friendly and informative.

To do this, we will use the Google Gemini API to generate explanations based on the recommendations. The LLM will take the source movie, the recommended movie, and the evidence gathered from the graph (shared genres, user overlap, and shared tags) to create a friendly and concise explanation.

We could use any other language model, including a local small language model, but we will leave that as an exercise for the reader.

Agent model and tool use

The way we will implement this is by defining a function that takes the source movie title and a list of recommended movies, gathers the necessary evidence from the graph, and then uses the Gemini API to generate explanations for each recommendation. The function will return a list of dictionaries containing the recommended movie title and its explanation.

These methods will then be used as part of a conversational agent that can answer movie-related queries and provide recommendations based on user input. The agent will have these methods as tools it can call to provide answers, making it capable of handling a wide range of movie-related questions.

Show the code

import google.generativeai as genai

# Check for Google API key, preferring GEMINI_API_KEY
api_key = os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY")
if not api_key:
    print(
        "API key not found. Please set the GEMINI_API_KEY or GOOGLE_API_KEY environment variable."
    )
    gemini_model = None
else:
    genai.configure(api_key=api_key)
    gemini_model = genai.GenerativeModel("gemini-2.5-flash")


def generate_recommendation_explanations(
    source_title: str, recommendations: List[Dict[str, Any]]
) -> List[Dict[str, str]]:
    """
    Generates natural language explanations for recommendations using the Gemini API.

    Returns:
        list: A list of dictionaries, each containing a 'title' and 'explanation'.
    """
    if not gemini_model:
        print("Gemini model not initialized. Cannot generate explanations.")
        return []

    explained_recommendations = []
    for rec in recommendations:
        recommended_title = rec["title"]
        evidence = rec["evidence"]

        if not evidence:
            continue

        # Construct a detailed prompt for the LLM
        prompt = f"""
        You are a movie recommendation assistant. Your task is to provide a compelling, one-paragraph explanation for a movie recommendation.

        Here is the data you should use:
        - The user liked the movie: "{source_title}"
        - We are recommending the movie: "{recommended_title}"
        - These two movies share the following genres: {', '.join(evidence['shared_genres'])}
        - We found that {evidence['user_overlap']} users who gave a high rating to "{source_title}" also gave a high rating to "{recommended_title}".
        - They also share common themes, as shown by these shared tags: {', '.join(evidence['shared_tags'])}.

        Based on this evidence, please generate a friendly and concise paragraph explaining why someone who liked "{source_title}" would enjoy "{recommended_title}".
        Do not just list the data; weave it into a natural-sounding explanation.
        """

        try:
            response = gemini_model.generate_content(prompt)
            explanation = response.text

            explained_recommendations.append(
                {"title": recommended_title, "explanation": explanation}
            )

        except Exception as e:
            print(
                f"An error occurred while generating explanation for {recommended_title}: {e}"
            )
            explained_recommendations.append(
                {
                    "title": recommended_title,
                    "explanation": f"Could not generate explanation: {e}",
                }
            )

    return explained_recommendations

Let us test the recommendation explanations with a specific movie title and see how the LLM generates explanations for the recommendations. We will use the movie “Life Is a Long Quiet River” as an example, which should yield interesting recommendations based on its themes and user ratings.

Show the code

from IPython.display import Markdown, display
import pandas as pd

# Example usage with generative explanations
movie_title = "Life Is a Long Quiet River"
source, recommendations = recommend_movies(movie_title, num_recommendations=3)

if source:
    print(f"Recommendations based on '{source}':\n")
    explained_recs = generate_recommendation_explanations(source, recommendations)

    # Build and display a pandas DataFrame
    if explained_recs:
        df = pd.DataFrame(explained_recs)
        df.rename(
            columns={"title": "Recommended Movie", "explanation": "Explanation"},
            inplace=True,
        )
        styler = df.style.set_properties(
            **{"white-space": "normal", "text-align": "left"}
        )
        styler.set_table_styles([dict(selector="th", props=[("text-align", "left")])])
        display(styler)
    else:
        print("Could not generate explanations for the recommendations.")

else:
    print("No recommendations found.")

Recommendations based on 'Life Is a Long Quiet River (La vie est un long fleuve tranquille) (1988)':

	Recommended Movie	Explanation
0	Life Is Beautiful (La Vita è bella) (1997)	Given your appreciation for the charming film Life Is a Long Quiet River (La vie est un long fleuve tranquille), we believe you'll absolutely love Life Is Beautiful (La Vita è bella). Both movies beautifully blend heartfelt narratives with a strong comedic spirit, ensuring a truly engaging viewing experience. In fact, at least one user who highly rated Life Is a Long Quiet River also gave a top rating to Life Is Beautiful, indicating a shared taste for films that navigate life's complexities with humor and warmth.
1	Train of Life (Train de vie) (1998)	Given your enjoyment of the distinct charm and comedic sensibilities found in "Life Is a Long Quiet River (La vie est un long fleuve tranquille)," we believe you might also appreciate "Train of Life (Train de vie)." Both films share the Comedy genre, suggesting that if you connected with the unique brand of humor and delightful storytelling in your previous favorite, you'll likely find similar entertainment and lighthearted moments aboard the journey presented in "Train of Life."
2	La cravate (1957)	If you enjoyed the uniquely French blend of humor and insightful observation in "Life Is a Long Quiet River," you might appreciate "La cravate" for its distinct approach to storytelling. While it offers a very different, more experimental journey, its captivating visual narrative and unique sensibility could appeal to viewers who appreciate the rich, varied, and often unexpected artistic expressions found within French cinema, inviting you to explore another fascinating corner of its diverse landscape.

The agent

Now let us create the conversational agent that can use the tools we built previously to answer movie-related queries. The agent will be able to autonomously recommend movies and find movies by description.

Here is a diagram of our agent architecture:

graph TD
    subgraph "User Interaction"
        A[User Query]
    end

    subgraph "Conversational Agent"
        B(converse_with_llm)
        C{Tool Executor}
        D[recommend_movies]
        E[find_movies_by_description]
    end

    subgraph "Gemini API"
        F(Gemini LLM)
    end

    subgraph "Neo4j Database"
        G((Graph Database))
    end

    A --> B;
    B -- "Query + Tools" --> F;
    F -- "Function Call" --> C;
    F -- "Direct Answer" --> H[Formatted Response];
    C -- "Executes" --> D;
    C -- "Executes" --> E;
    D -- "Cypher Query" --> G;
    E -- "Cypher Query" --> G;
    G -- "Data" --> D;
    G -- "Data" --> E;
    D --> H;
    E --> H;
    H --> I[Display to User];

    style A fill:#FFDAB9,stroke:#333,stroke-width:2px
    style F fill:#ADD8E6,stroke:#333,stroke-width:2px
    style G fill:#90EE90,stroke:#333,stroke-width:2px

Show the code

# Conversational Agent with Tool Use


def converse_with_llm(query: str) -> str:
    """
    A conversational agent that can use tools to answer movie-related queries.
    """
    if not gemini_model:
        return "Gemini model not initialized. Cannot process query."

    # The user's query
    print(f"User query: '{query}'")

    # Give the model the available tools
    response = gemini_model.generate_content(
        query,
        tools=[recommend_movies, find_movies_by_description],
        generation_config={"max_output_tokens": 250},
    )

    # Check if the model decided to call a tool
    if not response.candidates[0].content.parts:
        return "I'm sorry, I couldn't find a suitable tool to answer your question."

    part = response.candidates[0].content.parts[0]
    if part.function_call:
        function_call = part.function_call
        function_name = function_call.name
        function_args = dict(function_call.args)

        print(
            f"LLM decided to call tool '{function_name}' with arguments: {function_args}\n"
        )

        # --- Call the chosen function ---
        if function_name == "recommend_movies":
            source_title, recommendations = recommend_movies(**function_args)
            if not recommendations:
                return f"Sorry, couldn't find recommendations for '{function_args.get('movie_title')}'"

            # Format the output
            output = f"Recommendations based on '{source_title}':\n\n"
            for rec in recommendations:
                output += f"- {rec['title']} (Score: {rec['score']:.4f})\n"
            return output

        elif function_name == "find_movies_by_description":
            movies = find_movies_by_description(**function_args)
            if not movies:
                return f"Sorry, couldn't find movies for description: '{function_args.get('description')}'"

            # Format the output
            output = f"Movies found for '{function_args.get('description')}':\n\n"
            for title, score in movies:
                output += f"- {title} (Relevance: {score:.4f})\n"
            return output
        else:
            return f"Error: Unknown function call '{function_name}'"
    else:
        # The model responded directly
        return response.text

And finally let us test it out with a couple of queries.

Show the code

# Using the recommendation tool
query1 = "Can you recommend some movies similar to 'Toy Story'?"
result1 = converse_with_llm(query1)
print(result1)

# Using the description search tool
query2 = "I want to watch a movie about space exploration and robots."
result2 = converse_with_llm(query2)
print(result2)

User query: 'Can you recommend some movies similar to 'Toy Story'?'
LLM decided to call tool 'recommend_movies' with arguments: {'movie_title': 'Toy Story'}

Recommendations based on 'Toy Story (1995)':

- Toy Story 2 (1999) (Score: 14.2874)
- Toy Story 3 (2010) (Score: 8.8880)
- Psycho (1960) (Score: 8.2628)
- The Lego Movie (2014) (Score: 3.1701)
- Dangerous Minds (1995) (Score: 1.9641)

User query: 'I want to watch a movie about space exploration and robots.'

LLM decided to call tool 'find_movies_by_description' with arguments: {'description': 'space exploration and robots'}

Movies found for 'space exploration and robots':

- Star Wars: Episode IV - A New Hope (1977) (Relevance: 2.0728)
- Iron Giant, The (1999) (Relevance: 0.7099)
- Terminator 2: Judgment Day (1991) (Relevance: 0.7099)
- Terminator, The (1984) (Relevance: 0.7099)
- Blade Runner (1982) (Relevance: 0.7099)
- A.I. Artificial Intelligence (2001) (Relevance: 0.7099)
- Short Circuit (1986) (Relevance: 0.7099)
- Apollo 13 (1995) (Relevance: 0.7073)
- Forbidden Planet (1956) (Relevance: 0.7073)
- 2001: A Space Odyssey (1968) (Relevance: 0.7073)

Final remarks

This experiment has demonstrated how to build a simple movie recommendation system using Neo4j and a large language model. We have imported a dataset of movies, ratings, and tags into a Neo4j graph database, created a recommendation algorithm that combines semantic similarity, collaborative filtering, and shared tags, and integrated a large language model to generate natural language explanations for the recommendations.

We also created a conversational agent that can answer movie-related queries and provide recommendations based on user input. The agent can autonomously recommend movies and find movies by description, integrating graph capabilities and natural language understanding.

A good further exercise would be to extend the source dataset, with further relationships such as actors, directors, and genres, and to enhance the recommendation algorithm to take these into account, by using larger datasets such as the IMDB Non-Commercial dataset.

Reuse

This work is licensed under CC BY (View License)