a Neo4j experiment to build a movie recommendation system
Experiments
Graph Analysis
Network Theory
Published
July 17, 2025
In linguistics you’ve probably heard the term “ontology”. In knowledge graphs, an ontology is a formal, machine‑readable vocabulary that defines a domain’s classes (things that exist), properties (how they’re related) and constraints (logical rules). By using universal identifiers (URIs) and logic‑friendly standards like RDF Schema or OWL, ontologies ensure that systems agree on what “Customer,” “Order” or “PaymentTerm” actually mean, without necessarily requiring data dictionaries.
A graph database is the storage engine optimized for this kind of highly connected data. Triple‑stores (e.g. GraphDB, Amazon Neptune‑RDF) persist facts as Subject‑Predicate‑Object triples and run SPARQL queries with built‑in reasoning. Property‑graph systems like Neo4j, model nodes and relationships with arbitrary key/value properties and offer Cypher for analytics‑friendly traversals. With relationships as first‑class citizens, questions like “find all suppliers two hops from a recalled component” execute in milliseconds, without tangled SQL JOINs required.
Compared to relational databases, graph stores trade rigid, table‑bound schemas for schema‑late, edge‑like flexibility. While you do give up some of the cross‑table ACID simplicity (especially at massive scale or across clusters), you gain the ability to evolve your data model on the fly and uncover hidden patterns, making graphs a natural fit whenever relationships matter most.
In this experiment, we will explore a simple use case for a graph database using the MovieLens dataset, which contains information about movies, users, ratings, and tags.
We will use Neo4j to create a graph representation of the data and perform some interesting queries.
An embedding model will compute semantic embeddings for movie titles and tags, and we will use those embeddings to enhance our graph queries.
We will implement a simple recommendation system using collaborative filtering.
Finally we will use a large language model (LLM) to generate explanations for recommendations and build a conversational agent that can answer movie-related queries using the graph database.
Note
“Collaborative filtering” is a technique used in recommendation systems to suggest items (like movies) based on the preferences of similar users. It relies on the idea that if two users have similar tastes, they are likely to enjoy similar items.
Downloading the MovieLens dataset
We will start by downloading the 100,000 movie ratings dataset. It is small enough to fit within the free-tier limits of Neo4j Aura, and it contains enough data to demonstrate the capabilities of graph databases.
With the dataset downloaded, we can now proceed to set up our Neo4j database and import the data.
Setting up Neo4j
We will use Neo4j Aura, the cloud version of Neo4j, to host our graph database. You can sign up for a free account at Neo4j Aura. Once you have created an account, you can create a new database instance and obtain the connection details (URI, username, and password).
Make sure to set the environment variables NEO4J_URI, NEO4J_USERNAME, and NEO4J_PASSWORD with the connection details of your Neo4j Aura instance.
Show the code
import osfrom neo4j import GraphDatabase, basic_auth, Driver, Session, Transaction, RecordURI = os.getenv("NEO4J_URI")USER = os.getenv("NEO4J_USERNAME")PASSWORD = os.getenv("NEO4J_PASSWORD")AUTH = (USER, PASSWORD)print(f"Connecting to Neo4j at {URI} with user {USER}")with GraphDatabase.driver(URI, auth=AUTH) as driver: driver.verify_connectivity()def test_aura_connection() ->None:with driver.session() as session: result = session.run("RETURN 'Hello, Aura!' AS message") record = result.single()print(record["message"]) # should print "Hello, Aura!"test_aura_connection()
Connecting to Neo4j at neo4j+s://8c1ab3e4.databases.neo4j.io with user neo4j
Hello, Aura!
Computing embeddings
With the connection established, we can now proceed to import the dataset. First however, let us define a helper class to compute embeddings for movie titles and tags using a Transformer model. For simplicity, we will use the all-MiniLM-L6-v2 model from the sentence-transformers library, which is a lightweight model suitable for semantic similarity tasks (however, it lags behind larger models in terms of accuracy).
Show the code
# Method to compute embeddings using a given Transformer modelfrom transformers import AutoTokenizer, AutoModelimport torchimport numpy as npfrom typing import Listclass EmbeddingModel:def__init__(self, model_name: str="sentence-transformers/all-MiniLM-L6-v2" ) ->None:# Use CUDA if availableif torch.cuda.is_available():print("Using CUDA for embeddings.")self.device = torch.device("cuda")elif torch.backends.mps.is_available():print("Using MPS for embeddings.")self.device = torch.device("mps")else:print("Using CPU for embeddings.")self.device = torch.device("cpu")self.tokenizer = AutoTokenizer.from_pretrained(model_name)self.model = AutoModel.from_pretrained(model_name).to(self.device)def embed_batch(self, texts: List[str]) -> np.ndarray: inputs =self.tokenizer( texts, return_tensors="pt", truncation=True, padding=True, max_length=512 ).to(self.device)with torch.no_grad(): outputs =self.model(**inputs)return outputs.last_hidden_state.mean(dim=1).cpu().numpy()embedding_model = EmbeddingModel()
Using CUDA for embeddings.
In addition, we will also need a few methods to handle the import of the dataset into Neo4j. We will create nodes for movies, users, genres, and tags, and establish relationships between them. We will be defining three methods, drop_schema to drop existing constraints and indexes, create_constraints to create the necessary constraints in the database, and import_movies_batched, import_ratings_batched, and import_tags_batched to import the movies, ratings, and tags data in batches (batched loading is important if you are doing any significant volume of transactions).
Once all data is loaded, our graph will be structured as follows:
graph TD
subgraph Nodes
direction LR
U[User]
M[Movie]
G[Genre]
end
subgraph Relationships
direction LR
U -- RATED --> M
U -- TAGGED --> M
M -- IN_GENRE --> G
end
style U fill:#FFDAB9,stroke:#333,stroke-width:2px
style M fill:#ADD8E6,stroke:#333,stroke-width:2px
style G fill:#90EE90,stroke:#333,stroke-width:2px
linkStyle 0 stroke:red,stroke-width:2px;
linkStyle 1 stroke:blue,stroke-width:2px;
linkStyle 2 stroke:green,stroke-width:2px;
Importing the dataset
Our Cypher query to import movies (and similarly, other entities) looks like this:
UNWIND $batch AS row
MERGE (m:Movie {movieId: toInteger(row.movieId)})
SET m.title = row.title,
m.imdbId = row.imdbId,
m.tmdbId = row.tmdbId,
m.embedding = row.embedding
WITH m, row
UNWIND split(row.genres, '|') AS genreName
MERGE (g:Genre {name: genreName})
MERGE (m)-[:IN_GENRE]->(g)
It takes a list of records supplied in $batch and processes them one at a time (UNWIND). For each record it either finds or creates a Movie node keyed by movieId, then updates that node with its title, IMDb and TMDb identifiers, plus a pre‑computed vector stored in embedding. Because it uses MERGE, you’ll never get duplicate movie nodes with the same ID.
After updating the movie, the query pulls the pipe‑separated genre string into individual genre names, again ensuring each unique name has exactly one Genre node. It then creates (or confirms) an IN_GENRE relationship from the movie to each of its genres. The whole snippet is basically an idempotent “upsert” that normalises movies and genres while wiring them together in a clean graph structure with an IN_GENRE relationship.
Show the code
import csvimport itertoolsimport refrom typing import Dict, Any, Tuple, Optionaldef drop_schema(tx: Transaction) ->None:# Drop constraintsfor record in tx.run("SHOW CONSTRAINTS"): name = record["name"] tx.run(f"DROP CONSTRAINT `{name}`")# Drop indexesfor record in tx.run("SHOW INDEXES"): name = record["name"] tx.run(f"DROP INDEX `{name}`")def create_constraints(tx: Transaction) ->None: tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (m:Movie) REQUIRE m.movieId IS UNIQUE") tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (u:User) REQUIRE u.userId IS UNIQUE") tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (g:Genre) REQUIRE g.name IS UNIQUE")def _load_movie_batch(tx: Transaction, batch: List[Dict[str, Any]]) ->None: tx.run(""" UNWIND $batch AS row MERGE (m:Movie {movieId: toInteger(row.movieId)}) SET m.title = row.title, m.imdbId = row.imdbId, m.tmdbId = row.tmdbId, m.embedding = row.embedding WITH m, row UNWIND split(row.genres, '|') AS genreName MERGE (g:Genre {name: genreName}) MERGE (m)-[:IN_GENRE]->(g) """, batch=batch, )def import_movies_batched( session: Session, movies_f: str, links_f: str, batch_size: int=1000) ->None:# preload links into memory once links = {}withopen(links_f, newline="", encoding="utf-8") as f:for r in csv.DictReader(f): links[r["movieId"]] = {"imdbId": r["imdbId"], "tmdbId": r["tmdbId"]} batch = []withopen(movies_f, newline="", encoding="utf-8") as f: reader = csv.DictReader(f)whileTrue:# Read a batch of rows from the CSV rows =list(itertools.islice(reader, batch_size))ifnot rows:break# Extract titles and compute embeddings in a batch titles = [row["title"] for row in rows]# Titles have a year ("... (1994)" for example), remove it with a regexp for better embeddings titles = [re.sub(r"\s*\(\d{4}\)$", "", title) for title in titles] embeddings = embedding_model.embed_batch(titles)# Prepare batch for Neo4j import batch_to_load = []for i, row inenumerate(rows): lm = links.get(row["movieId"], {}) batch_to_load.append( {"movieId": row["movieId"],"title": row["title"],"genres": row["genres"],"imdbId": lm.get("imdbId"),"tmdbId": lm.get("tmdbId"),"embedding": embeddings[i], } ) session.execute_write(_load_movie_batch, batch_to_load)def _load_ratings_batch(tx: Transaction, batch: List[Dict[str, str]]) ->None: tx.run(""" UNWIND $batch AS row MERGE (u:User {userId: toInteger(row.userId)}) MERGE (m:Movie {movieId: toInteger(row.movieId)}) MERGE (u)-[r:RATED]->(m) SET r.rating = toFloat(row.rating), r.timestamp = toInteger(row.timestamp) """, batch=batch, )def import_ratings_batched( session: Session, ratings_f: str, batch_size: int=1000) ->None: batch = []withopen(ratings_f, newline="", encoding="utf-8") as f:for row in csv.DictReader(f): batch.append(row)iflen(batch) >= batch_size: session.execute_write(_load_ratings_batch, batch) batch.clear()if batch: session.execute_write(_load_ratings_batch, batch)def _load_tags_batch(tx: Transaction, batch: List[Dict[str, Any]]) ->None: tx.run(""" UNWIND $batch AS row MERGE (u:User {userId: toInteger(row.userId)}) MERGE (m:Movie {movieId: toInteger(row.movieId)}) MERGE (u)-[t:TAGGED]->(m) SET t.tag = row.tag, t.timestamp = toInteger(row.timestamp), t.embedding = row.embedding """, batch=batch, )def import_tags_batched(session: Session, tags_f: str, batch_size: int=1000) ->None:withopen(tags_f, newline="", encoding="utf-8") as f: reader = csv.DictReader(f)whileTrue: rows =list(itertools.islice(reader, batch_size))ifnot rows:break tags = [row["tag"] for row in rows] embeddings = embedding_model.embed_batch(tags)for i, row inenumerate(rows): row["embedding"] = embeddings[i] session.execute_write(_load_tags_batch, rows)
We also need a couple of methods to create the vector indexes for the movie and tag embeddings. These indexes will allow us to perform efficient similarity searches against the backend.
Show the code
def create_vector_index(tx: Transaction) ->None: tx.run(""" CREATE VECTOR INDEX `movie_embeddings` IF NOT EXISTS FOR (m:Movie) ON m.embedding OPTIONS {indexConfig: { `vector.dimensions`: 384, `vector.similarity_function`: 'cosine'}} """ )def create_tag_vector_index(tx: Transaction) ->None: tx.run(""" CREATE VECTOR INDEX `tag_embeddings` IF NOT EXISTS FOR ()-[t:TAGGED]-() ON t.embedding OPTIONS {indexConfig: { `vector.dimensions`: 384, `vector.similarity_function`: 'cosine'}} """ )
With this defined, we can now proceed to drop any existing schema, create the necessary constraints and indexes, and import the data into Aura.
Let us do some exploratory queries to see what we have in our graph. We can start by picking a single movie, say “Forrest Gump (1994)”, and find out which users have tagged it, what tags they used, and which other movies those users have additionally tagged.
In Cypher, this is a multi-step query which first finds the movie:
MATCH (fg:Movie {title: "Forrest Gump (1994)"})
Then finds the users who have tagged it (CALL is used to define a subquery):
CALL(fg) {
MATCH (u:User)-[t:TAGGED]->(fg)
WITH u, count(t) AS fgTagCount
ORDER BY fgTagCount DESC
LIMIT 20
RETURN collect(u) AS topUsers
}
We then then unwind the top users and find the top 10 other movies they have tagged:
CALL(fg, topUsers) {
UNWIND topUsers AS u
MATCH (u)-[:TAGGED]->(m:Movie)
WHERE m <> fg
WITH m, count(DISTINCT u) AS userCount
ORDER BY userCount DESC
LIMIT 10
RETURN collect(m) AS topMovies
}
Followed by unwinding (meaning “exploding” every element in the list into separate rows) both the Forrest Gump movie and the other top 10 movies, and pulling every tag edge between those users and those movies:
WITH fg, topUsers, topMovies
UNWIND topUsers AS u
UNWIND [fg] + topMovies AS m
And finally we return the user ID, movie ID, movie title, tags, and tag count of the resulting tag edges:
MATCH (u)-[t:TAGGED]->(m)
RETURN
u.userId AS uId,
m.movieId AS mId,
m.title AS title,
collect(t.tag) AS tags,
count(t) AS tagCount
Show the code
film ="Forrest Gump (1994)"max_users =20max_other_movies =10# Grab exactly the users and movies we care about, plus their tag‐counts & tag names:with driver.session() as sess: result = sess.run(""" MATCH (fg:Movie {title: $film}) // top 20 users by # of tags on FG CALL(fg) { MATCH (u:User)-[t:TAGGED]->(fg) WITH u, count(t) AS fgTagCount ORDER BY fgTagCount DESC LIMIT $max_users RETURN collect(u) AS topUsers } // top 10 other movies tagged by those users CALL(fg, topUsers) { UNWIND topUsers AS u MATCH (u)-[:TAGGED]->(m:Movie) WHERE m <> fg WITH m, count(DISTINCT u) AS userCount ORDER BY userCount DESC LIMIT $max_other_movies RETURN collect(m) AS topMovies } WITH fg, topUsers, topMovies // unwind both FG + the other top 10 UNWIND topUsers AS u UNWIND [fg] + topMovies AS m // pull every tag‐edge between those users and those movies MATCH (u)-[t:TAGGED]->(m) RETURN u.userId AS uId, m.movieId AS mId, m.title AS title, collect(t.tag) AS tags, count(t) AS tagCount """, {"film": film, "max_users": max_users, "max_other_movies": max_other_movies}, ) records = [r.data() for r in result]
The resulting records will contain the user ID, movie ID, movie title, tags used by the user, and the count of those tags. We can then display a sample record and build a network visualization of the tag edges.
Show the code
from IPython.display import display, Markdownimport jsonprint("Tag‐edges for 'Forrest Gump'")print(f"Found {len(records)} tag‐edges among {len({rec['uId'] for rec in records})} users and {len({rec['mId'] for rec in records})} movies.\n")if records:print("Sample record:") pretty_record = json.dumps(records[0], indent=2)print(pretty_record)
Tag‐edges for 'Forrest Gump'
Found 23 tag‐edges among 3 users and 11 movies.
Sample record:
{
"uId": 474,
"mId": 356,
"title": "Forrest Gump (1994)",
"tags": [
"Vietnam"
],
"tagCount": 1
}
Visualizing the data
To get a good intuition of our graph, we can use the pyvis library to create an interactive visualization of the tag edges. The nodes will represent users and movies, while the edges will represent the tags applied by users to movies. We will also highlight “Forrest Gump (1994)” in orange for better visibility.
Show the code
from pyvis.network import Network# Build the networknet = Network(height="600px", width="100%", notebook=True, cdn_resources="in_line")for rec in records: uid =f"U{rec['uId']}" mid =f"M{rec['mId']}"# nodes (idempotent) net.add_node(uid, label=f"User {rec['uId']}", color="grey")# Set Forrest Gump to orange, other movies to lightblue movie_color ="orange"if rec["title"] =="Forrest Gump (1994)"else"lightblue" net.add_node(mid, label=rec["title"], color=movie_color)# edge thickness = # of tags net.add_edge(uid, mid, value=rec["tagCount"], title=", ".join(rec["tags"]))# configure a stabilization run of 1000 iterationsnet.set_options("""var options = { "physics": { "stabilization": { "enabled": true, "iterations": 1000, "updateInterval": 25 }, "barnesHut": { "gravitationalConstant": -8000, "centralGravity": 0.3, "springLength": 200, "springConstant": 0.04, "damping": 0.09, "avoidOverlap": 0.1 } }}""")net.save_graph("forrest_gump_graph.html")
Performing recommendations
We previously calculated embeddings for tags and film titles, which we can use to perform “fuzzy” semantic searches. This allows us to find films that are similar to a given title, even if the title is not an exact match. We can use these embeddings to build a recommendation system that suggests films based on their semantic similarity, collaborative filtering, and shared tags.
The algorithm is simple: we first find the closest movie match to the input title using the vector index, and then use that as the basis for recommendations. We will also incorporate collaborative filtering by finding users who liked both the source movie and the recommended movie, and content filtering by finding shared tags between the source and recommended movies.
The formula for the final recommendation score is a weighted sum of the title similarity, user overlap, and shared tags:
Here is the recommendation method that performs these steps.
Show the code
def recommend_movies( movie_title: str, num_recommendations: int=5) -> Tuple[Optional[str], List[Dict[str, Any]]]:""" Recommends movies based on a hybrid of semantic similarity, collaborative filtering, and shared tags. It first finds the closest movie match to the input title and then uses that as the basis for recommendations. Args: movie_title (str): The title of the movie to get recommendations for. num_recommendations (int): The number of recommendations to return. Returns: tuple: A tuple containing (source_movie_title, list_of_recommendations). Each recommendation is a dictionary with title, score, and evidence. """# Compute the embedding for the input movie title title_embedding = embedding_model.embed_batch([movie_title])[0]# Query Neo4j for recommendationswith driver.session() as sess: result = sess.run(""" // Find the closest movie to the input title string CALL db.index.vector.queryNodes('movie_embeddings', 1, $embedding) YIELD node AS source_movie // Find recommendation candidates based on the source movie's embedding CALL db.index.vector.queryNodes('movie_embeddings', $k, source_movie.embedding) YIELD node AS similar_movie, score AS title_similarity WHERE similar_movie <> source_movie // Collaborative Filtering - find users who liked both movies WITH source_movie, similar_movie, title_similarity OPTIONAL MATCH (source_movie)<-[r1:RATED]-(u:User)-[r2:RATED]->(similar_movie) WHERE r1.rating >= 4.0 AND r2.rating >= 4.0 WITH source_movie, similar_movie, title_similarity, count(DISTINCT u) AS user_overlap // Content Filtering - find shared tags more robustly WITH source_movie, similar_movie, title_similarity, user_overlap OPTIONAL MATCH (source_movie)<-[t1:TAGGED]-(:User) WITH source_movie, similar_movie, title_similarity, user_overlap, collect(DISTINCT t1.tag) AS source_tags OPTIONAL MATCH (similar_movie)<-[t2:TAGGED]-(:User) WITH source_movie, similar_movie, title_similarity, user_overlap, source_tags, collect(DISTINCT t2.tag) AS similar_tags // Calculate the intersection of tags WITH source_movie, similar_movie, title_similarity, user_overlap, [tag IN source_tags WHERE tag IN similar_tags] AS shared_tags // Get shared genres WITH source_movie, similar_movie, title_similarity, user_overlap, shared_tags OPTIONAL MATCH (source_movie)-[:IN_GENRE]->(g:Genre)<-[:IN_GENRE]-(similar_movie) WITH source_movie, similar_movie, title_similarity, user_overlap, shared_tags, collect(DISTINCT g.name) AS shared_genres // Calculate final score and return WITH source_movie, similar_movie, (title_similarity * 0.5) + (user_overlap * 0.3) + (size(shared_tags) * 0.2) AS final_score, user_overlap, shared_tags, shared_genres RETURN source_movie.title AS source_title, similar_movie.title AS title, final_score, user_overlap, shared_tags, shared_genres ORDER BY final_score DESC LIMIT $num_recommendations """, {"k": 20, # Get more initial candidates to refine"embedding": title_embedding,"num_recommendations": int( num_recommendations ), # Ensure it's an integer }, ) records =list(result)ifnot records:returnNone, [] source_title = records[0]["source_title"] recommendations = [ {"title": r["title"],"score": r["final_score"],"evidence": {"user_overlap": r["user_overlap"],"shared_tags": r["shared_tags"],"shared_genres": r["shared_genres"], }, }for r in records ]return source_title, recommendationsdef visualize_recommendations( source_title: str, recommendations: List[Dict[str, Any]]) -> Network:""" Generates a pyvis graph to visualize the relationships between the source movie and its recommendations based on shared genres. Args: source_title (str): The title of the source movie. recommendations (list): A list of recommendation dictionaries. Returns: pyvis.network.Network: A pyvis Network object representing the graph. """ net = Network(height="600px", width="100%", notebook=True, cdn_resources="in_line")# Add the source movie node net.add_node( source_title, label=source_title, color="orange", size=25, font={"size": 16} )# Keep track of genres already added added_genres =set()for rec in recommendations: rec_title = rec["title"] evidence = rec["evidence"]# Add the recommended movie node net.add_node(rec_title, label=rec_title, color="lightblue", size=15)# Add a direct edge from source to recommendation net.add_edge( source_title, rec_title, value=rec["score"], title=f"Score: {rec['score']:.2f}\nUsers: {evidence['user_overlap']}\nTags: {', '.join(evidence['shared_tags'])}", color="#cccccc", )# Add genre nodes and edgesfor genre in evidence.get("shared_genres", []):if genre notin added_genres: net.add_node(genre, label=genre, color="lightgreen", size=10) added_genres.add(genre)# Connect movies to their shared genres net.add_edge(source_title, genre, color="lightgrey", width=2) net.add_edge(rec_title, genre, color="lightgrey", width=2) net.set_options(""" var options = { "physics": { "stabilization": { "enabled": true, "iterations": 1000, "updateInterval": 25 }, "barnesHut": { "gravitationalConstant": -8000, "centralGravity": 0.3, "springLength": 250, "springConstant": 0.05, "damping": 0.09, "avoidOverlap": 0.1 } } } """ )return net
What does the recommendation algorithm returns for a couple of example movie titles ? The first example will be “Jurassic Park”, which should return recommendations based on that title, while the second example will be a more generic search for “world war” to see how well the algorithm can handle partial matches.
Show the code
# Example usagemovie_title ="Jurassic Park"source, recommendations = recommend_movies(movie_title)if source: md =f"Recommendations based on '{source}':\n\n"for rec in recommendations: md +=f"- {rec['title']} (Score: {rec['score']:.4f})\n"print(md)# Visualize the recommendations net = visualize_recommendations(source, recommendations) net.save_graph("jurassic_park_recommendations.html")else:print("No recommendations found.")# Another example with a non-exact titlemovie_title ="world war"source, recommendations = recommend_movies(movie_title)if source: md =f"Recommendations based on '{source}':\n"for rec in recommendations: md +=f"- {rec['title']} (Score: {rec['score']:.4f})\n"print(md)# Visualize the recommendations net = visualize_recommendations(source, recommendations) net.save_graph("world_war_recommendations.html")else:print("No recommendations found.")
Recommendations based on 'Jurassic World: Fallen Kingdom (2018)':
- Lost World: Jurassic Park, The (1997) (Score: 0.4793)
- Jurassic World (2015) (Score: 0.4699)
- Jurassic Park III (2001) (Score: 0.4665)
- Jurassic Park (1993) (Score: 0.4661)
- Dinotopia (2002) (Score: 0.4473)
Recommendations based on 'War of the Worlds (2005)':
- War of the Worlds (2005) (Score: 0.4975)
- War of the Worlds, The (1953) (Score: 0.4931)
- Lord of War (2005) (Score: 0.4761)
- In Love and War (1996) (Score: 0.4715)
- Reign of Fire (2002) (Score: 0.4703)
The recommendations returned by the algorithm are a pretty reasonable match to the input titles, and include a mix of movies that are semantically similar, have shared tags, and are liked by users who also liked the source movie.
Let us also visualize the recommendations for “Jurassic Park” to see how the relationships between the source movie and its recommendations look like in a graph.
And for the “world war” query, we can see how the algorithm handles a more abstract search, returning movies that are related to the themes of world wars.
Finding movies by description
Further down we will also need a method to find movies based on a natural language description of their themes or content. This will allow us to search for movies using more abstract queries, such as “space travel and aliens” or “funny romantic movies”. We will use the same embedding model to compute the embedding for the input description and then query Neo4j for movies based on similarity using a :link cosine similarity search.
Note
Neo4j supports vector search using the db.index.vector.queryRelationships procedure, which allows us to find relationships that are similar to a given embedding.
Show the code
def find_movies_by_description( description: str, num_results: int=10) -> List[Tuple[str, float]]:""" Finds movies based on a natural language description of their themes or content. Args: description (str): The descriptive search query. num_results (int): The number of movies to return. Returns: list: A list of tuples, each containing a movie title and its relevance score. """# 1. Compute the embedding for the input description description_embedding = embedding_model.embed_batch([description])[0]# 2. Query Neo4j for movies based on tag similaritywith driver.session() as sess: result = sess.run(""" // Find top K similar tags via vector search CALL db.index.vector.queryRelationships('tag_embeddings', $k, $embedding) YIELD relationship AS t, score // Find the movies associated with those tags WITH t, score MATCH (m:Movie)<-[t]-() // Aggregate scores for each movie WITH m, sum(score) AS total_score, count(t) AS matching_tags // Return top N movies ranked by score RETURN m.title AS title, total_score, matching_tags ORDER BY total_score DESC LIMIT $num_results """, {"k": 20, # Find top 20 tags to broaden the search space"embedding": description_embedding,"num_results": int(num_results), }, ) movies = [(r["title"], r["total_score"]) for r in result]return movies
Let’s try this method with a couple of example queries to see how well it can find movies based on their tags. The first query will be “space travel and aliens”, which should return movies related to those themes, and the second query will be “funny romantic movies”, which should return light-hearted romantic comedies.
This will be heavily dependent on the tags that users have applied to movies in the dataset, so the results may vary based on tag availability.
Show the code
# Example usage:search_query ="space travel and aliens"movies = find_movies_by_description(search_query)md =f"Movies found for '{search_query}':\n\n"if movies:for title, score in movies: md +=f"- {title} (Relevance: {score:.4f})\n"else: md +="No matching movies found."print(md)# Another example:search_query ="funny romantic movies"movies = find_movies_by_description(search_query)md =f"Movies found for '{search_query}':\n\n"if movies:for title, score in movies: md +=f"- {title} (Relevance: {score:.4f})\n"else: md +="No matching movies found."print(md)
Movies found for 'space travel and aliens':
- Day the Earth Stood Still, The (1951) (Relevance: 0.7658)
- Thing from Another World, The (1951) (Relevance: 0.7658)
- Astronaut's Wife, The (1999) (Relevance: 0.7658)
- Independence Day (a.k.a. ID4) (1996) (Relevance: 0.7658)
- Men in Black (a.k.a. MIB) (1997) (Relevance: 0.7658)
- Arrival, The (1996) (Relevance: 0.7658)
- Alien (1979) (Relevance: 0.7658)
- My Stepmother Is an Alien (1988) (Relevance: 0.7658)
- E.T. the Extra-Terrestrial (1982) (Relevance: 0.7658)
- Signs (2002) (Relevance: 0.7433)
Movies found for 'funny romantic movies':
- Son of Rambow (2007) (Relevance: 0.7653)
- Titanic (1997) (Relevance: 0.7422)
- Harold and Maude (1971) (Relevance: 0.7266)
- Punchline (1988) (Relevance: 0.7195)
- Personal Velocity (2002) (Relevance: 0.7180)
- Corrina, Corrina (1994) (Relevance: 0.7165)
- Monty Python's The Meaning of Life (1983) (Relevance: 0.7077)
- Sunset Blvd. (a.k.a. Sunset Boulevard) (1950) (Relevance: 0.7023)
- State and Main (2000) (Relevance: 0.7023)
- Dr. Strangelove or: How I Learned to Stop Worrying and Love the Bomb (1964) (Relevance: 0.6999)
Integrating with a Large Language Model
To enhance our recommendation system, we can integrate a large language model (LLM) to generate natural language explanations for why certain movies are recommended. This will provide users with more context and reasoning behind the recommendations, making the system more user-friendly and informative.
To do this, we will use the Google Gemini API to generate explanations based on the recommendations. The LLM will take the source movie, the recommended movie, and the evidence gathered from the graph (shared genres, user overlap, and shared tags) to create a friendly and concise explanation.
We could use any other language model, including a local small language model, but we will leave that as an exercise for the reader.
Agent model and tool use
The way we will implement this is by defining a function that takes the source movie title and a list of recommended movies, gathers the necessary evidence from the graph, and then uses the Gemini API to generate explanations for each recommendation. The function will return a list of dictionaries containing the recommended movie title and its explanation.
These methods will then be used as part of a conversational agent that can answer movie-related queries and provide recommendations based on user input. The agent will have these methods as tools it can call to provide answers, making it capable of handling a wide range of movie-related questions.
Show the code
import google.generativeai as genai# Check for Google API key, preferring GEMINI_API_KEYapi_key = os.getenv("GEMINI_API_KEY") or os.getenv("GOOGLE_API_KEY")ifnot api_key:print("API key not found. Please set the GEMINI_API_KEY or GOOGLE_API_KEY environment variable." ) gemini_model =Noneelse: genai.configure(api_key=api_key) gemini_model = genai.GenerativeModel("gemini-2.5-flash")def generate_recommendation_explanations( source_title: str, recommendations: List[Dict[str, Any]]) -> List[Dict[str, str]]:""" Generates natural language explanations for recommendations using the Gemini API. Returns: list: A list of dictionaries, each containing a 'title' and 'explanation'. """ifnot gemini_model:print("Gemini model not initialized. Cannot generate explanations.")return [] explained_recommendations = []for rec in recommendations: recommended_title = rec["title"] evidence = rec["evidence"]ifnot evidence:continue# Construct a detailed prompt for the LLM prompt =f""" You are a movie recommendation assistant. Your task is to provide a compelling, one-paragraph explanation for a movie recommendation. Here is the data you should use: - The user liked the movie: "{source_title}" - We are recommending the movie: "{recommended_title}" - These two movies share the following genres: {', '.join(evidence['shared_genres'])} - We found that {evidence['user_overlap']} users who gave a high rating to "{source_title}" also gave a high rating to "{recommended_title}". - They also share common themes, as shown by these shared tags: {', '.join(evidence['shared_tags'])}. Based on this evidence, please generate a friendly and concise paragraph explaining why someone who liked "{source_title}" would enjoy "{recommended_title}". Do not just list the data; weave it into a natural-sounding explanation. """try: response = gemini_model.generate_content(prompt) explanation = response.text explained_recommendations.append( {"title": recommended_title, "explanation": explanation} )exceptExceptionas e:print(f"An error occurred while generating explanation for {recommended_title}: {e}" ) explained_recommendations.append( {"title": recommended_title,"explanation": f"Could not generate explanation: {e}", } )return explained_recommendations
Let us test the recommendation explanations with a specific movie title and see how the LLM generates explanations for the recommendations. We will use the movie “Life Is a Long Quiet River” as an example, which should yield interesting recommendations based on its themes and user ratings.
Show the code
from IPython.display import Markdown, displayimport pandas as pd# Example usage with generative explanationsmovie_title ="Life Is a Long Quiet River"source, recommendations = recommend_movies(movie_title, num_recommendations=3)if source:print(f"Recommendations based on '{source}':\n") explained_recs = generate_recommendation_explanations(source, recommendations)# Build and display a pandas DataFrameif explained_recs: df = pd.DataFrame(explained_recs) df.rename( columns={"title": "Recommended Movie", "explanation": "Explanation"}, inplace=True, ) styler = df.style.set_properties(**{"white-space": "normal", "text-align": "left"} ) styler.set_table_styles([dict(selector="th", props=[("text-align", "left")])]) display(styler)else:print("Could not generate explanations for the recommendations.")else:print("No recommendations found.")
Recommendations based on 'Life Is a Long Quiet River (La vie est un long fleuve tranquille) (1988)':
Recommended Movie
Explanation
0
Life Is Beautiful (La Vita è bella) (1997)
Given your appreciation for the charming film *Life Is a Long Quiet River (La vie est un long fleuve tranquille)*, we believe you'll absolutely love *Life Is Beautiful (La Vita è bella)*. Both movies beautifully blend heartfelt narratives with a strong comedic spirit, ensuring a truly engaging viewing experience. In fact, at least one user who highly rated *Life Is a Long Quiet River* also gave a top rating to *Life Is Beautiful*, indicating a shared taste for films that navigate life's complexities with humor and warmth.
1
Train of Life (Train de vie) (1998)
Given your enjoyment of the distinct charm and comedic sensibilities found in "Life Is a Long Quiet River (La vie est un long fleuve tranquille)," we believe you might also appreciate "Train of Life (Train de vie)." Both films share the Comedy genre, suggesting that if you connected with the unique brand of humor and delightful storytelling in your previous favorite, you'll likely find similar entertainment and lighthearted moments aboard the journey presented in "Train of Life."
2
La cravate (1957)
If you enjoyed the uniquely French blend of humor and insightful observation in "Life Is a Long Quiet River," you might appreciate "La cravate" for its distinct approach to storytelling. While it offers a very different, more experimental journey, its captivating visual narrative and unique sensibility could appeal to viewers who appreciate the rich, varied, and often unexpected artistic expressions found within French cinema, inviting you to explore another fascinating corner of its diverse landscape.
The agent
Now let us create the conversational agent that can use the tools we built previously to answer movie-related queries. The agent will be able to autonomously recommend movies and find movies by description.
Here is a diagram of our agent architecture:
graph TD
subgraph "User Interaction"
A[User Query]
end
subgraph "Conversational Agent"
B(converse_with_llm)
C{Tool Executor}
D[recommend_movies]
E[find_movies_by_description]
end
subgraph "Gemini API"
F(Gemini LLM)
end
subgraph "Neo4j Database"
G((Graph Database))
end
A --> B;
B -- "Query + Tools" --> F;
F -- "Function Call" --> C;
F -- "Direct Answer" --> H[Formatted Response];
C -- "Executes" --> D;
C -- "Executes" --> E;
D -- "Cypher Query" --> G;
E -- "Cypher Query" --> G;
G -- "Data" --> D;
G -- "Data" --> E;
D --> H;
E --> H;
H --> I[Display to User];
style A fill:#FFDAB9,stroke:#333,stroke-width:2px
style F fill:#ADD8E6,stroke:#333,stroke-width:2px
style G fill:#90EE90,stroke:#333,stroke-width:2px
Show the code
# Conversational Agent with Tool Usedef converse_with_llm(query: str) ->str:""" A conversational agent that can use tools to answer movie-related queries. """ifnot gemini_model:return"Gemini model not initialized. Cannot process query."# The user's queryprint(f"User query: '{query}'")# Give the model the available tools response = gemini_model.generate_content( query, tools=[recommend_movies, find_movies_by_description], generation_config={"max_output_tokens": 250}, )# Check if the model decided to call a toolifnot response.candidates[0].content.parts:return"I'm sorry, I couldn't find a suitable tool to answer your question." part = response.candidates[0].content.parts[0]if part.function_call: function_call = part.function_call function_name = function_call.name function_args =dict(function_call.args)print(f"LLM decided to call tool '{function_name}' with arguments: {function_args}\n" )# --- Call the chosen function ---if function_name =="recommend_movies": source_title, recommendations = recommend_movies(**function_args)ifnot recommendations:returnf"Sorry, couldn't find recommendations for '{function_args.get('movie_title')}'"# Format the output output =f"Recommendations based on '{source_title}':\n\n"for rec in recommendations: output +=f"- {rec['title']} (Score: {rec['score']:.4f})\n"return outputelif function_name =="find_movies_by_description": movies = find_movies_by_description(**function_args)ifnot movies:returnf"Sorry, couldn't find movies for description: '{function_args.get('description')}'"# Format the output output =f"Movies found for '{function_args.get('description')}':\n\n"for title, score in movies: output +=f"- {title} (Relevance: {score:.4f})\n"return outputelse:returnf"Error: Unknown function call '{function_name}'"else:# The model responded directlyreturn response.text
And finally let us test it out with a couple of queries.
Show the code
# Using the recommendation toolquery1 ="Can you recommend some movies similar to 'Toy Story'?"result1 = converse_with_llm(query1)print(result1)# Using the description search toolquery2 ="I want to watch a movie about space exploration and robots."result2 = converse_with_llm(query2)print(result2)
User query: 'Can you recommend some movies similar to 'Toy Story'?'
LLM decided to call tool 'recommend_movies' with arguments: {'movie_title': 'Toy Story'}
Recommendations based on 'Toy Story (1995)':
- Toy Story 2 (1999) (Score: 14.2874)
- Toy Story 3 (2010) (Score: 8.8880)
- Psycho (1960) (Score: 8.2628)
- The Lego Movie (2014) (Score: 3.1701)
- Dangerous Minds (1995) (Score: 1.9641)
User query: 'I want to watch a movie about space exploration and robots.'
LLM decided to call tool 'find_movies_by_description' with arguments: {'description': 'space exploration and robots'}
Movies found for 'space exploration and robots':
- Star Wars: Episode IV - A New Hope (1977) (Relevance: 2.0728)
- Iron Giant, The (1999) (Relevance: 0.7099)
- Terminator 2: Judgment Day (1991) (Relevance: 0.7099)
- Terminator, The (1984) (Relevance: 0.7099)
- Blade Runner (1982) (Relevance: 0.7099)
- A.I. Artificial Intelligence (2001) (Relevance: 0.7099)
- Short Circuit (1986) (Relevance: 0.7099)
- Apollo 13 (1995) (Relevance: 0.7073)
- Forbidden Planet (1956) (Relevance: 0.7073)
- 2001: A Space Odyssey (1968) (Relevance: 0.7073)
Final remarks
This experiment has demonstrated how to build a simple movie recommendation system using Neo4j and a large language model. We have imported a dataset of movies, ratings, and tags into a Neo4j graph database, created a recommendation algorithm that combines semantic similarity, collaborative filtering, and shared tags, and integrated a large language model to generate natural language explanations for the recommendations.
We also created a conversational agent that can answer movie-related queries and provide recommendations based on user input. The agent can autonomously recommend movies and find movies by description, integrating graph capabilities and natural language understanding.
A good further exercise would be to extend the source dataset, with further relationships such as actors, directors, and genres, and to enhance the recommendation algorithm to take these into account, by using larger datasets such as the IMDB Non-Commercial dataset.