RAG, or Retrieval-Augmented Generation, is a technique that combines the strengths of large language models (LLMs) with external knowledge sources. In this experiment, we explore a practical use case for RAG using a graph database.
:link RAG combined with a graph, allows us to enhance the contextual understanding of the LLM by providing it with structured information from the graph. While when RAG is combined with a typical search index and unstructured data, it can lead to less accurate or relevant results due to the lack of context and fine detail.
The use case
In this experiment, we will build methods which can download issues from any GitHub repository, store them in a graph database, and then use RAG to query the issues and comments. The goal is to demonstrate how combining retrieval with a graph can improve the usefulness of information from structured data.
Retrieving issues from GitHub
We will start by implementing the necessary methods to download issues from a GitHub repository, including their comments, users, labels, and events.
Show the code
from github import Github, UnknownObjectExceptionimport pandas as pdfrom tqdm.auto import tqdmimport requests_cachedef _get_user_data(user, users_data: dict):"""Safely retrieves user data and handles exceptions for non-existent users."""if user and user.login notin users_data:try: users_data[user.login] = {"id": user.id,"login": user.login,"name": user.name,"company": user.company,"location": user.location,"followers": user.followers,"created_at": user.created_at, }except UnknownObjectException:print(f"Could not retrieve full profile for user {user.login}. Storing basic info." )# Store basic info if the full profile is not available users_data[user.login] = {"id": user.id,"login": user.login,"name": None,"company": None,"location": None,"followers": -1,"created_at": None, }def download_issues( token: str, repo_name: str) ->tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:""" Download issues from a GitHub repository and return them as DataFrames. Args: token (str): GitHub personal access token. repo_name (str): Name of the repository in the format 'owner/repo'. Returns: tuple: DataFrames for issues, comments, users, labels, and events. """ os.makedirs(".data", exist_ok=True) requests_cache.install_cache(".data/github_cache", backend="sqlite", expire_after=4*3600 ) g = Github(token)ifnot g:raiseValueError("Invalid GitHub token or authentication failed.")# 2) Get repo and issues repo = g.get_repo(repo_name)ifnot repo:raiseValueError(f"Repository '{repo_name}' not found or access denied.") issues = repo.get_issues(state="all") # Paginated iteratorifnot issues:raiseValueError(f"No issues found in repository '{repo_name}'.") issue_data = [] issue_comments = [] issue_events = [] users_data = {} labels_data = {}for issue in tqdm( issues, total=issues.totalCount, desc=f"Downloading issues from {repo_name}" ):# Add all issue data to list issue_data.append( {"id": issue.id,"number": issue.number,"title": issue.title,"state": issue.state,"created_at": issue.created_at,"updated_at": issue.updated_at,"closed_at": issue.closed_at,"body": issue.body,"labels": [label.name for label in issue.labels],"assignees": [assignee.login for assignee in issue.assignees],"user": issue.user.login, } )# Add user data _get_user_data(issue.user, users_data)for assignee in issue.assignees: _get_user_data(assignee, users_data)# Add all comments to listfor comment in issue.get_comments(): issue_comments.append( {"issue_id": issue.id,"comment_id": comment.id,"user": comment.user.login,"created_at": comment.created_at,"updated_at": comment.updated_at,"body": comment.body, } )# Add comment user to users list _get_user_data(comment.user, users_data)# Add all labels to listfor label in issue.labels:if label.name notin labels_data: labels_data[label.name] = {"name": label.name,"color": label.color,"description": label.description, }# Add all events to listfor event in issue.get_events(): issue_events.append( {"issue_id": issue.id,"event_id": event.id,"actor": event.actor.login if event.actor elseNone,"event": event.event,"created_at": event.created_at, } )# Add event actor to users listif event.actor: _get_user_data(event.actor, users_data)return ( pd.DataFrame(issue_data), pd.DataFrame(issue_comments), pd.DataFrame(list(users_data.values())), pd.DataFrame(list(labels_data.values())), pd.DataFrame(issue_events), )
Retrieving hundreds of issues from a repository can take a while, so we will cache the results to avoid unnecessary API calls. In this case we will use the Farama-Foundation/Gymnasium repository as our data source - it is small enough to be manageable, but large enough to demonstrate the capabilities of our methods.
Show the code
import ostoken = os.getenv("GITHUB_TOKEN")repo_name ="Farama-Foundation/Gymnasium"# Check if we already have the dataif os.path.exists(".data/issues.pkl"):print("Data already downloaded. Loading from pickle files.") issue_data = pd.read_pickle(".data/issues.pkl") issue_comments = pd.read_pickle(".data/comments.pkl") users_data = pd.read_pickle(".data/users.pkl") labels_data = pd.read_pickle(".data/labels.pkl") issue_events = pd.read_pickle(".data/events.pkl")else:print("Downloading issues from GitHub...") issue_data, issue_comments, users_data, labels_data, issue_events = download_issues( token, repo_name )# Save all dataframes to pickle files under `.data` os.makedirs(".data", exist_ok=True) issue_data.to_pickle(".data/issues.pkl") issue_comments.to_pickle(".data/comments.pkl") users_data.to_pickle(".data/users.pkl") labels_data.to_pickle(".data/labels.pkl") issue_events.to_pickle(".data/events.pkl")
Data already downloaded. Loading from pickle files.
Computing embeddings
With the data at hand, and loaded into pandas DataFrames, we can now compute embeddings for the content in issues and comments. We will use a the QWen/Qwen3-Embedding-0.6B pre-trained Transformer model to generate embeddings for the text data, as it provides a good balance between accuracy and performance for our use case. It can also handle large text lengths, which means we can in many cases use it without necessarily chunking the text.
Important
In a production setting, you would most definitely want to implement chunking of the text. For the purpose of this experiment, we will keep it simple and not chunk content.
To keep our data size manageable, we will also truncate the embeddings to a fixed dimension of \(768\), which is a common size for many Transformer models. For larger datasets, you would want to consider using a larger embedding size.
Show the code
# Method to compute embeddings using a given Transformer modelfrom sentence_transformers import SentenceTransformerimport torchimport numpy as npfrom typing import Listclass EmbeddingModel:def__init__(self, model_name: str="QWen/Qwen3-Embedding-0.6B", batch_size: int=32, truncate_dim: int=None, ) ->None:# Use CUDA if availableif torch.cuda.is_available():print("Using CUDA for embeddings.")self.device = torch.device("cuda")elif torch.backends.mps.is_available():print("Using MPS for embeddings.")self.device = torch.device("mps")else:print("Using CPU for embeddings.")self.device = torch.device("cpu")self.model = SentenceTransformer(model_name, truncate_dim=truncate_dim).to(self.device )self.batch_size = batch_sizedef embed_batch(self, texts: List[str], desc: str="Embedding batch" ) -> np.ndarray:""" Embed a batch of texts using the SentenceTransformer model. Args: texts (List[str]): List of texts to embed. desc (str): Description for the tqdm progress bar. Returns: np.ndarray: Array of embeddings for the input texts. """ all_embs = []self.model.to(self.device)with torch.no_grad(): # disable gradsfor i in tqdm(range(0, len(texts), self.batch_size), desc=desc): batch = texts[i : i +self.batch_size]# get a CPU numpy array directly embs =self.model.encode( batch, batch_size=len(batch), show_progress_bar=False, convert_to_tensor=False, # returns numpy on CPU ) all_embs.append(np.vstack(embs) ifisinstance(embs, list) else embs)# free any CUDA scratchifself.device.type=="cuda": torch.cuda.empty_cache()return np.vstack(all_embs)embedding_dim =768# Set the embedding dimensionembedding_model = EmbeddingModel(batch_size=2, truncate_dim=embedding_dim)
Using CUDA for embeddings.
Show the code
# Compute embeddings for issues and comments, including title and bodydef compute_embeddings( df: pd.DataFrame, text_columns: List[str], desc: str="Computing embeddings") -> np.ndarray:""" Compute embeddings for specified text columns in a DataFrame. Args: df (pd.DataFrame): DataFrame containing the text data. text_columns (List[str]): List of column names to compute embeddings for. desc (str): Description for the tqdm progress bar. Returns: np.ndarray: Array of embeddings for the specified text columns. """ texts = []for _, row in df.iterrows(): text =" ".join(str(row[col]) for col in text_columns if pd.notna(row[col])) texts.append(text)return embedding_model.embed_batch(texts, desc=desc)
Just as with the issue data, we will cache the embeddings to avoid recomputing them every time we run the experiment. If the embeddings already exist in the DataFrame, we will load them from there to avoid unnecessary computation.
Show the code
recompute =False# Check if embeddings already existif ("embeddings"in issue_data.columnsand"embeddings"in issue_comments.columnsandnot recompute):print("Embeddings already computed. Loading from DataFrame.")else:print("Computing embeddings for issues and comments...") issue_text_columns = ["title", "body"] issue_embeddings = compute_embeddings( issue_data, issue_text_columns, "Computing issue embeddings" ) comment_text_columns = ["body"] comment_embeddings = compute_embeddings( issue_comments, comment_text_columns, "Computing comment embeddings" )# Add embeddings to DataFrames issue_data["embeddings"] =list(issue_embeddings) issue_comments["embeddings"] =list(comment_embeddings)# Save dataframe back to pickle files issue_data.to_pickle(".data/issues.pkl") issue_comments.to_pickle(".data/comments.pkl")
Embeddings already computed. Loading from DataFrame.
The raw issue data
Let us take a look at the raw data we have collected and processed so far. We will display a sample of each DataFrame to get an overview of the data structure and content.
Issue data contains information about the issues, including their title, body, state, labels, assignees, and user who raised the issue. Also note the computed embeddings for the issue text.
Show the code
# Show a sample for each dataframeprint("Sample issue data:")issue_data.sample(5)
Sample issue data:
id
number
title
state
created_at
updated_at
closed_at
body
labels
assignees
user
embeddings
853
1764400983
560
[Proposal] Check render_mode for RecordVideo w...
closed
2023-06-20 00:21:09+00:00
2023-07-03 09:46:40+00:00
2023-07-03 09:46:40+00:00
### Proposal\r\n\r\nRight now, when you wrap a...
[enhancement]
[]
robertoschiavone
[0.011846603, 0.0290194, -0.0031601943, -0.075...
1262
1457055647
150
Versioned Action wrappers which supports jumpy
closed
2022-11-20 21:37:48+00:00
2022-12-01 20:36:12+00:00
2022-12-01 20:36:12+00:00
# Description\r\n\r\nThis PR add support for j...
[]
[]
gianlucadecola
[0.030979421, 0.041525118, -0.011083876, -0.07...
594
2030125938
820
[Bug Report] [documentation] The LaTeX math is...
closed
2023-12-07 08:02:45+00:00
2023-12-17 09:38:24+00:00
2023-12-17 09:38:23+00:00
### Describe the bug\n\nexample:\r\nhttps://gy...
[bug]
[]
Kallinteris-Andreas
[0.0021287652, 0.048698895, -0.005818796, -0.0...
1106
1568043962
307
Update test_vector_make.py
closed
2023-02-02 13:16:09+00:00
2023-02-02 15:26:54+00:00
2023-02-02 15:26:53+00:00
# Description\r\nThe code has been updated to ...
[]
[]
MiChaelinzo
[0.037689343, 0.008772637, -0.006663781, -0.03...
127
2774346221
1288
Add `wrappers.vector.TransformObs/Action` sing...
closed
2025-01-08 05:39:52+00:00
2025-01-12 12:43:55+00:00
2025-01-12 12:43:55+00:00
# Description\r\n\r\nFixes #1287\r\n\r\n## Typ...
[]
[]
howardh
[-0.0051537124, 0.0085140485, -0.0072919703, -...
Comments on issues include the comment body, creation and update timestamps, and the user who made the comment. The embeddings for the comment text are also included.
User data contains information about the users who raised issues or made comments, including their login, name, company, location, followers count, and account creation date.
Show the code
print("\nSample users data:")users_data.sample(5)
Sample users data:
id
login
name
company
location
followers
created_at
62
66969704
mariovas3
Mario Vasilev
None
London, United Kingdom
2
2020-06-15 18:43:17+00:00
48
6186430
abouelsaadat
Mohamed Abouelsaadat
None
None
0
2013-12-14 18:34:14+00:00
30
64679842
is-jang
장인성 (Insung Jang)
None
Busan, Republic of Korea
2
2020-05-02 06:53:53+00:00
118
36020639
RuizhouLiu
None
None
None
1
2018-02-01 02:07:30+00:00
450
18716355
YangyangFu
yyf
Texas A$M University
None
47
2016-04-28 08:40:11+00:00
Labels associated with issues include their name, color, and description.
Finally, issue events include information about events related to issues, such as when an issue was opened, closed, or commented on. The events also include the user who triggered the event and the timestamp of the event.
Once loaded into a graph, the data will be modeled as the following data architecture.
%%{
init: {
'theme': 'base',
'themeVariables': {
'fontSize': '16px'
}
}
}%%
graph TD
subgraph "Graph Data Model"
U[User]
I[Issue]
C[Comment]
L[Label]
E[Event]
U -- "RAISED_BY" --> I
U -- "ASSIGNED_TO" --> I
U -- "COMMENT_BY" --> C
U -- "EVENT_BY" --> E
I -- "HAS_LABEL" --> L
I -- "MIGHT_RELATE_TO" --> I
C -- "COMMENT_ON" --> I
E -- "EVENT_ON" --> I
end
Setting up the Neo4j graph database
We will use Neo4j Aura to store our issue data in a graph database. Neo4j Aura is a fully managed cloud service that provides a Neo4j database instance, which we can use to store and query our data. A number of helper methods are needed to set up the database schema, create constraints, and indexes, and to clear the database if needed.
First, we will connect to the Neo4j Aura instance using the neo4j Python driver. Make sure you have the necessary environment variables set for the connection.
Show the code
from neo4j import GraphDatabase, basic_auth, Driver, Session, Transaction, Recordfrom neo4j.graph import GraphURI = os.getenv("NEO4J_URI")USER = os.getenv("NEO4J_USERNAME")PASSWORD = os.getenv("NEO4J_PASSWORD")AUTH = (USER, PASSWORD)print(f"Connecting to Neo4j at {URI} with user {USER}")driver = GraphDatabase.driver(URI, auth=AUTH)driver.verify_connectivity()def test_aura_connection() ->None:with driver.session() as session: result = session.run("RETURN 'Hello, Aura!' AS message") record = result.single()print(record["message"]) # should print "Hello, Aura!"test_aura_connection()
Connecting to Neo4j at neo4j+s://8c1ab3e4.databases.neo4j.io with user neo4j
Hello, Aura!
We then need a few additional methods to manage the database schema, including dropping existing constraints and indexes, clearing the database, and creating new constraints and vector indexes for the issue and comment embeddings whenever we re-run the experiment. This is useful to ensure that we start with a clean slate and can easily modify the schema if needed.
Show the code
def drop_schema(tx: Transaction) ->None:# Drop constraintsfor record in tx.run("SHOW CONSTRAINTS"): name = record["name"] tx.run(f"DROP CONSTRAINT `{name}`")# Drop indexesfor record in tx.run("SHOW INDEXES"): name = record["name"] tx.run(f"DROP INDEX `{name}`")def clear_database(tx: Transaction) ->None:# Drop all nodes and relationships tx.run("MATCH (n) DETACH DELETE n")def create_constraints(tx: Transaction) ->None: tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (i:Issue) REQUIRE i.id IS UNIQUE") tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (c:Comment) REQUIRE c.id IS UNIQUE") tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (u:User) REQUIRE u.id IS UNIQUE") tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (l:Label) REQUIRE l.name IS UNIQUE") tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (e:Event) REQUIRE e.id IS UNIQUE")
To store and query embeddings efficiently (the core of our RAG approach), we will create vector indexes for the issue and comment embeddings. The embedding dimensions will need to match the model’s output dimensions.
Show the code
def create_issue_vector_index(tx: Transaction, embedding_dim: int=768) ->None: tx.run(""" CREATE VECTOR INDEX `issue_embeddings` IF NOT EXISTS FOR (i:Issue) ON i.embedding OPTIONS {indexConfig: { `vector.dimensions`: $embedding_dim, `vector.similarity_function`: 'cosine'}} """, embedding_dim=embedding_dim, )def create_comment_vector_index(tx: Transaction, embedding_dim: int=768) ->None: tx.run(""" CREATE VECTOR INDEX `comment_embeddings` IF NOT EXISTS FOR (c:Comment) ON c.embedding OPTIONS {indexConfig: { `vector.dimensions`: $embedding_dim, `vector.similarity_function`: 'cosine'}} """, embedding_dim=embedding_dim, )
Show the code
with driver.session(database="neo4j") as session:# Clear the databaseprint("Clearing the database...") session.execute_write(clear_database)# Drop existing schemaprint("Dropping existing schema...") session.execute_write(drop_schema)# Create new constraintsprint("Creating new constraints...") session.execute_write(create_constraints)# Create vector indexesprint("Creating vector indexes...") session.execute_write(create_issue_vector_index, embedding_dim=embedding_dim) session.execute_write(create_comment_vector_index, embedding_dim=embedding_dim)print("Schema updated successfully.")
Clearing the database...
Dropping existing schema...
Creating new constraints...
Creating vector indexes...
Schema updated successfully.
Importing data into Neo4j
We still need a few methods to import issue data. These methods will handle the insertion of users, labels, issues, comments, and events into the Neo4j database in batches. This is important for performance, especially when dealing with large datasets.
Show the code
from typing import List, Dict, Anyimport pandas as pddef _load_users_batch(tx: Transaction, batch: List[Dict[str, Any]]) ->None: tx.run(""" UNWIND $batch AS row MERGE (u:User {id: row.id}) SET u.login = row.login, u.name = row.name, u.company = row.company, u.location = row.location, u.followers = row.followers, u.created_at = CASE WHEN row.created_at IS NOT NULL THEN datetime(row.created_at) ELSE null END """, batch=batch, )def import_users_batched( session: Session, users_df: pd.DataFrame, batch_size: int=128) ->None:for i in tqdm(range(0, len(users_df), batch_size), desc="Importing users"): batch = users_df.iloc[i : i + batch_size].to_dict("records") session.execute_write(_load_users_batch, batch)def _load_labels_batch(tx: Transaction, batch: List[Dict[str, Any]]) ->None: tx.run(""" UNWIND $batch AS row MERGE (l:Label {name: row.name}) SET l.color = row.color, l.description = row.description """, batch=batch, )def import_labels_batched( session: Session, labels_df: pd.DataFrame, batch_size: int=128) ->None:for i in tqdm(range(0, len(labels_df), batch_size), desc="Importing labels"): batch = labels_df.iloc[i : i + batch_size].to_dict("records") session.execute_write(_load_labels_batch, batch)def _load_issues_batch(tx: Transaction, batch: List[Dict[str, Any]]) ->None: tx.run(""" UNWIND $batch AS row MERGE (i:Issue {id: row.id}) SET i.number = row.number, i.title = row.title, i.state = row.state, i.body = row.body, i.created_at = datetime(row.created_at), i.updated_at = datetime(row.updated_at), i.closed_at = CASE WHEN row.closed_at IS NOT NULL THEN datetime(row.closed_at) ELSE null END, i.embedding = row.embeddings WITH i, row MERGE (u:User {login: row.user}) MERGE (i)-[:RAISED_BY]->(u) WITH i, row UNWIND row.labels AS labelName MERGE (l:Label {name: labelName}) MERGE (i)-[:HAS_LABEL]->(l) WITH i, row UNWIND row.assignees AS assigneeLogin MERGE (a:User {login: assigneeLogin}) MERGE (i)-[:ASSIGNED_TO]->(a) """, batch=batch, )def import_issues_batched( session: Session, issues_df: pd.DataFrame, batch_size: int=128) ->None:for i in tqdm(range(0, len(issues_df), batch_size), desc="Importing issues"): batch = issues_df.iloc[i : i + batch_size].to_dict("records") session.execute_write(_load_issues_batch, batch)def _load_comments_batch(tx: Transaction, batch: List[Dict[str, Any]]) ->None: tx.run(""" UNWIND $batch AS row MERGE (c:Comment {id: row.comment_id}) SET c.body = row.body, c.created_at = datetime(row.created_at), c.updated_at = datetime(row.updated_at), c.embedding = row.embeddings WITH c, row MERGE (i:Issue {id: row.issue_id}) MERGE (c)-[:COMMENT_ON]->(i) WITH c, row MERGE (u:User {login: row.user}) MERGE (c)-[:COMMENT_BY]->(u) """, batch=batch, )def import_comments_batched( session: Session, comments_df: pd.DataFrame, batch_size: int=128) ->None:for i in tqdm(range(0, len(comments_df), batch_size), desc="Importing comments"): batch = comments_df.iloc[i : i + batch_size].to_dict("records") session.execute_write(_load_comments_batch, batch)def _load_events_batch(tx: Transaction, batch: List[Dict[str, Any]]) ->None: tx.run(""" UNWIND $batch AS row MERGE (e:Event {id: row.event_id}) SET e.event = row.event, e.created_at = datetime(row.created_at) WITH e, row MERGE (i:Issue {id: row.issue_id}) MERGE (e)-[:EVENT_ON]->(i) WITH e, row WHERE row.actor IS NOT NULL MERGE (u:User {login: row.actor}) MERGE (e)-[:EVENT_BY]->(u) """, batch=batch, )def import_events_batched( session: Session, events_df: pd.DataFrame, batch_size: int=128) ->None:for i in tqdm(range(0, len(events_df), batch_size), desc="Importing events"): batch = events_df.iloc[i : i + batch_size].to_dict("records") session.execute_write(_load_events_batch, batch)
We can now import the data, creating nodes and relationships for users, labels, issues, comments, and events in the graph database.
Show the code
with driver.session() as session:# Import dataprint("Importing data...") import_users_batched(session, users_data) import_labels_batched(session, labels_data) import_issues_batched(session, issue_data) import_comments_batched(session, issue_comments) import_events_batched(session, issue_events)print("Data imported successfully.")
Importing data...
Data imported successfully.
Visualising our graph
A picture is worth a thousand words, we can use the Pyvis library to visualize the graph we have created in Neo4j. It is a great way to get an immediate intuitive understanding of the data and their relationships. Let us quickly create a method to convert the Neo4j graph object into a Pyvis Network object, which we can then visualize.
Show the code
from pyvis.network import Networkimport pandas as pdfrom neo4j.graph import Node, Relationshipdef create_pyvis_network_from_neo4j(graph: Graph) -> Network:""" Creates a Pyvis Network object from a Neo4j graph object. """ net = Network( notebook=True, cdn_resources="in_line", height="750px", width="100%", bgcolor="#ffffff", font_color="black", )for node in graph.nodes: node_id = node.element_id labels =list(node.labels) group = labels[0] if labels else"Node" properties =dict(node)# Plain‑text title with newlines title_lines = [group]for k, v in properties.items():if k =="embedding":continueif k =="body"and v andlen(v) >512: v = v[:512] +"..." title_lines.append(f"{k}: {v}") title ="\n".join(title_lines)# Use a specific property for the label if available node_label =str( properties.get("title")or properties.get("name")or properties.get("login")or properties.get("id")or node_id )iflen(node_label) >30: node_label = node_label[:27] +"..." node_size =25if"Issue"in labels:# Make the node size relative to the number of related nodes related_nodes =len( [ relfor rel in graph.relationshipsif rel.start_node.element_id == node_idor rel.end_node.element_id == node_id ] ) node_size += related_nodes net.add_node( node_id, label=node_label, title=title, group=group, size=node_size )# Add edgesfor rel in graph.relationships: source_id = rel.start_node.element_id target_id = rel.end_node.element_id net.add_edge(source_id, target_id, title=rel.type, arrows="to", dashes=True)return net
So we don’t overwhelm the visualization with too many nodes, we will sample a subset of the graph data. We can use random sampling with Cypher’s rand method to select a limited number of issues and their related nodes. You can zoom in and out of the visualization, and click on nodes to see their properties in the graph.
Show the code
# Query Neo4j to get a sample of the graph datawith driver.session() as session: result = session.run(""" MATCH (i:Issue) WITH i, rand() AS r ORDER BY r LIMIT 50 MATCH (i)-[rel]-(neighbor) RETURN i, rel, neighbor """ ) graph = result.graph() net = create_pyvis_network_from_neo4j(graph)# Configure physics and controlsnet.toggle_physics(True)# Save the visualization to HTMLnet.show("graph_visualization.html", notebook=True)
graph_visualization.html
Computing similarity links
By comparing the embeddings of issues and their comments, we can create MIGHT_RELATE_TO relationships between issues that are semantically similar. This can help in identifying duplicate or related issues, and in understanding the context of a given issue by looking at which others might contain important information related to the problem at hand.
This will help us build a more connected graph, with more meaningful relationships between relevant problems.
Show the code
def create_similarity_links(tx: Transaction, min_score: float) ->int: result = tx.run(""" // issue→issue and issue→comment similarities MATCH (i:Issue) CALL { WITH i CALL db.index.vector.queryNodes('issue_embeddings', 10, i.embedding) YIELD node AS similar_issue, score RETURN i AS issue, similar_issue, score UNION WITH i CALL db.index.vector.queryNodes('comment_embeddings', 10, i.embedding) YIELD node AS similar_comment, score MATCH (similar_issue:Issue)<-[:COMMENT_ON]-(similar_comment) RETURN i AS issue, similar_issue, score } WITH issue, similar_issue, score WHERE score >= $min_score AND elementId(issue) < elementId(similar_issue) WITH issue, similar_issue, max(score) AS max_score MERGE (issue)-[r:MIGHT_RELATE_TO]->(similar_issue) SET r.score = max_score // comment→issue similarities (no shadowing in import WITH) WITH issue MATCH (issue)<-[:COMMENT_ON]-(c:Comment) CALL { WITH c, issue CALL db.index.vector.queryNodes('issue_embeddings', 10, c.embedding) YIELD node AS similar_issue, score // alias here, not in the WITH RETURN issue AS parent_issue, similar_issue, score } // safely re‑alias back to `issue` WITH parent_issue AS issue, similar_issue, score WHERE score >= $min_score AND elementId(issue) < elementId(similar_issue) WITH issue, similar_issue, max(score) AS max_score MERGE (issue)-[r:MIGHT_RELATE_TO]->(similar_issue) SET r.score = max_score """, min_score=min_score, )return result.consume().counters.relationships_created
We will set a minimum score threshold of 0.75 (keep in mind cosine similarity scores can range from -1 to 1) for the similarity links to avoid creating too many relationships that might not be meaningful. This threshold can be adjusted based on the specific use case and the quality of the embeddings.
Show the code
min_score_threshold =0.75with driver.session() as session:print(f"Creating MIGHT_RELATE_TO relationships between issues with score >= {min_score_threshold}..." ) num_rels_created = session.execute_write( create_similarity_links, min_score=min_score_threshold )print(f"Created {num_rels_created} MIGHT_RELATE_TO relationships.")
Creating MIGHT_RELATE_TO relationships between issues with score >= 0.75...
Created 19282 MIGHT_RELATE_TO relationships.
Visualizing related issues
To visualize the relationships between issues, we can create a Pyvis network that includes the MIGHT_RELATE_TO relationships. This will help us see how issues are connected based on their semantic similarity. To further enhance the visualisation, we will also perform community detection on the graph, to group similar issues together using correlated colors.
Show the code
import networkx as nxfrom networkx.algorithms import communitydef create_pyvis_network_from_networkx( G: nx.Graph, node_community: dict, min_score_threshold: float) -> Network:""" Creates a Pyvis Network object from a NetworkX graph object, with community information. """ net = Network( notebook=True, cdn_resources="in_line", height="750px", width="100%", bgcolor="#ffffff", font_color="black", )# Add nodes to PyVis network with community informationfor node_id, properties in G.nodes(data=True): group = node_community.get(node_id, -1) # -1 for nodes not in any community# Plain‑text title with newlines title_lines = [f"Community: {group}"]for k, v in properties.items():if k =="embedding":continueif k =="body"and v andlen(v) >512: v = v[:512] +"..." title_lines.append(f"{k}: {v}") title ="\n".join(title_lines)# Use a specific property for the label if available node_label =str( properties.get("title")or properties.get("name")or properties.get("login")or properties.get("id")or node_id )iflen(node_label) >30: node_label = node_label[:27] +"..." net.add_node(node_id, label=node_label, title=title, group=group)# Add edgesfor source_id, target_id, properties in G.edges(data=True): rel_title = properties.get("type", "") edge_width =1if"score"in properties: score = properties["score"] rel_title =f"MIGHT_RELATE_TO (score: {score:.2f})"# Scale edge width based on score. edge_width =1+ (score - min_score_threshold) * (10/ (1- min_score_threshold) ) net.add_edge( source_id, target_id, title=rel_title, width=edge_width, arrows="to", dashes=True, )return net
Note how the new MIGHT_RELATE_TO relationships are established based on semantic similarity scores (represented by the thickness of the edges).
Show the code
# Create a NetworkX graph to perform community detectionG = nx.Graph()# Query Neo4j to get a sample of issues with MIGHT_RELATE_TO relationshipswith driver.session() as session: result = session.run(""" MATCH (i:Issue)-[rel:MIGHT_RELATE_TO]-(neighbor:Issue) WITH i, rel, neighbor, rand() as r ORDER BY r LIMIT 200 RETURN i, rel, neighbor """ )# Build the NetworkX graph from the query resultsfor record in result: node_i = record["i"] node_neighbor = record["neighbor"] rel = record["rel"] G.add_node(node_i.element_id, **dict(node_i)) G.add_node(node_neighbor.element_id, **dict(node_neighbor)) G.add_edge(node_i.element_id, node_neighbor.element_id, **dict(rel))# Detect communities using the Louvain methodcommunities = community.louvain_communities(G)# Create a mapping from node to community idnode_community = {}for i, comm inenumerate(communities):for node_id in comm: node_community[node_id] = inet_similar = create_pyvis_network_from_networkx(G, node_community, min_score_threshold)# Configure physics and controlsnet_similar.toggle_physics(True)# Save the visualization to HTMLnet_similar.show("might_relate_to_visualization.html", notebook=True)
might_relate_to_visualization.html
The resulting RAG graph
To find the most relevant issues to a query string, we can use the embeddings of both issues and comments. We will create a method that searches for the top-k matching issues based on a blended search of issues and comments, and returns a graph of their connections. This will allow us to build a RAG graph that can be used to answer questions about the issues and their related comments.
The Cypher query in the following method is a bit complex, and requires some explanation. It starts by running two vector‐search subqueries in parallel: one against the issue embeddings index and another against the comment embeddings index. It returns the top k matches from each search, pairing comment matches back to their parent issues so that you end up with a unified stream of issues scored by similarity to your input embedding.
Next, it orders every returned issue‑score pair by descending score, wraps each issue node together with its score into a map, and deduplicates those maps so that each issue appears only once (preserving its highest score). It then slices that deduplicated list down to the single top k issues you want to focus on, and re‑materializes the actual Issue nodes by matching on their IDs.
Finally, for each of those top \(k\) issues the query pulls in any labels or “raised by” relationships, all comments on the issue, and the users who made those comments. It aggregates each issue’s related nodes and relationships, flattens everything into two big collections, and then unwinds and re‑collects them with DISTINCT to eliminate duplicates. The result is a clean subgraph containing exactly the top \(k\) semantically similar issues plus their immediate context.
Show the code
def get_rag_graph(tx: Transaction, query_string: str, top_k: int=5) -> Graph:""" Finds the most relevant issues to a query string by searching both issue and comment embeddings, and returns a graph of their connections. The graph contains: - The top-k matching issues based on a blended search of issues and comments. - For each of these issues: their comments, users who wrote them, and labels. """# Embed the query string query_embedding = embedding_model.embed_batch([query_string])[0].tolist()# Find the most relevant issues and build the graph result = tx.run(""" // Find top k issues from issue embeddings and from comment embeddings CALL { CALL db.index.vector.queryNodes('issue_embeddings', $top_k, $embedding) YIELD node AS issue, score RETURN issue, score UNION CALL db.index.vector.queryNodes('comment_embeddings', $top_k, $embedding) YIELD node AS comment, score MATCH (comment)-[:COMMENT_ON]->(issue:Issue) RETURN issue, score } // Combine, deduplicate, and select top k issues overall WITH issue, score ORDER BY score DESC WITH collect(issue {.*, score: score}) AS issues WITH [i in issues | i.id] AS issueIds, issues WITH [id IN issueIds | head([i IN issues WHERE i.id = id])] AS uniqueIssues WITH uniqueIssues[..$top_k] AS top_issues UNWIND top_issues as top_issue_data MATCH (top_issue:Issue {id: top_issue_data.id}) // Collect the top issues, their labels, and the users who raised them OPTIONAL MATCH (top_issue)-[r1:HAS_LABEL|RAISED_BY]->(n1) // Collect comments on the top issues and the users who made them OPTIONAL MATCH (top_issue)<-[r2:COMMENT_ON]-(c1:Comment)-[r3:COMMENT_BY]->(u1:User) // Aggregate all nodes and relationships per issue WITH top_issue, collect(DISTINCT n1) as nodes1, collect(DISTINCT r1) as rels1, collect(DISTINCT c1) + collect(DISTINCT u1) as nodes2, collect(DISTINCT r2) + collect(DISTINCT r3) as rels2 // Aggregate all nodes and relationships across all issues WITH collect(top_issue) + apoc.coll.flatten(collect(nodes1)) + apoc.coll.flatten(collect(nodes2)) as all_nodes, apoc.coll.flatten(collect(rels1)) + apoc.coll.flatten(collect(rels2)) as all_rels UNWIND all_nodes as n UNWIND all_rels as r RETURN collect(DISTINCT n) as nodes, collect(DISTINCT r) as relationships """, embedding=query_embedding, top_k=top_k, ) record = result.single()# Reconstruct the graph from nodes and relationships nodes = record["nodes"] relationships = record["relationships"]# Create a graph object to return# This is a bit of a hack, as we can't directly instantiate a Graph object easily# with nodes and relationships from the driver. We'll run a query that returns a graph.ifnot nodes:return Graph() node_ids = [n.element_id for n in nodes] graph_result = tx.run(""" MATCH (n) WHERE elementId(n) IN $node_ids OPTIONAL MATCH (n)-[r]-(m) WHERE elementId(n) IN $node_ids AND elementId(m) IN $node_ids RETURN n, r, m """, node_ids=node_ids, )return graph_result.graph()
Let’s see what the RAG graph looks like for a specific query. Note the default top_k value is set to 5, but you can adjust it to retrieve more or fewer issues based on your needs.
Show the code
query_string ="What are the dependencies necessary to run Atari environments ?"with driver.session() as session:print(f"Finding RAG graph for query: {query_string}") rag_graph = session.execute_read(get_rag_graph, query_string)print(f"Found {len(rag_graph.nodes)} nodes and {len(rag_graph.relationships)} relationships in the RAG graph." )
Finding RAG graph for query: What are the dependencies necessary to run Atari environments ?
Found 17 nodes and 26 relationships in the RAG graph.
Show the code
# Visualize the RAG graph using Pyvisrag_net = create_pyvis_network_from_neo4j(rag_graph)rag_net.toggle_physics(True)rag_net.show("rag_graph_visualization.html", notebook=True)
rag_graph_visualization.html
The AI agent
Now that we understand how our graph is structured and how to retrieve relevant information from it, we can build an AI agent that can answer questions about issues in our graph. The agent will use the RAG graph to find relevant issues and comments, and then generate a textual summary of the information found.
We will use the Gemini API to interact with a large language model (LLM) that can process the textual summaries and generate answers to user queries. Google Gemini offers a nice Python client library that we can use to interact with the API (if you use Conda, you can install the gemini API with conda install google-genai).
First we need to set up the Gemini client with our API key. Make sure you have the GEMINI_API_KEY environment variable set with your key.
Show the code
from google import genai# Configure the Gemini API keygemini_api_key = os.getenv("GEMINI_API_KEY")ifnot gemini_api_key:raiseValueError("GEMINI_API_KEY environment variable not set.")genai_client = genai.Client(api_key=gemini_api_key)
We also need a method to convert the Neo4j graph into a textual memo that can be passed to the LLM. This summary will include information about the nodes and relationships in the graph, which will help the LLM understand the context of the issues and comments.
Show the code
def graph_to_textual_summary(graph: Graph) ->str:"""Converts a Neo4j graph into a descriptive narrative summary for an LLM."""# Use the driver-provided Graph object, which has .nodes and .relationshipsifnot graph.nodes:return"No information found for the query." descriptions = []# Describe each node in a natural-language sentencefor node in graph.nodes: labels =sorted(node.labels) props =dict(node)# Choose a human-friendly identifier identifier = ( props.get("title")or props.get("name")or props.get("login")or props.get("id")orstr(node.id) ) label_str =" and ".join(labels) if labels else"Node"# Build the sentence sentence =f"A {label_str} node identified as '{identifier}'"# Add other descriptive properties extras = []for key, value in props.items():if key in ("title", "name", "login", "id", "embedding"):continue text =str(value)if key =="body"and value andlen(value) >600: text = value[:600] +"..." extras.append(f"its {key} is '{text}'") sentence +=f" has {', '.join(extras)}"if extras else"" sentence = sentence.rstrip(", ") +"." descriptions.append(sentence)# Describe relationships in narrative formfor rel in graph.relationships: start = rel.start_node end = rel.end_node start_id = (dict(start).get("title")ordict(start).get("name")ordict(start).get("login")ordict(start).get("id")orstr(start.id) ) end_id = (dict(end).get("title")ordict(end).get("name")ordict(end).get("login")ordict(end).get("id")orstr(end.id) ) sentence =f"There is a relationship of type '{rel.type}' from node '{start_id}' to node '{end_id}'"if"score"in rel: score = rel["score"] sentence +=f" with a similarity score of {score:.2f}" descriptions.append(sentence +".")return"\n".join(descriptions)
We can use an example query to see how the information will be structured for the LLM.
Show the code
example_query ="What are the dependencies necessary to run Atari environments ?"with driver.session() as session: rag_graph = session.execute_read(get_rag_graph, example_query) summary = graph_to_textual_summary(rag_graph)print(summary)
A Issue node identified as 'Added Atari environments to tests, removed dead code' has its number is '78', its closed_at is '2022-10-26T20:41:39.000000000+00:00', its updated_at is '2022-10-26T20:41:40.000000000+00:00', its created_at is '2022-10-26T15:30:13.000000000+00:00', its state is 'closed', its body is '- Adds some atari environments to tested environments (if gym and ale are available)
- Removed definition of `minimum_testing_env_specs`, which was dead code, also didn't make sense (compared specs to strings, I think)
- Atari environments are currently not being tested for `render_mode` because `GymEnvironment` doesn't support that kwarg
Tests are currently not passing locally, which seems to be due to an unrelated problem'.
A User node identified as 'Markus Krimmel' has its followers is '21', its created_at is '2015-11-11T20:17:24.000000000+00:00'.
A Comment node identified as '1292409182' has its updated_at is '2022-10-26T18:02:09.000000000+00:00', its created_at is '2022-10-26T18:02:09.000000000+00:00', its body is 'Yeah, I will add that comment :)
I also considered fetching the ids from the (old) Gym registry, but I decided against it because that registry should not really change, given that Gym is no longer being maintained. Also, I would be somewhat worried that for some reason (e.g. ale not being installed) no Atari envs show up in the registry and the test is silently skipped.
Currently, neither this test, nor `test_gym_conversion` are in CI, because gym isn't being installed.'.
A Issue node identified as '[Question] Does the Pong game have speed in its actions?' has its number is '865', its closed_at is '2024-01-02T19:51:20.000000000+00:00', its updated_at is '2024-01-02T19:51:20.000000000+00:00', its created_at is '2024-01-02T15:56:36.000000000+00:00', its state is 'closed', its body is '### Question
The pong game has 6 basic actions. Noop, fire, right, rightfire, left, left fire. My question is do actions that have fire options (such as right fire) speed up the ball?
According to the AtariAge page, the red button in the actual controller adds some speed. Did you add this feature to the gymnasium?'.
A Issue node identified as 'Fix documentation ci' has its number is '417', its closed_at is '2023-03-30T13:49:19.000000000+00:00', its updated_at is '2023-03-30T13:49:19.000000000+00:00', its created_at is '2023-03-30T13:29:07.000000000+00:00', its state is 'closed', its body is 'https://github.com/Farama-Foundation/Gymnasium/pull/414 caused the documentation CI to fail due to the filter list not working as intended
Additionally, add the new atari environments to the list '.
A Issue node identified as 'Add all atari environments and remove pip install atari from documentation' has its number is '367', its closed_at is '2023-03-08T12:31:43.000000000+00:00', its updated_at is '2023-03-08T12:31:44.000000000+00:00', its created_at is '2023-03-08T12:29:06.000000000+00:00', its state is 'closed', its body is 'Previously, in generating the documentation, atari and autorom was installed which caused issues if the roms failed to install.
However, atari is not generate each time in the documentation so this was just causing issues for no reason
Furthermore, this PR add documentation for all of the atari environments (not including descriptions)'.
A Issue node identified as '[Bug Report] Cannot make an environment in env.registry' has its number is '152', its closed_at is '2022-11-23T14:03:14.000000000+00:00', its updated_at is '2022-11-23T14:53:50.000000000+00:00', its created_at is '2022-11-21T17:24:08.000000000+00:00', its state is 'closed', its body is '### Describe the bug
Hello,
When trying to make an environment in ``gym.registry`` I get a ``NameNotFound`` error, even though the environment should be found as I am picking the name from ``gym.registry``.
```python
Python 3.10.6 (main, Nov 2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gymnasium as gym
>>> gym.make("YarsRevengeNoFrameskip-v4")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/marc/.venvs/bonsai-gym/lib/python3.10/site-packages/gymnasium/envs/regist...'.
A User node identified as 'Mark Towers' has its followers is '245', its created_at is '2015-10-03T12:42:08.000000000+00:00'.
A User node identified as 'Marco Zatta' has its followers is '0', its created_at is '2021-12-03T07:44:35.000000000+00:00', its company is '@microsoft'.
A Label node identified as 'bug' has its color is 'd73a4a', its description is 'Something isn't working'.
A Comment node identified as '1322786823' has its updated_at is '2022-11-23T14:53:50.000000000+00:00', its created_at is '2022-11-21T23:19:47.000000000+00:00', its body is 'Ale-py has not updated to gymnasium yet sadly.
I will update the release notes.
This will be fixed in the next release, in the meantime you can use `pip install shimmy[atari]` or using the gym compatibility Env (see the website info)'.
A Comment node identified as '1325117034' has its updated_at is '2022-11-23T14:03:14.000000000+00:00', its created_at is '2022-11-23T14:03:14.000000000+00:00', its body is 'Oh sorry, I got confused and did not consider that I was talking about a `gym` environment and not a `gymnasium` one. Sorry for that'.
A User node identified as 'gglsmm' has its followers is '0', its created_at is '2022-04-23T20:17:49.000000000+00:00', its location is 'USA'.
A Label node identified as 'question' has its color is 'd876e3', its description is 'Further information is requested'.
A Comment node identified as '1874411902' has its updated_at is '2024-01-02T18:48:57.000000000+00:00', its created_at is '2024-01-02T18:48:57.000000000+00:00', its body is 'https://gymnasium.farama.org/environments/atari/pong/#actions'.
A Comment node identified as '1874475704' has its updated_at is '2024-01-02T19:51:20.000000000+00:00', its created_at is '2024-01-02T19:51:20.000000000+00:00', its body is '@gglsmm I would guess so and from playing the environment, I believe so, as all the atari environment is a wrapper over the stella emulator which should run the actual pong ROM so it should play identically to the real thing '.
A User node identified as 'Kallinteris Andreas' has its followers is '37', its created_at is '2017-08-05T21:48:59.000000000+00:00'.
There is a relationship of type 'RAISED_BY' from node 'Added Atari environments to tests, removed dead code' to node 'Markus Krimmel'.
There is a relationship of type 'COMMENT_ON' from node '1292409182' to node 'Added Atari environments to tests, removed dead code'.
There is a relationship of type 'MIGHT_RELATE_TO' from node '[Question] Does the Pong game have speed in its actions?' to node 'Added Atari environments to tests, removed dead code' with a similarity score of 0.80.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Fix documentation ci' to node 'Added Atari environments to tests, removed dead code' with a similarity score of 0.91.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Add all atari environments and remove pip install atari from documentation' to node 'Added Atari environments to tests, removed dead code' with a similarity score of 0.90.
There is a relationship of type 'MIGHT_RELATE_TO' from node '[Bug Report] Cannot make an environment in env.registry' to node 'Added Atari environments to tests, removed dead code' with a similarity score of 0.87.
There is a relationship of type 'RAISED_BY' from node 'Add all atari environments and remove pip install atari from documentation' to node 'Mark Towers'.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Add all atari environments and remove pip install atari from documentation' to node '[Bug Report] Cannot make an environment in env.registry' with a similarity score of 0.86.
There is a relationship of type 'MIGHT_RELATE_TO' from node '[Question] Does the Pong game have speed in its actions?' to node 'Add all atari environments and remove pip install atari from documentation' with a similarity score of 0.84.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Fix documentation ci' to node 'Add all atari environments and remove pip install atari from documentation' with a similarity score of 0.93.
There is a relationship of type 'RAISED_BY' from node 'Fix documentation ci' to node 'Mark Towers'.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Fix documentation ci' to node '[Bug Report] Cannot make an environment in env.registry' with a similarity score of 0.88.
There is a relationship of type 'MIGHT_RELATE_TO' from node '[Question] Does the Pong game have speed in its actions?' to node 'Fix documentation ci' with a similarity score of 0.85.
There is a relationship of type 'RAISED_BY' from node '[Bug Report] Cannot make an environment in env.registry' to node 'Marco Zatta'.
There is a relationship of type 'HAS_LABEL' from node '[Bug Report] Cannot make an environment in env.registry' to node 'bug'.
There is a relationship of type 'COMMENT_ON' from node '1322786823' to node '[Bug Report] Cannot make an environment in env.registry'.
There is a relationship of type 'COMMENT_ON' from node '1325117034' to node '[Bug Report] Cannot make an environment in env.registry'.
There is a relationship of type 'RAISED_BY' from node '[Question] Does the Pong game have speed in its actions?' to node 'gglsmm'.
There is a relationship of type 'HAS_LABEL' from node '[Question] Does the Pong game have speed in its actions?' to node 'question'.
There is a relationship of type 'COMMENT_ON' from node '1874411902' to node '[Question] Does the Pong game have speed in its actions?'.
There is a relationship of type 'COMMENT_ON' from node '1874475704' to node '[Question] Does the Pong game have speed in its actions?'.
There is a relationship of type 'COMMENT_BY' from node '1292409182' to node 'Markus Krimmel'.
There is a relationship of type 'COMMENT_BY' from node '1874475704' to node 'Mark Towers'.
There is a relationship of type 'COMMENT_BY' from node '1322786823' to node 'Mark Towers'.
There is a relationship of type 'COMMENT_BY' from node '1325117034' to node 'Marco Zatta'.
There is a relationship of type 'COMMENT_BY' from node '1874411902' to node 'Kallinteris Andreas'.
This might seem like a lot of information, and not particularly easy to follow. But keep in mind that the LLM is capable of processing it and generating a coherent response based on it - it is not supposed to be easy to follow for a human reader, but rather structured in a way that the LLM can understand and use to generate answers.
Tools for the agent
A key aspect of building an AI agent is defining the tools it can use to autonomously interact with, and potentially modify, the environment. In our case we will provide the agent with two tools - find_issues_from_prompt and find_experts. The first tool will allow the agent to find issues in the graph based on a user prompt, while the second tool will help it identify potential experts based on their interactions with relevant issues.
In many cases tools just return further information to the agent, which it can then use to generate a response. However, in some cases the agent might need to take actions based on the information it retrieves, such as creating new issues or updating existing ones. In our case, we will focus on retrieving information and generating summaries.
Show the code
def find_issues_from_prompt(query_string: str) ->dict:""" Finds potential issues from a user prompt, gets the graph matching the prompt, and returns a textual summary of the graph. Args: query_string: The user's query about issues. Returns: A dictionary containing a summary of the retrieved graph data. """print(f"Agent is calling find_issues_from_prompt with query: '{query_string}'")with driver.session() as session: rag_graph = session.execute_read(get_rag_graph, query_string)if rag_graph:print(f"Found {len(rag_graph.nodes)} nodes and {len(rag_graph.relationships)} relationships." ) summary = graph_to_textual_summary(rag_graph)return {"summary": summary}else:return {"summary": "Could not find any relevant information in the graph."}
Show the code
def find_experts(query_string: str) ->dict:""" Finds potential experts on a topic by analyzing who has contributed to the most relevant issues. Args: query_string: The user's query describing the topic of interest. Returns: A dictionary containing a summary of potential experts. """print(f"Agent is calling find_experts with query: '{query_string}'")# Embed the query string query_embedding = embedding_model.embed_batch([query_string])[0].tolist()# Find experts in the graphwith driver.session() as session: result = session.run(""" // Find the top matching issue for the query embedding CALL db.index.vector.queryNodes('issue_embeddings', 1, $embedding) YIELD node AS top_issue // Collect the top issue and up to 5 of its most similar issues WITH top_issue OPTIONAL MATCH (top_issue)-[r:MIGHT_RELATE_TO]-(related_issue:Issue) WITH top_issue, related_issue, r.score as score ORDER BY score DESC WITH top_issue, collect(related_issue)[..5] AS related_issues WITH [top_issue] + related_issues AS all_issues UNWIND all_issues as issue // Find all users who have interacted with these issues OPTIONAL MATCH (issue)<-[:RAISED_BY]-(u1:User) OPTIONAL MATCH (issue)<-[:ASSIGNED_TO]-(u2:User) OPTIONAL MATCH (issue)<-[:COMMENT_ON]-(:Comment)-[:COMMENT_BY]->(u3:User) // Aggregate and rank the users WITH issue, u1, u2, u3 WITH collect(u1) + collect(u2) + collect(u3) as users, issue UNWIND users as user WITH user, count(issue) as interactions, collect(DISTINCT {id: issue.id, title: issue.title}) as issues ORDER BY interactions DESC LIMIT 5 RETURN collect({user: user.login, interactions: interactions, issues: issues}) as experts """, embedding=query_embedding, ) experts = result.single()["experts"]if experts: summary ="Found the following potential experts based on their interactions with relevant issues:\n\n"for expert in experts: summary +=f"- User: {expert['user']} (Interactions: {expert['interactions']})\n"for issue in expert["issues"]: summary += (f" - Interacted with issue #{issue['id']}: {issue['title']}\n" )return {"summary": summary}else:return {"summary": "Could not find any potential experts for this topic."}
With these tools defined, we can now create an AI agent that can use them to answer user queries about issues in the graph. The agent will be able to find relevant issues based on user prompts and summarize the information found, as well as identify potential experts based on their interactions with relevant issues.
Note the system_instruction in the GenerateContentConfig is crucial as it defines the agent’s role and how it should use the tools provided. The agent will strictly use the information provided by the tools to formulate its response, ensuring that it does not make assumptions or generate information that is not present in the graph.
Show the code
from google.genai import typesfrom IPython.display import display, Markdowndef converse_with_agent(user_prompt: str) ->str:""" Converse with the agent using a user prompt. Args: user_prompt: The user's query to the agent. Returns: The agent's response as a string. """ config = types.GenerateContentConfig( system_instruction=f"""You are an expert agent that can find issues in a Neo4j graph database based on user prompts for the {repo_name} repository. Use the find_issues_from_prompt tool to retrieve relevant issues, summarize them into two sections: ### Summary Provide a concise summary of the issues found in the graph based on the user prompt. ### Potential Issues List the potential issues that match the user prompt, including relevant details such as issue titles, labels, and any other pertinent information. ### Advice Provide any advice or recommendations based on the issues found in the graph. When using the find_experts tool, summarize the potential experts based on their interactions with relevant issues as a table, including their usernames and interaction counts, and a list of relevant issue titles and ID's. Strictly use only the information provided by the tool to formulate your response.""", temperature=0.4, tools=[find_issues_from_prompt, find_experts], ) response = genai_client.models.generate_content( model="gemini-2.5-flash", config=config, contents=user_prompt )ifnot response.candidates ornot response.candidates[0].content.parts:return"No response generated by the agent."return response.candidates[0].content.parts[0].text
Example interactions
Let’s test our agent with a few example queries. The agent will use the tools defined earlier to find relevant issues and provide a summary of the information found in the graph.
Show the code
user_prompt ="What are the dependencies necessary to run Atari environments ?"print(f"User prompt: {user_prompt}\n")response = converse_with_agent(user_prompt)print("\nAgent Response:")boxed_md =f"""::: callout-note{response}:::"""display(Markdown(boxed_md))
User prompt: What are the dependencies necessary to run Atari environments ?
Agent is calling find_issues_from_prompt with query: 'What are the dependencies necessary to run Atari environments ?'
Found 17 nodes and 26 relationships.
Agent Response:
Summary
The issues found indicate that running Atari environments in Gymnasium requires specific dependencies, primarily gym and ale-py. There have been discussions and fixes related to ensuring these environments are properly tested and documented, and issues have arisen when ale-py was not updated to support Gymnasium.
Potential Issues
Issue 78: Added Atari environments to tests, removed dead code
Details: This issue, raised by Markus Krimmel, mentions that Atari environments are added to tests “if gym and ale are available.” This directly points to gym and ale (likely referring to ale-py) as necessary dependencies.
Issue 152: [Bug Report] Cannot make an environment in env.registry
Labels: bug
Details: Raised by Marco Zatta, this bug report highlights that ale-py had not been updated to Gymnasium, preventing the creation of Atari environments. A comment on this issue suggests using pip install shimmy[atari] as a workaround or utilizing the Gym compatibility environment. This strongly indicates ale-py as a crucial dependency.
Issue 367: Add all atari environments and remove pip install atari from documentation
Details: Raised by Mark Towers, this issue discusses the installation of atari and autorom for documentation generation, which caused issues if ROMs failed to install. While the issue aimed to remove pip install atari from documentation generation, it implies that atari (and potentially autorom) were considered dependencies at some point for these environments.
Issue 417: Fix documentation ci
Details: Raised by Mark Towers, this issue is related to fixing documentation CI failures and adding new Atari environments to a list, suggesting that the proper setup for Atari environments impacts documentation and CI processes.
Issue 865: [Question] Does the Pong game have speed in its actions?
Labels: question
Details: Raised by gglsmm, this question about Pong game actions mentions that “all the atari environment is a wrapper over the stella emulator which should run the actual pong ROM.” This implies that the underlying Atari environments rely on an emulator like Stella and the corresponding ROMs.
Advice
To run Atari environments in Gymnasium, you will primarily need gym and ale-py. Ensure that ale-py is compatible with your Gymnasium version. If you encounter issues with environment creation, consider using pip install shimmy[atari] or exploring the Gym compatibility environment as mentioned in Issue 152. Additionally, be aware that the Atari environments are built upon emulators like Stella and require the relevant ROMs.
Show the code
user_prompt ="How do I make sure random number generation is seeded properly for experiment consistency ?"print(f"User prompt: {user_prompt}\n")response = converse_with_agent(user_prompt)print("\nAgent Response:")boxed_md =f"""::: callout-note{response}:::"""display(Markdown(boxed_md))
User prompt: How do I make sure random number generation is seeded properly for experiment consistency ?
Agent is calling find_issues_from_prompt with query: 'random number generation seeded properly for experiment consistency'
Found 69 nodes and 125 relationships.
Agent Response:
Summary
The issues found highlight the challenges and ongoing efforts in ensuring consistent and reproducible random number generation within the Gymnasium environment, particularly concerning the seeding of environments and the ability to retrieve the seed used. There’s a recognized need for better control over seeding in vectorized environments and for making the active seed easily accessible for debugging and reproducibility.
Potential Issues
Support list of options in VectorEnv.reset() (Issue #113): This closed issue indicates a past attempt to allow passing a list of seeds (and options) to VectorEnv.reset(). While it was deemed not feasible to implement cleanly at the time, it points to a desire for more fine-grained control over individual sub-environment seeding within a vectorized setup. This could be a potential area for future development if experiment consistency across multiple parallel environments is a high priority.
Made readout of seed possible in env (Issue #889): This issue addresses the difficulty of extracting the random seed used by the environment’s np_random object. It proposes adding a np_random_seed property to the environment, allowing users to read the seed that was set during reset(). This is crucial for verifying and recording the exact seed used for an experiment, which is fundamental for reproducibility. The discussion also touches upon potential inconsistencies if env.np_random is directly set by the user.
Check the determinism of env.reset(seed=42); env.reset() (Issue #1086): This issue focuses on ensuring that env.reset() behaves deterministically when a seed is provided. It highlights a scenario where an environment might generate random observations not based on the internal np_random object, leading to non-deterministic behavior even if a seed is passed. This is a direct threat to experiment consistency and emphasizes the importance of correctly implementing random number generation within custom environments.
Advice
Based on the issues, here’s some advice for ensuring random number generation is seeded properly for experiment consistency:
Always use the seed parameter in env.reset(): This is the primary and most reliable way to seed the environment’s internal random number generator. Ensure that you pass a consistent seed value for reproducible experiments.
Understand np_random_seed (Issue #889): If you need to verify the seed used by an environment, especially within complex frameworks or parallel processes, the np_random_seed property (introduced in Issue #889) can be very useful. Be aware of its behavior, particularly if you are directly manipulating env.np_random.
Ensure all random operations are tied to env.np_random (Issue #1086): When creating custom Gymnasium environments, it is critical that all random number generation within the environment (e.g., for initial states, observations, or internal dynamics) uses the self.np_random object provided by the Gymnasium API. If you use random or numpy.random directly without linking it to self.np_random, your environment will not be deterministic even if env.reset(seed=...) is called.
Consider the implications for vectorized environments (Issue #113): While direct support for lists of seeds in VectorEnv.reset() was deemed complex, be mindful of how individual sub-environments are seeded within your vectorized setups. If you require specific seeding for each sub-environment, you might need to manage this externally or explore wrappers that provide such functionality.
Test for determinism: Actively test your environments for determinism by resetting with the same seed multiple times and verifying that the initial states and subsequent transitions are identical. The example provided in Issue #1086 (check_env from gymnasium.utils.env_checker) can be a good starting point for such tests.
Show the code
user_prompt = ("Who could I reach out to for help with an issue with Atari environments ?")print(f"User prompt: {user_prompt}\n")response = converse_with_agent(user_prompt)print("\nAgent Response:")boxed_md =f"""::: callout-note{response}:::"""display(Markdown(boxed_md))
User prompt: Who could I reach out to for help with an issue with Atari environments ?
Agent is calling find_experts with query: 'Atari environments'
Agent Response:
Potential Experts
Username
Interaction Count
dylwil3
5
pseudo-rnd-thoughts
3
Markus28
1
Relevant Issues
Add missing descriptions to Atari docs (ID: #1715976674)
[Bug Report] The Atari doc is missing some information (ID: #1575650136)
Added Atari environments to tests, removed dead code (ID: #1424260214)
What can we improve ?
The above results are quite promising, but there are several areas where we can improve the agent’s performance. A key one is adding chunking when computing embeddings - this will allow us to handle larger texts and provide more context to the agent. We can also improve the way we summarize the graph data, making it more concise and easier for the LLM to process.
We can also enhance the agent’s ability to reason about the graph data by providing it with more context and examples of how to use the tools effectively. This can be done by refining the system instruction and providing more detailed examples of how to use the tools in different scenarios.