A Use Case for Graph RAG – Pedro Leitão

RAG, or Retrieval-Augmented Generation, is a technique that combines the strengths of large language models (LLMs) with external knowledge sources. In this experiment, we explore a practical use case for RAG using a graph database.

:link RAG combined with a graph, allows us to enhance the contextual understanding of the LLM by providing it with structured information from the graph. While when RAG is combined with a typical search index and unstructured data, it can lead to less accurate or relevant results due to the lack of context and fine detail.

The use case

In this experiment, we will build methods which can download issues from any GitHub repository, store them in a graph database, and then use RAG to query the issues and comments. The goal is to demonstrate how combining retrieval with a graph can improve the usefulness of information from structured data.

Retrieving issues from GitHub

We will start by implementing the necessary methods to download issues from a GitHub repository, including their comments, users, labels, and events.

Show the code

from github import Github, UnknownObjectException
import pandas as pd
from tqdm.auto import tqdm
import requests_cache


def _get_user_data(user, users_data: dict):
    """Safely retrieves user data and handles exceptions for non-existent users."""
    if user and user.login not in users_data:
        try:
            users_data[user.login] = {
                "id": user.id,
                "login": user.login,
                "name": user.name,
                "company": user.company,
                "location": user.location,
                "followers": user.followers,
                "created_at": user.created_at,
            }
        except UnknownObjectException:
            print(
                f"Could not retrieve full profile for user {user.login}. Storing basic info."
            )
            # Store basic info if the full profile is not available
            users_data[user.login] = {
                "id": user.id,
                "login": user.login,
                "name": None,
                "company": None,
                "location": None,
                "followers": -1,
                "created_at": None,
            }


def download_issues(
    token: str, repo_name: str
) -> tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    """
    Download issues from a GitHub repository and return them as DataFrames.

    Args:
        token (str): GitHub personal access token.
        repo_name (str): Name of the repository in the format 'owner/repo'.

    Returns:
        tuple: DataFrames for issues, comments, users, labels, and events.
    """
    os.makedirs(".data", exist_ok=True)
    requests_cache.install_cache(
        ".data/github_cache", backend="sqlite", expire_after=4 * 3600
    )

    g = Github(token)
    if not g:
        raise ValueError("Invalid GitHub token or authentication failed.")

    # 2) Get repo and issues
    repo = g.get_repo(repo_name)
    if not repo:
        raise ValueError(f"Repository '{repo_name}' not found or access denied.")
    issues = repo.get_issues(state="all")  # Paginated iterator
    if not issues:
        raise ValueError(f"No issues found in repository '{repo_name}'.")

    issue_data = []
    issue_comments = []
    issue_events = []
    users_data = {}
    labels_data = {}

    for issue in tqdm(
        issues, total=issues.totalCount, desc=f"Downloading issues from {repo_name}"
    ):
        # Add all issue data to list
        issue_data.append(
            {
                "id": issue.id,
                "number": issue.number,
                "title": issue.title,
                "state": issue.state,
                "created_at": issue.created_at,
                "updated_at": issue.updated_at,
                "closed_at": issue.closed_at,
                "body": issue.body,
                "labels": [label.name for label in issue.labels],
                "assignees": [assignee.login for assignee in issue.assignees],
                "user": issue.user.login,
            }
        )

        # Add user data
        _get_user_data(issue.user, users_data)
        for assignee in issue.assignees:
            _get_user_data(assignee, users_data)

        # Add all comments to list
        for comment in issue.get_comments():
            issue_comments.append(
                {
                    "issue_id": issue.id,
                    "comment_id": comment.id,
                    "user": comment.user.login,
                    "created_at": comment.created_at,
                    "updated_at": comment.updated_at,
                    "body": comment.body,
                }
            )
            # Add comment user to users list
            _get_user_data(comment.user, users_data)

        # Add all labels to list
        for label in issue.labels:
            if label.name not in labels_data:
                labels_data[label.name] = {
                    "name": label.name,
                    "color": label.color,
                    "description": label.description,
                }

        # Add all events to list
        for event in issue.get_events():
            issue_events.append(
                {
                    "issue_id": issue.id,
                    "event_id": event.id,
                    "actor": event.actor.login if event.actor else None,
                    "event": event.event,
                    "created_at": event.created_at,
                }
            )
            # Add event actor to users list
            if event.actor:
                _get_user_data(event.actor, users_data)

    return (
        pd.DataFrame(issue_data),
        pd.DataFrame(issue_comments),
        pd.DataFrame(list(users_data.values())),
        pd.DataFrame(list(labels_data.values())),
        pd.DataFrame(issue_events),
    )

Retrieving hundreds of issues from a repository can take a while, so we will cache the results to avoid unnecessary API calls. In this case we will use the Farama-Foundation/Gymnasium repository as our data source - it is small enough to be manageable, but large enough to demonstrate the capabilities of our methods.

Show the code

import os

token = os.getenv("GITHUB_TOKEN")
repo_name = "Farama-Foundation/Gymnasium"

# Check if we already have the data
if os.path.exists(".data/issues.pkl"):
    print("Data already downloaded. Loading from pickle files.")
    issue_data = pd.read_pickle(".data/issues.pkl")
    issue_comments = pd.read_pickle(".data/comments.pkl")
    users_data = pd.read_pickle(".data/users.pkl")
    labels_data = pd.read_pickle(".data/labels.pkl")
    issue_events = pd.read_pickle(".data/events.pkl")
else:
    print("Downloading issues from GitHub...")
    issue_data, issue_comments, users_data, labels_data, issue_events = download_issues(
        token, repo_name
    )
    # Save all dataframes to pickle files under `.data`
    os.makedirs(".data", exist_ok=True)
    issue_data.to_pickle(".data/issues.pkl")
    issue_comments.to_pickle(".data/comments.pkl")
    users_data.to_pickle(".data/users.pkl")
    labels_data.to_pickle(".data/labels.pkl")
    issue_events.to_pickle(".data/events.pkl")

Data already downloaded. Loading from pickle files.

Computing embeddings

With the data at hand, and loaded into pandas DataFrames, we can now compute embeddings for the content in issues and comments. We will use a the QWen/Qwen3-Embedding-0.6B pre-trained Transformer model to generate embeddings for the text data, as it provides a good balance between accuracy and performance for our use case. It can also handle large text lengths, which means we can in many cases use it without necessarily chunking the text.

Important

In a production setting, you would most definitely want to implement chunking of the text. For the purpose of this experiment, we will keep it simple and not chunk content.

To keep our data size manageable, we will also truncate the embeddings to a fixed dimension of $768$, which is a common size for many Transformer models. For larger datasets, you would want to consider using a larger embedding size.

Show the code

# Method to compute embeddings using a given Transformer model
from sentence_transformers import SentenceTransformer
import torch
import numpy as np
from typing import List


class EmbeddingModel:
    def __init__(
        self,
        model_name: str = "QWen/Qwen3-Embedding-0.6B",
        batch_size: int = 32,
        truncate_dim: int = None,
    ) -> None:
        # Use CUDA if available
        if torch.cuda.is_available():
            print("Using CUDA for embeddings.")
            self.device = torch.device("cuda")
        elif torch.backends.mps.is_available():
            print("Using MPS for embeddings.")
            self.device = torch.device("mps")
        else:
            print("Using CPU for embeddings.")
            self.device = torch.device("cpu")
        self.model = SentenceTransformer(model_name, truncate_dim=truncate_dim).to(
            self.device
        )
        self.batch_size = batch_size

    def embed_batch(
        self, texts: List[str], desc: str = "Embedding batch"
    ) -> np.ndarray:
        """
        Embed a batch of texts using the SentenceTransformer model.

        Args:
            texts (List[str]): List of texts to embed.
            desc (str): Description for the tqdm progress bar.

        Returns:
            np.ndarray: Array of embeddings for the input texts.
        """
        all_embs = []
        self.model.to(self.device)
        with torch.no_grad():  # disable grads
            for i in tqdm(range(0, len(texts), self.batch_size), desc=desc):
                batch = texts[i : i + self.batch_size]
                # get a CPU numpy array directly
                embs = self.model.encode(
                    batch,
                    batch_size=len(batch),
                    show_progress_bar=False,
                    convert_to_tensor=False,  # returns numpy on CPU
                )
                all_embs.append(np.vstack(embs) if isinstance(embs, list) else embs)
                # free any CUDA scratch
                if self.device.type == "cuda":
                    torch.cuda.empty_cache()
        return np.vstack(all_embs)


embedding_dim = 768  # Set the embedding dimension

embedding_model = EmbeddingModel(batch_size=2, truncate_dim=embedding_dim)

Using CUDA for embeddings.

Show the code

# Compute embeddings for issues and comments, including title and body
def compute_embeddings(
    df: pd.DataFrame, text_columns: List[str], desc: str = "Computing embeddings"
) -> np.ndarray:
    """
    Compute embeddings for specified text columns in a DataFrame.

    Args:
        df (pd.DataFrame): DataFrame containing the text data.
        text_columns (List[str]): List of column names to compute embeddings for.
        desc (str): Description for the tqdm progress bar.

    Returns:
        np.ndarray: Array of embeddings for the specified text columns.
    """
    texts = []
    for _, row in df.iterrows():
        text = " ".join(str(row[col]) for col in text_columns if pd.notna(row[col]))
        texts.append(text)
    return embedding_model.embed_batch(texts, desc=desc)

Just as with the issue data, we will cache the embeddings to avoid recomputing them every time we run the experiment. If the embeddings already exist in the DataFrame, we will load them from there to avoid unnecessary computation.

Show the code

recompute = False
# Check if embeddings already exist
if (
    "embeddings" in issue_data.columns
    and "embeddings" in issue_comments.columns
    and not recompute
):
    print("Embeddings already computed. Loading from DataFrame.")
else:
    print("Computing embeddings for issues and comments...")
    issue_text_columns = ["title", "body"]
    issue_embeddings = compute_embeddings(
        issue_data, issue_text_columns, "Computing issue embeddings"
    )
    comment_text_columns = ["body"]
    comment_embeddings = compute_embeddings(
        issue_comments, comment_text_columns, "Computing comment embeddings"
    )
    # Add embeddings to DataFrames
    issue_data["embeddings"] = list(issue_embeddings)
    issue_comments["embeddings"] = list(comment_embeddings)
    # Save dataframe back to pickle files
    issue_data.to_pickle(".data/issues.pkl")
    issue_comments.to_pickle(".data/comments.pkl")

Embeddings already computed. Loading from DataFrame.

The raw issue data

Let us take a look at the raw data we have collected and processed so far. We will display a sample of each DataFrame to get an overview of the data structure and content.

Issue data contains information about the issues, including their title, body, state, labels, assignees, and user who raised the issue. Also note the computed embeddings for the issue text.

Show the code

# Show a sample for each dataframe
print("Sample issue data:")
issue_data.sample(5)

Sample issue data:

	id	number	title	state	created_at	updated_at	closed_at	body	labels	assignees	user	embeddings
853	1764400983	560	[Proposal] Check render_mode for RecordVideo w...	closed	2023-06-20 00:21:09+00:00	2023-07-03 09:46:40+00:00	2023-07-03 09:46:40+00:00	### Proposal\r\n\r\nRight now, when you wrap a...	[enhancement]	[]	robertoschiavone	[0.011846603, 0.0290194, -0.0031601943, -0.075...
1262	1457055647	150	Versioned Action wrappers which supports jumpy	closed	2022-11-20 21:37:48+00:00	2022-12-01 20:36:12+00:00	2022-12-01 20:36:12+00:00	# Description\r\n\r\nThis PR add support for j...	[]	[]	gianlucadecola	[0.030979421, 0.041525118, -0.011083876, -0.07...
594	2030125938	820	[Bug Report] [documentation] The LaTeX math is...	closed	2023-12-07 08:02:45+00:00	2023-12-17 09:38:24+00:00	2023-12-17 09:38:23+00:00	### Describe the bug\n\nexample:\r\nhttps://gy...	[bug]	[]	Kallinteris-Andreas	[0.0021287652, 0.048698895, -0.005818796, -0.0...
1106	1568043962	307	Update test_vector_make.py	closed	2023-02-02 13:16:09+00:00	2023-02-02 15:26:54+00:00	2023-02-02 15:26:53+00:00	# Description\r\nThe code has been updated to ...	[]	[]	MiChaelinzo	[0.037689343, 0.008772637, -0.006663781, -0.03...
127	2774346221	1288	Add `wrappers.vector.TransformObs/Action` sing...	closed	2025-01-08 05:39:52+00:00	2025-01-12 12:43:55+00:00	2025-01-12 12:43:55+00:00	# Description\r\n\r\nFixes #1287\r\n\r\n## Typ...	[]	[]	howardh	[-0.0051537124, 0.0085140485, -0.0072919703, -...

Comments on issues include the comment body, creation and update timestamps, and the user who made the comment. The embeddings for the comment text are also included.

Show the code

print("\nSample issue comments:")
issue_comments.sample(5)


Sample issue comments:

	issue_id	comment_id	user	created_at	updated_at	body	embeddings
2005	1718352168	1562609339	jjshoots	2023-05-25 09:47:22+00:00	2023-05-25 09:58:55+00:00	@pseudo-rnd-thoughts In my edit it wasn't actu...	[0.01990752, 0.06396204, -0.0070522893, -0.036...
986	2138606364	1949122830	Kallinteris-Andreas	2024-02-16 18:56:29+00:00	2024-02-16 18:56:29+00:00	Update: it now handles `nondeterministic` envi...	[0.0044486374, -0.007628141, -0.010253754, -0....
116	2930111508	2737406358	amacati	2025-03-19 17:00:01+00:00	2025-03-19 17:00:01+00:00	If 1.1 is out, should I update the Python requ...	[0.0014711036, 0.036554232, -0.0057385303, -0....
2946	1413905787	1285802792	mgoulao	2022-10-20 15:59:50+00:00	2022-10-20 15:59:50+00:00	@pseudo-rnd-thoughts `vectorising.py` is raisi...	[0.014613141, -0.019691627, -0.014793664, -0.0...
2833	1436202749	1308868067	pseudo-rnd-thoughts	2022-11-09 14:42:29+00:00	2022-11-09 14:42:29+00:00	I think that a list of options (`list[dict]`) ...	[0.017504165, 0.0024789188, -0.011195177, -0.0...

User data contains information about the users who raised issues or made comments, including their login, name, company, location, followers count, and account creation date.

Show the code

print("\nSample users data:")
users_data.sample(5)


Sample users data:

	id	login	name	company	location	followers	created_at
62	66969704	mariovas3	Mario Vasilev	None	London, United Kingdom	2	2020-06-15 18:43:17+00:00
48	6186430	abouelsaadat	Mohamed Abouelsaadat	None	None	0	2013-12-14 18:34:14+00:00
30	64679842	is-jang	장인성 (Insung Jang)	None	Busan, Republic of Korea	2	2020-05-02 06:53:53+00:00
118	36020639	RuizhouLiu	None	None	None	1	2018-02-01 02:07:30+00:00
450	18716355	YangyangFu	yyf	Texas A$M University	None	47	2016-04-28 08:40:11+00:00

Labels associated with issues include their name, color, and description.

Show the code

print("\nSample labels data:")
labels_data.sample(5)


Sample labels data:

	name	color	description
4	github_actions	000000	Pull requests that update GitHub Actions code
8	documentation	0075ca	Improvements or additions to documentation
2	enhancement	a2eeef	New feature or request
3	dependencies	0366d6	Pull requests that update a dependency file
1	question	d876e3	Further information is requested

Finally, issue events include information about events related to issues, such as when an issue was opened, closed, or commented on. The events also include the user who triggered the event and the timestamp of the event.

Show the code

print("\nSample issue events:")
issue_events.sample(5)


Sample issue events:

	issue_id	event_id	actor	event	created_at
4235	1691435458	9139915984	RedTachyon	merged	2023-05-01 21:45:29+00:00
2105	2118803859	11705850805	pseudo-rnd-thoughts	merged	2024-02-05 15:26:57+00:00
5711	1452193591	7837642524	pseudo-rnd-thoughts	referenced	2022-11-17 20:40:20+00:00
1643	2229274906	12381814404	pseudo-rnd-thoughts	closed	2024-04-06 14:44:10+00:00
2769	1991519068	10947309626	howardh	labeled	2023-11-13 21:31:40+00:00

Once loaded into a graph, the data will be modeled as the following data architecture.

%%{
  init: {
    'theme': 'base',
    'themeVariables': {
      'fontSize': '16px'
    }
  }
}%%
graph TD
    subgraph "Graph Data Model"
        U[User]
        I[Issue]
        C[Comment]
        L[Label]
        E[Event]

        U -- "RAISED_BY" --> I
        U -- "ASSIGNED_TO" --> I
        U -- "COMMENT_BY" --> C
        U -- "EVENT_BY" --> E
        
        I -- "HAS_LABEL" --> L
        I -- "MIGHT_RELATE_TO" --> I
        
        C -- "COMMENT_ON" --> I
        E -- "EVENT_ON" --> I
    end

Setting up the Neo4j graph database

We will use Neo4j Aura to store our issue data in a graph database. Neo4j Aura is a fully managed cloud service that provides a Neo4j database instance, which we can use to store and query our data. A number of helper methods are needed to set up the database schema, create constraints, and indexes, and to clear the database if needed.

First, we will connect to the Neo4j Aura instance using the neo4j Python driver. Make sure you have the necessary environment variables set for the connection.

Show the code

from neo4j import GraphDatabase, basic_auth, Driver, Session, Transaction, Record
from neo4j.graph import Graph

URI = os.getenv("NEO4J_URI")
USER = os.getenv("NEO4J_USERNAME")
PASSWORD = os.getenv("NEO4J_PASSWORD")
AUTH = (USER, PASSWORD)

print(f"Connecting to Neo4j at {URI} with user {USER}")

driver = GraphDatabase.driver(URI, auth=AUTH)
driver.verify_connectivity()


def test_aura_connection() -> None:
    with driver.session() as session:
        result = session.run("RETURN 'Hello, Aura!' AS message")
        record = result.single()
        print(record["message"])  # should print "Hello, Aura!"


test_aura_connection()

Connecting to Neo4j at neo4j+s://8c1ab3e4.databases.neo4j.io with user neo4j
Hello, Aura!

We then need a few additional methods to manage the database schema, including dropping existing constraints and indexes, clearing the database, and creating new constraints and vector indexes for the issue and comment embeddings whenever we re-run the experiment. This is useful to ensure that we start with a clean slate and can easily modify the schema if needed.

Show the code

def drop_schema(tx: Transaction) -> None:
    # Drop constraints
    for record in tx.run("SHOW CONSTRAINTS"):
        name = record["name"]
        tx.run(f"DROP CONSTRAINT `{name}`")
    # Drop indexes
    for record in tx.run("SHOW INDEXES"):
        name = record["name"]
        tx.run(f"DROP INDEX `{name}`")


def clear_database(tx: Transaction) -> None:
    # Drop all nodes and relationships
    tx.run("MATCH (n) DETACH DELETE n")


def create_constraints(tx: Transaction) -> None:
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (i:Issue) REQUIRE i.id IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (c:Comment) REQUIRE c.id IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (u:User) REQUIRE u.id IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (l:Label) REQUIRE l.name IS UNIQUE")
    tx.run("CREATE CONSTRAINT IF NOT EXISTS FOR (e:Event) REQUIRE e.id IS UNIQUE")

To store and query embeddings efficiently (the core of our RAG approach), we will create vector indexes for the issue and comment embeddings. The embedding dimensions will need to match the model’s output dimensions.

Show the code

def create_issue_vector_index(tx: Transaction, embedding_dim: int = 768) -> None:
    tx.run(
        """
        CREATE VECTOR INDEX `issue_embeddings` IF NOT EXISTS
        FOR (i:Issue)
        ON i.embedding
        OPTIONS {indexConfig: {
            `vector.dimensions`: $embedding_dim,
            `vector.similarity_function`: 'cosine'
        }}
    """,
        embedding_dim=embedding_dim,
    )


def create_comment_vector_index(tx: Transaction, embedding_dim: int = 768) -> None:
    tx.run(
        """
        CREATE VECTOR INDEX `comment_embeddings` IF NOT EXISTS
        FOR (c:Comment)
        ON c.embedding
        OPTIONS {indexConfig: {
            `vector.dimensions`: $embedding_dim,
            `vector.similarity_function`: 'cosine'
        }}
    """,
        embedding_dim=embedding_dim,
    )

Show the code

with driver.session(database="neo4j") as session:
    # Clear the database
    print("Clearing the database...")
    session.execute_write(clear_database)

    # Drop existing schema
    print("Dropping existing schema...")
    session.execute_write(drop_schema)

    # Create new constraints
    print("Creating new constraints...")
    session.execute_write(create_constraints)

    # Create vector indexes
    print("Creating vector indexes...")
    session.execute_write(create_issue_vector_index, embedding_dim=embedding_dim)
    session.execute_write(create_comment_vector_index, embedding_dim=embedding_dim)

    print("Schema updated successfully.")

Clearing the database...
Dropping existing schema...
Creating new constraints...
Creating vector indexes...
Schema updated successfully.

Importing data into Neo4j

We still need a few methods to import issue data. These methods will handle the insertion of users, labels, issues, comments, and events into the Neo4j database in batches. This is important for performance, especially when dealing with large datasets.

Show the code

from typing import List, Dict, Any
import pandas as pd


def _load_users_batch(tx: Transaction, batch: List[Dict[str, Any]]) -> None:
    tx.run(
        """
        UNWIND $batch AS row
        MERGE (u:User {id: row.id})
        SET u.login = row.login,
            u.name = row.name,
            u.company = row.company,
            u.location = row.location,
            u.followers = row.followers,
            u.created_at = CASE WHEN row.created_at IS NOT NULL THEN datetime(row.created_at) ELSE null END
        """,
        batch=batch,
    )


def import_users_batched(
    session: Session, users_df: pd.DataFrame, batch_size: int = 128
) -> None:
    for i in tqdm(range(0, len(users_df), batch_size), desc="Importing users"):
        batch = users_df.iloc[i : i + batch_size].to_dict("records")
        session.execute_write(_load_users_batch, batch)


def _load_labels_batch(tx: Transaction, batch: List[Dict[str, Any]]) -> None:
    tx.run(
        """
        UNWIND $batch AS row
        MERGE (l:Label {name: row.name})
        SET l.color = row.color,
            l.description = row.description
        """,
        batch=batch,
    )


def import_labels_batched(
    session: Session, labels_df: pd.DataFrame, batch_size: int = 128
) -> None:
    for i in tqdm(range(0, len(labels_df), batch_size), desc="Importing labels"):
        batch = labels_df.iloc[i : i + batch_size].to_dict("records")
        session.execute_write(_load_labels_batch, batch)


def _load_issues_batch(tx: Transaction, batch: List[Dict[str, Any]]) -> None:
    tx.run(
        """
        UNWIND $batch AS row
        MERGE (i:Issue {id: row.id})
        SET i.number = row.number,
            i.title = row.title,
            i.state = row.state,
            i.body = row.body,
            i.created_at = datetime(row.created_at),
            i.updated_at = datetime(row.updated_at),
            i.closed_at = CASE WHEN row.closed_at IS NOT NULL THEN datetime(row.closed_at) ELSE null END,
            i.embedding = row.embeddings
        WITH i, row
        MERGE (u:User {login: row.user})
        MERGE (i)-[:RAISED_BY]->(u)
        WITH i, row
        UNWIND row.labels AS labelName
          MERGE (l:Label {name: labelName})
          MERGE (i)-[:HAS_LABEL]->(l)
        WITH i, row
        UNWIND row.assignees AS assigneeLogin
          MERGE (a:User {login: assigneeLogin})
          MERGE (i)-[:ASSIGNED_TO]->(a)
        """,
        batch=batch,
    )


def import_issues_batched(
    session: Session, issues_df: pd.DataFrame, batch_size: int = 128
) -> None:
    for i in tqdm(range(0, len(issues_df), batch_size), desc="Importing issues"):
        batch = issues_df.iloc[i : i + batch_size].to_dict("records")
        session.execute_write(_load_issues_batch, batch)


def _load_comments_batch(tx: Transaction, batch: List[Dict[str, Any]]) -> None:
    tx.run(
        """
        UNWIND $batch AS row
        MERGE (c:Comment {id: row.comment_id})
        SET c.body = row.body,
            c.created_at = datetime(row.created_at),
            c.updated_at = datetime(row.updated_at),
            c.embedding = row.embeddings
        WITH c, row
        MERGE (i:Issue {id: row.issue_id})
        MERGE (c)-[:COMMENT_ON]->(i)
        WITH c, row
        MERGE (u:User {login: row.user})
        MERGE (c)-[:COMMENT_BY]->(u)
        """,
        batch=batch,
    )


def import_comments_batched(
    session: Session, comments_df: pd.DataFrame, batch_size: int = 128
) -> None:
    for i in tqdm(range(0, len(comments_df), batch_size), desc="Importing comments"):
        batch = comments_df.iloc[i : i + batch_size].to_dict("records")
        session.execute_write(_load_comments_batch, batch)


def _load_events_batch(tx: Transaction, batch: List[Dict[str, Any]]) -> None:
    tx.run(
        """
        UNWIND $batch AS row
        MERGE (e:Event {id: row.event_id})
        SET e.event = row.event,
            e.created_at = datetime(row.created_at)
        WITH e, row
        MERGE (i:Issue {id: row.issue_id})
        MERGE (e)-[:EVENT_ON]->(i)
        WITH e, row
        WHERE row.actor IS NOT NULL
        MERGE (u:User {login: row.actor})
        MERGE (e)-[:EVENT_BY]->(u)
        """,
        batch=batch,
    )


def import_events_batched(
    session: Session, events_df: pd.DataFrame, batch_size: int = 128
) -> None:
    for i in tqdm(range(0, len(events_df), batch_size), desc="Importing events"):
        batch = events_df.iloc[i : i + batch_size].to_dict("records")
        session.execute_write(_load_events_batch, batch)

We can now import the data, creating nodes and relationships for users, labels, issues, comments, and events in the graph database.

Show the code

with driver.session() as session:
    # Import data
    print("Importing data...")
    import_users_batched(session, users_data)
    import_labels_batched(session, labels_data)
    import_issues_batched(session, issue_data)
    import_comments_batched(session, issue_comments)
    import_events_batched(session, issue_events)
    print("Data imported successfully.")

Importing data...

Data imported successfully.

Visualising our graph

A picture is worth a thousand words, we can use the Pyvis library to visualize the graph we have created in Neo4j. It is a great way to get an immediate intuitive understanding of the data and their relationships. Let us quickly create a method to convert the Neo4j graph object into a Pyvis Network object, which we can then visualize.

Show the code

from pyvis.network import Network
import pandas as pd
from neo4j.graph import Node, Relationship


def create_pyvis_network_from_neo4j(graph: Graph) -> Network:
    """
    Creates a Pyvis Network object from a Neo4j graph object.
    """
    net = Network(
        notebook=True,
        cdn_resources="in_line",
        height="750px",
        width="100%",
        bgcolor="#ffffff",
        font_color="black",
    )

    for node in graph.nodes:
        node_id = node.element_id
        labels = list(node.labels)
        group = labels[0] if labels else "Node"
        properties = dict(node)

        # Plain‑text title with newlines
        title_lines = [group]
        for k, v in properties.items():
            if k == "embedding":
                continue
            if k == "body" and v and len(v) > 512:
                v = v[:512] + "..."
            title_lines.append(f"{k}: {v}")
        title = "\n".join(title_lines)

        # Use a specific property for the label if available
        node_label = str(
            properties.get("title")
            or properties.get("name")
            or properties.get("login")
            or properties.get("id")
            or node_id
        )
        if len(node_label) > 30:
            node_label = node_label[:27] + "..."

        node_size = 25
        if "Issue" in labels:
            # Make the node size relative to the number of related nodes
            related_nodes = len(
                [
                    rel
                    for rel in graph.relationships
                    if rel.start_node.element_id == node_id
                    or rel.end_node.element_id == node_id
                ]
            )
            node_size += related_nodes

        net.add_node(
            node_id, label=node_label, title=title, group=group, size=node_size
        )

    # Add edges
    for rel in graph.relationships:
        source_id = rel.start_node.element_id
        target_id = rel.end_node.element_id
        net.add_edge(source_id, target_id, title=rel.type, arrows="to", dashes=True)

    return net

So we don’t overwhelm the visualization with too many nodes, we will sample a subset of the graph data. We can use random sampling with Cypher’s rand method to select a limited number of issues and their related nodes. You can zoom in and out of the visualization, and click on nodes to see their properties in the graph.

Show the code

# Query Neo4j to get a sample of the graph data
with driver.session() as session:
    result = session.run(
        """
    MATCH (i:Issue)
    WITH i, rand() AS r ORDER BY r LIMIT 50
    MATCH (i)-[rel]-(neighbor)
    RETURN i, rel, neighbor
    """
    )
    graph = result.graph()
    net = create_pyvis_network_from_neo4j(graph)

# Configure physics and controls
net.toggle_physics(True)

# Save the visualization to HTML
net.show("graph_visualization.html", notebook=True)

graph_visualization.html

Computing similarity links

By comparing the embeddings of issues and their comments, we can create MIGHT_RELATE_TO relationships between issues that are semantically similar. This can help in identifying duplicate or related issues, and in understanding the context of a given issue by looking at which others might contain important information related to the problem at hand.

This will help us build a more connected graph, with more meaningful relationships between relevant problems.

Show the code

def create_similarity_links(tx: Transaction, min_score: float) -> int:
    result = tx.run(
        """
        // issue→issue and issue→comment similarities
        MATCH (i:Issue)
        CALL {
          WITH i
          CALL db.index.vector.queryNodes('issue_embeddings', 10, i.embedding)
            YIELD node AS similar_issue, score
          RETURN i AS issue, similar_issue, score
          UNION
          WITH i
          CALL db.index.vector.queryNodes('comment_embeddings', 10, i.embedding)
            YIELD node AS similar_comment, score
          MATCH (similar_issue:Issue)<-[:COMMENT_ON]-(similar_comment)
          RETURN i AS issue, similar_issue, score
        }
        WITH issue, similar_issue, score
        WHERE score >= $min_score AND elementId(issue) < elementId(similar_issue)
        WITH issue, similar_issue, max(score) AS max_score
        MERGE (issue)-[r:MIGHT_RELATE_TO]->(similar_issue)
        SET r.score = max_score

        // comment→issue similarities (no shadowing in import WITH)
        WITH issue
        MATCH (issue)<-[:COMMENT_ON]-(c:Comment)
        CALL {
          WITH c, issue
          CALL db.index.vector.queryNodes('issue_embeddings', 10, c.embedding)
            YIELD node AS similar_issue, score
          // alias here, not in the WITH
          RETURN issue AS parent_issue, similar_issue, score
        }
        // safely re‑alias back to `issue`
        WITH parent_issue AS issue, similar_issue, score
        WHERE score >= $min_score AND elementId(issue) < elementId(similar_issue)
        WITH issue, similar_issue, max(score) AS max_score
        MERGE (issue)-[r:MIGHT_RELATE_TO]->(similar_issue)
        SET r.score = max_score
    """,
        min_score=min_score,
    )
    return result.consume().counters.relationships_created

We will set a minimum score threshold of 0.75 (keep in mind cosine similarity scores can range from -1 to 1) for the similarity links to avoid creating too many relationships that might not be meaningful. This threshold can be adjusted based on the specific use case and the quality of the embeddings.

Show the code

min_score_threshold = 0.75
with driver.session() as session:
    print(
        f"Creating MIGHT_RELATE_TO relationships between issues with score >= {min_score_threshold}..."
    )
    num_rels_created = session.execute_write(
        create_similarity_links, min_score=min_score_threshold
    )
    print(f"Created {num_rels_created} MIGHT_RELATE_TO relationships.")

Creating MIGHT_RELATE_TO relationships between issues with score >= 0.75...

Created 19282 MIGHT_RELATE_TO relationships.

Visualizing related issues

To visualize the relationships between issues, we can create a Pyvis network that includes the MIGHT_RELATE_TO relationships. This will help us see how issues are connected based on their semantic similarity. To further enhance the visualisation, we will also perform community detection on the graph, to group similar issues together using correlated colors.

Show the code

import networkx as nx
from networkx.algorithms import community


def create_pyvis_network_from_networkx(
    G: nx.Graph, node_community: dict, min_score_threshold: float
) -> Network:
    """
    Creates a Pyvis Network object from a NetworkX graph object, with community information.
    """
    net = Network(
        notebook=True,
        cdn_resources="in_line",
        height="750px",
        width="100%",
        bgcolor="#ffffff",
        font_color="black",
    )

    # Add nodes to PyVis network with community information
    for node_id, properties in G.nodes(data=True):
        group = node_community.get(node_id, -1)  # -1 for nodes not in any community

        # Plain‑text title with newlines
        title_lines = [f"Community: {group}"]
        for k, v in properties.items():
            if k == "embedding":
                continue
            if k == "body" and v and len(v) > 512:
                v = v[:512] + "..."
            title_lines.append(f"{k}: {v}")
        title = "\n".join(title_lines)

        # Use a specific property for the label if available
        node_label = str(
            properties.get("title")
            or properties.get("name")
            or properties.get("login")
            or properties.get("id")
            or node_id
        )
        if len(node_label) > 30:
            node_label = node_label[:27] + "..."

        net.add_node(node_id, label=node_label, title=title, group=group)

    # Add edges
    for source_id, target_id, properties in G.edges(data=True):
        rel_title = properties.get("type", "")
        edge_width = 1
        if "score" in properties:
            score = properties["score"]
            rel_title = f"MIGHT_RELATE_TO (score: {score:.2f})"
            # Scale edge width based on score.
            edge_width = 1 + (score - min_score_threshold) * (
                10 / (1 - min_score_threshold)
            )

        net.add_edge(
            source_id,
            target_id,
            title=rel_title,
            width=edge_width,
            arrows="to",
            dashes=True,
        )

    return net

Note how the new MIGHT_RELATE_TO relationships are established based on semantic similarity scores (represented by the thickness of the edges).

Show the code

# Create a NetworkX graph to perform community detection
G = nx.Graph()

# Query Neo4j to get a sample of issues with MIGHT_RELATE_TO relationships
with driver.session() as session:
    result = session.run(
        """
    MATCH (i:Issue)-[rel:MIGHT_RELATE_TO]-(neighbor:Issue)
    WITH i, rel, neighbor, rand() as r
    ORDER BY r
    LIMIT 200
    RETURN i, rel, neighbor
    """
    )

    # Build the NetworkX graph from the query results
    for record in result:
        node_i = record["i"]
        node_neighbor = record["neighbor"]
        rel = record["rel"]

        G.add_node(node_i.element_id, **dict(node_i))
        G.add_node(node_neighbor.element_id, **dict(node_neighbor))
        G.add_edge(node_i.element_id, node_neighbor.element_id, **dict(rel))

# Detect communities using the Louvain method
communities = community.louvain_communities(G)
# Create a mapping from node to community id
node_community = {}
for i, comm in enumerate(communities):
    for node_id in comm:
        node_community[node_id] = i

net_similar = create_pyvis_network_from_networkx(G, node_community, min_score_threshold)

# Configure physics and controls
net_similar.toggle_physics(True)

# Save the visualization to HTML
net_similar.show("might_relate_to_visualization.html", notebook=True)

might_relate_to_visualization.html

The resulting RAG graph

To find the most relevant issues to a query string, we can use the embeddings of both issues and comments. We will create a method that searches for the top-k matching issues based on a blended search of issues and comments, and returns a graph of their connections. This will allow us to build a RAG graph that can be used to answer questions about the issues and their related comments.

The Cypher query in the following method is a bit complex, and requires some explanation. It starts by running two vector‐search subqueries in parallel: one against the issue embeddings index and another against the comment embeddings index. It returns the top k matches from each search, pairing comment matches back to their parent issues so that you end up with a unified stream of issues scored by similarity to your input embedding.

Next, it orders every returned issue‑score pair by descending score, wraps each issue node together with its score into a map, and deduplicates those maps so that each issue appears only once (preserving its highest score). It then slices that deduplicated list down to the single top k issues you want to focus on, and re‑materializes the actual Issue nodes by matching on their IDs.

Finally, for each of those top $k$ issues the query pulls in any labels or “raised by” relationships, all comments on the issue, and the users who made those comments. It aggregates each issue’s related nodes and relationships, flattens everything into two big collections, and then unwinds and re‑collects them with DISTINCT to eliminate duplicates. The result is a clean subgraph containing exactly the top $k$ semantically similar issues plus their immediate context.

Show the code

def get_rag_graph(tx: Transaction, query_string: str, top_k: int = 5) -> Graph:
    """
    Finds the most relevant issues to a query string by searching both issue and comment embeddings,
    and returns a graph of their connections.

    The graph contains:
    - The top-k matching issues based on a blended search of issues and comments.
    - For each of these issues: their comments, users who wrote them, and labels.
    """
    # Embed the query string
    query_embedding = embedding_model.embed_batch([query_string])[0].tolist()

    # Find the most relevant issues and build the graph
    result = tx.run(
        """
        // Find top k issues from issue embeddings and from comment embeddings
        CALL {
            CALL db.index.vector.queryNodes('issue_embeddings', $top_k, $embedding) YIELD node AS issue, score
            RETURN issue, score
            UNION
            CALL db.index.vector.queryNodes('comment_embeddings', $top_k, $embedding) YIELD node AS comment, score
            MATCH (comment)-[:COMMENT_ON]->(issue:Issue)
            RETURN issue, score
        }
        
        // Combine, deduplicate, and select top k issues overall
        WITH issue, score
        ORDER BY score DESC
        WITH collect(issue {.*, score: score}) AS issues
        WITH [i in issues | i.id] AS issueIds, issues
        WITH [id IN issueIds | head([i IN issues WHERE i.id = id])] AS uniqueIssues
        WITH uniqueIssues[..$top_k] AS top_issues
        UNWIND top_issues as top_issue_data
        MATCH (top_issue:Issue {id: top_issue_data.id})

        // Collect the top issues, their labels, and the users who raised them
        OPTIONAL MATCH (top_issue)-[r1:HAS_LABEL|RAISED_BY]->(n1)

        // Collect comments on the top issues and the users who made them
        OPTIONAL MATCH (top_issue)<-[r2:COMMENT_ON]-(c1:Comment)-[r3:COMMENT_BY]->(u1:User)
        
        // Aggregate all nodes and relationships per issue
        WITH top_issue, 
             collect(DISTINCT n1) as nodes1,
             collect(DISTINCT r1) as rels1,
             collect(DISTINCT c1) + collect(DISTINCT u1) as nodes2,
             collect(DISTINCT r2) + collect(DISTINCT r3) as rels2

        // Aggregate all nodes and relationships across all issues
        WITH collect(top_issue) + apoc.coll.flatten(collect(nodes1)) + apoc.coll.flatten(collect(nodes2)) as all_nodes,
             apoc.coll.flatten(collect(rels1)) + apoc.coll.flatten(collect(rels2)) as all_rels

        UNWIND all_nodes as n
        UNWIND all_rels as r
        RETURN collect(DISTINCT n) as nodes, collect(DISTINCT r) as relationships
    """,
        embedding=query_embedding,
        top_k=top_k,
    )

    record = result.single()

    # Reconstruct the graph from nodes and relationships
    nodes = record["nodes"]
    relationships = record["relationships"]

    # Create a graph object to return
    # This is a bit of a hack, as we can't directly instantiate a Graph object easily
    # with nodes and relationships from the driver. We'll run a query that returns a graph.
    if not nodes:
        return Graph()

    node_ids = [n.element_id for n in nodes]

    graph_result = tx.run(
        """
        MATCH (n) WHERE elementId(n) IN $node_ids
        OPTIONAL MATCH (n)-[r]-(m) WHERE elementId(n) IN $node_ids AND elementId(m) IN $node_ids
        RETURN n, r, m
    """,
        node_ids=node_ids,
    )

    return graph_result.graph()

Let’s see what the RAG graph looks like for a specific query. Note the default top_k value is set to 5, but you can adjust it to retrieve more or fewer issues based on your needs.

Show the code

query_string = "What are the dependencies necessary to run Atari environments ?"
with driver.session() as session:
    print(f"Finding RAG graph for query: {query_string}")
    rag_graph = session.execute_read(get_rag_graph, query_string)
    print(
        f"Found {len(rag_graph.nodes)} nodes and {len(rag_graph.relationships)} relationships in the RAG graph."
    )

Finding RAG graph for query: What are the dependencies necessary to run Atari environments ?

Found 17 nodes and 26 relationships in the RAG graph.

Show the code

# Visualize the RAG graph using Pyvis
rag_net = create_pyvis_network_from_neo4j(rag_graph)
rag_net.toggle_physics(True)
rag_net.show("rag_graph_visualization.html", notebook=True)

rag_graph_visualization.html

The AI agent

Now that we understand how our graph is structured and how to retrieve relevant information from it, we can build an AI agent that can answer questions about issues in our graph. The agent will use the RAG graph to find relevant issues and comments, and then generate a textual summary of the information found.

We will use the Gemini API to interact with a large language model (LLM) that can process the textual summaries and generate answers to user queries. Google Gemini offers a nice Python client library that we can use to interact with the API (if you use Conda, you can install the gemini API with conda install google-genai).

First we need to set up the Gemini client with our API key. Make sure you have the GEMINI_API_KEY environment variable set with your key.

Show the code

from google import genai

# Configure the Gemini API key
gemini_api_key = os.getenv("GEMINI_API_KEY")
if not gemini_api_key:
    raise ValueError("GEMINI_API_KEY environment variable not set.")

genai_client = genai.Client(api_key=gemini_api_key)

We also need a method to convert the Neo4j graph into a textual memo that can be passed to the LLM. This summary will include information about the nodes and relationships in the graph, which will help the LLM understand the context of the issues and comments.

Show the code

def graph_to_textual_summary(graph: Graph) -> str:
    """Converts a Neo4j graph into a descriptive narrative summary for an LLM."""
    # Use the driver-provided Graph object, which has .nodes and .relationships
    if not graph.nodes:
        return "No information found for the query."

    descriptions = []

    # Describe each node in a natural-language sentence
    for node in graph.nodes:
        labels = sorted(node.labels)
        props = dict(node)

        # Choose a human-friendly identifier
        identifier = (
            props.get("title")
            or props.get("name")
            or props.get("login")
            or props.get("id")
            or str(node.id)
        )
        label_str = " and ".join(labels) if labels else "Node"

        # Build the sentence
        sentence = f"A {label_str} node identified as '{identifier}'"

        # Add other descriptive properties
        extras = []
        for key, value in props.items():
            if key in ("title", "name", "login", "id", "embedding"):
                continue
            text = str(value)
            if key == "body" and value and len(value) > 600:
                text = value[:600] + "..."
            extras.append(f"its {key} is '{text}'")

        sentence += f" has {', '.join(extras)}" if extras else ""
        sentence = sentence.rstrip(", ") + "."
        descriptions.append(sentence)

    # Describe relationships in narrative form
    for rel in graph.relationships:
        start = rel.start_node
        end = rel.end_node
        start_id = (
            dict(start).get("title")
            or dict(start).get("name")
            or dict(start).get("login")
            or dict(start).get("id")
            or str(start.id)
        )
        end_id = (
            dict(end).get("title")
            or dict(end).get("name")
            or dict(end).get("login")
            or dict(end).get("id")
            or str(end.id)
        )

        sentence = f"There is a relationship of type '{rel.type}' from node '{start_id}' to node '{end_id}'"
        if "score" in rel:
            score = rel["score"]
            sentence += f" with a similarity score of {score:.2f}"
        descriptions.append(sentence + ".")

    return "\n".join(descriptions)

We can use an example query to see how the information will be structured for the LLM.

Show the code

example_query = "What are the dependencies necessary to run Atari environments ?"
with driver.session() as session:
    rag_graph = session.execute_read(get_rag_graph, example_query)
    summary = graph_to_textual_summary(rag_graph)
    print(summary)

A Issue node identified as 'Added Atari environments to tests, removed dead code' has its number is '78', its closed_at is '2022-10-26T20:41:39.000000000+00:00', its updated_at is '2022-10-26T20:41:40.000000000+00:00', its created_at is '2022-10-26T15:30:13.000000000+00:00', its state is 'closed', its body is '- Adds some atari environments to tested environments (if gym and ale are available)
- Removed definition of `minimum_testing_env_specs`, which was dead code, also didn't make sense (compared specs to strings, I think)
- Atari environments are currently not being tested for `render_mode` because `GymEnvironment` doesn't support that kwarg

Tests are currently not passing locally, which seems to be due to an unrelated problem'.
A User node identified as 'Markus Krimmel' has its followers is '21', its created_at is '2015-11-11T20:17:24.000000000+00:00'.
A Comment node identified as '1292409182' has its updated_at is '2022-10-26T18:02:09.000000000+00:00', its created_at is '2022-10-26T18:02:09.000000000+00:00', its body is 'Yeah, I will add that comment :) 

I also considered fetching the ids from the (old) Gym registry, but I decided against it because that registry should not really change, given that Gym is no longer being maintained. Also, I would be somewhat worried that for some reason (e.g. ale not being installed) no Atari envs show up in the registry and the test is silently skipped.

Currently, neither this test, nor `test_gym_conversion` are in CI, because gym isn't being installed.'.
A Issue node identified as '[Question] Does the Pong game have speed in its actions?' has its number is '865', its closed_at is '2024-01-02T19:51:20.000000000+00:00', its updated_at is '2024-01-02T19:51:20.000000000+00:00', its created_at is '2024-01-02T15:56:36.000000000+00:00', its state is 'closed', its body is '### Question

The pong game has 6 basic actions. Noop, fire, right, rightfire, left, left fire. My question is do actions that have fire options (such as right fire) speed up the ball? 
According to the AtariAge page, the red button in the actual controller adds some speed. Did you add this feature to the gymnasium?'.
A Issue node identified as 'Fix documentation ci' has its number is '417', its closed_at is '2023-03-30T13:49:19.000000000+00:00', its updated_at is '2023-03-30T13:49:19.000000000+00:00', its created_at is '2023-03-30T13:29:07.000000000+00:00', its state is 'closed', its body is 'https://github.com/Farama-Foundation/Gymnasium/pull/414 caused the documentation CI to fail due to the filter list not working as intended

Additionally, add the new atari environments to the list '.
A Issue node identified as 'Add all atari environments and remove pip install atari from documentation' has its number is '367', its closed_at is '2023-03-08T12:31:43.000000000+00:00', its updated_at is '2023-03-08T12:31:44.000000000+00:00', its created_at is '2023-03-08T12:29:06.000000000+00:00', its state is 'closed', its body is 'Previously, in generating the documentation, atari and autorom was installed which caused issues if the roms failed to install. 
However, atari is not generate each time in the documentation so this was just causing issues for no reason

Furthermore, this PR add documentation for all of the atari environments (not including descriptions)'.
A Issue node identified as '[Bug Report] Cannot make an environment in env.registry' has its number is '152', its closed_at is '2022-11-23T14:03:14.000000000+00:00', its updated_at is '2022-11-23T14:53:50.000000000+00:00', its created_at is '2022-11-21T17:24:08.000000000+00:00', its state is 'closed', its body is '### Describe the bug

Hello,

When trying to make an environment in ``gym.registry`` I get a ``NameNotFound`` error, even though the environment should be found as I am picking the name from ``gym.registry``.

```python
Python 3.10.6 (main, Nov  2 2022, 18:53:38) [GCC 11.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gymnasium as gym
>>> gym.make("YarsRevengeNoFrameskip-v4")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/marc/.venvs/bonsai-gym/lib/python3.10/site-packages/gymnasium/envs/regist...'.
A User node identified as 'Mark Towers' has its followers is '245', its created_at is '2015-10-03T12:42:08.000000000+00:00'.
A User node identified as 'Marco Zatta' has its followers is '0', its created_at is '2021-12-03T07:44:35.000000000+00:00', its company is '@microsoft'.
A Label node identified as 'bug' has its color is 'd73a4a', its description is 'Something isn't working'.
A Comment node identified as '1322786823' has its updated_at is '2022-11-23T14:53:50.000000000+00:00', its created_at is '2022-11-21T23:19:47.000000000+00:00', its body is 'Ale-py has not updated to gymnasium yet sadly. 
I will update the release notes. 
This will be fixed in the next release, in the meantime you can use `pip install shimmy[atari]` or using the gym compatibility Env (see the website info)'.
A Comment node identified as '1325117034' has its updated_at is '2022-11-23T14:03:14.000000000+00:00', its created_at is '2022-11-23T14:03:14.000000000+00:00', its body is 'Oh sorry, I got confused and did not consider that I was talking about a `gym` environment and not a `gymnasium` one. Sorry for that'.
A User node identified as 'gglsmm' has its followers is '0', its created_at is '2022-04-23T20:17:49.000000000+00:00', its location is 'USA'.
A Label node identified as 'question' has its color is 'd876e3', its description is 'Further information is requested'.
A Comment node identified as '1874411902' has its updated_at is '2024-01-02T18:48:57.000000000+00:00', its created_at is '2024-01-02T18:48:57.000000000+00:00', its body is 'https://gymnasium.farama.org/environments/atari/pong/#actions'.
A Comment node identified as '1874475704' has its updated_at is '2024-01-02T19:51:20.000000000+00:00', its created_at is '2024-01-02T19:51:20.000000000+00:00', its body is '@gglsmm I would guess so and from playing the environment, I believe so, as all the atari environment is a wrapper over the stella emulator which should run the actual pong ROM so it should play identically to the real thing '.
A User node identified as 'Kallinteris Andreas' has its followers is '37', its created_at is '2017-08-05T21:48:59.000000000+00:00'.
There is a relationship of type 'RAISED_BY' from node 'Added Atari environments to tests, removed dead code' to node 'Markus Krimmel'.
There is a relationship of type 'COMMENT_ON' from node '1292409182' to node 'Added Atari environments to tests, removed dead code'.
There is a relationship of type 'MIGHT_RELATE_TO' from node '[Question] Does the Pong game have speed in its actions?' to node 'Added Atari environments to tests, removed dead code' with a similarity score of 0.80.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Fix documentation ci' to node 'Added Atari environments to tests, removed dead code' with a similarity score of 0.91.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Add all atari environments and remove pip install atari from documentation' to node 'Added Atari environments to tests, removed dead code' with a similarity score of 0.90.
There is a relationship of type 'MIGHT_RELATE_TO' from node '[Bug Report] Cannot make an environment in env.registry' to node 'Added Atari environments to tests, removed dead code' with a similarity score of 0.87.
There is a relationship of type 'RAISED_BY' from node 'Add all atari environments and remove pip install atari from documentation' to node 'Mark Towers'.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Add all atari environments and remove pip install atari from documentation' to node '[Bug Report] Cannot make an environment in env.registry' with a similarity score of 0.86.
There is a relationship of type 'MIGHT_RELATE_TO' from node '[Question] Does the Pong game have speed in its actions?' to node 'Add all atari environments and remove pip install atari from documentation' with a similarity score of 0.84.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Fix documentation ci' to node 'Add all atari environments and remove pip install atari from documentation' with a similarity score of 0.93.
There is a relationship of type 'RAISED_BY' from node 'Fix documentation ci' to node 'Mark Towers'.
There is a relationship of type 'MIGHT_RELATE_TO' from node 'Fix documentation ci' to node '[Bug Report] Cannot make an environment in env.registry' with a similarity score of 0.88.
There is a relationship of type 'MIGHT_RELATE_TO' from node '[Question] Does the Pong game have speed in its actions?' to node 'Fix documentation ci' with a similarity score of 0.85.
There is a relationship of type 'RAISED_BY' from node '[Bug Report] Cannot make an environment in env.registry' to node 'Marco Zatta'.
There is a relationship of type 'HAS_LABEL' from node '[Bug Report] Cannot make an environment in env.registry' to node 'bug'.
There is a relationship of type 'COMMENT_ON' from node '1322786823' to node '[Bug Report] Cannot make an environment in env.registry'.
There is a relationship of type 'COMMENT_ON' from node '1325117034' to node '[Bug Report] Cannot make an environment in env.registry'.
There is a relationship of type 'RAISED_BY' from node '[Question] Does the Pong game have speed in its actions?' to node 'gglsmm'.
There is a relationship of type 'HAS_LABEL' from node '[Question] Does the Pong game have speed in its actions?' to node 'question'.
There is a relationship of type 'COMMENT_ON' from node '1874411902' to node '[Question] Does the Pong game have speed in its actions?'.
There is a relationship of type 'COMMENT_ON' from node '1874475704' to node '[Question] Does the Pong game have speed in its actions?'.
There is a relationship of type 'COMMENT_BY' from node '1292409182' to node 'Markus Krimmel'.
There is a relationship of type 'COMMENT_BY' from node '1874475704' to node 'Mark Towers'.
There is a relationship of type 'COMMENT_BY' from node '1322786823' to node 'Mark Towers'.
There is a relationship of type 'COMMENT_BY' from node '1325117034' to node 'Marco Zatta'.
There is a relationship of type 'COMMENT_BY' from node '1874411902' to node 'Kallinteris Andreas'.

This might seem like a lot of information, and not particularly easy to follow. But keep in mind that the LLM is capable of processing it and generating a coherent response based on it - it is not supposed to be easy to follow for a human reader, but rather structured in a way that the LLM can understand and use to generate answers.

Tools for the agent

A key aspect of building an AI agent is defining the tools it can use to autonomously interact with, and potentially modify, the environment. In our case we will provide the agent with two tools - find_issues_from_prompt and find_experts. The first tool will allow the agent to find issues in the graph based on a user prompt, while the second tool will help it identify potential experts based on their interactions with relevant issues.

In many cases tools just return further information to the agent, which it can then use to generate a response. However, in some cases the agent might need to take actions based on the information it retrieves, such as creating new issues or updating existing ones. In our case, we will focus on retrieving information and generating summaries.

Show the code

def find_issues_from_prompt(query_string: str) -> dict:
    """
    Finds potential issues from a user prompt, gets the graph matching the prompt,
    and returns a textual summary of the graph.

    Args:
        query_string: The user's query about issues.

    Returns:
        A dictionary containing a summary of the retrieved graph data.
    """
    print(f"Agent is calling find_issues_from_prompt with query: '{query_string}'")
    with driver.session() as session:
        rag_graph = session.execute_read(get_rag_graph, query_string)
        if rag_graph:
            print(
                f"Found {len(rag_graph.nodes)} nodes and {len(rag_graph.relationships)} relationships."
            )
            summary = graph_to_textual_summary(rag_graph)
            return {"summary": summary}
        else:
            return {"summary": "Could not find any relevant information in the graph."}

Show the code

def find_experts(query_string: str) -> dict:
    """
    Finds potential experts on a topic by analyzing who has contributed to the most relevant issues.

    Args:
        query_string: The user's query describing the topic of interest.

    Returns:
        A dictionary containing a summary of potential experts.
    """
    print(f"Agent is calling find_experts with query: '{query_string}'")

    # Embed the query string
    query_embedding = embedding_model.embed_batch([query_string])[0].tolist()

    # Find experts in the graph
    with driver.session() as session:
        result = session.run(
            """
            // Find the top matching issue for the query embedding
            CALL db.index.vector.queryNodes('issue_embeddings', 1, $embedding) YIELD node AS top_issue
            
            // Collect the top issue and up to 5 of its most similar issues
            WITH top_issue
            OPTIONAL MATCH (top_issue)-[r:MIGHT_RELATE_TO]-(related_issue:Issue)
            WITH top_issue, related_issue, r.score as score
            ORDER BY score DESC
            WITH top_issue, collect(related_issue)[..5] AS related_issues
            WITH [top_issue] + related_issues AS all_issues
            UNWIND all_issues as issue

            // Find all users who have interacted with these issues
            OPTIONAL MATCH (issue)<-[:RAISED_BY]-(u1:User)
            OPTIONAL MATCH (issue)<-[:ASSIGNED_TO]-(u2:User)
            OPTIONAL MATCH (issue)<-[:COMMENT_ON]-(:Comment)-[:COMMENT_BY]->(u3:User)

            // Aggregate and rank the users
            WITH issue, u1, u2, u3
            WITH collect(u1) + collect(u2) + collect(u3) as users, issue
            UNWIND users as user
            WITH user, count(issue) as interactions, collect(DISTINCT {id: issue.id, title: issue.title}) as issues
            ORDER BY interactions DESC
            LIMIT 5
            
            RETURN collect({user: user.login, interactions: interactions, issues: issues}) as experts
        """,
            embedding=query_embedding,
        )

        experts = result.single()["experts"]

        if experts:
            summary = "Found the following potential experts based on their interactions with relevant issues:\n\n"
            for expert in experts:
                summary += f"- User: {expert['user']} (Interactions: {expert['interactions']})\n"
                for issue in expert["issues"]:
                    summary += (
                        f"  - Interacted with issue #{issue['id']}: {issue['title']}\n"
                    )
            return {"summary": summary}
        else:
            return {"summary": "Could not find any potential experts for this topic."}

With these tools defined, we can now create an AI agent that can use them to answer user queries about issues in the graph. The agent will be able to find relevant issues based on user prompts and summarize the information found, as well as identify potential experts based on their interactions with relevant issues.

Note the system_instruction in the GenerateContentConfig is crucial as it defines the agent’s role and how it should use the tools provided. The agent will strictly use the information provided by the tools to formulate its response, ensuring that it does not make assumptions or generate information that is not present in the graph.

Show the code

from google.genai import types
from IPython.display import display, Markdown


def converse_with_agent(user_prompt: str) -> str:
    """
    Converse with the agent using a user prompt.

    Args:
        user_prompt: The user's query to the agent.

    Returns:
        The agent's response as a string.
    """
    config = types.GenerateContentConfig(
        system_instruction=f"""You are an expert agent that can find issues in a Neo4j graph database based on user prompts for the {repo_name} repository. Use the find_issues_from_prompt tool to retrieve relevant issues, summarize them into two sections:
        
        ### Summary
        Provide a concise summary of the issues found in the graph based on the user prompt.
        ### Potential Issues
        List the potential issues that match the user prompt, including relevant details such as issue titles, labels, and any other pertinent information.
        ### Advice
        Provide any advice or recommendations based on the issues found in the graph.

        When using the find_experts tool, summarize the potential experts based on their interactions with relevant issues as a table, including their usernames and interaction counts, and a list of relevant issue titles and ID's.

        Strictly use only the information provided by the tool to formulate your response.""",
        temperature=0.4,
        tools=[find_issues_from_prompt, find_experts],
    )

    response = genai_client.models.generate_content(
        model="gemini-2.5-flash", config=config, contents=user_prompt
    )

    if not response.candidates or not response.candidates[0].content.parts:
        return "No response generated by the agent."

    return response.candidates[0].content.parts[0].text

Example interactions

Let’s test our agent with a few example queries. The agent will use the tools defined earlier to find relevant issues and provide a summary of the information found in the graph.

Show the code

user_prompt = "What are the dependencies necessary to run Atari environments ?"
print(f"User prompt: {user_prompt}\n")

response = converse_with_agent(user_prompt)

print("\nAgent Response:")

boxed_md = f"""
::: callout-note
{response}
:::
"""

display(Markdown(boxed_md))

User prompt: What are the dependencies necessary to run Atari environments ?

Agent is calling find_issues_from_prompt with query: 'What are the dependencies necessary to run Atari environments ?'

Found 17 nodes and 26 relationships.

Agent Response:

Summary

The issues found indicate that running Atari environments in Gymnasium requires specific dependencies, primarily gym and ale-py. There have been discussions and fixes related to ensuring these environments are properly tested and documented, and issues have arisen when ale-py was not updated to support Gymnasium.

Potential Issues

Issue 78: Added Atari environments to tests, removed dead code
- Details: This issue, raised by Markus Krimmel, mentions that Atari environments are added to tests “if gym and ale are available.” This directly points to gym and ale (likely referring to ale-py) as necessary dependencies.
Issue 152: [Bug Report] Cannot make an environment in env.registry
- Labels: bug
- Details: Raised by Marco Zatta, this bug report highlights that ale-py had not been updated to Gymnasium, preventing the creation of Atari environments. A comment on this issue suggests using pip install shimmy[atari] as a workaround or utilizing the Gym compatibility environment. This strongly indicates ale-py as a crucial dependency.
Issue 367: Add all atari environments and remove pip install atari from documentation
- Details: Raised by Mark Towers, this issue discusses the installation of atari and autorom for documentation generation, which caused issues if ROMs failed to install. While the issue aimed to remove pip install atari from documentation generation, it implies that atari (and potentially autorom) were considered dependencies at some point for these environments.
Issue 417: Fix documentation ci
- Details: Raised by Mark Towers, this issue is related to fixing documentation CI failures and adding new Atari environments to a list, suggesting that the proper setup for Atari environments impacts documentation and CI processes.
Issue 865: [Question] Does the Pong game have speed in its actions?
- Labels: question
- Details: Raised by gglsmm, this question about Pong game actions mentions that “all the atari environment is a wrapper over the stella emulator which should run the actual pong ROM.” This implies that the underlying Atari environments rely on an emulator like Stella and the corresponding ROMs.

Advice

To run Atari environments in Gymnasium, you will primarily need gym and ale-py. Ensure that ale-py is compatible with your Gymnasium version. If you encounter issues with environment creation, consider using pip install shimmy[atari] or exploring the Gym compatibility environment as mentioned in Issue 152. Additionally, be aware that the Atari environments are built upon emulators like Stella and require the relevant ROMs.

Show the code

user_prompt = "How do I make sure random number generation is seeded properly for experiment consistency ?"

print(f"User prompt: {user_prompt}\n")

response = converse_with_agent(user_prompt)

print("\nAgent Response:")

boxed_md = f"""
::: callout-note
{response}
:::
"""

display(Markdown(boxed_md))

User prompt: How do I make sure random number generation is seeded properly for experiment consistency ?

Agent is calling find_issues_from_prompt with query: 'random number generation seeded properly for experiment consistency'

Found 69 nodes and 125 relationships.

Agent Response:

Summary

The issues found highlight the challenges and ongoing efforts in ensuring consistent and reproducible random number generation within the Gymnasium environment, particularly concerning the seeding of environments and the ability to retrieve the seed used. There’s a recognized need for better control over seeding in vectorized environments and for making the active seed easily accessible for debugging and reproducibility.

Potential Issues

Support list of options in VectorEnv.reset() (Issue #113): This closed issue indicates a past attempt to allow passing a list of seeds (and options) to VectorEnv.reset(). While it was deemed not feasible to implement cleanly at the time, it points to a desire for more fine-grained control over individual sub-environment seeding within a vectorized setup. This could be a potential area for future development if experiment consistency across multiple parallel environments is a high priority.
Made readout of seed possible in env (Issue #889): This issue addresses the difficulty of extracting the random seed used by the environment’s np_random object. It proposes adding a np_random_seed property to the environment, allowing users to read the seed that was set during reset(). This is crucial for verifying and recording the exact seed used for an experiment, which is fundamental for reproducibility. The discussion also touches upon potential inconsistencies if env.np_random is directly set by the user.
Check the determinism of env.reset(seed=42); env.reset() (Issue #1086): This issue focuses on ensuring that env.reset() behaves deterministically when a seed is provided. It highlights a scenario where an environment might generate random observations not based on the internal np_random object, leading to non-deterministic behavior even if a seed is passed. This is a direct threat to experiment consistency and emphasizes the importance of correctly implementing random number generation within custom environments.

Advice

Based on the issues, here’s some advice for ensuring random number generation is seeded properly for experiment consistency:

Always use the seed parameter in env.reset(): This is the primary and most reliable way to seed the environment’s internal random number generator. Ensure that you pass a consistent seed value for reproducible experiments.
Understand np_random_seed (Issue #889): If you need to verify the seed used by an environment, especially within complex frameworks or parallel processes, the np_random_seed property (introduced in Issue #889) can be very useful. Be aware of its behavior, particularly if you are directly manipulating env.np_random.
Ensure all random operations are tied to env.np_random (Issue #1086): When creating custom Gymnasium environments, it is critical that all random number generation within the environment (e.g., for initial states, observations, or internal dynamics) uses the self.np_random object provided by the Gymnasium API. If you use random or numpy.random directly without linking it to self.np_random, your environment will not be deterministic even if env.reset(seed=...) is called.
Consider the implications for vectorized environments (Issue #113): While direct support for lists of seeds in VectorEnv.reset() was deemed complex, be mindful of how individual sub-environments are seeded within your vectorized setups. If you require specific seeding for each sub-environment, you might need to manage this externally or explore wrappers that provide such functionality.
Test for determinism: Actively test your environments for determinism by resetting with the same seed multiple times and verifying that the initial states and subsequent transitions are identical. The example provided in Issue #1086 (check_env from gymnasium.utils.env_checker) can be a good starting point for such tests.

Show the code

user_prompt = (
    "Who could I reach out to for help with an issue with Atari environments ?"
)

print(f"User prompt: {user_prompt}\n")

response = converse_with_agent(user_prompt)

print("\nAgent Response:")

boxed_md = f"""
::: callout-note
{response}
:::
"""

display(Markdown(boxed_md))

User prompt: Who could I reach out to for help with an issue with Atari environments ?

Agent is calling find_experts with query: 'Atari environments'


Agent Response:

Potential Experts

Username	Interaction Count
dylwil3	5
pseudo-rnd-thoughts	3
Markus28	1

Relevant Issues

Add missing descriptions to Atari docs (ID: #1715976674)
[Bug Report] The Atari doc is missing some information (ID: #1575650136)
Added Atari environments to tests, removed dead code (ID: #1424260214)

What can we improve ?

The above results are quite promising, but there are several areas where we can improve the agent’s performance. A key one is adding chunking when computing embeddings - this will allow us to handle larger texts and provide more context to the agent. We can also improve the way we summarize the graph data, making it more concise and easier for the LLM to process.

We can also enhance the agent’s ability to reason about the graph data by providing it with more context and examples of how to use the tools effectively. This can be done by refining the system instruction and providing more detailed examples of how to use the tools in different scenarios.

Reuse

This work is licensed under CC BY (View License)