Basics of Word Vectors – Pedro Leitão

Word vectors are a mainstay of NLP, and are used in a variety of tasks, from sentiment analysis to machine translation. In this experiment, we will explore the very basics of word vectors, and how they can be used to represent words in a way that captures their meaning. Word vector models represent words as vectors in a high-dimensional space, where the distance between vectors captures the similarity and relationships between words within a given context of a corpus.

For the purposes of simplicity, we will use the gensim library and a ready made word vector model. The model we will use is the glove-wiki-gigaword-50 model, which is a 50-dimensional word vector model trained on the Wikipedia corpus.

Let’s start by loading the model.

Show the code

import os
import sys
import contextlib
import gensim.downloader as api


# Define a context manager to suppress stdout and stderr
@contextlib.contextmanager
def suppress_stdout_stderr():
    """A context manager that redirects stdout and stderr to devnull"""
    with open(os.devnull, "w") as fnull:
        old_stdout = sys.stdout
        old_stderr = sys.stderr
        sys.stdout = fnull
        sys.stderr = fnull
        try:
            yield
        finally:
            sys.stdout = old_stdout
            sys.stderr = old_stderr


# Use the context manager to suppress output from the model download
with suppress_stdout_stderr():
    model = api.load("glove-wiki-gigaword-50")
print(model)

KeyedVectors<vector_size=50, 400000 keys>

Using word vectors for question answering

We will use this model to explore the relationships between words. Let us start with a simple problem - “Brasil is to Portugal, what X is to Spain”. We will use word vectors to estimate possible candidates to X.

Show the code

# Calculate the "br - pt + es" vector and find the closest word
result = model.most_similar(positive=["brazil", "spain"], negative=["portugal"], topn=1)
print(result)
result_word = result[0][0]
# Print the shape of the result vector
dimensions = model[result[0][0]].shape[0]
print("Number of vector dimensions: ", dimensions)

[('mexico', 0.8388405442237854)]
Number of vector dimensions:  50

Great! We now have a candidate word for X and a probability score, also notice how the resulting word vector returned by the model has 50 dimensions.

Show the code

print(model[result[0][0]])

[ 4.1189e-01 -9.3082e-02 -1.8871e-01  1.0692e+00 -5.3433e-01 -9.9734e-01
 -5.5511e-01  3.2821e-01  1.3527e-01  5.6191e-01  4.1530e-01 -7.6649e-01
  6.4603e-01 -1.8546e-01  7.8123e-01 -8.9799e-01 -4.9891e-01 -2.2080e-01
 -1.3527e-01  1.8387e-01 -7.1566e-01 -7.6888e-01 -8.0441e-02  4.9022e-04
 -5.2532e-01 -1.8385e+00  1.7512e-01  3.3196e-01  2.4221e-01 -5.6305e-01
  2.6126e+00 -2.3889e-03 -5.4640e-01 -7.3415e-01 -4.6359e-01 -7.5214e-01
 -1.2924e+00  2.6360e-01  2.8462e-01  2.6416e-02 -1.0242e+00  5.7252e-01
  1.4757e+00 -1.2457e+00 -6.2902e-01  4.0549e-01 -4.1026e-01 -6.0271e-01
  1.3786e-01 -7.9502e-02]

These numbers encode a lot of meaning regarding the word ‘mexico’, and in general, the more dimensions present in a given word vector model the more semantic information can be represented by the model!

Visualising word vectors

Now let us attempt to visualise the relationships between these vector representations - we will perform a comparison between an actual vector operation, and the estimate returned by gensim using the most_similar operation. We first need to get vector representations for all the words (“portugal”, “brazil”, “spain” and “mexico”) so we can plot their proximity.

Show the code

import numpy as np

# Calculate the "brazil + spain - portugal" vector
true_vector = model["brazil"] + model["spain"] - model["portugal"]

words = ["portugal", "spain", "brazil", result_word]

# Get vectors for each word
vectors = np.array([model[w] for w in words])
vectors = np.vstack(
    [vectors, true_vector]
)  # Add the true vector to the list of vectors
words += ["brazil + spain - portugal"]  # Add the label for the true vector

Now, how do we visualize 50 dimensions? We’ll need to reduce the dimensionality of our vector space to something manageable!

PCA or t-SNE?

In this case, we will use Principal Component Analysis (PCA), a statistical procedure that utilizes orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. A better approach would be using t-SNE, but given we have a tiny number of samples, it makes little or no difference.

Show the code

import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D  # enables 3D plotting
from sklearn.decomposition import PCA

# Perform PCA to reduce to 3 dimensions
pca = PCA(n_components=3)
reduced_vectors = pca.fit_transform(vectors)

# Generate colors using matplotlib's tab10 colormap
colors = plt.cm.tab10(np.linspace(0, 1, len(words)))

# Create a 3D figure
fig = plt.figure(figsize=(8, 6))
ax = fig.add_subplot(111, projection="3d")

# Plot scatter points with text labels
for i, word in enumerate(words):
    # Scatter each point
    ax.scatter(
        reduced_vectors[i, 0],
        reduced_vectors[i, 1],
        reduced_vectors[i, 2],
        color=colors[i],
        s=50,
    )
    # Annotate the point with its corresponding word
    ax.text(
        reduced_vectors[i, 0],
        reduced_vectors[i, 1],
        reduced_vectors[i, 2],
        word,
        fontsize=9,
        color=colors[i],
    )

# Optionally add lines from the origin to each point
for i, word in enumerate(words):
    linestyle = "dashed" if word.lower() == "mexico" else "solid"
    ax.plot(
        [0, reduced_vectors[i, 0]],
        [0, reduced_vectors[i, 1]],
        [0, reduced_vectors[i, 2]],
        color=colors[i],
        linestyle=linestyle,
    )

# Set axis labels and title
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")
ax.set_zlabel("Z Axis")
ax.set_title("3D PCA Projection of Word Vectors")

plt.show()

Notice how the “true” vector (the ‘brazil + spain - portugal’ edge) doesn’t seem to align much or be anywhere near “mexico” ? This can simply be explained by dimensionality reduction - the original number of dimensions is much higher than three, and our dimensionality reduction does not capture the complexity of the data. Take the above as a mere ilustration.

Now to offer a different visualisation, let us perform a 3D plot of a variety of countries. Additionally, we will also cluster countries into separate groups using KMeans. Can you discern how the algorithm decided to group different countries ?

Show the code

from sklearn.manifold import TSNE
from sklearn.cluster import KMeans
import matplotlib.patches as mpatches

# List of countries
countries = [
    "afghanistan",
    "algeria",
    "angola",
    "argentina",
    "australia",
    "austria",
    "azerbaijan",
    "bahrain",
    "bangladesh",
    "barbados",
    "belarus",
    "belgium",
    "bhutan",
    "bolivia",
    "botswana",
    "brazil",
    "brunei",
    "bulgaria",
    "canada",
    "chile",
    "china",
    "colombia",
    "cuba",
    "cyprus",
    "denmark",
    "djibouti",
    "ecuador",
    "egypt",
    "estonia",
    "ethiopia",
    "finland",
    "france",
    "gambia",
    "germany",
    "ghana",
    "greece",
    "guinea",
    "guinea-bissau",
    "guyana",
    "honduras",
    "hungary",
    "iceland",
    "india",
    "indonesia",
    "iran",
    "iraq",
    "ireland",
    "israel",
    "italy",
    "jamaica",
    "japan",
    "kenya",
    "kuwait",
    "kyrgyzstan",
    "lebanon",
    "lesotho",
    "libya",
    "lithuania",
    "luxembourg",
    "madagascar",
    "malawi",
    "malaysia",
    "maldives",
    "mali",
    "malta",
    "mauritania",
    "mauritius",
    "mexico",
    "micronesia",
    "moldova",
    "monaco",
    "mongolia",
    "montenegro",
    "morocco",
    "mozambique",
    "namibia",
    "netherlands",
    "nicaragua",
    "niger",
    "nigeria",
    "norway",
    "oman",
    "pakistan",
    "panama",
    "paraguay",
    "peru",
    "philippines",
    "poland",
    "portugal",
    "qatar",
    "romania",
    "russia",
    "rwanda",
    "samoa",
    "senegal",
    "serbia",
    "singapore",
    "slovakia",
    "slovenia",
    "somalia",
    "spain",
    "sweden",
    "switzerland",
    "tanzania",
    "thailand",
    "tunisia",
    "turkey",
    "turkmenistan",
    "uganda",
    "ukraine",
    "uruguay",
    "venezuela",
    "vietnam",
    "yemen",
    "zambia",
    "zimbabwe",
]

# Assuming you have a pre-trained model that maps each country to a vector
vectors = np.array([model[country] for country in countries])

# Perform t-SNE to reduce to 3 dimensions
tsne = TSNE(n_components=3, random_state=42)
reduced_vectors = tsne.fit_transform(vectors)

# Cluster the reduced vectors into groups using KMeans
num_clusters = 8
kmeans = KMeans(n_clusters=num_clusters, random_state=42)
clusters = kmeans.fit_predict(reduced_vectors)

# Extract coordinates
xs = reduced_vectors[:, 0]
ys = reduced_vectors[:, 1]
zs = reduced_vectors[:, 2]

# Create a 3D figure
fig = plt.figure(figsize=(9, 6))
ax = fig.add_subplot(111, projection="3d")

# Plot scatter points, coloring by cluster using a discrete colormap (tab10)
sc = ax.scatter(xs, ys, zs, c=clusters, cmap="tab10", s=50, alpha=0.8)

# Create a legend with colored patches for each cluster
handles = []
cmap = plt.cm.tab10
for cluster in range(num_clusters):
    color = cmap(cluster)
    patch = mpatches.Patch(color=color, label=f"Cluster {cluster}")
    handles.append(patch)

ax.legend(handles=handles, title="Cluster")

# Annotate each point with the country name
for i, country in enumerate(countries):
    ax.text(xs[i], ys[i], zs[i], country, fontsize=8, ha="center", va="bottom")

# Set axis labels and title
ax.set_xlabel("X Axis")
ax.set_ylabel("Y Axis")
ax.set_zlabel("Z Axis")
ax.set_title("3D t-SNE Projection of Country Word Vectors (Colored by Cluster)")

plt.show()

Answering further questions

Finally let us investigate a few more questions to see what the model returns.

Show the code

# Codfish is to Portugal as ? is to Spain
result = model.most_similar(
    positive=["spain", "codfish"], negative=["portugal"], topn=1
)
print(result)

# Barcelona is to Spain as ? is to Portugal
result = model.most_similar(
    positive=["portugal", "barcelona"], negative=["spain"], topn=1
)
print(result)

# Lisbon is to Portugal as ? is to Britain
result = model.most_similar(
    positive=["britain", "lisbon"], negative=["portugal"], topn=1
)
print(result)

# Stalin is to Russia as ? is to China
result = model.most_similar(positive=["china", "stalin"], negative=["russia"], topn=1)
print(result)

[('fritters', 0.698106586933136)]
[('porto', 0.8763006925582886)]
[('london', 0.7939333319664001)]
[('mao', 0.8245931267738342)]

Final remarks

Word vectors are a powerful tool in NLP, and can be used to capture the meaning of words in a high-dimensional space. They can be used to estimate relationships between words, and can be used in a variety of tasks, from sentiment analysis to machine translation. In this experiment, we used the gensim library and a pre-trained word vector model to estimate relationships between words, and explored the use of word vectors in a simple question answering task.

Reuse

This work is licensed under CC BY (View License)