Instance vs Model Learning

a comparison of two machine learning approaches

Experiments
Machine Learning
Published

March 2, 2024

Instance-based machine learning and model-based machine learning are two broad categories of machine learning algorithms that differ in their approach to learning and making predictions.

Instance-based learning algorithms, also known as lazy learning algorithms, do not build an explicit model from the training data. Instead, they store the entire training set and make predictions based on the similarity between new data points and the stored training data. Examples of instance-based learning algorithms include k-nearest neighbors, locally weighted learning, and instance-based learning algorithms.

Model-based learning algorithms, on the other hand, build an explicit model from the training data. This model can be used to make predictions on new data points. Examples of model-based learning algorithms include linear regression, logistic regression, and decision trees.

One of the key differences between instance-based and model-based learning algorithms is the way they handle unseen data. Instance-based learning algorithms make predictions based on the similarity between new data points and the stored training data. This means that they can make accurate predictions on unseen data, even if the data is not linearly separable. However, instance-based learning algorithms can be computationally expensive, especially when the training set is large.

Model-based learning algorithms, on the other hand, make predictions based on the model that has been built from the training data. This means that they can make accurate predictions on unseen data, even if the data is not linearly separable. However, model-based learning algorithms can be less accurate than instance-based learning algorithms on small training sets.

About Linearly Separable

Linearly separable refers to a scenario in data classification where two sets of points in a feature space can be completely separated by a straight line (in 2D), a plane (in 3D), or a hyperplane in higher dimensions. Essentially, if you can draw a line (or its higher-dimensional analog) such that all points of one class fall on one side of the line and all points of the other class fall on the other side, those points are considered linearly separable.

Another key difference between instance-based and model-based learning algorithms is the way they handle noise in the training data. Instance-based learning algorithms are more robust to noise in the training data than model-based learning algorithms. This is because instance-based learning algorithms do not build an explicit model from the training data. Instead, they store the entire training set and make predictions based on the similarity between new data points and the stored training data. This means that they are less likely to be affected by noise.

Model-based learning algorithms, on the other hand, are less robust to noise. This is because model-based learning algorithms build an explicit model from the training data. This model can be affected by noise, which can lead to inaccurate predictions on new data points.

An example instance approach predictor

To illustrate the above, let us build a simple instance based predictor - in this case, based on the California Housing dataset which can be found both on Keras and SKLearn. This predictor will attempt to “guess” the median house price for a given California district census block group.

Let us start by getting a sense of what the dataset is about.

Show the code
import numpy as np
from sklearn.datasets import fetch_california_housing
import matplotlib.pyplot as plt
import pandas as pd


# Load the California housing dataset
housing = fetch_california_housing()
housing_df = pd.DataFrame(housing.data, columns=housing.feature_names)
print(housing.DESCR)
housing_df["MedianHouseValue"] = housing.target
housing_df.plot(
    kind="scatter",
    x="Longitude",
    y="Latitude",
    alpha=0.4,
    s=housing_df["Population"] / 100,
    label="Population",
    figsize=(10, 7),
    c="MedianHouseValue",
    cmap=plt.get_cmap("jet"),
    colorbar=True,
)
plt.legend()
.. _california_housing_dataset:

California Housing dataset
--------------------------

**Data Set Characteristics:**

:Number of Instances: 20640

:Number of Attributes: 8 numeric, predictive attributes and the target

:Attribute Information:
    - MedInc        median income in block group
    - HouseAge      median house age in block group
    - AveRooms      average number of rooms per household
    - AveBedrms     average number of bedrooms per household
    - Population    block group population
    - AveOccup      average number of household members
    - Latitude      block group latitude
    - Longitude     block group longitude

:Missing Attribute Values: None

This dataset was obtained from the StatLib repository.
https://www.dcc.fc.up.pt/~ltorgo/Regression/cal_housing.html

The target variable is the median house value for California districts,
expressed in hundreds of thousands of dollars ($100,000).

This dataset was derived from the 1990 U.S. census, using one row per census
block group. A block group is the smallest geographical unit for which the U.S.
Census Bureau publishes sample data (a block group typically has a population
of 600 to 3,000 people).

A household is a group of people residing within a home. Since the average
number of rooms and bedrooms in this dataset are provided per household, these
columns may take surprisingly large values for block groups with few households
and many empty houses, such as vacation resorts.

It can be downloaded/loaded using the
:func:`sklearn.datasets.fetch_california_housing` function.

.. rubric:: References

- Pace, R. Kelley and Ronald Barry, Sparse Spatial Autoregressions,
  Statistics and Probability Letters, 33 (1997) 291-297

And now that we have a reasonable idea of what data we are dealing with, let us define the predictor. In our case, we will be using a k-Nearest Neighbors regressor set for 10 neighbour groups.

About k-Nearest Neighbors

The k-nearest neighbors (k-NN) regressor is a straightforward method used in machine learning for predicting the value of an unknown point based on the values of its nearest neighbors. Imagine you’re at a park trying to guess the age of a tree you’re standing next to but have no idea how to do it. What you can do, however, is look at the nearby trees whose ages you do know. You decide to consider the ages of the 3 trees closest to the one you’re interested in. If those trees are 50, 55, and 60 years old, you might guess that the tree you’re looking at is around 55 years old—the average age of its “nearest neighbors.”

Show the code
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsRegressor
import random
import torch

X, y = housing.data, housing.target

# Initialize and seed random number generators
seed = 42
np.random.seed(seed)
random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
    torch.cuda.manual_seed(seed)
    torch.cuda.manual_seed_all(seed)
if torch.backends.mps.is_available():
    torch.manual_seed(seed)


# Load and split the California housing dataset
X_train_full, X_test, y_train_full, y_test = train_test_split(
    X, y, test_size=0.2, random_state=seed
)
X_train, X_valid, y_train, y_valid = train_test_split(X_train_full, y_train_full)

# Scale the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_valid_scaled = scaler.transform(X_valid)
X_test_scaled = scaler.transform(X_test)

# Initialize the k-NN regressor
knn_reg = KNeighborsRegressor(n_neighbors=10)

# Train the k-NN model
knn_reg.fit(X_train_scaled, y_train)

# Evaluate the model
score = knn_reg.score(
    X_test_scaled, y_test
)  # This returns the R^2 score of the prediction

# Making predictions
predictions = knn_reg.predict(X_test_scaled[:5])

# Calculate relative differences as percentages
relative_differences = ((predictions - y_test[:5]) / y_test[:5]) * 100

print(f"Model R^2 score: {score}")
print(f"Predictions for first 5 instances: {predictions}")
print(f"Actual values for first 5 instances: {y_test[:5]}")
print(f"Relative differences (%): {relative_differences}")
Model R^2 score: 0.6863935783148378
Predictions for first 5 instances: [0.5714   0.6634   4.395005 2.6315   2.5254  ]
Actual values for first 5 instances: [0.477   0.458   5.00001 2.186   2.78   ]
Relative differences (%): [ 19.79035639  44.84716157 -12.1000758   20.37968893  -9.15827338]

We can see that based on the above regressor, our \(R^2\) is about 0.68. What this means is that 68% of the variance in the target variable can be explained by the features used in the model. In practical terms, this indicates a moderate to good fit, depending on the context and the complexity of the problem being modeled. However, it also means that 32% of the variance is not captured by the model, which could be due to various factors like missing important features, model underfitting, or the data inherently containing a significant amount of unexplainable variability.

Solving the same problem with a model approach

Let’s explore a model-based method for making predictions by utilizing a straightforward neural network structure. it is a simple feedforward model built for regression tasks. It starts with an input layer that directly connects to a hidden layer of 50 neurons. This hidden layer uses the ReLU activation function, which helps the model capture non-linear relationships in the data. After processing through this hidden layer, the data is passed to a single neuron in the output layer that produces the final prediction.

Additionally, the model is trained using the mean squared error (MSE) loss function, which is well-suited for regression because it penalizes larger errors more heavily. The use of stochastic gradient descent (SGD) helps in efficiently updating the model’s weights. We also add an early stopping mechanism to halt training when the validation loss stops improving, thereby preventing overfitting.

Show the code
import torch.nn as nn
import pytorch_lightning as pl
from torch.utils.data import DataLoader, TensorDataset
from pytorch_lightning.callbacks.early_stopping import EarlyStopping


# Define the LightningModule
class RegressionModel(pl.LightningModule):
    def __init__(self, input_dim):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(input_dim, 50), nn.ReLU(), nn.Linear(50, 1)
        )
        self.loss_fn = nn.MSELoss()

    def forward(self, x):
        return self.model(x)

    def training_step(self, batch, batch_idx):
        x, y = batch
        y = y.view(-1, 1)  # Ensure proper shape
        y_hat = self(x)
        loss = self.loss_fn(y_hat, y)
        self.log("train_loss", loss)
        return loss

    def validation_step(self, batch, batch_idx):
        x, y = batch
        y = y.view(-1, 1)
        y_hat = self(x)
        loss = self.loss_fn(y_hat, y)
        self.log("val_loss", loss, prog_bar=True)
        return loss

    def test_step(self, batch, batch_idx):
        x, y = batch
        y = y.view(-1, 1)
        y_hat = self(x)
        loss = self.loss_fn(y_hat, y)
        self.log("test_loss", loss)
        return loss

    def configure_optimizers(self):
        optimizer = torch.optim.SGD(self.parameters(), lr=0.01)
        return optimizer


# Create TensorDatasets and DataLoaders
train_dataset = TensorDataset(
    torch.tensor(X_train_scaled, dtype=torch.float32),
    torch.tensor(y_train, dtype=torch.float32),
)
valid_dataset = TensorDataset(
    torch.tensor(X_valid_scaled, dtype=torch.float32),
    torch.tensor(y_valid, dtype=torch.float32),
)
test_dataset = TensorDataset(
    torch.tensor(X_test_scaled, dtype=torch.float32),
    torch.tensor(y_test, dtype=torch.float32),
)

train_loader = DataLoader(train_dataset, batch_size=32, shuffle=True)
valid_loader = DataLoader(valid_dataset, batch_size=32)
test_loader = DataLoader(test_dataset, batch_size=32)

# Initialize the model with the correct input dimension
input_dim = X_train_scaled.shape[1]
model = RegressionModel(input_dim=input_dim)

# Set up EarlyStopping callback
early_stop_callback = EarlyStopping(
    monitor="val_loss", patience=3, mode="min", verbose=False
)

# Train the model using PyTorch-Lightning's Trainer
trainer = pl.Trainer(
    max_epochs=50,
    callbacks=[early_stop_callback],
    logger=False,
    enable_progress_bar=False,
    enable_checkpointing=False,
)
trainer.fit(model, train_loader, valid_loader)

With the model trained, let us evaluate performance and run a few predictions.

Show the code
# Evaluate the model on the test set
test_results = trainer.test(model, test_loader)
print(f"Test MSE: {test_results}")
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
       Test metric             DataLoader 0
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
        test_loss           0.3777255415916443
────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Test MSE: [{'test_loss': 0.3777255415916443}]
Show the code
# Make predictions for the first 5 test instances
model.eval()
with torch.no_grad():
    test_samples = torch.tensor(X_test_scaled[:5], dtype=torch.float32)
    predictions = model(test_samples).view(-1).numpy()

# Calculate relative differences as percentages
relative_differences = ((predictions - y_test[:5]) / y_test[:5]) * 100

print(f"Predictions for first 5 instances: {predictions}")
print(f"Actual values for first 5 instances: {y_test[:5]}")
print(f"Relative differences (%): {relative_differences}")
Predictions for first 5 instances: [0.5031792 1.6951654 3.744416  2.57257   2.807933 ]
Actual values for first 5 instances: [0.477   0.458   5.00001 2.186   2.78   ]
Relative differences (%): [  5.48830032 270.12344885 -25.11182981  17.68390144   1.00478749]

Suggestions for improvement

While we’ve covered the basics, here are a few ideas to take this experiment further:

  • Dive deeper into the data: Understand which features most affect housing prices and why.
  • Tune the models: Experiment with different settings and configurations to improve accuracy.
  • Compare more metrics: Look beyond the \(R^2\) score to other metrics like MAE or MSE for a fuller picture of model performance.
  • Explore model limitations: Identify and address any shortcomings in the models used.

Final remarks

In this experiment, we’ve explored two different ways to predict housing prices in California: using instance-based learning with a k-Nearest Neighbors (k-NN) regressor and model-based learning with a neural network. Here’s a straightforward recap of what we learned:

  • Instance-Based Learning with k-NN: This method relies on comparing new data points to existing ones to make predictions. It’s pretty straightforward and works well for datasets where the relationship between data points is clear. Our k-NN model did a decent job, explaining about 68% of the variance in housing prices, showing it’s a viable option but also highlighting some limits, especially when dealing with very large datasets.

  • Model-Based Learning with Neural Networks: This approach creates a generalized model from the data it’s trained on. Our simple neural network, equipped with early stopping to prevent overfitting, showcased the ability to capture complex patterns in the data. It requires a bit more setup and tuning but has the potential to tackle more complicated relationships in data.

Each method has its place, depending on the specific needs of your project and the characteristics of your dataset. Instance-based learning is great for simplicity and direct interpretations of data, while model-based learning can handle more complex patterns at the expense of needing more computational resources and tuning.

Reuse

This work is licensed under CC BY (View License)