LESS is More: A Comprehensive Tutorial

LESS (Learning with Subset Stacking) is a scalable and versatile ensemble learning framework. It constructs an ensemble of local models trained on subsets of data and combines their predictions using a global meta-estimator.

In this tutorial, we will explore the two main variants of LESS:

  1. LESS-A (Averaging): Trains multiple iterations of local/global models and averages their predictions.

  2. LESS-B (Boosting): Trains models sequentially, where each stage learns to correct the residuals of the previous stage.

We will also dive deep into the critical parameters that control the behavior of LESS, such as n_subsets, n_estimators, min_neighbors, and the choice of estimators.

import numpy as np
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import fetch_openml
from sklearn.metrics import mean_squared_error
from sklearn.tree import DecisionTreeRegressor

from less import LESSARegressor, LESSBRegressor

Data Preparation

We will use the Abalone dataset for this tutorial. It contains physical measurements of abalones, and the goal is to predict the age (number of rings).

abalone = fetch_openml(name="abalone", version=1, as_frame=True)

X = pd.get_dummies(abalone.data, drop_first=True, dtype=np.float32)
y = abalone.target.astype(np.float32)

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)

scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

Output:

Training set size: (3341, 9)
Test set size: (836, 9)

1. LESS-A (Averaging)

LESS-A (LESSARegressor) is the averaging variant. It performs multiple independent iterations. In each iteration, it selects subsets of data, trains local models, and optionally trains a global model to combine them. The final prediction is the average of predictions from all iterations.

Key Parameter: n_estimators

For LESS-A, n_estimators controls the number of averaging iterations.

  • Default: 100

  • Effect: Generally, more estimators lead to more stable predictions (lower variance) but increase training time linearly.

# LESS-A with default parameters (n_estimators=100)
less_a = LESSARegressor(n_estimators=100, random_state=42)
less_a.fit(X_train, y_train)
y_pred_a = less_a.predict(X_test)
print(f'LESS-A Test MSE: {mean_squared_error(y_test, y_pred_a):.4f}')

Output:

LESS-A Test MSE: 4.3957

2. LESS-B (Boosting)

LESS-B (LESSBRegressor) applies a boosting strategy. Instead of independent iterations, it trains models sequentially. Each stage learns to correct the residuals (errors) of the previous stage.

Key Parameters: n_estimators and learning_rate

  • n_estimators: The number of boosting stages (default: 100).

  • learning_rate: Shrinks the contribution of each estimator (default: 0.1). There is a trade-off between learning_rate and n_estimators.

less_b = LESSBRegressor(
    n_estimators=50,       # 50 boosting stages
    learning_rate=0.1,     # Shrinkage parameter
    random_state=42
)
less_b.fit(X_train, y_train)
y_pred_b = less_b.predict(X_test)
print(f'LESS-B Test MSE: {mean_squared_error(y_test, y_pred_b):.4f}')

Output:

LESS-B Test MSE: 4.3576

3. Critical Parameters Deep Dive

To get the most out of LESS, it’s essential to understand its key parameters.

Number of Subsets (n_subsets)

This is perhaps the most critical parameter. It determines how many local subsets are created in each iteration (or stage).

  • Default: 20

  • Effect: A higher n_subsets means more local models, which can capture more local details but increases computational cost. It also affects the number of neighbors used for training each local model (see min_neighbors below).

# Experimenting with n_subsets
for n in [5, 20, 50]:
    model = LESSARegressor(n_subsets=n, random_state=42)
    model.fit(X_train, y_train)
    mse = mean_squared_error(y_test, model.predict(X_test))
    print(f'n_subsets={n}: MSE={mse:.4f}')

Output:

n_subsets=5: MSE=4.4578
n_subsets=20: MSE=4.3957
n_subsets=50: MSE=4.3848

Minimum Neighbors (min_neighbors)

This parameter ensures that each local model has enough data to train on.

  • Default: 10

  • Internal Logic: LESS automatically calculates the number of neighbors (n_neighbors) for each subset based on the dataset size (n_samples) and n_subsets.

\[ \begin{align}\begin{aligned}\text{suggested_neighbors} = \max(\text{min_neighbors}, \lfloor \frac{\text{n_samples}}{\text{n_subsets}} \rfloor)\\\text{n_neighbors} = \min(\text{suggested_neighbors}, \text{n_samples})\end{aligned}\end{align} \]

This logic ensures that even if you have many subsets, each local model will see at least min_neighbors samples (overlapping if necessary). If n_subsets is small, the local models will see more data (n_samples / n_subsets).

# Experimenting with min_neighbors
# This allows min_neighbors to actually control the subset size.
for n in [50, 500, 1000]:
    model = LESSARegressor(n_subsets=10, min_neighbors=n, random_state=42)
    model.fit(X_train, y_train)
    mse = mean_squared_error(y_test, model.predict(X_test))
    print(f'min_neighbors={n}: MSE={mse:.4f}')

Output:

min_neighbors=50: MSE=4.4130
min_neighbors=500: MSE=4.4186
min_neighbors=1000: MSE=4.4224

Local Estimator (local_estimator)

This defines the model used for each local subset.

  • Default: 'linear' (uses LinearRegression)

  • Options:

    • 'linear': Standard Linear Regression.

    • 'tree': A DecisionTreeRegressor with specific parameters (max_leaf_nodes=31, etc.).

    • Custom: You can pass any callable that returns a scikit-learn compatible regressor (e.g., lambda: SVR()).

# Using 'tree' as local estimator
less_tree_local = LESSARegressor(local_estimator='tree', random_state=42)
less_tree_local.fit(X_train, y_train)
print(f'Local=Tree MSE: {mean_squared_error(y_test, less_tree_local.predict(X_test)):.4f}')

# Using a custom local estimator
less_custom_local = LESSARegressor(local_estimator=lambda: DecisionTreeRegressor(max_depth=5), random_state=42)
less_custom_local.fit(X_train, y_train)
print(f'Local=CustomTree MSE: {mean_squared_error(y_test, less_custom_local.predict(X_test)):.4f}')

Output:

Local=Tree MSE: 5.0080
Local=CustomTree MSE: 4.9919

Global Estimator (global_estimator)

The global estimator combines the predictions of the local models.

  • Default: 'xgboost' (uses XGBRFRegressor)

  • Options:

    • 'xgboost': Random Forest regressor from XGBoost.

    • None: Removes the global estimator. The final prediction becomes a weighted average of local predictions.

    • Custom: Any callable returning a regressor (e.g., lambda: RandomForestRegressor()).

# Using default (XGBoost)
less_default = LESSARegressor(random_state=42)
less_default.fit(X_train, y_train)
print(f'Global=XGBoost MSE: {mean_squared_error(y_test, less_default.predict(X_test)):.4f}')

# Removing global estimator (Weighted Average)
less_no_global = LESSARegressor(global_estimator=None, random_state=42)
less_no_global.fit(X_train, y_train)
print(f'No Global MSE: {mean_squared_error(y_test, less_no_global.predict(X_test)):.4f}')

Output:

Global=XGBoost MSE: 4.3957
No Global MSE: 6.1603

Validation Split (val_size)

You can split the dataset into training and validation sets within LESS.

  • Purpose: The training set is used to train the local estimators, while the validation set is used to train the global estimator. This can help prevent overfitting, especially when the global estimator is powerful.

  • Usage: Set val_size to a float between 0 and 1 (e.g., 0.2 for 20% validation data).

less_val = LESSARegressor(val_size=0.2, random_state=42)
less_val.fit(X_train, y_train)
y_pred_val = less_val.predict(X_test)
print(f'Test error (val_size=0.2): {mean_squared_error(y_test, y_pred_val):.4f}')

Output:

Test error (val_size=0.2): 4.3987

Clustering Method (cluster_method)

This parameter controls how the centers of the subsets are selected.

  • Default: 'tree' (Random sampling). It selects n_subsets centers randomly from the data.

  • Options:

    • 'tree': Random sampling.

    • 'kmeans': Uses K-Means clustering. Crucially, the number of clusters is set equal to n_subsets. The cluster centers found by K-Means become the centers of the subsets.

# Using K-Means for clustering
# Here, n_subsets=20 means K-Means will find 20 cluster centers
less_kmeans = LESSARegressor(cluster_method='kmeans', n_subsets=20, random_state=42)
less_kmeans.fit(X_train, y_train)
print(f'Cluster=KMeans MSE: {mean_squared_error(y_test, less_kmeans.predict(X_test)):.4f}')

Output:

Cluster=KMeans MSE: 4.3866

Random State (random_state)

Controls the randomness of the algorithm (subset selection, local estimator initialization, global estimator initialization). Setting this ensures reproducibility of your results.