Learning with Subset Stacking (LESS) ====================================== LESS is a new supervised learning algorithm that is based on training many local estimators on subsets of a given dataset, and then passing their predictions to a global estimator. This is, of course, a rough description of LESS. In the second part of this tutorial, we will give more details about the inner workings of LESS and discuss how to change its many parameters to obtain different models. But for now, let us carry on with the default LESS and show that it works just fine out-of-the-box. Imports ------- First, we need to import the necessary libraries. Apart from standard data manipulation and plotting libraries like ``numpy``, ``pandas``, ``matplotlib``, and ``seaborn``, we import various regression models from ``scikit-learn``, ``xgboost``, and ``lightgbm`` for comparison. Most importantly, we import ``LESSBRegressor`` from the ``less`` package. .. code-block:: python import numpy as np import matplotlib.pyplot as plt import seaborn as sns import pandas as pd from sklearn.ensemble import RandomForestRegressor from sklearn.neighbors import KNeighborsRegressor from sklearn.tree import DecisionTreeRegressor from sklearn.svm import SVR from sklearn.linear_model import LinearRegression from xgboost import XGBRegressor from lightgbm import LGBMRegressor from less import LESSBRegressor from sklearn.model_selection import train_test_split from sklearn.metrics import mean_squared_error from sklearn.preprocessing import StandardScaler from sklearn.datasets import fetch_openml import warnings warnings.filterwarnings("ignore") np.random.seed(42) Synthetic Dataset ----------------- Here is a simple one-dimensional regression problem. This synthetic dataset is generated by randomly sampling a set of points from the real line (input) and then adding perturbations to their function values obtained with a sine curve (output). The blue dots in the figure below shows the dataset with 300 samples. .. code-block:: python def synthetic_sine_curve(n_samples=300): plt.figure(figsize=(10, 4)) # Generate data X = np.random.uniform(-10, 10, (n_samples, 1)) y = 10 * np.sin(X[:, 0]) + 2.5 * np.random.randn(n_samples) # Plot xvals = np.arange(-10, 10, 0.1) sns.lineplot(x=xvals, y=10 * np.sin(xvals), color="red") sns.scatterplot(x=X[:, 0], y=y, alpha=0.5) plt.ylim([-15, 15]) plt.title("Synthetic Data") plt.tick_params(labelbottom=False, labelleft=False) plt.tight_layout() plt.show() return X, y X, y = synthetic_sine_curve() .. image:: _static/tutorial/dataset.png :align: center :alt: Syntetic Dataset Training LESS ------------- You will notice that LESS uses exactly the same syntax (``fit`` & ``predict``) that is used by all the learning algorithms in ``scikit-learn``. Currently, LESS only supports regression. We are working on adding the LESS classifier. **Data Preprocessing:** Before training, we split the data into training and testing sets. We also scale the features using ``StandardScaler``. Scaling is often a good practice in machine learning, especially for algorithms that rely on distance metrics (like k-NN, which might be used as a local estimator in LESS) or gradient-based optimization. It ensures that all features contribute equally to the result. .. code-block:: python X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) **Fitting the Model:** We initialize ``LESSBRegressor`` with a random state for reproducibility. Then we fit the model to the training data and evaluate it on the test set. .. code-block:: python LESS_model = LESSBRegressor(random_state=42) LESS_model.fit(X_train, y_train) y_pred = LESS_model.predict(X_test) print(f"Test error of LESS: {mean_squared_error(y_pred, y_test):0.2f}") .. rst-class:: sphx-glr-script-out .. code-block:: none Test error of LESS: 4.65 Comparison with Other Models ---------------------------- To see how LESS performs compared to other popular regression algorithms, we define a list of models including Random Forest, LightGBM, k-NN, Decision Tree, SVR, Linear Regression, and XGBoost. .. code-block:: python models = [ LESSBRegressor(random_state=42), RandomForestRegressor(random_state=42), LGBMRegressor(random_state=42, verbose=-1), KNeighborsRegressor(), DecisionTreeRegressor(random_state=42), SVR(), LinearRegression(), XGBRegressor(random_state=42), ] We then iterate through these models, train them on the same data, and plot their predictions (or compare their MSE scores). This visual comparison helps in understanding how different models capture the underlying pattern of the data. .. code-block:: python def compare_models(X_train, X_test, y_train, y_test, models, plot="line"): """ Compare multiple models by plotting their predictions or scores. plot="line": Creates a subplot grid showing X vs predicted values. plot="bar": Creates a bar chart showing MSE scores. """ if plot == "bar": # Calculate MSE for each model model_names = [] mse_scores = [] for model in models: model.fit(X_train, y_train) y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) model_names.append(model.__class__.__name__) mse_scores.append(mse) # Create bar plot plt.figure(figsize=(12, 6)) bars = plt.bar(range(len(model_names)), mse_scores, alpha=0.7) plt.xticks(range(len(model_names)), model_names, rotation=45, ha="right") plt.ylabel("MSE") plt.title("Model Comparison - MSE Scores") plt.grid(True, alpha=0.3, axis="y") # Add MSE values on bars for bar, mse in zip(bars, mse_scores): plt.text( bar.get_x() + bar.get_width() / 2, bar.get_height(), f"{mse:.2f}", ha="center", va="bottom", fontsize=10, ) plt.tight_layout() plt.show() else: # plot == "line" # Calculate grid size n_models = len(models) n_cols = 2 n_rows = (n_models + n_cols - 1) // n_cols # Create subplot grid fig, axes = plt.subplots(n_rows, n_cols, figsize=(12, 4 * n_rows)) axes = axes.flatten() # Plot each model for idx, model in enumerate(models): # Train and predict model.fit(X_train, y_train) y_pred = model.predict(X_test) mse = mean_squared_error(y_test, y_pred) # Sort for line plot sort_idx = X_test[:, 0].argsort() X_sorted = X_test[sort_idx, 0] y_pred_sorted = y_pred[sort_idx] # Plot ax = axes[idx] ax.plot(X_sorted, y_pred_sorted, alpha=0.7) ax.set_title(f"{model.__class__.__name__} - MSE: {mse:.2f}") ax.tick_params(labelbottom=False, labelleft=False) ax.grid(True, alpha=0.3) # Hide extra subplots for i in range(len(models), len(axes)): axes[i].axis("off") plt.tight_layout() plt.show() .. code-block:: python # Run comparison compare_models(X_train, X_test, y_train, y_test, models, plot="line") .. image:: _static/tutorial/01_line_plot.png :align: center :alt: Model Comparison - Line Plot Experiment with Abalone Dataset ------------------------------- Let's try with a larger dataset. We will use the Abalone dataset which has 4177 rows and 8 columns. .. code-block:: python abalone = fetch_openml(name="abalone", version=1, as_frame=True) X = pd.get_dummies(abalone.data, drop_first=True, dtype=np.float32) y = abalone.target.astype(np.float32) X_train, X_test, y_train, y_test = train_test_split( X, y, test_size=0.2, random_state=42 ) scaler = StandardScaler() X_train = scaler.fit_transform(X_train) X_test = scaler.transform(X_test) We can also compare the models based on their Mean Squared Error (MSE) using a bar chart. .. code-block:: python # Run comparison with bar plot compare_models(X_train, X_test, y_train, y_test, models, plot="bar") .. image:: _static/tutorial/01_bar_plot.png :align: center :alt: Model Comparison - Bar Plot