Integrating with the Scikit: Pipeline and Gridsearch

ITEA implementations inherits scikits’ base classes. This means that we can integrate the ITEA with methods like Pipeline and Gridsearch. In this notebook, we’ll show some examples on how to take advantage of that to tune an predictor.

[1]:
import time

import pandas  as pd
import numpy   as np
import seaborn as sns
import matplotlib.pyplot as plt

from scipy                   import stats
from sklearn.model_selection import train_test_split
from IPython.display         import display

from sklearn.pipeline import Pipeline

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression

from sklearn import datasets
from sklearn.model_selection import GridSearchCV

# Importing the halving gridsearch algorithm
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV

from itea.regression import ITEA_regressor
from itea.inspection import *

import warnings
warnings.filterwarnings(action='ignore', module=r'itea')

Loading the data

First, let’s load the data, and split it into a training and testing partition. The training partition will be used for the training and validation process, and only after obtaining a final method will we perform the training with this data and the test with the test partition.

[2]:
boston_data = datasets.load_boston()
X, y        = boston_data['data'], boston_data['target']
labels      = boston_data['feature_names']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

Inspectioning the data

Let’s look at some descriptive statistics for the variables.

Suppose that, to reduce the complexity of the final model, we are interested in obtaining a subset of these variables.

[3]:
pd.DataFrame(data=X, columns=labels).describe()
[3]:
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
count 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 6.284634 68.574901 3.795043 9.549407 408.237154 18.455534 356.674032 12.653063
std 8.601545 23.322453 6.860353 0.253994 0.115878 0.702617 28.148861 2.105710 8.707259 168.537116 2.164946 91.294864 7.141062
min 0.006320 0.000000 0.460000 0.000000 0.385000 3.561000 2.900000 1.129600 1.000000 187.000000 12.600000 0.320000 1.730000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 5.885500 45.025000 2.100175 4.000000 279.000000 17.400000 375.377500 6.950000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 6.208500 77.500000 3.207450 5.000000 330.000000 19.050000 391.440000 11.360000
75% 3.677083 12.500000 18.100000 0.000000 0.624000 6.623500 94.075000 5.188425 24.000000 666.000000 20.200000 396.225000 16.955000
max 88.976200 100.000000 27.740000 1.000000 0.871000 8.780000 100.000000 12.126500 24.000000 711.000000 22.000000 396.900000 37.970000

One way to reduce the amount of attributes is to use attribute engineering methods (such as PCA), but we can also select a subset of attributes.

Let’s use a scikit method that finds a subset of k attributes based on a passed metric. Let’s get the 4 best variables based on the mutual information of continuous variables.

[4]:
feature_selector = SelectKBest(mutual_info_regression, k=4)

X_new      = feature_selector.fit_transform(X, y)
mask       = feature_selector.get_support()
labels_new = labels[mask]

print(labels_new)
['INDUS' 'NOX' 'RM' 'LSTAT']

Without going into further details or making use of pre-processing rigor, let’s just look at the correlation between the selected variables and the dependent variable (target variable).

[5]:
def corrfunc(x,y, ax=None, **kws):

    r, _ = stats.pearsonr(x, y)
    ax = ax or plt.gca()

    ax.annotate(
        r'$\rho' + f'= {r:.2f}$',
        xy=(.1, .9),
        xycoords=ax.transAxes,
        bbox=dict(
            facecolor='white',
            edgecolor='black',
            boxstyle='round, pad=0.35'
        )
    )

g = sns.pairplot(
    pd.DataFrame(
        data=np.hstack( (X_new, y.reshape(-1, 1)) ),
        columns=labels_new.tolist()+['target variable']
    ),
    markers=".",
    corner=True,
)

g.map_lower(corrfunc)

plt.show()
_images/_integrating_with_scikits_classes_9_0.png

Creating a pipeline

scikit provides a Pipeline class, which serves to nest a sequence of transformations and a final estimator. With this, we can automate the date transformation steps for an estimator.

Let’s do the variable selection step and fit an ITEA regressor into a pipeline.

[6]:
tfuncs = {
    'log'      : np.log,
    'sqrt.abs' : lambda x: np.sqrt(np.abs(x)),
    'id'       : lambda x: x,
    'exp'      : np.exp
}

tfuncs_dx = {
    'log'      : lambda x: 1/x,
    'sqrt.abs' : lambda x: x/( 2*(np.abs(x)**(3/2)) ),
    'id'       : lambda x: np.ones_like(x),
    'exp'      : np.exp,
}

# Creating our ITEA regressor instance
itea = ITEA_regressor(
    gens         = 75,
    popsize      = 75,
    expolim      = (-2, 2),
    tfuncs       = tfuncs,
    tfuncs_dx    = tfuncs_dx,
    verbose      = 10,
    labels       = labels_new
)

pipeline = Pipeline([
    ('selectKbest', feature_selector),
    ('itea', itea)
])

pipeline.fit(X_train, y_train)
gen | smallest fitness | mean fitness | highest fitness | remaining time
----------------------------------------------------------------------------
  0 |         4.768589 |     5.768432 |        9.387225 | 0min2sec
 10 |         4.301443 |     4.560781 |        4.914515 | 0min2sec
 20 |         4.218327 |     4.367407 |        4.679878 | 0min2sec
 30 |         4.058739 |     4.215670 |        6.612622 | 0min2sec
 40 |         3.962236 |     4.207883 |        6.405873 | 0min2sec
 50 |         3.962236 |     4.176876 |        7.135065 | 0min1sec
 60 |         3.962236 |     4.186288 |        4.764542 | 0min1sec
 70 |         3.916771 |     4.141421 |        4.779798 | 0min1sec
[6]:
Pipeline(steps=[('selectKbest',
                 SelectKBest(k=4,
                             score_func=<function mutual_info_regression at 0x7f89b6d52dd0>)),
                ('itea',
                 ITEA_regressor(gens=75,
                                labels=array(['INDUS', 'NOX', 'RM', 'LSTAT'], dtype='<U7'),
                                popsize=75,
                                tfuncs={'exp': <ufunc 'exp'>,
                                        'id': <function <lambda> at 0x7f89ab6f3dd0>,
                                        'log': <ufunc 'log'>,
                                        'sqrt.abs': <function <lambda> at 0x7f89ab6f3950>},
                                tfuncs_dx={'exp': <ufunc 'exp'>,
                                           'id': <function <lambda> at 0x7f89aafe2a70>,
                                           'log': <function <lambda> at 0x7f89aafe2950>,
                                           'sqrt.abs': <function <lambda> at 0x7f89aafe29e0>},
                                verbose=10))])

We can access the Pipeline ITEA with the index operator. Let’s save the ITEA in one variable, and let’s save the final expression (ITExpr_regressor) in another. Finally, let’s look at the final expression.

[7]:
print(pipeline['itea'].bestsol_)
0.086*id(INDUS * NOX^-2 * RM^2) + -18.892*log(NOX^-2 * RM * LSTAT^2) + -18.139*log(NOX^-1 * LSTAT^-2) + -0.002*sqrt.abs(INDUS^2 * NOX^2 * RM^2 * LSTAT) + 0.705*id(INDUS * NOX^-2 * RM^2 * LSTAT^-2) + -21.588

Finetuning with gridsearch

ITEA has several hyperparameters, and although the method can be used with default values (which deliver fast execution with satisfactory results), it may be necessary to further investigate a suitable configuration for the domain of the problem in which the regressor is being applied.

Imagine we want to limit the final expression to something that isn’t too complex. We can achieve this by several ways.

We can have several expression sizes, exponents limits, and different transformation functions. Let’s choose some values for each configuration to perform the gridsearch.

Here, we’ll look to find a subset of functions and exponents that, combined, deliver good performance in the dataset we’re using.

Gridsearch can receive either an estimator or a pipeline to make the adjustment.

A detail that is worth mentioning is that, in the case of a Pipeline, the variables will have a name with a prefix to be used in gridsearch.

[8]:
from itertools import permutations

two_tfuncs = permutations(['log', 'sqrt.abs', 'exp'], r=2)

parameters = {
    'itea__gens'      : [100],
    'itea__popsize'   : [100],
    'itea__tfuncs_dx' : [tfuncs_dx],
    'itea__expolim'   : [(-2, 2), (-1, 1), (0, 1), (0, 2)],
    'itea__max_terms' : [10],
    'itea__tfuncs'    : [
        {t1:tfuncs[t1], t2:tfuncs[t2], 'id':tfuncs['id']}
        for (t1, t2) in set(two_tfuncs)
    ],
    'itea__verbose': [False]
}

The scikit provides GridSearchCV, a method that does an exhaustive search for the best configuration by cross-validating past data.

Since ITEA is an evolutionary algorithm, exhaustive testing can be computationally expensive. Let’s use HalvingGridSearchCV (which is in experimental stage at the time of creation of this notebook), imported at the beginning of the notebook.

This method makes the gridsearch with several interactions, but allocating few resources for the first runs, in order to get a possible direction of where it should apply more effort to obtain the best configuration.

To use the standard gridsearch, just change HalvingGridSearchCV to GridSearchCV.

[9]:
gridsearch = HalvingGridSearchCV(
    estimator=pipeline,
    param_grid=parameters,
    verbose=2,
    n_jobs=-1,
    refit=True, # If true, then 'gridsearch' will have a best_estimator_
    cv=3,
    factor=2,
    scoring='neg_root_mean_squared_error'
)

t_start = time.time()

gridsearch.fit(X_train, y_train)

t_end = time.time() - t_start

print('----------')
print(f'{round(t_end, 2)} seconds')
n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 21
max_resources_: 339
aggressive_elimination: False
factor: 2
----------
iter: 0
n_candidates: 24
n_resources: 21
Fitting 3 folds for each of 24 candidates, totalling 72 fits
----------
iter: 1
n_candidates: 12
n_resources: 42
Fitting 3 folds for each of 12 candidates, totalling 36 fits
----------
iter: 2
n_candidates: 6
n_resources: 84
Fitting 3 folds for each of 6 candidates, totalling 18 fits
----------
iter: 3
n_candidates: 3
n_resources: 168
Fitting 3 folds for each of 3 candidates, totalling 9 fits
----------
iter: 4
n_candidates: 2
n_resources: 336
Fitting 3 folds for each of 2 candidates, totalling 6 fits
----------
222.4 seconds

Now that we have the best result, let’s preview the grid of different settings for exponent limits and subsets of transform functions, and let’s also create a final model with the best found setting.

The heatmap is based on this example from the scikits’ documentation.

[10]:
fig, ax = plt.subplots(1, 1, figsize=(8, 5))

results_aux = gridsearch.cv_results_.copy()

# One of the parameters is a dictionary, the other a tuple. When we create a data frame
# with this, it ends up becoming a mess when the pandas parser handles these values.
# Let's turn it into strings.
results_aux['param_itea__tfuncs']  = [
    str(list(k.keys())[:-1]) # Left out the 'id' tfunc
    for k in results_aux['param_itea__tfuncs']]

results_aux['param_itea__expolim'] = [
    str(k)
    for k in results_aux['param_itea__expolim']]

results = pd.DataFrame.from_dict(results_aux)

scores_matrix = results.sort_values('iter').pivot_table(
    index   = 'param_itea__tfuncs',
    columns = 'param_itea__expolim',
    values  = 'mean_test_score',
    aggfunc = 'last'
)

print(scores_matrix.values)

im = ax.imshow(scores_matrix*-1, aspect='auto', cmap='viridis_r')

expolims_gs = set(results_aux['param_itea__expolim'])
ax.set_xlabel('expolim', fontsize=15)
ax.set_xticks(np.arange(len(expolims_gs)))
ax.set_xticklabels(expolims_gs)

tfuncs_gs = set(results_aux['param_itea__tfuncs'])
ax.set_ylabel('tfuncs', fontsize=15)
ax.set_yticks(np.arange(len(tfuncs_gs)))
ax.set_yticklabels(tfuncs_gs)

iterations = results.pivot_table(
    index='param_itea__tfuncs',
    columns='param_itea__expolim',
    values='iter',
    aggfunc='max'
).values

for i in range(len(tfuncs_gs)):
    for j in range(len(expolims_gs)):
        ax.text(j, i, iterations[i, j],
            ha="center", va="center", color="k", fontsize=15)

fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])

fig.colorbar(im, cax=cbar_ax)
cbar_ax.set_ylabel(
    'mean test score (RMSE)', rotation=-90,
    va="bottom", fontsize=15)

plt.show()
[[-1.97417141e+02 -2.46071938e+01 -8.44492769e+01 -1.03811080e+01]
 [-1.15262941e+01 -6.56115272e+15 -5.41812874e+02 -1.82725318e+01]
 [-3.63525617e+01 -2.17213496e+01 -9.62046613e+01 -3.41900820e+01]
 [-1.26314994e+01 -7.06547429e+00 -4.19566596e+00 -3.05124383e+01]
 [-2.26318495e+01 -7.44435538e+01 -5.82059580e+02 -5.72124969e+00]
 [-2.61751169e+01 -1.14023381e+01 -4.14897391e+00 -5.76394867e+00]]
_images/_integrating_with_scikits_classes_19_1.png

Creating and fitting a model with the best configuration

Finally, let’s create an instance of ITEA with the best configuration, and then use the test data to see how the final model performs as a predictor.

Additionally, let’s look at some interpretability graphs.

[11]:
# best_pipeline is be a pipeline!
best_pipeline = gridsearch.best_estimator_

# predict(), score() and other estimators methods will
# perform the transformations and then call the method on the final
# estimator.

best_pipeline.score(X_test, y_test)
[11]:
0.7876885603830583

ITEA is an estimator, and the interpretability classes only work with instances of ITEA or ITExpr.

To be able to use the entire pipeline created, let’s create a method that receives a pipeline and iterates over all transformations successively until finishing the treatment, and returning this new data.

So we’ll use this method to handle the data before calling the fit of the explainers. Thus, we use the pipeline with the classes from itea.inspection.

[12]:
def just_transformers(pipeline, X):

    Xt = X.copy()
    for name, transformer in pipeline.steps[:-1]:
        Xt = transformer.transform(Xt)

    return Xt

Let’s create the explainer instance. We’ll pass an ITExpr to the explainer and use the transformations to fit the data. Note how these values are used.

[13]:
explainer = ITExpr_explainer(
    itexpr    = best_pipeline['itea'].bestsol_,
    tfuncs    = tfuncs,
    tfuncs_dx = tfuncs_dx
).fit(just_transformers(best_pipeline, X_train), y_train)

Now we can create the interpretability plots we saw on the other notebooks.

[14]:
explainer.plot_feature_importances(
    X=just_transformers(best_pipeline, X_train),
    importance_method='pe',
    grouping_threshold=0.0,
    barh_kw={'color':'green'}
)
_images/_integrating_with_scikits_classes_27_0.png
[15]:
explainer.plot_normalized_partial_effects(
    grouping_threshold=0.1, show=False, num_points=10)

plt.tight_layout()
_images/_integrating_with_scikits_classes_28_0.png
[16]:
fig, axs = plt.subplots(1, 4, figsize=(10, 3))

explainer.plot_partial_effects_at_means(
    X=just_transformers(best_pipeline, X_test),
    features=range(4),
    ax=axs,
    num_points=100,
    share_y=True,
    show_err=True,
    show=False
)

plt.tight_layout()
plt.show()
_images/_integrating_with_scikits_classes_29_0.png