Integrating with the Scikit: Pipeline and Gridsearch

ITEA implementations inherits scikits’ base classes. This means that we can integrate the ITEA with methods like Pipeline and Gridsearch. In this notebook, we’ll show some examples on how to take advantage of that to tune an predictor.

[1]:

import time

import pandas  as pd
import numpy   as np
import seaborn as sns
import matplotlib.pyplot as plt

from scipy                   import stats
from sklearn.model_selection import train_test_split
from IPython.display         import display

from sklearn.pipeline import Pipeline

from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import mutual_info_regression

from sklearn import datasets
from sklearn.model_selection import GridSearchCV

# Importing the halving gridsearch algorithm
from sklearn.experimental import enable_halving_search_cv
from sklearn.model_selection import HalvingGridSearchCV

from itea.regression import ITEA_regressor
from itea.inspection import *

import warnings
warnings.filterwarnings(action='ignore', module=r'itea')

Loading the data

First, let’s load the data, and split it into a training and testing partition. The training partition will be used for the training and validation process, and only after obtaining a final method will we perform the training with this data and the test with the test partition.

[2]:

boston_data = datasets.load_boston()
X, y        = boston_data['data'], boston_data['target']
labels      = boston_data['feature_names']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

Inspectioning the data

Let’s look at some descriptive statistics for the variables.

Suppose that, to reduce the complexity of the final model, we are interested in obtaining a subset of these variables.

[3]:

pd.DataFrame(data=X, columns=labels).describe()

[3]:

	CRIM	ZN	INDUS	CHAS	NOX	RM	AGE	DIS	RAD	TAX	PTRATIO	B	LSTAT
count	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000	506.000000
mean	3.613524	11.363636	11.136779	0.069170	0.554695	6.284634	68.574901	3.795043	9.549407	408.237154	18.455534	356.674032	12.653063
std	8.601545	23.322453	6.860353	0.253994	0.115878	0.702617	28.148861	2.105710	8.707259	168.537116	2.164946	91.294864	7.141062
min	0.006320	0.000000	0.460000	0.000000	0.385000	3.561000	2.900000	1.129600	1.000000	187.000000	12.600000	0.320000	1.730000
25%	0.082045	0.000000	5.190000	0.000000	0.449000	5.885500	45.025000	2.100175	4.000000	279.000000	17.400000	375.377500	6.950000
50%	0.256510	0.000000	9.690000	0.000000	0.538000	6.208500	77.500000	3.207450	5.000000	330.000000	19.050000	391.440000	11.360000
75%	3.677083	12.500000	18.100000	0.000000	0.624000	6.623500	94.075000	5.188425	24.000000	666.000000	20.200000	396.225000	16.955000
max	88.976200	100.000000	27.740000	1.000000	0.871000	8.780000	100.000000	12.126500	24.000000	711.000000	22.000000	396.900000	37.970000

One way to reduce the amount of attributes is to use attribute engineering methods (such as PCA), but we can also select a subset of attributes.

Let’s use a scikit method that finds a subset of k attributes based on a passed metric. Let’s get the 4 best variables based on the mutual information of continuous variables.

[4]:

feature_selector = SelectKBest(mutual_info_regression, k=4)

X_new      = feature_selector.fit_transform(X, y)
mask       = feature_selector.get_support()
labels_new = labels[mask]

print(labels_new)

['INDUS' 'NOX' 'RM' 'LSTAT']

Without going into further details or making use of pre-processing rigor, let’s just look at the correlation between the selected variables and the dependent variable (target variable).

[5]:

def corrfunc(x,y, ax=None, **kws):

    r, _ = stats.pearsonr(x, y)
    ax = ax or plt.gca()

    ax.annotate(
        r'$\rho' + f'= {r:.2f}$',
        xy=(.1, .9),
        xycoords=ax.transAxes,
        bbox=dict(
            facecolor='white',
            edgecolor='black',
            boxstyle='round, pad=0.35'
        )
    )

g = sns.pairplot(
    pd.DataFrame(
        data=np.hstack( (X_new, y.reshape(-1, 1)) ),
        columns=labels_new.tolist()+['target variable']
    ),
    markers=".",
    corner=True,
)

g.map_lower(corrfunc)

plt.show()

_images/_integrating_with_scikits_classes_9_0.png

Creating a pipeline

scikit provides a Pipeline class, which serves to nest a sequence of transformations and a final estimator. With this, we can automate the date transformation steps for an estimator.

Let’s do the variable selection step and fit an ITEA regressor into a pipeline.

[6]:

tfuncs = {
    'log'      : np.log,
    'sqrt.abs' : lambda x: np.sqrt(np.abs(x)),
    'id'       : lambda x: x,
    'exp'      : np.exp
}

tfuncs_dx = {
    'log'      : lambda x: 1/x,
    'sqrt.abs' : lambda x: x/( 2*(np.abs(x)**(3/2)) ),
    'id'       : lambda x: np.ones_like(x),
    'exp'      : np.exp,
}

# Creating our ITEA regressor instance
itea = ITEA_regressor(
    gens         = 75,
    popsize      = 75,
    expolim      = (-2, 2),
    tfuncs       = tfuncs,
    tfuncs_dx    = tfuncs_dx,
    verbose      = 10,
    labels       = labels_new
)

pipeline = Pipeline([
    ('selectKbest', feature_selector),
    ('itea', itea)
])

pipeline.fit(X_train, y_train)

gen | smallest fitness | mean fitness | highest fitness | remaining time
----------------------------------------------------------------------------
  0 |         4.768589 |     5.768432 |        9.387225 | 0min2sec
 10 |         4.301443 |     4.560781 |        4.914515 | 0min2sec
 20 |         4.218327 |     4.367407 |        4.679878 | 0min2sec
 30 |         4.058739 |     4.215670 |        6.612622 | 0min2sec
 40 |         3.962236 |     4.207883 |        6.405873 | 0min2sec
 50 |         3.962236 |     4.176876 |        7.135065 | 0min1sec
 60 |         3.962236 |     4.186288 |        4.764542 | 0min1sec
 70 |         3.916771 |     4.141421 |        4.779798 | 0min1sec

[6]:

Pipeline(steps=[('selectKbest',
                 SelectKBest(k=4,
                             score_func=<function mutual_info_regression at 0x7f89b6d52dd0>)),
                ('itea',
                 ITEA_regressor(gens=75,
                                labels=array(['INDUS', 'NOX', 'RM', 'LSTAT'], dtype='<U7'),
                                popsize=75,
                                tfuncs={'exp': <ufunc 'exp'>,
                                        'id': <function <lambda> at 0x7f89ab6f3dd0>,
                                        'log': <ufunc 'log'>,
                                        'sqrt.abs': <function <lambda> at 0x7f89ab6f3950>},
                                tfuncs_dx={'exp': <ufunc 'exp'>,
                                           'id': <function <lambda> at 0x7f89aafe2a70>,
                                           'log': <function <lambda> at 0x7f89aafe2950>,
                                           'sqrt.abs': <function <lambda> at 0x7f89aafe29e0>},
                                verbose=10))])

We can access the Pipeline ITEA with the index operator. Let’s save the ITEA in one variable, and let’s save the final expression (ITExpr_regressor) in another. Finally, let’s look at the final expression.

[7]:

print(pipeline['itea'].bestsol_)

0.086*id(INDUS * NOX^-2 * RM^2) + -18.892*log(NOX^-2 * RM * LSTAT^2) + -18.139*log(NOX^-1 * LSTAT^-2) + -0.002*sqrt.abs(INDUS^2 * NOX^2 * RM^2 * LSTAT) + 0.705*id(INDUS * NOX^-2 * RM^2 * LSTAT^-2) + -21.588

Finetuning with gridsearch

ITEA has several hyperparameters, and although the method can be used with default values (which deliver fast execution with satisfactory results), it may be necessary to further investigate a suitable configuration for the domain of the problem in which the regressor is being applied.

Imagine we want to limit the final expression to something that isn’t too complex. We can achieve this by several ways.

We can have several expression sizes, exponents limits, and different transformation functions. Let’s choose some values for each configuration to perform the gridsearch.

Here, we’ll look to find a subset of functions and exponents that, combined, deliver good performance in the dataset we’re using.

Gridsearch can receive either an estimator or a pipeline to make the adjustment.

A detail that is worth mentioning is that, in the case of a Pipeline, the variables will have a name with a prefix to be used in gridsearch.

[8]:

from itertools import permutations

two_tfuncs = permutations(['log', 'sqrt.abs', 'exp'], r=2)

parameters = {
    'itea__gens'      : [100],
    'itea__popsize'   : [100],
    'itea__tfuncs_dx' : [tfuncs_dx],
    'itea__expolim'   : [(-2, 2), (-1, 1), (0, 1), (0, 2)],
    'itea__max_terms' : [10],
    'itea__tfuncs'    : [
        {t1:tfuncs[t1], t2:tfuncs[t2], 'id':tfuncs['id']}
        for (t1, t2) in set(two_tfuncs)
    ],
    'itea__verbose': [False]
}

The scikit provides GridSearchCV, a method that does an exhaustive search for the best configuration by cross-validating past data.

Since ITEA is an evolutionary algorithm, exhaustive testing can be computationally expensive. Let’s use HalvingGridSearchCV (which is in experimental stage at the time of creation of this notebook), imported at the beginning of the notebook.

This method makes the gridsearch with several interactions, but allocating few resources for the first runs, in order to get a possible direction of where it should apply more effort to obtain the best configuration.

To use the standard gridsearch, just change HalvingGridSearchCV to GridSearchCV.

[9]:

gridsearch = HalvingGridSearchCV(
    estimator=pipeline,
    param_grid=parameters,
    verbose=2,
    n_jobs=-1,
    refit=True, # If true, then 'gridsearch' will have a best_estimator_
    cv=3,
    factor=2,
    scoring='neg_root_mean_squared_error'
)

t_start = time.time()

gridsearch.fit(X_train, y_train)

t_end = time.time() - t_start

print('----------')
print(f'{round(t_end, 2)} seconds')

n_iterations: 5
n_required_iterations: 5
n_possible_iterations: 5
min_resources_: 21
max_resources_: 339
aggressive_elimination: False
factor: 2
----------
iter: 0
n_candidates: 24
n_resources: 21
Fitting 3 folds for each of 24 candidates, totalling 72 fits
----------
iter: 1
n_candidates: 12
n_resources: 42
Fitting 3 folds for each of 12 candidates, totalling 36 fits
----------
iter: 2
n_candidates: 6
n_resources: 84
Fitting 3 folds for each of 6 candidates, totalling 18 fits
----------
iter: 3
n_candidates: 3
n_resources: 168
Fitting 3 folds for each of 3 candidates, totalling 9 fits
----------
iter: 4
n_candidates: 2
n_resources: 336
Fitting 3 folds for each of 2 candidates, totalling 6 fits
----------
222.4 seconds

Now that we have the best result, let’s preview the grid of different settings for exponent limits and subsets of transform functions, and let’s also create a final model with the best found setting.

The heatmap is based on this example from the scikits’ documentation.

[10]:

fig, ax = plt.subplots(1, 1, figsize=(8, 5))

results_aux = gridsearch.cv_results_.copy()

# One of the parameters is a dictionary, the other a tuple. When we create a data frame
# with this, it ends up becoming a mess when the pandas parser handles these values.
# Let's turn it into strings.
results_aux['param_itea__tfuncs']  = [
    str(list(k.keys())[:-1]) # Left out the 'id' tfunc
    for k in results_aux['param_itea__tfuncs']]

results_aux['param_itea__expolim'] = [
    str(k)
    for k in results_aux['param_itea__expolim']]

results = pd.DataFrame.from_dict(results_aux)

scores_matrix = results.sort_values('iter').pivot_table(
    index   = 'param_itea__tfuncs',
    columns = 'param_itea__expolim',
    values  = 'mean_test_score',
    aggfunc = 'last'
)

print(scores_matrix.values)

im = ax.imshow(scores_matrix*-1, aspect='auto', cmap='viridis_r')

expolims_gs = set(results_aux['param_itea__expolim'])
ax.set_xlabel('expolim', fontsize=15)
ax.set_xticks(np.arange(len(expolims_gs)))
ax.set_xticklabels(expolims_gs)

tfuncs_gs = set(results_aux['param_itea__tfuncs'])
ax.set_ylabel('tfuncs', fontsize=15)
ax.set_yticks(np.arange(len(tfuncs_gs)))
ax.set_yticklabels(tfuncs_gs)

iterations = results.pivot_table(
    index='param_itea__tfuncs',
    columns='param_itea__expolim',
    values='iter',
    aggfunc='max'
).values

for i in range(len(tfuncs_gs)):
    for j in range(len(expolims_gs)):
        ax.text(j, i, iterations[i, j],
            ha="center", va="center", color="k", fontsize=15)

fig.subplots_adjust(right=0.8)
cbar_ax = fig.add_axes([0.85, 0.15, 0.05, 0.7])

fig.colorbar(im, cax=cbar_ax)
cbar_ax.set_ylabel(
    'mean test score (RMSE)', rotation=-90,
    va="bottom", fontsize=15)

plt.show()

[[-1.97417141e+02 -2.46071938e+01 -8.44492769e+01 -1.03811080e+01]
 [-1.15262941e+01 -6.56115272e+15 -5.41812874e+02 -1.82725318e+01]
 [-3.63525617e+01 -2.17213496e+01 -9.62046613e+01 -3.41900820e+01]
 [-1.26314994e+01 -7.06547429e+00 -4.19566596e+00 -3.05124383e+01]
 [-2.26318495e+01 -7.44435538e+01 -5.82059580e+02 -5.72124969e+00]
 [-2.61751169e+01 -1.14023381e+01 -4.14897391e+00 -5.76394867e+00]]

_images/_integrating_with_scikits_classes_19_1.png

Creating and fitting a model with the best configuration

Finally, let’s create an instance of ITEA with the best configuration, and then use the test data to see how the final model performs as a predictor.

Additionally, let’s look at some interpretability graphs.

[11]:

# best_pipeline is be a pipeline!
best_pipeline = gridsearch.best_estimator_

# predict(), score() and other estimators methods will
# perform the transformations and then call the method on the final
# estimator.

best_pipeline.score(X_test, y_test)

[11]:

0.7876885603830583

ITEA is an estimator, and the interpretability classes only work with instances of ITEA or ITExpr.

To be able to use the entire pipeline created, let’s create a method that receives a pipeline and iterates over all transformations successively until finishing the treatment, and returning this new data.

So we’ll use this method to handle the data before calling the fit of the explainers. Thus, we use the pipeline with the classes from itea.inspection.

[12]:

def just_transformers(pipeline, X):

    Xt = X.copy()
    for name, transformer in pipeline.steps[:-1]:
        Xt = transformer.transform(Xt)

    return Xt

Let’s create the explainer instance. We’ll pass an ITExpr to the explainer and use the transformations to fit the data. Note how these values are used.

[13]:

explainer = ITExpr_explainer(
    itexpr    = best_pipeline['itea'].bestsol_,
    tfuncs    = tfuncs,
    tfuncs_dx = tfuncs_dx
).fit(just_transformers(best_pipeline, X_train), y_train)

Now we can create the interpretability plots we saw on the other notebooks.

[14]:

explainer.plot_feature_importances(
    X=just_transformers(best_pipeline, X_train),
    importance_method='pe',
    grouping_threshold=0.0,
    barh_kw={'color':'green'}
)

_images/_integrating_with_scikits_classes_27_0.png

[15]:

explainer.plot_normalized_partial_effects(
    grouping_threshold=0.1, show=False, num_points=10)

plt.tight_layout()

_images/_integrating_with_scikits_classes_28_0.png

[16]:

fig, axs = plt.subplots(1, 4, figsize=(10, 3))

explainer.plot_partial_effects_at_means(
    X=just_transformers(best_pipeline, X_test),
    features=range(4),
    ax=axs,
    num_points=100,
    share_y=True,
    show_err=True,
    show=False
)

plt.tight_layout()
plt.show()

_images/_integrating_with_scikits_classes_29_0.png