California housing regression

In this notebook we’ll use the ITEA_regressor to search for a good expression, that will be encapsulated inside the ITExpr_regressor class, and it will be used for the regression task of predicting California housing prices.

[1]:
import numpy  as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn import datasets

from sklearn.model_selection import train_test_split
from IPython.display         import display

from itea.regression import ITEA_regressor
from itea.inspection import *

import warnings
warnings.filterwarnings(action='ignore', module=r'itea')

The California Housing data set contains 8 features.

In this notebook, we’ll provide the transformation functions and their derivatives, instead of using the itea feature of extracting the derivatives using Jax.

Creating and fitting an ITEA_regressor

[2]:
housing_data = datasets.fetch_california_housing()
X, y         = housing_data['data'], housing_data['target']
labels       = housing_data['feature_names']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

tfuncs = {
    'log'      : np.log,
    'sqrt.abs' : lambda x: np.sqrt(np.abs(x)),
    'id'       : lambda x: x,
    'sin'      : np.sin,
    'cos'      : np.cos,
    'exp'      : np.exp
}

tfuncs_dx = {
    'log'      : lambda x: 1/x,
    'sqrt.abs' : lambda x: x/( 2*(np.abs(x)**(3/2)) ),
    'id'       : lambda x: np.ones_like(x),
    'sin'      : np.cos,
    'cos'      : lambda x: -np.sin(x),
    'exp'      : np.exp,
}

reg = ITEA_regressor(
    gens         = 50,
    popsize      = 50,
    max_terms    = 5,
    expolim      = (0, 2),
    verbose      = 10,
    tfuncs       = tfuncs,
    tfuncs_dx    = tfuncs_dx,
    labels       = labels,
    random_state = 42,
    simplify_method = 'simplify_by_coef'
).fit(X_train, y_train)
gen | smallest fitness | mean fitness | highest fitness | remaining time
----------------------------------------------------------------------------
  0 |         0.879653 |     1.075671 |        1.153701 | 1min17sec
 10 |         0.794826 |     0.828574 |        0.983679 | 2min7sec
 20 |         0.788833 |     0.799124 |        0.850923 | 1min39sec
 30 |         0.740517 |     0.783561 |        0.806966 | 1min9sec
 40 |         0.756137 |     0.777656 |        0.806420 | 0min57sec

Inspecting the results from ITEA_regressor and ITExpr_regressor

We can see the convergence of the fitness, the number of terms, or tree complexity by using the ITEA_summarizer, an inspector class focused on the ITEA:

[3]:
fig, axs = plt.subplots(3, 1, figsize=(10, 8), sharex=True)

summarizer = ITEA_summarizer(itea=reg).fit(X_train, y_train)

summarizer.plot_convergence(
    data=['fitness', 'n_terms', 'complexity'],
    ax=axs,
    show=False
)

plt.tight_layout()
plt.show()
_images/_regression_example_7_0.png

Now that we have fitted the ITEA, our reg contains the bestsol_ attribute, which is a fitted instance of ITExpr_regressor ready to be used. Let us see the final expression and the execution time.

[4]:
final_itexpr = reg.bestsol_

print('\nFinal expression:\n', final_itexpr.to_str(term_separator=' +\n'))
print(f'\nElapsed time: {reg.exectime_}')
print(f'\nSelected Features: {final_itexpr.selected_features_}')

Final expression:
 9.361*log(MedInc^2 * HouseAge * AveRooms * AveBedrms * Population^2 * Latitude * Longitude^2) +
-9.662*log(HouseAge * Population^2 * AveOccup * Longitude^2) +
-8.436*log(MedInc^2 * HouseAge^2 * AveRooms^2 * AveBedrms * Population^2 * Latitude^2) +
-1.954*log(AveRooms) +
8.739*log(HouseAge^2 * AveRooms * Population^2 * AveOccup * Longitude^2) +
-53.078

Elapsed time: 210.53795051574707

Selected Features: ['MedInc' 'HouseAge' 'AveRooms' 'AveBedrms' 'Population' 'AveOccup'
 'Latitude' 'Longitude']
[5]:
# just remembering that ITEA and ITExpr implements scikits
# base classes. We can check all parameters with:
print(final_itexpr.get_params)
<bound method BaseEstimator.get_params of ITExpr_regressor(expr=[('log', [2, 1, 1, 1, 2, 0, 1, 2]),
                       ('log', [0, 1, 0, 0, 2, 1, 0, 2]),
                       ('log', [2, 2, 2, 1, 2, 0, 2, 0]),
                       ('log', [0, 0, 1, 0, 0, 0, 0, 0]),
                       ('log', [0, 2, 1, 0, 2, 1, 0, 2])],
                 labels=array(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
       'AveOccup', 'Latitude', 'Longitude'], dtype='<U10'),
                 tfuncs={'cos': <ufunc 'cos'>, 'exp': <ufunc 'exp'>,
                         'id': <function <lambda> at 0x7f4ab0369680>,
                         'log': <ufunc 'log'>, 'sin': <ufunc 'sin'>,
                         'sqrt.abs': <function <lambda> at 0x7f4ab0369f80>})>
[6]:
fig, axs = plt.subplots()

axs.scatter(y_test, final_itexpr.predict(X_test))
plt.show()
_images/_regression_example_11_0.png

We can use the ITExpr_inspector to see information for each term.

[7]:
display(pd.DataFrame(
    ITExpr_inspector(
        itexpr=final_itexpr, tfuncs=tfuncs
    ).fit(X_train, y_train).terms_analysis()
))
coef func strengths coef\nstderr. mean pairwise\ndisentanglement mean mutual\ninformation prediction\nvar.
0 9.361 log [2, 1, 1, 1, 2, 0, 1, 2] 0.22 0.535 0.603 268.140
1 -9.662 log [0, 1, 0, 0, 2, 1, 0, 2] 0.221 0.494 0.527 221.432
2 -8.436 log [2, 2, 2, 1, 2, 0, 2, 0] 0.222 0.544 0.609 263.430
3 -1.954 log [0, 0, 1, 0, 0, 0, 0, 0] 0.038 0.075 0.187 0.290
4 8.739 log [0, 2, 1, 0, 2, 1, 0, 2] 0.222 0.520 0.553 214.239
5 -53.078 intercept --- 1.325 0.000 0.000 0.000

Explaining the IT_regressor expression using Partial Effects

We can obtain feature importances using the Partial Effects and the ITExpr_explainer.

[8]:
explainer = ITExpr_explainer(
    itexpr=final_itexpr, tfuncs=tfuncs, tfuncs_dx=tfuncs_dx).fit(X, y)

explainer.plot_feature_importances(
    X=X_train,
    importance_method='pe',
    grouping_threshold=0.0,
    barh_kw={'color':'green'}
)
_images/_regression_example_15_0.png

The Partial Effects at the Means can help understand how the contribution of each variable changes according to its values when their covariables are fixed at the means.

[9]:
fig, axs = plt.subplots(2, 4, figsize=(10, 5))

explainer.plot_partial_effects_at_means(
    X=X_test,
    features=range(8),
    ax=axs,
    num_points=100,
    share_y=False,
    show_err=True,
    show=False
)

plt.tight_layout()
plt.show()
_images/_regression_example_17_0.png

Finally, we can also plot the mean relative importances of each feature by calculating the average Partial Effect for each interval when the output is discretized.

[10]:
fig, ax = plt.subplots(1, 1, figsize=(10, 4))

explainer.plot_normalized_partial_effects(
    grouping_threshold=0.1, show=False,
    num_points=100, ax=ax
)

plt.tight_layout()
_images/_regression_example_19_0.png