California housing regression

In this notebook we’ll use the ITEA_regressor to search for a good expression, that will be encapsulated inside the ITExpr_regressor class, and it will be used for the regression task of predicting California housing prices.

import numpy  as np
import pandas as pd

import matplotlib.pyplot as plt

from sklearn import datasets

from sklearn.model_selection import train_test_split
from IPython.display         import display

from itea.regression import ITEA_regressor
from itea.inspection import *

import warnings
warnings.filterwarnings(action='ignore', module=r'itea')

The California Housing data set contains 8 features.

In this notebook, we’ll provide the transformation functions and their derivatives, instead of using the itea feature of extracting the derivatives using Jax.

Creating and fitting an ITEA_regressor

housing_data = datasets.fetch_california_housing()
X, y         = housing_data['data'], housing_data['target']
labels       = housing_data['feature_names']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.33, random_state=42)

tfuncs = {
    'log'      : np.log,
    'sqrt.abs' : lambda x: np.sqrt(np.abs(x)),
    'id'       : lambda x: x,
    'sin'      : np.sin,
    'cos'      : np.cos,
    'exp'      : np.exp

tfuncs_dx = {
    'log'      : lambda x: 1/x,
    'sqrt.abs' : lambda x: x/( 2*(np.abs(x)**(3/2)) ),
    'id'       : lambda x: np.ones_like(x),
    'sin'      : np.cos,
    'cos'      : lambda x: -np.sin(x),
    'exp'      : np.exp,

reg = ITEA_regressor(
    gens         = 50,
    popsize      = 50,
    max_terms    = 5,
    expolim      = (0, 2),
    verbose      = 10,
    tfuncs       = tfuncs,
    tfuncs_dx    = tfuncs_dx,
    labels       = labels,
    random_state = 42,
    simplify_method = 'simplify_by_coef'
).fit(X_train, y_train)
gen | smallest fitness | mean fitness | highest fitness | remaining time
  0 |         0.879653 |     1.075671 |        1.153701 | 1min17sec
 10 |         0.794826 |     0.828574 |        0.983679 | 2min7sec
 20 |         0.788833 |     0.799124 |        0.850923 | 1min39sec
 30 |         0.740517 |     0.783561 |        0.806966 | 1min9sec
 40 |         0.756137 |     0.777656 |        0.806420 | 0min57sec

Inspecting the results from ITEA_regressor and ITExpr_regressor

We can see the convergence of the fitness, the number of terms, or tree complexity by using the ITEA_summarizer, an inspector class focused on the ITEA:

fig, axs = plt.subplots(3, 1, figsize=(10, 8), sharex=True)

summarizer = ITEA_summarizer(itea=reg).fit(X_train, y_train)

    data=['fitness', 'n_terms', 'complexity'],


Now that we have fitted the ITEA, our reg contains the bestsol_ attribute, which is a fitted instance of ITExpr_regressor ready to be used. Let us see the final expression and the execution time.

final_itexpr = reg.bestsol_

print('\nFinal expression:\n', final_itexpr.to_str(term_separator=' +\n'))
print(f'\nElapsed time: {reg.exectime_}')
print(f'\nSelected Features: {final_itexpr.selected_features_}')

Final expression:
 9.361*log(MedInc^2 * HouseAge * AveRooms * AveBedrms * Population^2 * Latitude * Longitude^2) +
-9.662*log(HouseAge * Population^2 * AveOccup * Longitude^2) +
-8.436*log(MedInc^2 * HouseAge^2 * AveRooms^2 * AveBedrms * Population^2 * Latitude^2) +
-1.954*log(AveRooms) +
8.739*log(HouseAge^2 * AveRooms * Population^2 * AveOccup * Longitude^2) +

Elapsed time: 210.53795051574707

Selected Features: ['MedInc' 'HouseAge' 'AveRooms' 'AveBedrms' 'Population' 'AveOccup'
 'Latitude' 'Longitude']
# just remembering that ITEA and ITExpr implements scikits
# base classes. We can check all parameters with:
<bound method BaseEstimator.get_params of ITExpr_regressor(expr=[('log', [2, 1, 1, 1, 2, 0, 1, 2]),
                       ('log', [0, 1, 0, 0, 2, 1, 0, 2]),
                       ('log', [2, 2, 2, 1, 2, 0, 2, 0]),
                       ('log', [0, 0, 1, 0, 0, 0, 0, 0]),
                       ('log', [0, 2, 1, 0, 2, 1, 0, 2])],
                 labels=array(['MedInc', 'HouseAge', 'AveRooms', 'AveBedrms', 'Population',
       'AveOccup', 'Latitude', 'Longitude'], dtype='<U10'),
                 tfuncs={'cos': <ufunc 'cos'>, 'exp': <ufunc 'exp'>,
                         'id': <function <lambda> at 0x7f4ab0369680>,
                         'log': <ufunc 'log'>, 'sin': <ufunc 'sin'>,
                         'sqrt.abs': <function <lambda> at 0x7f4ab0369f80>})>
fig, axs = plt.subplots()

axs.scatter(y_test, final_itexpr.predict(X_test))

We can use the ITExpr_inspector to see information for each term.

        itexpr=final_itexpr, tfuncs=tfuncs
    ).fit(X_train, y_train).terms_analysis()
coef func strengths coef\nstderr. mean pairwise\ndisentanglement mean mutual\ninformation prediction\nvar.
0 9.361 log [2, 1, 1, 1, 2, 0, 1, 2] 0.22 0.535 0.603 268.140
1 -9.662 log [0, 1, 0, 0, 2, 1, 0, 2] 0.221 0.494 0.527 221.432
2 -8.436 log [2, 2, 2, 1, 2, 0, 2, 0] 0.222 0.544 0.609 263.430
3 -1.954 log [0, 0, 1, 0, 0, 0, 0, 0] 0.038 0.075 0.187 0.290
4 8.739 log [0, 2, 1, 0, 2, 1, 0, 2] 0.222 0.520 0.553 214.239
5 -53.078 intercept --- 1.325 0.000 0.000 0.000

Explaining the IT_regressor expression using Partial Effects

We can obtain feature importances using the Partial Effects and the ITExpr_explainer.

explainer = ITExpr_explainer(
    itexpr=final_itexpr, tfuncs=tfuncs, tfuncs_dx=tfuncs_dx).fit(X, y)


The Partial Effects at the Means can help understand how the contribution of each variable changes according to its values when their covariables are fixed at the means.

fig, axs = plt.subplots(2, 4, figsize=(10, 5))



Finally, we can also plot the mean relative importances of each feature by calculating the average Partial Effect for each interval when the output is discretized.

fig, ax = plt.subplots(1, 1, figsize=(10, 4))

    grouping_threshold=0.1, show=False,
    num_points=100, ax=ax
