henchman.learning.create_model(X, y, model=None, metric=None, n_splits=1, split_size=0.3, _return_df=False)[source]

Make a model. Returns a scorelist and a fit model. A wrapper around a standard scoring workflow. Uses train_test_split unless otherwise specified (in which case it will use TimeSeriesSplit).

In this function we trade flexibility for ease of use. Unless you want this exact validation-fitting-scoring method, it’s recommended you just use the sklearn API.

  • X (pd.DataFrame) – A cleaned numeric feature matrix.
  • y (pd.Series) – A column of labels.
  • model – A sklearn model with fit and predict methods.
  • metric – A metric which takes y_test, preds and returns a score.
  • n_splits (int) – If 1 use a train_test_split. Otherwise use tssplit. Default value is 1.
  • split_size (float) – Size of testing set. Default is .3.
  • _return_df (bool) – If true, return (X_train, X_test, y_train, y_test) after returns. Not generally useful, but sometimes necessary.

A list of scores and a fit model.

Return type:

(list[float], sklearn.ensemble)


>>> from henchman.learning import create_model
>>> import numpy as np
>>> from sklearn.ensemble import RandomForestClassifier
>>> from sklearn.metrics import roc_auc_score
>>> scores, fit_model = create_model(X, y,
...                                  RandomForestClassifier(),
...                                  roc_auc_score,
...                                  n_splits=5)
>>> print('Average score of {:.2f}'.format(np.mean(scores)))