The Henchman package has reusable functionality in four areas: dataframe diagnostics, feature selection, machine learning and bokeh plotting. We’ll demonstrate some of that functionality here in the documentaion.

To get started, we’ll use a premade feature matrix from the Featuretools ft.demo.load_flight() function. We’ll load in the csv using pandas.

In [1]: sample_df = pd.read_csv('../tests/sample_data/sample_fm.csv', index_col='trip_log_id').iloc[:100,:15]

In [2]: fm_enc = pd.read_csv('../tests/sample_data/sample_fm_enc.csv', index_col='trip_log_id')

Diagnostics

It can sometimes be hard to find information about a dataframe by inspection. Frequent questions such as “how large is this dataframe” and “are there duplicates” usually require copying code from one notebook to another. In this module, we give easy function calls that do those basic diagnostics.

In [3]: from henchman.diagnostics import overview

In [4]: overview(fm_enc)

+--------------+
|  Data Shape  |
+--------------+
Number of columns: 320
Number of rows: 100

+------------------+
|  Missing Values  |
+------------------+
Most values missing from column: 0
Average missing values by column: 0.00

+----------------+
|  Memory Usage  |
+----------------+
Total memory used: 0.26 MB
Average memory by column: 0.00 MB

+--------------+
|  Data Types  |
+--------------+
         index
0             
bool         1
int64      247
float64     72

With just the overview function call, we’ve answered many of our basic questions about this dataframe. We know that it has 349 columns, 56353 rows and is about 162 MB. Almost all of the columns are integers or floats in pandas.

It’s also useful to see warnings about common data science pitfalls. The warnings function checks the pairwise correlation of all of your columns, if you have any duplicates, if there are many values missing from a column and if you have object columns with many distinct values. Let’s sample the warning function on a smaller dataframe.

In [5]: from henchman.diagnostics import warnings

In [6]: warnings(sample_df)

+------------+
|  Warnings  |
+------------+
scheduled_elapsed_time and distance are linearly correlated: 0.980
DAY(flight_date) and DAY(scheduled_arr_time) are linearly correlated: 1.000
DAY(scheduled_dep_time) and DAY(scheduled_arr_time) are linearly correlated: 1.000
flight_id has many unique values: 98

It’s not particularly suprising that the distance would be highly correlated with how long a flight will take! Nevertheless, it’s a good thing to know before feeding both of those columns into a machine learning algorithm. It also seems like we shouldn’t be encoding the flight id, since it would give too many unique values. A better approach (and the one that was taken in this feature matrix) is to aggregate according to that id.

While technically the overview, warnings and column_report are three distinct API-stable functions, the most common use case is to call all three at once using the profile function. This will do the same overview and warnings as above but also give information about every column in the dataframe.

In [7]: from henchman.diagnostics import profile

In [8]: profile(sample_df)

+--------------+
|  Data Shape  |
+--------------+
Number of columns: 15
Number of rows: 100

+------------------+
|  Missing Values  |
+------------------+
Most values missing from column: 0
Average missing values by column: 0.00

+----------------+
|  Memory Usage  |
+----------------+
Total memory used: 0.03 MB
Average memory by column: 0.00 MB

+--------------+
|  Data Types  |
+--------------+
         index
0             
int64       12
float64      1
object       2

+------------+
|  Warnings  |
+------------+
scheduled_elapsed_time and distance are linearly correlated: 0.980
DAY(flight_date) and DAY(scheduled_arr_time) are linearly correlated: 1.000
DAY(scheduled_dep_time) and DAY(scheduled_arr_time) are linearly correlated: 1.000
flight_id has many unique values: 98

+-------------------------+
|  Object Column Summary  |
+-------------------------+

## time ##
Unique: 50
Mode: 2017-01-30, (matches 6.0% of rows)

## flight_id ##
Unique: 98
Mode: No Mode

+--------------------------+
|  Numeric Column Summary  |
+--------------------------+

## scheduled_elapsed_time ##
Maximum: 24780000000000, Minimum: 3900000000000, Mean: 10066800000000.00
Quartile 3: 12990000000000.00 | Median: 8400000000000.00| Quartile 1: 5490000000000.00

## distance ##
Maximum: 2615.0, Minimum: 184.0, Mean: 987.54
Quartile 3: 1284.25 | Median: 723.50| Quartile 1: 369.00

## DAY(flight_date) ##
Maximum: 31, Minimum: 1, Mean: 15.13
Quartile 3: 23.00 | Median: 16.00| Quartile 1: 7.00

## DAY(scheduled_dep_time) ##
Maximum: 31, Minimum: 1, Mean: 15.13
Quartile 3: 23.00 | Median: 16.00| Quartile 1: 7.00

## DAY(scheduled_arr_time) ##
Maximum: 31, Minimum: 1, Mean: 15.16
Quartile 3: 23.00 | Median: 16.00| Quartile 1: 7.00

## DAY(time_index) ##
Maximum: 31, Minimum: 1, Mean: 16.11
Quartile 3: 25.00 | Median: 15.50| Quartile 1: 8.00

## YEAR(flight_date) ##
Maximum: 2017, Minimum: 2017, Mean: 2017.00
Quartile 3: 2017.00 | Median: 2017.00| Quartile 1: 2017.00

## YEAR(scheduled_dep_time) ##
Maximum: 2017, Minimum: 2017, Mean: 2017.00
Quartile 3: 2017.00 | Median: 2017.00| Quartile 1: 2017.00

## YEAR(scheduled_arr_time) ##
Maximum: 2017, Minimum: 2017, Mean: 2017.00
Quartile 3: 2017.00 | Median: 2017.00| Quartile 1: 2017.00

## YEAR(time_index) ##
Maximum: 2016, Minimum: 2016, Mean: 2016.00
Quartile 3: 2016.00 | Median: 2016.00| Quartile 1: 2016.00

## MONTH(flight_date) ##
Maximum: 2, Minimum: 1, Mean: 1.48
Quartile 3: 2.00 | Median: 1.00| Quartile 1: 1.00

## MONTH(scheduled_dep_time) ##
Maximum: 2, Minimum: 1, Mean: 1.48
Quartile 3: 2.00 | Median: 1.00| Quartile 1: 1.00

## MONTH(scheduled_arr_time) ##
Maximum: 2, Minimum: 1, Mean: 1.48
Quartile 3: 2.00 | Median: 1.00| Quartile 1: 1.00

Module Contents

overview(data) Give a brief data overview.
warnings(data[, corr_thresh, …]) Warn about common dataset problems.
column_report(data) Give column summaries according to pandas dtype.
profile(data[, corr_thresh, missing_thresh, …]) Profile dataset.

Selection

There are some lightweight feature selection packages provided as well. There is RandomSelect randomly choose n_feats features to select and Dendrogram which can use pairwise correlation to find a feature set.

In [9]: from henchman.selection import RandomSelect

In [10]: X = fm_enc.copy().fillna(0).iloc[:100, :30]

In [11]: y = fm_enc.copy().pop('label')[:100]

In [12]: sel = RandomSelect(n_feats=12)

In [13]: sel.fit(X)

In [14]: sel.transform(X).head()
Out[14]: 
             flight_id = VX-312:LAX->MCO         ...           flight_id = unknown
trip_log_id                                      ...                              
37012                                  0         ...                             1
11885                                  0         ...                             1
7141                                   0         ...                             1
26136                                  0         ...                             1
32295                                  0         ...                             1

[5 rows x 12 columns]

Alternately, you can use pairwise correlation

In [15]: from henchman.selection import Dendrogram

In [16]: sel2 = Dendrogram(X)

In [17]: sel2.transform(X, n_feats=12).head()
There are 12 distinct connected components at thresh step 17 in the Dendrogram
You might also be interested in 13 components at step 16
Out[17]: 
             scheduled_elapsed_time          ...            DAY(flight_date) = 12
trip_log_id                                  ...                                 
37012                22860000000000          ...                                0
11885                 4440000000000          ...                                0
7141                  4800000000000          ...                                0
26136                 5700000000000          ...                                0
32295                12480000000000          ...                                0

[5 rows x 12 columns]

The Dendrogram can not necessarily find a feature set of an arbitrary size. Since not all features are pairwise correlated, not all features will eventually be connected. Since the object has total connectivity information according to the given metric, if there is a distance that provides a better indicator of feature closeness in your dataset it can be passed in as an argument.

As a last note, the features returned are actually representatives of a connected component of a particular graph. Those components can be seen, and the representatives can be shuffled to return a similarly connected feature set.

In [18]: sel2.graphs[4]
Out[18]: 
defaultdict(set,
            {0: {0, 1},
             2: {2},
             3: {3},
             4: {4},
             5: {5},
             6: {6},
             7: {7},
             8: {8},
             9: {9},
             10: {10},
             11: {11},
             12: {12},
             13: {13, 24},
             14: {14, 25},
             15: {15, 26},
             16: {16, 27},
             17: {17, 28},
             18: {18, 29},
             19: {19},
             20: {20},
             21: {21},
             22: {22},
             23: {23}})

In [19]: sel2.shuffle_all_representatives()

In [20]: sel2.graphs[4]
Out[20]: 
defaultdict(set,
            {1: {0, 1},
             2: {2},
             3: {3},
             4: {4},
             5: {5},
             6: {6},
             7: {7},
             8: {8},
             9: {9},
             10: {10},
             11: {11},
             12: {12},
             24: {13, 24},
             25: {14, 25},
             15: {15, 26},
             27: {16, 27},
             17: {17, 28},
             29: {18, 29},
             19: {19},
             20: {20},
             21: {21},
             22: {22},
             23: {23}})

Module Contents

RandomSelect([names, n_feats]) Randomly choose a feature set.
Dendrogram([X, pairing_func, max_threshes]) Pair features by an arbitrary function.

Learning

The learning module exists to simplify some frequent machine learning calls. For instance, given a feature matrix X and a column of labels y, it’s nice to be able to quickly get back a score. We’ll use the create_model function.

In [21]: from henchman.learning import create_model

In [22]: from sklearn.ensemble import RandomForestClassifier

In [23]: from sklearn.metrics import roc_auc_score

In [24]: scores, fit_model = create_model(X, y, RandomForestClassifier(), roc_auc_score)

In [25]: scores
Out[25]: [0.6538461538461537]

We can check feature importances with similar ease:

In [26]: from henchman.learning import feature_importances

In [27]: feats = feature_importances(X, fit_model, n_feats=3)
1: scheduled_elapsed_time [1.000]
2: distance [0.523]
3: DAY(flight_date) = 22 [0.352]
-----


In [28]: X[feats].head()
Out[28]: 
             scheduled_elapsed_time          ...            DAY(flight_date) = 22
trip_log_id                                  ...                                 
37012                22860000000000          ...                                0
11885                 4440000000000          ...                                0
7141                  4800000000000          ...                                0
26136                 5700000000000          ...                                0
32295                12480000000000          ...                                0

[5 rows x 3 columns]

Module Contents

create_model(X, y[, model, metric, …]) Make a model.
inplace_encoder(X) Replace all columns with pd.dtype == ‘O’ with integers.
feature_importances(X, model[, n_feats]) Print a list of important features.
create_holdout(X, y[, split_size]) A wrapper around train_test_split.

Plotting

The plotting module gives a collection of useful dataset agnostic plots. Plots have the ability to be dynamic or static. We recommend importing the whole module at once using import henchman.plotting as hplot for easy access to all of the functions. The single exception might be henchman.plotting.show(), which is useful enough that you might consider importing it as itself.

The show function has many parameters which can be hard to remember. Because of that, there’s a templating function from which you can copy and paste the arguments you want.

In [29]: import henchman.plotting as hplot

In [30]: from henchman.plotting import show

In [31]: hplot.show_template()
show(plot,
     static=False,
     png=False,
     hover=False,
     colors=None,
     width=None,
     height=None,
     title='Temporary title',
     x_axis='my xaxis name',
     y_axis='my yaxis name',
     x_range=(0, 10) or None,
     y_range=(0, 10) or None)

See the plotting gallery page for some example bokeh plots.

Module Contents
show(plot[, png, static, hover, width, …]) Format and show a bokeh plot.
show_template() Prints a template for show.
piechart(col[, sort, mergepast, drop_n, figargs]) Creates a piechart.
histogram(col[, y, n_bins, col_max, …]) Creates a histogram.
timeseries(col_1, col_2[, col_max, col_min, …]) Creates a time based aggregations of a numeric variable.
scatter(col_1, col_2[, cat, label, …]) Creates a scatter plot of two variables.
dendrogram(D[, figargs]) Creates a dendrogram plot.
feature_importances(X, model[, n_feats, figargs]) Plot feature importances.
roc_auc(X, y, model[, pos_label, prob_col, …]) Plots the reveiver operating characteristic curve.
f1(X, y, model[, n_precs, n_splits, figargs]) Plots the precision, recall and f1 at various thresholds.