Machine learning models learn from data. It is crucial, however, that the data you feed them is specifically preprocessed and refined for the problem you want to solve. This includes data cleaning, preprocessing, feature engineering, and so on.
Very often, when presented with a dataset, I would fire up a Jupyter notebook and start exploring it interactively. The notebook is great for that task, but after a while I ended up with code that is a total mess in the global namespace.
Then I read about scikit-learn’s Pipeline
object, a utility
that provides a way to automate a machine learning workflow. It works by
allowing several transformers to be chained together. One can also add an
estimator at the end of the pipeline. Data flows from the start of the pipeline
to its end, and each time it is transformed and fed to the next component. A
Pipeline
object has two main methods:
fit_transform
: this same method is called for each transformer and each time the result is fed into the next transformer;fit_predict
: if your pipeline ends with an estimator, then as before the data is transformed until it arrives at the last step, where it is fed into the estimator andfit_predict
is called on the estimator.
Sometimes data flow is not linear, and that’s where FeatureUnion
comes in. A FeatureUnion
is itself a transformer, which combines multiple
transformers. During fitting, they are fitted independently, while for the
transformation, each component of the union is applied in parallel. Where all
the results have been collected, they are concatenated into a single vector.
Example
The excellent scikit-learn documentation has loads of examples. Let’s take a look at the Anova SVM pipeline. The relevant part is the following:
# ANOVA SVM-C
# 1) anova filter, take 3 best ranked features
anova_filter = SelectKBest(f_regression, k=3)
# 2) svm
clf = svm.SVC(kernel='linear')
anova_svm = make_pipeline(anova_filter, clf)
anova_svm.fit(X, y)
anova_svm.predict(X)
The function
make_pipeline
is just a wrapper around the class, and it allows to compose transformers and
estimators without specifying a name for each one. The above code is equivalent
to the following:
# ANOVA SVM-C
# 1) anova filter, take 3 best ranked features
anova_filter = SelectKBest(f_regression, k=3)
# 2) svm
clf = svm.SVC(kernel='linear')
anova_filter.fit(X, y)
X_ = anova_filter.transform(X)
clf.fit(X_, y)
clf.predict(X_)
In this little example, we only have one transformer and one estimator, but the difference in readability and clarity is significantly in favour of the first version. In what follows, I’ll explain how I got scikit-learn and pandas working together in a pipeline with many more transformers.
Pipelines and Pandas dataframes
Unfortunately, scikit-learn’s API expects Numpy arrays. If you feed a dataframe
into a pipeline, you will get a Numpy array out of it. Other times, as it is
the case with FeatureUnion
, it will not work as expected. It would be much
better if one could get a dataframe out of the pipeline. Right now various
efforts are in place to allow a better sklearn/pandas integration, namely:
- the PR
scikit-learn/3886
, which at the time of writing is still a work in progress; - the package
sklearn-pandas
.
I tried sklearn-pandas
but it doesn’t quite do what I wanted: it provides a
way to map DataFrame
columns to transformations. Most of the time, however, I
construct a pipeline of transformers and I want to receive a DataFrame
as
input or output. For this reason I wrote a custom transformer that does
precisely this:
from sklearn.base import TransformerMixin
class NoFitMixin:
def fit(self, X, y=None):
return self
class DFTransform(TransformerMixin, NoFitMixin):
def __init__(self, func, copy=False):
self.func = func
self.copy = copy
def transform(self, X):
X_ = X if not self.copy else X.copy()
return self.func(X_)
It accepts a function as argument and the transformed data is simply its return
value. The copy
keyword argument is there to prevent a double copying: if the
function itself returns a new DataFrame
, then there’s no need to copy it.
The only problem arises when using FeatureUnion
: it does not concatenate the
results into a DataFrame
. I wrote a custom class for this case as well:
from sklearn.pipeline import Pipeline, FeatureUnion, _transform_one
from sklearn.externals.joblib import Parallel, delayed
class DFFeatureUnion(FeatureUnion):
def fit_transform(self, X, y=None, **fit_params):
# non-optimized default implementation; override when a better
# method is possible
if y is None:
# fit method of arity 1 (unsupervised transformation)
return self.fit(X, **fit_params).transform(X)
else:
# fit method of arity 2 (supervised transformation)
return self.fit(X, y, **fit_params).transform(X)
def transform(self, X):
Xs = Parallel(n_jobs=self.n_jobs)(
delayed(_transform_one)(trans, X, None, weight)
for _, trans, weight in self._iter())
return pd.concat(Xs, axis=1, join='inner')
This is an example showing how they can be used:
pipeline = Pipeline([
('ordinal_to_nums', DFTransform(_ordinal_to_nums, copy=True)),
('union', DFFeatureUnion([
('categorical', Pipeline([
('select', DFTransform(lambda X: X.select_dtypes(include=['object']))),
('fill_na', DFTransform(lambda X: X.fillna('NA'))),
('one_hot', DFTransform(_one_hot_encode)),
])),
('numerical', Pipeline([
('select', DFTransform(lambda X: X.select_dtypes(exclude=['object']))),
('fill_median', DFTransform(lambda X: X.fillna(X.median()))),
('add_features', DFTransform(_add_features, copy=True)),
('remove_skew', DFTransform(_remove_skew, copy=True)),
('find_outliers', DFTransform(_find_outliers, copy=True)),
('normalize', DFTransform(lambda X: X.div(X.max())))
])),
])),
])
The above pipeline splits the DataFrame
into categorical and numerical
columns, applying different transformation to each. The columns are
concatenated into a DataFrame
at then end of the DFFeatureUnion
.
The resulting code is well organized and very easy to understand. It’s also extremely easy to add or remove steps to/from the pipeline.
UPDATE (Oct 28, 2017): As of scikit-learn v0.19.0, the function signature
of the undocumented function _transform_one
changed, and the code of
DFFeatureUnion
was updated accordingly (thanks to Paulo Cheadi Haddad Filho
for pointing it out).
UPDATE (Dec 02, 2019): As of scikit-learn v0.21.0, the function signature
of the function _transform_one
changed once again, and the code of
DFFeatureUnion
was updated accordingly (thanks to Григорий Гусаров for
pointing it out).