Why you should use scikit-learn's Pipeline object,
Machine learning models learn from data. It is crucial, however, that the data you feed them is specifically preprocessed and refined for the problem you want to solve. This includes data cleaning, preprocessing, feature engineering, and so on.
Very often, when presented with a dataset, I would fire up a Jupyter notebook and start exploring it interactively. The notebook is great for that task, but after a while I ended up with code that is a total mess in the global namespace.
Then I read about scikit-learn’s
Pipeline object, a utility
that provides a way to automate a machine learning workflow. It works by
allowing several transformers to be chained together. One can also add an
estimator at the end of the pipeline. Data flows from the start of the pipeline
to its end, and each time it is transformed and fed to the next component. A
Pipeline object has two main methods:
fit_transform: this same method is called for each transformer and each time the result is fed into the next transformer;
fit_predict: if your pipeline ends with an estimator, then as before the data is transformed until it arrives at the last step, where it is fed into the estimator and
fit_predictis called on the estimator.
Sometimes data flow is not linear, and that’s where
comes in. A
FeatureUnion is itself a transformer, which combines multiple
transformers. During fitting, they are fitted independently, while for the
transformation, each component of the union is applied in parallel. Where all
the results have been collected, they are concatenated into a single vector.
The excellent scikit-learn documentation has loads of examples. Let’s take a look at the Anova SVM pipeline. The relevant part is the following:
# ANOVA SVM-C # 1) anova filter, take 3 best ranked features anova_filter = SelectKBest(f_regression, k=3) # 2) svm clf = svm.SVC(kernel='linear') anova_svm = make_pipeline(anova_filter, clf) anova_svm.fit(X, y) anova_svm.predict(X)
is just a wrapper around the class, and it allows to compose transformers and
estimators without specifying a name for each one. The above code is equivalent
to the following:
# ANOVA SVM-C # 1) anova filter, take 3 best ranked features anova_filter = SelectKBest(f_regression, k=3) # 2) svm clf = svm.SVC(kernel='linear') anova_filter.fit(X, y) X_ = anova_filter.transform(X) clf.fit(X_, y) clf.predict(X_)
In this little example, we only have one transformer and one estimator, but the difference in readability and clarity is significantly in favour of the first version. In what follows, I’ll explain how I got scikit-learn and pandas working together in a pipeline with many more transformers.
Pipelines and Pandas dataframes
Unfortunately, scikit-learn’s API expects Numpy arrays. If you feed a dataframe
into a pipeline, you will get a Numpy array out of it. Other times, as it is
the case with
FeatureUnion, it will not work as expected. It would be much
better if one could get a dataframe out of the pipeline. Right now various
efforts are in place to allow a better sklearn/pandas integration, namely:
- the PR
scikit-learn/3886, which at the time of writing is still a work in progress;
- the package
sklearn-pandas but it doesn’t quite do what I wanted: it provides a
way to map
DataFrame columns to transformations. Most of the time, however, I
construct a pipeline of transformers and I want to receive a
input or output. For this reason I wrote a custom transformer that does
from sklearn.base import TransformerMixin class NoFitMixin: def fit(self, X, y=None): return self class DFTransform(TransformerMixin, NoFitMixin): def __init__(self, func, copy=False): self.func = func self.copy = copy def transform(self, X): X_ = X if not self.copy else X.copy() return self.func(X_)
It accepts a function as argument and the transformed data is simply its return
copy keyword argument is there to prevent a double copying: if the
function itself returns a new
DataFrame, then there’s no need to copy it.
The only problem arises when using
FeatureUnion: it does not concatenate the
results into a
DataFrame. I wrote a custom class for this case as well:
from sklearn.pipeline import Pipeline, FeatureUnion, _transform_one from sklearn.externals.joblib import Parallel, delayed class DFFeatureUnion(FeatureUnion): def fit_transform(self, X, y=None, **fit_params): # non-optimized default implementation; override when a better # method is possible if y is None: # fit method of arity 1 (unsupervised transformation) return self.fit(X, **fit_params).transform(X) else: # fit method of arity 2 (supervised transformation) return self.fit(X, y, **fit_params).transform(X) def transform(self, X): Xs = Parallel(n_jobs=self.n_jobs)( delayed(_transform_one)(trans, weight, X) for _, trans, weight in self._iter()) return pd.concat(Xs, axis=1, join='inner')
This is an example showing how they can be used:
pipeline = Pipeline([ ('ordinal_to_nums', DFTransform(_ordinal_to_nums, copy=True)), ('union', DFFeatureUnion([ ('categorical', Pipeline([ ('select', DFTransform(lambda X: X.select_dtypes(include=['object']))), ('fill_na', DFTransform(lambda X: X.fillna('NA'))), ('one_hot', DFTransform(_one_hot_encode)), ])), ('numerical', Pipeline([ ('select', DFTransform(lambda X: X.select_dtypes(exclude=['object']))), ('fill_median', DFTransform(lambda X: X.fillna(X.median()))), ('add_features', DFTransform(_add_features, copy=True)), ('remove_skew', DFTransform(_remove_skew, copy=True)), ('find_outliers', DFTransform(_find_outliers, copy=True)), ('normalize', DFTransform(lambda X: X.div(X.max()))) ])), ])), ])
The above pipeline splits the
DataFrame into categorical and numerical
columns, applying different transformation to each. The columns are
concatenated into a
DataFrame at then end of the
The resulting code is well organized and very easy to understand. It’s also extremely easy to add or remove steps to/from the pipeline.
UPDATE (Oct 28, 2017): As of scikit-learn v0.19.0, the function signature
of the undocumented function
_transform_one changed, and the code of
DFFeatureUnion was updated accordingly (thanks to Paulo Cheadi Haddad Filho
for pointing it out).