Data Science Plumbing: Peeking Into Scikit-Learn Pipelines

Written by Scott Adams and Aamodini Gupta, published by Towards Data Science here.

Introduction

Scikit-learn pipelines are useful tools that provide extra efficiency and simplicity to data science projects (if you are unfamiliar with scikit-learn pipelines see Vickery, 2019 for a great overview). Pipelines can combine and structure multiple steps, from data transformation to modeling, all within a single object. Despite their overall usefulness, there can be a learning curve to using them. In particular, peeking into the individual pipeline steps and extracting important pieces of information from said steps is not always the most intuitive process. Accordingly, we wrote this article as a brief guide to creating a simple pipeline and obtaining several pieces of relevant information within it.

Obtaining the Data

We use the California Housing data, a commonly used practice dataset that provides block group level housing market information from California in 1990 (as seen in Géron, 2017). For this workflow, the objective is to use a scikit-learn pipeline to generate a linear regression model predicting the target of median_house_value using 3 features:

  • median_income — median income of households in block group (in tens of thousands of dollars),

  • popbin — binned population of block group split over 3 quartiles (Low, Medium, High),

  • ocean_proximity — block group's proximity to the ocean (<1H Ocean, Inland, Near Ocean, Near Bay, Island).

Note that the code below introduces artificial missing data by randomly removing 10% of the dataset’s values in order to demonstrate the imputation implementation with scikit-learn pipelines. If you run this code be aware that the actual observations assigned to have missing data will likely be different in your data due to randomness in the missing data assignment process. We also split the data into training and testing data to better mimic an actual machine learning workflow. For more information on appropriately applying pipelines to training and testing data see the scikit-learn Getting Started page.

Setting up the Pipeline

With the dataset ready, the next step is to apply appropriate data transformations so the dataset will be ready to use in a machine learning model. For the example used in this article, we implement data transformation processes like imputation, encoding, and scaling for multiple data types using the ColumnTransformer estimator.

To conduct transformations on the dataset appropriately, the ColumnTransformer estimator’s transformation steps need to be defined separately for different data types. In the examples below, different data transformation processes are established for numeric, ordinal, and categorical measures in this dataset (more information on numeric, ordinal, and categorical measurement schemes can be found here), and then combined into pipelines. Once established, the individual numeric, ordinal, and categorical data transformation pipelines serve as the steps for ColumnTransformer in the full pipeline, where the full pipeline refers to the pipeline that combines data transformation and modeling. Let’s now take a closer look at the data transformation steps.

Numeric Data

If the decision is made to impute missing values in the numeric data (in this example, median_income), the missing data can be imputed using SimpleImputer. In our example, we use the median as the imputation value for the numeric data. After the missing values have been dealt with it is time to scale the numeric data. Here, StandardScaler rescales median_income to have a mean of 0 and unit variance. Note that StandardScaler is used in this example as an illustration and in a real-world setting you may want to try other transformation techniques such as MinMaxScaler, depending on the distribution of your features.

The following examples for establishing data transformation steps for ordinal and categorical columns follow the same overall structure as the numeric transformation pipeline, namely, the imputation of missing values followed by an appropriate transformation. Be aware that in all these steps we use SimpleImputer for illustrative purposes, but depending on your data you may find another technique for handling missing data more appropriate.

Ordinal Data

SimpleImputer can fill missing data with the most frequently occurring value from the respective column, which is useful for an ordinal column such as popbin. After imputing missing values, we use OrdinalEncoder to ordinal-encode the popbin column so the transformed column takes on integer values from 0 to k-1, where k is the number of unique values in the original popbin column.

Categorical Data

One-hot encoding creates a new column for each unique value in a categorical column. A given observation is then coded with a value of 1 on the column corresponding to the observation’s value in the original categorical column and a value of 0 on all remaining columns generated from the original categorical column.

We start the categorical transformation process below with SimpleImputer to fill all missing values on ocean_proximity with a new string value of 'missing’. After imputing missing values, OneHotEncoder then one-hot encodes the ocean_proximity column, which also creates a separate column for the newly imputed 'missing' value.

Fitting The Pipeline

Once the steps outlined above are defined for each data type, these three individual data transformation pipelines are aggregated into transformer_steps, which will then be declared as the transformers parameter in the ColumnTransformer.

We can now create the full pipeline comprised of the ColumnTransformer step, which is labeled as 'transformation’, and a linear regression model step labeled as 'linreg'. The full pipeline then needs to be fit on the training set, and we can assign this fitted pipeline to a variable named lm (for “linear model”).

This pipeline is pretty packed with information within it! Let’s look at some of the most important and useful information that can be extracted.

Breaking Into the Pipeline

See All The Pipeline’s Steps

We just built a nifty pipeline, but is there a way to actually see all the steps of the pipeline without referring to the code that established the pipeline? Fortunately, the answer is yes! All we need to do is apply named_stepsto the name of the fitted pipeline (applying steps to the pipeline will also provide the necessary output, just formatted in a slightly different way).

Unfortunately, that pipeline output is not exactly easy to read. As an alternative, we can print the pipeline to an HTML file and view it in a web browser with the following code.

And here is what the rendered HTML file looks like.

HTML rendered image of the Pipeline

HTML rendered image of the Pipeline

Let’s now go deeper to see the actual labels and values used within various pipeline steps.

Obtain Imputation Values For Numeric and Ordinal Data

One reason to dig into a pipeline is to extract the imputed values for numeric and ordinal data. Let’s start with the numeric column, in which the median is used to replace missing data. The Pipelineattribute named_steps outputs a dictionary-like object of key-value pairs that help us parse the steps within a pipeline, while the ColumnTransformer attribute named_transformers_ outputs a set of key-value pairs that help us parse the steps within ColumnTransformer. For this example, the name of the step used to identify the numerical column transformations is called num. Thus, callingimputer from the num step outputs information on the imputation used for our numeric column.

To see the actual value used to impute missing numeric data we then can call the statistics_ attribute from imputer.

The code above can also be used to obtain the value used to impute missing data on the ordinal column, popbin, by replacing num with ord in named_transformers_['num'] .

Be aware that your output from the imputation steps and some of the following steps will likely differ slightly due to the randomness in assigning missing values to the data.

Obtain Mean and Variance Used For Standardization

As the code below shows, extracting the mean and variance values used in the standardization step for the numeric column is very similar to the previous step of extracting the median from the numeric transformer’s numeric step. We just need to replace imputer with normalize and statistics_ with mean_ or var_ .

Extract Feature Names for Numeric and Ordinal Columns

ColumnTransformer has a transformers_ attribute, which returns a tuple for each fitted transformer composed of (1) the transformer’s label, (2) the transformer itself, and (3) the column(s) on which the transformer was applied.

The goal of feature name extraction for the numeric and the ordinal columns is to slice that output to reach the numeric and ordinal column names.

  • Numeric Column: Because we specified num as the first element of transformer_steps, the num transformer pipeline is the first tuple returned in the transformers_ output. The actual column name of interest is the third item in the tuple for a given transformer so we need to output the third item ([2]) of the first transformers_ tuple ([0]), as shown below (remember that Python is zero-indexed).

  • Ordinal Column: The ordinal transformer is the second tuple output by transformers_ and the name of the ordinal column used here is the third item in this tuple.

Extract Feature Names for One-Hot Encoded Columns

This one is a bit different because one-hot encoding replaces the original categorical column with a set of new columns for every category in the specified variables. Fortunately, OneHotEncoder has a method get_feature_names that can be used to get descriptive column names. So the goal here is to first access OneHotEncoder within the data transformation pipeline and then call get_feature_names.

To make the column name prefixes more descriptive, we can pass the column names of the original one-hot encoded variable as an argument to get_feature_names.

See Coefficients from a Regression Model

LinearRegression has an attribute coef_, which stores the regression coefficients (see here for an overview of linear regression and how to interpret regression coefficients). Thus, to see the coefficients we need to access the linreg step from the full pipeline and call the coef_ attribute.

Again, be aware that your output will likely differ slightly due to the randomness in assigning missing values to the data.

Conclusion

Scikit-learn pipelines are great tools for the organized and efficient execution of data transformation and modeling tasks. As our own experience has shown, there may be times where it is helpful to be able to extract specific pieces of information from various steps in a pipeline, which motivated us to write this article. For example, we may want to create a customized table/visualization of feature importance results, necessitating the extraction of column names (including the individual one-hot encoded column names) and model parameters. Or, maybe we are drafting a report on a model and need to include the means or medians used to replace missing data in the imputation step. Alternatively, maybe we are just curious and want to see the results at a specific step in the pipeline. In any of these scenarios, knowing how to pull relevant information from specific steps in the pipeline will prove useful.

Thank you for taking the time to read this article. You can run the full code in a Jupyter notebook on Binder. If you found the content in this article useful please leave some claps for it here on Medium. Also, feel free to reach out with any constructive comments.

References

Géron, A. (2017). Hands-on Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and Techniques to Build Intelligent Systems. O’Reilly Media.

Vickery, R. (2019, February 5). A Simple Guide to Scikit-learn Pipelines. Vickdata. https://medium.com/vickdata/a-simple-guide-to-scikit-learn-pipelines-4ac0d974bdcf

Next
Next

The Talk: Starting a Data Science Project