What is the Dummy variable trap in Pandas ?

Vivek Muraleedharan
4 min readJun 13, 2021

Dummy variable by definition is the numerical values that represent the categorical variable in the data and can take only 0 and 1. When it comes to machine learning problems we always feed numerical values to the algorithm to understand the data so we usually use numerical representation for categorical variables. In this article I’m gonna talk about some problem that can cause by the Dummy variable creation using pandas.get_dummies() or one-hot encoding function and some alternatives to overcome this issue.

pandas.get_dummies function/one-hot encoding

In the above image we can see the working of get_dummies function, in the raw data we had 3 categories in the color column and 2 categories in class column as well. After applying get_dummies function (check here for the pandas function details ) we got 5 columns which represent the different categories in the form of 0 and1. So the entire dataset shape got changed and which is something we should keep in mind because when we have a categorical variable with many categories in inside we should look for other options like binning or bucketing addition to the dummy variable to avoid sparsity in data.

The problem with dummy variable or the Dummy variable trap

So as we already seen how the dummy variable creation add columns and convert the categories into 0s and 1s now will see the problem caused by this method

  1. Multicollinearity between variables

In case of regression we are trying to understand the relationship between each variable to the target variable. The mean of the target changes by the coefficient value of the variable when a unit change happens to it keep the other variables constant. So when multicollinearity occurs the regression interpretation or the real relation will not be able to capture.

in the above data we can see the the gender_female and gender.male columns are created using dummy variables. If we look closely we can easily see that both of those columns are related to each other, if female become 0 then male column will have 1 and vice versa this could create multicollinearity in the dataset which affect the regression.

2. Train and test data shape mismatch

Imagine in the training data we have 5 categories in one column and we did create the dummy variables to all of them which added 5 more column to our original training dataset. We should do the same data preprocessing to our test/new dataset before feed to the algorithm to get required output, in that case imagine our test dataset had only 4 types of categories in the column and if we do dummy variable conversion to them there will be shape mismatch between train and test data and we cannot proceed with that.

Some solutions to the dummy variable trap are…

Drop the first column

While using get_dummies,one_hot encoding function drop the first column (in the above case of gender we can drop the female column) which will prevent the multicollinearity in the final data.

check here for more examples form sklearn documentation about one-hot encoding

Use label encoder

Label encoding function can help to avoid data shape mismatch issue and also dummy variable trap.

Instead of creating new columns with 0s and 1 in the dataset label encoder will convert the categories from 0 as shown in above image, check here for more info about the label encoder documentation.

--

--