Pandas — EDA: Smart Way to replace NaN

In this article, I have shown one small trick to replace Null Values / Missing values from the dataset. This is part of the EDA.

Rutvij Bhutaiya
Analytics Vidhya

--

Many analysts use to either drop the NaN or replace all the NaN with variable mean or another statistical measurement. However, it’s not always the right method.

In this following example, I have chosen the dataset, where a few of the columns have NaN values, and I have shown how to replace NaN with mean. Wait for a second!! It’s not the whole mean of the column.

Photo by Towfiqu barbhuiya on Unsplash

Here, I have selected only the BasePay variable to replace the NaN.

Screen Shot 1

Now, what’s the total number of missing values in a particular feature is shown as follows. There are 609 NaN in the BasePay variable.

Screen Shot 2
Screen Shot 2

Now, one option could be to replace NaN with the mean of the variable, which is 66325.44. However, here we are showing the method to replace the NaN with mean, but with respect to a particular JobTitle.

Screen Shot 3

There is a total of 2154 unique JobTitle ranging from General Manager to Cashier. Now, for example, If we replace a missing value for the General Manager’s BasePay with mean [feature mean] or Cashier’s BasePay with mean [feature mean]. It would be injustice and false.

Screen Shot 4

Hence, in this method, we have shown the technique to replace missing values with mean, but with respect to a particular JobTitle. Meaning, General Manager’s mean will only apply to missing values where the JobTitle of the BasePay is General Manager.

And means for a particular JobTitle are as follows,

Screen Shot 5

Here, in the following method, we have used groupby() and transform() functions to replace NaN.

sal[‘BasePay’] = sal[‘BasePay’].fillna(sal.groupby(“JobTitle”)[‘BasePay’].transform(‘mean’))

And now, after the replacement of NaN for BasePay, you can see the clear difference in the mean value, which is changes from 66325.44 to 66287.40

Screen Shot 6

However, still there are 8 missing values! How?

These missing values are from the data, where the data is not provided in JobTitel (refer to the first screenshot)

Screen Shot 6

Now, in this case, we are not replacing the missing values, but we are removing the entire rows.

Screen Shot 7

--

--