Skewness — Definition, Problem and Reducing Methods
What is skewness?
Better looking at this picture:
How does it happen?
Outliers causes the distribution is skewed towards them. For instance, if you have extremely low values in comparison with the rest, your distribution will skew to the left and vice versa.
Why is skewness a problem?
Many common statistical methods require at least an approximately normal distribution, such as: central limit theorem, hypothesis testing (z-test, ANOVA), etc. With a skewed data, not only it limits our tools to do the work, but also affects performance of our model especially regression-based model.
For example: majority of student has height between 160–175 cm and minority are over 200 cm. The data is skewed to right. If we apply a linear regression model to this data, this is going to happens:
As we can see, with an outlier (height = 250 cm), our R-squared drops from 0.805 to 0.584. R-squared tells us percentage variation of y (height) explained by x (weight). In this case, regression line work better at predicting majority.
What is an acceptable range of skewness of approximately normal distribution?
How to handle skewness?
References
Statistics for Data Science: What is Skewness and Why is it Important?
Skewed Data: A problem to your statistical model
Regression Analysis: How Do I Interpret R-squared and Assess the Goodness-of-Fit?
Normality Testing — Skewness and Kurtosis
study notes: Handling Skewed data for Machine Learning models