← Importance of data distribution in training machine learning models

Reason for data normalization in ML Models

November 14, 2015 by Niranjan Tallapalli Leave a comment

Standardization/Normalization is a common requirement for majority of algorithms (except like ID3 impl of Trees) which transforms asymmetric training data into symmetric. ML Algorithms behave badly if the training data is not brought on to the same scale because of the noise/outliers or the non-guassian properties of features.

Types of normalization

Z-transform: This is also called as Standardization.
- This rescales the features so that they will have the properties of a standard normal distribution with mean=0 and standard_deviation=1.
- It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.

Normalization: This is also called Min-Max Scaling(based on min max values of the variable).
- In this data is scaled to a range [0,1].
- The advantage of this bounded range between 0 and 1 is that it ends up with smaller standard deviations and suppresses the affect of the outliers.
- We use this method in K-Nearest Neighbors and preparation of coefficients in regression

Apart from normalization/standardization techniques, other pre-processing methods to transform data from non-linear to linear can be logarithmic and square root scaling. These are usually performed when the data is characterized by “bursts”, i.e. the data is well grouped in low values, but some portion of it has relatively larger values.

Feature normalization is to make different features to the same. Illustration

		Features
	AcctID	Fico	Revolving_Balance	Num of Cards
Data point #1	10001	755	20000	5
Data point #2	10002	820	5000	2

Features are Fico_score, Revolving_balance and Num_of_cards. Out of these features, one feature ‘Revolving_Balance’ is in 1000s scale, ‘Fico’ in 100s scale and ‘Num of Cards’ in 10s scale.

Now if we calculate distances between data points, since one of the dimensions have very large values, it overrides other dimensions (you can see above example, the distance contributed by number of cards would completely be nullified by the distance contributed by Revolving_Balance, if the data is not normalized)

The only models that does not care about rescaling of data is when we build the decision trees (like ID3, see the implementation here)

When to use which method? It is hard to say, we have to choose based on some experimentation.

References
https://msdn.microsoft.com/en-us/library/azure/dn905838.aspx
http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
http://machinelearningmastery.com/rescaling-data-for-machine-learning-in-python-with-scikit-learn/
http://sebastianraschka.com/Articles/2014_about_feature_scaling.html

Filed under Machine Learning

M	T	W	T	F	S	S
						1
2	3	4	5	6	7	8
9	10	11	12	13	14	15
16	17	18	19	20	21	22
23	24	25	26	27	28	29
30

coding algorithms

Reason for data normalization in ML Models

Leave a comment Cancel reply

Coding Algorithms is referred

Categories

Subscribe via email

Mostly Viewed

Recent Comments

Recent Posts

Archives

Blogs I Follow

coding algorithms

Reason for data normalization in ML Models

Rate this:

Share this:

Related

Leave a comment Cancel reply

Coding Algorithms is referred

Categories

Subscribe via email

Trending Categories

Mostly Viewed

Recent Comments

Recent Posts

Archives

Blogs I Follow