Reason for data normalization in ML Models

Standardization/Normalization is a common requirement for majority of algorithms (except like ID3 impl of Trees) which transforms asymmetric training data into symmetric. ML Algorithms behave badly if the training data is not brought on to the same scale because of the noise/outliers or the non-guassian properties of features.

Types of normalization

  • Z-transform: This is also called as Standardization.
    • This rescales the features so that they will have the properties of a standard normal distribution with mean=0 and standard_deviation=1.
    • It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.
    • z-transform

  • Normalization: This is also called Min-Max Scaling(based on min max values of the variable).
    • In this data is scaled to a range [0,1].
    • The advantage of this bounded range between 0 and 1 is that it ends up with smaller standard deviations and suppresses the affect of the outliers.
    • We use this method in K-Nearest Neighbors and preparation of coefficients in regression
    • min_max_norm

Apart from normalization/standardization techniques, other pre-processing methods to transform data from non-linear to linear can be logarithmic and square root scaling. These are usually performed when the data is characterized by “bursts”, i.e. the data is well grouped in low values, but some portion of it has relatively larger values.

Feature normalization is to make different features to the same. Illustration

  AcctID Fico Revolving_Balance Num of Cards
Data point #1 10001 755 20000 5
Data point #2 10002 820 5000 2

Features are Fico_score, Revolving_balance and Num_of_cards. Out of these features, one feature ‘Revolving_Balance’ is in 1000s scale, ‘Fico’ in 100s scale and ‘Num of Cards’ in 10s scale.

Now if we calculate distances between data points, since one of the dimensions have very large values, it overrides other dimensions (you can see above example, the distance contributed by number of cards would completely be nullified by the distance contributed by Revolving_Balance, if the data is not normalized)

The only models that does not care about rescaling of data is when we build the decision trees (like ID3, see the implementation here)

When to use which method? It is hard to say, we have to choose based on some experimentation.


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s


Mostly technology with occasional sprinkling of other random thoughts


Amir Amintabar's personal page

101 Books

Reading my way through Time Magazine's 100 Greatest Novels since 1923 (plus Ulysses)

Seek, Plunnge and more...

My words, my world...

ARRM Foundation

Do not wait for leaders; do it alone, person to person - Mother Teresa

Executive Management

An unexamined life is not worth living – Socrates

Diabolical or Smart

Nitwit, Blubber, Oddment, Tweak !!


A topnotch site


Just another site

coding algorithms

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." -- John Tukey