role of a software architect

As an architect we should always focus on these areas

  • Envision: Software Architecture
  • Model: Design Pattern
  • Blueprint: Construction and Design Principles
  • Inspect: Construction Patterns
  • Nomenclature: Architect Jargon to communicate intent.

Software Architecture

  • Web Server? Pipelined Architectures
  • Cloud? N-Tier Network Architecture
  • SOA? Component Architecture
  • Client/Server? Layered Architecture
  • Single Tier? Monolithic Architecture

Design Pattern

  • Creational, Structural and Behavioral Patterns
    • Creational DP:provide us a way to decouple the client from the objects that it needs to instantiate
      Singleton, Factory, Builder
    • Structural DP: help us to compose classes or objects into larger structures
      Decorator, Facade, Adapter, Proxy
    • Behavioral DP: dictates how classes and objects interact with each other and distribute the responsibilities
      Strategy, Template, Iterator
  • Plugins: Plugins enable you to execute code in response to certain events

Design Principles

  • How does Encapsulation affects code reuse? Because it hides complex code which can easily be replaced with another code.
  • How cohesion affects layers? High cohesion leads to better layered design
  • How coupling affects scalability? Low coupling leads to more scalability

Construction Principles

  • Favor composition(component design) over inheritance as it reduces number of dependencies between the modules which increases flexibility.
    Layered and Tier Design?
  • Whats your developement methodology? 3-5 week sprint based on type of project.
  • How to avoid architecture erosion? Do periodic code reviews, build unit/integration tests and do a nightly builds

Architect Jargon

  • CAP Theorem: Consistency Avaliability Partition-Tolerance
  • ACID Properties: Atomicity Consistency Isolation Durability
  • Difference between a Component and a Module? A component is fine grained selft contained entity that can interact with remaining parts of system (like Data Access Layer), whereas module is coarse grained deployable source code bundle which shares a common purpose (like JSON Parser, Log4j, etc), replacing it would not impact overall system architecture
  • Difference between Tier and Layer? Tier is a physical unit, where the code / process runs. E.g.: client, application server, database server; Layer is a logical unit, how to organize the code. E.g.: presentation (view), controller, models, repository, data access.
  • Cohesion and Coupling: Cohesion is the degree to which the elements of a certain module belong together. Coupling is the degree of interdependence between software modules


Important aggregations in spark

Three main aggregations

  • reduceByKey(): It has internal combiner, used when aggregation in the data is high. Its used only when INTERMEDIATE/COMBINER aggregation logic is same as that of FINAL/REDUCER AGGREGATION logic
  • aggregateByKey(): Its similar to reduceByKey(). It has internal custom combiner. This is used to initialize some default value
  • combineByKey(): Its similar to reduceByKey(). It also has internal custom combiner. This is used to initialize dynamic value (by reading the input record and have some logic in place to initialize)


  • aggregateByKey() and reduceByKey() are sub types of combineByKey()
  • In aggregateByKey() and combineByKey(), TYPE of INPUT value need not be same as that of the OUTPUT
  • If we want to use custom logic in combiner than we go for aggregateByKey() or combineByKey() and in reduceByKey(), the combiner logic will be same as that of reducer.

Other important aggregations:
– groupByKey(): Used when combiner is not required, and hence its used when there are not many aggregations on the dataset. It provides much more flexibility for complex operations than other aggregations.
– countByKey(): Unlike all the above methods which are transformations, this is an action

Sqoop cheat sheet

Here we will discuss all possible sqoop command line options to import and export data between HDFS and RDBMS, import/export delimiters, incremental load and sqoop job/merge operations.

For practice, I downloaded the cloudera VM from

Anytime during this exercise, if you need help on sqoop queries, use sqoop help option
$sqoop --help
$sqoop import --help

Import  into HDFS – Database level operations

— list databases
$ sqoop list-databases --connect "jdbc:mysql://quickstart.cloudera:3306" --username retail_dba --password cloudera

— import all tables from db to HDFS
sqoop import-all-tables -m 12 --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --as-textfile --warehouse-dir=/user/cloudera/sqoop_import/
Formats: supported are avro, text and binary
–as-textfile, –as-avrodatafile, –as-sequencefile
-m or –num-mappers: Used to define number of threads per table
BoundingValsQuery: Used to figure out number of buckets based on number of mappers.

— Import all tables from rdbms with compression and hive table creation
$sqoop import-all-tables \
> --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" \
> --username retail_dba \
> --password cloudera \
> --hive-import \
> --hive-overwrite \
> --create-hive-table \
> --hive-database sqoop_import \
> --compress \
> --compression-codec \
> --outdir java_files

compress and comression-codec: is used to compress ingested files
out-dir: is used to store some sqoop internal java files
–hive-import and create-hive-table: used to import into hive warehouse and create hive tables on ingeated tables
–hive-overwrite – overwrites the data in existing table, if not mentioned then it will append to the existing data in the table

Import into HDFS – Table level operations

— Import a single table from sqoop
$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --target-dir /user/cloudera/departments
–table: mention table name
–target-dir: location where table data is copied
Note: If ‘-m’ option is not given then default number of mappers=4
Note: For every table import sqoop will use min and max of primary key (in boundingvalquery) and divide the records into number of buckets as specified
* Disadv: with above query is that if there are some outliers in the data then data will be unevently spread across mappers with some mappers taking heavy load and some less load

— overwrite boundary query to redefine the distribution
$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --target-dir /user/cloudera/departments --boundary-query "select min(department_id), max(department_id) from departments where department_id <> 8000"

— import specific columns from a table
$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --target-dir /user/cloudera/departments --boundary-query "select min(department_id), max(department_id) from departments where department_id <> 8000" --columns department_id,department_name

— import a table using specific query
$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --target-dir /user/cloudera/departments --boundary-query "select min(department_id), max(department_id) from departments where department_id <> 8000" --columns department_id,department_name --query "select * from departments"
* –query and –table are mutually exclusive

— import a table without primary key
$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments_nopk --target-dir /user/cloudera/departments
* This will error out as sqoop cannot split the records if there is no promary key. In this case we have to give either ‘-m 1’ or ‘–split-by ‘
$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments_nopk --target-dir /user/cloudera/departments -m 1
$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments_nopk --target-dir /user/cloudera/departments --split-by department_id

— import data by joining the source table
$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --query "select * from orders join order_items on orders.order_id=order_items.order_item_order_id where \$CONDITIONS" --split-by order_id --target-dir /user/cloudera/order_join --where "orders.order_id <> 0"
* –table-name cannot be given with –query
* $CONDITIONS is required because sqoop qill append conditions from –where otherwise ‘true’ (if no condition given)
* –splity-by is given because there is no primary_key on the joined dataset

— import into HIVE Tables
$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --hive-home /user/hive/warehouse --hive-import --hive-overwrite --hive-table sqoop_import.departments
* –hive-home is optional as it is the default value
* –hive-table should include db name followed by table name OR include –hive-database to have dbname separate
* There are two ways to import data into hive tables, one is to create the table and then import into the existing table via –hive-table(above query), and other option is to create table while importing itself via –create-hive-table
* Hive import will first download data into the temp dir (i.e, home dir of user /user/cloudera/) and then loads into the hive table, hence make sure the dir with the table name is deleted in your home directory

Incremental Load

* In Incremental Loads – Before importing we connect to log table or log file to check for the delta condition (using sqoop eval or IO API) and then do import and update the log table/file after import is successfull so that next incremental/delta load can look at it
* Incremental Load can be done in two ways – One is using –where argument and other option is to use out of the box incremental options –incremental, –check-column and –last-value

$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --append --target-dir /user/cloudera/sqoop_import/departments/ --where "department_id > 7"
* –append and –where works togeather in incremental loads. If –append not given then it will error out

$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --append --target-dir /user/cloudera/sqoop_import/departments/ --check-column department_id --incremental append --last-value 7
–append is req in this case as well
–check-column : columns against which delta is evaluated
–last-value: last values from where data has to be imported
–incremental: append/lastmodified
* –incremental: append – Used when there are only inserts into the the sql table (NO UPDATES)
* –incremental: lastmodified – Used when there are inserts and updates to the SQL table. For this to use we should have date column in the table and –last-value should be the timestamp

Export data to a MySQL database from HDFS using Sqoop 

— Export HDFS data into new SQL table
sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table order_items_export --export-dir /user/cloudera/sqoop_import/order_items
* –export-dir is option to specify external directory to load the data from hdfs into mysql table
* How number of threads/mappers work in export? In import based on number of mappers(‘-m 12’) sqoop will issue that many queries and imports data from mysql table into the cluster as RDBMS has that capability. But in export, it uses HDFS distributed data blocks to divide the blocks among the threads (‘–num-mappers 12’) and starts uploading the data

— Update/Merge HDFS data into existing SQL table
$ sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --export-dir /user/cloudera/sqoop_import/departments_export/ --batch --update-key department_id --update-mode allowinsert
* –update-key is the primary_key/unique_key against which the update will happen. There has to be a primary key on the table for the above query to work otherwise all records will be inserted (duplicate records). If there is composite key then give comma separated columns
* –update-mode : updateonly/allowinsert
updateonly – It updates the existing record/s and DOES NOT insert new record (DEFAULT MODE), all new records will be ignored. So without passing –update-mode argument, records can only be updated but new records cannot be inserted.
allowinsert – It can updates existing records and also inserts new records
* Without –update-key and –update-mode, it works only as insert mode.

Change the delimiter and file format of data during import using Sqoop

— Change import delimiters on plain HDFS dir
$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --target-dir /user/cloudera/departments_enclosed --enclosed-by \" --fields-terminated-by \| --lines-terminated-by \\n --escaped-by \, --null-string \\N --null-non-string -1
* –enclosed-by: It encloses every field in the data with this character
* –escaped-by: Used to escape any special characters in the data (like , in csv can cause issue with total number of cols in a record)
* –fields-terminated-by: field separater
* –lines-terminated-by: line separater
* –null-string: Replace null in string columns
* –null-non-string: Replace null in non-string(int, double etc) columns
* Default values are Uses MySQL’s default delimiter set: fields: , lines: \n escaped-by: \ optionally-enclosed-by: ‘ [These can be used with explicit arg –mysql-delimiters or dont give any args with respect to delimiters and formats]

— Change import delimiters on hive tables
Sqoop import using –hive-import options will import the data using default hive delimiters as fields: CTRL+A and lines: \n
$sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --hive-home /user/hive/warehouse --hive-import --hive-overwrite --hive-table sqoop_import.departments_test --create-hive-table

— Change export delimiters
All the delimiters in HDFS input in export are appended with –input
* –input-enclosed-by: It encloses every field in the data with this character
* –input-escaped-by: Used to escape any special characters in the data (like , in csv can cause issue with total number of cols in a record)
* –input-fields-terminated-by: field separater
* –input-lines-terminated-by: line separater
* –input-null-string: Replace null in string columns
* –input-null-non-string: Replace null in non-string(int, double etc) columns

But if we are used non-default SQL delimiters when we imported the data and wanted to use same imported directory in export then we have to use above-to-above arguments as well as those delimiters will be stored in the out-dir (java-files) in the imported dir
$ sqoop export --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments_test --export-dir /user/hive/warehouse/sqoop_import.db/departments_test/ --input-fields-terminated-by \\001 --input-lines-terminated-by '\n' --input-null-string NULL --input-null-non-string -1

— file format of data during import
$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --target-dir /user/cloudera/departments --as-sequencefile
–as-sequencefile: will store data in binary format
$ sqoop import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --target-dir /user/cloudera/departments --as-avrodatafile
–as-avrodatafile will import schema into the user home dir along with the data into the target dir.

  • Schema represents the table structure, columns and datatypes. It is generated with convention sqoop_import_.avsc
    • hive> Create external table departments_avro ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.avro.AvroSerDe’ stored as inputformat ‘’ outputformat ‘’ location ‘/user/cloudera/departments/’ tblproperties(‘avro.schema.url’=’/user/cloudera/departments.avsc’);
  • Export have nothing to do with file formats

Sqoop Jobs and Sqoop Merge

This is used to define pre-defined job with all the required parameters for the purpose of reuse
$ sqoop job --create import_job -- import --connect "jdbc:mysql://quickstart.cloudera:3306/retail_db" --username retail_dba --password cloudera --table departments --target-dir /user/cloudera/departments
* — import \ [there should be space between — and import]

$sqoop job --list -> will list all the existing sqoop jobs
$sqoop job --show –> will show the job details and definition
$sqoop job --exec –> To run the job

— Merge
sqoop merge --merge-key department_id --new-data --new-data /user/cloudera/sqoop_merge/departments_delta --onto /user/cloudera/sqoop_merge/departments --target-dir /user/cloudera/sqoop_merge/staging --class-name --jar-file /tmp/sqoop-cloudera/compile/e11d28e872acd71c103d33fbf81ec5c7/departments.jar
* now remove the old dir ‘/user/cloudera/sqoop_merge/departments’
hdfs dfs -rm -R /user/cloudera/sqoop_merge/departments
* rename dir ‘/user/cloudera/sqoop_merge/staging’ to ‘/user/cloudera/sqoop_merge/departments’
hdfs dfs -mv /user/cloudera/sqoop_merge/staging /user/cloudera/sqoop_merge/departments


Order by, Sort by, Distribute By, Cluster by in Hive

These four clauses provided by HIVE are very important in context of Big Data as they provide a way to achieve sorting on big datasets.

  • ORDER BY x: guarantees global ordering, but does this by pushing all data through just one reducer. This is basically unacceptable for large datasets. You end up one sorted file as output.
  • SORT BY x: orders data at each of N reducers, but each reducer can receive overlapping ranges of data. You end up with N or more sorted files with overlapping ranges.
  • DISTRIBUTE BY x: ensures each of N reducers gets non-overlapping ranges of x, but doesn’t sort the output of each reducer. You end up with N or unsorted files with non-overlapping ranges.
  • CLUSTER BY x: ensures each of N reducers gets non-overlapping ranges, then sorts by those ranges at the reducers. This gives you global ordering, and is the same as doing (DISTRIBUTE BY x and SORT BY x). You end up with N or more sorted files with non-overlapping ranges.

More detailed explanation is given here


best possible way to find the duplicates in an array

This is the most frequently asked question in many interviews. This can be done in many ways and simplest way is the brute-force approach by taking 2 for loops with complexity of O(n*n). But the interviewer looks for the solution with the best time complexity and also without using any existing data structures that are available in Java or other languages.

Now I am going to show you the approach with O(n) complexity at the cost of O(n) space. Full source code can be downloaded from here.

Step1: Construct a simple linked list which holds the integer value.

    class MyInt {
		int value;
		MyInt next;
		public MyInt(int value) {
			this.value = value;

Step2: Initialize the above array with the size of the input array.

hashArray = new MyInt[size];

Step3: Loop through the input array and insert elements into the newly constructed hash array

  • In this method we construct the hashcode of each element and then find out the appropriate bucket (array index) in which the element has to be stored.
    public void findDuplicates(Integer currentElement) {
		int idx = getSupplementalHash(currentElement.hashCode()) % size;
		//int idx = (element.hashCode()) % size;
		MyInt existingElement = hashArray[idx];

		for(;existingElement != null; existingElement = {
			if(existingElement.value == currentElement) {
				// duplicate
				System.out.println(currentElement+" this is a duplicate");
			} else {
				System.out.println("identified collision, adding "+
                                                   existingElement.value+" to the list");
		System.out.println("adding "+currentElement+" to the list");
		MyInt mi = new MyInt(currentElement);
		// insert element at the head to avoid tail traversing = hashArray[idx];
		hashArray[idx] = mi;

Example #1
Input Array: {1, 2, 3, 3, 4, 5, 6, 8, 8, 1, 2,8}

3 is a duplicate
8 is a duplicate
1 is a duplicate
2 is a duplicate
8 is a duplicate

But there is a catch here, if the hashcode is not properly implemented by the user the complexity might go upto O(n*n), see the below example

Example #2: In findDuplicates() method above, comment line #2 and uncomment line #3
Input array: {10,0,100,20,20}

Input array size is 5
adding 10 to the list
identified collision, adding 10 to the list
adding 0 to the list
identified collision, adding 0 to the list
identified collision, adding 10 to the list
adding 100 to the list
identified collision, adding 100 to the list
identified collision, adding 0 to the list
identified collision, adding 10 to the list
adding 20 to the list
20 is a duplicate

If you notice above output, to find if 20 is a duplicate, it traversed 4 times which is nothing but O(n) for each element and hence it goes back to O(n*n) overall.

Disadvantage: The downside with this approach when hashcodes are not proper (like all ending with 0, all are even, all are odd, etc), they all go into one bucket and thus becoming another array (with complexity O(n)) instead or hasharray (with complexity O(1)). Because of this our target complexity O(n) will be lost.

To address this we will re-compute the hash code again at our end to make sure it is well distributes the index, here is the code

    private int getSupplementalHash(int h) {
		// This function ensures that hashCodes that differ only by
		// constant multiples at each bit position have a bounded
		// number of collisions (approximately 8 at default load factor).
		h ^= (h >>> 20) ^ (h >>> 12);
		return h ^ (h >>> 7) ^ (h >>> 4);

Example #2 (run with getSupplementalHash() method): : In findDuplicates() method above, comment line #3 and uncomment line #2

Input array size is 5
adding 10 to the list
identified collision, adding 10 to the list
adding 0 to the list
adding 100 to the list
adding 20 to the list
20 is a duplicate

If you notice, the collision is reduced to a greater extent.

Full source code can be downloaded from here.

Different Similarity/Distance Measures in Machine Learning

What is the best similarity/distance measure to be used in machine learning models? This depends on various factors. We will see each of them now.

Distance measures for numeric data points

Minkowski Distance: It is a generic distance metric where Manhattan(r=1) or Euclidean(r=2) distance measures are generalizations of it.

Manhattan Distance: It is the sum of absolute differences between the coordinates. It is also called as Rectilinear Distance, L1-Distance/L1-Norm, Minkowski’s L1 Distance, City Block Distance, Taxi Cab Distance.


Euclidean Distance: It is the square-root of sum of squares of difference between the coordinates and is given by Pythagorean theorem. Also called as L2 Norm, Ruler Distance. In general, default distance is considered as Euclidean distance.


Use Manhattan or Euclidean distance measures if there are no missing values in the training data set (data is dense)

Cosine Similarity

This measures the cosine of angle between two data points (instances). It is a judgment of orientation rather than magnitude between two instances/vectors with respect to the origin. The cosine of 0degrees is 1 which means the data points are similar and cosine of 90degrees is 0 which means data points are dissimilar (all these angular measures are independent of magnitude).

  • Mostly used in high dimensional positive spaces where output is bounded in the range of [0,1].
  • Used in Information Retrieval, Text Mining, Recommendation Systems and Classifications. It is also used to measure cohesion between the clusters in Data Mining.
  • It is very efficient to evaluate sparse vectors(as only the non-zero dimensions in the space are considered)
  • If the attribute vectors are normalized by subtracting the vector means [e.g., Ai – mean(A)], the measure is called centered cosine similarity and is equivalent to the Pearson Correlation Coefficient.


Disadvantage: Cosine similarity is subjective to the domain and application and is not an actual distance metric. For example data points [1,2] and [100,200], are shown similar with cosine similarity, whereas in eucildean distance measure shows they are far away from each other (in a way not similar).

Question #1: If we want to find out closest merchants (like costco, walmart, delta, starbugs, etc) to a given merchant, we would go for cosine similarity because we want merchants in similar domains (i.e, orientation) rather than the magnitude of the feature vector.

Question #2: If we want to find out closest customers to a given customer, we would go for euclidean distance because we want customers who are close in magnitude to the current customer.

Pearson Correlation Coefficient

This is a measure of correlation between two random variables. It ranges between [-1, 1]. 1 is the perfect agreement and -1 is perfect disagreement. We use this when we have grade inflation problem.


What is grade inflation? It means users are not consistent in giving the ratings. In below example Clara’s 4.75 is same as Robert’s 4 rating. In real time as well some people tend to give extreme ratings and some people never give extreme  ratings and because of this an item (rockband in this case) which is extremely good might get ratings between 4-5 which are all same. And this problem is addressed with pierson coefficient.

  Blues traveler Norah Jones Phoenix The Strokes Weird Al
Clara 4.75 4.5 5 4.25 4
Robert 4 3 5 2 1


Jaccard Similarity

Data points in Jaccard Similarity are sets rather than vectors. Sometimes the equality of certain values does not mean anything and this problem is addressed with Jaccard Similarity.
What is set? Collection of unordered elements {a,b}
What is Cardinality? It counts number of elements in A and is denoted by |A|
Intersection of sets? Elements common in both datasets A and B
Union of Sets? All elements present in both sets.

Illustration of iTune songs ordered by number of plays.
Problem Stmt: find similar songs in iTunes
Say, my top track is Moonlight Sonata by Marcus Miller with 25 plays. Chances are that you have played that track zero times. In fact, chances are good that you have not played any of my top tracks. In addition, there are over 15 million tracks in iTunes and I have only four thousand. So the data for a single person is sparse since it has relatively few non-zero attributes (plays of a track). When we compare two people by using the number of plays of the 15 million tracks, mostly they will have shared zeros in common and it doesn’t indicate the similarity between the users. On the other hand, if both users listened to the same song (a value of 1 for each), it is implied that the users have many similarities. In this case, equality of 1 should carry a much higher weight than equality of 0.

There are other distance measures for sequences (Time Series), network nodes(Sim Rank, Random Walk), population distribution(KL Divergence) which I will be covering in later posts.


Reason for data normalization in ML Models

Standardization/Normalization is a common requirement for majority of algorithms (except like ID3 impl of Trees) which transforms asymmetric training data into symmetric. ML Algorithms behave badly if the training data is not brought on to the same scale because of the noise/outliers or the non-guassian properties of features.

Types of normalization

  • Z-transform: This is also called as Standardization.
    • This rescales the features so that they will have the properties of a standard normal distribution with mean=0 and standard_deviation=1.
    • It is useful to standardize attributes for a model that relies on the distribution of attributes such as Gaussian processes.
    • z-transform

  • Normalization: This is also called Min-Max Scaling(based on min max values of the variable).
    • In this data is scaled to a range [0,1].
    • The advantage of this bounded range between 0 and 1 is that it ends up with smaller standard deviations and suppresses the affect of the outliers.
    • We use this method in K-Nearest Neighbors and preparation of coefficients in regression
    • min_max_norm

Apart from normalization/standardization techniques, other pre-processing methods to transform data from non-linear to linear can be logarithmic and square root scaling. These are usually performed when the data is characterized by “bursts”, i.e. the data is well grouped in low values, but some portion of it has relatively larger values.

Feature normalization is to make different features to the same. Illustration

  AcctID Fico Revolving_Balance Num of Cards
Data point #1 10001 755 20000 5
Data point #2 10002 820 5000 2

Features are Fico_score, Revolving_balance and Num_of_cards. Out of these features, one feature ‘Revolving_Balance’ is in 1000s scale, ‘Fico’ in 100s scale and ‘Num of Cards’ in 10s scale.

Now if we calculate distances between data points, since one of the dimensions have very large values, it overrides other dimensions (you can see above example, the distance contributed by number of cards would completely be nullified by the distance contributed by Revolving_Balance, if the data is not normalized)

The only models that does not care about rescaling of data is when we build the decision trees (like ID3, see the implementation here)

When to use which method? It is hard to say, we have to choose based on some experimentation.


Importance of data distribution in training machine learning models

A fundamental task in many statistical analyses is to characterize the location and variability of a data set. A further characterization of the data includes data distribution, skewness and kurtosis.

What is Normal Distribution and why is it important in training our data models in Machine learning? The normal distributions are very important class of statistical distributions. All normal distributions are symmetric and have bell-shaped curves with a single peak (aka Gaussian Distribution). Creating a histogram on variable (variable values on Xaxis and its frequency on Yaxis) would get you a normal distribution.

When the distribution is normal then it obeys 68-95-99.7% rule. Which means

  • 68% of data points/observations fall within +1*(Standard Deviation) to -1*(Standard Deviation) of mean
  • 95% of data points/observations fall within +2*(Standard Deviation) to -2*(Standard Deviation) of mean
  • 7% of data points/observations fall within +13*(Standard Deviation) to -3*(Standard Deviation) of mean


If the data distribution is not normal then it can be skewed to the left or right or completely random. Some of these cases are addressed through skewness and kurtosis.

Skewness: The coefficient of Skewness is a measure for the degree of symmetry in the variable distribution. There are different formulae for calculating this skewness coefficient. Karl Pearson gave couple of formula


Kurtosis: Kurtosis is a measure of whether the data are peaked or flat relative to a normal distribution. That is, data sets with high kurtosis tend to have a distinct peak near the mean, decline rather rapidly, and have heavy tails. Data sets with low kurtosis tend to have a flat top near the mean.

Some basics to recollect to go through the distribution (mean, median and std dev)
Mean: It is the sum of all observations divided by number of observations
Median: When all the observations are sorted in the ascending order, the median is exactly the middle value.
– Median is equal to 50th  percentile.
– If the distribution of the data is Normal, then the median is equal to the arithmetic mean (which also equals Mode).
– The median is not sensitive to extreme values/outliers/noise, and therefore it may be a better measure of central tendency than the arithmetic mean.
Standard Deviation: It gives the measure of the spread of the data. Average of squared differences from the mean is variance and square root of variance is std dev.


ID3 Implementation of Decision Trees

There are different implementations given for Decision Trees. Major ones are

  • ID3: Iternative Dichotomizer was the very first implementation of Decision Tree given by Ross Quinlan.
  • C4.5: Advanced version of ID3 algorithm addressing the issues in ID3. Some of issues it addressed were
    • Accepts continuous features (along with discrete in ID3)
    • Normalized Information Gain
    • Missing Value Imputation: Handling missing values
    • Pruning: Overfitting problem
    • Applying different weights to the features
  • CART: Classification and Regression Trees are similar to C4.5 in many ways except these trees are built on numeric splitting criteria.

We will see the basic implementation of ID3 now and we will extend it from here going further.

Features, training and testing data sets are taken from

FEATURES (sample)

Total: 22
// A listing of the features and their possible values. All features are binary.
age_lte_30 t f
age_betw_31_50 t f
age_gt_50 t f
sex m f


Total: 102
trainPatient1 survived   t f f f n y y y y n y y y y y t t t f f f n
trainPatient2 survived   f t f m n y n y y n y y y y y t f t f f t n
trainPatient3 survived   f f t m y y n y y y y y y y y t t t f f f n
trainPatient4 survived   f t f m y n y y y y y y y y y t t t f f f n


Total: 52
testPatient1 survived    f t f m y y y y y y y y y y y t t f t f f n
testPatient2 survived    f f t m n n y y y n n y y y y t f f f t t y
testPatient3 survived    t f f f y y n n y y n y y y y t f f f t t n
testPatient4 died        f t f m y y n y y y n n y n n t f f t f t y
testPatient5 survived    f f t m n y n n n y n n n y n f f f f t f y

Implementation of the ID3 ALGORITHM

  • Load the data: Load the features, training and testing data.
  • Build the Decision Tree
  • Test the Decision Tree
  • Print the constructed Decision Tree

Build the Decision Tree

 - Check for the base cases
 - For each attribute, find the Information Gain and choose the attribute 
     with Maximum Info Gain or Minimum Entropy (aka Best Attribute)
 - Create a Decision Node on the best attribute chosen in above step.
 - Recursively run above 3 steps for remaining attributes.

public DTNode buildDecisionTree(Features features, Instances trainExamples, 
			String[] outputLabels) {

		if(trainExamples.getExamples().size() ==0 || features.getFeatures().size() ==0) {
			return null;

		Map&amp;amp;lt;String, Integer&amp;amp;gt; outputLabelCt = getOutputLabelDistribution(features.getFeatures(), trainExamples.getExamples(), outputLabels);
		int negCt = outputLabelCt.get(outputLabels[0]);
		int posCt = outputLabelCt.get(outputLabels[1]);
		String debug = posCt + "+ve/" + negCt + "-ve";
		String outputLabel = outputLabels[1];
		if(posCt &amp;amp;gt; negCt) {
			outputLabel = outputLabels[0];

		if(posCt == 0) {
			DTNode node = new DTNode();
			return node;

		if(negCt == 0) {
			DTNode node = new DTNode();
			return node;
		Feature bestFeature = getBestFeature(features.getFeatures(), trainExamples.getExamples(), outputLabels, outputLabelCt);
		Features remainingFtrs = getRemainingFtrs(features.getFeatures(), bestFeature);
		//System.out.println("remainingFtrs size "+remainingFtrs.getFeatures().size());

		// build a node based on the best feature
		DTNode root = new DTNode();

		// traverse left side and construct
		//System.out.println("traversing left");
		Instances filteredExOnBestFtr = getFilteredExamplesOnBestFtr(trainExamples.getExamples(), bestFeature, bestFeature.getfOptions()[0]);
		root.setLeftNode(buildDecisionTree(remainingFtrs, filteredExOnBestFtr, outputLabels));

		// traverse right side and construct
		//System.out.println("traversing right");
		filteredExOnBestFtr = getFilteredExamplesOnBestFtr(trainExamples.getExamples(), bestFeature, bestFeature.getfOptions()[1]);
		root.setRightNode(buildDecisionTree(remainingFtrs, filteredExOnBestFtr, outputLabels));

		// return root of the tree
		return root;

Test the Decision Tree: Tests the decision tree which is built in previous step and prints the accuracy (In our implementation accuracy is around 92%)

Successfully predicted testPatients: 48
Failed example is testPatient5;survived;[f, f, t, m, n, y, n, n, n, y, n, n, n, y, n, f, f, f, f, t, f, y]
Failed example is testPatient19;died;[f, t, f, m, n, n, n, n, y, y, y, y, n, y, y, t, t, f, t, f, t, y]
Failed example is testPatient24;survived;[f, f, t, m, y, y, n, y, y, y, n, n, n, y, n, t, f, t, f, f, t, y]
Failed example is testPatient34;survived;[f, t, f, m, n, y, n, n, n, y, y, y, y, y, y, t, f, f, f, t, f, y]
Number of failed examples: 4, out of total test examples: 52
Accuracy achieved is 92.3076923076923%

OUTPUT Decision Tree

│       ┌── survived#ONLYNEG#1+ve/0-ve
│   ┌── histology#Normal#1+ve/6-ve
│   │   └── died#ONLYPOS#0+ve/6-ve
└── varices#Normal#88+ve/14-ve
    │               ┌── survived#ONLYNEG#4+ve/0-ve
    │           ┌── sex#Normal#9+ve/5-ve
    │           │   │   ┌── died#ONLYPOS#0+ve/4-ve
    │           │   └── sgot_lte_59#Normal#5+ve/5-ve
    │           │       │       ┌── died#ONLYPOS#0+ve/1-ve
    │           │       │   ┌── age_betw_31_50#Normal#1+ve/1-ve
    │           │       │   │   └── survived#ONLYNEG#1+ve/0-ve
    │           │       └── anorexia#Normal#5+ve/1-ve
    │           │           └── survived#ONLYNEG#4+ve/0-ve
    │       ┌── alk_phosphate_lte_127#Normal#35+ve/8-ve
    │       │   │       ┌── survived#ONLYNEG#3+ve/0-ve
    │       │   │   ┌── antivirals#Normal#5+ve/3-ve
    │       │   │   │   │       ┌── survived#ONLYNEG#2+ve/0-ve
    │       │   │   │   │   ┌── sgot_gt_156#Normal#2+ve/1-ve
    │       │   │   │   │   │   └── died#ONLYPOS#0+ve/1-ve
    │       │   │   │   └── liver_big#Normal#2+ve/3-ve
    │       │   │   │       └── died#ONLYPOS#0+ve/2-ve
    │       │   └── spiders#Normal#26+ve/3-ve
    │       │       └── survived#ONLYNEG#21+ve/0-ve
    │   ┌── age_lte_30#Normal#48+ve/8-ve
    │   │   └── survived#ONLYNEG#13+ve/0-ve
    └── fatigue#Normal#87+ve/8-ve
        └── survived#ONLYNEG#39+ve/0-ve

Complete source code can be downloaded at


Decision Trees Learning and Implementation #2

Entropy and Information Gain
Let ‘v’ be the random boolean variable with the following probability distribution:

P(v=0) = 0.2 and P(v=1) = 0.8
Say, v=0 is heads on a coin and v=1 is tails

The surprise S(V=v) of each value ‘v’ is defined as

S(V=v) = -log[P(V=v)]

Which means
– an event with probability 1 gives me ZERO surprise (consider a coin with heads on both sides, and probability of getting heads doesn’t surprise at all)
– an event with probability 0 gives me INFINITE surprise (consider a coin with heads on both sides, and probability of getting tails gives us infinite surprise)

Now that we have individual event surprise, we would be interested in average surprise of all events which introduces the concept of Entropy

What is Entropy? Entropy of ‘v’, denoted as H(v), is defined as the average surprise of describing the result of one trial of ‘v’


Here Sigma(log[P(H=v)]) gives the total surprise and multiplying it with the probability P(H=v) gives the average.

Entropy is also viewed as the measure of uncertainty.


From the graph it is observed that the entropy is maximum at probability=0.5, i.e, when probability is same for both events then the uncertainty is maximum. If you take a fair coin with heads on one side and tails on other side then the

probability P(V=heads)=0.5 and P(V=tails)=0.5
Entropy = -0.5log(0.5)-0.5log(0.5) = 1

In statistics ‘Gene Coefficient’ also generates the same curve as that of entropy.

Now the question is: How this curve helps us to find the best attribute when compared to the error rate? This introduces concept of Information Gain

What is Information Gain or Mutual Information? Consider two dependent random variables A, B. The mutual information between A and B is: the amount of information we learn about B by knowing the value of A.


Mutual Information quantifies how much a feature/attribute ‘x1’ tells about the value of class Y. So now if we look at our previous example




CodeCogsEqn (1)

CodeCogsEqn (2)

CodeCogsEqn (3)

CodeCogsEqn (4)

CodeCogsEqn (7)

If you notice in previous article the error rate was ZERO before and now we have positive information gain as 0.0304 which is telling us that we are making progress in building the decision tree.

Following the same steps one by one we can build the complete decision tree for the previous shown illustration and it looks sth like this.


In this graph, if you notice there are only three features (Outlook, Wind and Humidity) considered. We did not consider the remaining feature ‘temperature’ because it is redundant. Even in real time we select only relevant attributes while training the model.

I will dive into the implementation details on some real time data in the next post.


Mostly technology with occasional sprinkling of other random thoughts


Amir Amintabar's personal page

101 Books

Reading my way through Time Magazine's 100 Greatest Novels since 1923 (plus Ulysses)

Seek, Plunnge and more...

My words, my world...

ARRM Foundation

Do not wait for leaders; do it alone, person to person - Mother Teresa

Executive Management

An unexamined life is not worth living – Socrates

Diabolical or Smart

Nitwit, Blubber, Oddment, Tweak !!


A topnotch site


Just another site

coding algorithms

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." -- John Tukey