Decision Trees Learning and Implementation #1

These are most widely used data mining machine learning algorithms for two reasons, one is its fairly easier to understand and implement and other reason is the scalability.

Hypothesis Space and Learning Algorithms of Decision Trees (all these terms are explained in previous article)

  • Variable size hypothesis space: Amount of hypothesis space grows with the input data.
  • Deterministic: Every training example in Decision Tree is deterministic in nature. (i.e, either positive or negative)
  • Parameters: Decision Trees support both continuous and discrete parameters.
  • Constructive search algorithms: Tree is built by adding nodes.
  • Eager: Learning algorithm of DT is eager in nature.
  • Batch Mode.

Some basic concepts of DT hypothesis space
– What is Internal Node? Node where we test the value of one feature [Xj]
– What is a Leaf Node? Leaf nodes makes the class prediction. (if the person has the cancer or not)

Decision Space Decision Boundaries:
– DTs divide feature space into axis parallel rectangles and each rectangle is labeled with one of ‘k’ classes. Consider a simple DTree (on LHS) with two features (X1, X2) with continuous parameters (0-6). The tree is mapped on to a chart on RHS which explains the above definition.



Decision Tree Illustration from Tom Mitchell book (

Day Outlook Temperature Humidity Wind Play ball
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No

From the above data, features (or Attributes) are Outlook, Temperature, Humidity and Wind. With these features we should now build the Decision Tree and the question is which feature comes first in the tree and what other features will follow it?

The order of the features depend on what features you can evaluate fast so that the tree ends quick. If we choose the best attribute every time, then we will be building a great decision tree.

The ideal thing that could happen is to have a feature that perfectly classifies the data (for example spam from non-spam based on some word like , say, ‘viagra’), here the decision tree will be with one feature (say contains_word_viagra) which classifies the data at 100% on one side of node and 0% on other side of the node and we are done.

But in real time it classifies on certain percentages like 60-40, 30-70, 90-10%,etc, (these are also called Error Rates) and we should choose the best feature/attribute as our first node based on best error rate, like 90-10% in above example (because the ideal 100-0% is not possible in real time)

Unfortunately the error rate is not the best factor to choose the attribute because sometimes it does not tell us if we are making any progress while building the decision tree. See the below tree


This tree is not telling us if we made any improvement after splitting on X1, therefore we have to use some better aspect than just the error rate, i.e, To Be Heuristic

A better heuristic from Information Theory is the usage of logarithmic scale which introduces the concept of Entropy and Information Gain. These concepts are explained in the next post.

Essence of Inductive/Supervised Learning

Inductive/Supervised Learning: Given a training example, [x, f(x)], for some unknown function ‘f’, a good approximation of ‘f’ is called Supervised Learning.

Appropriate applications of Supervised Learning

  • Situations where humans can perform the task but cannot describe how they do it.
    • x: bitmap picture of hand written characters; f(x): ASCII code of char
  • Situation where desired function is changing very frequently.
    • Predictiong stock prices.
  • Situation where each user needs customized function.
    • Amazon user recommendations.

Let us take this simple example

#  x1  x2  x3  x4 | y
1  0   0   1   0  | 0  
2  0   1   0   0  | 0  
3  0   0   1   1  | 1  
4  1   0   0   1  | 1  
5  0   1   1   0  | 0  
6  1   1   0   0  | 0  
7  0   1   0   1  | 0  
. so on

We have to predict ‘y’ for a given new combination of (x1,x2,x3,x4). Since there are 4 variables, there will be 2^4 states(16) and hence number of functions would become 2^16.

Since the number of functions is double exponential and in real time there will be 1000s of variables, brute force way of finding the function is not possible, therefore we need generalization.

Two views of Machine Learning to generalize

  • Removal of remaining uncertainty (suppose we know the unknown function is ‘m of n’, then we would use training examples to infer the function)
  • It requires guessing a good, small hypothesis class. We can start with very small class and enlarge it until it contains hypothesis that fits the data.

A framework for hypothesis spaces and learning algorithms

Framework for hypothesis spaces

Key factors to understand the hypothesis space

  • Size: Is the hypothesis space fixed in size (like Naïve Bayes) or variable in size (like Decision Trees)
    • Fixed size spaces are easier to understand, but variable size spaces are more useful.
    • Variable size spaces introduce the problem of overfitting.
  • Randomness: Is each hypothesis ‘Deterministic’ or ‘Stochastic’? this effects how we evaluate hypothesis.
    • With deterministic hypothesis, the training example is consistent, i.e, correctly predicted OR inconsistent, i.e, incorrectly predicted (ex: spam/non-spam detection)
    • With stochastic hypothesis, the training example is more likely or less likely.
  • Parameterization: Is the hypothesis described by set of symbolic (discrete) choices OR continuous parameters?


Framework for Learning Algorithms

Key factors to understand the learning algorithms

  • Search Procedure
    • Direct Computation: Solve for the hypothesis directly
    • Local Search: Start with initial hypothesis and make small improvements until local optimum.
    • Constructive Search: We start with empty hypothesis and gradually add structure to it.
  • Timing
    • Eager: Analyze training data and construct explicit hypothesis (Decision Trees)
    • Lazy: Store the training data and wait till the testing data is presented and then construct the adhoc hypothesis to classify the test instance (KNN Algo).
  • Online vs Batch Learning
    • If the data changes very often (like stock market prediction), we need online learning
    • If the data does not change much overtime (like drug analytics), then we need batch learning. This is more expensive.


Terminology and key issues with Machine Learning

These are some of the terms which are used in machine learning algorithms.

  • Training Example: An example of the form [x, f(x)]. Statisticians call it ‘Sample’. It is also called ‘Training Instance’.
  • Target Function: This is the true function ‘f’, that we are trying to learn.
  • Target Concept: It is a boolean function where
    • f(x) = 1 are called positive instances
    • f(x) = 0 are called negative instances
  • Hypothesis: In every algorithm we will try to come up with some hypothesis space which is close to target function ‘f’.
  • Hypothesis Space: The space of all hypothesis that can be output by a program. Version Space is a subset of this space.
  • Classifier: It’s a discrete valued function.
    • Classifier is what a learner outputs. A learning program is a program where output is also a program.
    • Once we have the classifier we replace the learning algorithm with the classifier.
    • Program vs Output and Learner vs Classifier are same

Some of the notations commonly used in Machine Learning related white papers


Some of the key issues with machine learning algos

  • What is a good hypothesis space? Is past data good enough?
  • What algorithms fit to what spaces? Different spaces need different algorithms
  • How can we optimize the accuracy of future data points? (this is also called as ‘Problem of Overfitting‘)
  • How to select the features from the training examples? (this is also called ‘Curse of Dimentionality‘)
  • How can we have confidence in results? How much training data is required to find accurate hypothesis (it’s a statistics question)
  • Are learning problems computationally intractable? (Is the solution scalable)
  • Engineering problem? (how to formulate application problems into ML problems)

Note: Problem of Overfitting and Curse of Dimentionality will be there with most of the real time problems, we will look into each of these problems while studying individual algorithms.


Machine Learning and its overall territory

Machine Learning is automating automation OR getting computers to program themselves.
*sometimes writing software will be a bottleneck (like face detection, handwriting to ASCII mapping, stock prediction etc), so let the data do the work instead



Every machine learning algorithm has 3 components.

  • Representation
  • Evaluation
  • Optimization

Like we programmers need a programming language like java/scala.. etc to develop a program, machine learning needs languages to accomplish learning, they are

  • Decision Trees: these are much widely used in ML
  • Set of Rules: Its simple set of rules (like a bunch of if-else conditions)
  • Instances: This is one of the easiest, oldest and simplest lazy learnings of ML.
  • Graphical Models: Bayes and Markov nets.
  • Neural Networks
  • Support Vector Machines: These are very important in business based learning and use much sophisticated mathematics.
  • Model Ensembles: These are the newest ones (ex: Gradient Boosting Machine)

New representations come much less often than compared to next phases of ‘Evaluation’ and ‘Optimization’ and hence this is like a first time effort. Once a representation is chosen, it means that we have chosen a language and now the actual work starts, that is ‘Evaluation’

It explains how to measure accuracy of a ML program. There are few concepts that we use to measure accuracy (consider spam detection example)

  • Accuracy: a program which counts number of spams which are actually marked spam and same with non-spams.
  • Precision: What fraction of our predicted spams are actually spams (0-1 probability)
  • Recall
  • Squared Error: Square of the difference between the predicted value and the actual value.
  • Likelihood: How likely is what we are seeing according to our model. Likelihood is good when the learning is not very powerful.
  • Posterior Probability: It is a combination of Likelihood and ‘Prime Probability’. Along with likelihood it gives weightage to our beliefs.
  • Cost/Utility: We should consider the cost of ‘false positives’ and ‘false negatives’ as they can be very different and expensive.
  • Margin: If we draw a line between spams and non-spams then the distance between the line and spams/non-spams is the margin.
  • Entropy: It is used to measure the degree of uncertainty.
  • KL Divergence

The type of optimization to use depends on what type of Representation and Evaluation we are using.

  • Combinatorial Optimization: We use this if representation is discrete. (ex: greedy search)
  • Convex Optimization: We use this if representation is discrete. (ex: gradient descent)
  • Constrained Optimization: Its continuous optimization subjected to many constraints (in a way this is combination of above both, like Linear Programming)

This gives the complete picture of Machine Learning and we will dive into each ML algorithm going forward.

Analogy between Binary Search Tree and Quicksort Algorithm

Binary Search Tree(BST) and Quicksort Algorithm(QA) are similar in nature in all the factors. They have same time complexities and same number of comparisons at every insertion. If one finds the quicksort logic difficult to understand and remember then he can always remember BST and memorize QA logic. We will now see how QA algorithm can be seen as simple BST by looking at both the concepts individually.

Binary Search Tree(BST)
Binary Search Tree(BST) property is that every inserted node in the BST would be inserted to the left of its parent node if its less than the parent node value and inserted to the right of its parent node if its more.

Binary Search Tree (constructed from random input array)
BST Sort

Binary Search Tree (constructed from sorted input array)
BST Sorted
Complexity of BST: Randomized input would lead to complete BST in which case the BST insertion would take O(logn) time complexity. And if the input is sorted in some order(ascending or descending) then its going to be the length of the array, i.e, O(n)
So for n inserts it would be O(nlogn) or O(n^2)

Quicksort Algorithm
Now we will look at quick sort logic
Plain Quicksort Algorithm

Quicksort algorithm with BST embedded

Complexity of Quicksort Algorithm: Randomized input would lead to O(logn) time complexity as seen in above diagram. And if the input is sorted in some order then its going to be length of the array, i.e, O(n)
So for n inserts it would be O(nlogn) or O(n^2)

Understanding quicksort algorithm

I see many people finding it very difficult to remember this algorithm, so is this article. Complete source code with few unit tests can be downloaded here

To put it in a simple way, Quicksort algorithm is as simple as this diagram depicts.
Explanation of the above diagram: Every time we are choosing first element as the pivot and making the partition, i.e, elements less than pivot comes to left partition and vice versa. And in what order these elements should be brought into left and right partitions is explained in partition logic below. And partitioning itself is a recursive mechanism we do that as part of actual quicksort logic.
Note: Just because this algorithm is in_place algorithm you might find it bit confusing(but efficient), but otherwise if you choose simplicity you can always implement it in other ways.

Complexity of Quicksort Algorithm: On an average Quicksort Algorithm has the complexity of O(nlogn) and in the worst case it has O(n^2) when the elements of the input array are sorted (ascending or descending order). It is an in place algorithm in the sense it does not takes any additional space.
It is a divide and conquer algorithm where we partition the given array with respect to a particular element(called as ‘PIVOT’) such that the lower partition elements of the array are less than the pivot and upper partition elements of the array are higher than the pivot.

Concept of Partition and Pivot:
Rule of Pivot: There is not any rule of pivot as such but it can be any element of the given array, here I am considering the first element as the pivot in every partition. How choosing of an pivot effects the distribution of the elements in partitioning and its role in the complexity of the quicksort logic will be discussed in future posts.
Rule of Partition: The lower partition should be less than the pivot and upper partition should be higher than the pivot. Running time of partition logic is linear.

Let us take an example array {14,12,16,13,11}
First Iteration: Lets say pivot is always the first element of the array(i.e, 14 in this case);

 14  | 12  | 16  | 13  | 11 --> (a[0] is not less than pivot, hence i is stopped
 ---------------------------   at index 0; a[4] is not greater than pivot,hence
                               j is stopped at index 5. Now swap a[0] and a[4])  
 11  | 12  | 16  | 13  | 14 --> (a[2] is not less than pivot, hence i is stopped
 ---------------------------   at index 2; a[3] is not greater than pivot,hence
                               j is stopped at index 3. Now swap a[2] and a[3].)
 11  | 12  | 13  | 16  | 14 
 0   | 1   | 2   | 3   | 4   -->  Index
At this point index i points to 3 and j points to 2 and since i > j, 
partition logic will exit resulting into two partitions
Left Partition:  14  | 12  
Right Partition: 13  | 16  | 11   

Partition logic implemented in Java

	 * Partition logic
	 * @param a array to be modified based on pivot
	 * @param left lower bound of the array
	 * @param right upper bound of the array
	 * @return the partition index where elements to its left are less than it and 
	 * elements to its right are more than it 
	public int partition(int[] a, int left, int right) {
		// Get the pivot element
		int pivot = a[left];
		// Break when left is  >  right
		while(left  < = right) {
			//increment the lower bound till you find the element more than the pivot
			while(a[left] < pivot)
			//decrement the upper bound till you find the element less than the pivot
			while(a[right] > pivot)
			// swap the values which are left by lower and upper bounds 
			if(left  < = right) {
				int tmp = a[left];
				a[left] = a[right];
				a[right] = tmp;
				//increment left index and decrement right index
		return left;

Running time of quick sort logic depends on the distribution of elements across the partitions.

Quicksort Logic:
Quicksort Logic is straight forward with couple of recursive calls. Quicksort implementation in java is shown below

	 * Recursive quicksort logic
	 * @param a input array
	 * @param i start index of the array
	 * @param j end index of the array
	public void recursiveQuickSort(int[] a, int i, int j) {
		// Handle logic to divide the array
		int idx = partition(a, i, j);
		// Recursively call quick sort with left part of the divided array
		if(i < idx-1) {
			recursiveQuickSort(a, i, idx-1);
		// Recursively call quick sort with right part of the divided array
		if(j > idx) {
			recursiveQuickSort(a, idx, j);

First recursive call ‘recursiveQuickSort(a, i, idx-1);’ picks the left partition of the array and repeats the process. And second recursive call ‘recursiveQuickSort(a, idx, j);’ picks the right partition of the array and repeats the process.

Since we are returning index ‘left’, i.e, i=2 from partition() method first recursive call ‘recursiveQuickSort(a, i, idx-1);’ would consider left partition array from index 0 to 1

Left Partition:  14  | 12  

Second recursive call ‘recursiveQuickSort(a, idx, j);’ would consider right partition array from index 2 to 4

Right Partition: 13  | 16  | 11   

Complete source code with few unit tests can be downloaded here

Swap every pair of nodes in a single linked list

This algorithm is just the extension of ‘swapping any two given nodes’ in a single linked list.

It can be done in two ways – #1 by swapping the addresses of each node #2 by swapping data of each node. #2 is straight forward and doesnt need any explanation. We will discuss #1 here.

Following diagram depicts what are we trying to do:

Let’s say Single Linked List is with 5 nodes n1,n2,n3,n4,n5
Analysis of the algorithm:
#1 – We swap two nodes as usual (n1 and n2 are swapped which leads to n2,n1,n3,n4,n5)
#2 – This is key step. before swapping next two nodes we should remember that n1’s next address should also be changed because when n3&n4 are swapped n1->n3 link will be broken and hence we should take care of this step. (n3 and n4 are swapped and n1’s next is linked to n4 – which leads to n2,n1,n4,n3,n5)

Source Code

	public static Node swap(Node n) {
		Node buffer = null;
		Node newRoot = null;
		while (n != null && != null) {
			if (buffer != null) { =;

			buffer =; =; = n;
			if (newRoot == null)
				newRoot = buffer;
			n =;
		return newRoot;

algorithm to reverse words of a string

This is a bit tricky algorithm and most frequently asked question in interviews. I will now show how to do it in the best time complexity of [O(2n)]. Complete source code can be downloaded from here.

Here we will make use of algorithm to reverse the string which is explained in our previous posts at O(n/2) complexity.

Steps to solve this algorithm

  • take two pointers, one pointing to the start of the array and another to the end of the array and swap the values at respective pointers.
  • While traversing the array increment starting pointer and decrement ending pointer and continue swap the values at respective pointers
  • continue this till we reach the middle of the array which will result in reversed string
  • from above generated reversed string take a pointer which runs through this string and stops at spaces which indicates the occurance of a word
  • go back to first three steps to reverse a word within this reversed string and go back to fourth step till the sentence is completed
  • by end of this algorithm, last word will not be reversed because there is no space after it and hence reverse the last word at the end

Source Code

	 * Algorithm to reverse word of a string
	 * @return reversed words array
	public char[] reverseWords(char[] arr) {
	    // reverse the string
	    reverse(arr, 0, arr.length / 2, arr.length);
	    // reverse words of a string
	    int wordIdx = 0;
	    int wordMidIdx = 0;
	    int prevWordLastIdx = 0;
	    // outer loop to track spaces
	    for (int sentenceIdx = 0; sentenceIdx < arr.length; sentenceIdx++) {
	        if (arr[sentenceIdx] != ' ')
	        wordIdx = prevWordLastIdx;
	        int wordLastIdx = sentenceIdx;
	        wordMidIdx = (sentenceIdx + wordIdx) / 2;
	        // reverse each word in a string
	        reverse(arr, wordIdx, wordMidIdx, wordLastIdx);
	        prevWordLastIdx = sentenceIdx + 1;
	    // reverse last word
	    wordIdx = prevWordLastIdx;
	    wordMidIdx = (arr.length + wordIdx) / 2;
	    reverse(arr, wordIdx, wordMidIdx, arr.length);
	    return arr;

How to arrive at the time complexity?

Step #1: reverse the complete string which is explained in our previous algorithm at O(n/2) complexity.

Step #2: keep a pointer which traverses through the string looking for spaces which takes O(n)

Step #3: using the above spaces reverse words within the string at O(n/2). Let us see why this is O(n/2)

'welcome to coding algorithms' would become 'smhtirogla gnidoc ot emoclew'

Now take the string 'smhtirogla gnidoc ot emoclew' and reverse each word within
 it with same logic explained in reversing the string. This will take O(n) 
 iterations, let us see how is it O(n)

At first iteration  
 'smhtirogla gnidoc ot emoclew' would become 'algorithms gnidoc ot emoclew' 
 at 5 iterations i.e (length of 'smhtirogla')/2)

At second iteration 
 'algorithms gnidoc ot emoclew' would become 'algorithms coding ot emoclew'
 at 3 iterations i.e (length of 'gnidoc')/2)

At third iteration  
 'algorithms coding ot emoclew' would become 'algorithms coding to emoclew'
 at 1 iterations i.e (length of 'ot')/2)

At fourth iteration 
 'algorithms coding to emoclew' would become 'algorithms coding to welcome'
 at 3 iterations i.e (length of 'emoclew')/2)

total length of the string = 29

total iterations = 12 + spaces = 15 (approximately half the length of the 
total string)

Total complexity from Step #1, Step #2 and Step #3 = O(n/2) + O(n) + O(n/2) = O(2n)

Complete source code can be downloaded from here.

algorithm to find if a linked list is cyclic

The best way of doing it is to take two pointers. For every step taken by one pointer, move another pointer two steps. In this manner if there is no cycle in the linked list(i.e, if the list is linear), then by the time first pointer reaches mid if the list, second pointer reaches end of the list.

The following figure depicts the cycle in Single Linked List.

Linear Linked List Analysis

Node        : 1->200 | 2->300 | 3->400 | 4->500 | 5->600 | 6->700 | 7->NULL |
Node Address: 100    | 200    | 300    | 400    | 500    | 600    | 700     |
Iteration-1 : p1     |        | p2     |        |        |        |         |
Iteration-2 :        | p1     |        |        | p2     |        |         |
Iteration-3 :        |        | p1     |        |        |        | p2      |

If you notice by the time the first pointer reaches mid of the list, second pointer has reached end of the list.

Cyclic Linked List Analysis

Node        : 1->200 | 2->300 | 3->400 | 4->500 | 5->600 | 6->700 | 7->300  |
Node Address: 100    | 200    | 300    | 400    | 500    | 600    | 700     |
Iteration-1 : p1     |        | p2     |        |        |        |         |
Iteration-2 :        | p1     |        |        | p2     |        |         |
Iteration-3 :        |        | p1     |        |        |        | p2      |
Iteration-3 :        |        |        | p2  p1 |        |        |         |

If you notice both the pointers are referring to the same node.

Steps to solve this Algorithm:
– For every step taken by one pointer, move another pointer two steps.
– Continue this till either second_pointer/second_pointer’s next pointer is same as that of the first pointer which means second pointer has come behind the first pointer resulting in a cycle.

Source Code

	public static boolean findCylicSLL(Node root) {
		if(root == null) {
			return false;
		Node ptr1 = root;
		Node ptr2 =;
		while(ptr1 != null && ptr2 != null) {
			 if(ptr2 == ptr1 || == ptr1) {
				 return true;
			 ptr1 =;
			 ptr2 =;
		return false;

A small change in this same logic will lead us to another important trick:
How do you find exact middle element of the list in O(n/2) complexity?

Steps to solve this Algorithm:
– For every step taken by one pointer, move another pointer two steps.
– Continue this till either second_pointer is null which means it reached end of the list and therefore first_pointer will be exactly in the middle of the list

	public static boolean findMiddleElementOfList(Node root) {
		if(root == null) {
			return false;
		Node ptr1 = root;
		Node ptr2 =;
                while(ptr2 != null) {
	            ptr1 =;
		    ptr2 =;
                // When ptr2 is null, it means ptr1 reached mid of the list
                System.out.println("ptr1: "+ptr1);

Mostly technology with occasional sprinkling of other random thoughts


Amir Amintabar's personal page

101 Books

Reading my way through Time Magazine's 100 Greatest Novels since 1923 (plus Ulysses)

Seek, Plunnge and more...

My words, my world...

ARRM Foundation

Do not wait for leaders; do it alone, person to person - Mother Teresa

Executive Management

An unexamined life is not worth living – Socrates

Diabolical or Smart

Nitwit, Blubber, Oddment, Tweak !!


A topnotch site


Just another site

coding algorithms

"An approximate answer to the right problem is worth a good deal more than an exact answer to an approximate problem." -- John Tukey

%d bloggers like this: