ID3 Implementation of Decision Trees
November 12, 2015 3 Comments
There are different implementations given for Decision Trees. Major ones are
- ID3: Iternative Dichotomizer was the very first implementation of Decision Tree given by Ross Quinlan.
- C4.5: Advanced version of ID3 algorithm addressing the issues in ID3. Some of issues it addressed were
- Accepts continuous features (along with discrete in ID3)
- Normalized Information Gain
- Missing Value Imputation: Handling missing values
- Pruning: Overfitting problem
- Applying different weights to the features
- CART: Classification and Regression Trees are similar to C4.5 in many ways except these trees are built on numeric splitting criteria.
We will see the basic implementation of ID3 now and we will extend it from here going further.
INPUTS
Features, training and testing data sets are taken from https://archive.ics.uci.edu/ml/datasets/Hepatitis
FEATURES (sample)
Total: 22 // A listing of the features and their possible values. All features are binary. age_lte_30 t f age_betw_31_50 t f age_gt_50 t f sex m f
TRAINING DATA (Sample)
Total: 102 trainPatient1 survived t f f f n y y y y n y y y y y t t t f f f n trainPatient2 survived f t f m n y n y y n y y y y y t f t f f t n trainPatient3 survived f f t m y y n y y y y y y y y t t t f f f n trainPatient4 survived f t f m y n y y y y y y y y y t t t f f f n
TESTING DATA (Sample)
Total: 52 testPatient1 survived f t f m y y y y y y y y y y y t t f t f f n testPatient2 survived f f t m n n y y y n n y y y y t f f f t t y testPatient3 survived t f f f y y n n y y n y y y y t f f f t t n testPatient4 died f t f m y y n y y y n n y n n t f f t f t y testPatient5 survived f f t m n y n n n y n n n y n f f f f t f y
Implementation of the ID3 ALGORITHM
- Load the data: Load the features, training and testing data.
- Build the Decision Tree
- Test the Decision Tree
- Print the constructed Decision Tree
Build the Decision Tree
ALGORITHM - Check for the base cases - For each attribute, find the Information Gain and choose the attribute with Maximum Info Gain or Minimum Entropy (aka Best Attribute) - Create a Decision Node on the best attribute chosen in above step. - Recursively run above 3 steps for remaining attributes. CODE public DTNode buildDecisionTree(Features features, Instances trainExamples, String[] outputLabels) { if(trainExamples.getExamples().size() ==0 || features.getFeatures().size() ==0) { return null; } Map<String, Integer> outputLabelCt = getOutputLabelDistribution(features.getFeatures(), trainExamples.getExamples(), outputLabels); int negCt = outputLabelCt.get(outputLabels[0]); int posCt = outputLabelCt.get(outputLabels[1]); String debug = posCt + "+ve/" + negCt + "-ve"; String outputLabel = outputLabels[1]; if(posCt > negCt) { outputLabel = outputLabels[0]; } if(posCt == 0) { DTNode node = new DTNode(); node.setNodeType(DecisionTreeNodeType.OUTPUT_LEAF_NODE); node.setLog(outputLabel+"#ONLYPOS#"+debug); node.setOutputLabel(outputLabel); return node; } if(negCt == 0) { DTNode node = new DTNode(); node.setNodeType(DecisionTreeNodeType.OUTPUT_LEAF_NODE); node.setLog(outputLabel+"#ONLYNEG#"+debug); node.setOutputLabel(outputLabel); return node; } Feature bestFeature = getBestFeature(features.getFeatures(), trainExamples.getExamples(), outputLabels, outputLabelCt); Features remainingFtrs = getRemainingFtrs(features.getFeatures(), bestFeature); //System.out.println("remainingFtrs size "+remainingFtrs.getFeatures().size()); // build a node based on the best feature DTNode root = new DTNode(); root.setNodeType(DecisionTreeNodeType.FEATURE_NODE); root.setLog("#Normal#"+debug); root.setFeature(bestFeature); // traverse left side and construct //System.out.println("traversing left"); Instances filteredExOnBestFtr = getFilteredExamplesOnBestFtr(trainExamples.getExamples(), bestFeature, bestFeature.getfOptions()[0]); root.setLeftNode(buildDecisionTree(remainingFtrs, filteredExOnBestFtr, outputLabels)); // traverse right side and construct //System.out.println("traversing right"); filteredExOnBestFtr = getFilteredExamplesOnBestFtr(trainExamples.getExamples(), bestFeature, bestFeature.getfOptions()[1]); root.setRightNode(buildDecisionTree(remainingFtrs, filteredExOnBestFtr, outputLabels)); //System.out.println("end"); // return root of the tree return root; }
Test the Decision Tree: Tests the decision tree which is built in previous step and prints the accuracy (In our implementation accuracy is around 92%)
Successfully predicted testPatients: 48 Failed example is testPatient5;survived;[f, f, t, m, n, y, n, n, n, y, n, n, n, y, n, f, f, f, f, t, f, y] Failed example is testPatient19;died;[f, t, f, m, n, n, n, n, y, y, y, y, n, y, y, t, t, f, t, f, t, y] Failed example is testPatient24;survived;[f, f, t, m, y, y, n, y, y, y, n, n, n, y, n, t, f, t, f, f, t, y] Failed example is testPatient34;survived;[f, t, f, m, n, y, n, n, n, y, y, y, y, y, y, t, f, f, f, t, f, y] Number of failed examples: 4, out of total test examples: 52 Accuracy achieved is 92.3076923076923%
OUTPUT Decision Tree
│ ┌── survived#ONLYNEG#1+ve/0-ve │ ┌── histology#Normal#1+ve/6-ve │ │ └── died#ONLYPOS#0+ve/6-ve └── varices#Normal#88+ve/14-ve │ ┌── survived#ONLYNEG#4+ve/0-ve │ ┌── sex#Normal#9+ve/5-ve │ │ │ ┌── died#ONLYPOS#0+ve/4-ve │ │ └── sgot_lte_59#Normal#5+ve/5-ve │ │ │ ┌── died#ONLYPOS#0+ve/1-ve │ │ │ ┌── age_betw_31_50#Normal#1+ve/1-ve │ │ │ │ └── survived#ONLYNEG#1+ve/0-ve │ │ └── anorexia#Normal#5+ve/1-ve │ │ └── survived#ONLYNEG#4+ve/0-ve │ ┌── alk_phosphate_lte_127#Normal#35+ve/8-ve │ │ │ ┌── survived#ONLYNEG#3+ve/0-ve │ │ │ ┌── antivirals#Normal#5+ve/3-ve │ │ │ │ │ ┌── survived#ONLYNEG#2+ve/0-ve │ │ │ │ │ ┌── sgot_gt_156#Normal#2+ve/1-ve │ │ │ │ │ │ └── died#ONLYPOS#0+ve/1-ve │ │ │ │ └── liver_big#Normal#2+ve/3-ve │ │ │ │ └── died#ONLYPOS#0+ve/2-ve │ │ └── spiders#Normal#26+ve/3-ve │ │ └── survived#ONLYNEG#21+ve/0-ve │ ┌── age_lte_30#Normal#48+ve/8-ve │ │ └── survived#ONLYNEG#13+ve/0-ve └── fatigue#Normal#87+ve/8-ve └── survived#ONLYNEG#39+ve/0-ve
Complete source code can be downloaded at
https://github.com/ntallapa12/decision_tree
References:
https://en.wikipedia.org/wiki/ID3_algorithm
http://www.drdobbs.com/web-development/classifying-text-with-id3-and-c45/184410304
http://stackoverflow.com/questions/9979461/different-decision-tree-algorithms-with-comparison-of-complexity-or-performance
how to combine the input dataset file?, and what are ve/ and ve- mean?
how to combine the input dataset file without splitting them into traing and testing?, and what are ve/ and ve- mean?
update for C4.5 please