Decision Tree Lab
- (45%) Correctly implement the ID3 decision tree algorithm
- The algorithm should have the ability to handle unknown attributes.
- You do not need to handle real valued attributes.
- Use standard information gain as your basic attribute evaluation metric. (Note that normal ID3 would usually augment information gain with gain ratio or some other mechanism to penalize statistically insignificant attribute splits. Otherwise, even with approaches like pruning below, the SSN-type of overfit (i.e., where an attribute has random values and ends up with only 1 example in each partition) could still hurt us.)
- Use a simple data set (like the lenses data or the homework), which you can check by hand, to test your algorithm. You should be able to get about 68% (61%-82%) predictive accuracy on lenses.
- Note: If there are particular combinations of feature values that are never seen in the training set then it is possible that during prediction we encounter a combination of features that does not have a complete path through the decision tree. To solve this problem use the C4.5 approach. The simplest way to implement this is during the tree-building phase. In addition to putting labels on each leaf node, keep a "just in case" label at each non-leaf node that represents the majority class for all instances that were considered at that node. That way if we are asked to predict for a set of features with a dead-end path, use the "just in case" label at the crust node.
- (20%) You will use your ID3 algorithm to induce decision trees for the cars data set and the voting data set (note this is a different version of the voting dataset than previously used, with missing values "?" now reinserted).
- Do not use a stopping criteria, but induce the tree as far as it can go (until classes are pure or there are no more data or attributes to split on).
- A full tree you will often get 100% accuracy on the training set. (Why would you and in what cases would you not? Discuss and answer this question in your report.)
- You will need to support unknown attributes in the voting data set.
- Use 10-fold CV on each data set to predict how well the models will do on novel data (check the command-line options).
- Report the training and test classification accuracy for each fold and then average the test accuracies to get your prediction.
- Create a table summarizing these accuracy results, and discuss what you observed.
- As a rough sanity check, typical decision tree accuracies for these data sets are: Cars: .90-.95, Vote: .92-.95.
- (10%) For each of the two problems, summarize in English what the decision tree has learned (i.e., look at the induced tree and describe what rules it has discovered to try to solve each task). If the tree is large you can just discuss a few of the more shallow attributes combinations and the most important decisions made high in the tree.
- (10%) How did you handle unknown attributes in the voting problem? Why did you choose this approach? (Do not use the approach of just throwing out data with unknown attributes).
- (15%) Implement reduced error pruning to help avoid overfitting.
- You will need to take a validation set out of your training data to do this, while still having a test set to test your final accuracy.
- Create a table comparing the original trees created with no overfit avoidance in item 2 above and the trees you create with pruning. This table should compare a) the # of nodes (including leaf nodes) and tree depth of the final decision trees and b) the generalization (test set) accuracy. (For the unpruned 10-fold CV models, just use their average values in the table).
Note: In order to help you debug this and other projects we have included some small examples and other hints with actual learned hypotheses so that you can compare the results of your code and help ensure that your code is working properly. You may also discuss and compare results with classmates.
Acknowledgments
Thanks to Dr. Tony Martinez for help in designing the projects and requirements for this course.