Paul Bodily About Courses Research Outreach Tips for Communicating Teaching Philosophy Vitae

Nearest Neighbor Lab

  1. (40%) Implement the k nearest neighbor algorithm and the k nearest neighbor regression algorithm, including optional distance weighting for both algorithms. Unless specified otherwise, use Euclidean distance metrics. Attach your source code.
  2. (15%) Use the k nearest neighbor algorithm (without distance weighting) for the magic telescope problem using this training set and this test set (note the toolkit has an option for a static split of training and test sets).
    • Try it with k=3 with normalization (input features normalized between 0 and 1) and without normalization and discuss the accuracy results on the test set.
    • For the rest of the experiments use only normalized data.
    • With just the normalized training set as your data, graph classification accuracy on the test set with odd values of k from 1 to 15. Which value of k is the best in terms of classification accuracy?
    • As a rough sanity check, typical k-nn accuracies for the magic telescope data set are 75-85%.
  3. (15%) Use the regression variation of your algorithm (without distance weighting) for the housing price prediction problem using this training set and this test set.
    • Report Mean Square Error on the test set as your accuracy metric for this case.
    • Experiment using odd values of k from 1 to 15. Which value of k is the best?
  4. (15%) Repeat your experiments for magic telescope and housing using distance-weighted (inverse of distance squared) voting and discuss your results.
  5. (15%) Use the k nearest neighbor algorithm to solve the credit-approval (credit-a) data set.
    • Note that this set has both continuous and nominal attributes, together with don’t know values.
    • Implement and justify a distance metric which supports continuous, nominal, and don’t know attribute values (You need to handle don't knows with the distance metric, not by imputing a value).
    • Use and document your own choice for k, training/test split, etc.
    • If you're curious what distance metrics others have used, check this out.
    • As a rough sanity check, typical k-nn accuracies for the credit data set are 70-80%.
  6. Note: In order to help you debug this and other projects we have included some small examples and other hints with actual learned hypotheses so that you can compare the results of your code and help ensure that your code is working properly. You may also discuss and compare results with classmates.

    Acknowledgments

    Thanks to Dr. Tony Martinez for help in designing the projects and requirements for this course.