Paul Bodily About Courses Research Outreach Tips for Communicating Teaching Philosophy Vitae

Machine Learning Tool Kit

We will always use ARFF files for our datasets, and we will make the assumption that all data will fit in RAM. Details on ARFF are found here. A collection of data sets already in the ARFF format can be found here.

A basic tool kit is provided in Java and C++ to help you get started implementing learning algorithms. In order to facilitate mentoring, I strongly recommend implementing the projects in Eclipse (see video tutorial below). You are also welcome to code up your own toolkit or modify the source code made available here however you want. If you do so, here are a few things to keep in mind:

  • Some of the labs require a high number of computations. Using an interpreted language may result in longer runtimes. (I am not sure exactly how much longer, since I have never tried any of the labs in an interpreted language. Just be forewarned.)
  • You will get the best help from me if you do it in Java since that is the language that I will use.
  • Advice about dealing with discrete-valued data: A common way to represent each "instance" (a.k.a. "pattern") is to use a vector of numbers. For discrete values, these numbers could be an index to the "name" of that value in the metadata.

The toolkit is intended as a starting place for working with machine learning algorithms. It provides the following functionality to run your algorithms:

  • Parses and stores the ARFF file
  • Randomizes the instances in the ARFF file
  • Provides four evaluation methods (A more detailed description of these methods is found here):
    1. Training set method: The model is evaluated on the same data set that was used for training
    2. Static split test set method: Two distinct data sets are made available to the learning algorithm; one for training and one for testing
    3. Random split test set method: A single data set is made available to the learning algorithm and the data set is split such that x% of the instances are randomly selected for training and the remainder are used for testing, where you supply the value of x.
    4. N-fold cross-validation method: A single data set is made available to the learning algorithm which is partitioned into N equally sized subsets. Each subset is used once for evaluating the learning algorithm while the remaining instances are used for training. The results of the N runs are then averaged to provide the final accuracy estimate.
  • Parse command-line arguments
  • Normalize attributes

Build Instructions for the Java version

  1. Download the zip file here.
  2. Unzip the zip file
  3. javac *.java
(Though we have not tested it, this should work in Microsoft Visual C++. You will need to create a project solution. If you use Windows and have trouble, come see us for help.)

Build Instructions for the C++ version

  1. Open the terminal
  2. wget https://www2.cose.isu.edu/~bodipaul/courses/f20/4478/resources/toolkitc.zip
  3. unzip toolkitc.zip
  4. cd toolkit/src/
  5. make opt

Toolkit Validation Instructions

  1. mkdir datasets
  2. cd datasets/
  3. wget https://www2.cose.isu.edu/~bodipaul/courses/f20/4478/resources/iris.arff
  4. cd ..
  5. ./bin/MLSystemManager -L baseline -A datasets/iris.arff -E training
You should see the results for a baseline classifier (33% accuracy on iris)

Alternate Toolkits

Students from previous semesters have ported the toolkit to various other languages. We list below links to the Github repositories for two of the more popular alternative languages, Python and C#. Please note that while previous students have successfully used both of these languages for the course, these ported versions are more likely to contain uncaught bugs. The professor will be in a better place to help you if you use the Java or C++ versions of the toolkit.

Python: https://github.com/tooke7/toolkitPython

C#: https://github.com/tygill/MLToolkitCSharp

Video Tutorial for Java toolkit installation

Download the toolkit here.

Access the Iris dataset here.

Usage Instructions

MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E [EvaluationMethod] {[ExtraParameters]} [-N] [-R seed]

Where the -N will normalize the training and test data sets (Normalization max and min will come from the training set).

The -R allows you pass in a seed for the random number generator. By default each time you run the code, the data set will be shuffled differently. If you wish to produce the same shuffle, provide a seed such as 1 or 2.

Possible evaluation methods are:

  • Training (using same data set for training and testing):

    ./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E training

  • Static Split (2 distinct datasets/ARFF files; one for training and one for testing:

    ./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E static [TestARFF_File]

  • Random Split (1 dataset is split randomly providing x% for training and the rest for testing):

    ./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E random [PercentageForTraining]

  • N-fold Cross-validation (1 dataset is partitioned into N partitions). The learning algorithm is evaluated on each portion and then the average accuracy is returned:

    ./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E cross [NumOfFolds]

Here is an example with output:

./MLSystemManager -L dummy -A ../Research/dataSets/iris.arff -E training -N
Dataset name: iris
Dataset is normalized.
Number of instances: 150
Learning algorithm: dummy
Evaluation method: training

Accuracy on the training set:
Output classes accuracy:
Iris-setosa: 1
Iris-versicolor: 0
Iris-virginica: 0
Set accuracy: 0.333333

Accuracy on the test set:
Output classes accuracy:
Iris-setosa: 1
Iris-versicolor: 0
Iris-virginica: 0
Set accuracy: 0.333333

Time to train: 5.96046e-06 seconds (Note: If the simulation starts before midnight and ends after, the time will not be accurate)

A DummyLearner class is provided that classifies all instances as the majority class (BaselineLearner). This class can be used as a template for creating your own learning algorithms. The instances are stored in a vector of vectors of doubles (C++ version) or an ArrayList of ArrayLists of doubles (Java version). When creating a new learning algorithm, you need to add the include line in the MLSystemManager file and it must inherit from the Learner class.

Acknowledgments

Thanks to Dr. Tony Martinez for the written tutorial and toolkit development. Thanks to Mike Brodie for the video tutorial.