Paul Bodily About Courses Research Outreach Tips for Communicating Teaching Philosophy Vitae

Machine Learning Tool Kit

We will always use ARFF files for our datasets, and we will make the assumption that all data will fit in RAM. Details on ARFF are found here. A collection of data sets already in the ARFF format can be found here.

A basic tool kit is provided in Java to help you get started implementing learning algorithms. In order to facilitate mentoring, I strongly recommend implementing the projects in Eclipse (see video tutorial below).

The toolkit is intended as a starting place for working with machine learning algorithms. It provides the following functionality to run your algorithms:

Video Tutorial for Java toolkit installation

Download the toolkit here.

Access the Iris dataset here.

Usage Instructions

MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E [EvaluationMethod] {[ExtraParameters]} [-N] [-R seed]

Where the -N will normalize the training and test data sets (Normalization max and min will come from the training set).

The -R allows you pass in a seed for the random number generator. By default each time you run the code, the data set will be shuffled differently. If you wish to produce the same shuffle, provide a seed such as 1 or 2.

Possible evaluation methods are:

Here is an example with output:

./MLSystemManager -L dummy -A ../Research/dataSets/iris.arff -E training -N
Dataset name: iris
Dataset is normalized.
Number of instances: 150
Learning algorithm: dummy
Evaluation method: training

Accuracy on the training set:
Output classes accuracy:
Iris-setosa: 1
Iris-versicolor: 0
Iris-virginica: 0
Set accuracy: 0.333333

Accuracy on the test set:
Output classes accuracy:
Iris-setosa: 1
Iris-versicolor: 0
Iris-virginica: 0
Set accuracy: 0.333333

Time to train: 5.96046e-06 seconds (Note: If the simulation starts before midnight and ends after, the time will not be accurate)

A DummyLearner class is provided that classifies all instances as the majority class (BaselineLearner). This class can be used as a template for creating your own learning algorithms. The instances are stored in an ArrayList of ArrayLists of doubles. When creating a new learning algorithm, you need to add the include line in the MLSystemManager file and it must inherit from the Learner class.

Acknowledgments

Thanks to Dr. Tony Martinez for the written tutorial and toolkit development. Thanks to Mike Brodie for the video tutorial.