We will always use ARFF files for our datasets, and we will make the assumption that all data will fit in RAM. Details on ARFF are found here. A collection of data sets already in the ARFF format can be found here.
A basic tool kit is provided in Java to help you get started implementing learning algorithms. In order to facilitate mentoring, I recommend implementing the projects in Eclipse (see video tutorial below).
The toolkit is intended as a starting place for working with machine learning algorithms. It provides the following functionality to run your algorithms:
Download the toolkit here.
Access the Iris dataset here.
MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E [EvaluationMethod] {[ExtraParameters]} [-N] [-R seed]
Where the -N will normalize the training and test data sets (Normalization max and min will come from the training set).
The -R allows you pass in a seed for the random number generator. By default each time you run the code, the data set will be shuffled differently. If you wish to produce the same shuffle, provide a seed such as 1 or 2.
Possible evaluation methods are:
Training (using same data set for training and testing):
./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E training
Static Split (2 distinct datasets/ARFF files; one for training and one for testing:
./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E static [TestARFF_File]
Random Split (1 dataset is split randomly providing x% for training and the rest for testing):
./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E random [PercentageForTraining]
N-fold Cross-validation (1 dataset is partitioned into N partitions). The learning algorithm is evaluated on each portion and then the average accuracy is returned:
./MLSystemManager -L [LearningAlgorithm] -A [ARFF_File] -E cross [NumOfFolds]
Here is an example with output:
./MLSystemManager -L dummy -A ../Research/dataSets/iris.arff -E training -N
Dataset name: iris
Dataset is normalized.
Number of instances: 150
Learning algorithm: dummy
Evaluation method: training
Accuracy on the training set:
Output classes accuracy:
Iris-setosa: 1
Iris-versicolor: 0
Iris-virginica: 0
Set accuracy: 0.333333
Accuracy on the test set:
Output classes accuracy:
Iris-setosa: 1
Iris-versicolor: 0
Iris-virginica: 0
Set accuracy: 0.333333
Time to train: 5.96046e-06 seconds (Note: If the simulation starts before midnight and ends after, the time will not be accurate)
A DummyLearner class is provided that classifies all instances as the majority class (BaselineLearner). This class can be used as a template for creating your own learning algorithms. The instances are stored in an ArrayList of ArrayLists of doubles (Java version). When creating a new learning algorithm, you need to add the include line in the MLSystemManager file and it must inherit from the Learner class.