To help debug your implementations, you may run them on this data set (don't shuffle or normalize it). Note that this is just the labor data set with an instance id column added for showing results. (Do NOT use the id column or the output class column as feature data). The results for 5-means, using the first 5 elements of the data set as initial centroids should be this. In HAC we just do the one closest merge per iteration. The results for HAC-single link up to 5 clusters should be this and complete link up to 5 clusters should be this.
For the sample files, we ignore missing values when calculating centroids and assign them a distance of 1 when determining total sum squared error. Suppose you had the following instances in a cluster:
?, 1.0, Round
?, 2.0, Square
?, 3.0, Round
Red, ?, ?
Red, ?, Square
The centroid value for the first attribute would be "Red" and SSE would be 3. The centroid value for the second attribute would be 2, and SSE would be 4. In case of a tie as with the third attribute, we choose the nominal value which appeared first in the meta data list. So if the attribute were declared as @attribute Shape{"Round", "Square"}, then the centroid value for the third attribute would be Round and the SSE would be 3. For other types of ties (node or cluster with the same distance to another cluster, which should be rare), just go with the earliest cluster in your list. If all the instances in a cluster have don’t know for one of the attributes, then use don’t know in the centroid for that attribute.
Note: In order to help you debug this and other projects we have included some small examples and other hints with actual learned hypotheses so that you can compare the results of your code and help ensure that your code is working properly. You may also discuss and compare results with classmates.