Clustering Lab
- (55%) Implement the k-means clustering algorithm OR the HAC (Hierarchical Agglomerative Clustering) algorithm.
- Attach your source code.
- Use Euclidean distance for continuous attributes and (0,1) distances for nominal and unknown attributes (e.g., matching nominals have distance 0 else 1, unknown attributes have distance 1).
- HAC should support both single link and complete link options.
- For k-means you will pass in a specific k value for the number of clusters that should be in the resulting clustering. Since HAC automatically generates groupings for all values of k, you will pass in a range or set of k values for which actual output will be generated.
- The output for the algorithm tested should include for each clustering: a) the number of clusters, b) the centroid values of each cluster, c) the number of instances tied to that centroid, d) the SSE of each cluster, and e) the total SSE of the full clustering.
- The sum squared error (SSE) of a single cluster is the sum of the squared euclidean distance of each cluster member to the cluster centroid.
- Run k-means, or HAC-single link and HAC-complete link on this exact set of the sponge data set (use all columns and do not shuffle or normalize) and report your exact results for each algorithm with k=4 clusters. For k-means use the first 4 elements of the data set as initial centroids. This will allow us to check the accuracy of your implementation.
- To help debug your implementations, you may run them on this data set (don't shuffle or normalize it).
- Note that this is just the labor data set with an instance id column added for showing results. (Do NOT use the id column or the output class column as feature data).
- The results for 5-means, using the first 5 elements of the data set as initial centroids should be this.
- In HAC we just do the one closest merge per iteration.
- The results for HAC-single link up to 5 clusters should be this
- The results for the HAC complete link up to 5 clusters should be this.
- We ignore missing values when calculating centroids and assign them a distance of 1 when determining total sum squared error. Suppose you had the following instances in a cluster:
- ?, 1.0, Round
- ?, 2.0, Square
- ?, 3.0, Round
- Red, ?, ?
- Red, ?, Square
- The centroid value for the first attribute would be "Red" and SSE would be 3. The centroid value for the second attribute would be 2, and SSE would be 4. In case of a tie as with the third attribute, we choose the nominal value which appeared first in the meta data list. So if the attribute were declared as @attribute Shape{"Round", "Square"}, then the centroid value for the third attribute would be Round and the SSE would be 3. For other types of ties (node or cluster with the same distance to another cluster, which should be rare), just go with the earliest cluster in your list. If all the instances in a cluster have don’t know for one of the attributes, then use don’t know in the centroid for that attribute.
- (25%) Run your variation (k-means, or HAC-single link and HAC-complete link) on the full iris data set where you do not include the output label as part of the data set.
- For k-means you should always choose k random points in the data set as initial centroids.
- If you ever end up with any empty clusters in k-means, re-run with different initial centroids.
- Run it for k = 2-7.
- State whether you normalize or not (your choice).
- Graph the total SSE for each k and discuss your results (i.e. what kind of clusters are being made).
- Repeat the previous (sub)steps only now include the output class as one of the input features. Discuss your results and any differences.
- For this final data set, also run k-means 5 times with k=4, each time with different initial random centroids and discuss any variations in the results.
- (20%) Run your variation (k-means, or HAC-single link and HAC-complete link (or both)) on the following smaller (500 instance) abalone data set where you include all attributes including the normal output attribute “rings”.
- Treat “rings” as a continuous variable, rather than nominal. Why would I suggest that?
- Run it for k = 2-7.
- Graph your SSE results without normalization.
- Run it again with normalization and graph.
Note: In order to help you debug this and other projects we have included some small examples and other hints with actual learned hypotheses so that you can compare the results of your code and help ensure that your code is working properly. You may also discuss and compare results with classmates.
Acknowledgments
Thanks to Dr. Tony Martinez for help in designing the projects and requirements for this course.