Data Portfolio
Clustering, Turkiye Student Evaluation
Data content was viewed using R’s str() and summary() commands. The data were checked for missing values; there were none. The 28 questions were all responded to with ratings ranging from 1 to 5; no scaling was necessary. Variables were renamed to more accurately reflect attribute contents.
​
Student evaluations were collected from students in 13 classes taught by 3 instructors. (Class 5 was taught by two instructors and is represented twice for a total of 14 classes.) There were 775 surveys for Instructor 1, 1,444 for Instructor 2, and 3,601 for Instructor 3.
​
Three unsupervised learning clustering methods were used in this analysis: agglomerative hierarchical clustering/nesting (AGNES), divisive hierarchical (DIANA), and k-means clustering. All distances measured in each method are Euclidean although other measures can be used and may be appropriate for other applications

Analysis
AGNES can be calculated multiple ways some of which are single linkage, complete linkage, average linkage, and ward’s method. Each of the four HAC methods were compared using the agglomerative coefficient that indicates the amount of clustering. Ward’s method performed the best with an agglomerative coefficient of 0.997 compared to single linkage: 0.855, complete linkage: 0.932, and average linkage: 0.892
AGNES

The dendrogram suggests that possible good cuts to the dendrogram could be made at 3 or 5 clusters.

From the silhouette plots, we can see that 4 clusters produce two clusters (3 and 4) that are very close together. The plot with 5 clusters shows more equally distributed clusters however cluster 5 is smaller and the average width is very low compared to the silhouette average. The plot with three clusters is more evenly distributed with all 3 clusters above the average silhouette width.

DIANA
While AGNES uses a bottom-up approach, divisive hierarchical clustering (DIANA) works from the top down by dividing clusters until each observation is its own cluster. The divisive coefficient is 0.920.
​
This method was not explored further because the coefficient was significantly lower than that of AGNES using Ward’s method and slightly lower than AGNES using complete linkage.
​


k-means
The k-means algorithm partitions observations based on variance among attributes to find groupings. One of the difficulties with k-means is determining the appropriate number of clusters.
​
The average silhouette method shown with plots for 3, 4, and 5 clusters were plotted to determine the number of clusters. (The elbow method was also used but is not shown here.)
Results
Both agglomerative nesting and k-means clustering were able to identify clusters in the data set with a somewhat large percentage (69.1%) of agreement with classes previously determined and contained in the raw data. Although supervised learning models can predict class membership with a high degree of accuracy, clustering provides a robust method where predetermined classes are not required.