Clustering Kuliah 6 4/10/2017
Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada fase learning. Analisis cluster membagi data ke dalam grup (cluster) yang bermakna, berguna, atau keduanya. Analisis cluster akan mengelompokkan obyek-obyek data hanya berdasarkan pada informasi yang terdapat pada data, yang menjelaskan obyek dan relasinya. 4/10/2017
Analisis Cluster Tujuan analisis cluster adalah agar obyek-obyek di dalam grup adalah mirip (atau berhubungan) satu dengan lainnya, dan berbeda (atau tidak berhubungan) dengan obyek dalam grup lainnya. Semakin besar tingkat kemiripan/similarity (atau homogenitas) di dalam satu grup dan semakin besar tingkat perbedaan diantara grup, maka semakin baik (atau lebih berbeda) clustering tersebut. Inter-cluster distances are maximized Intra-cluster distances are minimized 4/10/2017
Aplikasi analisis cluster Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Clustering precipitation in Australia
Penentuan cluster terbaik tergantung dari kondisi data serta hasil yang diinginkan How many clusters? Six Clusters Two Clusters Four Clusters 4/10/2017
Tipe-tipe clustering Hierarchical versus Partitional Partitional clustering adalah membagi himpunan obyek data ke dalam sub-himpunan (cluster) yang tidak overlap, sehingga setiap obyek data berada dalam tepat satu cluster. Hierarchical clustering Cluster memiliki subcluster Himpunan cluster bersarang yang diatur dalam bentuk tree. 4/10/2017
Partitional Clustering A Partitional Clustering Original Points 4/10/2017
Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram 4/10/2017
Tipe clustering lainnya Exclusive versus non-exclusive Dalam non-exclusive clustering, titik-titik dapat menjadi anggota banyak clusters. Dapat merepresentasikan banyak kelas atau titik ‘border’ Fuzzy versus non-fuzzy setiap obyek menjadi milik setiap cluster dengan nilai keanggotaan diantara 0 (multak bukan anggota cluster) dan 1 (mutlak anggota cluster) Penjumlahan bobot haruslah 1 Partial versus complete Complete clustering akan menetapkan setiap obyek ke dalam cluster, sedangkan partial clustering tidak. Alasan partial clustering adalah karena beberapa obyek dalam data set mungkin bukan anggota kelompok yang telah didefinisikan dengan baik. Banyak obyek dalam data set mungkin mewakili noise atau outlier. 4/10/2017
Tipe-tipe Cluster Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function 4/10/2017
Well-Separated Well-Separated Clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters 4/10/2017
Center-Based Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters 4/10/2017
Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. Grup obyek terkoneksi satu sama lain, tetapi tidak memiliki koneksi dengan obyek di luar grup Data direpresentasikan sebagai graph, dimana obyek menjadi node dan link menyatakan koneksi diantara obyek 8 contiguous clusters 4/10/2017
Density-Based Density-based Sebuah cluster adalah wilayah yang padat obyek dikelilingi oleh wilayah dengan kepadatan rendah. Used when the clusters are irregular, and when noise and outliers are present. 6 density-based clusters 4/10/2017
Conceptual Clusters Shared Property or Conceptual Clusters Finds clusters that share some common property or represent a particular concept. 2 Overlapping Circles 4/10/2017
Similarity and Dissimilarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity 4/10/2017
Euclidean Distance Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ. 4/10/2017
Euclidean Distance Distance Matrix 4/10/2017
Kemiripan Kosinus Jika x dan y adalah dua vektor dokumen, maka Dimana . menunjukkan perkalian titik pada vektor dan adalah panjang dari vektor x, = 4/10/2017
Perluasan Koefisien Jaccard (Koefisien Tanimoto) dan Korelasi 4/10/2017
Ukuran Jarak Antar Cluster Jarak maksimum antar elemen dalam cluster (complete linkage clustering): d(A,B) = max {Sx,y}, x A, y A dimana Sx,y adalah jarak dua data x dan y masing-masing dari cluster A dan B. Jarak minimum antara elemen dari setiap cluster (single linkage clustering) d(A,B) = min{Sx,y}, x A, y A 4/10/2017
Ukuran Jarak Antar Cluster Rata-rata jarak antara elemen dari setiap cluster (average linkage clustering) d(A, B) = dimana nA dan nB masing-masing adalah banyaknya data dalam cluster A dan B. Centroid Linkage Ward Linkage adalah jarak antara cluster A dan B dengan menggunakan formula centroid linkage. 4/10/2017
Clustering Algorithms K-means and its variants Hierarchical clustering Fuzzy clustering 4/10/2017
K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple 4/10/2017
K-means Clustering – Details Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes 4/10/2017
Two different K-means Clusterings Original Points Optimal Clustering Sub-optimal Clustering 4/10/2017
Importance of Choosing Initial Centroids 4/10/2017
Importance of Choosing Initial Centroids 4/10/2017
Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K 4/10/2017
Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 4/10/2017
Strengths of Hierarchical Clustering Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) 4/10/2017
Hierarchical Clustering Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time 4/10/2017
Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms 4/10/2017
Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 . . . . Proximity Matrix 4/10/2017
Intermediate Situation After some merging steps, we have some clusters C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2 4/10/2017
Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2 4/10/2017
After Merging The question is “How do we update the proximity matrix?” C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5 4/10/2017
How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . Similarity? MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017
How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017
How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017
How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017
How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017
Hierarchical Clustering: Comparison 5 5 1 2 3 4 5 6 4 4 1 2 3 4 5 6 3 2 2 1 MIN MAX 3 1 5 5 1 2 3 4 5 6 1 2 3 4 5 6 4 4 2 2 Ward’s Method 3 3 1 Group Average 1 4/10/2017
Reference Tan P., Michael S., & Vipin K. 2006. Introduction to Data mining. Pearson Education, Inc. Han J & Kamber M. 2006. Data mining – Concept and Techniques. Morgan-Kauffman, San Diego Santosa B. 2007. Data mining: Teknik Pemanfaatan Data Untuk Keperluan Bisnis, Teori dan Aplikasi. Graha Ilmu, Jogjakarta. 4/10/2017