Presentasi sedang didownload. Silahkan tunggu

Presentasi sedang didownload. Silahkan tunggu

Clustering Kuliah 6 4/9/2015. Pendahuluan  Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut.

Presentasi serupa


Presentasi berjudul: "Clustering Kuliah 6 4/9/2015. Pendahuluan  Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut."— Transcript presentasi:

1 Clustering Kuliah 6 4/9/2015

2 Pendahuluan  Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada fase learning.  Analisis cluster membagi data ke dalam grup (cluster) yang bermakna, berguna, atau keduanya.  Analisis cluster akan mengelompokkan obyek-obyek data hanya berdasarkan pada informasi yang terdapat pada data, yang menjelaskan obyek dan relasinya. 4/9/2015

3 Analisis Cluster  Tujuan analisis cluster adalah agar obyek-obyek di dalam grup adalah mirip (atau berhubungan) satu dengan lainnya, dan berbeda (atau tidak berhubungan) dengan obyek dalam grup lainnya.  Semakin besar tingkat kemiripan/similarity (atau homogenitas) di dalam satu grup dan semakin besar tingkat perbedaan diantara grup, maka semakin baik (atau lebih berbeda) clustering tersebut. Inter-cluster distances are maximized Intra-cluster distances are minimized 4/9/2015

4 Aplikasi analisis cluster  Understanding  Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations  Summarization  Reduce the size of large data sets Clustering precipitation in Australia

5 Penentuan cluster terbaik tergantung dari kondisi data serta hasil yang diinginkan How many clusters? Four ClustersTwo Clusters Six Clusters 4/9/2015

6 Tipe-tipe clustering  Hierarchical versus Partitional  Partitional clustering adalah membagi himpunan obyek data ke dalam sub-himpunan (cluster) yang tidak overlap, sehingga setiap obyek data berada dalam tepat satu cluster.  Hierarchical clustering  Cluster memiliki subcluster  Himpunan cluster bersarang yang diatur dalam bentuk tree. 4/9/2015

7 Partitional Clustering Original Points A Partitional Clustering 4/9/2015

8 Hierarchical Clustering Traditional Hierarchical Clustering Non-traditional Hierarchical Clustering Non-traditional Dendrogram Traditional Dendrogram 4/9/2015

9 Tipe clustering lainnya  Exclusive versus non-exclusive  Dalam non-exclusive clustering, titik-titik dapat menjadi anggota banyak clusters.  Dapat merepresentasikan banyak kelas atau titik ‘border’  Fuzzy versus non-fuzzy  setiap obyek menjadi milik setiap cluster dengan nilai keanggotaan diantara 0 (multak bukan anggota cluster) dan 1 (mutlak anggota cluster)  Penjumlahan bobot haruslah 1  Partial versus complete  Complete clustering akan menetapkan setiap obyek ke dalam cluster, sedangkan partial clustering tidak.  Alasan partial clustering adalah karena beberapa obyek dalam data set mungkin bukan anggota kelompok yang telah didefinisikan dengan baik.  Banyak obyek dalam data set mungkin mewakili noise atau outlier. 4/9/2015

10 Tipe-tipe Cluster  Well-separated clusters  Center-based clusters  Contiguous clusters  Density-based clusters  Property or Conceptual  Described by an Objective Function 4/9/2015

11 Well-Separated  Well-Separated Clusters:  A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters 4/9/2015

12 Center-Based  Center-based  A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster  The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters 4/9/2015

13 Contiguity-Based  Contiguous Cluster (Nearest neighbor or Transitive)  A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.  Grup obyek terkoneksi satu sama lain, tetapi tidak memiliki koneksi dengan obyek di luar grup  Data direpresentasikan sebagai graph, dimana obyek menjadi node dan link menyatakan koneksi diantara obyek 8 contiguous clusters 4/9/2015

14 Density-Based  Density-based  Sebuah cluster adalah wilayah yang padat obyek dikelilingi oleh wilayah dengan kepadatan rendah.  Used when the clusters are irregular, and when noise and outliers are present. 6 density-based clusters 4/9/2015

15 Conceptual Clusters  Shared Property or Conceptual Clusters  Finds clusters that share some common property or represent a particular concept. 2 Overlapping Circles 4/9/2015

16 Similarity and Dissimilarity  Similarity  Numerical measure of how alike two data objects are.  Is higher when objects are more alike.  Often falls in the range [0,1]  Dissimilarity  Numerical measure of how different are two data objects  Lower when objects are more alike  Minimum dissimilarity is often 0  Upper limit varies  Proximity refers to a similarity or dissimilarity 4/9/2015

17 Euclidean Distance  Euclidean Distance Where n is the number of dimensions (attributes) and p k and q k are, respectively, the k th attributes (components) or data objects p and q.  Standardization is necessary, if scales differ. 4/9/2015

18 Euclidean Distance Distance Matrix 4/9/2015

19 Kemiripan Kosinus  Jika x dan y adalah dua vektor dokumen, maka  Dimana. menunjukkan perkalian titik pada vektor  dan adalah panjang dari vektor x, = 4/9/2015

20 Perluasan Koefisien Jaccard (Koefisien Tanimoto) dan Korelasi  Perluasan Koefisien Jaccard (Koefisien Tanimoto)  Korelasi 4/9/2015

21 Ukuran Jarak Antar Cluster  Jarak maksimum antar elemen dalam cluster (complete linkage clustering):  d(A,B) = max {S x,y }, x  A, y  A  dimana S x,y adalah jarak dua data x dan y masing-masing dari cluster A dan B.  Jarak minimum antara elemen dari setiap cluster (single linkage clustering)  d(A,B) = min{S x,y }, x  A, y  A  dimana S x,y adalah jarak dua data x dan y masing-masing dari cluster A dan B. 4/9/2015

22 Ukuran Jarak Antar Cluster  Rata-rata jarak antara elemen dari setiap cluster (average linkage clustering)  d(A, B) =  dimana n A dan n B masing-masing adalah banyaknya data dalam cluster A dan B.  Centroid Linkage  d(A, B) =  Ward Linkage  d(A, B) =  adalah jarak antara cluster A dan B dengan menggunakan formula centroid linkage. 4/9/2015

23 Clustering Algorithms  K-means and its variants  Hierarchical clustering  Fuzzy clustering 4/9/2015

24 K-means Clustering  Partitional clustering approach  Each cluster is associated with a centroid (center point)  Each point is assigned to the cluster with the closest centroid  Number of clusters, K, must be specified  The basic algorithm is very simple 4/9/2015

25 K-means Clustering – Details  Initial centroids are often chosen randomly.  Clusters produced vary from one run to another.  The centroid is (typically) the mean of the points in the cluster.  ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.  K-means will converge for common similarity measures mentioned above.  Most of the convergence happens in the first few iterations.  Often the stopping condition is changed to ‘Until relatively few points change clusters’  Complexity is O( n * K * I * d )  n = number of points, K = number of clusters, I = number of iterations, d = number of attributes 4/9/2015

26 Two different K-means Clusterings Sub-optimal ClusteringOptimal Clustering Original Points 4/9/2015

27 Importance of Choosing Initial Centroids 4/9/2015

28 Importance of Choosing Initial Centroids 4/9/2015

29 Evaluating K-means Clusters  Most common measure is Sum of Squared Error (SSE)  For each point, the error is the distance to the nearest cluster  To get SSE, we square these errors and sum them.  x is a data point in cluster C i and m i is the representative point for cluster C i  can show that m i corresponds to the center (mean) of the cluster  Given two clusters, we can choose the one with the smallest error  One easy way to reduce SSE is to increase K, the number of clusters  A good clustering with smaller K can have a lower SSE than a poor clustering with higher K 4/9/2015

30 Hierarchical Clustering  Produces a set of nested clusters organized as a hierarchical tree  Can be visualized as a dendrogram  A tree like diagram that records the sequences of merges or splits 4/9/2015

31 Strengths of Hierarchical Clustering  Do not have to assume any particular number of clusters  Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level  They may correspond to meaningful taxonomies  Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) 4/9/2015

32 Hierarchical Clustering  Two main types of hierarchical clustering  Agglomerative:  Start with the points as individual clusters  At each step, merge the closest pair of clusters until only one cluster (or k clusters) left  Divisive:  Start with one, all-inclusive cluster  At each step, split a cluster until each cluster contains a point (or there are k clusters)  Traditional hierarchical algorithms use a similarity or distance matrix  Merge or split one cluster at a time 4/9/2015

33 Agglomerative Clustering Algorithm  More popular hierarchical clustering technique  Basic algorithm is straightforward 1. Compute the proximity matrix 2. Let each data point be a cluster 3. Repeat 4. Merge the two closest clusters 5. Update the proximity matrix 6. Until only a single cluster remains  Key operation is the computation of the proximity of two clusters  Different approaches to defining the distance between clusters distinguish the different algorithms 4/9/2015

34 Starting Situation  Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix 4/9/2015

35 Intermediate Situation  After some merging steps, we have some clusters C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix 4/9/2015

36 Intermediate Situation  We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C1 C4 C2 C5 C3 C2C1 C3 C5 C4 C2 C3C4C5 Proximity Matrix 4/9/2015

37 After Merging  The question is “How do we update the proximity matrix?” C1 C4 C2 U C5 C3 ? ? ? ? ? C2 U C5 C1 C3 C4 C2 U C5 C3C4 Proximity Matrix 4/9/2015

38 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Similarity? l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error Proximity Matrix 4/9/2015

39 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error 4/9/2015

40 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error 4/9/2015

41 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error 4/9/2015

42 How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 p1p2p3p4p Proximity Matrix l MIN l MAX l Group Average l Distance Between Centroids l Other methods driven by an objective function –Ward’s Method uses squared error  4/9/2015

43 Hierarchical Clustering: Comparison Group Average Ward’s Method MINMAX /9/2015

44 Reference  Tan P., Michael S., & Vipin K Introduction to Data mining. Pearson Education, Inc.  Han J & Kamber M Data mining – Concept and Techniques. Morgan-Kauffman, San Diego  Santosa B Data mining: Teknik Pemanfaatan Data Untuk Keperluan Bisnis, Teori dan Aplikasi. Graha Ilmu, Jogjakarta. 4/9/2015


Download ppt "Clustering Kuliah 6 4/9/2015. Pendahuluan  Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut."

Presentasi serupa


Iklan oleh Google