Presentasi sedang didownload. Silahkan tunggu

Presentasi sedang didownload. Silahkan tunggu

Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada.

Presentasi serupa


Presentasi berjudul: "Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada."— Transcript presentasi:

0 Clustering Kuliah 6 4/10/2017

1 Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada fase learning. Analisis cluster membagi data ke dalam grup (cluster) yang bermakna, berguna, atau keduanya. Analisis cluster akan mengelompokkan obyek-obyek data hanya berdasarkan pada informasi yang terdapat pada data, yang menjelaskan obyek dan relasinya. 4/10/2017

2 Analisis Cluster Tujuan analisis cluster adalah agar obyek-obyek di dalam grup adalah mirip (atau berhubungan) satu dengan lainnya, dan berbeda (atau tidak berhubungan) dengan obyek dalam grup lainnya. Semakin besar tingkat kemiripan/similarity (atau homogenitas) di dalam satu grup dan semakin besar tingkat perbedaan diantara grup, maka semakin baik (atau lebih berbeda) clustering tersebut. Inter-cluster distances are maximized Intra-cluster distances are minimized 4/10/2017

3 Aplikasi analisis cluster
Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Clustering precipitation in Australia

4 Penentuan cluster terbaik tergantung dari kondisi data serta hasil yang diinginkan
How many clusters? Six Clusters Two Clusters Four Clusters 4/10/2017

5 Tipe-tipe clustering Hierarchical versus Partitional
Partitional clustering adalah membagi himpunan obyek data ke dalam sub-himpunan (cluster) yang tidak overlap, sehingga setiap obyek data berada dalam tepat satu cluster. Hierarchical clustering Cluster memiliki subcluster Himpunan cluster bersarang yang diatur dalam bentuk tree. 4/10/2017

6 Partitional Clustering
A Partitional Clustering Original Points 4/10/2017

7 Hierarchical Clustering
Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram 4/10/2017

8 Tipe clustering lainnya
Exclusive versus non-exclusive Dalam non-exclusive clustering, titik-titik dapat menjadi anggota banyak clusters. Dapat merepresentasikan banyak kelas atau titik ‘border’ Fuzzy versus non-fuzzy setiap obyek menjadi milik setiap cluster dengan nilai keanggotaan diantara 0 (multak bukan anggota cluster) dan 1 (mutlak anggota cluster) Penjumlahan bobot haruslah 1 Partial versus complete Complete clustering akan menetapkan setiap obyek ke dalam cluster, sedangkan partial clustering tidak. Alasan partial clustering adalah karena beberapa obyek dalam data set mungkin bukan anggota kelompok yang telah didefinisikan dengan baik. Banyak obyek dalam data set mungkin mewakili noise atau outlier. 4/10/2017

9 Tipe-tipe Cluster Well-separated clusters Center-based clusters
Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function 4/10/2017

10 Well-Separated Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters 4/10/2017

11 Center-Based Center-based
A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters 4/10/2017

12 Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive)
A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. Grup obyek terkoneksi satu sama lain, tetapi tidak memiliki koneksi dengan obyek di luar grup Data direpresentasikan sebagai graph, dimana obyek menjadi node dan link menyatakan koneksi diantara obyek 8 contiguous clusters 4/10/2017

13 Density-Based Density-based
Sebuah cluster adalah wilayah yang padat obyek dikelilingi oleh wilayah dengan kepadatan rendah. Used when the clusters are irregular, and when noise and outliers are present. 6 density-based clusters 4/10/2017

14 Conceptual Clusters Shared Property or Conceptual Clusters
Finds clusters that share some common property or represent a particular concept. 2 Overlapping Circles 4/10/2017

15 Similarity and Dissimilarity
Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity 4/10/2017

16 Euclidean Distance Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ. 4/10/2017

17 Euclidean Distance Distance Matrix 4/10/2017

18 Kemiripan Kosinus Jika x dan y adalah dua vektor dokumen, maka
Dimana . menunjukkan perkalian titik pada vektor dan adalah panjang dari vektor x, = 4/10/2017

19 Perluasan Koefisien Jaccard (Koefisien Tanimoto) dan Korelasi
4/10/2017

20 Ukuran Jarak Antar Cluster
Jarak maksimum antar elemen dalam cluster (complete linkage clustering): d(A,B) = max {Sx,y}, x  A, y  A dimana Sx,y adalah jarak dua data x dan y masing-masing dari cluster A dan B. Jarak minimum antara elemen dari setiap cluster (single linkage clustering) d(A,B) = min{Sx,y}, x  A, y  A 4/10/2017

21 Ukuran Jarak Antar Cluster
Rata-rata jarak antara elemen dari setiap cluster (average linkage clustering) d(A, B) = dimana nA dan nB masing-masing adalah banyaknya data dalam cluster A dan B. Centroid Linkage Ward Linkage adalah jarak antara cluster A dan B dengan menggunakan formula centroid linkage. 4/10/2017

22 Clustering Algorithms
K-means and its variants Hierarchical clustering Fuzzy clustering 4/10/2017

23 K-means Clustering Partitional clustering approach
Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple 4/10/2017

24 K-means Clustering – Details
Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes 4/10/2017

25 Two different K-means Clusterings
Original Points Optimal Clustering Sub-optimal Clustering 4/10/2017

26 Importance of Choosing Initial Centroids
4/10/2017

27 Importance of Choosing Initial Centroids
4/10/2017

28 Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K 4/10/2017

29 Hierarchical Clustering
Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 4/10/2017

30 Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) 4/10/2017

31 Hierarchical Clustering
Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time 4/10/2017

32 Agglomerative Clustering Algorithm
More popular hierarchical clustering technique Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms 4/10/2017

33 Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 . . . . Proximity Matrix 4/10/2017

34 Intermediate Situation
After some merging steps, we have some clusters C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2 4/10/2017

35 Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2 4/10/2017

36 After Merging The question is “How do we update the proximity matrix?”
C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5 4/10/2017

37 How to Define Inter-Cluster Similarity
p1 p3 p5 p4 p2 . . . . Similarity? MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

38 How to Define Inter-Cluster Similarity
p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

39 How to Define Inter-Cluster Similarity
p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

40 How to Define Inter-Cluster Similarity
p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

41 How to Define Inter-Cluster Similarity
p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

42 Hierarchical Clustering: Comparison
5 5 1 2 3 4 5 6 4 4 1 2 3 4 5 6 3 2 2 1 MIN MAX 3 1 5 5 1 2 3 4 5 6 1 2 3 4 5 6 4 4 2 2 Ward’s Method 3 3 1 Group Average 1 4/10/2017

43 Reference Tan P., Michael S., & Vipin K Introduction to Data mining. Pearson Education, Inc. Han J & Kamber M Data mining – Concept and Techniques. Morgan-Kauffman, San Diego Santosa B Data mining: Teknik Pemanfaatan Data Untuk Keperluan Bisnis, Teori dan Aplikasi. Graha Ilmu, Jogjakarta. 4/10/2017


Download ppt "Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada."

Presentasi serupa


Iklan oleh Google