Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada.

Clustering Kuliah 6 4/10/2017

Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada fase learning. Analisis cluster membagi data ke dalam grup (cluster) yang bermakna, berguna, atau keduanya. Analisis cluster akan mengelompokkan obyek-obyek data hanya berdasarkan pada informasi yang terdapat pada data, yang menjelaskan obyek dan relasinya. 4/10/2017

Analisis Cluster Tujuan analisis cluster adalah agar obyek-obyek di dalam grup adalah mirip (atau berhubungan) satu dengan lainnya, dan berbeda (atau tidak berhubungan) dengan obyek dalam grup lainnya. Semakin besar tingkat kemiripan/similarity (atau homogenitas) di dalam satu grup dan semakin besar tingkat perbedaan diantara grup, maka semakin baik (atau lebih berbeda) clustering tersebut. Inter-cluster distances are maximized Intra-cluster distances are minimized 4/10/2017

Aplikasi analisis cluster
Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Clustering precipitation in Australia

Penentuan cluster terbaik tergantung dari kondisi data serta hasil yang diinginkan
How many clusters? Six Clusters Two Clusters Four Clusters 4/10/2017

Tipe-tipe clustering Hierarchical versus Partitional
Partitional clustering adalah membagi himpunan obyek data ke dalam sub-himpunan (cluster) yang tidak overlap, sehingga setiap obyek data berada dalam tepat satu cluster. Hierarchical clustering Cluster memiliki subcluster Himpunan cluster bersarang yang diatur dalam bentuk tree. 4/10/2017

Partitional Clustering
A Partitional Clustering Original Points 4/10/2017

Hierarchical Clustering
Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram 4/10/2017

Tipe clustering lainnya
Exclusive versus non-exclusive Dalam non-exclusive clustering, titik-titik dapat menjadi anggota banyak clusters. Dapat merepresentasikan banyak kelas atau titik ‘border’ Fuzzy versus non-fuzzy setiap obyek menjadi milik setiap cluster dengan nilai keanggotaan diantara 0 (multak bukan anggota cluster) dan 1 (mutlak anggota cluster) Penjumlahan bobot haruslah 1 Partial versus complete Complete clustering akan menetapkan setiap obyek ke dalam cluster, sedangkan partial clustering tidak. Alasan partial clustering adalah karena beberapa obyek dalam data set mungkin bukan anggota kelompok yang telah didefinisikan dengan baik. Banyak obyek dalam data set mungkin mewakili noise atau outlier. 4/10/2017

Tipe-tipe Cluster Well-separated clusters Center-based clusters
Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function 4/10/2017

Well-Separated Well-Separated Clusters:
A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters 4/10/2017

Center-Based Center-based
A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters 4/10/2017

Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive)
A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. Grup obyek terkoneksi satu sama lain, tetapi tidak memiliki koneksi dengan obyek di luar grup Data direpresentasikan sebagai graph, dimana obyek menjadi node dan link menyatakan koneksi diantara obyek 8 contiguous clusters 4/10/2017

Density-Based Density-based
Sebuah cluster adalah wilayah yang padat obyek dikelilingi oleh wilayah dengan kepadatan rendah. Used when the clusters are irregular, and when noise and outliers are present. 6 density-based clusters 4/10/2017

Conceptual Clusters Shared Property or Conceptual Clusters
Finds clusters that share some common property or represent a particular concept. 2 Overlapping Circles 4/10/2017

Similarity and Dissimilarity
Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity 4/10/2017

Euclidean Distance Euclidean Distance
Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ. 4/10/2017

Euclidean Distance Distance Matrix 4/10/2017

Kemiripan Kosinus Jika x dan y adalah dua vektor dokumen, maka
Dimana . menunjukkan perkalian titik pada vektor dan adalah panjang dari vektor x, = 4/10/2017

Perluasan Koefisien Jaccard (Koefisien Tanimoto) dan Korelasi
4/10/2017

Ukuran Jarak Antar Cluster
Jarak maksimum antar elemen dalam cluster (complete linkage clustering): d(A,B) = max {Sx,y}, x  A, y  A dimana Sx,y adalah jarak dua data x dan y masing-masing dari cluster A dan B. Jarak minimum antara elemen dari setiap cluster (single linkage clustering) d(A,B) = min{Sx,y}, x  A, y  A 4/10/2017

Ukuran Jarak Antar Cluster
Rata-rata jarak antara elemen dari setiap cluster (average linkage clustering) d(A, B) = dimana nA dan nB masing-masing adalah banyaknya data dalam cluster A dan B. Centroid Linkage Ward Linkage adalah jarak antara cluster A dan B dengan menggunakan formula centroid linkage. 4/10/2017

Clustering Algorithms
K-means and its variants Hierarchical clustering Fuzzy clustering 4/10/2017

K-means Clustering Partitional clustering approach
Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple 4/10/2017

K-means Clustering – Details
Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes 4/10/2017

Two different K-means Clusterings
Original Points Optimal Clustering Sub-optimal Clustering 4/10/2017

Importance of Choosing Initial Centroids
4/10/2017

Evaluating K-means Clusters
Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K 4/10/2017

Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 4/10/2017

Strengths of Hierarchical Clustering
Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) 4/10/2017

Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time 4/10/2017

Agglomerative Clustering Algorithm
More popular hierarchical clustering technique Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms 4/10/2017

Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 . . . . Proximity Matrix 4/10/2017

Intermediate Situation
After some merging steps, we have some clusters C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2 4/10/2017

Intermediate Situation
We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2 4/10/2017

After Merging The question is “How do we update the proximity matrix?”
C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5 4/10/2017

How to Define Inter-Cluster Similarity
p1 p3 p5 p4 p2 . . . . Similarity? MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

p1 p3 p5 p4 p2 . . . .   MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

Hierarchical Clustering: Comparison
5 5 1 2 3 4 5 6 4 4 1 2 3 4 5 6 3 2 2 1 MIN MAX 3 1 5 5 1 2 3 4 5 6 1 2 3 4 5 6 4 4 2 2 Ward’s Method 3 3 1 Group Average 1 4/10/2017

Reference Tan P., Michael S., & Vipin K Introduction to Data mining. Pearson Education, Inc. Han J & Kamber M Data mining – Concept and Techniques. Morgan-Kauffman, San Diego Santosa B Data mining: Teknik Pemanfaatan Data Untuk Keperluan Bisnis, Teori dan Aplikasi. Graha Ilmu, Jogjakarta. 4/10/2017

Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada.

Presentasi serupa

Presentasi berjudul: "Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada."— Transcript presentasi:

Presentasi serupa

Tentang proyek

Tanggapan

Masuk

Otorisasi melalui jaringan sosial:

Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada.

Presentasi serupa

Presentasi berjudul: "Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada."— Transcript presentasi:

Presentasi serupa

Tentang proyek

Tanggapan