Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada.

Slides:

Advertisements

Presentasi serupa

K-Means Clustering.

Advertisements

Pengelompokan Jenis Tanah Menggunakan Algoritma Clustering K-Means

Clustering Okt 2012.

Modul-8 : Algoritma dan Struktur Data

Applied Multivariate Analysis

Chapter 9 ALGORITME Cluster dan WEKA

Game Theory Purdianta, ST., MT..

K-Map Using different rules and properties in Boolean algebra can simplify Boolean equations May involve many of rules / properties during simplification.

PEMERINTAH KOTA PONTIANAK DINAS PENDIDIKAN PEMERINTAH KOTA PONTIANAK DINAS PENDIDIKAN Jl. Letjen. Sutoyo Pontianak, Telp. (0561) , Website:

Activity Diagram Shinta P.. For Bussiness Modeling, Activity diagrams describe the activities of a class. It is used for the following purposes: (Bennet.

Clustering. Definition Clustering is “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection.

BLACK BOX TESTING.

Presented By : Group 2. A solution of an equation in two variables of the form. Ax + By = C and Ax + By + C = 0 A and B are not both zero, is an ordered.

Mekanisme Pasar Permintaan dan Penawaran

1 Pertemuan 02 Ukuran Pemusatan dan Lokasi Matakuliah: I Statistika Tahun: 2008 Versi: Revisi.

1 Diselesaikan Oleh KOMPUTER Langkah-langkah harus tersusun secara LOGIS dan Efisien agar dapat menyelesaikan tugas dengan benar dan efisien. ALGORITMA.

Ruang Contoh dan Peluang Pertemuan 05

1 Pertemuan 03 dan 04 Ukuran Variasi Matakuliah: I Statistika Tahun: 2008 Versi: Revisi.

Masalah Transportasi II (Transportation Problem II)

BAB 6 KOMBINATORIAL DAN PELUANG DISKRIT. KOMBINATORIAL (COMBINATORIC) : ADALAH CABANG MATEMATIKA YANG MEMPELAJARI PENGATURAN OBJEK- OBJEK. ADALAH CABANG.

PERTEMUAN KE-6 UNIFIED MODELLING LANGUAGE (UML) (Part 2)

Pertemuan 07 Peluang Beberapa Sebaran Khusus Peubah Acak Kontinu

HAMPIRAN NUMERIK SOLUSI PERSAMAAN NIRLANJAR Pertemuan 3

1 Pertemuan 8 JARINGAN COMPETITIVE Matakuliah: H0434/Jaringan Syaraf Tiruan Tahun: 2005 Versi: 1.

1 HAMPIRAN NUMERIK SOLUSI PERSAMAAN LANJAR Pertemuan 5 Matakuliah: K0342 / Metode Numerik I Tahun: 2006 TIK:Mahasiswa dapat meghitung nilai hampiran numerik.

Mata kuliah :K0362/ Matematika Diskrit Tahun :2008

9.3 Geometric Sequences and Series. Objective To find specified terms and the common ratio in a geometric sequence. To find the partial sum of a geometric.

Ukuran Pemusatan dan Lokasi Pertemuan 03 Matakuliah: L0104 / Statistika Psikologi Tahun : 2008.

Smoothing. Basic Smoothing Models Moving average, weighted moving average, exponential smoothing Single and Double Smoothing First order exponential smoothing.

Copyright © 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 9 Relational Database Design by ER- to-Relational Mapping.

Ukuran Penyimpangan atau Disversi Pertemuan 04

Jartel, Sukiswo Sukiswo

STATISTIKA CHATPER 4 (Perhitungan Dispersi (Sebaran))

KOMUNIKASI DATA Materi Pertemuan 3.

Rekayasa Perangkat Lunak Class Diagram

Branch and Bound Lecture 12 CS3024.

PEMBUATAN POHON KEPUTUSAN

Cartesian coordinates in two dimensions

Cartesian coordinates in two dimensions

Pengujian Hipotesis (I) Pertemuan 11

Matakuliah : I0014 / Biostatistika Tahun : 2005 Versi : V1 / R1

Dasar-Dasar Pemrograman

Peramalan Data Time Series

Parabola Parabola.

VECTOR VECTOR IN PLANE.

CSG3F3/ Desain dan Analisis Algoritma

Algorithms and Programming Searching

Pendugaan Parameter (II) Pertemuan 10

PRODI MIK | FAKULTAS ILMU-ILMU KESEHATAN

Classification Supervised learning.

CENTRAL TENDENCY Hartanto, SIP, MA Ilmu Hubungan Internasional

Master data Management

Pertemuan 4 CLASS DIAGRAM.

Self-Organizing Network Model (SOM) Pertemuan 10

ANALISIS CLUSTER Part 2.

Ukuran Akurasi Model Deret Waktu Manajemen Informasi Kesehatan

Suhandi Wiratama. Before I begin this presentation, I want to thank Mr. Abe first. He taught me many things about CorelDRAW. He also guided me when I.

Simultaneous Linear Equations

THE INFORMATION ABOUT HEALTH INSURANCE IN AUSTRALIA.

Algoritma & Pemrograman 1 Achmad Fitro The Power of PowerPoint – thepopp.com Chapter 3.

Pengelompokan Dokumen (Document Clustering)

Implementasi clustering K-MEANS (dengan IRIS dataset)

By : Rahmat Robi Waliyansyah, M.Kom

Draw a picture that shows where the knife, fork, spoon, and napkin are placed in a table setting.

2. Discussion TASK 1. WORK IN PAIRS Ask your partner. Then, in turn your friend asks you A. what kinds of product are there? B. why do people want to.

Wednesday/ September,  There are lots of problems with trade ◦ There may be some ways that some governments can make things better by intervening.

Transcript presentasi:

Clustering Kuliah 6 4/10/2017

Pendahuluan Clustering adalah salah satu teknik unsupervised learning dimana kita tidak perlu melatih metoda tersebut atau dengan kata lain, tidak ada fase learning. Analisis cluster membagi data ke dalam grup (cluster) yang bermakna, berguna, atau keduanya. Analisis cluster akan mengelompokkan obyek-obyek data hanya berdasarkan pada informasi yang terdapat pada data, yang menjelaskan obyek dan relasinya. 4/10/2017

Analisis Cluster Tujuan analisis cluster adalah agar obyek-obyek di dalam grup adalah mirip (atau berhubungan) satu dengan lainnya, dan berbeda (atau tidak berhubungan) dengan obyek dalam grup lainnya. Semakin besar tingkat kemiripan/similarity (atau homogenitas) di dalam satu grup dan semakin besar tingkat perbedaan diantara grup, maka semakin baik (atau lebih berbeda) clustering tersebut. Inter-cluster distances are maximized Intra-cluster distances are minimized 4/10/2017

Aplikasi analisis cluster Understanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuations Summarization Reduce the size of large data sets Clustering precipitation in Australia

Penentuan cluster terbaik tergantung dari kondisi data serta hasil yang diinginkan How many clusters? Six Clusters Two Clusters Four Clusters 4/10/2017

Tipe-tipe clustering Hierarchical versus Partitional Partitional clustering adalah membagi himpunan obyek data ke dalam sub-himpunan (cluster) yang tidak overlap, sehingga setiap obyek data berada dalam tepat satu cluster. Hierarchical clustering Cluster memiliki subcluster Himpunan cluster bersarang yang diatur dalam bentuk tree. 4/10/2017

Partitional Clustering A Partitional Clustering Original Points 4/10/2017

Hierarchical Clustering Traditional Hierarchical Clustering Traditional Dendrogram Non-traditional Hierarchical Clustering Non-traditional Dendrogram 4/10/2017

Tipe clustering lainnya Exclusive versus non-exclusive Dalam non-exclusive clustering, titik-titik dapat menjadi anggota banyak clusters. Dapat merepresentasikan banyak kelas atau titik ‘border’ Fuzzy versus non-fuzzy setiap obyek menjadi milik setiap cluster dengan nilai keanggotaan diantara 0 (multak bukan anggota cluster) dan 1 (mutlak anggota cluster) Penjumlahan bobot haruslah 1 Partial versus complete Complete clustering akan menetapkan setiap obyek ke dalam cluster, sedangkan partial clustering tidak. Alasan partial clustering adalah karena beberapa obyek dalam data set mungkin bukan anggota kelompok yang telah didefinisikan dengan baik. Banyak obyek dalam data set mungkin mewakili noise atau outlier. 4/10/2017

Tipe-tipe Cluster Well-separated clusters Center-based clusters Contiguous clusters Density-based clusters Property or Conceptual Described by an Objective Function 4/10/2017

Well-Separated Well-Separated Clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any point not in the cluster. 3 well-separated clusters 4/10/2017

Center-Based Center-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters 4/10/2017

Contiguity-Based Contiguous Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster. Grup obyek terkoneksi satu sama lain, tetapi tidak memiliki koneksi dengan obyek di luar grup Data direpresentasikan sebagai graph, dimana obyek menjadi node dan link menyatakan koneksi diantara obyek 8 contiguous clusters 4/10/2017

Density-Based Density-based Sebuah cluster adalah wilayah yang padat obyek dikelilingi oleh wilayah dengan kepadatan rendah. Used when the clusters are irregular, and when noise and outliers are present. 6 density-based clusters 4/10/2017

Conceptual Clusters Shared Property or Conceptual Clusters Finds clusters that share some common property or represent a particular concept. 2 Overlapping Circles 4/10/2017

Similarity and Dissimilarity Numerical measure of how alike two data objects are. Is higher when objects are more alike. Often falls in the range [0,1] Dissimilarity Numerical measure of how different are two data objects Lower when objects are more alike Minimum dissimilarity is often 0 Upper limit varies Proximity refers to a similarity or dissimilarity 4/10/2017

Euclidean Distance Euclidean Distance Where n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q. Standardization is necessary, if scales differ. 4/10/2017

Euclidean Distance Distance Matrix 4/10/2017

Kemiripan Kosinus Jika x dan y adalah dua vektor dokumen, maka Dimana . menunjukkan perkalian titik pada vektor dan adalah panjang dari vektor x, = 4/10/2017

Perluasan Koefisien Jaccard (Koefisien Tanimoto) dan Korelasi 4/10/2017

Ukuran Jarak Antar Cluster Jarak maksimum antar elemen dalam cluster (complete linkage clustering): d(A,B) = max {Sx,y}, x  A, y  A dimana Sx,y adalah jarak dua data x dan y masing-masing dari cluster A dan B. Jarak minimum antara elemen dari setiap cluster (single linkage clustering) d(A,B) = min{Sx,y}, x  A, y  A 4/10/2017

Ukuran Jarak Antar Cluster Rata-rata jarak antara elemen dari setiap cluster (average linkage clustering) d(A, B) = dimana nA dan nB masing-masing adalah banyaknya data dalam cluster A dan B. Centroid Linkage Ward Linkage adalah jarak antara cluster A dan B dengan menggunakan formula centroid linkage. 4/10/2017

Clustering Algorithms K-means and its variants Hierarchical clustering Fuzzy clustering 4/10/2017

K-means Clustering Partitional clustering approach Each cluster is associated with a centroid (center point) Each point is assigned to the cluster with the closest centroid Number of clusters, K, must be specified The basic algorithm is very simple 4/10/2017

K-means Clustering – Details Initial centroids are often chosen randomly. Clusters produced vary from one run to another. The centroid is (typically) the mean of the points in the cluster. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. K-means will converge for common similarity measures mentioned above. Most of the convergence happens in the first few iterations. Often the stopping condition is changed to ‘Until relatively few points change clusters’ Complexity is O( n * K * I * d ) n = number of points, K = number of clusters, I = number of iterations, d = number of attributes 4/10/2017

Two different K-means Clusterings Original Points Optimal Clustering Sub-optimal Clustering 4/10/2017

Importance of Choosing Initial Centroids 4/10/2017

Importance of Choosing Initial Centroids 4/10/2017

Evaluating K-means Clusters Most common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. x is a data point in cluster Ci and mi is the representative point for cluster Ci can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clusters A good clustering with smaller K can have a lower SSE than a poor clustering with higher K 4/10/2017

Hierarchical Clustering Produces a set of nested clusters organized as a hierarchical tree Can be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits 4/10/2017

Strengths of Hierarchical Clustering Do not have to assume any particular number of clusters Any desired number of clusters can be obtained by ‘cutting’ the dendogram at the proper level They may correspond to meaningful taxonomies Example in biological sciences (e.g., animal kingdom, phylogeny reconstruction, …) 4/10/2017

Hierarchical Clustering Two main types of hierarchical clustering Agglomerative: Start with the points as individual clusters At each step, merge the closest pair of clusters until only one cluster (or k clusters) left Divisive: Start with one, all-inclusive cluster At each step, split a cluster until each cluster contains a point (or there are k clusters) Traditional hierarchical algorithms use a similarity or distance matrix Merge or split one cluster at a time 4/10/2017

Agglomerative Clustering Algorithm More popular hierarchical clustering technique Basic algorithm is straightforward Compute the proximity matrix Let each data point be a cluster Repeat Merge the two closest clusters Update the proximity matrix Until only a single cluster remains Key operation is the computation of the proximity of two clusters Different approaches to defining the distance between clusters distinguish the different algorithms 4/10/2017

Starting Situation Start with clusters of individual points and a proximity matrix p1 p3 p5 p4 p2 . . . . Proximity Matrix 4/10/2017

Intermediate Situation After some merging steps, we have some clusters C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2 4/10/2017

Intermediate Situation We want to merge the two closest clusters (C2 and C5) and update the proximity matrix. C2 C1 C3 C5 C4 C3 C4 Proximity Matrix C1 C5 C2 4/10/2017

After Merging The question is “How do we update the proximity matrix?” C2 U C5 C1 C3 C4 C1 ? ? ? ? ? C2 U C5 C3 C3 ? C4 ? C4 Proximity Matrix C1 C2 U C5 4/10/2017

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . Similarity? MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . . MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

How to Define Inter-Cluster Similarity p1 p3 p5 p4 p2 . . . .   MIN MAX Group Average Distance Between Centroids Other methods driven by an objective function Ward’s Method uses squared error Proximity Matrix 4/10/2017

Hierarchical Clustering: Comparison 5 5 1 2 3 4 5 6 4 4 1 2 3 4 5 6 3 2 2 1 MIN MAX 3 1 5 5 1 2 3 4 5 6 1 2 3 4 5 6 4 4 2 2 Ward’s Method 3 3 1 Group Average 1 4/10/2017

Reference Tan P., Michael S., & Vipin K. 2006. Introduction to Data mining. Pearson Education, Inc. Han J & Kamber M. 2006. Data mining – Concept and Techniques. Morgan-Kauffman, San Diego Santosa B. 2007. Data mining: Teknik Pemanfaatan Data Untuk Keperluan Bisnis, Teori dan Aplikasi. Graha Ilmu, Jogjakarta. 4/10/2017