Upload presentasi
Presentasi sedang didownload. Silahkan tunggu
1
Pohon Keputusan (Decision Trees)
2
Klasifikasi: Definisi
Diberikan sekumpulan record (tramsaksi) (training set ) Setiap record terdiri dari attribute-atribute, dan satu atribut berupa atribut class. Menemukan model pada atribute class yaitu sebagai fungsi dari atribut lain Test set digunakan untuk menentukan akurasi model. Biasanya data dinyatakan dengan data training dan data test. Data training digunakan untuk membangun model dan data tes digunakan menentukan validasi model
3
Ilustrasi Klasifikasi
4
Model/teknik Klasifikasi (contoh)
Metode berbasis pohon keputusan ( Decision Tree based Methods) Rule-based Methods Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines
5
Decision Tree Model: Decision Tree Training Data Splitting Attributes
categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree
6
Decision Tree(2) categorical categorical continuous class MarSt
Single, Divorced Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES
7
Klasifikasi Decision Tree
8
Menggunakan model untuk data tesa
Test Data Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K
9
Model data tes Test Data Refund MarSt TaxInc YES NO Yes No Married
Single, Divorced < 80K > 80K
10
Model data tes Test Data Refund Yes No NO MarSt Single, Divorced
Married TaxInc NO < 80K > 80K NO YES
11
Model data tes Test Data Refund Yes No NO MarSt Single, Divorced
Married TaxInc NO < 80K > 80K NO YES
12
Model data tes Test Data Refund Yes No NO MarSt Single, Divorced
Married TaxInc NO < 80K > 80K NO YES
13
Model data tes Test Data Assign Cheat to “No” Refund Yes No NO MarSt
Married Assign Cheat to “No” Single, Divorced TaxInc NO < 80K > 80K NO YES
14
Model data tes Decision Tree
15
Tree Induction Masalah
Menentukan bagaiman untuk membagi/split record-record Bagaimana cara menentukan atribut untuk node pemisah? Bagaimana menentukan pembagi yang terbaik (best split)? Kapan berhenti untuk melakukan pemisahan/ splitting
16
Cara menentukan pengujian terhadap atribute
Bergantung pada tipe atribut Nominal Ordinal Continuous Bergantung pada cara memisahkan 2-way split Multi-way split
17
Splitt pada atribute Nominal
Multi-way split: Binary split: CarType Family Sports Luxury CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} OR
18
Split pada atribut Ordinal
Multi-way split: Binary split: Size Small Medium Large Size {Small, Medium} {Large} Size {Medium, Large} {Small} OR Size {Small, Large} {Medium}
19
Split pada atribute kontinu
Beberapa cara Discretization Binary Decision: (A < v) or (A v) Mencari beberapa kemungkinan split dan mencari yang terbaik. Perlu banyak komputasi
20
Split berdasarkan atribute kontinu
21
Menentukan pemisah yang terbaik
Sebelum dipisah : 10 record class 0, 10 record class 1
22
Ukuran Gini Index Entropy Misclassification error
23
GINI Gini Index pada note t :
(NOTE: p( j | t) frekwensi relatif dari class j pada node t. Maximum (1 - 1/nc) bila records distribusinya sama untuk semua class,tanda informasi kurang menarik Minimum (0.0) bila semua record menjadi anggota dari satu kelas tertentu, tanda informasi yang sangat menarik
24
Pemisahan berdasarkanGINI
Ketika node p dipisah ke dalam k bagian (children), maka nilainya dihitung dengan dimana, ni = jumlah record pada child i, n = jumlah record pada node p
25
Atribut Binary: Menghitung GINI Index
Dibagi menjadi 2 bagian B? Yes No Node N1 Node N2 Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194 Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528 Gini(Children) = 7/12 * /12 * = 0.333
26
Atribute Categorical: menghitung Gini Index
Multi-way split Two-way split (find best partition of values)
27
Atribut Kontinu: Menghitung GINI index
Menggunakan Binary Decisions Beberapa pilihan untuk splitting value Banyaknya nilai pemisah = jumlah nilai yang berbeda
28
Atribut Kontinu Pilih gini index terkecil Sorted Values
Split Positions Sorted Values
29
Ukuran INFO Entropy pada node t: Mengukur keseragaman suatu node.
(NOTE: p( j | t) adalah frekwensi relatif dari class j pada node t Mengukur keseragaman suatu node. Maximum (log nc) bila record sama untuk semua class, tanda kurang informasi Minimum (0.0) bila semua record menjadi anggota suatu kelas,tanda banyak informasi Entropy perhitungannya serupa dengan GINI index
30
Contoh perhitungan P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/ P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/ P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
31
Didasarkan pada INFO... Information Gain:
Parent Node, p dipisah kedalam k bagian; ni adalah jumlah recode dalam partisi I Pilih pemisah dengan dengan maximium GAIN
32
Contoh: TABLE 7.1 (p.145) A simple flat database
X1 X2 X3 C B 65 TRUE C1 A 70 FALSE C2 75 78 80 85 90 95 96 TABLE 7.1 (p.145) A simple flat database of examples for training
33
=5/14(-2/5*log2(2/5)-3/5*log2(3/5))
Entropy (T)=-9/14*log2(9/14)-5/14*log2(5/14) =0.940 bits =5/14(-2/5*log2(2/5)-3/5*log2(3/5)) +4/14(-4/4*log2(4/4)-0/4*log2(0/4)) +5/14(-3/5*log2(3/5)-2/5*log2(2/5)) =0.694 bits Gain(x1)= =0.246 bits
34
Splitting Based on INFO...
Gain Ratio: Parent Node, p dipisah ke dalam k partisi ni jumlah recode dalam partisi i
35
Splitting Criteria based on Classification Error
Classification error at a node t : Measures misclassification error made by a node. Maximum (1 - 1/nc) distribusi kelas sama, sedikit informasi Minimum (0.0) bila semua record menjadi anggot satu kelas tertentu; banyak informasi
36
Contoh P(C1) = 0/6 = 0 P(C2) = 6/6 = 1
Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/ P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/ P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
37
Training Examples No Strong High Mild Rain D14 Yes Weak Normal Hot
Overcast D13 D12 Sunny D11 D10 Cold D9 D8 Cool D7 D6 D5 D4 D3 D2 D1 Play Tennis Wind Humidity Temp. Outlook Day ICS611
38
Selecting the Next Attribute
Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.592 E=0.811 E=1.0 E=0.985 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 ICS611 Humidity provides greater info. gain than Wind, w.r.t target classification.
39
Selecting the Next Attribute
Outlook Over cast Rain Sunny [3+, 2-] [2+, 3-] [4+, 0] E=0.971 E=0.971 E=0.0 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247 ICS611
40
Selecting the Next Attribute
The information gain values for the 4 attributes are: Gain(S,Outlook) =0.247 Gain(S,Humidity) =0.151 Gain(S,Wind) =0.048 Gain(S,Temperature) =0.029 where S denotes the collection of training examples ICS611
41
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970
[D1,D2,…,D14] [9+,5-] Outlook Sunny Overcast Rain Ssunny=[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019 ICS611
42
ID3 Algorithm Outlook Sunny Overcast Rain Humidity Yes Wind
[D3,D7,D12,D13] High Normal Strong Weak No Yes No Yes [D6,D14] [D4,D5,D10] [D1,D2] [D8,D9,D11] ICS611
Presentasi serupa
© 2024 SlidePlayer.info Inc.
All rights reserved.