Pohon Keputusan (Decision Trees)
Klasifikasi: Definisi Diberikan sekumpulan record (tramsaksi) (training set ) Setiap record terdiri dari attribute-atribute, dan satu atribut berupa atribut class. Menemukan model pada atribute class yaitu sebagai fungsi dari atribut lain Test set digunakan untuk menentukan akurasi model. Biasanya data dinyatakan dengan data training dan data test. Data training digunakan untuk membangun model dan data tes digunakan menentukan validasi model
Ilustrasi Klasifikasi
Model/teknik Klasifikasi (contoh) Metode berbasis pohon keputusan ( Decision Tree based Methods) Rule-based Methods Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines
Decision Tree Model: Decision Tree Training Data Splitting Attributes categorical continuous class Splitting Attributes Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES Training Data Model: Decision Tree
Decision Tree(2) categorical categorical continuous class MarSt Single, Divorced Married NO Refund No Yes NO TaxInc < 80K > 80K NO YES
Klasifikasi Decision Tree
Menggunakan model untuk data tesa Test Data Start from the root of tree. Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K
Model data tes Test Data Refund MarSt TaxInc YES NO Yes No Married Single, Divorced < 80K > 80K
Model data tes Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES
Model data tes Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES
Model data tes Test Data Refund Yes No NO MarSt Single, Divorced Married TaxInc NO < 80K > 80K NO YES
Model data tes Test Data Assign Cheat to “No” Refund Yes No NO MarSt Married Assign Cheat to “No” Single, Divorced TaxInc NO < 80K > 80K NO YES
Model data tes Decision Tree
Tree Induction Masalah Menentukan bagaiman untuk membagi/split record-record Bagaimana cara menentukan atribut untuk node pemisah? Bagaimana menentukan pembagi yang terbaik (best split)? Kapan berhenti untuk melakukan pemisahan/ splitting
Cara menentukan pengujian terhadap atribute Bergantung pada tipe atribut Nominal Ordinal Continuous Bergantung pada cara memisahkan 2-way split Multi-way split
Splitt pada atribute Nominal Multi-way split: Binary split: CarType Family Sports Luxury CarType {Sports, Luxury} {Family} CarType {Family, Luxury} {Sports} OR
Split pada atribut Ordinal Multi-way split: Binary split: Size Small Medium Large Size {Small, Medium} {Large} Size {Medium, Large} {Small} OR Size {Small, Large} {Medium}
Split pada atribute kontinu Beberapa cara Discretization Binary Decision: (A < v) or (A v) Mencari beberapa kemungkinan split dan mencari yang terbaik. Perlu banyak komputasi
Split berdasarkan atribute kontinu
Menentukan pemisah yang terbaik Sebelum dipisah : 10 record class 0, 10 record class 1
Ukuran Gini Index Entropy Misclassification error
GINI Gini Index pada note t : (NOTE: p( j | t) frekwensi relatif dari class j pada node t. Maximum (1 - 1/nc) bila records distribusinya sama untuk semua class,tanda informasi kurang menarik Minimum (0.0) bila semua record menjadi anggota dari satu kelas tertentu, tanda informasi yang sangat menarik
Pemisahan berdasarkanGINI Ketika node p dipisah ke dalam k bagian (children), maka nilainya dihitung dengan dimana, ni = jumlah record pada child i, n = jumlah record pada node p
Atribut Binary: Menghitung GINI Index Dibagi menjadi 2 bagian B? Yes No Node N1 Node N2 Gini(N1) = 1 – (5/6)2 – (2/6)2 = 0.194 Gini(N2) = 1 – (1/6)2 – (4/6)2 = 0.528 Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528 = 0.333
Atribute Categorical: menghitung Gini Index Multi-way split Two-way split (find best partition of values)
Atribut Kontinu: Menghitung GINI index Menggunakan Binary Decisions Beberapa pilihan untuk splitting value Banyaknya nilai pemisah = jumlah nilai yang berbeda
Atribut Kontinu Pilih gini index terkecil Sorted Values Split Positions Sorted Values
Ukuran INFO Entropy pada node t: Mengukur keseragaman suatu node. (NOTE: p( j | t) adalah frekwensi relatif dari class j pada node t Mengukur keseragaman suatu node. Maximum (log nc) bila record sama untuk semua class, tanda kurang informasi Minimum (0.0) bila semua record menjadi anggota suatu kelas,tanda banyak informasi Entropy perhitungannya serupa dengan GINI index
Contoh perhitungan P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Entropy = – 0 log 0 – 1 log 1 = – 0 – 0 = 0 P(C1) = 1/6 P(C2) = 5/6 Entropy = – (1/6) log2 (1/6) – (5/6) log2 (1/6) = 0.65 P(C1) = 2/6 P(C2) = 4/6 Entropy = – (2/6) log2 (2/6) – (4/6) log2 (4/6) = 0.92
Didasarkan pada INFO... Information Gain: Parent Node, p dipisah kedalam k bagian; ni adalah jumlah recode dalam partisi I Pilih pemisah dengan dengan maximium GAIN
Contoh: TABLE 7.1 (p.145) A simple flat database X1 X2 X3 C B 65 TRUE C1 A 70 FALSE C2 75 78 80 85 90 95 96 TABLE 7.1 (p.145) A simple flat database of examples for training
=5/14(-2/5*log2(2/5)-3/5*log2(3/5)) Entropy (T)=-9/14*log2(9/14)-5/14*log2(5/14) =0.940 bits =5/14(-2/5*log2(2/5)-3/5*log2(3/5)) +4/14(-4/4*log2(4/4)-0/4*log2(0/4)) +5/14(-3/5*log2(3/5)-2/5*log2(2/5)) =0.694 bits Gain(x1)=0.940-0.694=0.246 bits
Splitting Based on INFO... Gain Ratio: Parent Node, p dipisah ke dalam k partisi ni jumlah recode dalam partisi i
Splitting Criteria based on Classification Error Classification error at a node t : Measures misclassification error made by a node. Maximum (1 - 1/nc) distribusi kelas sama, sedikit informasi Minimum (0.0) bila semua record menjadi anggot satu kelas tertentu; banyak informasi
Contoh P(C1) = 0/6 = 0 P(C2) = 6/6 = 1 Error = 1 – max (0, 1) = 1 – 1 = 0 P(C1) = 1/6 P(C2) = 5/6 Error = 1 – max (1/6, 5/6) = 1 – 5/6 = 1/6 P(C1) = 2/6 P(C2) = 4/6 Error = 1 – max (2/6, 4/6) = 1 – 4/6 = 1/3
Training Examples No Strong High Mild Rain D14 Yes Weak Normal Hot Overcast D13 D12 Sunny D11 D10 Cold D9 D8 Cool D7 D6 D5 D4 D3 D2 D1 Play Tennis Wind Humidity Temp. Outlook Day ICS611
Selecting the Next Attribute Humidity Wind High Normal Weak Strong [3+, 4-] [6+, 1-] [6+, 2-] [3+, 3-] E=0.592 E=0.811 E=1.0 E=0.985 Gain(S,Wind) =0.940-(8/14)*0.811 – (6/14)*1.0 =0.048 Gain(S,Humidity) =0.940-(7/14)*0.985 – (7/14)*0.592 =0.151 ICS611 Humidity provides greater info. gain than Wind, w.r.t target classification.
Selecting the Next Attribute Outlook Over cast Rain Sunny [3+, 2-] [2+, 3-] [4+, 0] E=0.971 E=0.971 E=0.0 Gain(S,Outlook) =0.940-(5/14)*0.971 -(4/14)*0.0 – (5/14)*0.0971 =0.247 ICS611
Selecting the Next Attribute The information gain values for the 4 attributes are: Gain(S,Outlook) =0.247 Gain(S,Humidity) =0.151 Gain(S,Wind) =0.048 Gain(S,Temperature) =0.029 where S denotes the collection of training examples ICS611
Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 [D1,D2,…,D14] [9+,5-] Outlook Sunny Overcast Rain Ssunny=[D1,D2,D8,D9,D11] [2+,3-] [D3,D7,D12,D13] [4+,0-] [D4,D5,D6,D10,D14] [3+,2-] Yes ? ? Gain(Ssunny , Humidity)=0.970-(3/5)0.0 – 2/5(0.0) = 0.970 Gain(Ssunny , Temp.)=0.970-(2/5)0.0 –2/5(1.0)-(1/5)0.0 = 0.570 Gain(Ssunny , Wind)=0.970= -(2/5)1.0 – 3/5(0.918) = 0.019 ICS611
ID3 Algorithm Outlook Sunny Overcast Rain Humidity Yes Wind [D3,D7,D12,D13] High Normal Strong Weak No Yes No Yes [D6,D14] [D4,D5,D10] [D1,D2] [D8,D9,D11] ICS611