2-Pembelajaran Statistik 25 Agustus 2015 Data pelatihan dan pengujian Bias dan variansi Error rate & confidence interval Regresi Linear Praktikum: Data summarization, vizualisation, linear regression
Classification: Definition Given a collection of records (training set ) – Each record contains a set of attributes, one of the attributes is the class. Find a model for class attribute as a function of the values of other attributes. Goal: previously unseen records should be assigned a class as accurately as possible. – A test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it.
Illustrating Classification Task
Examples of Classification Task Predicting tumor cells as benign or malignant Classifying credit card transactions as legitimate or fraudulent Classifying secondary structures of protein as alpha-helix, beta-sheet, or random coil Categorizing news stories as finance, weather, entertainment, sports, etc
Classification Techniques Logistic regression Decision Tree based Methods Rule-based Methods Memory based reasoning Neural Networks Naïve Bayes and Bayesian Belief Networks Support Vector Machines
Methods of Estimation Holdout (proportion) – Reserve 2/3 for training and 1/3 for testing (depends on analyst) Random subsampling – Repeated holdout Cross validation – Partition data into k disjoint subsets – k-fold: train on k-1 partitions, test on the remaining one – Leave-one-out: k=n Stratified sampling – Oversampling vs undersampling Bootstrap (Averaging / Ensamble) – Sampling with replacement
Linear Regression A response y is a continuous measurement variable such as sales or profit Function f ( ・ ) is linear in the k regressor (predictor) variables The estimation of the parameters is usually achieved through least squares (which, for independent normal errors, is identical to maximum likelihood estimation).
Linear Regression Concepts
Linear Regression Evaluation Given a set of predictions for m new cases, we can evaluate the predictions according to their: – The mean error should be close to zero; mean errors different from zero indicate a bias in the forecasts. – The root mean square error expresses the magnitude of the forecast error in the units of the response variable. – The mean absolute percent forecast error expresses the forecast error in percentage terms.
ESTIMATION IN R
EXAMPLE 1 (3.1): FUEL EFFICIENCY OF AUTOMOBILES We try to model the fuel efficiency, measured in: GPM (gallons per 100 miles), as a function of GPM = 100/MPG: – WT = weight of the car (in 1000 lb), – DIS = cubic displacement (in cubic inches), – NC = number of cylinders, – HP = horsepower, – ACC = acceleration (in seconds from 0 to 60 mph) – ET = engine type (V-type and straight (coded as 1).
KUIS1 (10 menit) 1.Apakah datamining itu menurut pendapat Anda? 2.Sebutkan 4 tugas (tasks) dalam datamining 3.Apa yang dimaksud dengan tugas prediktif dan deskriptif dalam datamining? 4.Berikan dua contoh untuk masing-masing tipe data berikut ini: 1.Nominal 2.Ordinal 3.Interval 4.Ratio
TUGAS1: Analisis Data dan Regresi Linear 1.Gunakan data FuelEff 1.Berikan arti dari setiap baris kode dalam cross-validation (leave one out) regresi linear 2.Modifikasi kode cross-validation untuk 10-folds regresi linear 3.Analisis secara singkat apakah ME, RMSE dan MAPE 10-folds lebih baik atau lebih buruk dibandingkan dengan leave one out. 2.Gunakan data Orange Juice 1.Ikuti dan coba pahami langkah-langkah dalam buku DMBAR 2.3 (tidak dikumpul) 2.Berikan summary dari data Orange Juice: atribut apa saja berupa data nominal atau numeric 3.Lakukan regresi linear (hanya data numeric) dengan ‘logmove’ sebagai variabel yang akan diprediksi 1.Gunakan semua variabel numeric 2.Pilih satu variabel numeric yang paling kuat korelasinya dengan ‘logmove’ 3.Lakukan cross-validation: leave one out dan 10-folds untuk semua variabel dan variabel terpilih pada nomor 2. 4.Manakah kombinasi yang anda pilih: leave one out atau 10-folds, semua variabel atau terpilih. Berikan alasan yang jelas. Kumpulkan: 8 September 2015 Kirimkan untuk share folder Anda Per hari keterlambatan -25%