Data Mining: 7. Algoritma Estimasi dan Forecasting Romi Satria Wahono WA/SMS:
2 Romi Satria Wahono SD Sompok Semarang (1987) SMPN 8 Semarang (1990) SMA Taruna Nusantara Magelang (1993) B.Eng, M.Eng and Ph.D in Software Engineering from Saitama University Japan ( ) Universiti Teknikal Malaysia Melaka (2014) Research Interests: Software Engineering, Machine Learning Founder dan Koordinator IlmuKomputer.Com Peneliti LIPI ( ) Founder dan CEO PT Brainmatics Cipta Informatika
8. Text Mining 7. Algoritma Estimasi dan Forecasting 6. Algoritma Asosiasi 5. Algoritma Klastering 4. Algoritma Klasifikasi 3. Persiapan Data 2. Proses Data Mining 1. Pengantar Data Mining 3 Course Outline
7. Algoritma Estimasi dan Forecasting 7.1 Linear Regression 7.2 Neural Network 7.3 Support Vector Machine 7.4 Time Series Forecasting 4
7.1 Linear Regression 5
1.Siapkan data 2.Identifikasi Atribut dan Label 3.Hitung X², Y², XY dan total dari masing- masingnya 4.Hitung a dan b berdasarkan persamaan yang sudah ditentukan 5.Buat Model Persamaan Regresi Linear Sederhana 6 Tahapan Algoritma Linear Regression
7 1. Persiapan Data Tanggal Rata-rata Suhu Ruangan (X) Jumlah Cacat (Y)
Y = a + bX Dimana: Y = Variabel terikat (Dependen) X = Variabel tidak terikat (Independen) a = konstanta b = koefisien regresi (kemiringan); besaran Response yang ditimbulkan oleh variabel a = (Σy) (Σx²) – (Σx) (Σxy) n(Σx²) – (Σx)² b = n(Σxy) – (Σx) (Σy) n(Σx²) – (Σx)² 8 2. Identifikasikan Atribut dan Label
9 3. Hitung X², Y², XY dan total dari masing-masingnya Tanggal Rata-rata Suhu Ruangan (X) Jumlah Cacat (Y) X2X2 Y2Y2 XY
Menghitung Koefisien Regresi (a) a = (Σy) (Σx²) – (Σx) (Σxy) n(Σx²) – (Σx)² a = (72) (4876) – (220) (1640) 10 (4876) – (220)² a = -27,02 Menghitung Koefisien Regresi (b) b = n(Σxy) – (Σx) (Σy) n(Σx²) – (Σx)² b = 10 (1640) – (220) (72) 10 (4876) – (220)² b = 1, Hitung a dan b berdasarkan persamaan yang sudah ditentukan
Y = a + bX Y = -27,02 + 1,56X Buatkan Model Persamaan Regresi Linear Sederhana
1.Prediksikan Jumlah Cacat Produksi jika suhu dalam keadaan tinggi (Variabel X), contohnya: 30°C Y = -27,02 + 1,56X Y = -27,02 + 1,56(30) =19,78 2.Jika Cacat Produksi (Variabel Y) yang ditargetkan hanya boleh 5 unit, maka berapakah suhu ruangan yang diperlukan untuk mencapai target tersebut? 5= -27,02 + 1,56X 1,56X = 5+27,02 X= 32,02/1,56 X =20,52 Jadi Prediksi Suhu Ruangan yang paling sesuai untuk mencapai target Cacat Produksi adalah sekitar 20,52 0 C 12 Pengujian
7.1.2 Studi Kasus CRISP-DM Heating Oil Consumption – Estimation (Matthew North, Data Mining for the Masses, 2012, Chapter 8 Estimation, pp ) Dataset: HeatingOil-Training.csv dan HeatingOil-Scoring.csv 13
Lakukan eksperimen mengikuti buku Matthew North, Data Mining for the Masses, 2012, Chapter 8 Estimation, pp tentang Heating Oil Consumption Dataset: HeatingOil-Training.csv dan HeatingOil- Scoring.csv 14 Latihan
15 CRISP-DM
Sarah, the regional sales manager is back for more help Business is booming, her sales team is signing up thousands of new clients, and she wants to be sure the company will be able to meet this new level of demand, she now is hoping we can help her do some prediction as well She knows that there is some correlation between the attributes in her data set (things like temperature, insulation, and occupant ages), and she’s now wondering if she can use the previous data set to predict heating oil usage for new customers You see, these new customers haven’t begun consuming heating oil yet, there are a lot of them (42,650 to be exact), and she wants to know how much oil she needs to expect to keep in stock in order to meet these new customers’ demand Can she use data mining to examine household attributes and known past consumption quantities to anticipate and meet her new customers’ needs? 16 Context and Perspective
Sarah’s new data mining objective is pretty clear: she wants to anticipate demand for a consumable product We will use a linear regression model to help her with her desired predictions She has data, 1,218 observations that give an attribute profile for each home, along with those homes’ annual heating oil consumption She wants to use this data set as training data to predict the usage that 42,650 new clients will bring to her company She knows that these new clients’ homes are similar in nature to her existing client base, so the existing customers’ usage behavior should serve as a solid gauge for predicting future usage by new customers Business Understanding
We create a data set comprised of the following attributes: Insulation: This is a density rating, ranging from one to ten, indicating the thickness of each home’s insulation. A home with a density rating of one is poorly insulated, while a home with a density of ten has excellent insulation Temperature: This is the average outdoor ambient temperature at each home for the most recent year, measure in degree Fahrenheit Heating_Oil: This is the total number of units of heating oil purchased by the owner of each home in the most recent year Num_Occupants: This is the total number of occupants living in each home Avg_Age: This is the average age of those occupants Home_Size: This is a rating, on a scale of one to eight, of the home’s overall size. The higher the number, the larger the home Data Understanding
A CSV data set for this chapter’s example is available for download at the book’s companion web site ( Data Preparation
20 3. Data Preparation
21 3. Data Preparation
22 4. Modeling
23 4. Modeling
24 5. Evaluation
25 5. Evaluation
26 6. Deployment
27 6. Deployment
28 6. Deployment
7.2 Neural Network 29
7.3 Support Vector Machine 30
7.4 Time Series Forecasting 31
Time series forecasting is one of the oldest known predictive analytics techniques It has existed and been in widespread use even before the term “predictive analytics” was ever coined Independent or predictor variables are not strictly necessary for univariate time series forecasting, but are strongly recommended for multivariate time series Time series forecasting methods: 1.Data Driven Method: There is no difference between a predictor and a target. Techniques such as time series averaging or smoothing are considered data-driven approaches to time series forecasting 2.Model Driven Method: Similar to “conventional” predictive models, which have independent and dependent variables, but with a twist: the independent variable is now time 32 Time Series Forecasting
There is no difference between a predictor and a target The predictor is also the target variable Data Driven Methods: Naïve Forecast Simple Average Moving Average Weighted Moving Average Exponential Smoothing Holt’s Two-Parameter Exponential Smoothing 33 Data Driven Methods
In model-driven methods, time is the predictor or independent variable and the time series value is the dependent variable Model-based methods are generally preferable when the time series appears to have a “global” pattern The idea is that the model parameters will be able to capture these patterns Thus enable us to make predictions for any step ahead in the future under the assumption that this pattern is going to repeat For a time series with local patterns instead of a global pattern, using the model-driven approach requires specifying how and when the patterns change, which is difficult 34 Model Driven Methods
Linear Regression Polynomial Regression Linear Regression with Seasonality Autoregression Models and ARIMA 35 Model Driven Methods
RapidMiner’s approach to time series is based on two main data transformation processes The fist is windowing to transform the time series data into a generic data set: this step will convert the last row of a window within the time series into a label or target variable We apply any of the “learners” or algorithms to predict the target variable and thus predict the next time step in the series 36 How to Implement
The parameters of the Windowing operator allow changing the size of the windows, the overlap between consecutive windows (also known as step size), and the prediction horizon, which is used for forecasting The prediction horizon controls which row in the raw data series ends up as the label variable in the transformed series 37 Windowing Concept
38 Rapidminer Windowing Operator
Window size: Determines how many “attributes” are created for the cross-sectional data Each row of the original time series within the window width will become a new attribute We choose w = 6 Step size: Determines how to advance the window Let us use s = 1 Horizon: Determines how far out to make the forecast If the window size is 6 and the horizon is 1, then the seventh row of the original time series becomes the fist sample for the “label” variable Let us use h = 1 39 Windowing Operator Parameters
40
Lakukan training dengan menggunakan linear regression pada dataset hargasaham-training.xls Gunakan Split Data untuk memisahkan dataset di atas, 90% training dan 10% untuk testing Harus dilakukan proses Windowing pada dataset Plot grafik antara label dan hasil prediksi dengan menggunakan chart 41 Latihan
Cari data time series di internet, data apapun Lakukan proses data mining terhadap data tersebut, lihat pola yang terbentuk 42 Latihan
1.Jelaskan perbedaan antara data, informasi dan pengetahuan! 2.Jelaskan apa yang anda ketahui tentang data mining! 3.Sebutkan peran utama data mining! 4.Sebutkan pemanfaatan dari data mining di berbagai bidang! 5.Pengetahuan atau pola apa yang bisa kita dapatkan dari data di bawah? 43 Post-Test NIMGenderNilai UN Asal Sekolah IPS1IPS2IPS3IPS 4...Lulus Tepat Waktu 10001L28SMAN Ya 10002P27SMAN Tidak 10003P24SMAN Tidak 10004L26.4SMAN Ya L23.4SMAN Ya
1.Jiawei Han and Micheline Kamber, Data Mining: Concepts and Techniques Third Edition, Elsevier, Ian H. Witten, Frank Eibe, Mark A. Hall, Data mining: Practical Machine Learning Tools and Techniques 3rd Edition, Elsevier, Markus Hofmann and Ralf Klinkenberg, RapidMiner: Data Mining Use Cases and Business Analytics Applications, CRC Press Taylor & Francis Group, Daniel T. Larose, Discovering Knowledge in Data: an Introduction to Data Mining, John Wiley & Sons, Ethem Alpaydin, Introduction to Machine Learning, 3rd ed., MIT Press, Florin Gorunescu, Data Mining: Concepts, Models and Techniques, Springer, Oded Maimon and Lior Rokach, Data Mining and Knowledge Discovery Handbook Second Edition, Springer, Warren Liao and Evangelos Triantaphyllou (eds.), Recent Advances in Data Mining of Enterprise Data: Algorithms and Applications, World Scientific, Referensi