Text Mining Patrick Cash.

Slides:



Advertisements
Presentasi serupa
© aSup-2007 PENGENALAN SPSS   1 INTRODUCTION to SPSS Statistical Package for Social Science.
Advertisements

Robert Groth, “Data Mining: Building Competitive Advantage”, chap 2
2. Introduction to Algorithm and Programming
Learning Medium School : SMPN 1 Gotham City Subject : English
FLOW INJECTION ANALYSIS (Analisis dalam sistem aliran)
Pertemuan XIV FUNGSI MAYOR Assosiation. What Is Association Mining? Association rule mining: –Finding frequent patterns, associations, correlations, or.
Center of Young Scientists Mummy Method for Determination of 3D Irregular Body Surface Area Pasca Nadia Fitri, et. al. Pandapotan Harahap, M.Pd., M.P.Fis.
Perancangan Database Pertemuan 07 s.d 08
Clustering. Definition Clustering is “the process of organizing objects into groups whose members are similar in some way”. A cluster is therefore a collection.
BLACK BOX TESTING.
Muhammad Yusuf Teknik Multimedia dan Jaringan UNIVERSITAS TRUNOJOYO.
Zulharman. Tujuan Belajar 1. Mahasiswa mampu memahami berbagai metode membuat catatan kuliah (note taking) 2. Mahasiswa mampu memahami metode membaca.
Testing Implementasi Sistem Oleh :Rifiana Arief, SKom, MMSI
Information Retrieval
Data Mining: Klasifikasi dan Prediksi Naive Bayesian & Bayesian Network . April 13, 2017.
INTRODUCTION TO SPSS Statistical Package for Social Science 1.
1 Pertemuan 09 Kebutuhan Sistem Matakuliah: T0234 / Sistem Informasi Geografis Tahun: 2005 Versi: 01/revisi 1.
Population and sample. Population is complete actual/theoretical collection of numerical values (scores) that are of interest to the researcher. Simbol.
Masalah Transportasi II (Transportation Problem II)
1 Pertemuan 21 Function Matakuliah: M0086/Analisis dan Perancangan Sistem Informasi Tahun: 2005 Versi: 5.
Pertemuan XIV FUNGSI MAYOR Assosiation. What Is Association Mining? Association rule mining: –Finding frequent patterns, associations, correlations, or.
Pertemuan 07 Peluang Beberapa Sebaran Khusus Peubah Acak Kontinu
Dr. Nur Aini Masruroh Deterministic mathematical modeling.
1 Pertemuan 2 Unit 1 - Careers Matakuliah: G0682 / Bahasa Inggris Ekonomi 1 Tahun: 2005 Versi: versi/revisi.
1 Pertemuan 24 Matakuliah: I0214 / Statistika Multivariat Tahun: 2005 Versi: V1 / R1 Analisis Struktur Peubah Ganda (IV): Analisis Kanonik.
1 Pertemuan 11 Function dari System Matakuliah: M0446/Analisa dan Perancangan Sistem Informasi Tahun: 2005 Versi: 0/0.
9.3 Geometric Sequences and Series. Objective To find specified terms and the common ratio in a geometric sequence. To find the partial sum of a geometric.
OPERATOR DAN FUNGSI MATEMATIK. Operator  Assignment operator Assignment operator (operator pengerjaan) menggunakan simbol titik dua diikuti oleh tanda.
Keuangan dan Akuntansi Proyek Modul 2: BASIC TOOLS CHRISTIONO UTOMO, Ph.D. Bidang Manajemen Proyek ITS 2011.
Suharmadi Sanjaya - Matematika ITS. BACKGROUND A Good course has a clear purpose: Applied Mathematics is alive and very vigorous Teaching of Apllied Mathematics.
Smoothing. Basic Smoothing Models Moving average, weighted moving average, exponential smoothing Single and Double Smoothing First order exponential smoothing.
Matakuliah : M0086/Analisis dan Perancangan Sistem Informasi
RISET AKADEMIK: ORISINALITAS RISET DAN PEMODELAN
Pert. 16. Menyimak lingkungan IS/IT saat ini
KOMUNIKASI DATA Materi Pertemuan 3.
DATAWAREHOUSING & BUSINESS INTELLIGENT <<Pertemuan – 12>>
Program Studi S-1 Teknik Informatika FMIPA Universitas Padjadjaran
Pertemuan 06 Fungsi Analisis pada SIG
Fungsi Analisis pada SIG
Pertemuan 5 Struktur dan Hubungan Antara Class dan Object
Pengujian Hipotesis (I) Pertemuan 11
Matakuliah : I0014 / Biostatistika Tahun : 2005 Versi : V1 / R1
Data Mining.
Clustering.
Software Engineering Rekayasa Perangkat Lunak
Pertemuan <<18>> << Penemuan Fakta(01) >>
METODOLOGI PENELITIAN ADMINISTRASI NEGARA
Algorithms and Programming Searching
Pendugaan Parameter (II) Pertemuan 10
Introduction to Data Mining
TEXT OPERATION Muhammad Yusuf Teknik Multimedia dan Jaringan
THE EFFECT OF COOPERATIVE LEARNING TYPE JIGSAW PROBLEM SOLVING
Teknik Pengujian Software
Master data Management
Self-Organizing Network Model (SOM) Pertemuan 10
USING DATA MINING TO MODEL PLAYER EXPERIENCE
1 © 2004, Cisco Systems, Inc. All rights reserved. Module 2 Single-Area OSPF.
Ukuran Akurasi Model Deret Waktu Manajemen Informasi Kesehatan
How to Set Up AT&T on MS Outlook ATT is a multinational company headquartered in Texas. ATT services are used by many people widely across.
Evidence-Based Medicine Prof. Carl Heneghan Director CEBM University of Oxford.
Ir. Nurly Gofar, MSCE, PhD Program Studi Teknik Sipil Program Pascasarjana Universitas Bina Darma Sem Ganjil 2018/2019 METODOLOGI PENELITIAN.
TEXT MINING.
Konsep Aplikasi Data Mining
If you are an user, then you know how spam affects your account. In this article, we tell you how you can control spam’s in your ZOHO.
By Yulius Suprianto Macroeconomics | 02 Maret 2019 Chapter-5: The Standard of Living Over Time and A Cross Countries Source: http//
Website: Website Technologies.
Konsep Aplikasi Data Mining
HANDLING RUSH PRESIDENT UNIVERSITY NURLAELA RIZKINA.
A SHORT ESSAY OF CIVIL ENGINEERING BY : ALFATIHATU RAHMI CIVIL ENGINEERING ENGINEERING FACULTY ANDALAS UNIVERSITY PADANG.
2. Discussion TASK 1. WORK IN PAIRS Ask your partner. Then, in turn your friend asks you A. what kinds of product are there? B. why do people want to.
Transcript presentasi:

Text Mining Patrick Cash

Outline Introduction Data Mining Text Mining Text Mining Applications Text Mining Process Text Mining Applications Challenges in Text Mining Conclusion

Introduction Why Text Mining? Massive amount of new information being created World’s data doubles every 18 months (Jacques Vallee Ph.D) 80-90% of all data is held in various unstructured formats Useful information can be derived from this unstructured data

Introduction Intelligence in text mining is based on NLP techniques Can be used as a preprocessing technique to harvest data and get an initial understanding of the patterns that exist in the data Often seen as a special case of data mining but there is an important difference

Data Mining Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (or patterns) from data Data Mining: a misnomer? Knowledge discovery, knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.

Data Mining Descriptive: understanding underlying processes or behavior Patterns and trends Clustering Predictive: predict an unseen or unmeasured value Future projections and missing values Classification

SEMMA Search Explore Modify Model Assess Input data source, data sampling, partitioning Explore Patterns, trends, outliers, visualization Modify Clustering, feature reduction Model Regression, tree, network Assess Report, pass to next step in analysis

Search vs. Discover Search (goal-oriented) Discover (opportunistic) Structured Data Data Retrieval Data Mining Unstructured Data (Text) Information Retrieval Text Mining

Text Mining Many different by similar definitions Text Mining = Statistical NLP + Data Mining Text Mining is a process that employs Statistical NLP: a set of algorithms for converting unstructured text into structured data objects Data Mining: the quantitative methods that analyze these data objects to discover knowledge

Text Mining Descriptive Predictive Pattern and trend analysis Knowledge base creation Summarization Visualization Predictive Classification Question answering Pattern and trend forecasting

Text Mining Techniques Information Retrieval Indexing and retrieval of textual documents Information Extraction Extraction of partial knowledge in the text Web Mining Indexing and retrieval of textual documents and extraction of partial knowledge using the web (ontology building) Clustering Generating collections of similar text documents

Text Mining Process

Text Mining Process Text Preprocessing Features Generation Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification (Supervised) / Clustering (Unsupervised) Analyzing results

Text Mining Process Text preprocessing Part Of Speech (POS) tagging Find the corresponding POS for each word. Word sense disambiguation Context based or proximity based Parsing Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph

Text Mining Process Feature Generation Text document is represented by the words it contains (and their occurrences) Order of words is not that important for certain applications (Bag of words) Stemming: identifies a word by its root Reduce dimensionality Stop words: The common words unlikely to help text mining

Text Mining Process Feature Selection Reduce dimensionality Learners have difficulty addressing tasks with high dimensionality Only interested in the information relevant to what is being analyzed Irrelevant features Not all features help

Text Mining Process Text Mining: Classification definition Given: a collection of labeled records (training set) Each record contains a set of features (attributes), and the true class (label) Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as accurately as possible

Text Mining Process Text Mining: Clustering definition Given: a set of documents and a similarity measure among documents Find: clusters such that: Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another Goal: Finding a correct set of documents clusters

Text Mining Process Supervised learning (classification) The training data is labeled indicating the class New data is classified based on the training set Correct classification: The known label of test sample is identical with the class result from the classification model Unsupervised learning (clustering) The class labels of training data are unknown Establish the existence of classes or clusters in the data Good clustering method: high intra-cluster similarity

Text Mining Process Analyzing the results Are the results satisfactory? Does more mining need to be done? Does a different technique need to be used? Does another iteration of one or more steps in the process need to be done?

Text Mining Applications Bioinformatics Genomics research (DNA sequencing) Medical Mining medical records to improve care Business intelligence Risk analysis Research Analyzing research publications Basically anywhere there is large amount of unstructured text data

Text Mining Application Classification (Categorization) Spam detection, Document organization Clustering Trend analysis, Topic identification Web Mining Trend analysis, Opinion mining, Ontology creation Classical NLP Text summarization, Question answering, Information extraction

Text Mining Application Smaller scale applications Relationship Analysis If A is related to B, and B is related to C, there is potentially a relationship between A and C. Trend analysis Occurrences of A peak in October. Mixed applications Co-occurrence of A together with B peak in November. (Shopping Cart Analysis)

Challenges in Text Mining Remember Text Mining = Statistical NLP + Data Mining Text mining suffers from the same challenges as Statistical NLP and Data Mining Add in the additional difficulties associated with the data not being structured

Challenges in Text Mining Statistical NLP Ambiguity Context Tokenization \ Sentence Detection Stemming POS Tagging Coreference Resolution

Challenges in Text Mining Data Mining Data preprocessing Ability to process the data Massive amounts of data Determining and extracting information of interest Availability of NLP tools to work with data mining Discovery process No training data available

Conclusion Text Mining = Statistical NLP + Data Mining Culmination of all the NLP techniques covered in this course Growing research area that will be important as information growth (and need to extract knowledge from that information) increases

References Even-Zohar, Y. Introduction to Text Mining. Supercomputing, 2002. http://alg.ncsa.uiuc.edu/do/documents/presentations Treloar, N AvaQuest Inc. www.knowledgetechnologies.net/proceedings/presentations/treloar/nathantreloar.ppt Witte, R. Faculty of Informatics Institute for Program Structures and Data Organization (IPD) http://www.edbt2006.de/edbt-share/IntroductionToTextMining.pdf

Questions ? 29