Upload presentasi
Presentasi sedang didownload. Silahkan tunggu
1
Text Mining Patrick Cash
2
Outline Introduction Data Mining Text Mining Text Mining Applications
Text Mining Process Text Mining Applications Challenges in Text Mining Conclusion
3
Introduction Why Text Mining?
Massive amount of new information being created World’s data doubles every 18 months (Jacques Vallee Ph.D) 80-90% of all data is held in various unstructured formats Useful information can be derived from this unstructured data
4
Introduction Intelligence in text mining is based on NLP techniques
Can be used as a preprocessing technique to harvest data and get an initial understanding of the patterns that exist in the data Often seen as a special case of data mining but there is an important difference
5
Data Mining Extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) information (or patterns) from data Data Mining: a misnomer? Knowledge discovery, knowledge extraction, data/pattern analysis, data archeology, data dredging, information harvesting, business intelligence, etc.
6
Data Mining Descriptive: understanding underlying processes or behavior Patterns and trends Clustering Predictive: predict an unseen or unmeasured value Future projections and missing values Classification
7
SEMMA Search Explore Modify Model Assess
Input data source, data sampling, partitioning Explore Patterns, trends, outliers, visualization Modify Clustering, feature reduction Model Regression, tree, network Assess Report, pass to next step in analysis
8
Search vs. Discover Search (goal-oriented) Discover (opportunistic)
Structured Data Data Retrieval Data Mining Unstructured Data (Text) Information Retrieval Text Mining
9
Text Mining Many different by similar definitions
Text Mining = Statistical NLP + Data Mining Text Mining is a process that employs Statistical NLP: a set of algorithms for converting unstructured text into structured data objects Data Mining: the quantitative methods that analyze these data objects to discover knowledge
10
Text Mining Descriptive Predictive Pattern and trend analysis
Knowledge base creation Summarization Visualization Predictive Classification Question answering Pattern and trend forecasting
11
Text Mining Techniques
Information Retrieval Indexing and retrieval of textual documents Information Extraction Extraction of partial knowledge in the text Web Mining Indexing and retrieval of textual documents and extraction of partial knowledge using the web (ontology building) Clustering Generating collections of similar text documents
12
Text Mining Process
13
Text Mining Process Text Preprocessing Features Generation
Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification (Supervised) / Clustering (Unsupervised) Analyzing results
14
Text Mining Process Text preprocessing Part Of Speech (POS) tagging
Find the corresponding POS for each word. Word sense disambiguation Context based or proximity based Parsing Generates a parse tree (graph) for each sentence Each sentence is a stand alone graph
15
Text Mining Process Feature Generation
Text document is represented by the words it contains (and their occurrences) Order of words is not that important for certain applications (Bag of words) Stemming: identifies a word by its root Reduce dimensionality Stop words: The common words unlikely to help text mining
16
Text Mining Process Feature Selection Reduce dimensionality
Learners have difficulty addressing tasks with high dimensionality Only interested in the information relevant to what is being analyzed Irrelevant features Not all features help
17
Text Mining Process Text Mining: Classification definition
Given: a collection of labeled records (training set) Each record contains a set of features (attributes), and the true class (label) Find: a model for the class as a function of the values of the features Goal: previously unseen records should be assigned a class as accurately as possible
18
Text Mining Process Text Mining: Clustering definition
Given: a set of documents and a similarity measure among documents Find: clusters such that: Documents in one cluster are more similar to one another Documents in separate clusters are less similar to one another Goal: Finding a correct set of documents clusters
19
Text Mining Process Supervised learning (classification)
The training data is labeled indicating the class New data is classified based on the training set Correct classification: The known label of test sample is identical with the class result from the classification model Unsupervised learning (clustering) The class labels of training data are unknown Establish the existence of classes or clusters in the data Good clustering method: high intra-cluster similarity
20
Text Mining Process Analyzing the results
Are the results satisfactory? Does more mining need to be done? Does a different technique need to be used? Does another iteration of one or more steps in the process need to be done?
21
Text Mining Applications
Bioinformatics Genomics research (DNA sequencing) Medical Mining medical records to improve care Business intelligence Risk analysis Research Analyzing research publications Basically anywhere there is large amount of unstructured text data
22
Text Mining Application
Classification (Categorization) Spam detection, Document organization Clustering Trend analysis, Topic identification Web Mining Trend analysis, Opinion mining, Ontology creation Classical NLP Text summarization, Question answering, Information extraction
23
Text Mining Application
Smaller scale applications Relationship Analysis If A is related to B, and B is related to C, there is potentially a relationship between A and C. Trend analysis Occurrences of A peak in October. Mixed applications Co-occurrence of A together with B peak in November. (Shopping Cart Analysis)
24
Challenges in Text Mining
Remember Text Mining = Statistical NLP + Data Mining Text mining suffers from the same challenges as Statistical NLP and Data Mining Add in the additional difficulties associated with the data not being structured
25
Challenges in Text Mining
Statistical NLP Ambiguity Context Tokenization \ Sentence Detection Stemming POS Tagging Coreference Resolution
26
Challenges in Text Mining
Data Mining Data preprocessing Ability to process the data Massive amounts of data Determining and extracting information of interest Availability of NLP tools to work with data mining Discovery process No training data available
27
Conclusion Text Mining = Statistical NLP + Data Mining
Culmination of all the NLP techniques covered in this course Growing research area that will be important as information growth (and need to extract knowledge from that information) increases
28
References Even-Zohar, Y. Introduction to Text Mining. Supercomputing, Treloar, N AvaQuest Inc. Witte, R. Faculty of Informatics Institute for Program Structures and Data Organization (IPD)
29
Questions ? 29
Presentasi serupa
© 2024 SlidePlayer.info Inc.
All rights reserved.