* Overview of text mining * From textual information to numerical vectors * Using text for prediction * Information retrieval and text mining * Finding structure in a document collection * Looking for information in documents * Case studies * Emerging directions * Appendix: software notes * References * Author & subject indexes
Data mining is a mature technology. The prediction problem, looking for predictive patterns in data, has been widely studied. Strong me- ods are available to the practitioner. These methods process structured numerical information, where uniform measurements are taken over a sample of data. Text is often described as unstructured information. So, it would seem, text and numerical data are different, requiring different methods. Or are they? In our view, a prediction problem can be solved by the same methods, whether the data are structured - merical measurements or unstructured text. Text and documents can be transformed into measured values, such as the presence or absence of words, and the same methods that have proven successful for pred- tive data mining can be applied to text. Yet, there are key differences. Evaluation techniques must be adapted to the chronological order of publication and to alternative measures of error. Because the data are documents, more specialized analytical methods may be preferred for text. Moreover, the methods must be modi?ed to accommodate very high dimensions: tens of thousands of words and documents. Still, the central themes are similar.
Text mining searches for regularities, patterns or trends in natural language text. Inspired by data mining, which discovers major patterns from highly structured databases, text mining aims to extract useful knowledge from unstructured text. This book focuses on the concepts and methods needed to expand horizons beyond structured, numeric data to automated mining of text samples. This authoritative and highly accessible text/reference, written by a team of authorities on text mining, develops the foundation concepts, principles, and methods needed to expand beyond structured, numeric data to automated mining of text samples. Researchers, computer scientists, and advanced undergraduates and graduates with work and interests in data mining, machine learning, databases, and computational linguistics will find the work an essential resource.