Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications

Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications

by James Pustejovsky, Amber Stubbs
Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications

Natural Language Annotation for Machine Learning: A Guide to Corpus-Building for Applications

by James Pustejovsky, Amber Stubbs

eBook

$23.99  $31.99 Save 25% Current price is $23.99, Original price is $31.99. You Save 25%.

Available on Compatible NOOK Devices and the free NOOK Apps.
WANT A NOOK?  Explore Now

Related collections and offers


Overview

Create your own natural language training corpus for machine learning. Whether you’re working with English, Chinese, or any other natural language, this hands-on book guides you through a proven annotation development cycle—the process of adding metadata to your training corpus to help ML algorithms work more efficiently. You don’t need any programming or linguistics experience to get started.

Using detailed examples at every step, you’ll learn how the MATTER Annotation Development Process helps you Model, Annotate, Train, Test, Evaluate, and Revise your training corpus. You also get a complete walkthrough of a real-world annotation project.

  • Define a clear annotation goal before collecting your dataset (corpus)
  • Learn tools for analyzing the linguistic content of your corpus
  • Build a model and specification for your annotation project
  • Examine the different annotation formats, from basic XML to the Linguistic Annotation Framework
  • Create a gold standard corpus that can be used to train and test ML algorithms
  • Select the ML algorithms that will process your annotated data
  • Evaluate the test results and revise your annotation task
  • Learn how to use lightweight software for annotating texts and adjudicating the annotations

This book is a perfect companion to O’Reilly’s Natural Language Processing with Python.


Product Details

ISBN-13: 9781449359768
Publisher: O'Reilly Media, Incorporated
Publication date: 10/11/2012
Sold by: Barnes & Noble
Format: eBook
Pages: 342
File size: 4 MB

About the Author

James Pustejovsky teaches and does research in Artificial Intelligence and Computational Linguistics in the Computer Science Department at Brandeis University. His main areas of interest include: lexical meaning, computational semantics, temporal and spatial reasoning, and corpus linguistics. He is active in the development of standards for interoperability between language processing applications, and lead the creation of the recently adopted ISO standard for time annotation, ISO-TimeML. He is currently heading the development of a standard for annotating spatial information in language. More information on publications and research activities can be found at his webpage: pusto.com.


Amber Stubbs recently completed her Ph.D. in Computer Science at Brandeis University, and is currently a Postdoctoral Associate at SUNY Albany. Her dissertation focused on creating an annotation methodology to aid in extracting high-level information from natural language files, particularly biomedical texts. Her website can be found at http://pages.cs.brandeis.edu/~astubbs/

Table of Contents

Preface; Natural Language Annotation for Machine Learning; Audience; Organization of This Book; Software Requirements; Conventions Used in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; Acknowledgments; Chapter 1: The Basics; 1.1 The Importance of Language Annotation; 1.2 A Brief History of Corpus Linguistics; 1.3 Language Data and Machine Learning; 1.4 The Annotation Development Cycle; 1.5 Summary; Chapter 2: Defining Your Goal and Dataset; 2.1 Defining Your Goal; 2.2 Background Research; 2.3 Assembling Your Dataset; 2.4 The Size of Your Corpus; 2.5 Summary; Chapter 3: Corpus Analytics; 3.1 Basic Probability for Corpus Analytics; 3.2 Counting Occurrences; 3.3 Language Models; 3.4 Summary; Chapter 4: Building Your Model and Specification; 4.1 Some Example Models and Specs; 4.2 Adopting (or Not Adopting) Existing Models; 4.3 Different Kinds of Standards; 4.4 Summary; Chapter 5: Applying and Adopting Annotation Standards; 5.1 Metadata Annotation: Document Classification; 5.2 Text Extent Annotation: Named Entities; 5.3 Linked Extent Annotation: Semantic Roles; 5.4 ISO Standards and You; 5.5 Summary; Chapter 6: Annotation and Adjudication; 6.1 The Infrastructure of an Annotation Project; 6.2 Specification Versus Guidelines; 6.3 Be Prepared to Revise; 6.4 Preparing Your Data for Annotation; 6.5 Writing the Annotation Guidelines; 6.6 Annotators; 6.7 Choosing an Annotation Environment; 6.8 Evaluating the Annotations; 6.9 Creating the Gold Standard (Adjudication); 6.10 Summary; Chapter 7: Training: Machine Learning; 7.1 What Is Learning?; 7.2 Defining Our Learning Task; 7.3 Classifier Algorithms; 7.4 Sequence Induction Algorithms; 7.5 Clustering and Unsupervised Learning; 7.6 Semi-Supervised Learning; 7.7 Matching Annotation to Algorithms; 7.8 Summary; Chapter 8: Testing and Evaluation; 8.1 Testing Your Algorithm; 8.2 Evaluating Your Algorithm; 8.3 Problems That Can Affect Evaluation; 8.4 Final Testing Scores; 8.5 Summary; Chapter 9: Revising and Reporting; 9.1 Revising Your Project; 9.2 Reporting About Your Work; 9.3 Summary; Chapter 10: Annotation: TimeML; 10.1 The Goal of TimeML; 10.2 Related Research; 10.3 Building the Corpus; 10.4 Model: Preliminary Specifications; 10.5 Annotation: First Attempts; 10.6 Model: The TimeML Specification Used in TimeBank; 10.7 Annotation: The Creation of TimeBank; 10.8 TimeML Becomes ISO-TimeML; 10.9 Modeling the Future: Directions for TimeML; 10.10 Summary; Chapter 11: Automatic Annotation: Generating TimeML; 11.1 The TARSQI Components; 11.2 Improvements to the TTK; 11.3 TimeML Challenges: TempEval-2; 11.4 Future of the TTK; 11.5 Summary; Chapter 12: Afterword: The Future of Annotation; 12.1 Crowdsourcing Annotation; 12.2 Handling Big Data; 12.3 NLP Online and in the Cloud; 12.4 And Finally...; List of Available Corpora and Specifications; Corpora; Specifications, Guidelines, and Other Resources; Representation Standards; List of Software Resources; Annotation and Adjudication Software; Machine Learning Resources; MAE User Guide; Installing and Running MAE; Loading Tasks and Files; Saving Files; Defining Your Own Task; Frequently Asked Questions; MAI User Guide; Installing and Running MAI; Loading Tasks and Files; Adjudicating; Saving Files; Bibliography; References for Using Amazon’s Mechanical Turk/Crowdsourcing; Colophon;
From the B&N Reads Blog

Customer Reviews