Tapping into Unstructured Data: Integrating Unstructured Data and Textual Analytics into Business Intelligence [NOOK Book]

Overview

“The authors, the best minds on the topic, are breaking new ground. They show how every organization can realize the benefits of a system that can search and present complex ideas or data from what has been a mostly untapped source of raw data.”

--Randy Chalfant, CTO, Sun Microsystems

The Definitive Guide to Unstructured Data Management and Analysis--From the World’s Leading Information Management Expert

A wealth of invaluable information exists in unstructured textual form, but...

See more details below
Tapping into Unstructured Data: Integrating Unstructured Data and Textual Analytics into Business Intelligence

Available on NOOK devices and apps  
  • NOOK Devices
  • NOOK HD/HD+ Tablet
  • NOOK
  • NOOK Color
  • NOOK Tablet
  • Tablet/Phone
  • NOOK for Windows 8 Tablet
  • NOOK for iOS
  • NOOK for Android
  • NOOK Kids for iPad
  • PC/Mac
  • NOOK for Windows 8
  • NOOK for PC
  • NOOK for Mac
  • NOOK Study

Want a NOOK? Explore Now

NOOK Book (eBook)
$22.99
BN.com price
(Save 42%)$39.99 List Price

Overview

“The authors, the best minds on the topic, are breaking new ground. They show how every organization can realize the benefits of a system that can search and present complex ideas or data from what has been a mostly untapped source of raw data.”

--Randy Chalfant, CTO, Sun Microsystems

The Definitive Guide to Unstructured Data Management and Analysis--From the World’s Leading Information Management Expert

A wealth of invaluable information exists in unstructured textual form, but organizations have found it difficult or impossible to access and utilize it. This is changing rapidly: new approaches finally make it possible to glean useful knowledge from virtually any collection of unstructured data.

William H. Inmon--the father of data warehousing--and Anthony Nesavich introduce the next data revolution: unstructured data management. Inmon and Nesavich cover all you need to know to make unstructured data work for your organization. You’ll learn how to bring it into your existing structured data environment, leverage existing analytical infrastructure, and implement textual analytic processing technologies to solve new problems and uncover new opportunities. Inmon and Nesavich introduce breakthrough techniques covered in no other book--including the powerful role of textual integration, new ways to integrate textual data into data warehouses, and new SQL techniques for reading and analyzing text. They also present five chapter-length, real-world case studies--demonstrating unstructured data at work in medical research, insurance, chemical manufacturing, contracting, and beyond.

This book will be indispensable to every business and technical professional trying to make sense of a large body of unstructured text: managers, database designers, data modelers, DBAs, researchers, and end users alike.

Coverage includes

  • What unstructured data is, and how it differs from structured data
  • First generation technology for handling unstructured data, from search engines to ECM--and its limitations
  • Integrating text so it can be analyzed with a common, colloquial vocabulary: integration engines, ontologies, glossaries, and taxonomies
  • Processing semistructured data: uncovering patterns, words, identifiers, and conflicts
  • Novel processing opportunities that arise when text is freed from context
  • Architecture and unstructured data: Data Warehousing 2.0
  • Building unstructured relational databases and linking them to structured data
  • Visualizations and Self-Organizing Maps (SOMs), including Compudigm and Raptor solutions
  • Capturing knowledge from spreadsheet data and email
  • Implementing and managing metadata: data models, data quality, and more

William H. Inmon is founder, president, and CTO of Inmon Data Systems. He is the father of the data warehouse concept, the corporate information factory, and the government information factory. Inmon has written 47 books on data warehouse, database, and information technology management; as well as more than 750 articles for trade journals such as Data Management Review, Byte, Datamation, and ComputerWorld. His b-eye-network.com newsletter currently reaches 55,000 people.

Anthony Nesavich worked at Inmon Data Systems, where he developed multiple reports that successfully query unstructured data.

Preface xvii

1 Unstructured Textual Data in the Organization 1

2 The Environments of Structured Data and Unstructured Data 15

3 First Generation Textual Analytics 33

4 Integrating Unstructured Text into the Structured Environment 47

5 Semistructured Data 73

6 Architecture and Textual Analytics 83

7 The Unstructured Database 95

8 Analyzing a Combination of Unstructured Data and Structured Data 113

9 Analyzing Text Through Visualization 127

10 Spreadsheets and Email 135

11 Metadata in Unstructured Data 147

12 A Methodology for Textual Analytics 163

13 Merging Unstructured Databases into the Data Warehouse 175

14 Using SQL to Analyze Text 185

15 Case Study--Textual Analytics in Medical Research 195

16 Case Study--A Database for Harmful Chemicals 203

17 Case Study--Managing Contracts Through an Unstructured Database 209

18 Case Study--Creating a Corporate Taxonomy (Glossary) 215

19 Case Study--Insurance Claims 219

Glossary 227

Index 233

Read More Show Less

Product Details

  • ISBN-13: 9780132712910
  • Publisher: Pearson Education
  • Publication date: 12/25/2007
  • Sold by: Barnes & Noble
  • Format: eBook
  • Edition number: 1
  • Pages: 288
  • Sales rank: 485,071
  • File size: 4 MB

Meet the Author

Bill Inmon--the "father of data warehousing"--has written 50 books and published in nine languages on subjects such as data warehousing, database design, and architecture.

For current events, seminars, conference speaking schedules, and a lot of other information related to data warehousing, unstructured data, and textual ETL, take a look at Bill Inmon’s Web site at inmoncif.com.

Anthony aka “Tony” Nesavich received his master's degree in computer information technology from Regis University in Denver, Colorado. He worked with Bill Inmon at Inmon Data Systems (IDS) where he was instrumental in the development of the IDS Foundation software. Much of Tony’s contributions to IDS are discussed in this book. Tony lives in Denver, Colorado, with his wife Melissa and his faithful dog, Lola.

Read More Show Less

Read an Excerpt

PrefacePreface

There have been two environments that have grown up side by side—the structured environment and the unstructured environment. The structured environment is typified by transactions, databases, records, keys, and attributes. The unstructured environment is typified by email, spreadsheets, medical records, documents, and reports.

It is amazing that at the same time that these worlds have grown up side by side, they have grown separately. It is as if these worlds exist in alternate universes.

The world of analytics and business intelligence has grown up around structured information. With business intelligence, we have displays of information, summaries, pivots, and an entire world of analytical processing. With business intelligence, we can make sense of the numbers, facts, and figures that hide out in the systems that run our corporations.

For analyses of text—unstructured information—there is nowhere near the amount of sophistication that exists in the structured environment. In the unstructured world, a few search engines can find documents and that is about it.

Does that mean that there is no important or useful information in the unstructured environment? The answer is—of course not. There is a wealth of important and useful information in the unstructured environment, but it is not as easily recoverable as information in the structured environment. The information in the unstructured environment is much more difficult to get a handle on.

There are many reasons why textual data is more difficult to handle than structured, transaction-oriented data. The primary reason is the lack of repeatability of textual data and the lackof predictability about the contents of the data. Textual data is hard to handle because it is hard to find, and it is hard to find because it does not entail repetition to any great degree.

This book is about doing textual analytics and the technologies that can be used to do textual analytics.

Two major architectural and technological approaches to doing textual analytics are used. One approach is to look at and gather the textual data in the unstructured environment. When there, the textual data is analyzed and manipulated in the unstructured environment. The unstructured environment seems like a natural place to do textual analytics because, after all, the text resides in the unstructured environment.

The other architectural approach is to look at and gather the textual data in the unstructured environment and then bring the textual data to the structured environment to do the textual analytics there.

It might seem strange or even unnatural to take the approach of accessing and gathering textual data in the unstructured environment and then bringing the textual data to the structured environment for analytical processing; however, there are good reasons for doing exactly that.

Some of those reasons follow:

  • The analytical environment has already been created in the structured environment. If we bring unstructured data to that environment, we can leverage existing investments. We already have trained end users, trained support staff, and licenses in place. So, why not bring the unstructured text to the structured environment where analytical tools are already in place?
  • Proprietary software. When we bring in technology to do analytical processing in the unstructured environment, that technology is proprietary. Do we actually want more proprietary software in our world? Isn’t it a much more rational approach to use open software that has thousands of users and uses around the world, rather than bring in proprietary software that might or might not meet the long-term goals of the organization?
  • By bringing unstructured text to the structured environment, we can create links between the unstructured data and our structured data, making possible analysis that otherwise would not have been possible. In doing so, we can build an integrated data warehouse that takes into account both structured and unstructured data.
  • If we don’t bring unstructured data to the structured environment, we are going to have to re-create the analytical infrastructure in the unstructured environment. Is that something that is advisable to do? We already have an analytical infrastructure. Why not use it?

For these reasons, this book is about what is required to go to the unstructured environment, find and integrate the textual data there, and then bring the unstructured textual data to the structured environment and organize it in a meaningful manner. After the textual data is in the structured analytical environment, a new world of analyses opens up.

One of the recurring themes of this book is the need for integration of text before it is useful. In most environments and in most circumstances, text is nonhomogeneous. People might talk in English, but for all practical purposes, they speak in dialects. Before analytical processing can be done effectively, there must be a common tongue established. Analyses can be done effectively only when a common tongue is established. Stated differently, if all you do is gather text and throw it into a database, you end up with the Tower of Babel. The Tower of Babel led nowhere, certainly not up to God.

One of the requirements of textual analytical processing is accessing and analyzing text in a colloquial vocabulary and a common vocabulary. The textual analyst needs both abilities.

The classical approach to text and text processing is to use semantics and natural language processing. This book describes a different approach. Without fail, the approach taken in this book is that text—made of up of words—is just another form of data. The approach that looks at words as just another unit of data frees the analyst from the trap of context. It is true that words taken out of context can have twisted meanings in some occasions. It is also true that freeing words from context opens up the door to entirely new and novel kinds of processing that simply are not possible when having to stop and consider the context of text at every turn.

There is a tradeoff. Paying attention to context when dealing with text entails a certain set of opportunities and precision. However, freeing text from context opens up entirely new and exciting vistas.

This book assumes that words are treated as just another unit of data and does not take context into consideration in 99.99 percent of the cases.

This book is for a wide audience. It is for students of computer science, general managers, database designers, data modelers, database administrators, researchers, and end users—in short, it is for anyone facing the challenge of taking a body of text and trying to make sense of it. In addition, this book answers the questions, “How do we bridge the gap between structured and unstructured systems?” and “How do we create an integrated data warehouse that incorporates both structured and unstructured data?”

The discipline of textual analytics is in its infancy; it is entirely predictable that more discussion and more advances will be made in the future about this subject. This book represents merely the first step in what is likely to be a massive field of endeavor in years to come.

We hope that you find the book full of useful information. We hope the book at least sets you down the right path to enjoying the fruits of textual analytics.

Bill Inmon Jan 11, 2007

Tony Nesavich, Jan 11, 2007

Read More Show Less

Table of Contents

Preface xvii

1 Unstructured Textual Data in the Organization 1

2 The Environments of Structured Data and Unstructured Data 15

3 First Generation Textual Analytics 33

4 Integrating Unstructured Text into the Structured Environment 47

5 Semistructured Data 73

6 Architecture and Textual Analytics 83

7 The Unstructured Database 95

8 Analyzing a Combination of Unstructured Data and Structured Data 113

9 Analyzing Text Through Visualization 127

10 Spreadsheets and Email 135

11 Metadata in Unstructured Data 147

12 A Methodology for Textual Analytics 163

13 Merging Unstructured Databases into the Data Warehouse 175

14 Using SQL to Analyze Text 185

15 Case Study--Textual Analytics in Medical Research 195

16 Case Study--A Database for Harmful Chemicals 203

17 Case Study--Managing Contracts Through an Unstructured Database 209

18 Case Study--Creating a Corporate Taxonomy (Glossary) 215

19 Case Study--Insurance Claims 219

Glossary 227

Index 233

Read More Show Less

Preface

Preface

There have been two environments that have grown up side by side—the structured environment and the unstructured environment. The structured environment is typified by transactions, databases, records, keys, and attributes. The unstructured environment is typified by email, spreadsheets, medical records, documents, and reports.

It is amazing that at the same time that these worlds have grown up side by side, they have grown separately. It is as if these worlds exist in alternate universes.

The world of analytics and business intelligence has grown up around structured information. With business intelligence, we have displays of information, summaries, pivots, and an entire world of analytical processing. With business intelligence, we can make sense of the numbers, facts, and figures that hide out in the systems that run our corporations.

For analyses of text—unstructured information—there is nowhere near the amount of sophistication that exists in the structured environment. In the unstructured world, a few search engines can find documents and that is about it.

Does that mean that there is no important or useful information in the unstructured environment? The answer is—of course not. There is a wealth of important and useful information in the unstructured environment, but it is not as easily recoverable as information in the structured environment. The information in the unstructured environment is much more difficult to get a handle on.

There are many reasons why textual data is more difficult to handle than structured, transaction-oriented data. The primary reason is the lack of repeatability of textual data and the lack of predictability about the contents of the data. Textual data is hard to handle because it is hard to find, and it is hard to find because it does not entail repetition to any great degree.

This book is about doing textual analytics and the technologies that can be used to do textual analytics.

Two major architectural and technological approaches to doing textual analytics are used. One approach is to look at and gather the textual data in the unstructured environment. When there, the textual data is analyzed and manipulated in the unstructured environment. The unstructured environment seems like a natural place to do textual analytics because, after all, the text resides in the unstructured environment.

The other architectural approach is to look at and gather the textual data in the unstructured environment and then bring the textual data to the structured environment to do the textual analytics there.

It might seem strange or even unnatural to take the approach of accessing and gathering textual data in the unstructured environment and then bringing the textual data to the structured environment for analytical processing; however, there are good reasons for doing exactly that.

Some of those reasons follow:

  • The analytical environment has already been created in the structured environment. If we bring unstructured data to that environment, we can leverage existing investments. We already have trained end users, trained support staff, and licenses in place. So, why not bring the unstructured text to the structured environment where analytical tools are already in place?
  • Proprietary software. When we bring in technology to do analytical processing in the unstructured environment, that technology is proprietary. Do we actually want more proprietary software in our world? Isn’t it a much more rational approach to use open software that has thousands of users and uses around the world, rather than bring in proprietary software that might or might not meet the long-term goals of the organization?
  • By bringing unstructured text to the structured environment, we can create links between the unstructured data and our structured data, making possible analysis that otherwise would not have been possible. In doing so, we can build an integrated data warehouse that takes into account both structured and unstructured data.
  • If we don’t bring unstructured data to the structured environment, we are going to have to re-create the analytical infrastructure in the unstructured environment. Is that something that is advisable to do? We already have an analytical infrastructure. Why not use it?

For these reasons, this book is about what is required to go to the unstructured environment, find and integrate the textual data there, and then bring the unstructured textual data to the structured environment and organize it in a meaningful manner. After the textual data is in the structured analytical environment, a new world of analyses opens up.

One of the recurring themes of this book is the need for integration of text before it is useful. In most environments and in most circumstances, text is nonhomogeneous. People might talk in English, but for all practical purposes, they speak in dialects. Before analytical processing can be done effectively, there must be a common tongue established. Analyses can be done effectively only when a common tongue is established. Stated differently, if all you do is gather text and throw it into a database, you end up with the Tower of Babel. The Tower of Babel led nowhere, certainly not up to God.

One of the requirements of textual analytical processing is accessing and analyzing text in a colloquial vocabulary and a common vocabulary. The textual analyst needs both abilities.

The classical approach to text and text processing is to use semantics and natural language processing. This book describes a different approach. Without fail, the approach taken in this book is that text—made of up of words—is just another form of data. The approach that looks at words as just another unit of data frees the analyst from the trap of context. It is true that words taken out of context can have twisted meanings in some occasions. It is also true that freeing words from context opens up the door to entirely new and novel kinds of processing that simply are not possible when having to stop and consider the context of text at every turn.

There is a tradeoff. Paying attention to context when dealing with text entails a certain set of opportunities and precision. However, freeing text from context opens up entirely new and exciting vistas.

This book assumes that words are treated as just another unit of data and does not take context into consideration in 99.99 percent of the cases.

This book is for a wide audience. It is for students of computer science, general managers, database designers, data modelers, database administrators, researchers, and end users—in short, it is for anyone facing the challenge of taking a body of text and trying to make sense of it. In addition, this book answers the questions, “How do we bridge the gap between structured and unstructured systems?” and “How do we create an integrated data warehouse that incorporates both structured and unstructured data?”

The discipline of textual analytics is in its infancy; it is entirely predictable that more discussion and more advances will be made in the future about this subject. This book represents merely the first step in what is likely to be a massive field of endeavor in years to come.

We hope that you find the book full of useful information. We hope the book at least sets you down the right path to enjoying the fruits of textual analytics.

Bill Inmon Jan 11, 2007

Tony Nesavich, Jan 11, 2007

Read More Show Less

Customer Reviews

Be the first to write a review
( 0 )
Rating Distribution

5 Star

(0)

4 Star

(0)

3 Star

(0)

2 Star

(0)

1 Star

(0)

Your Rating:

Your Name: Create a Pen Name or

Barnes & Noble.com Review Rules

Our reader reviews allow you to share your comments on titles you liked, or didn't, with others. By submitting an online review, you are representing to Barnes & Noble.com that all information contained in your review is original and accurate in all respects, and that the submission of such content by you and the posting of such content by Barnes & Noble.com does not and will not violate the rights of any third party. Please follow the rules below to help ensure that your review can be posted.

Reviews by Our Customers Under the Age of 13

We highly value and respect everyone's opinion concerning the titles we offer. However, we cannot allow persons under the age of 13 to have accounts at BN.com or to post customer reviews. Please see our Terms of Use for more details.

What to exclude from your review:

Please do not write about reviews, commentary, or information posted on the product page. If you see any errors in the information on the product page, please send us an email.

Reviews should not contain any of the following:

  • - HTML tags, profanity, obscenities, vulgarities, or comments that defame anyone
  • - Time-sensitive information such as tour dates, signings, lectures, etc.
  • - Single-word reviews. Other people will read your review to discover why you liked or didn't like the title. Be descriptive.
  • - Comments focusing on the author or that may ruin the ending for others
  • - Phone numbers, addresses, URLs
  • - Pricing and availability information or alternative ordering information
  • - Advertisements or commercial solicitation

Reminder:

  • - By submitting a review, you grant to Barnes & Noble.com and its sublicensees the royalty-free, perpetual, irrevocable right and license to use the review in accordance with the Barnes & Noble.com Terms of Use.
  • - Barnes & Noble.com reserves the right not to post any review -- particularly those that do not follow the terms and conditions of these Rules. Barnes & Noble.com also reserves the right to remove any review at any time without notice.
  • - See Terms of Use for other conditions and disclaimers.
Search for Products You'd Like to Recommend

Recommend other products that relate to your review. Just search for them below and share!

Create a Pen Name

Your Pen Name is your unique identity on BN.com. It will appear on the reviews you write and other website activities. Your Pen Name cannot be edited, changed or deleted once submitted.

 
Your Pen Name can be any combination of alphanumeric characters (plus - and _), and must be at least two characters long.

Continue Anonymously

    If you find inappropriate content, please report it to Barnes & Noble
    Why is this product inappropriate?
    Comments (optional)