Entity Resolution and Information Quality

Entity Resolution and Information Quality presents topics and definitions, and clarifies confusing terminologies regarding entity resolution and information quality. It takes a very wide view of IQ, including its six-domain framework and the skills formed by the International Association for Information and Data Quality {IAIDQ). The book includes chapters that cover the principles of entity resolution and the principles of Information Quality, in addition to their concepts and terminology. It also discusses the Fellegi-Sunter theory of record linkage, the Stanford Entity Resolution Framework, and the Algebraic Model for Entity Resolution, which are the major theoretical models that support Entity Resolution. In relation to this, the book briefly discusses entity-based data integration (EBDI) and its model, which serve as an extension of the Algebraic Model for Entity Resolution. There is also an explanation of how the three commercial ER systems operate and a description of the non-commercial open-source system known as OYSTER. The book concludes by discussing trends in entity resolution research and practice. Students taking IT courses and IT professionals will find this book invaluable. - First authoritative reference explaining entity resolution and how to use it effectively - Provides practical system design advice to help you get a competitive advantage - Includes a companion site with synthetic customer data for applicatory exercises, and access to a Java-based Entity Resolution program.

Entity Resolution and Information Quality

51.95 In Stock

Entity Resolution and Information Quality

Add to Wishlist

Entity Resolution and Information Quality

eBook

$51.95

eBook
$51.95

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Product Details

ISBN-13:	9780123819734
Publisher:	Morgan Kaufmann Publishers
Publication date:	01/14/2011
Sold by:	Barnes & Noble
Format:	eBook
Pages:	256
File size:	4 MB

About the Author

Dr. John R. Talburt is Professor of Information Science at the University of Arkansas at Little Rock (UALR) where he is the Coordinator for the Information Quality Graduate Program and the Executive Director of the UALR Center for Advanced Research in Entity Resolution and Information Quality (ERIQ). He is also the Chief Scientist for Black Oak Partners, LLC, an information quality solutions company. Prior to his appointment at UALR he was the leader for research and development and product innovation at Acxiom Corporation, a global leader in information management and customer data integration. Professor Talburt holds several patents related to customer data integration and the author of numerous articles on information quality and entity resolution, and is the author of Entity Resolution and Information Quality (Morgan Kaufmann, 2011). He also holds the IAIDQ Information Quality Certified Professional (IQCP) credential.

Read an Excerpt

ENTITY RESOLUTION AND INFORMATION QUALITY

By JOHN R. TALBURT

MORGAN KAUFMANN

Copyright © 2011 Elsevier Inc.
All right reserved.
ISBN: 978-0-12-381973-4

Chapter One

PRINCIPLES OF ENTITY RESOLUTION

Entity Resolution

Entity resolution (ER) is the process of determining whether two references to real-world objects are referring to the same object or to different objects. The term entity describes the real-world object, a person, place, or thing, and the term resolution is used because ER is fundamentally a decision process to answer (resolve) the question, Are the references to the same or to different entities? Although the ER process is defined between pairs of references, it can be systematically and successively applied to a larger set of references so as to aggregate all the references to same object into subsets or clusters. Viewed in this larger context, ER is also defined as "the process of identifying and merging records judged to represent the same real-world entity" (Benjelloun, Garcia-Molina, Menestrina, et al., 2009).

Entities are described in terms of their characteristics, called attributes. The values of these attributes provide information about a specific entity. Identity attributes are those that when taken together distinguish one entity from another. Identity attributes for people are things such as name, address, date of birth, and fingerprint—the kinds of things often asked for to identify the person requesting a driver's license or hospital admission. For a product identity, attributes might be model number, size, manufacturer, or universal product code (UPC).

A reference is a collection of attributes values for a specific entity. When two references are to the same entity, they are sometimes said to co-refer (Chen, Kalashnikov, Mehtra, 2009) or to be matching references (Benjelloun, et al., 2009). However, for reasons that will be clear later, the term equivalent references will be used throughout this text to describe references to the same entity.

An important assumption throughout the following discuss of ER is the unique reference assumption. The unique reference assumption simply states that a reference is always created to refer to one, and only one, entity. The reason for this assumption is that in real-world situations a reference may appear to be ambiguous—that is, it could refer to more than one entity or possibly no entity. For example, a salesperson could write a product description on a sales order, but because the description is incomplete, the person processing the order might not be clear about which product is to be ordered. Despite this problem, it was the intent of the salesperson to reference a specific product. The degree of completeness, accuracy, timeliness, believability, consistency, accessibility, and other aspects of reference data can affect the operation of ER processes and produce better or worse outcomes. This is one of the reasons that ER is so closely related to the field of information quality (IQ).

Background

The concepts of entity and attribute are foundational to the entity-relation model (ERM) that is at the very core of modern data modeling and database schema design. The entity-relation diagram (ERD) is the graphical representation of an ERM and has long been considered a necessary artifact for any database development project. The relational model, first described by E. F. Codd (1970), was later refined into what we now know as the ERM by Peter Chen (1976). In the ERM, information systems are conceptualized as a collection of entities, each having a set of descriptive attributes and also having well-defined relationships with other entities.

Figure 1.1 shows a simple ERD illustrating a data model with three entity types: Instructor, Course, and Student. The line connecting the Instructor and Course entity types indicates that there is a relation between them. Similarly, the diagram shows that Course and Student entity types are related. Furthermore, in the ERD style used here, the adornments on the relation line give more detail about these relationships. For example, the triangular configuration of short lines, sometimes called a crow's foot, at the junction of the relation line with an entity indicates a many-to-one relationship. In this example it indicates that one Instructor entity may be related to (be the instructor for) more than one Course entity. The additional adornment of a single bar with the crow's foot further constrains the relation by indicating that each Instructor entity must be related to (assigned to) at least one Course entity. The double bar at the junction of this same relation and the Instructor entity is used to indicate an exactly-one relationship. Here it represents the constraint that each Course entity must be related to (has assigned to it) one, and only one, Instructor entity. The crow's foot symbol with a circle that appears at both ends of the relation between the Course and Student entities indicates a zero-to-many relation. This means that any given Student entity may be related to (enrolled in) several Course entities, or in none. Conversely, any given Course entity may be related to (have in it) several Student entities, or none.

Each entity type also has a set of attributes that describes the entity. For example, the Instructor entity type has the three attributes FacultyID, Name, and Department. Assigning values to these attributes defines a particular instructor, called an instance of the Instructor entity. By the previous definition, an instance of an entity is also an entity reference. A fundamental rule of ERM is that every instance of an entity should have a unique identifier. Codd (1970) called this the Entity Identity Rule. A primary key is an identity attribute or group of identity attributes selected by the data modeler because the combination of values taken on by these attributes will be unique for each entity instance. However, at the design stage, it is not always clear that a particular combination of descriptive attributes will have this property, or it if does, that the combination will continue to be unique as more instances of the entity are acquired. For this reason data modelers often play it safe by adding another attribute to an entity type that does not describe any intrinsic characteristic of the entity but is simply there to guarantee that each instance of the entity has a primary key. For example, in Figure 1.1, with only name and department as the identity attributes for the Instructor entity, it is conceivable that a department could have two instructors with the same name. If this were to happen, the combination of name and department would no longer meet the requirements to form a primary key. By adding a FacultyID attribute as a third attribute and by controlling the values assigned to FacultyID, it is possible to guarantee that each instance of the Instructor entity has a unique primary key value. Called surrogate keys, the values for these artificial keys have no intrinsic meaning, such as a FacultyID value of "T1234" or an Employee_Number of "387."

In theory, ER should never be a problem in a well-designed database because two entity instances should be equivalent if, and only if, they have the same primary key. When this is true, it allows information about the same entity in different tables of the database to be brought together by simply matching instances with the same primary key value through what is called a table join operation.

The problem is that these artificial primary keys must be assigned when the instance is entered into the database and maintained throughout the life cycle of the entity, and there is no guarantee that this will always be done correctly. An even greater problem is that the same entity may be represented in different databases or even different tables within the same database, using a different primary key. In other situations the references may lack key values because they came from a nondatabase source or were extracted from a database without including the key. ER in a database context is sometimes referred to as the problem of heterogeneous database join (Thuraisingham, 2003; Sidló, 2009).

ER systems that provide heterogeneous database join functionality are often employed by law enforcement and intelligence agencies, where each agency maintains a separate database of entities of interest, with each using a different scheme for primary keys. In this setting, the ER system acts as a "hub" that connects to each of the databases. When an entity reference from an investigation is entered, the system reformats the reference information as a query appropriate to each database and returns the matching results to the user. The Identity Resolution Engine® by Infoglide Software®, discussed in Chapter 5, is an example of a commercial system that provides this type of functionality. Chapter 7 discusses the growing trend to use ER hub architectures as a solution to the problem of bringing together information about a common set of entities held in independently maintained systems.

Entity versus Entity Reference

Although instances of an entity type are often called entities by data modelers, it is important to understand that in the context of ER, instances of an ERM entity type are not entities. An instance of an entity type, such as the Student entity type in Figure 1.1, is just a row in the Student database table inside the computer. The instance is only a reference to a real student walking around campus and attending classes. In an ER context, entities do not exist in the information system—they exist in the real world. More than a nuance in terminology, the distinction between an entity and an entity reference is fundamental to understanding ER.

ER Principle #1: Information systems store and manipulate references to entities, not the entities.

Figure 1.2 illustrates how many combinations and variations of identity attributes such as name, size, quantity, manufacturer, and product code can lead to multiple references to the same item. The same situation can occur with place entities that have attributes such as postal address, global positioning system (GPS) coordinates, or landmark references, and event entities that have attributes such as name, date, time, attendees, and location.

As a business example, suppose that the entity type is a customer of a business—a person. The same customer may be referenced by many different records in the company's information system or, in some cases, multiple systems. There are many reasons that a company might create multiple references to the same customer. It may be the result of the customer having purchased items through different branches, departments, or sales channels of the company. Each sales channel often has its own database, and that database may not be properly integrated with other databases in the company. Databases that do not share information across the company with other systems are sometimes called data silos. Recognition that information about critical business entity types such as customer and product should be synchronized across the entire enterprise has given rise to the practice of master data management (MDM) (Loshin, 2008).

Another reason for the proliferation of customer references in a business is that customer characteristics, especially contact information, change over time. If changes to customer contact information such as name, mailing address, telephone, or email address are not captured and managed properly, the system may assume that transactions using the unrecorded contact information represent a new customer rather than one the system already recognizes. Recognizing that these records are actually references to the same customer is the essence of ER.

In other cases the problem may simply be the lack of proper information quality controls on manual data entry that allow errors or variations in data values to enter the system. Maydanchik (2007) describes a number of ways in which data quality errors can be introduced into an information system, including bringing data from outside, by processes changing data within the system, and through data decay.

ER in the context of customers, whether the customers are consumers (individuals) or other businesses, is called customer data integration (CDI). CDI is an essential component of customer relationship management (CRM). CRM is an enterprisewide process intended to give a company a competitive advantage by improving each customer's experience with the business. Dyché and Levy (2006) describe CRM as implying "that a company is thinking about and acting toward its customers individually ..." An obvious first step in accomplishing this objective is to have a complete picture of a customer's interactions with the company through effective CDI.

(Continues...)

Excerpted from ENTITY RESOLUTION AND INFORMATION QUALITY by JOHN R. TALBURT Copyright © 2011 by Elsevier Inc.. Excerpted by permission of MORGAN KAUFMANN. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Chapter 1 Principles of Entity ResolutionChapter 2 Principles of Information Quality Chapter 3 Entity Resolution Models Chapter 4 Entity-Based Data Integration Chapter 5 Entity Resolution SystemsChapter 6 The OYSTER ProjectChapter 7 Trends in Entity Resolution Research and Applications

What People are Saying About This

From the Publisher

Learn how to integrate and use your customer and product information data to stay ahead of your competition!

From the B&N Reads Blog

Page 1 of

Editorial Reviews

"This book is comprehensive, timely, and on the leading edge of the topic. In addition to being comprehensive and systematic, the book has two distinct characteristics: (1) it addresses the issue of entity relationships, which go beyond entity matching. This novel approach generates much richer information about entities; (2) it discusses not only techniques, but also systems that implement the techniques. This system-oriented approach helps the reader to see how to apply the techniques for problem solving." —Dr. Hongwei (Harry) Zhu - Assistant Professor of Information Technology in the College of Business and Public Administration, Old Dominion University

"Talburt, the author of this book, is one of the organizers of the first graduate degree program in information quality, hosted by the University of Arkansas at Little Rock. The book contains seven easy-to-read chapters. A chapter on trends and research topics in entity resolution closes this short textbook. Some of the suggestions will undoubtedly encourage graduate students to pursue their research on data integration topics. The book offers interesting pointers and bibliographic references for exploring new avenues of research." —Computing Reviews

"Talburt (information science, U. of Arkansas-Little Rock) presents a textbook developed from a graduate course on the two emerging specialties within information science. Students tend to come from a number of disciplines, so no deep background in information science is assumed, and the material may even be suitable for upper-level undergraduate courses. He covers principles of entity resolution and information quality, entity resolution models and systems, entity-based data integration, the OYSTER open-source software development project, and trends in research and applications." —SciTech Book News

From the Publisher

Entity Resolution and Information Quality

Entity Resolution and Information Quality

eBook

eBook

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

ENTITY RESOLUTION AND INFORMATION QUALITY

MORGAN KAUFMANN

Chapter One

Table of Contents

What People are Saying About This

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

ENTITY RESOLUTION AND INFORMATION QUALITY

MORGAN KAUFMANN

Chapter One

Table of Contents

What People are Saying About This

Related Subjects

Customer Reviews