Data Analysis And Harmonization

Data Analysis And Harmonization

by Jeff Voivoda


View All Available Formats & Editions
Choose Expedited Shipping at checkout for guaranteed delivery by Monday, November 26

Product Details

ISBN-13: 9781450298247
Publisher: iUniverse, Incorporated
Publication date: 03/24/2011
Pages: 156
Product dimensions: 5.50(w) x 8.50(h) x 0.33(d)

Read an Excerpt

Data Analysis and Harmonization

By Jeff Voivoda

iUniverse, Inc.

Copyright © 2011 Jeff Voivoda
All right reserved.

ISBN: 978-1-4502-9824-7

Chapter One

This Organization Has Problems

Let's suppose for a minute that you are the person responsible for producing and disseminating a voluminous (and dreaded) "monthly" report. By good luck or bad, by knowledge or naivety, you're the poor sap who has to locate, collect, manipulate, crunch, grind, produce, and distribute this information to coworkers, colleagues, stakeholders, and clients.

First things first, let's gather all the required data. Okay, gathering all the required data isn't as easy as it sounds. As a topnotch employee, you know that sales figures come from the sales system. That's housed in the sales department. You also know the inventory totals are stored over in the warehouse system, part of the inventory management department. You can't access that system, but you can get Bob to dump those numbers into a file so you can use them (hopefully, Bob isn't on vacation!). Oh yes, and the expired customers need to be notified that their accounts are about to be closed. Who do you call for that, again? Oh, right, customer service has that data. You better get cracking!

Once you receive the sales figures, don't forget to apply the conversion program because sales volumes are expressed in pieces and the inventory amounts are expressed in components. Also, the part number associated with the pieces in the sales system can contain alphabetic characters, but the part number used for components in the warehouse system is only numeric. And you can't forget to review the product descriptions since the sales system truncates the description field at twenty-five characters, so the forty character description fields from inventory system sometimes just don't make sense. If there are any problems, you'll have to print out the inconsistencies and make certain the report lists both part numbers from both systems and send it to Jane in the quality control (QC) department for her to review and (hopefully) rectify the problems. Of course, all this needs to be completed before the final report is delivered. While that's transpiring, you better start reviewing the list of expired accounts. Last month, we sent four expiration notices to the same person because the addresses were basically the same, except for some slight differences in each record in the database. How embarrassing!

Does This Problem Sound Familiar?

Should we continue this confusing, ineffective scenario, or do you get the picture? I think we can easily see this process needs some streamlining, and with little trouble we can spot the inefficiencies. Although, the poor processing and data issues were exaggerated to make the point, there are indeed some organizations that operate in a haphazard way that is alarmingly similar to this disjointed and ineffective manner described in the scenario. Maybe you've even been part of a mess like this! I know I have!

A few years ago, I was working with a government agency that processed applications and issued certifications to applicants based on the data contained in the applications. There were two basic requirements to get the agency-issued certification: First, the application had to be complete, which meant that all the required data had to be supplied and any complimentary documentation had to be provided. Second, the applicant had to pay a fee to apply and obtain the certification. Sounds reasonable, right? This agency had a major problem: there were three types of certifications that could be obtained, and each type of certification was processed using a different application and hence was stored in one of three different databases! Even though most of the application information was the same or very similar for each applicant (e.g. applicant name, applicant address, and so on), each was stored in its own database. And to make matters worse, the payment information was kept in yet another database in a completely different department. I was part of a team of analysts that came into this agency and harmonized their disparate data sources, streamlined the application process (in this case by allowing electronic submission of a single application form), and cut the length of time from application to certification almost in half!

Inefficient processes are usually the product of poorly structured data and databases. This stands to reason because if you have to access multiple databases on different technology platforms in order to gather, process, and present information, the process itself and the poor data storage strategies will be equally to blame for your issues!

You might say, "So what? What's the harm as long as the work gets done and the reports go out the door on time?" Shame on you! I hope you're not saying that! Don't be lulled into complacency by the mere fact that the work is getting done and the reports are being delivered! The data contained in those reports is questionable, or worse, inaccurate. Accepting erroneous data as accurate and reliable exposes you and your client to many potential risks! For example, let's say you've just laid the concrete foundation of a structure, and you hire an engineer to test the stability of the walls. If the data regarding the soundness of the material is inaccurate, you may build your structure on a faulty foundation. Consider the risk of that scenario! The data provided in any situation needs to be structured, consolidated, accurate, and then made more reliable. This data needs to be harmonized.

The topic of analyzing and harmonizing data is not a new subject. Organizations at all levels and of all sizes have struggled with managing disparate data sources and maintaining multiple naming conventions as well as with a general lack of data reliability and standardization since long before computers came along.

There are many ways to systematically organize data. The process of data harmonization is not a new or complex process. In fact, it follows closely in principle with the process of normalization. If you have ever categorized or classified data into groups and then identified relationships between those groups, guess what? You've already performed data harmonization at a high level! But before you pat yourself on the back too hard, let's dive deeper into this process.

But before we launch into data harmonization and all its glory, let's talk about the concerns that arise by having and enabling stovepipe systems and silos of information. A stovepipe system is a computer system whose functionality and processes are narrowly focused to provide specific data to a specific recipient. I've already cited an example from my past experience earlier in this book (the government agency) in which time and resources were wasted due to information silos. I'm betting you have an example or two you could share as well.

Issues with Information Silos

Generally speaking, here are some issues that every data analyst should be concerned with when confronted with disparate data sources:

• Redundancy: Data redundancy occurs when a data element is stored in more than one location or database at the same time. This creates issues with the reliability of the data being retrieved. Quite simply, the question becomes, Which occurrence of the data is correct?

• Authoritative data: This is "officially recognized data that can be certified and provided by an authoritative source." In other words, authoritative data is data that your organization provides and that is accepted by the consumer as reliable and accurate. If data is stored redundantly and/or must be converted in order to be presented, you are vulnerable to irregularities, mistakes, and errors.

• Stewardship: Data stewardship is the responsible management of all aspects of data and related metadata. In order for the data to be reliable, you must have a single point of contact (POC) for maintenance of the data and that POC should be a subject matter expert (or SME, pronounced "smee") for that specific data area. When multiple people update redundant data elements in multiple data stores, things quickly spiral out of control.

• Accessibility: Accessibility addresses the ability or authority to access, view, and update the data contained in the data stores. Accessibility should be restricted or constrained to only those persons who need access to the data. If one department can update the same data that is stored but restricted in another department, the data quickly becomes out of sync and inaccurate.

• Transfer: Data transfer issues arise when the format of data stored in the sending entity is not the same as the format in the receiving entity. There are methods for getting around such issues, including an interchange language such as eXtensible Markup Language (XML), but if the problem resides with the integrity of the data from the sending entity or the application of a conversion routine, not in the intermediate exchange technology, then XML is of little value as a transfer mechanism if the data that is transferred is wrong.

• Timeliness: Timeliness addresses the issue of retrieving the required data in a timeframe that allows the decision maker ample time to review, analyze, and make the correct decision based on accurate and reliable information. If the data has to be retrieved from several different data stores and then massaged, converted, and reformatted, chances are your stakeholder will not receive the information in a timely manner.

How Does This Happen?

The issues raised by stovepipe systems can be considered the lowlights of the IT department. All of the issues listed, plus other issues, potentially expose your client to risks; risks to their reputation and credibility, and even possible financial risks in certain situations. Data becomes stovepiped for many reasons: if response times are too slow for one department, data may be held redundantly for performance purposes; one person or department may require data that is slightly different than what was already contained in a data store; geographic considerations may play a role in separating the data; and I've even seen people download data from a department data store to a desktop application in order to manipulate the numbers and then label that spreadsheet a "system" or authoritative source.

No information technology organization is immune to unreliable and disparate data stores. Stovepipes tend to evolve over time and fly under the radar. They usually go unnoticed until a stakeholder makes a poor decision based on data that someone in the organization provided. That's when questions come up about where the data resides and who is maintaining it—but by then, it may be too late!

In many cases, the size of the business drives the accuracy and integrity of the data and the associated data stores. The data structures of small businesses tend to be smaller, more centralized, and accurate. They're simply holding less data. Generally speaking, the less voluminous the data, the easier it is to manage. As companies grow in size, complexity, and geographic diversity, the amount of data that they need to store and sustain themselves also rises. Heck, even the federal government is not impervious to the stovepipe predicament. In fact, the federal government is engaged in a very active campaign to break down the silos of information, consolidate data sources where possible, and start leveraging data as an enterprise asset, rather than an agency-controlled, locally owned possession.

Successful data harmonization provides a means for your clients and customers to interact and communicate using data and information that is reliable, not overlapping, wholly understood, and definable across all required organizational boundaries. By eliminating duplicative data that exists in multiple data stores, you'll realize substantial financial gains because you no longer must maintain the silos of information and their associated technology platforms.

You also eliminate the need for the conversion programs that were composed to convert one data element in one system format to an acceptable format in another data store. And don't forget the personnel side of the equation either. Once the data has been harmonized and consolidated, just think of all the additional time some of your employees will have in which to complete other, more important tasks in the organization.

An Example Organization

One of the best ways to understand the data harmonization process is to look at an example. To that end, we will be following the data issues and activities of Monty's Import Service.

Let's first look at the corporate profile of Monty's Import Service:

• Monty's is a fairly small import operation that deals with imports only—no exports.

• Monty's only operates at three ports of entry that are passable by truck (land borders).

• Monty's clients regulate the importation of exactly ten commodities into the United States.

• Four of the ten commodities are categorized as perishable.

• Five of the ten commodities are categorized as nonperishable.

• One of the commodities is imported infrequently and doesn't fall into either of the other two categories.

So far, so good? Good! Let's continue.

In order to import products into the United States, importing companies have to submit import forms to Monty's Import Service to secure entry of the commodities across the borders in question. Here is a sample of the three forms utilized by Monty's Import Service:

Form 100-P: Permit to Import Perishable Goods


Form 100-NP: Permit to Import Nonperishable Goods


Form 200-NC: Permit to Import Special Goods


The three ports of entry that Monty's supervises are geographically diverse. One is in the state of Washington; another is in North Dakota; and the third is in Maine. And one last point to note: any of the ten products can be imported through any of the three ports of entry. (By the way, even though these points of entry are on land, they are still referred to as "ports" in import/ export jargon.)

So as not to overcomplicate this example, let's assume that the business process is intact and working reasonably well for the import practice. The problem we're trying to solve lies mainly with the data and data storage, not with the business processes (although they may be at least partially to blame, and we will be reviewing them at a very high level).

As we progress in this book, I'll refer back to Monty's Import Service when necessary to help you gain a clearer understanding of how data analysis and harmonization function and how these methods can be practically applied to streamline data and make processes and overall operations more efficient.


The inefficient collection, storage, and maintenance of corporate data in some organizations today can lead to multiple business issues over the course of time. Business processes that once were sound and performed effectively can become wrought with ineffective procedures, redundantly held data, and data sets that have serious integrity and accuracy anomalies.

Data harmonization seeks to rectify data redundancy, unclear authoritative data sources, poor data stewardship, inaccessibility of data, and a lack of timeliness. By harmonizing data stores, you will notice not only cleaner data but also easier and timelier access to more reliable data as well as time and resource savings due to streamlined business processes. Your data is the most important corporate asset that you or your client owns. You must manage it well!

Chapter Two

Identify the Data Sources

As I already mentioned, data harmonization is an iterative process. Remember, we are imagining that you have been hired by a client to analyze and consolidate their disparate data stores; so your initial effort for the harmonization process should be directed toward identifying all data that your client and your client's business processes (and possibly their stakeholders) require to make business decisions and successfully complete their existing (and possibly target) business processes. In short, you want to give them the data they need to do their jobs—when they need it and in a format that makes sense.

Let me point out that the same process can (and should) be followed when you're performing data harmonization within an organization—not just when you are a data consultant to a client. We want to focus on the tasks that need to be accomplished, the process to be followed, and the corresponding benefits; it doesn't matter whether you're a hired gun or an internal analyst.

Getting Started on the Data Trail

The best way to start gathering the required data is to identify all the forms that the client is using. These can be paper forms if it's a paper-based process or online forms if the submission of the forms is electronic. Regardless of whether it is electronic or paper-based (or a mix of both), your initial goal is to identify all the data that is contained in those forms. Each field on the forms will be identified as an individual data element. As you identify the data elements, record available peripheral data, or metadata, about each of the data elements identified on the forms. During this process, you should also identify relationships that exist between and among the data elements, as this information will be essential later in the harmonization process.


Excerpted from Data Analysis and Harmonization by Jeff Voivoda Copyright © 2011 by Jeff Voivoda. Excerpted by permission of iUniverse, Inc.. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

Table of Contents


Chapter 1....................1
Chapter 2....................15
Chapter 3....................27
Chapter 4....................33
Chapter 5....................41
Chapter 6....................55
Chapter 7....................65
Chapter 8....................83
Chapter 9....................93
Chapter 10....................105
Chapter 11....................115
Chapter 12....................125

Customer Reviews

Most Helpful Customer Reviews

See All Customer Reviews