Read an Excerpt
JOE CELKO'S Analytics and OLAP in SQL
By Joe Celko
MORGAN KAUFMANNCopyright © 2006 Elsevier Inc.
All right reserved.
Chapter OneBasic Reports and History
Prisoner: Where am I?
Number Two: In the Village.
Prisoner: What do you want?
Number Two: Information.
Prisoner: Which side are you on?
Number Two: That would be telling. We want information, information, information ...
Prisoner: You won't get it.
Number Two: By hook or by crook we will.
Prisoner: Who are you?
Number Two: The new Number Two.
Prisoner: Who is Number One?
Number Two: You are Number Six.
Prisoner: I am not a number. I am a free man.
Number Two: Ha, ha, ha, ha ...
—The Prisoner (1967)
Once upon a time, long, long ago, we had punch cards and magnetic tapes for data processing. We printed reports onto paper. There were no databases or cheap video terminals. Instead we produced tons of continuous-form paper outputs, and I do mean tons.
The problem was that nobody read these reports. In fact, there was a standard trick in software engineering consulting in the 1980s to find out about the usage. You took one of these huge stacks of paper and inserted an envelope with a $10 bill in the middle of the report. The note told the person who opened the envelope to keep the money and return the note, which had the name of the report on it, to the consultant. Only about one out of 10 or 12 envelopes were discovered.
Data was failing to be turned into information. People simply got a sense of security and importance from looking at a large stack of paper. When hardcopy reports were used, it was usually just the summary-level data. The whole report would sit around in a binder until the next time it was run. The new report would replace the old in the binder, and a few summary pages would be put into a second binder.
If the report did not give you the data in a format you could use, you had to reformat it by hand with a calculator. Later on, you would use a spreadsheet, but the idea was the same.
The reason for all of this was that computers and computer time were very costly, and display terminals were years away. Running a paper report once and letting people share it was the economical sound decision.
The simplest possible report is a printout of each record on a separate page or in a few lines on a case-by-case basis. In the old days, however, you could not afford to run a program to print a few pages on demand. We would print out a set of these cases with a date on them, do updates manually, and then key in the changes in batches. The real "database" was a vertical filing cabinet.
This meant that you had to have a sorting key that defined the order of the output. This usually meant alphabetical or numerical order on one or more fields. The classic example is a printed telephone book. If you know a name and alphabetical order, it is quite easy. However, if you wanted to find all the people who live on Elm Street, you would want to get the same data in a very different order.
People do not appreciate the power of alphabetical ordering for data. Hong Kong telephone operators commit thousands of names and numbers to memory because Chinese does not have a way to map sounds to characters.
Much of the computer time in the early days of data processing was spent sorting data to get reports in the desired order. Since magnetic tapes are sequential by their physical nature, we got very good at merging and sorting algorithms. This also meant that we assumed that all data had to be in sequential order and that we needed to have a physical sort key.
This assumption became the primary key in SQL. By the time we realized that a relational key has nothing to do with the physical ordering of the data's storage, and that all keys are equally keys, the language was defined.
Today, cases can be called up on a screen and updated instantly. But, even more significantly, you do not have to use one sort key to get to the data.
1.2 Control-Break Reports
The next type of report goes by several names: control-break or banded reports are the most common. They consist of lines of detailed information, followed by summary information at various levels of aggregation. The assumption is that your data falls into a hierarchy. Today, a dimension hierarchy is just another type of control break. Like the alphabet, hierarchies are a useful way to organize data. Some things don't change, they simply get mechanized.
Its layout was determined by the hardware that we originally used to produce these reports. The electronic accounting machines had a limited number of accumulators and almost no internal data storage. A record was read from a deck of cards (later from a magnetic tape) and printed on the output paper, and the relevant fields were added to the accumulators as the data passed through the hardware. When the record that started the next group was read, the accumulator contents was output and reset. The number of levels of aggregation was determined by the hardware.
The way that people actually used these reports was to ignore the details and flip the stack of paper to the final summary pages. The grand totals were on the last pages, so people would put them into a binder and throw everything else out.
The obvious way to design such a report would have been to start with the grand totals, then move to subtotals, then sub-subtotals, and finally to the details. Today, this is what is done with terminals and drilldown displays.
1.3 Cross-Tabulation Reports
The third most common report is a cross-tabulation report, also known as "crosstabs" in statistics. This has become more important, because it is the basis for cubes and other OLAP storage methods, which we will discuss later.
A simple crosstabs is an array of cells on at least two attributes. The values of one attribute go on the horizontal axis, and the values of the other attribute go on the vertical axis. Each cell has a statistic about the cases for the entities that have both values. For example, a sex by race crosstabs would have a total count for white females, white males, black females, black males, and so forth.
Most such reports also include row and column totals, such as males and females, and a total for each racial category. While simple counts are the most common statistic used, you can also compute averages, extrema, and more powerful statistics such as chi-squares.
Interactive packages, such as Brio, allow the user to arrange and trim the grid with interactive tools.
1.4 Presentation Graphics
The fourth kind of report was not very popular in the old days, because it had to be drawn by hand. These were the graphic presentations, such as pie charts, bar charts, and line charts.
The early computers did not have the output devices to do these things for us. I am not going to bother with a discussion of this type of report for two reasons. The first reason is that I am more concerned with data that is represented in a purely computerized format and ways to manipulate it within a computer.
The second reason is that the best material about this type of report has been presented by Edward Tufte in his books and courses (http:// www.edwardtufte.com). His book, The Visual Display of Quantitative Information, is a classic that anyone who works with data should own. Stephen Few (www.perceptualedge.com) is also writing good stuff on information presentation.
1.5 Local Databases
The final kind of report is not really a traditional report at all. Personal computers and mass storage are cheap. We can send the user a small, almost personal database of his or her own to use on a local machine as he or she sees fit. This is not the same thing as sending a report to the user that cannot be altered. We download raw data from a trusted source, usually a central company database.
The user can be a subscriber who gets new data on a regular basis from a central data source. This is handy for things that have a regular schedule, such as monthly inventory changes or sales leads. One of the best examples of this kind of reporting is continuous stock market downloads. The raw data comes from a service and gets formatted and analyzed locally by the subscriber's program.
The other model is to download data on request, but the idea of local report generation is still the same. The most common example of this is the Internet. You probably do not think of it as a reporting service.
In recent years, companies have also set up servers that do nothing but provide reports.
Across-tabulation, or crosstabs for short, is a common statistical report. It can be done in IBM's QMF tool, using the ACROSS summary option, and in many other SQL-based reporting packages. SPSS, SAS, and other statistical packages have library procedures or language constructs for crosstabs. Many spreadsheets can load the results of SQL queries and perform a crosstabs within the spreadsheet. You can even do it with Microsoft Access (the poor man's analytics) crosstab queries!
If you can use a reporting package on the server in a client/server system instead of the following method, do so. It will run faster and in less space than the method discussed here. However, if you have to use the reporting package on the client side, the extra time required to transfer data will make these methods on the server side much faster.
A crosstabs takes the values in a column, turns them into columns in the report. For example, if you have a table with Sales data, a crosstabs would have columns for each year that appears in the sales date "year" axis and the product codes as rows on the other "Products" axis. The following SQL code is based on a trick that mathematician call an identity matrix. When you do a CROSS JOIN with the CrossTabs table, you multiply a value by one if you want to keep it and multiply by 0 if you want to drop the value.
Obviously, (S1.product_price * S1.qty) is the total dollar amount of each product in each year. The yearN column will be either a 1 or a 0. If it is a 0, the total dollar amount in the SUM() is 0; if it is a 1, the total dollar amount in the SUM() is unchanged.
This solution lets you adjust the time frame being shown in the report by replacing the values in the year column to whichever consecutive years you wish. A two-way crosstabs takes two variables and produces a spreadsheet with all values of one variable in the rows and all values of the other represented by the columns. Each cell in the table holds the COUNT of entities that have those values for the two variables. NULLs will not fit into a crosstabs very well, unless you decide to make them a group of their own.
Another trick is to use the POSITION() function to convert a string into a 1 or a 0. For example, assume we have a "day of the week" function that returns a three-letter abbreviation, and we want to report the sales of items by day of the week in a horizontal list.
There are also totals for each column and each row, as well as a grand total. Crosstabs of (n) variables are defined by building an n-dimensional spreadsheet. But you cannot easily print (n) dimensions on two-dimensional paper. The usual trick is to display the results as a two-dimensional grid with one or both axes as a tree structure. The way the values are nested on the axis is usually under program control; thus, "race within sex" shows sex broken down by race, whereas "sex within race" shows race broken down by sex.
Assume that we have a table, Personnel (emp_nbr, sex, race, job_nbr, salary_amt), keyed on employee number, with no NULLs in any columns. We wish to write a crosstabs of employees by sex and race, which would look like this:
asian black caucasian hispanic Other TOTALS ==========================================================
Male 3 2 12 5 5 27 Female 1 10 20 2 9 42 TOTAL 4 12 32 7 14 69
The first thought is to use a GROUP BY and write a simple query, thus:
SELECT sex, race, COUNT(*) FROM Personnel GROUP BY sex, race;
This approach works fine for two variables and would produce a table that could be sent to a report writer program to give a final version. But where are your column and row totals? This means you also need to write these two queries:
SELECT race, COUNT(*) FROM Personnel GROUP BY race; SELECT sex, COUNT(*) FROM Personnel GROUP BY sex;
However, what I wanted was a table with a row for males and a row for females, with columns for each of the racial groups, just as I drew it.
Excerpted from JOE CELKO'S Analytics and OLAP in SQL by Joe Celko Copyright © 2006 by Elsevier Inc. . Excerpted by permission of MORGAN KAUFMANN. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.