Read an Excerpt
JOE CELKO'S SQL PUZZLES & ANSWERS
By Joe Celko
MORGAN KAUFMANN PUBLISHERSCopyright © 2007 Elsevier Inc.
All right reserved.
Chapter OnePUZZLE 1 FISCAL YEAR TABLES
Let's write some CREATE TABLE statements that are as complete as possible. This little exercise is important because SQL is a declarative language and you need to learn how to specify things in the database instead of in the code.
The table looks like this:
CREATE TABLE FiscalYearTable1 (fiscal_year INTEGER, start_date DATE, end_date DATE);
It stores date ranges for determining what fiscal year any given date belongs to. For example, the federal government runs its fiscal year from October 1 until the end of September. The scalar subquery you would use to do this table lookup is:
(SELECT F1.fiscal_year FROM FiscalYearTable1 AS F1 WHERE outside_date BETWEEN F1.start_date AND F1.end_date)
Your assignment is to add all the constraints you can think of to the table to guarantee that it contains only correct information.
While vendors all have different date and time functions, let's assume that all we have is the SQL-92 temporal arithmetic and the function EXTRACT ([YEAR | MONTH | DAY] FROM <date expression>), which returns an integer that represents a field within a date.
1. First things first; make all the columns NOT NULL since there is no good reason to allow them to be NULL.
2. Most SQL programmers immediately think in terms of adding a PRIMARY KEY, so you might add the constraint PRIMARY KEY (fiscal_year, start_date, end_date) because the fiscal year is really another name for the pair (start_date, end_date). This is not enough, because it would allow this sort of error:
(1995, '1994-10-01', '1995-09-30') (1996, '1995-10-01', '1996-08-30') <== error! (1997, '1996-10-01', '1997-09-30') (1998, '1997-10-01', '1997-09-30')
You could continue along the same lines and fix some problems by adding the constraints UNIQUE (fiscal_year), UNIQUE (start_date), and UNIQUE (end_date), since we do not want duplicate dates in any of those columns.
3. The constraint that almost everyone forgets to add because it is so obvious is:
CHECK (start_date < end_date) or CHECK (start_date <= end_date), as is appropriate.
4. A better way would be to use the constraint PRIMARY KEY (fiscal_year) as before, but then since the start and end dates are the same within each year, you could use constraints on those column declarations:
CREATE TABLE FiscalYearTable1 (fiscal_year INTEGER NOT NULL PRIMARY KEY, start_date DATE NOT NULL, CONSTRAINT valid_start_date CHECK ((EXTRACT (YEAR FROM start_date) = fiscal_year - 1) AND (EXTRACT (MONTH FROM start_date) = 10) AND CHECK (EXTRACT (DAY FROM start_date) = 01)), end_date DATE NOT NULL, CONSTRAINT valid_end_date CHECK ((EXTRACT (YEAR FROM end_date) = fiscal_year) AND (EXTRACT (MONTH FROM end_date) = 09) AND (EXTRACT (DAY FROM end_date) = 30)));
You could argue for making each predicate a separate constraint to give more detailed error messages. The predicates on the year components of the start_date and end_date columns also guarantee uniqueness because they are derived from the unique fiscal year.
5. Unfortunately, this method does not work for all companies. Many companies have an elaborate set of rules that involve taking into account the weeks, weekends, and weekdays involved.
They do this to arrive at exactly 360 days or 52 weeks in their accounting year. In fact, there is a fairly standard accounting practice of using a "4 weeks, 4 weeks, 5 weeks" quarter with some fudging at the end of the year; you can have a leftover week between 3 and 11 days. The answer is a FiscalMonth table along the same lines as this FiscalYears example.
A constraint that will work surprisingly well for such cases is:
CHECK ((end_date - start_date) = INTERVAL 359 DAYS)
where you adjust the number of days to fit your rules (i.e., 52 weeks * 7 days = 364 days). If the rules allow some variation in the size of the fiscal year, then replace the equality test with a BETWEEN predicate.
Now, true confession time. When I have to load such a table in a database, I get out my spreadsheet and build a table using the built-in temporal functions. Spreadsheets have much better temporal functions than databases, and there is a good chance that the accounting department already has the fiscal calendar in a spreadsheet.
PUZZLE 2 ABSENTEES
This problem was presented on the MS ACCESS forum on CompuServe by Jim Chupella. He wanted to create a database that tracks employee absentee rates. Here is the table you will use:
CREATE TABLE Absenteeism (emp_id INTEGER NOT NULL REFERENCES Personnel (emp_id), absent_date DATE NOT NULL, reason_code CHAR (40) NOT NULL REFERENCES ExcuseList (reason_code), severity_points INTEGER NOT NULL CHECK (severity_points BETWEEN 1 AND 4), PRIMARY KEY (emp_id, absent_date));
An employee ID number identifies each employee. The reason_code is a short text explanation for the absence (for example, "hit by beer truck," "bad hair day," and so on) that you pull from an ever-growing and imaginative list, and severity point is a point system that scores the penalty associated with the absence.
If an employee accrues 40 severity points within a one-year period, you automatically discharge that employee. If an employee is absent more than one day in a row, it is charged as a long-term illness, not as a typical absence. The employee does not receive severity points on the second, third, or later days, nor do those days count toward his or her total absenteeism.
Your job is to write SQL to enforce these two business rules, changing the schema if necessary.
Looking at the first rule on discharging personnel, the most common design error is to try to drop the second, third, and later days from the table. This approach messes up queries that count sick days, and makes chains of sick days very difficult to find.
The trick is to allow a severity score of zero, so you can track the long-term illness of an employee in the Absenteeism table. Simply change the severity point declaration to "CHECK (severity_points BETWEEN 0 AND 4)" so that you can give a zero to those absences that do not count. This is a trick newbies miss because storing a zero seems to be a waste of space, but zero is a number and the event is a fact that needs to be noted.
UPDATE Absenteeism SET severity_points= 0, reason_code = 'long term illness' WHERE EXISTS (SELECT * FROM Absenteeism AS A2 WHERE Absenteeism.emp_id = A2.emp_id AND Absenteeism.absent_date = (A2.absent_date INTERVAL 1 DAY));
When a new row is inserted, this updatewill look for another absence on the day before and change its severity point score and reason_code in accordance with your first rule.
The second rule for firing an employee requires that you know what his or her current point score is. You would write that query as follows:
SELECT emp_id, SUM(severity_points) FROM Absenteeism GROUP BY emp_id;
This is the basis for a grouped subquery in the DELETE statement you finally want. Personnel with less than 40 points will return a NULL, and the test will fail.
DELETE FROM Personnel WHERE emp_id = (SELECT A1.emp_id FROM Absenteeism AS A1 WHERE A1.emp_id = Personnel.emp_id GROUP BY A1.emp_id HAVING SUM(severity_points) >= 40);
The GROUP BY clause is not really needed in SQL-92, but some older SQL implementations will require it.
Bert Scalzo, a senior instructor for Oracle Corporation, pointed out that the puzzle solution had two flaws and room for performance improvements.
The flaws are quite simple. First, the subquery does not check for personnel accruing 40 or more severity points within a one-year period, as required. It requires the addition of a date range check in the WHERE clause:
DELETE FROM Personnel WHERE emp_id = (SELECT A1.emp_id FROM Absenteeism AS A1 WHERE A1.emp_id = Personnel.emp_id AND absent_date BETWEEN CURRENT_TIMESTAMP - INTERVAL 365 DAYS AND CURRENT_TIMESTAMP GROUP BY A1.emp_id HAVING SUM(severity_points) >= 40);
Second, this SQL code deletes only offending personnel and not their absences. The related Absenteeism row must be either explicitly or implicitly deleted as well. You could replicate the above deletion for the Absenteeism table. However, the best solution is to add a cascading deletion clause to the Absenteeism table declaration:
CREATE TABLE Absenteeism ( ... emp_id INTEGER NOT NULL REFERENCES Personnel(emp_id) ON DELETE CASCADE, ...);
The performance suggestions are based on some assumptions. If you can safely assume that the UPDATE is run regularly and people do not change their departments while they are absent, then you can improve the UPDATE command's subquery:
UPDATE Absenteeism AS A1 SET severity_points = 0, reason_code = 'long term illness' WHERE EXISTS (SELECT * FROM absenteeism as A2 WHERE A1.emp_id = A2.emp_id AND (A1.absent_date + INTERVAL 1 DAY) = A2.absent_date);
There is still a problem with long-term illnesses that span weeks. The current situation is that if you want to spend your weekends being sick, that is fine with the company. This is not a very nice place to work. If an employee reports in absent on Friday of week number 1, all of week number 2, and just Monday of week number 3, the UPDATE will catch only the five days from week number 2 as long-term illness. The Friday and Monday will show up as sick days with severity points. The subquery in the UPDATE requires additional changes to the missed-date chaining.
I would avoid problems with weekends by having a code for scheduled days off (weekends, holidays, vacation, and so forth) that carry a severity point of zero. A business that has people working weekend shifts would need such codes.
Excerpted from JOE CELKO'S SQL PUZZLES & ANSWERS by Joe Celko Copyright © 2007 by Elsevier Inc.. Excerpted by permission of MORGAN KAUFMANN PUBLISHERS. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.