Joe Celko's Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL

Perfectly intelligent programmers often struggle when forced to work with SQL. Why? Joe Celko believes the problem lies with their procedural programming mindset, which keeps them from taking full advantage of the power of declarative languages. The result is overly complex and inefficient code, not to mention lost productivity.This book will change the way you think about the problems you solve with SQL programs.. Focusing on three key table-based techniques, Celko reveals their power through detailed examples and clear explanations. As you master these techniques, you'll find you are able to conceptualize problems as rooted in sets and solvable through declarative programming. Before long, you'll be coding more quickly, writing more efficient code, and applying the full power of SQL - Filled with the insights of one of the world's leading SQL authorities - noted for his knowledge and his ability to teach what he knows - Focuses on auxiliary tables (for computing functions and other values by joins), temporal tables (for temporal queries, historical data, and audit information), and virtual tables (for improved performance) - Presents clear guidance for selecting and correctly applying the right table technique

Joe Celko's Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL

31.95 In Stock

Joe Celko's Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL

Add to Wishlist

Joe Celko's Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL

eBook

$31.95

View All Available Formats & Editions

eBook
$31.95

View All Available Formats & Editions

Available on Compatible NOOK devices, the free NOOK App and in My Digital Library.

WANT A NOOK? Explore Now

Buy As Gift

Related collections and offers

Overview

Product Details

ISBN-13:	9780080557526
Publisher:	Morgan Kaufmann Publishers
Publication date:	01/22/2008
Series:	The Morgan Kaufmann Series in Data Management Systems
Sold by:	Barnes & Noble
Format:	eBook
Pages:	384
File size:	2 MB

About the Author

Joe Celko served 10 years on ANSI/ISO SQL Standards Committee and contributed to the SQL-89 and SQL-92 Standards. Mr. Celko is author a series of books on SQL and RDBMS for Elsevier/MKP. He is an independent consultant based in Austin, Texas. He has written over 1200 columns in the computer trade and academic press, mostly dealing with data and databases.

Read an Excerpt

JOE CELKO'S THINKING IN SETS

Auxiliary, Temporal, and Virtual Tables in SQL

By Joe Celko

MORGAN KAUFMANN

Copyright © 2008 Elsevier Inc.
All right reserved.
ISBN: 978-0-08-055752-6

Chapter One

SQL is Declarative, Not Procedural

In the preface I told a short story about FORTRAN programmers could only solve problems using loops and a LISP programmer who could only solve problems recursively. This is not uncommon because we love the tools we know. Let me tell a joke instead of a story: A mathematician, a physicist, and a database programmer were all given a rubber ball and told to find the volume.

The mathematician carefully measured the diameter and either evaluated the volume of sphere formula or used a triple integral if the ball was not perfectly round.

The physicist filled a beaker with water, put the ball in the water, and measured the total displacement. He does not care about the details of the shape of the ball.

The database programmer looked up the model and serial numbers in his rubber ball manufacturer's on-line database. He does not care about the actual ball. But he has information about the tolerances to which it was made, the expected shape and size, and a bunch of other things that apply to the entire rubber ball production process.

The moral of the story is: The mathematician knows how to compute. The physicist knows how to measure. The database guy knows how to look up data. Each person grabs his tools to solve the problem.

Now change the problem to an inventory of thousands of rubber balls. The mathematician and the physicist are stuck with a lot of manual labor. The database guy does a few downloads and he can produce rubber ball industry standards (assuming that there are such things) and detailed documentation in court with his answers.

1.1 Different Programming Models

Perfecting oneself is as much unlearning as it is learning. —Edsgar Dijkstra

There are many models of programming. Procedural programming languages use a sequence of procedural steps guided by flow of control statements (WHILE-DO, IF-THEN-ELSE, and BEGIN-END)that change the input data to output data. This was the traditional view of programming, and it is often called the von Neumann Model after John von Neumann, the mathematician who was responsible for it. The same source code runs through the same compiler and generates the same executable module every time. The same program will work exactly the same way every time it is invoked. The keywords in this model are predictable and deterministic. It is also subject to some mathematical analysis because it is deterministic.

There are some variations on the theme. Some languages use different flow control statements. FORTRAN and COBOL allocated all the storage for the data at the start of the program. Later, the Algol family of languages did dynamic storage allocation based on the scope of the data within a block-structured language.

Edsgar Dijkstra (see his archives at www.cs.utexas.edu/users/EWD/) came up with a language that was nondeterministic. Statements, called guarded commands, have a control that either blocks or allows the statement to be executed, but there is no particular order of execution among the open statements. This model was not implemented in a commercial product, but it demonstrated that something we had thought was necessary for programming (determinism) could be dropped.

Functional programming languages are based on solving problems as a series of nested function calls. The concept of higher-order functions to change one function to another is important in these languages. The derivative and integral transforms are mathematical examples of such higher-order functions. One of the goals of such languages is to avoid a side effect in programs so they can be optimized algebraically. In particular, once you have an expression that is equal to another (in some sense of equality), they can substitute for each other without affecting the result of the computation.

APL is the most successful functional programming language and had a fad period as a teaching language when Ken Iverson wrote his book A Programming Language in 1962. IBM produced special keyboards that included the obscure mathematical symbols used in APL for their desktop machines. Most of the functional languages never made it out of academia, but some survive in commercial applications today. Erlang is used for concurrent applications; R is a statistical language; Mathematica is a popular symbolic mathematics product; and Kx Systems uses the K language for large-volume financial analysis. More recently, the ML and Haskell programming languages have become popular among Linux and UNIX programmers.

Here we dropped another concept that had been regarded as fundamental: There is no flow of control in these languages.

Constraint or constraint logic programming languages are a series of constraints on a problem domain. As you add more constraints, the system figures out which answers are possible and which are not. The most popular such language is PROLOG, which also had an academic fad many years ago when Borland Software (www.borland.com) made a cheap student version available. The website ON-LINE GUIDE TO CONSTRAINT PROGRAMMING by Roman Bartak is a good place to start if you are interested in this topic (http://kti.ms.mff.cuni.czl~bartak/ constraints/index, html).

Here we dropped the concept of an algorithm altogether and just provided a problem specification.

Object-oriented (OO) programming is based on the ideas of objects that have both data and behavior in the same module of code. The programming model is a collection of independent cooperating objects instead of a single program invoking functions. An object is capable of receiving messages, processing data, and sending messages to other objects.

The idea is that each object can be maintained and written independently of any particular application and dropped into place where it is needed. Imagine a community of people who do particular jobs. They receive orders from their customers, process them, and return a result.

Many years ago, the INCITS H2 Database Standards Committee (nee ANSI X3H2 Database Standards Committee) had a meeting in Rapid City, South Dakota. We had Mount Rushmore and Bjarne Stroustrup as special attractions. Mr. Stroustrup did his slide show with overhead transparencies (yes, this was before PowerPoint was ubiquitous!) about Bell Labs inventing C++ and OO programming, and we got to ask questions.

One of the questions was how we should put OO features into the working model of the next version of the SQL standard, which was known as SQL3 internally. His answer was that Bell Labs, with all their talent, had tried four different approaches to this problem and they came to the conclusion that it should not be done. OO was great for programming but deadly for data.

I have watched people try to force OO models into SQL, and it falls apart in about a year. Every typo becomes a new attribute or class, queries that would have been so easy in a relational model are now multitable monster outer joins, redundancy grows at an exponential rates, constraints are virtually impossible to write so you can kiss data integrity goodbye, and so forth.

With all these programming models, why should we not have different data models?

Different Data Models

Consider the humble punch card. Punch cards had been used in France to control textile looms since the early 1700s; the method was perfected by Joseph Marie Jacquard in 1801 with his Jacquard loom.

Flash forward to the year 1890, when a man named Herman Hollerith invented a punch card and tabulating machines for that year's United States Census. His census project was so successful that Mr. Hollerith left the government and started the Tabulating Machine Company in 1896. After a series of mergers and name changes, this company became IBM. You might have heard of it.

Up to the 1970s, the "IBMcard" and related machinery was everywhere. The most common card was the IBM 5081, and that part number became the common term for it—even across vendors! The punch card was data processing back then.

The physical characteristics of the card determined how we stored and processed data for decades afterwards. The card was the size of an 1887 United States dollar bill (3.25 inches by 7.375 inches). The reason for that size was simple; when Hollerith worked on the Census, he could get drawers to store the decks of cards from the Department of the Treasury across the street.

The cards had a grid of 80 columns of 12 rows, which could accommodate holes. This was for physical reasons again. But once the 80-column convention was established, it stuck. The early video terminals that replaced the key punch machines used screens with 80 columns of text and 24 or 25 rows—that is, two punch cards high and possibly a line for error messages.

Magnetic tapes started replacing punch cards in the 1970s, but they also mimicked the 80-column convention, although there was no longer any need. Many of the early ANSI tape standards for header records are based on this convention. Legacy systems simply replaced card readers with magnetic tape units for obvious reasons, but new applications continued to be built to this standard, too.

The physical nature of the cards meant that data was written and read from left to right in sequential order. Likewise, the deck of cards was written and read from front to back in sequential order.

A magnetic tape file is also written and read in the same way, but with the added bonus that when you drop a tape on the floor, it does not get scrambled like a deck of cards. The downside of a tape over a deck of cards is that it cannot be rearranged manually on purpose either.

Card and tape files are pretty passive creatures and will take whatever an application program throws at them without much objection. Files are also independent of each other, simply because they are connected to one application program at a time and therefore have no idea what other files look like.

Early disk systems also mimicked this model—physically contiguous storage read in a sequential order, with meaning given to the data by the program reading it.

It was a while before disk systems realized that the read/write heads could be moved to any physical position on the disk. This gave us random access storage. We still have a contiguous storage concept within each field and each record, however.

The Relational Model was a big jump, because it divorced the physical and logical models of data. If you read the specifications for many of the early programming languages, they describe physically contiguous data and storage methods. SQL describes only the behavior of the data without any reference to physical storage methods.

1.2.1 Columns Are Not Fields

A field within a record is defined by the application program that reads it. A column in a row in a table is defined independently of any application by the database schema in DDL. The data types in a column are always scalar and NULL-able.

This is a problem for files. If I mount the wrong tape on a tape drive, say a COBOL file, and read it with a FORTRAN program, it can produce meaningless output. The program simply counts the number of bytes from the start of the tape and slices off so many characters into each field from left to right.

The order of the application program variables in the READ or INPUT statements is important, because the values are read into the program variables in that order. In SQL, columns are referenced only by their names. Yes, there are shorthands like the SELECT * clause and "INSERT INTO <table name>" statements that expand into a list of column names in the physical order in which the column names appear within their table declaration, but these are shorthands that resolve to named lists. This is a leftover from the early days of SQL, when we were doing our unlearning and still had a "record-oriented" mindset.

The use of NULLSin SQL is also unique to the language. Fields do not support a missing data marker as part of the field, record, or file itself. Nor do fields have constraints that can be added to them in the record, like the DEFAULT and CHECK() clauses in SQL.

Nor do fields have a data type. Fields have meaning and are defined by the program reading them, not in themselves. Thus, four columns on a punch card containing 1223 might be an integer in one program, a string in a second program, or read as four fields instead of one in a third program.

(Continues...)

Excerpted from JOE CELKO'S THINKING IN SETS by Joe Celko Copyright © 2008 by Elsevier Inc. . Excerpted by permission of MORGAN KAUFMANN. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.

1 SQL Is Declarative, Not Procedural1.1 Different Programming Models1.2 Different Data Models1.2.1 Columns Are Not Fields1.2.2 Rows Are Not Records1.2.3 Tables Are Not Files 1.2.4 Relational Keys Are Not Record Locators 1.2.5 Kinds of Keys 1.2.6 Desirable Properties of Relational Keys 1.2.7 Unique But Not Invariant 1.3 Tables as Entities 1.4 Tables as Relationships1.5 Statements Are Not Procedures 1.6 Molecular, Atomic, and Subatomic Data Elements 1.6.1 Table Splitting 1.6.2 Column Splitting 1.6.3 Temporal Splitting 1.6.4 Faking Non-1NF Data 1.6.5 Molecular Data Elements 1.6.6 Isomer Data Elements 1.6.7 Validating a Molecule 2 Hardware, Data Volume, and Maintaining Databases 2.1 Parallelism 2.2 Cheap Main Storage 2.3 Solid-State Disk 2.4 Cheaper Secondary and Tertiary Storage 2.5 The Data Changed 2.6 The Mindset Has Not Changed 3 Data Access and Records 3.1 Sequential Access 3.1.1 Tape-Searching Algorithms 3.2 Indexes 3.2.1 Single-Table Indexes 3.2.2 Multiple-Table Indexes 3.2.3 Type of Indexes 3.3 Hashing 3.3.1 Digit Selection 3.3.2 Division Hashing 3.3.3 Multiplication Hashing 3.3.4 Folding 3.3.5 Table Lookups 3.3.6 Collisions 3.4 Bit Vector Indexes 3.5 Parallel Access 3.6 Row and Column Storage 3.6.1 Row-Based Storage 3.6.2 Column-Based Storage 3.7 JOIN Algorithms 3.7.1 Nested-Loop Join Algorithm 3.7.2 Sort-Merge Join Method 3.7.3 Hash Join Method 3.7.4 Shin's Algorithm 4 Lookup Tables 4.1 Data Element Names 4.2 Multiparameter Lookup Tables 4.3 Constants Table 4.4 OTLT or MUCK Table Problems 4.5 Defi nition of a Proper Table5 Auxiliary Tables 5.1 Sequence Table 5.1.1 Creating a Sequence Table 5.1.2 Sequence Constructor 5.1.3 Replacing an Iterative Loop 5.2 Permutations 5.2.1 Permutations via Recursion 5.2.2 Permutations via CROSS JOIN 5.3 Functions 5.3.1 Functions without a Simple Formula 5.4 Encryption via Tables 5.5 Random Numbers 5.6 Interpolation 6 Views 6.1 Mullins VIEW Usage Rules 6.1.1 Effi cient Access and Computations 6.1.2 Column Renaming 6.1.3 Proliferation Avoidance 6.1.4 The VIEW Synchronization Rule 6.2 Updatable and Read-Only VIEWs 6.3 Types of VIEWs 6.3.1 Single-Table Projection and Restriction 6.3.2 Calculated Columns 6.3.3 Translated Columns 6.3.4 Grouped VIEWs 6.3.5 UNIONed VIEWs 6.3.6 JOINs in VIEWs 6.3.7 Nested VIEWs 6.4 Modeling Classes with Tables 6.4.1 Class Hierarchies in SQL 6.4.2 Subclasses via ASSERTIONs and TRIGGERs 6.5 How VIEWs Are Handled in the Database System 6.5.1 VIEW Column List 6.5.2 VIEW Materialization 6.6 In-Line Text Expansion 6.7 WITH CHECK OPTION Clause 6.7.1 WITH CHECK OPTION as CHECK( ) Clause 6.8 Dropping VIEWs 6.9 Outdated Uses for VIEWs 6.9.1 Domain Support 6.9.2 Table Expression VIEWs 6.9.3 VIEWs for Table Level CHECK( ) Constraints 6.9.4 One VIEW per Base Table 7 Virtual Tables 7.1 Derived Tables 7.1.1 Column Naming Rules 7.1.2 Scoping Rules 7.1.3 Exposed Table Names 7.1.4 LATERAL() Clause 7.2 Common Table Expressions 7.2.1 Nonrecursive CTEs 7.2.2 Recursive CTEs 7.3 Temporary Tables 7.3.1 ANSI/ISO Standards 7.3.2 Vendors Models 7.4 The Information Schema 7.4.1 The INFORMATION_SCHEMA Declarations 7.4.2 A Quick List of VIEWS and Their Purposes 7.4.3 DOMAIN Declarations 7.4.4 Defi nition Schema 7.4.5 INFORMATION_SCHEMA Assertions 8 Complicated Functions via Tables 8.1 Functions without a Simple Formula 8.1.1 Encryption via Tables 8.2 Check Digits via Tables 8.2.1 Check Digits Defi ned 8.2.2 Error Detection versus Error Correction 8.3 Classes of Algorithms 8.3.1 Weighted-Sum Algorithms 8.3.2 Power-Sum Check Digits 8.3.3 Luhn Algorithm 8.3.4 Dihedral Five Check Digit 8.4 Declarations, Not Functions, Not Procedures 8.5 Data Mining for Auxiliary Tables 9 Temporal Tables 9.1 The Nature of Time 9.1.1 Durations, Not Chronons 9.1.2 Granularity 9.2 The ISO Half-Open Interval Model 9.2.1 Use of NULL for "Eternity 9.2.2 Single Timestamp Tables 9.2.3 Overlapping Intervals 9.3 State Transition Tables 9.4 Consolidating Intervals 9.4.1 Cursors and Triggers 9.4.

What People are Saying About This

From the Publisher

New advice for novice SQL programmers: think like a pro, think in sets!

From the B&N Reads Blog

Page 1 of

Joe Celko's Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL

Joe Celko's Thinking in Sets: Auxiliary, Temporal, and Virtual Tables in SQL

eBook

eBook

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

JOE CELKO'S THINKING IN SETS

MORGAN KAUFMANN

Chapter One

Table of Contents

What People are Saying About This

Customer Reviews

Related collections and offers

Overview

Product Details

About the Author

Read an Excerpt

JOE CELKO'S THINKING IN SETS

MORGAN KAUFMANN

Chapter One

Table of Contents

What People are Saying About This

Related Subjects

Customer Reviews