Read an Excerpt
Chapter 1:INTRODUCTION TO MYSQL
Topics in this Chapter
- How People Use Databases
- Why Databases are Better than Files
- Why MySQL
- MySQL History
In this first chapter, I will introduce you to the database concept and how MySQL implements this concept. You will learn about how MySQL attempts to solve common data management problems. You will also learn why MySQL in particular is a good choice from among many database servers.
As this is an overview, it will cover a lot of territory, but not get too in-depth. The following chapters will explain these topics in detail. If you're a database guru, most of this will be review. You can safely skip this chapter if you wish. If you're not sure what a database is exactly, this chapter will make you excited to learn more.
Reading through this book will be like visiting a museum in a skyscraper. You will start at the bottom and climb short stairs to each successive floor. The concepts will build upon each other until you reach high enough to understand the whole. This chapter is an exception, however. Imagine, if you will, that this chapter is a helicopter ride delivering you to the entrance of the museum. You will start from a high elevation and you'll speed by the floors as you make your descent. The people on the "replication" floor might wave to you as you go by, but you'll have to climb several floors to find out exactly why they're so excited.
How People Use Databases
You've heard people speak of databases. Our lives are sufficiently computerized as to allow conversation like, "I will now look up your record in the database." In this way, a database seems like a big hole where the phone company dumps all customer information, and where they sometimes have the information about you among the many piles.
Well, if such a place existed, technically it would be a database, a place where data is headquartered. Except, when we speak of databases, we're including the idea that the data is easy to find. Furthermore, we're talking about computerized systems.
Computerized Filing Systems
We introduce computers into our lives because they can be programmed to automate the boring, repetitive tasks we dread or allow us to do things that we would never be able to do without their computational speed or data storage capacity. (Well, unless you're like me, and computers are an end to themselves.) Putting information on paper and organizing the paper into folders and filing cabinets is a refined process, but it was nice for people when the process began to transform into moving electronic documents in folders on a hard drive.
Databases take things past just ordering files; they introduce a system for indexing the information so you don't need to flip past half the folders before finding the one you need, much like the card catalog in the library.
Not all databases are created equal, but traditionally they follow the idea that information should be organized into records. Each record is a fixed set of fields. The records are put into a table, and the database may have several tables.
Let's look at the example of a used car dealership. Ken the Car Dealer owns more than a hundred cars at any time. He can't remember everything he has in stock if someone asks, so he decides to organize the information for use with a database.
His database contains a table of information specific to each automobile, including the make, model, and year. He also keeps track of his wholesale cost and any repairs he's made. Each automobile in the United States has a vehicle identification number (VIN) that's unique. Ken adds this field to his car table so that he can distinguish cars that are otherwise identical.
Database Management Systems
Ken has a simple design for his database, but he needs a database management system (DBMS) to host his database. A DBMS is a software system that takes care of all the aspects of data management. It offers ways to create tables, insert data, find data, and even delete data.
Luckily, Ken chooses MySQL, an excellent solution because it's fast, reliable, and inexpensive. Ken's margins are small, so he can't afford to spend a lot on an enterprise solution, but he also can't afford to have the database crash. So, he picks an open-source package in which he can have higher confidence. Additionally, he noticed the abundance of free support available on the MySQL mailing lists. If he needs guaranteed, professional support, he has the option of purchasing support from MySQL AB.
MySQL takes care of storing records and tables on the hard drive efficiently. Ken doesn't need to worry much about how it's done. He only needs to issue the right commands.
MySQL, like many databases, operates in a client/server relationship. This concept describes a network architecture where computers on a network play the role of client or server. Servers are typically more powerful and are intended to provide services to a group of clients. Processing power and data are concentrated in the servers. Clients are lightweight, and they are the interface for the user to get to the resources on a server. shows information flowing between Ken's computer and the server's hard drive.
Ken runs the MySQL client program, which is a command-line utility. It connects to the server over the network. When Ken issues commands, the server attempts to follow them, most of the time reading and writing to the local hard disk.
Ken enjoys using command-line utilities. For example, he often uses the vi editor. However, he also runs MySQLGUI, one of many graphical clients for interacting with and administrating MySQL servers. Chapter 3, "Interacting with MySQL," discusses your client options.
This separation of labor allows Ken a lot of flexibility. Right now he's running both the client and the server on his desktop machine, but he could easily put the server on its own machine later.
The flow of data.
Speaking the Language
MySQL speaks a language called structured query language (SQL). Some people like to pronounce it "sequel." Others, like me, refer to it by the letters, "esk-cue-ell." Most people understand either name, but the MySQL developers seem to like the latter. They've declared their product is pronounced "my-esk-cue-ell." Having identified the fields he wants in his car table, Ken must use SQL to create the table in his MySQL database. He starts up the MySQL shell, which is a command-line utility, and he issues the command in Listing 1.1.
Listing 1.1 Creating a Table
CREATE TABLE car ( VIN VARCHAR(17) NOT NULL, Make VARCHAR(16) NOT NULL, Model VARCHAR(16) NOT NULL, ModelYear INT(4) NOT NULL, WholesalePrice FLOAT(6,2) NOT NULL, Color VARCHAR(8) NOT NULL, Mileage INT(11) NOT NULL, Comments TEXT, PRIMARY KEY(VIN) );
This command represents more detail than we've discussed before, so let's take a closer look. It's clear that this is a command for creating a table, and the name of that table is car. There's a list of formatted lines that describe the fields in the table inside the parentheses that follow.
The first field is named VIN and is of type VARCHAR, which is short for variable-length character. It can hold a value up to 17 characters, which happens to be the length of a VIN. The line ends with NOT NULL, which tells MySQL that this column has to hold some value. MySQL uses NULLto represent the absence of value, which is different from an empty string. Chapter 6, "Structured Query Language," explores this concept.
There are other fields that have different types of data. The Whole-salePrice field holds a FLOAT, which is short for floating-point number. It can have up to six digits before the decimal point and two after. The Model-Year field is a four-digit integer.
The last entry in the list of fields is not a field definition, but the definition of an index. It designates the VIN field as the primary key for the table. As discussed earlier, Ken expects the VIN to be a unique identifier for each vehicle. Making the field the primary key enforces this condition. Inserting two vehicles with the same VIN is an error and MySQL will not allow it.
In the database vernacular, we refer to commands as statements or queries. While any command sent to the database server can be called a statement, queries are statements that return information. Queries are questions posed to the database, sort of like playing Old Maid. If you want to find a record in the database, you might ask, "Do you have any records where the color is white?" If the database has any, it will return them.
The query in Listing 1.2 is a select statement. It selects a number of records that match its criterion: that the color field is white. It doesn't return everything in the record, however. It returns four fields: VIN, Make, Model, and ModelYear.
Listing 1.2 Searching for White Cars
SELECT VIN, Make, Model, ModelYear FROM car WHERE Color='White';
Notice that the select statement almost reads like an English sentence. SQL is a fourth-generation language (4GL). For the record, first-generation languages are native machine code that the CPU understands. Second-generation languages are assemblers. Third-generation languages are those high-level languages in which most programming is done; examples are C and PHP. So, a fourth-generation language is a step closer to being like human language.
Ken can execute the query in Listing 1.2, or any number of variations, but the result will be the same. No records will be found because he hasn't inserted any records yet! So, he must issue one or more insert statements.
Officially, an insert statement is not a query. It puts information into the database rather than retrieving it. In casual conversation, you might hear people refer to an insert query, however. Following the 4GL credo, the insert statement reads like a sentence. The query in Listing 1.3 inserts a single record into the car table. The order of the values matches the order of columns in the table.
Listing 1.3 Inserting a Record
INSERT INTO car VALUES ( '12345678901234567', 'Plymouth', 'Roadrunner', 1969, 5500.00, 'Blue', 148123, 'Unrestored' );
Ken proceeds to issue an insert statement for each of the cars on this lot. Afterward, he can search for cars by putting conditions in select statements. When a customer asks him if any Fords on the lot are less than $10,000, he makes the query shown in Listing 1.4.
Listing 1.4 Searching for Cheap Fords
SELECT * FROM car WHERE Make = 'Ford' AND WholesalePrice < 9000.00;
The asterisk (*) causes all fields to be returned. Ken figures $1,000 is the minimum profit he wants to make on any car and uses an appropriate figure for the wholesale price.
While trying to find Fords, Ken notices an obvious mistake in one record. One of the cars has "White" for its color. To fix this problem, Ken must issue an update query. Ken notes the VIN for this car and forms the update statement in Listing 1.5.
Listing 1.5 Updating Car Color
UPDATE car SET Color = 'White' WHERE VIN = '10203040506070809';
The update statement looks a lot like the other SQL we've seen so far. Like the select statement, it uses the where keyword to narrow the effect.
At the end of the week, Ken has sold two cars. So that they don't show up in his searches, Ken decides to delete them from the car table. He issues the delete statement in Listing 1.6.
Listing 1.6 Deleting Records
DELETE FROM car WHERE VIN IN ('12345678901234567', '10203040506070809');
The query in Listing 1.6 introduces the idea of a set. This is shorthand for a couple of identity statements. Ken is instructing the database to delete records from the car table where the VIN is either of the two numbers provided.
Ken discovered a problem inherent in his color field. It's just a free-form name. He can type the name of a color that doesn't exist, or he can easily make spelling or typing mistakes and insert bad data. The database can help him be more organized and avoid errors. The database can facilitate treating the fields as abstract pointers to other tables.
Ken reasons that instead of typing the same color names repeatedly, he should assign numbers to each color. A database table is the natural place to put this information, so he creates a table with two fields: a color code and a color name. Listing 1.7 demonstrates the SQL for creating the table.
Listing 1.7 Creating the Color Table
CREATE TABLE car_color ( ColorCode INT(6) NOT NULL AUTO_INCREMENT, Name VARCHAR(16), PRIMARY KEY(ColorCode) );
Ken wants each color to have its own code, so he makes the code an integer, and he also makes it the primary key. Ken doesn't want to be responsible for assigning these numerical codes to colors, so he uses the AUTO_ INCREMENT keyword that's part of MySQL. It tells MySQL to assign integers in sequence to new records.
Ken needs to start off his car_color table, so he makes a query on his car table to find the full set of colors with no duplicates. Ken issues the query in Listing 1.8.
Listing 1.8 Selecting Distinct Values
SELECT DISTINCT Color FROM car;
The distinct keyword tells MySQL to remove duplicate values in the result set. Ken could type a series of insert statements to insert each of these colors, but there are two reasons to find a better way. First, it's an annoying task, one that the database ought to help solve. And it also introduces many chances for a color name to be mistyped.
Ken decides to pair his select statement with an insert statement. He wants MySQL to insert the results of his query directly into the new table. The statement in Listing 1.9 inserts all the colors.
Listing 1.9 Inserting from a Select Statement
INSERT INTO car_color (Name) SELECT DISTINCT Color FROM car;
This statement assembles a set of data to insert into the car_color table from a select statement. The insert statement specifies that Name field will be the only field provided. When fields are left out in this way, MySQL substitutes NULL, a special value meaning nothing. The field being left out is the Code field, which we specified as not accepting null values; however, we also specified it as AUTO_INCREMENT. An auto-incrementing field will intercept null values and substitute a unique integer.
To complete the abstraction, Ken must alter or rebuild the car table. Ideally, Ken would return to the design phase, but in this case he chooses expedience. He uses another SQL statement to change the Color column to an integer, shown in Listing 1.10.
Listing 1.10 Altering the Car Table
ALTER TABLE car CHANGE Color ColorCode INT(6) NOT NULL;
Unfortunately for Ken, expedience means he just destroyed all the color information in the car table. MySQL will do its best to convert from the old field type to the new, but it has no idea that the colors are now in another table. The color names are converted to integers, but they all end up being zero.
Ken could have made a backup table before altering the original. Somehow, he might have used the data in the color table to look up color codes. However, manipulating the data this way isn't easy to imagine when you're first learning to use databases. Ken just issues an update for each record. When MySQL 4.0 becomes available, Ken will be able to update multiple tables with a single statement.
Now that all the cars have proper codes in their Color fields, Ken can refer to colors in the car table by their codes. White turned out to be code 1, and Midnight Blue is 13. Ken doesn't want to memorize all the color codes; he wants to see the color names when he makes searches on the car table.
The solution is to make a join between the two tables. This is a special type of select statement where every record in one table is matched to the records in another. There are several types of joins, but the most useful is the left join shown in Listing 1.11. It's results are shown in .
Listing 1.11 Left Join
SELECT car.VIN, car_color.Name FROM car LEFT JOIN car_color ON car.ColorCode = car_color.ColorCode WHERE car.Make = 'Plymouth';
The left join syntax is a bit harder to read than the other queries we've looked at so far. Instead of just specifying a table with the from keyword, we're placing an abstract table created by joining the car table to the car_colortable. Each record in the car table is matched up to a record in the car_colortable by matching the color codes.
Result of left join.
Notice that there isn't a one-to-one relationship between cars and colors. Said another way, there are bound to be several cars of any color. There are principles behind why we don't run out of matches when performing the join, which I discuss in later chapters.
The SQL in Listing 1.11 also introduces a complication involved with querying multiple tables. When referring to fields in a join query, you must avoid ambiguous names. Both tables contain a field named ColorCode. To refer to fields precisely, the query uses a dot-notation. The table name is appended to the field name, separated by a period. In the list of fields to display, the query uses dot-notation as well.
There's Even More
Ken the car dealer has only touched the surface of the power of relational databases, but we gained an overview of how they are used in everyday situations. The remaining chapters examine database concepts in detail, but before that, we owe it to ourselves to find convincing evidence that a database is the best solution.
Why Databases Are Better than Files
Most computers use disk drives for permanent storage. And most operating systems use the concept of files, which are long strings of data written to a disk. It's natural to store information in files. They seem to be an analogue to those sheets of paper filling folders and filing cabinets in most offices.
A Historical View of Data Management
When data sets are small and little or no reporting is required, a paper-based filing system is adequate. Imagine a filing cabinet with three drawers containing 10 folders each. The folders are labeled according to the type of documents inside. You know that to find electric bills for the past year, you look in the first drawer for the Electric Bills folder.
If you need to find the bill from one year ago, you can flip through everything in the folder. If you want to plot monthly costs for the past year, you'll have to find the last 12 statements. You can copy the charges onto a list and then draw a line graph by hand.
This process is relatively simple, but as you begin inventing more complex reports or repeating the processes, you begin to dream of automating the process with a computer. Naturally, the inclination is to duplicate the process in a programming language. The paper files become electronic files on the hard disk, and the process is written in a 3GL like C or Perl.
The files are grouped into directories according to some logic. And the contents of the files are structured to allow your program to read them quickly. For each type of file, you must write a set of routines to manipulate the files, including reading, writing, and updating.
Each report requires a custom-crafted routine for assembling the relevant data and presenting it. A report that compares trends in electricity bills versus natural gas bills probably requires some special code for matching units.
All this work adds up to a high cost. It requires lots of programming time. For your own home, you probably leave this information in paper form. A business, however, has a great incentive to computerize the process. The more information a business has, the better it can make decisions that affect its profitability.
Business needs are likely to drive the development of increasingly more complex reports. It's a competitive advantage to know more about your business than your competitors know about theirs. However, the high cost of developing new reports can prohibit their development. Furthermore, if the delay between recognizing need and producing a report is long enough, the report may no longer have value by the time it's ready.
Databases were developed to address the growing problem of creating custom routines to manipulate data and produce reports.
Databases Provide Uniform Interfaces
The first motivation for using a database is the freedom of having a uniform interface. You don't need to reinvent and custom-craft routines for manipulating the data. A central service provides a narrow interface to the data.
Databases treat data in a uniform manner. Information about how the data are structured and how they relate to each other are stored as metadata. This allows the system to treat the data as a commodity and use a generalized set of code.
Flat files require a 3GL, which require professionals with highly specialized skills. Flat files do not have a uniform interface. Each routine likely treats each file according to its unique attributes.
One consequence of having customized interfaces for each file is that a change to data structure causes cascading changes throughout the system. Imagine a field changed from a five-digit ZIP code to a nine-digit ZIP code. The field may have been designed to accept five characters, but now must be changed to accept minimally nine digits, perhaps 10 if the dash is included.
The existing data must be translated into the new format using a custom program. And the interface must be changed to handle the longer field. The custom code encourages a tight coupling where the application code assumes a certain form for the data. Any program that uses the file eventually will need to be changed to adapt to the changes in the data.
Databases afford a more flexible interface to the data. They allow the application to treat data from a logical perspective and ignore its physical storage. A change to the underlying data format may not require any changes in the application due to a narrow, uniform interface provided by the database.
Furthermore, modern databases offer a special query language for building ad hoc reports. The query language makes it easier for nonprogrammers to formulate instructions for building reports.
Databases Increase Performance
Databases improve performance by concentrating data manipulation into a single, cohesive unit that can be optimized. Databases take the responsibility of managing data in the most efficient manner. The details of how data is handled on the physical layer are hidden from the user, who sees the data from a logical perspective.
By unifying the routines for managing data, improvements to the system affect all data within the system at once. Compare this to improving one routine for one flat file.
Another source of increased performance is the allowance for multiple users. Databases present a single interface, and they are in control of all activity. They can accept multiple requests for data management simultaneously. Compare this to flat files, which can be opened by one process at a time.
A consequence of having a single point of control over data is the integration of data security. A DBMS can enforce certain rights to identified users. It can also take steps to protect data using backups without interrupting users' access.
Databases Encourage Integrity
Databases make it easy to carefully structure data to avoid duplication. Specifically, joins reduce the cost of abstracting data. Rather than repeat information about an entity, you can refer to that entity with a unique key and keep the information in one table.
Elimination of duplication helps avoid conflicting data that results from an incomplete update. If the home address for a customer is stored in two different places, it could be changed in one place and not the other. A relational database allows you to put the address in a single table and reference it in different tables as necessary.
A DBMS can also help enforce integrity based on the meta-data defined when the database is created. When a field is defined as being a date, it will accept nothing but a date. Physically, an integer may fit into a date field in a flat file, but a DBMS will reject data it can't interpret appropriately. In many cases, MySQL attempts to reformat values for the appropriate context, but sometimes it must fall back on a default value. For example, MySQL interprets "apple" as 0 when it expects an integer.
A DBMS can also enforce rules about the linking tables. You can define a relationship between a field in one table that matches a field in another. The DBMS can enforce that the range of valid values for the first table defines the allowable values in the second.
Databases Are Not a Cure-All
Despite all the great things about databases, they do not solve all problems with data management. Furthermore, like all tools, they must be used properly to be useful. Nothing stops you from duplicating data in several placesnothing except a careful design.
The variance between specific database implementations offers a respectable challenge. Although there are international standards for what features should appear in SQL, few DBMSs resist the temptation to extend it with their own features. The result is that it can be very difficult to switch from one DBMS to another.
Despite this difficulty, Chapter 28, "Transitioning," can guide you through the process of transitioning to another database server. If you anticipate changing servers, you can avoid the use of features unique to MySQL. This significantly aids the porting process later.
This book is about MySQL, but on the way to explaining the use of MySQL, it's necessary to learn about databases and SQL. And why not Oracle or Post-greSQL? These are valid alternatives to MySQL, and there are uses for any well-made tool. You may find MySQL particularly appealing because it's easy to learn. It's also remarkably flexible to various contexts. Where other database servers enforce one model, MySQL offers choices.
The most important aspect of MySQL is that it is a real database server. Later in this book we'll look at the characteristics of relational databases, and we examine how MySQL fills these requirements. MySQL implements most of the basic principles of relational databases, and it offers SQL for making queries. It provides all the features necessary for learning database concepts.
MySQL is fast. The mysql.com Web site displays the results of tests comparing MySQL to other relational databases. MySQL is genuinely faster than most database servers. Increased performance is of universal value, of course.
One reason MySQL is so fast is because of its elegant design. It reflects a single, practical development toward the goal of being nothing more than a DBMS. It doesn't have extra features hanging off of it like ornaments on a Christmas tree. MySQL's features reflect careful planning that maximize ease of use and flexibility.
The MySQL development team strives to include all features described in the ANSI standard. For users who don't need every feature, MySQL provides the option of leaving them out at compile time. Again, this reflects the dedication of letting you make choices.
The major force that keeps MySQL so focused is the nature of its development process. MySQL is open source and it benefits from strong leadership. You may review every line of code, and you may fix errors. In fact, many people do. But MySQL's development team knows where the project is going, so they keep new features in the spirit of the project. A small team of developers approves all code included in MySQL. This allows MySQL to maintain a high level of quality.
Being open source, MySQL has a very low cost of both acquisition and ownership. MySQL can be downloaded from several spots on the Web, and it's included in most Linux distributions. Once you're running MySQL, free support is plentiful on the mailing lists. MySQL AB employees monitor the firstname.lastname@example.org mailing list and resolve problems at no cost. In addition, MySQL AB supplies guaranteed, personal support for a modest fee.
Furthermore, MySQL is reliable. Having the source code available for review facilitates users' finding and fixing bugs as they appear.
Another consequence of being open source, MySQL is available for many platforms. It runs on most UNIX operation systems, Linux, Windows, and even less-popular operating systems like IBM OS/2.
Michael "Monty" Widenius wanted to add SQL as an interface on top of his legacy database that he had been developing for 15 years at TCX DataKon-sult AB. He considered using open-source relational database mSQL as an interface on top of his database code, but he couldn't interest author David J. Hughes. Monty began work on his own RDMS using a client interface modeled after mSQL's API in order to port existing tools written for mSQL. It was never Monty's intention to implement SQL completely, but after sharing the source code, the response was overwhelming.
Monty completed the first version of MySQL, an interface to his legacy database, in May 1995. He wrote it in three months. From that point, MySQL grew to be the most popular database used on the Web.
TCX DataKonsult AB was later renamed MySQL AB, and today Monty is chief technical officer. This Swedish company dedicates itself to supporting MySQL and its development. MySQL AB has been profitable since inception by selling commercial support, consulting services, and licenses for an embedded version of MySQL. It's a great example of the viability of the open-source software model.
In June 2000, MySQL became available under the GNU Public License (GPL), a popular open-source license. It ensures that anyone can share and improve MySQL without paying licensing fees.