Read an Excerpt
Professional SQL Server 2008 Internals and Troubleshooting
By Christian Bolton Justin Langford Brent Ozar James Rowland-Jones Steven Wort
John Wiley & SonsCopyright © 2010 John Wiley & Sons, Ltd
All right reserved.
Chapter OneSQL Server Architecture
WHAT'S IN THIS CHAPTER
* Understanding database transactions and the ACID properties
* Architectural components used to fulfill a read request
* Architectural components used to fulfill an update request
* Database recovery and the transaction log
* Dirty pages, checkpoints, and the lazywriter
* Where the SQLOS fits in and why it's needed
A basic grasp of SQL Server's architecture is fundamental to intelligently approach trouble-shooting a problem, but selecting the important bits to learn about can be challenging, as SQL Server is such a complex piece of software.
This chapter distills the core architecture of SQL Server and puts the most important components into the context of executing a simple query to help you understand the fundamentals of the core engine.
You will learn how SQL Server deals with your network connection, unravels what you're asking it to do, decides how it will execute your request, and finally how data is retrieved and modified on your behalf.
You will also discover when the transaction log is used and how it's affected by the configured recovery model; what happens when acheckpoint occurs and how you can influence the frequency; and what the lazywriter does.
The chapter starts by defining a "transaction" and what the requirements are for a database system to reliably process them. You'll then look at the life cycle of a simple query that reads data, taking a walk through the components employed to return a result set, before looking at how the process differs when data needs to be modified.
Finally, you'll read about the components and terminology that support the recovery process in SQL Server, and the SQLOS "framework" introduced in SQL Server 2005 that consolidates a lot of the low-level functions required by many SQL Server components.
A transaction is a unit of work in a database that typically contains several commands that read from and write to the database. The most well-known feature of a transaction is that it must complete all of the commands in their entirety or none of them. This feature, called atomicity, is just one of four properties defined in the early days of database theory as requirements for a database transaction, collectively known as ACID properties.
The four required properties of a database transaction are atomicity, consistency, isolation, and durability.
Atomicity means that all the effects of the transaction must complete successfully or the changes are rolled back. A classic example of an atomic transaction is a withdrawal from an ATM machine; the machine must both dispense the cash and debit your bank account. Either of those actions completing independently would cause a problem for either you or the bank.
The consistency requirement ensures that the transaction cannot break the integrity rules of the database; it must leave the database in a consistent state. For example, your system might require that stock levels cannot be a negative value, a spare part cannot exist without a parent object, or the data in a sex field must be male or female. In order to be consistent, a transaction must not break any of the constraints or rules defined for the data.
Isolation refers to keeping the changes of incomplete transactions running at the same time separate from one another. Each transaction must be entirely self-contained, and changes it makes must not be readable by any other transaction, although SQL Server does allow you to control the degree of isolation in order to find a balance between business and performance requirements.
Once a transaction is committed, it must persist even if there is a system failure - that is, it must be durable. In SQL Server, the information needed to replay changes made in a transaction is written to the transaction log before the transaction is considered to be committed.
SQL Server Transactions
There are two types of transactions in SQL Server that are differentiated only by the way they are created: implicit and explicit.
Implicit transactions are used automatically by SQL Server to guarantee the ACID properties of single commands. For example, if you wrote an update statement that modified 10 rows, SQL Server would run it as an implicit transaction so that the ACID properties would apply, and all 10 rows would be updated or none of them would.
Explicit transactions are started by using the BEGIN TRANSACTION T-SQL command and are stopped by using the COMMIT TRANSACTION or ROLLBACK TRANSACTION commands.
Committing a transaction effectively means making the changes within the transaction permanent, whereas rolling back a transaction means undoing all the changes that were made within the transaction. Explicit transactions are used to group together changes to which you want to apply the ACID properties as a whole, which also enables you to roll back the changes at any point if your business logic determines that you should cancel the change.
THE LIFE CYCLE OF A QUERY
To introduce the high-level components of SQL Server's architecture, this section uses the example of a query's life cycle to put each component into context in order to foster your understanding and create a foundation for the rest of the book.
It looks at a basic SELECT query first in order to reduce the scope to that of a READ operation, and then introduces the additional processes involved for a query that performs an UPDATE operation. Finally, you'll read about the terminology and processes that SQL Server uses to implement recovery while optimizing performance.
Figure 1-1 shows the high-level components that are used within the chapter to illustrate the life cycle of a query.
The Relational and Storage Engines
As shown in Figure 1-1, SQL Server is split into two main engines: the Relational Engine and the Storage Engine.
The Relational Engine is also sometimes called the query processor because its primary function is query optimization and execution. It contains a Command Parser to check query syntax and prepare query trees, a Query Optimizer that is arguably the crown jewel of any database system, and a Query Executor responsible for execution.
The Storage Engine is responsible for managing all I/O to the data, and contains the Access Methods code, which handles I/O requests for rows, indexes, pages, allocations and row versions, and a Buffer Manager, which deals with SQL Server's main memory consumer, the buffer pool. It also contains a Transaction Manager, which handles the locking of data to maintain Isolation (ACID properties) and manages the transaction log.
The Buffer Pool
The other major component you need to know about before getting into the query life cycle is the buffer pool, which is the largest consumer of memory in SQL Server. The buffer pool contains all the different caches in SQL Server, including the plan cache and the data cache, which is covered as the sections follow the query through its life cycle.
The buffer pool is covered in detail in Chapter 2.
A Basic Select Query
The details of the query used in this example aren't important - it's a simple SELECT statement with no joins, so you're just issuing a basic read request. Start at the client, where the first component you touch is the SQL Server Network Interface (SNI).
SQL Server Network Interface
The SQL Server Network Interface (SNI) is a protocol layer that establishes the network connection between the client and the server. It consists of a set of APIs that are used by both the database engine and the SQL Server Native Client (SNAC). SNI replaces the net-libraries found in SQL Server 2000 and the Microsoft Data Access Components (MDAC), which are included with Windows.
SNI isn't configurable directly; you just need to configure a network protocol on the client and the server. SQL Server has support for the following protocols:
* Shared memory: Simple and fast, shared memory is the default protocol used to connect from a client running on the same computer as SQL Server. It can only be used locally, has no configurable properties, and is always tried first when connecting from the local machine. * TCP/IP: TCP/IP is the most commonly used access protocol for SQL Server. It enables you to connect to SQL Server by specifying an IP address and a port number. Typically, this happens automatically when you specify an instance to connect to. Your internal name resolution system resolves the hostname part of the instance name to an IP address, and either you connect to the default TCP port number 1433 for default instances or the SQL Browser service will find the right port for a named instance using UDP port 1434. * Named Pipes: TCP/IP and Named Pipes are comparable protocols in the architectures in which they can be used. Named Pipes was developed for local area networks (LANs) but it can be inefficient across slower networks such as wide area networks (WANs).
To use Named Pipes you first need to enable it in SQL Server Configuration Manager (if you'll be connecting remotely) and then create a SQL Server alias, which connects to the server using Named Pipes as the protocol. Named Pipes uses TCP port 445, so ensure that the port is open on any firewalls between the two computers, including the Windows Firewall. * VIA: Virtual Interface Adapter is a protocol that enables high-performance communications between two systems. It requires specialized hardware at both ends and a dedicated connection. Like Named Pipes, to use the VIA protocol you first need to enable it in SQL Server Configuration Manager and then create a SQL Server alias that connects to the server using VIA as the protocol. Regardless of the network protocol used, once the connection is established, SNI creates a secure connection to a TDS endpoint (described next) on the server, which is then used to send requests and receive data. For the purpose here of following a query through its life cycle, you're sending the SELECT statement and waiting to receive the result set.
TDS (Tabular Data Stream) Endpoints
TDS is a Microsoft-proprietary protocol originally designed by Sybase that is used to interact with a database server. Once a connection has been made using a network protocol such as TCP/IP, a link is established to the relevant TDS endpoint that then acts as the communication point between the client and the server.
There is one TDS endpoint for each network protocol and an additional one reserved for use by the dedicated administrator connection (DAC). Once connectivity is established, TDS messages are used to communicate between the client and the server.
The SELECT statement is sent to the SQL Server as a TDS message across a TCP/IP connection (TCP/IP is the default protocol).
When the protocol layer in SQL Server receives your TDS packet, it has to reverse the work of the SNI at the client and unwrap the packet to find out what request it contains. The protocol layer is also responsible for packaging up results and status messages to send back to the client as TDS messages.
Our SELECT statement is marked in the TDS packet as a message of type "SQL Command," so it's passed on to the next component, the Query Parser, to begin the path toward execution.
Figure 1-2 shows where our query has gone so far. At the client, the statement was wrapped in a TDS packet by the SQL Server Network Interface and sent to the protocol layer on the SQL Server where it was unwrapped, identified as a SQL Command, and the code sent to the Command Parser by the SNI.
The Command Parser's role is to handle T-SQL language events. It first checks the syntax and returns any errors back to the protocol layer to send to the client. If the syntax is valid, then the next step is to generate a query plan or find an existing plan. A query plan contains the details about how SQL Server is going to execute a piece of code. It is commonly referred to as an execution plan.
To check for a query plan, the Command Parser generates a hash of the T-SQL and checks it against the plan cache to determine whether a suitable plan already exists. The plan cache is an area in the buffer pool used to cache query plans. If it finds a match, then the plan is read from cache and passed on to the Query Executor for execution. (The following section explains what happens if it doesn't find a match.)
Creating execution plans can be time consuming and resource intensive, so it makes sense that if SQL Server has already found a good way to execute a piece of code that it should try to reuse it for subsequent requests.
The plan cache, part of SQL Server's buffer pool, is used to store execution plans in case they are needed later. You can read more about execution plans and plan cache in Chapters 2 and 5.
If no cached plan is found, then the Command Parser generates a query tree based on the T-SQL. A query tree is an internal structure whereby each node in the tree represents an operation in the query that needs to be performed. This tree is then passed to the Query Optimizer to process.
Our basic query didn't have an existing plan so a query tree was created and passed to the Query Optimizer.
Figure 1-3 shows the plan cache added to the diagram, which is checked by the Command Parser for an existing query plan. Also added is the query tree output from the Command Parser being passed to the optimizer because nothing was found in cache for our query.
The Optimizer is the most prized possession of the SQL Server team and one of the most complex and secretive parts of the product. Fortunately, it's only the low-level algorithms and source code that are so well protected (even within Microsoft), and research and observation can reveal how the Optimizer works.
It is what's known as a "cost-based" optimizer, which means that it evaluates multiple ways to execute a query and then picks the method that it deems will have the lowest cost to execute. This "method" of executing is implemented as a query plan and is the output from the optimizer.
Based on that description, you would be forgiven for thinking that the optimizer's job is to find the best query plan because that would seem like an obvious assumption. Its actual job, however, is to find a good plan in a reasonable amount of time, rather than the best plan. The optimizer's goal is most commonly described as finding the most efficient plan.
If the optimizer tried to find the "best" plan every time, it might take longer to find the plan than it would to just execute a slower plan (some built-in heuristics actually ensure that it never takes longer to find a good plan than it does to just find a plan and execute it).
As well as being cost based, the optimizer also performs multi-stage optimization, increasing the number of decisions available to find a good plan at each stage. When a good plan is found, optimization stops at that stage.
The first stage is known as pre-optimization, and queries drop out of the process at this stage when the statement is simple enough that the most efficient plan is obvious, obviating the need for additional costing. Basic queries with no joins are regarded as "simple," and plans produced as such have zero cost (because they haven't been costed) and are referred to as trivial plans.
Excerpted from Professional SQL Server 2008 Internals and Troubleshooting by Christian Bolton Justin Langford Brent Ozar James Rowland-Jones Steven Wort Copyright © 2010 by John Wiley & Sons, Ltd. Excerpted by permission.
All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.