TCP/IP Sockets in C: Practical Guide for Programmers, 2nd Edition is a quick and affordable way to gain the knowledge and skills needed to develop sophisticated and powerful web-based applications. The book's focused, tutorial-based approach enables the reader to master the tasks and techniques essential to virtually all client-server projects using sockets in C. This edition has been expanded to include new advancements such as support for IPv6 as well as detailed defensive programming strategies.
If you program using Java, be sure to check out this book’s companion, TCP/IP Sockets in Java: Practical Guide for Programmers, 2nd Edition.
Includes completely new and expanded sections that address the IPv6 network environment, defensive programming, and the select() system call, thereby allowing the reader to program in accordance with the most current standards for internetworking.
Streamlined and concise tutelage in conjunction with line-by-line code commentary allows readers to quickly program web-based applications without having to wade through unrelated and discursive networking tenets.
Grants the reader access to online source code, which the can then be used to directly implement sockets programming procedures.
"Despite my having developed systems software with Sockets and C for 20+ years, I find myself still needing a book like this one. It covers all the subtleties and gotchas that one encounters when writing distributed applications in C with Sockets."--- Bobby Krupczak, The Krupczak Organization
Michael J. Donahoo teaches networking to undergraduate and graduate students at Baylor University, where he is an assistant professor. He received his Ph.D. in computer science from the Georgia Institute of Technology. His research interests are in large-scale information dissemination and management.
Kenneth L. Calvert is an associate professor at University of Kentucky, where he teaches and does research on the design and implementation of computer network protocols. He has been doing networking research since 1987, and teaching since 1991. He holds degrees from MIT, Stanford, and the University of Texas at Austin.
Today people use computers to make phone calls, watch TV, send instant messages to their friends, play games with other people, and buy most anything you can think of—from songs to automobiles. The ability of programs to communicate over the Internet makes all this possible. It's hard to say how many individual computers are now reachable over the Internet, but we can safely say that it is growing rapidly; it won't be long before the number is in the billions. Moreover, new applications are being developed every day. With the push for ever increasing bandwidth and access, the impact of the Internet will continue to grow for the forseeable future.
How does a program communicate with another program over a network? The goal of this book is to start you on the road to understanding the answer to that question, in the context of the C programming language. For a long time, C was the language of choice for implementing network communication softward. Indeed, the application programming interface (API) known as Sockets was first developed in C.
Before we delve into the details of sockets, however, it is worth taking a brief look at the big picture of networks and protocols to see where our code will fit in. Our goal here is not to teach you how networks and TCP/IP work—many fine texts are available for that purpose—but rather to introduce some basic concepts and terminology.
1.1 Networks, Packets, and Protocols
A computer network consists of machines interconnected by communication channels. We call these machines hosts and routers. Hosts are computers that run applications such as your Web browser, your IM agent, or a file-sharing program. The application programs running on hosts are the real "users" of the network. Routers (also called gateways) are machines whose job is to relay, or forward, information from one communication channel to another. They may run programs but typically do not run application programs. For our purposes, a communication channel is a means of conveying sequences of bytes from one host to another; it may be a wired (e.g., Ethernet), a wireless (e.g., WiFi), or other connection.
Routers are important simply because it is not practical to connect every host directly to every other host. Instead, a few hosts connect to a router, which connects to other routers, and so on to form the network. This arrangement lets each machine get by with a relatively small number of communication channels; most hosts need only one. Programs that exchange information over the network, however, do not interact directly with routers and generally remain blissfully unaware of their existence.
By information we mean sequences of bytes that are constructed and interpreted by programs. In the context of computer networks, these byte sequences are generally called packets. A packet contains control information that the network uses to do its job and sometimes also includes user data. An example is information identifying the packet's destination. Routers use such control information to figure out how to forward each packet.
A protocol is an agreement about the packets exchanged by communicating programs and what they mean. A protocol tells how packets are structured—for example, where the destination information is located in the packet and how big it is—as well as how the information is to be interpreted. A protocol is usually designed to solve a specific problem using given capabilities. For example, the HyperText Transfer Protocol (HTTP) solves the problem of transferring hypertext objects between servers, where they are stored or generated, and Web browsers that make them visible and useful to users. Instant messaging protocols solve the problem of enabling two or more users to exchange brief text messages.
Implementing a useful network requires solving a large number of different problems. To keep things manageable and modular, different protocols are designed to solve different sets of problems. TCP/IP is one such collection of solutions, sometimes called a protocol suite. It happens to be the suite of protocols used in the Internet, but it can be used in stand-alone private networks as well. Henceforth when we talk about the network, we mean any network that uses the TCP/IP protocol suite. The main protocols in the TCP/IP suite are the Internet Protocol (IP), the Transmission Control Protocol (TCP), and the User Datagram Protocol (UDP).
It turns out to be useful to organize protocols into layers; TCP/IP and virtually all other protocol suites are organized this way. Figure 1.1 shows the relationships among the protocols, applications, and the Sockets API in the hosts and routers, as well as the flow of data from one application (using TCP) to another. The boxes labeled TCP and IP represent implementations of those protocols. Such implementations typically reside in the operating system of a host. Applications access the services provided by UDP and TCP through the Sockets API, represented as a dashed line. The arrow depicts the flow of data from the application, through the TCP and IP implementations, through the network, and back up through the IP and TCP implementations at the other end.
In TCP/IP, the bottom layer consists of the underlying communication channels—for example, Ethernet or dial-up modem connections. Those channels are used by the network layer, which deals with the problem of forwarding packets toward their destination (i.e., what routers do). The single-network layer protocol in the TCP/IP suite is the Internet Protocol; it solves the problem of making the sequence of channels and routers between any two hosts look like a single host-to-host channel.
The Internet Protocol provides a datagram service: every packet is handled and delivered by the network independently, like letters or parcels sent via the postal system. To make this work, each IP packet has to contain the address of its destination, just as every package that you mail is addressed to somebody. (We'll say more about addresses shortly.) Although most delivery companies guarantee delivery of a package, IP is only a best-effort protocol: it attempts to deliver each packet, but it can (and occasionally does) lose, reorder, or duplicate packets in transit through the network.
The layer above IP is called the transport layer. It offers a choice between two protocols: TCP and UDP. Each builds on the service provided by IP, but they do so in different ways to provide different kinds of transport, which are used by application protocols with different needs. TCP and UDP have one function in common: addressing. Recall that IP delivers packets to hosts; clearly, a finer granularity of addressing is needed to get a packet to a particular application program, perhaps one of many using the network on the same host. Both TCP and UDP use addresses, called port numbers, to identify applications within hosts. TCP and UDP are called end-to-end transport protocols because they carry data all the way from one program to another (whereas IP only carries data from one host to another).
TCP is designed to detect and recover from the losses, duplications, and other errors that may occur in the host-to-host channel provided by IP. TCP provides a reliable byte-stream channel, so that applications do not have to deal with these problems. It is a connection-oriented protocol: before using it to communicate, two programs must first establish a TCP connection, which involves completing an exchange of handshake messages between the TCP implementations on the two communicating computers. Using TCP is also similar in many ways to file input/output (I/O). In fact, a file that is written by one program and read by another is a reasonable model of communication over a TCP connection. UDP, on the other hand, does not attempt to recover from errors experienced by IP; it simply extends the IP best-effort datagram service so that it works between application programs instead of between hosts. Thus, applications that use UDP must be prepared to deal with losses, reordering, and so on.
1.2 About Addresses
When you mail a letter, you provide the address of the recipient in a form that the postal service can understand. Before you can talk to someone on the phone, you must supply a phone number to the telephone system. In a similar way, before a program can communicate with another program, it must tell the network something to identify the other program. In TCP/IP, it takes two pieces of information to identify a particular program: an Internet address, used by IP, and a port number, the additional address interpreted by the transport protocol (TCP or UDP).
Internet addresses are binary numbers. They come in two flavors, corresponding to the two versions of the Internet Protocol that have been standardized. The most common is version 4 (IPv4,); the other is version 6 (IPv6,), which is just beginning to be deployed. IPv4 addresses are 32 bits long; because this is only enough to identify about 4 billion distinct destinations, they are not really big enough for today's Internet. (That may seem like a lot, but because of the way they are allocated, many are wasted. More than half of the total IPv4 address space has already been allocated.) For that reason, IPv6 was introduced. IPv6 addresses are 128 bits long.
1.2.1 Writing Down IP Addresses
In representing Internet addresses for human consumption (as opposed to using them inside programs), different conventions are used for the two versions of IP. IPv4 addresses are conventionally written as a group of four decimal numbers separated by periods (e.g., 10.1.2.3); this is called the dotted-quad notation. The four numbers in a dotted-quad string represent the contents of the four bytes of the Internet address—thus, each is a number between 0 and 255.
The 16 bytes of an IPv6 address, on the other hand, by convention are represented as groups of hexadecimal digits, separated by colons (e.g., 2000:fdb8:0000:0000:0001:00ab:853c: 39a1). Each group of digits represents 2 bytes of the address; leading zeros may be omitted, so the fifth and sixth groups in the foregoing example might be rendered as just :1:ab:. Also, one sequence of groups that contains only zeros may be omitted altogether (while leaving the colons that would separate them from the rest of the address). So the example above could be written as 2000:fdb8::1:00ab:853c:39a1.
Technically, each Internet address refers to the connection between a host and an underlying communication channel—in other words, a network interface. A host may have several interfaces; it is not uncommon, for example, for a host to have connections to both wired (Ethernet) and wireless (WiFi) networks. Because each such network connection belongs to a single host, an Internet address identifies a host as well as its connection to the network. However, the converse is not true, because a single host can have multiple interfaces, and each interface can have multiple addresses. (In fact, the same interface can have both IPv4 and IPv6 addresses.)
1.2.2 Dealing with Two Versions
When the first edition of this book was written, IPv6 was not widely supported. Today most systems are capable of supporting IPv6 "out of the box." To smooth the transition from IPv4 to IPv6, most systems are dual-stack, simultaneously supporting both IPv4 and IPv6. In such systems, each network interface (channel connection) may have at least one IPv4 address and one IPv6 address.
The existence of two versions of IP complicates life for the socket programmer. In general, you will need to choose either IPv4 or IPv6 as the underlying protocol when you create a socket to communicate. So how can you write an application that works with both versions? Fortunately, dual-stack systems handle interoperability by supporting both protocol versions and allowing IPv6 sockets to communicate with either IPv4 or IPv6 applications. Of course, IPv4 and IPv6 addresses are quite different; however, IPv4 addresses can be mapped into IPv6 addresses using IPv4 mapped addresses. An IPv4 mapped address is formed by prefixing the four bytes in the IPv4 address with ::fff. For example, the IPv4 mapped address for 220.127.116.11 is ::ffff:18.104.22.168. To aid in human readability, the last four bytes are typically written in dotted-quad notation. We discuss protocol interoperability in greater detail in Chapter 3.
Unfortunately, having an IPv6 Internet address is not sufficient to enable you to communicate with every other IPv6-enabled host across the Internet. To do that, you must also arrange with your Internet Service Provider (ISP) to provide IPv6 forwarding service.