Read an Excerpt
Delivery Errors and RetryingThis chapter is all about temporary delivery errors, and how Exim deals with them. In an ideal world, every message would either be delivered at the first attempt, or be bounced, and temporary errors would not arise. In the real world, this does not happen; hosts are down from time to time, or are not responding, and network connections fail. An NITA has to be prepared to hold on to messages for some time, while trying every now and again to deliver them. Some rules are needed for deciding how often the retrying is to occur, and when to give up because the retrying has been going on for too long.
A related topic is how to handle messages destined for hosts that are connected to the Internet only intermittently (for example, by dial-up lines). In this case, incoming messages have to be kept on some server host because they cannot be delivered immediately. Exim was not designed for this, and is not ideal for it, but because it is being used in such circumstances, the final section of this chapter discusses how it can best be configured.
Retrying After ErrorsDelivering a message costs resources, so it is a good idea not to retry unreasonably often. Trying to deliver a failing message every minute for several days, for example, is not sensible. Even trying as often as every 15 minutes is wasteful over a long period. Furthermore, if one message has just suffered a temporary connection failure, immediately trying to deliver another message to the same host is also a waste of resources.
A number of MTAs use message-based retrying; that is, they apply a retry schedule to each message independently. This can cause hosts to be tried several times in quick succession. Exim is not like this; for failures that are not related to a specific message, it uses host-based retrying, which means that if a host fails, all messages that are routed to it are delayed until its next retry time arrives.
In fact, Exim normally bases these retry operations on the failing IP address, rather than the hostname. If a host has more than one IP address, each is treated independently as far as retrying is concerned. In the discussion that follows, we use the word "host" when talking about remote delivery errors to make it easier to read. It should be understood, however, that this refers to a single IP address, so that a host with several network interfaces is, in effect, treated as several independent hosts.
Information about temporary delivery failures is kept in a hints database called retry in the db subdirectory of Exim's spool directory. You can read the contents of this if you want to, using the exim_dumpdb or exinext utilities, which are described in Chapter 21, Administering Exim. The information includes details of the error, the time of the first failure, the time of the most recent failure, and the time before which it is not reasonable to try again.
Exim uses a set of configurable retry rules in the fifth section of the configuration file for deciding when next to try a failing delivery. These rules allow you to specify fixed or increasing retry intervals, or a combination of the two. Details of the rules are given later in this chapter, after the different kinds of error are described.
Remote Delivery ErrorsMost, but not all, delays and retries are concerned with deliveries to remote hosts. Three different kinds of error are recognized during a remote delivery: host errors, message errors, and recipient errors.
A host error is not associated with a particular message, nor with a particular recipient of a message. The host errors are as follows:
- Refusal of connection to a remote host.
- Timeout of a connection attempt.
- An error code in response to setting up a connection.
- An error code in response to HELo or EH,o.
- Loss of connection at any time, except after the final dot that ends a message.
- I/O errors at any time.
- Timeout during the SMTP session, other than in response to rrAIL, RCPT or the dot at the end of the data.
When a permanent SMTP error code (5xoc) is given at the start of a connection or in response to a HEw or EHLo command, all the addresses that are routed to the host are failed, and returned to the sender in a bounce message.
The other kinds of host error are treated as temporary, and they cause all addresses routed to the host to be deferred. Retry data is created for the host, and it is not tried again, for any message, until its retry time arrives. If the current set of addresses are not all delivered to some backup host by this delivery process, the message is added to a list of those waiting for the failing host. This is a hint that Exim uses if it makes a subsequent successful delivery to the host. It checks to see if there are any other messages waiting for the host, and if so, sends them down the same SMTP connection.
A message error is associated with a particular message when sent to a particular host, but not with a particular recipient of the message. The message errors are as follows:
- An error code in response to MAIL, DATA, or the dot that terminates the data.
- Timeout after sending MAIL.
- Timeout or loss of connection after the dot that terminates the data. A timeout after the DATA command itself is treated as a host error, as is loss of connection at any other time.
If the remote host specifies support for the SzzE parameter in its response to EHW, Exim adds szzE=rmn to the MAIL command, so an overlarge message causes a permanent message error, because it arrives as a response to MAIL. However, when SIZE is not in use, some hosts respond to unacceptably large messages by just dropping the connection. This leads to a temporary message error if it is detected after the whole message has been sent. Better behaved hosts give a permanent error return after the end of the message; this allows the message to be bounced without retries.
A recipient error is associated with a particular recipient of a message. The recipient errors are as follows:
- An error code in response to RCPT
- Timeout after RCPT
The message is not added to the list of those waiting for this host. Use of the host for other recipient addresses is unaffected, and except in the case of a timeout, other recipients are processed independently, and may be successfully delivered in the current SMTP session. After a timeout, it is, of course, impossible to proceed with the session, so all addresses are deferred. However, those other than the one that failed do not suffer any subsequent retry delays. Therefore, if one recipient is causing trouble, the others have a chance of getting through when a subsequent delivery attempt occurs before the failing recipient's retry time.
Problems of Error Classification
Some hosts have been observed to give temporary error responses to every rAiL command at certain times ("insufficient space" has been seen). These are treated as message errors. It would be nice if such circumstances could be recognized instead as host errors, and retry data for the host itself created, but this is not possible within the current Exim design. What actually happens is that retry data for every (host, message) combination is created.
The reason that timeouts after MAIL and RCPT are treated specially is that these can Ssometimes arise as a result of the remote host's verification procedures taking a very long time. Exim makes this assumption, and treats them as if a temporary error response had been received. A timeout after the final dot is treated specially because it is known that some broken implementations fail to recognize the end of the message if the last character of the last line is a binary zero. Thus, is it helpful to treat this case as a message error.
Timeouts at other times are treated as host errors, assuming a problem with the host, or the connection to it. If a timeout after MAIL, RCS, or the final dot is really a connection problem, the assumption is that at the next try, the timeout is likely to occur at some other point in the dialog, causing it to be treated as a host error.
There is experimental evidence that some MTAs drop the connection after the terminating dot if they do not like the contents of the message for some reason. This is in contravention of the RFC, which indicates that a 5xoc response should be given. That is why Exim treats this case as a message error rather than a host error, in order not to delay other messages to the same host.
Delivery to Multiple Hosts
In all cases of temporary delivery error, if there are other hosts (or IP addresses) available for the current set of addresses (for example, from multiple MX records), they are tried in this run for any undelivered addresses, subject of course to their own retry data. This means that newly created recipient error retry data does not affect the current delivery process; instead, it takes effect the next time a delivery process for the message is run.....