Read an Excerpt
BEYOND GREATNESSFour Thoroughbred Legends
By Charles Justice
AuthorHouseCopyright © 2011 Charles Justice
All right reserved.
Chapter OneThis chapter gives an overview of how the four principal thoroughbreds in this study will be discussed. Of necessity, basic statistics must be used around which to make objective discussions or comparisons of past performance data. There is no other way to draw remotely unbiased conclusions concerning the relative merits of one horse versus another, especially when sample sizes are small – as they are in this study.
However, the primary purpose of this effort is not to judge one horse against another as 'greatest of all time'. The limitations on humans making value judgments based on statistics are generally discussed and cautioned against in any basic text.
The comparison methods used herein serve primarily as a guide for readers concerned about how to make objective judgments regarding thoroughbred greatness. This book is more about suggestions than carved-in-granite conclusions.
You will find no statements herein that Horse A is better than Horse B. Statistics, in fact, cannot validly make such statements. Statistics usually derives a bad name because statements purporting to compare quality between any individuals – horses or otherwise – are erroneously attached to it. Statistics tests hypotheses only. It cannot validly comment on comparative greatness – period!
* * *
An effective way to introduce the basic statistical concepts used here in is via an example of importance in thoroughbred racing history.
The alluded to event is the 1967 Woodward Stakes, held as the seventh race at Aqueduct Race Track on Saturday, September 30 of that year.
By the consensus of many racing pundits, that particular Woodward featured three of the greatest three-year-old colts of the past century. They were Dr. Fager, Buckpasser and Damascus. That race has, in fact, often been called the race of the century.
The Blood-Horse, Inc.(2) rated these horses as sixth, fourteenth and sixteenth, respectively, among the one-hundred greatest Thoroughbreds of the 20th Century.
Many readers may not know the results of the 1967 Woodward. Even if you do, it is interesting and instructive to apply some basic statistics and attempt to "predict" the race results, even though the outcome is known. The actual race results can then be compared with the prediction. This is probably as good a way as any to see immediately how statistics can safely be applied and what its limitations are.
This approach also highlights the advantages and pitfalls of all such statistical methods regarding whether they accurately mirror reality.
Linear Trend or Regression
Before examining the race data, some basic groundwork must be established. We begin by explaining what is meant by a linear trend line. Figure 1 displays such a line specifically tailored to this example.
The trend line in Figure 1 was designed to fit its data points perfectly. Look near the upper-right-hand corner of Figure 1. You see the equation y = 3x + 7 and, beneath it, the R2 value 1.00.
A value of 1.00 for R2, called the Coefficient of Determination or COD, means that the straight line falls exactly through each data point it was intended to predict. That's logical because the equation was designed to do just that for this example.
Using Microsoft Excel, two columns of data were placed on the same spreadsheet. In a column of cells labeled 'X,' the numbers 2, 3, 4, 8, 10 and 13 were entered. In the column of cells immediately adjacent and to the right of the 'X' column, the numbers were generated by Excel according to the pre-entered formula shown in Figure 1: y = 3x + 7. This column was labeled 'Y'.
Thus, for x = 2, the y value 13, or (3x + 7), was generated by Excel. Similarly, the remaining five values for Y were generated according to the values entered for x.
In this case, the resulting straight line must perfectly fit the data.
Statisticians often use the fancier name linear regression when referring to such linear plots. But, as my grandmother Reid liked to say, "However you slice it, it's still bologna!" By any name, the plot is simply interpreted by noting that the '3' multiplying x indicates the slope of the line (how fast it rises) while the '7' is called the y-intercept.
Study the graph, and you'll see that if the straight line were extended backward to where it intersects the y-axis, that point would fall seven units above the x-axis. Similarly, if you begin at any given value of x and move one unit to the right along the x-axis, you will find that the line rises three units parallel to the vertical or y-axis. That is the precise meaning of the slope value, three, in this example. There is nothing else to know regarding the placement and orientation of straight lines in such trend analyses.
It is beneficial, however, to know that the horizontal or x-axis is traditionally used to show values of the independent variable while the vertical or y-axis is used for locating the corresponding dependent variable values. For racing data, distances run by a given horse are independent variables, and the corresponding y values, the dependent variables, are the predicted times that the given horse will take to run those distances.
Excel's LINEST routine produces ten numbers, arranged in a two-column-by-five-row table for a simple linear regression such as Figure 1 simulates. Only four of these ten values are needed for a basic interpretation of the output; these values are highlighted in Table 1 and are explained in the accompanying text.
In real life, straight line trends seldom, if ever, fit data points perfectly. This is principally because many random factors influence how a given Y value is generated from a given X value, as opposed to the example contrived for Figure 1.
Let us see how all this applies to the basic past performance data of Dr. Fager, Buckpasser and Damascus and then try to determine the results for the 1967 Woodward based on this analysis. The data used herein to produce linear trend lines or other statistical models for a given horse were obtained from the book Champions, published in 2000 by Daily Racing Form (3), unless otherwise noted.
Table 1 gives the LINEST output for Buckpasser based on his complete set of racing data, excluding the Woodward results, for 1967, since we must exclude what we wish to predict later. Buckpasser ran five other races that year. We will use these results to predict how long it will take Buckpasser to run ten furlongs (1.25 miles), the length of the Woodward.
From Table 1, we see that Buckpasser's linear regression equation, using only the slope and intercept values highlighted in row 1, is:
[??] = 107.91x – 13.19
The y-circumflex, [??], is a statistics convention indicating that it is a predicted value (of time) based on the equation, as opposed to an actual run time. The values in the four highlighted cells are the only ones needed from the LINEST output to completely interpret simple regression equations such as this. The other values are 'niceties.'
Two values in Table 1 remain to be explained. They are the R2 or coefficient of determination in column two, row three, and the standard error of estimate (SEE), or standard error of the mean, as it is also known, in column four, row three.
The coefficient of determination indicates how well the straight line predicts the data trend. In this case, when the value of the COD is multiplied by one hundred, it is directly converted to a percent. The conversion gives 99.98 percent. It means that the linear trend line for Buckpasser's data accurately predicts 99.98 percent of his changes in running times based on changes in distance run. This is an extraordinarily high prediction accuracy compared to most real-world data. It will be seen to hold reasonably true for all racing data described herein.
The standard error of estimate value, 0.3534, is basically the standard deviation to be expected in the given data prediction. Using this example, if you substitute the value 1.25 for 'x' in the given equation, it predicts that Buckpasser's expected time for a ten-furlong race (1.25 mi), such as the Woodward, will be 121.70 s, rounded to two decimal places (with 'feet' hereafter abbreviated 'ft' and 'seconds' abbreviated 's').
The expected error limits of this prediction are then found using the standard error of estimate (SEE) value 0.3534. Multiplying SEE by three and then successively adding and subtracting the result from 121.70 gives you the 'three-sigma' boundary limits within which one can expect Buckpasser's predicted times to match actual times at ten furlongs.
The ±3 SEE range for this data is from (121.70 - 1.06) to (121.70 + 1.06), again rounding to two decimal places. These values are 120.64 and 122.76, respectively.
Figure 2 is the graph of Buckpasser's linear trend equation. The trend equation and COD are displayed near the upper right-hand corner of the graph. As in the preceding example, Buckpasser's slope of 107.91 means that, for an increase in x of one mile, the time value for will increase by 107.91 s. You should verify this result using the graph and the equation for practice.
When LINEST is run for Damascus and Dr. Fager, their resulting predicted mean times to run 1.25 miles are 121.81 s and 120.72 s, respectively.
Thus, based on this analysis, one would expect Dr. Fager to win slightly over Buckpasser and for Damascus to come in third, or to show.
The important conclusion of this example, however, differs. The actual running of the 1967 Woodward resulted in the following final times for the three horses – Damascus: 120.60 s, Buckpasser: 122.08 s and Dr. Fager: 122.15 s.
Damascus actually won the race by ten lengths over Buckpasser and by ten and one-half lengths over Dr. Fager.
Does this mean that such analyses are useless? No. It simply means one must always remember that statistics only estimates likely outcomes. Any estimate is always subject to error limits as given by the SEE. In this case, the following explanation for the race results clarifies much of the discrepancy between predicted and actual values.
The primary reason for Dr. Fager's relatively poor showing was that he was 'baited' into an early sprint-like duel with Damascus' stablemate, Hedevar. This ploy worked mainly because Dr. Fager hated to let other horses lead him. By the time he reached about the final half mile of the race he had spent his energy and slowed dramatically.
In four total real-life matches, Dr. Fager actually had a two-two split in wins against Damascus. He was every inch Damascus' equal as a runner.
The z-score: An Important Comparison Factor
A theoretical z-score gives the relative performance level for a horse in any given race. It is calculated by subtracting the horse's predicted average time, from his linear trend analysis (LTA) for a given distance, from his time for a particular race of equal distance and then dividing the result by his LTA standard error of the mean (SEE).
The z-score formula is not given here. It can be found in any elementary statistics text. Excel also calculates it. The z-scores for the 1967 Woodward for Buckpasser, Damascus and Dr. Fager are: 1.09, -1.06, and 2.04, respectively.
Their z-scores are thus consistent with the race results. They indicate that Damascus ran near one standard deviation below, or faster than, his expected average time (thus, the negative value). Buckpasser ran about one standard deviation above, or slower than, his expected average time, and Dr. Fager showed the poorest overall relative performance, running a full two standard deviations above his expected average time. This explains his loss.
A Physical Analogy for a Race
It may help to visualize what statistics is trying to attain via mathematical analysis of sample data by imagining a horse race as a hockey puck lying on a flat, smooth surface and having various forces acting on it at random points around its circumference.
Imagine an arrow representing each force. The length and position of the arrows abstractly and symbolically represent the different relative influences each force has on the race's outcome. Longer arrows always imply stronger influences. Figure 3 aids this visualization.
Figure 3 shows twelve arrows impinging on the circumference of the puck. Each represents some influence on the race's outcome. That outcome is expressed by the final direction the puck moves due to the predominant resultant force acting on it. Any set of such forces can always be reduced to a single resultant force acting at a specific angle.
It is not difficult to imagine at least twelve such factors that influence the result of a given race. One group might include, for example: track condition; length of race; weight carried (impost); post position; number of starters; rest time between races; gate conditions; overall health of the horse; climate factors; crowd noise; paddock incidents and track variant.
At the very least, it is nearly assured that the first seven listed conditions have some influence on the horse-jockey team, although their effects may not be quantifiable.
The four gray arrows in Figure 3 denote a random grouping of four factors that are assumed to affect the given horse most strongly on the particular day. We can imagine that such clusters of factors may affect a horse more at one time than at another and that they change from race to race. The analysis still remains valid.
If we calculate how many ways four of twelve factors may be randomly selected, we find that 495 unique groupings are possible. There is a formula for calculating this number that will not be given here. It is verbally stated as: the number of combinations of n items taken r at a time. It is symbolized: nCr or sometimes with the 'n' superscripted and the 'r' subscripted to the 'C'. In this case, r is four and n is twelve.
Many hand-held calculators compute nCr automatically with one button press once the total number of items and the group size are entered into memory.
It is quite reasonable to imagine four factors highly influencing a race outcome on a given day at a given time and place. This emphasizes why it is impossible to make a practical estimate of which horse in a group of, say, ten starters will win.
Obviously this situation makes betting the suspenseful pleasure that it is. If each race were nearly perfectly predictable, much of the sport's fun would be nullified.
Correlation and Statistics
To illustrate how the numerical effect (correlation) of random factors is calculated, data from Seabiscuit's past performance record are presented to examine the influence of several factors on his running times.
Fortunately, Seabiscuit ran enough races at a given distance on tracks having the same rated surface condition that this may be done with some degree of confidence. The data are from his 1936 racing season. He was then three years old.
Seabiscuit ran twenty-three total races in 1936. That is actually two races more than either Secretariat or Man o' War ran in their entire carriers of two years apiece.
Of those twenty-three races, eight were run at eight and a-half furlongs, or 1.0625 miles. Fortunately for analysis purposes, seven of those races were run on tracks rated fast, and three of the eight were run on the same track between Saturday, August 22 and Saturday, September 26. The running conditions for the three same-track races would logically be rated nearly identical, and so they would not be expected to bias the results.
Having this comparison allows us to see how much difference running equal distances on the same track and under the same track condition affects the correlations between the selected variables and the running time. It is not often that the small data samples available in thoroughbred racing comparisons give even this large a sample size wherein track conditions can be considered constant.
Excerpted from BEYOND GREATNESS by Charles Justice Copyright © 2011 by Charles Justice. Excerpted by permission of AuthorHouse. All rights reserved. No part of this excerpt may be reproduced or reprinted without permission in writing from the publisher.
Excerpts are provided by Dial-A-Book Inc. solely for the personal use of visitors to this web site.