Special Guest: John Barile TheTampaBayDownsHandicapper
Exacta Betting: Using a Tiered Approach (Part 2)
To Download Complete Video and/or Audio and Play on Your Computer
To Watch Online (iPad, Large Screen, Browser, etc).
To Download Related Materials (PDF's, Spreadsheets, etc.)
Check back often. Did you know that the best stuff goes directly to our email community? Be the first to hear about special offers, new products, and new material. Sign Up Today.
Join the discussion in our private Talking Handicapping with Dave Schwartz Facebook Group.
Steve G says
I wish we’d had more time with John Barile, because we never did get to what I consider to be the biggest pitfall in trainer handicapping: how do you distinguish between a trainer’s strengths, and random events?
If you look at the various performance categories Mr. Barile monitors, 1st off a layoff, 2nd off a layoff, class moves, distance moves, surface moves, etc., and look at the number of possible classifications in each category, by my count, there are 138k possible combinations. If one were to think in terms of a 95% confidence interval, this implies that for each trainer, there will be almost 7,000 false positives! I.e., combinations that show excellent results, purely as a result of randomness, and so would not persist going forward. I don’t know how one would go about separating these false positives from true strengths that could be reliably followed. Any thoughts?
Dave Schwartz says
Steve,
I love hearing this! It is just great to see that there is a topic that really excites people. Trainer handicapping is a wonderful way to do things but it can be a lot of work.
John and I are already talking about a return visit – maybe even a special show.
As for false positives, I am not sure I understand your math. Perhaps you could explain it more in depth.
Steve G says
Sure, I can give it a shot. As I said, I came up with 138k combinations of trainer performance characteristics by looking at how many mutually exclusive classifications there could be for each metric. For example, for the shipper dimension, the horse could either be a shipper, a 2shipper, or neither; 3 possibilities. For the surface dimension, the horse could either be FTT which implies dirt to turf, 2TT or 3TT, both of which imply no surface change, or neither F,2,3TT with either TtoD, DtoT, or neither; hence 6 possible classifications for the surface dimension. Do this for each category, multiply out the product, and I get 138k possible combinations.
When assessing the trainer’s performance in any possible combination of situations, you’d be looking for something that deviates greatly from his “norm”. However, you will get values that fall into the tails of the distribution purely due to randomness. The more trials you perform (in this case, a “trial” being assessing a trainer “situation”, eg “stretchout + L1-3”), the more will fall into the extremes via random chance. I arbitrarily used 95% as a confidence interval… in other words, you’re 95% sure that what your observing is not due to random chance. Put another way, there’s a 5% chance it IS due to random chance. When doing one trial, this is fine; when doing 138k trials, you’re going to be overwhelmed by Type I errors.
Let me try this analogy: let’s say you want to test if a penny is a fair coin. You toss it 10 times, and it comes up tails every time. The chances of this happening are 1 in 1000, so you conclude it is NOT a fair coin. Makes sense.
But suppose you had a bucket of 138k pennies. You toss each one 10 times. Chances are, about 130-150 of those pennies is going to come up tails all 10 times. Would it make sense to conclude all of those are not fair coins?? No. This is merely the mathematically expected result. Is it possible that some of those really WERE biased coins? Absolutely! But I have no way of knowing which ones of the 130-150 they are. This is precisely the problem I can’t solve with trainer handicapping.
Dave Schwartz says
Shouldn’t the strength of the sample matter in this calculation?
I mean, if you have a guy who is (say) 70-for-110 with a characteristic that is far better than 7-for-11.
Steve G says
Sure, the more the better… that puts it farther into the tail of the distribution, with less probability you’re seeing random noise; but this data is getting sliced so fine, even when looking at 3 years worth of data for a pretty active trainer, Mr. Barile’s spreadsheets don’t show anywhere near 110 starters for any of his profile entries. Of the 17 strong angles for Potts, the median number of starters is only 12. If one can successfully draw viable conclusions about 12 starts over 3 years, there’s got to be an enormous amount of intuition that goes into making that assessment. It would also explain why Mr. Barile has made no effort to automate his intensely manual process. The intuition that he’s overlaying on top of the raw numbers is something I’d love to hear him address in a follow-up session.
P.S. One correction I need to make to earlier numbers: in reviewing Mr. Barile’s spreadsheets, I see that he is treating first-off-a-layoff and second-off-a-layoff as one dimension. I’d imagine that if both exist, he’s giving precedence to the 1st-off-a-layoff. But since this is a single dimension, I have to revise my number of combinations from 138k to 60k.
Dave Schwartz says
That makes more sense to me. However, when you consider that 7,000 false positives out of 138,000 possibilities, the chances that you have a good one is pretty reasonable, no?
Steve G says
There’s no way of knowing. At least none that I’m aware of. Go back to my pennies analogy. After you do your 10 flips of 138k pennies, and you end up with say 133 that came up tails 10 times in a row, would you feel you could confidently say anything as to how many are truly biased? (or, in the verbiage above, how many of them are “good ones”?)
To carry out the analogy further, let’s say you would get to take as many of these 133 pennies as you want, and bet “tails” going forward. I would have absolutely no idea how many, or which ones to select. Even if someone were to tell you ahead of time that your bucket DID include a handful of biased coins, how do I distinguish those from the ones that ended up among the 133 purely by chance? If I take all 133 and bet tails on all of them against a 20 percent takeout, there are too many fair coins among them for me to make a profit.
Dave Schwartz says
To me, the issue of small, highly similar data samples versus larger, less similar data samples has been around forever. The real question is whether the larger samples actually work better.
Other than a very few highly sophisticated Probit and Logit models, I am not aware of anyone who has made large sample analysis work. Have you found different?
Do you have a better suggestion?
John J. Barile says
Gentlemen, I am not a mathematician and therefore cannot debate Steve’s “randomness” argument, but I will reject it out of hand completely.
We are not talking about flipping pennies or every race in North America for the last ten years.
What we are talking about is hard working professionals who have a finite number of opportunities each year to execute a plan (and each in his own way) that will supplement the shortfall in their operating budgets by backing their own horses at the betting windows.
For 25 years (a lifetime, and that’s all we’ve got) I’ve watched trainer’s getting the job done with their own special methods, in their own time and for their own pleasure and profit. They don’t always win, but they win with particular multiple-angles often enough to stay in the game and live the life they please.
Trainer Peter Wasiluk goes 2% (7-313) with Layoffs in L3T and then comes back 33% ITM (31-89) with Plain Shortenups at average odds of $11-1…..AND THERE IS NOTHING RANDOM ABOUT IT! I often joke with Mr. Wasiluk that he’s not very good at most things and he just winks and give me the Cheshire cat smile.
There is a time and a place for mathematical analysis and conclusion. This is not one of them…….
John J. Barile says
As a matter of fact, Steve G. accidentally highlights one of the strengths of multiple angle trainer profiling……one needs only know the four rules of arithmetic (addition, subtraction, multiplication and division) in order to be a successful trainer profiler, and many of the calculations can be done on your ten fingers…..
Dave Salvini says
With regards to trainer handicapping I would imagine that any significant change in the trainer’s horse stalk, or change in owners could have a significant influence (either positive or negative) on that trainer’s subsequent performance.
John J. Barile says
Could u further clarify your statement Dave S.? Thank you….
Dave Salvini says
Yes John, I could see a situation where a majority of a trainer’s wins were with only a couple of horses (particularly for a small stable). Now if that trainer no longer has these horses in his barn, then I could foresee that trainer not doing so well in the upcoming meet.
John J. Barile says
Let me just say that at some meetings a trainer will be #15 in the standings and the next he will be #30, but horses come and go and the competent trainer will maintain his strike rate and his MO will not change much from one meeting to the next, regardless of his stock.
steve gazis says
I respect your experience here, but as a data professional, it’s not in my nature to “reject out of hand completely” without investigation and explanation. Perhaps if we walk through an example that illustrates my point, it would be helpful.
When dealing with large amounts of data, rather than look at a trainer and identify the conditions under which he has success, it makes more sense to take a condition and identify the trainers who are successful with it. Purely at random I decided to look at L1-3 and shortening up. Taking trainer data from 2011-2012, I identified all trainers that had particular success under this condition. I defined “particular success” as a win rate that exceeded their overall win rate by at least 20 percentage points. To cut down on noise, I required at least 10 starters under L1-3/shortening.
Altogether, I found 19 trainers who excelled under these conditions:
Trainer All starts All w% L1-3+sh
BINGHAM WILLIAM B 106 0.09434 10 0.30000
BRINSON CLAY 206 0.31553 23 0.60870
CHABOT ROB 114 0.27193 10 0.50000
CONDILENIOS DINO K 241 0.21577 13 0.46154
DACOSTA JASON 218 0.15596 15 0.40000
DENZIK, JR. WILLIAM 140 0.20000 15 0.40000
DOWNING WILLIAM 216 0.12500 13 0.38462
DUHON PAUL 137 0.11679 15 0.33333
GULICK JAMES M 221 0.09050 13 0.30769
HICKEY WILLIAM J 93 0.20430 14 0.42857
JENSEN MARK 168 0.13095 14 0.42857
LENZINI MICHAEL 207 0.07246 10 0.30000
MCLEAN BILL 264 0.11364 13 0.46154
NOLAN WILLIAM J 123 0.07317 10 0.30000
SCHNELL DON 191 0.13089 11 0.36364
SOTO SALVADOR R 125 0.05600 10 0.30000
TARRANT AMY 161 0.12422 12 0.33333
THOMAS JAMEY R 185 0.11892 18 0.38889
ZULUETA MARCOS 121 0.28099 10 0.60000
Looking forward into the first 6 months of 2013, betting all these trainers under L1-3/shortening yielded 46 plays and almost broke even. 5 of these trainers are showing a profit, and 6, Chabot, DaCosta, Denzik, Downing, McLean and Zuleta are a combined 3 for 28 with a flat bet loss of 86 cents on the dollar. They killed all the profit the rest generated. Perhaps this group of 6 will rebound. Or it’s possible that what we saw as prior success was merely the result of statistical noise that is inevitable when looking at thousands of trainers. My problem is that when I look at the performance from 2011-2012, I don’t see anything about those 6 that looks noticeably different from the rest, that would alert me that their performance might not carry over. How do I know if’s smart to continue betting Tarrant and Lenzini, but Chabot is going to go from 50% winners to sending out 6 horses that don’t even hit the board?
I propose that whether you’re consciously aware of it or not, your knowledge and insights into the training colony at Tampa Bay is helping you make those kinds of determinations, which simple number-crunching can not.
John J. Barile says
First things first. I apologize if I offended Steve S. …certainly did not intend to do so…….nonetheless I think there is a defect in Steve’s approach at looking at the multiple angle(s) rather than the trainer…..
for example:
Thursday, I said that Jamie Ness won only
7% (2-28) with L1-3 + S + No Class Change types
but that at the same time he
won 53% (8-15) with L1-3 + S + Dropdown + Surface Change types.
I had to keep digging until I indentified the winning multiple angle. If, I had sorted Ness simply by L1-3 + S he would look mediocre at best…. I would propose that Steve is not looking deep enough to find the winning subset (that could be 3,4 even 5 angles deep)……
I appreciate his complement that my experience and intuition is largely at play here…..and perhaps its a factor…..but I think it is limited to some of the searches I intuitively embark upon…..i.e. Does Clement win just as many races when he’s not the favorite, then when he is and with a nice ROI too? and other such searches……..
steve gazis says
apologies… the board didn’t maintain the formatting of the table of numbers
Dave Schwartz says
My experience with trainer stats indicates that looking at anything across multiple circuits is a total waste of time. It is, in fact, almost random.
I would point out one more thing, Steve. You mentioned looking for a 20% difference from normal for that trainer. I suggest that looking for improvement above normal for a really poor trainer does not necessarily provide you with a profitable scenario.
Consider this line of thinking:
At the outset, consider if we had an unlimited amount of data (which, of course, nobody has).
1. If I use an ordinal system to examine a factor (such as “speed rating in the last race”) and create a system using that factor as a component but with a simple linear weight (such as 100 for 1st, 90 for 2nd, 80 for 3rd, etc., I get SOME RESULT.
2. If I create a table where I address the question, “How good are the horses in the different rank positions?” and use those values instead, I should see a better result (because it is PROBABLY a more accurate portrayal of reality).
3. If I create that table again, but this time make it trainer specific, (i.e. “How does THIS trainer do with horses ranked 3rd?”) the results should improve again.
4. If I make that trainer data specific to one track, region or circuit, the results should improve further.
All of these are suppositions, of course.
Questions for Steve:
1. What do you think of the 4 suppositions above?
2. WITH NEAR UNLIMITED DATA, are more specific samples are likely to be more predictive than those that are less specific?
3. If you have less than “a giant sample” of data and choose to lump everything together in order to have “enough,” does the quality of your prediction suffer?
4. In other words, if I handicap 2f baby races with the same sample I use to handicap marathon turf routes and everything in between, am I likely to build a system that works across all spectrums of racing?
What are your thoughts?
John J. Barile says
I might just add that I have a notional problem with the use of all the trainers’ starts (versus track specific starts). I can’t prove it, but I believe that trainers come to the meeting with specific goals in mind and that what he/she may hope to do at Tampa is not necessarily what he/she would like to accomplish at Thistledown……..
Dave Schwartz says
John, I think the other problem is that with larger training operations that are spread over the country the named trainer actually has very little to do with the actual training regimen. He has assistants for that which have some of their own techniques. In addition, the local track’s characteristics – such as (but not limited to) when the condition book is available, stall space, demands from racing secretary, etc. – has an impact on everything from workout patterns to class drops and distance switches.
steve gazis says
A lot to take in here… first, absolutely no offense taken. I watched the webinar in hopes of answering these sorts of questions that have dogged me in the past, and I appreciate the opportunity to engage in a dialogue.
First, I’m afraid I’m not being as clear as I’d like. Those 19 trainer situations above were winning subsets in 2011-12, without having to dig deeper and add on anything beyond the L1-3/shortening; my issue is that moving on to 2013, only half were able to maintain that success; the other half were awful. Figuring out which would be which ahead of time was the nut I could never crack.
Dave, excellent point about trainers with large operations. That’s a good argument all by itself for isolating the analysis to a track-by-track basis. Or at least a circuit-by-circuit basis. My initial thinking was the if a trainer had particular tendencies that he relied on to get a horse ready for a win, the location didn’t really matter; but as you say, there’s no guarantee that the name in the program is the person actually in charge of the horse. I’ll have to try some more data wrangling with this in mind, and see where it leads.
re: the “success” designation… yes, a minimum win% cutoff, not just 20% improvement, would be a good idea; but in this particular example, that wouldn’t have helped. Those 6 trainers who’ve crapped out so far 2013 had higher win rates in 2011-2012 than those who’ve continued their success.
As for your 4 questions:
1 – If I understand you correctly, this does sound somewhat similar to something I’ve also experimented with. I wanted to see how my top rated horses performed when trained and/or ridden by those who’d had a history of good success with horses I also had as top-rated. Unfortunately, what I found is that, like they say in those tv commercials, past success is no guarantee of future results.
2- I would assume yes
3 – “suffer” compared to what? The hypothetical unlimited specific dataset? I’d say yes. Compared to an overfit tiny specific sample? I’d say no, but neither is a good option.
4 – Of course not. But at the same time, I also believe that if you build a system based on the 8 baby dashes in your database, and another based on the 4 marathon turf races in your database, neither system is going to work going forward.
But yes, if the point of that was to argue that trainer tendencies should be done in a smaller and more specific way, ie track-specific, yes, I’m already on-board with that and prepared to take a look at it that way.
Dave Schwartz says
Steve,
My point was not just about trainer statistics.
I’d like to make the same point about ALL data. As a “data scientist,” one of your biggest challenges must be a constant battle of general vs. specific; of large samples vs. small.
My belief is that whenever you build any type of systematic approach (or for that matter come to any conclusions whatsoever) from a data sample, you are actually building a system that works better for some sub-segments of the data than others. Permit me to be more specific in this conversation.
Let’s say that you have a data sample of 100,000 races. Let’s further say that you have built a system that works to some degree.
My contention is that this system works best against the AVERAGE race in the sample. That is, if we were to graph the races in a two-dimensional bar graph, based upon a similarity scale to the AVERAGE race, the graph would describe a bell-shaped curve. The system developed will work best against the portion in the middle and less so the tails at each end. To me this is a logical assessment.
I think it it logical (and probably close to true) that one standard deviation from the midpoint is probably the sweet spot of races for such a system.
If you can agree with that, and I think most statistically trained people would – but I am open to being educated differently – then the next logical progression in system development would be to remove the tails (or “outliers” as you might want to call them) from the sample and redesign the system from scratch!
By the way, if you were to assume that 95% confidence level that you spoke of in your earlier post, I think it would be logical to say that IF you had a 95% likelihood of success in the center of the graph, it would rapidly scale downward as you moved towards the tails. So much so that the “chances” might be 15% or less of actually having the same system work on the tails that worked in the center. This is just my relatively untrained statistical brain applying some degree of logic to the question at hand. Please comment on this.
My reasoning is that not only does our system not work on those tail-end races, but in addition, those races serve to distort our answers away from the optimum system (whatever that may be).
I think it is further logical that this process of eliminating tails and (effectively) narrowing the segments will cause:
A. Improvement in the system in the development.
B. Improvement in the system in the real world going forward.
C. Diminishing of sample size.
D. Increasing complexity simply because there are more systems to build and maintain.
This is where I tie in my original premise – that as a “data scientist” your challenge is to break races into as many segments as you can while maintaining a “reasonable” sample size.
As the number of segments expands, the shape of the “similarity” curve begins to bunch towards the middle. This means that the definition of “average race” is a closer match to more (or even most) races in the sample.
Some races are just plain difficult. Let’s say that you are considering the races at 6.5f on the SA downhill turf course. There is a very small sample of such races run each year. Can you reasonably expect to EVER build a system for those races? Based upon sample size, probably not.
The problem is that every race comes with its own variety of UNIQUENESS. Most races – probably 2/3’s or so – fall outside of the thing called “average races” anyway.
My experience with “statistical types” is that they have a natural tendency to think it terms of global models. And why wouldn’t they? Doing it that way you have just the one model to build, test and maintain. Breaking things into segments adds at least a full magnitude to the complexity of the problem as well as a geometric increase in the potential solution space.
Of course, the non-statistical types go exactly in the opposite extreme. They want to build a system out of 4 races. LOL
Building a single, global model is a huge undertaking. Yet is is relatively small in comparison to building (say) 50 individual systems instead. Getting them all to work is probably impossible.
However, building the global model comes with the built-in loss factor from the tails.
I invite the comments of others as well as Steve in this matter. It is a great topic, I think.
easwaran_india says
thank you Dave for your kindness in sharing your wisdom
thanks
easwaran
India
steve gazis says
Yes Dave, we’re pretty much on the same page here concerning global vs. specific, the only difference might be what constitutes an acceptable level of “similarity”. But you’re right, mixing together samples that are dissimilar waters down the predictive capability of any good factors you might be using.
I’m reminded of a job I had years ago, doing response modeling for direct mail. When I first got there, they were using only two models: one for paid customers, and another for the small subset with open orders. Given that they mailed as many as 40 million customers in a single mailing, scoring almost all of them with a single model seemed nuts. Evaluating your best customers with the exact same criteria as customers who might have ordered only 3 times in 12 years couldn’t possibly work as well as developing models on smaller, targeted populations. By the time I left, there were more than a dozen models in play, each targeting a different layer of customer quality, with model components defined and scaled appropriately to its associated segment.
So yes, I would never argue for an entirely global approach when it comes to horses either. The only question is, where does one draw the line on specificity? When you said “every race comes with its own variety of uniqueness”, I don’t know how literally you meant that, but I actually do believe that if one were so inclined, every single race could be legitimately defined as a unique event via non-trivial criteria. Once you get past the obvious ones such as track, distance, surface, track condition, class, you could take it further and layer in things like field size, and variables that describe the pace setup of the race in excruciating detail, and so on, until you end up with 100,000 samples of size 1 each. So the question then becomes when does “non-identical” constitute “similar”? I doubt there’s a correct answer, and different people will have their own biases when they answer that question. A class handicapper might be appalled at someone who lumps together 5k claimers and nw2 allowances, and a pace handicapper might be equally appalled at someone who combines races with a lone speed with those with 3 need-to-lead quitters and a lone presser.
Yes, this really is a great topic, but I also don’t want to run the risk of derailing the trainer handicapping topic, so I’d like to return to that now. As I said last night, I would revisit my earlier analysis, with the addition of track segmentation. I also added races back to 2009, just to have more data to work with. As before, I took a trainer “situation” at random, this time 2L/no-shipper. Using the 2009-2012 data, I identified 21 trainers who excelled with it:
……………………………………….2L/no-shipper
Track….. Trainer……………………Starts… Win%
BEL…..ASMUSSEN STEVEN M……..14…..50%
BEL…..DOMINO CARL J……………..13…..38%
BEL…..MCGAUGHEY, III CLAUD……18…..39%
CDX…..BOREL CECIL P……………..13…..54%
CNL…..O’CONNELL KATHLEEN…….16…..38%
CRC…..DACOSTA JASON…………..11…..36%
FEX…..DREXLER MARTIN…………..14…..50%
GGX…..JENDA CHARLES J…………18…..39%
HOL…..EURTON PETER……………..11…..45%
HOU…..TORREZ JERENESTO………11…..45%
HOU…..WILLIS MINDY J…………….10…..40%
HST…..RYCROFT KELLY D………….14…..36%
LRL…..DILODOVICO DAMON R…….15…..47%
MPM…..CLULEY DENIS………………10…..40%
MTH…..MCLAUGHLIN KIARAN P……11…..45%
NPX…..STRUMECKI ALBERT………..13…..46%
OPX…..CALHOUN W. BRET…………12…..42%
SAR…..SERPE PHILIP M……………..11…..36%
TDN…..ROWE DONN A………………21…..43%
TPX…..CONNELLY WILLIAM R……..14…..43%
TPX…..ROMANS DALE L…………….15…..40%
Taking these 21 track/trainers and betting them into the first half of 2013 resulted in 7 trainers with 27 total plays. One of them (Calhoun) did very well, one (Torrez) slightly bettered breakeven, and the other 5 (Shug, DaCosta, Eurton, Cluley, Strumecki) were only 1 for 12 (a $7.80 winner). Altogether, the 27 plays were -5%, not bad by any stretch, but my problem with this is more conceptual: If these 5 trainers had 4 years of success with this pattern, and are now 1 for 12 when looking at forward, independent data, does this mean their prior observed success was merely a statistical anomaly? If these 5 really are ~45% successful with this pattern, 1 for 12 is pretty far outside the level of performance one would expect. Is there anything about the performance of these 5 in 2009-2012 that would suggest their success would not continue?
Given all the data above, what would one do in the second half of 2013? Would it be wise to bet those 5 assuming they will rebound? Would it be wise to bet Calhoun and Torrez, or is there a fear they’ll revert to the mean? I’m curious to see how it will shake out, but if I had to predict, I don’t feel like I have anything on which to base a guess.
Dave Schwartz says
Steve,
I agree. Great topic.
Personally, I use a different approach to Trainer Stats, but will leave that discussion for another time. But suffice it to say that I am more likely to use EVERYTHING as opposed to a handful of 2-deep or 3-deep factor combinations.
Back on track – Is your point here that the small sample is questionable? Statistically speaking, if a trainer had an angle with 20 starters out of a total of 1,500 starts and another trainer had the same angle with 20 starts but only had a total of 30 starts, is the likelihood of false positive the same? In other words, does the sample size they are drawn from matter?
steve gazis says
No, that doesn’t matter. In either case you’re observing 20 starters. This is a common fallacy when it comes to things like political polling. When people hear a poll is based on 500 people, they don’t understand how it could be reliable when there are 300 million people in the country. But whether those 500 people are representative of 300 million or 1,000, the margin of error in the data collected is still the same, because either way, it’s only dependent upon the variation inherent within 500 people.
Back on topic… I wouldn’t really say I have a “point”. What I have are questions based on my attempts at trying to build a workable framework around trainer handicapping. To put it down as straight-forwardly as possibly, how can I make it work when I can observe trainers enjoying tremendous success for 4 years with a particular pattern, but when applied to independent, forward data, they perform poorly with it? In other words, how do I distinguish the “false positives” from the trainers who really ARE effective with a pattern? Harking back to the previous example, how do I distinguish a Calhoun from a Eurton? Is there an objective, data-driven way of doing it, or does it just come down to accurate instincts about each trainer’s true capabilities?
Dave Schwartz says
On the topic of the 20 starters… when you use the 95% confidence example, would you not find that the larger the total sample of factors the greater the likelihood that some of them would be false?
In other words, if I have 100 positive trainer factors from a single trainer will that 95% confidence level mean 5 will be likely false positives, that with 200 it would be 10 and with 20 it would be only one?
Or does the sample size of each one stand on its own? I am assuming it does.
Another question: How does one transfer the likelihood of a false positive to $net? I understand how to do it with hit rate, but the question is, here are some number of starts in a sample and (say) a +40% ROI. Obviously one cannot use “average payoff” to determine likelihood because the very point would be that this trainer’s profitability MAY be based upon high odds.
Thus, if he had (say) a factor that collectively produced a +40% ROI from (say) 100 starts but the profit was fueled by long prices… how would you figure THAT out? I am thinking that you do not have enough information without breaking the sample down into individual wagers and computing them all. Or else doing it on a subset by odds (or some other criteria) in which case the sample size could be very small.
I guess what I am saying is that the fact that a trainer characteristic is profitable across (say) 100 starts does little to tell us whether he is profitable with today’s horse at 5/2. True?
BTW, again I would like to thank you for your participation.
John J. Barile says
Once again, my experience has been that looking at Non-shipping 2Ls (as in Steve’s example) may be inadequate.
Trainer A may win 50% with Non-shipping 2Ls but 76% With Non-shipping 2L + ? + ?
If the same is true for Trainer B, C and D it would make complete sense that together they would be 1-12 with Non-shipping 2L’s if their ? characteristics were not present in their more recent tries…..
steve gazis says
John, please elaborate on this point. If I see a trainer who is 5 for 10 with non-shipping 2Ls, do I really want to layer on more criteria? Wouldn’t 5 for 10 be an acceptable level of success? Or do you keep layering on criteria until you can no longer improve it? At what point would you get concerned about lack of representative races? In other words, if a trainer is 5 for 10 with A+B, do I want to add C is that makes him 3 for 3?
This is of particular concern when crunching these numbers mechanically… with this many dimensions of factors in play, you can do so much more sample splitting than you could by hand, so the question needs to be “when do I stop?”
John J. Barile says
Steve, the Potts example was very productive but perhaps not the best to use (for the show)….many of the trainers have as many as 150 (or more starts) with any single characteristic when I begin the process of “sample splitting” (and I like that description)….ultimately leaving me with a three or four deep angle that is say 9-12 win and 11-12 ITM….this is the cream that we want to see rising to the top….clearly, the greater the sample size the more confidence there is in using it….5-8 ITM at $9-1 odds presents a ethical dilemma….it’s accurate but not precise….to play or not to play???? And maybe that’s where the intuition comes into play……..also note that in the Potts Plain Stretchout example he won 7-20 at good odds going into the meeting, but he was also 5-11 with Plain Stretchout + Class Change types (and I overlooked this). Kabooom!!! At the last meeting he won 5 of 12 with the Plain Stretchout + Class Change (and that made me cringe)!!!
John J. Barile says
Anyway, you’ve have hit the nail squarely on the head, when to stop layering in another characteristic is the greatest challenge of all…..with the exception of the monster angle (13-13 with L1-3 + S + Dropdown + RtS + TtoD, it can, at times be equal parts art & science, but IMO there will always be enough solid angle plays to keep it profitable….
John J. Barile says
Kathy Guciardo went StR + DtoT only twice in the last four meetings….she was ITM both times at $75-1 odds….come on….tell me u wouldn’t put a couple of bucks down next time…..hahahaha
John J. Barile says
For example:
Greg Griffith was 7-9 ITM with Non-layoff Stretchout + Dropdown types in L3T and only 1-12 ITM with Non-layoff Stretchouts at the last meeting (while winning with the only starter that had the additional + Dropdown characteristic……
steve gazis says
It’s even worse than that; if only 5 of 100 *positive* factors were false, it wouldn’t matter, because the 95 legitimate ones would be making you rich. It’s 5 out of ANY 100 TESTED, that will be false positives. So if you test 100 and find 5 that look good, this is what you’d expect to find in completely homogeneous data via sheer luck. Even if there were a magical genie who could tell you that 1 of your 5 really was legitimate and will continue to be profitable going forward, I don’t know if there’s any way of knowing which one it is.
Not sure exactly what you meant by “the sample size of each stands on its own”, but if I’m understanding correctly, the sample size of each factor set would come into play in that it determines how far it has to deviate from the norm to be considered significant. Eg, you can’t just require a particular win% without specifying how many races need to be in that sample.
And yes, I’ve never found $net to be a good metric to use in these types of analyses, because, as you said, prior winners that paid decent prices dominate the sample to an unacceptable level. In all the tinkering I’ve done with this sort of stuff, purely requiring a prior $net never works. Prior win rate always needs to come into play as well. The two working in tandem generally seem to be preferable to either individually. In the case of the type of analysis we’re looking at here, the prior win rates are already so obscenely high, the prior $nets are also very high as well.
One idea I thought might be worth pursuing is to see if itm% would be a good predictor of future performance… the theory being a high win rate coupled with lots of near misses would be even better than just a high win rate. Unfortunately, in my most recent set of example data, it worked just the opposite. Two of those 5 trainers that have done poorly in 2013 were the two of the 21 that had the highest itm% in 2009-2012. Eurton, who I singled out above as the sort of false positive I want to identify ahead of time, was 91% itm in 2009-2012. Meanwhile, one of the profitable 2013 trainers, Torrez, only had 1 of his 11 2009-2012 starters finish 2nd or 3rd. So this line of thinking is off to a very bad start.
Dave Schwartz says
So you are saying that 5% of all factors are false positives?
Why is 95% the magic number?
Why not 90% or 70% or 4%?
steve gazis says
It’s not magic. When people deal with confidence intervals, they generally speak in terms of 95% or 99%. It’s just convention, but also has to do with the perceived cost of a false positive. For example, if we’re talking about a medical test that will lead to dangerous surgery if the patient tests positive, 5% false positives would be completely unacceptable. You’d require a test reading much, much farther from the mean before considering the patient to have tested “positive”.
And no, I’m not exactly saying that 5% are false positives, because I’m not using any kind of statistical precision in my measure of “prior success” that I’m using to designate these positive factors. Given that there are thousands of trainers, and I’m only identifying around 10-20 in these examples as being “positives”, the requirement of a win rate more than 20 points greater than norm, is way more stringent than a 95% confidence interval.
If you google something like “Type I errors multiple hypothesis testing”, you’ll find a lot having to do with this problem I’m talking about, many of which will do a much better job than I am of explaining it. I found one in particular that’s very clearly written and also describes ways of addressing it (basically, the idea is to require much higher levels of significance, which as noted above, I’m already doing).
http://www.aaos.org/news/aaosnow/apr12/research7.asp
Dave Schwartz says
That makes sense.
I have read Precision: Statistical and Mathematical Methods in Horse Racing. Have you? If so, what did you think of that book?
Recently I discovered the joys of using R2 for so many things. What a simple but valuable tool.
Since you are going so great in this thread, would you care to suggest a good way to apply statistical analysis to horse racing?
steve gazis says
Never heard of that book til now. Just looked it up on Amazon. The table of contents certainly looks interesting. My only concern is that I get the impression that the focus of the book is Hong Kong racing. The pools there are so full of unsophisticated money, if one did have access to good data, finding winning techniques doesn’t seem like it’d be nearly as challenging. Let me see the author tackle weekdays at Laurel, then I’ll be impressed 🙂
What did you think of it? Was there stuff in there that could actually be applied, or was it more theoretical in nature?
I’m afraid I don’t have a great answer to your question. Naturally I’ve tried many modelling exercises in the past, but never with good results. The problem for me is coming up with a good dependent variable. If you model wins, you end up predicting favorites and get a $1.80 $net. If you model win pay, you end up modeling the noise generated by the better paying winners in the sample, and you generally get a bizarre looking model that would never work on independent data.
My strength has always been data mining; sifting for whatever looks like it might work. And by “work”, I don’t rely on any sort of statistical tests and measurements. My criterion is way more simple and direct: is it profitable on independent data?
The biggest mistake people make is believing what they see in the data they’re analyzing, without independently confirming it on fresh data… which, as I type this, give me an idea for something new to try with the trainer analysis:
1 – Use 2009-2011 to identify potential profitable patterns
2 – confirm whether or not profitable in 2012
3 – If so, bet forward into 2013
This would also allow for a lower bar to identify successful patterns, since step 2 would theoretically be eliminating most of the extra false positives that would be brought in. Looks like I have a busy weekend ahead of me…
Dave Schwartz says
It was translated from the author’s native language so, there are some grammatical issues.
The content is (basically) how to build an iterative regression approach. Very whale like.
It is a worthwhile read. The closest thing to what the whales are doing, if you have an interest in understanding that.
steve gazis says
Yes, it does sound interesting… is it fair to assume that since it’s describing a whale-based approach, that access to a tote feed and automated betting is required for what he’s describing?
Dave Schwartz says
Yes. I would say that current odds and bet uploading should be in everyone’s arsenal.
However, in this age where there is no such thing as final odds, I think the approach needs to be modified to create an artificial line and then wager into that.
steve gazis says
Got the book and started giving it a quick browse. So far, I’m not seeing anything in here that’s new to me. His modeling approach looks to be what I referred to in my own modeling exercises: modeling on “win” as the dependent variable. Aside from the pitfall I described earlier, that the highest scored horses will generally be the favorites, if you use such a model to make oddlines and bet overlays, aside from the obvious problem of not being able to bet at fixed odds, the other problem I ran into is that when the model identifies a significant overlay that would not be affected by late odds moves, it’s generally because your model isn’t accounting for unusual circumstances that are affecting the model inputs
Dave Schwartz says
I can tell you that this approach is similar to what the whales use to make millions. However, the whales use an iterative version of regression. That is actually quite different.
Dave Schwartz says
In reply to Steve re: “The Hong Kong Book.”
The favorite SHOULD be the best horse in most races. After all, he wins more often (by far) than anyone else.
john renwick says
hi guys horse player magazine on how to handicap.One thing he said and i still have them trainer pattern to track.One was spt to rte from 1985 to 1997 at least 20%win and/or 30% win place at least 10 starts that over 12 yr that only top 10 trainers at that track. which make sense to me nobody had win % below 16%
Dave Schwartz says
John,
Ed Bain has an approach he calls “4+30” which means “30% win rate and at least 4 wins.” This is a very powerful way to look at trainer stats.
John J. Barile says
I will give that a going over too…..isn’t he a mystery writer?
Dave Schwartz says
LOL – Only if you consider Hong Kong racing a mystery.