The Parable of the Golfers

Why a high match% to Rybka is usually not evidence of cheating, unless...

In the Gospels, a parable means a hypothetical example for express purpose of comparison to real-life situations.

According to the official study by mathematician Francis Scheid summarized here on Golf.About.com, the odds of a "low handicapper" golfer hitting a hole-in-one on a standard par-3 hole are about 1-in-5,000. Many of us hackers will never see one, let alone make one.

But if you assemble 10,000 such golfers to take a swing, you will expect to see not just one but two of them. And there will be nothing sinister about those who get them---it is just the run of averages.

(Photo source; Payne Stewart.)

However, suppose you slip a black spot into the jacket pockets of 10 of the golfers, and one of them hits a hole-in-one. Now you have a coincidence! The odds are over 500-1 against, well within the common civil court standard for statistical unlikelihood.

The application in chess starts with my having taken over 28,000 human performances, about 10,000 in recent years and a different 10,000 from grandmaster-level players of all years. Would you guess I have seen some holes-in-one? You already know enough to justify saying that by pure chance, I probably must have...

So what can constitute evidence of cheating?

The answer is, what is a black spot?

The policy which I set on this site almost five years ago is that the spot can only be physical or observational evidence of cheating, something independent of the consideration of chess analysis and statistical matching to a computer.

Thus the statistical analysis can only be supporting evidence of cheating, in cases that have some other concrete distinguishing mark. When such a mark is present, the stats can be effective, and can meet civil court standards of evidence, which begin with a two-sigma deviation representing odds of roughly 40-1 (one-sided) or 20-1 (two-sided) against the "null hypothesis" of a chance occurrence. One other thing the stats can do is quantify how much benefit in rating-point terms was obtained by the alleged means.

(I am told that in certain industrial contexts, a five-sigma or six-sigma deviation---talking odds over a million to one against---can be prima-facie evidence of malfeasance. Moreover as I noted here, five-sigma is the particle-physics standard for discovery, while 3.5-sigma (roughly 2000-1 odds) enables one to claim evidence. There just aren't enough moves and games and players to get even a sniff of five-sigma, but 3.5 sigma, that happens...) Update 1/13/13: Oh, 5-sigma happens too, but there still aren't enough games and players to compare against actual frequencies of such deviations.

One thing that does not constitute a black spot is an unfounded accusation of cheating. They are all too cheap, and also importantly based on cases since "Toiletgate" where move-matching has been the alleged evidence, are not independent either.

The general statistical principle about holes-in-one described here is called Littlewood's Law. Another example is that if you play 40 games, chances are you'll have one where you would have the 40-1 statistical outlier situation. And every 250 players will have one of those in the last-round of a six-round Swiss. Note that you already need the full model described in my papers even just to judge this kind of outlier---if you merely get a lot of matching, you may simply have played an unusually forcing game.

Still another example notes that my Intrinsic Ratings Compendium states "95% confidence intervals" for every IPR, both the "e"xpected one (theoretical assuming no error in the rest of my modeling) and "a"ctual ones from preliminary field tests that are 1.4 times wider. And note that they are pretty wide, often over 100 Elo points both ways from the middle. The stated IPR's are "best estimates", but individual ones can be off the mark from "true" values by the amounts allowed. In particular for every 40 data points---and there are already well ver 40 there---you can expect that one of them will be gonzo-high above that range, and another will be gonzo-low, since 38/40 = 95%. This may be so in particular for the runs on some world championship matches. But when you take the aggregate you see that the scheme is on the whole pretty close---the average is within 4 Elo of the players' ratings with each match at face value, and within 9 Elo if weighting by (games or) moves.

The Cat 20+ tournaments on the other hand, are systematically low. My current hypothesis is that this is due to a higher mean difference in ratings among players in tournaments, whereas the training sets are all games with both players within 20 points of each other, and there as in matches the perception of skill parity may increase the circumspection of the play. That would cause higher IPR's for both the training sets which set the scale, and the matches. But then again, the run on the 2011 Canadian Open with much greater game disparity came out systematically high, so other factors may be present. Plus as I noted here, I will be upgrading the fitting procedure so numbers may change---though the last change mattered less than +- 5 Elo so I've ignored it.