In the Gospels, a *parable* means a hypothetical example for
express purpose of com*par*ison to real-life situations.

According to the official study by mathematician Francis Scheid summarized here on Golf.About.com, the odds of a "low handicapper" golfer hitting a hole-in-one on a standard par-3 hole are about 1-in-5,000. Many of us hackers will never see one, let alone make one.

But if you assemble 10,000 such golfers to take a swing, you will expect to see not just one but two of them. And there will be nothing sinister about those who get them---it is just the run of averages.

However, suppose you slip a black spot into the jacket pockets of 10 of the
golfers, and one of *them* hits a hole-in-one. Now you have a
coincidence! The odds are over 500-1 against, well within the common
civil court standard for statistical unlikelihood.

The application in chess starts with my having taken over 28,000 human performances, about 10,000 in recent years and a different 10,000 from grandmaster-level players of all years. Would you guess I have seen some holes-in-one? You already know enough to justify saying that by pure chance, I probably must have...

So what can constitute evidence of cheating?

The answer is, what is a black spot?

The policy which I set on this site almost five years ago is that the spot
can only be physical or observational evidence of
cheating, something *independent* of the consideration of
chess analysis and statistical matching to a computer.

Thus the statistical analysis can only be *supporting* evidence
of cheating, in cases that have some other concrete distinguishing mark.
When such a mark is present, the stats can be effective, and can meet civil
court standards of evidence, which begin with a two-sigma deviation
representing odds of roughly 40-1 (one-sided) or 20-1 (two-sided) against
the "null hypothesis" of a chance occurrence.
One other thing the stats can do is quantify how
much benefit in rating-point terms was obtained by the alleged means.

(I am told that in certain industrial contexts, a five-sigma or
six-sigma deviation---talking odds over a million to one against---can
be prima-facie evidence of malfeasance. Moreover as I noted
here, five-sigma is the particle-physics standard for *discovery*,
while 3.5-sigma (roughly 2000-1 odds)
enables one to claim *evidence*.
There just aren't enough moves and games and players to get even a sniff
of five-sigma, but 3.5 sigma, that happens...)
**Update 1/13/13:** Oh, 5-sigma happens too, but there still aren't enough
games and players to compare against actual frequencies of such deviations.

One thing that does **not** constitute a black spot is an unfounded
accusation of cheating. They are all too cheap, and also importantly based
on cases since "Toiletgate" where move-matching has been the alleged
evidence, are not independent either.

The general statistical principle about holes-in-one described here
is called Littlewood's
Law.
Another example is that if you play 40 games, chances are you'll have
one where you would have the 40-1 statistical outlier situation.
And every 250 players will have one of those in the last-round
of a six-round Swiss. Note that you already need the full model
described in my papers even just to judge this kind of outlier---if you
merely get a lot of matching, you may simply have played an unusually
forcing game.

Still another example notes that my Intrinsic Ratings Compendium states "95% confidence intervals" for every IPR, both the "e"xpected one (theoretical assuming no error in the rest of my modeling) and "a"ctual ones from preliminary field tests that are 1.4 times wider. And note that they are pretty wide, often over 100 Elo points both ways from the middle. The stated IPR's are "best estimates", but individual ones can be off the mark from "true" values by the amounts allowed. In particular for every 40 data points---and there are already well ver 40 there---you can expect that one of them will be gonzo-high above that range, and another will be gonzo-low, since 38/40 = 95%. This may be so in particular for the runs on some world championship matches. But when you take the aggregate you see that the scheme is on the whole pretty close---the average is within 4 Elo of the players' ratings with each match at face value, and within 9 Elo if weighting by (games or) moves.

The Cat 20+ *tournaments*
on the other hand, are systematically low. My current hypothesis is that
this is due to a higher mean difference in ratings among players in
tournaments, whereas the training sets are all games with both players within
20 points of each other, and there as in matches the perception of skill
parity may increase the circumspection of the play. That would cause higher
IPR's for both the training sets which set the scale, and the matches.
But then again, the run on the 2011 Canadian Open with much greater
game disparity came out systematically
*high*, so other factors may be present. Plus as I noted
here, I will be upgrading the fitting procedure so numbers may change---though
the last change mattered less than +- 5 Elo so I've ignored it.