Chess Engine Anomalies, evidently caused by hash-table collisions and effects of multithreading.  By KWR, Fall 2007.

Screen Shots of Deep Fritz 10, running with 3 cores active under Boot Camp on my quad-core 2.66Ghz Xeon Intel Mac Pro desktop computer.  DF10 has 1024MB hash, *no* EGTs, and default settings otherwise.  This first screenshot is not a “cold start”---I had been analyzing lines with hash preserved for over a day---but the phenomenon does show when the program is started up directly in this position, which is known to be drawn.




Deep Fritz is saying that taking the pawn by 74.Bxe4 will eventually put White 7.49 points ahead, which would happen only if it sees that White can make a Queen and win, since a Queen is valued at 9 points.  But that the position after 74.Bxe4 is indeed drawn by the reply …Ne5 is shown by the online server of perfect play for 6 or fewer pieces maintained by Knowledge for IT in Germany, next screen shot:




The reason it is drawn even though White is 2 pawns ahead is that Black’s Knight prevents White’s King from moving.  After 75.Bb1 Black must retreat the King (…Kh8, …Kg8, and …Kf8 are equally good), but this only lets White’s King out as far as Kh6, still hemmed in by Black’s Knight.  White can force the Knight to move by stalemating Black’s King, e.g. by 74.Bxe4 Ne5 75.Bb1 Kg8 76.Kh6 Kh8 77.Ba2, but then Black’s Knight can sacrifice itself for stalemate by 77…Ng6! (77…Nf7+ works too).  It looks like White makes progress by 78.h5 Ne5 79.Bb1 releasing the stalemate, but the final point is that 79…Nf7+ 80.Kg6 Nxg5! leaves White with the infamous Rook Pawn + Bishop of the Wrong Color, which is a draw because White can never force Black’s King out of the corner.  I see *nothing else going on* in this position needed to explain the draw.  And Deep Fritz (unlike Rybka) seems to understand the wrong-color Bishop problem perfectly and immediately even when its tablebases are switched off.  


The final point is that when run on a *single-core* machine---or when I drop the # of core threads to 1 in the "Engine parameters" dialog on this machine---this same DF10 program gives only a normal +2.50 or so to White.  Moreover, DF10 has been in this position except with Black’s Knight starting on a different square that goes to e5, namely f3 or c6, and has given it only +2.50, i.e. normal for White being 2 passed pawns ahead.  Hence the 5.81-and-higher  evaluations must be classed as a bug.  Further stepping got the eval over 1 queen ahead!





I believe that the "9.30" arises because many entries in the hash table have been overwritten with this value.  I.e. it is a "bandwagon effect", just like in crowd psychology!  I was able to continue the illusion well past move 100...



...and the "9.30" then re-established itself in the same think:




Repeating the experiment on a *cold start*, meaning exiting the program (out to the welcome screen or altogether, which seem to be equivalent) and re-starting in this position with no prior hash, shows that the effect reproduces:




Another re-try shows that sometimes the effect doesn’t show up until very high search depths.  Generally I've found that results with DF10 on multiple cores are *not* reproducible *exactly*, even in cases where single-core results come out the same.



Later in that run I got the eval up as high as 10.43!





Daftness in a different drawn position, one with only 5 pieces.  Still DF10 on 3 cores, 1024MB hash, no EGTs..



Here is a case of what I call "hash-table stiction":


DF10 has paused for 15+ minutes thinking about a line where 65.h7 obviously makes a Queen and wins---yes this causes a massive "fail-low", but the question is why was DF10 playing 65.Bg4 for so long??  This is with 3 cores, 1024MB hash, and all 3-4-5 EGTs switched on.  Here is another case, with DF10 taking awhile to realize its Knight is hanging.



"Hash stiction" does not happen from a cold start, and I only recall it in cases when I've stepped backward from other lines.  However, I think it is likewise caused by hash-table overwrites from lines with lower evals that have higher search-depth tags, hence are given more authority even though they do not relate to the current position!  In all these cases I believe that several or many collisions are involved, not just a single one at a critical point in the search, and that a probabilistic "infection-fighting" model may be the best one to use.


Here's an example of what I call "candystriping", in holiday colours!  This never happens from a "cold start", and it's not a bug per-se---it is caused by having stepped thru lines proceeding from this position that were run to various search depths.  Aside from clearing hash, the fix is to let the program run to high depth, as you can see it clearing out the alternation at depth 28.



Here finally is an extreme case of the most common hash effect I see. 



After you play 61...f5, the program suddenly wakes up and sees White's win, even at a much lower search depth.  Again this does not happen from a cold start, but this case happened while entering analysis anew, i.e. without my having been thru the positions after 63.g5! (or 62.g5) before.  Mind you, it is difficult to see ahead that WK can stop Black's resulting 3 passers, eventually capturing all to force BK off f6 in a fatal Zugzwang.  Again this is not a bug on the scale of the first two examples, where DF10 is giving huge evals in positions known to be completely drawn.  But it illustrates perils of analyzing with these engines, as DF10 was giving evals under 0.25 to the moves leading up to this position, even though it was part of a winning line I'd recorded months before.  The point is, how many winning lines have I *missed*---in cases where the point is concealed not so much by hidden chess tactics as by hash-table effects??


The possible significance goes beyond chess and chess programs themselves.  The programs employ a memory-mapping scheme called subset-sum (aka. "Zobrist" or subset-XOR) hashing whose efficacy is a central problem in computational complexity theory.  Several programs including Fritz generate their hash-key bits by (unnecessarily) employing pseudorandom generators, which are an even more central topic.  The programs themselves constitute statistical tests of these generators of a far different nature from tests that are commonly applied.  Insofar as Fritz & Co. exemplify large-scale digital simulations in general, research on anomalies in their behavior may have applications in areas of science that are not "just games", such as molecular simulations for drug synthesis and atmospheric models.




Dr. Kenneth W. Regan                          12/28/07

Department of CSE

University at Buffalo (SUNY)

201 Bell Hall

Buffalo, NY 14260-2000