114 points by chmaynard 4 days ago | 61 comments | View on ycombinator
OskarS 3 days ago |
intrasight 3 days ago |
In 1985 as an EE student, I took a course in modern CPU architectures. I still recall having my mind blown when learning about branch prediction and speculative execution. It was a humbling moment - as was pretty much all of my studies as CMU.
stephencanon 3 days ago |
bee_rider 3 days ago |
But the memorization capacity of the branch predictor must be a trade-off, right? I guess this generate_random_value function is impossible to predict using heuristics, so I guess the question is how often we encounter 30k long branch patterns like that.
Which isn’t to say I have evidence to the contrary. I just have no idea how useful this capacity actually is, haha.
Night_Thastus 3 days ago |
It's a tiny, trivial example with 1 branch that behaves in a pseudo-random way (random, but fixed seed). I'm not sure that's a really good example of real world branching.
How would the various branch predictors perform when the branch taken varies from 0% likely to 100% likely, in say, 5% increments?
How would they perform when the contents of both paths are very heavy, which involves a lot of pipeline/SE flushing?
How would they perform when many different branches all occur in sequence?
How costly are their branch mispredictions, relative to one another?
Without info like that, this feels a little pointless.
Paul_Clayton 3 days ago |
(Tags could be made to differ by, e.g., XORing a limited amount of global history with the hash of the address.)
It is also possible that the AMD Zen 5 and Apple M4 have similar unused predictor capacity and simply have much larger predictors.
I did not think even TAGE predictors used 5k branch history, so there may be some compression of the data (which is only pseudorandom).
It might be interesting to unroll the loop (with sufficient spacing between branches to ensure different indexing) to see if such measurably effected the results.
Of course, since "write to buffer" is just a store and increment and the compiler should be able to guarantee no buffer overflow (buffer size allocated for worst case) and that the memory store has no side effects, the branch could be predicated by selecting either new value to be stored or the old value and always storing. This would be a little extra work and might have store queue issues (if not all store queue entries can have the same address but different version numbers), so it might not be a safe optimization.
withinboredom 3 days ago |
stevefan1999 2 days ago |
The simplest binary saturating counter, ala bimodal predictor, already achieved more than 90% success rate. What comes next is just extension around that, just use multiple bimodal predictors and build a forest of it, but the core idea that treating the branch prediction using a Bayesian approach, never fades.
It is a combined effort between hardware design and software compiler, though.
barbegal 3 days ago |
ibobev 2 days ago |
piinbinary 3 days ago |
infinitewars 3 days ago |
At least that's my prediction.
rsmtjohn 2 days ago |
The clearest wins I've found: replacing conditional returns in hot loops with branchless arithmetic. The predictor loves it when you stop giving it choices. Lookup tables for small bounded ranges are another one that consistently surprises me with how much headroom there still is.
ww520 2 days ago |
themafia 3 days ago |
https://blog.cloudflare.com/branch-predictor/
Should be titled: How I Learned to Stop Worrying and Love the Branch Predictor
user070223 3 days ago |
atq2119 2 days ago |
Are they really keeping a branch history that's 30k deep? Or is there some kind of hashing going on, and AMD's hash just happens to be more attuned to the PRNG used here?
Would be interesting to see how robust these results are against the choice of PRNG and seed.
rayiner 3 days ago |
tonetegeatinst 3 days ago |
unit149 3 days ago |
openclaw01 2 days ago |
stefantalpalaru about 15 hours ago |
But that's clearly not right, because apparently the specific data it's branching off matters too? Like, "test memory location X, and branch at location Y", and it remembers both the specific memory location and which specific branch branches off of it? That's really impressive, I didn't think branch predictors worked like that.
Or does it learn the exact pattern? "After the pattern ...0101101011000 (each 0/1 representing the branch not taken/taken), it's probably 1 next time"?