212 points by birdculture 4 days ago | 71 comments | View on ycombinator
orlp 1 day ago |
kleiba2 1 day ago |
for (int i = 0; i < 1000; i++) {
small_numbers[smlen] = numbers[i];
smlen += (numbers[i] < 500);
}
is much faster than the conventional version with a conditional branch: for (int i = 0; i < 1000; i++) {
if (numbers[i] < 500) {
small_numbers[smlen] = numbers[i];
smlen += 1;
}
}
Been staring at this for a bit, but my brain is not working properly today: could someone please explain how these to loops compute the same value for small_numbers[smlen]?teo_zero 1 day ago |
> #define BLQS_CMP(a, b) ((a) < (b))
A function that returns true when one operand is Less Than the other, should be called BLQS_LT. The CMP abbreviation is idiomatic for a function that returns -1,0, or 1.
bagxrvxpepzn 1 day ago |
This is true but it's misleading. The reality is that modern out-of-order superscalar CPUs are so good at branch prediction that it's nearly always better to branch in a tight loop (to allow more ILP) than introduce a data-dependency in a tight loop (which limits ILP). Cf. https://mazzo.li/posts/value-speculation.html, https://yarchive.net/comp/linux/cmov.html
Branchless code should generally be avoided because modern CPUs are not designed to optimize that use case. There are exceptions of course, but those are exceptions.
quuxplusone 1 day ago |
T k; // default-construct
if (i > 0) k = left[--i]; // copy-assign
This fairly obviously could be replaced with "copy-construct." Could it be replaced with "move-construct"? I don't know.
Again, in `partition_small`, we have T swbuf[SMALLPART];
which default-constructs a bunch of Ts. I think we're just going to overwrite that memory in a moment anyway, so constructing all those Ts is a waste of cycles; but I'm not sure.All of my "I don't knows" and "I'm not sures" are due to my own lack of digging into the code; I'm sure one could find out if one really looked.
None of that matters if you're just sorting `int` or the benchmarked `struct entry`. But it matters a great deal if you're taking the README literally and trying to sort "types with higher copy costs [...] (such as strings)".
mgaunard 1 day ago |
Why not compare against that?
davidkwast 1 day ago |
Tomte 1 day ago |
Obviously, readable code wins, but at least once I had the computing time budget to be able to have a central function go straight through by calculating all five or so variations (it was about several kinds of encodings of the output values) and just pick the correct one in the end. That felt good.
kvuj 1 day ago |
>
>for (int i = 0; i < 1000; i++) {
> small_numbers[smlen] = numbers[i];
> smlen += (numbers[i] < 500);
>}
Excuse my terrible ignorance but isn't there still a branch? If numbers[i] < 500 then 1 else 0? I would think something like addition plus a bit comparison would avoid said branch. Unless compilers already optimize the code, but then wouldn't they also optimize the naive piece of code?
undefined 1 day ago |
undefined 1 day ago |
dekdrop 1 day ago |
Aardwolf about 23 hours ago |
globular-toast 1 day ago |
benj111 about 22 hours ago |
So if you use Rust, you get these by simply calling [T]::sort(_unstable). Great performance out of the box :)
On my machine (Apple M2), using the benchmarks from the repository on Apple clang 17 and Rust 1.98 nightly:
And now for a cool party trick, let's repeat the 50 million doubles experiment again, but have the first 90% already sorted, last 10% random: