Very neat. I never would have considered doing a sparse bucket sort :)
Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.
(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)
Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.
I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor
Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations
Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.
Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.
Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects.
10*32=320 cycles assuming input is in ZP.
;Collect sorted values
beq next1: ;Skip if empty.
lda #value1 ;Value corresponding to this bucket.
bne *-2 ;In case not unique.
sty Buckets ;Empty the buckets.