[CSDb] - User Forums

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Sorting

2007-10-08 16:08

Oswald

Registered: Apr 2002
Posts: 5127

Sorting

are sorters really so slow in games? :) I have made my unrolled version for my theoretical game (;), and it takes 132 rlines to sort 32 numbers o_O worst case is ~200 lines. tho when its the case of 4 numbers have to be swapped only it does the job in ~10 lines. wastes a lot of memory but I like it :)

... 193 posts hidden. Click here to view all posts....

2007-10-15 08:53

ChristopherJam

Registered: Aug 2004
Posts: 1424

Exactly :)

I traverse a singly linked actor_next list in the end-of-screen interrupt running their movement scripts, and after each has been moved (unless it has just died) I insert it into a singly linked actor_prev list.

Then once all the actors have been processed, I reverse the list.

ldx actor_prev,y
tay
sta actor_next,x
beq done
...

2007-10-15 11:29

JackAsser

Registered: Jun 2002
Posts: 2038

Quote: Ajtai, Komlós and Szemerédi's network is O(log N) depth, but still has O(N log N) comparitors - their timings are for parallel hardware, where you can swap elements 0 and 1 at the same time as 2 and 3, but swapping 0 with 2 would have to wait until the next cycle. The sample network I gave above they would categorise has having depth 3.

I was looking at something that on serial hardware like our humble c64 would be O(N log k), where k is the maximum number of sprites in any given bucket, which for buckets less than 22 lines high will be 8.

Oswald: I know the stuff I'm playing with this week will not give optimal results for a relatively stable arrangement - it's why I usually use insertion sort on a linked list of actors. I believe Doynax is attempting to avoid the huge spikes of CPU time a progressive sort would take whenever there is a massive change in order, like when two multi-sprite bosses are crossing vertically while you shoot at them with vertically fast moving sprite bullets ;)

Ah yeah, later that night I realized that the O(log N) was due to parallelism.

To optimize further I think we should look at the aims here:

* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.

First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.

Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc.

2007-10-15 11:37

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quote: Ah yeah, later that night I realized that the O(log N) was due to parallelism.

To optimize further I think we should look at the aims here:

* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.

First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.

Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc.

Unfortunately we still need to sort the first eight sprites in order to reuse them as soon as possible. Unless you use a radically different approach like what Cruzer suggested of course.

There may be cases where we can determine ahead of time that there will be "enough" free raster time anyway and don't perform a perfect sort on some segment of sprites, but as to how you'd implement such a beast in practice..

2017-08-12 09:39

ChristopherJam

Registered: Aug 2004
Posts: 1424

Necro thread time! (I know, I know, if I'd waited another two months it'd be the ten year anniversary. But I want to publish this now so I can move on to something else :P)

I've been meaning for a while to try and accelerate a perfect order bucket sort by finding a good way to skip over unused entries.

After a bit of experimentation this year, the best I've managed is to use a set of flag bytes, with one bit for each bucket to say whether there are any actors in it on this frame.

For any non-zero flag byte, a lookup table used to quickly determine the index of the least significant one bit, so that that that bucket can then be emptied by pushing each sprite index from the list starting with that bucket entry onto the stack. The used buckets are cleared as they are emptied, so no post-cleanup is required, and the untouched buckets take no CPU time at all.

By packing eight flags into each flag byte, and unrolling the code to one block per flag byte, the worst case overhead per bucket is only 1.25 cycles. Thus, the CPU time for a perfect sort is only marginally worse than what it would be for a sort routine that only used half as many buckets.

Total time, a shade under 40 raster lines \O/ (yes, doynax still wins /o\)

Code is up at codebase, cf http://www.codebase64.org/doku.php?id=base:flagged_bucket_sort

2017-08-13 11:04

ChristopherJam

Registered: Aug 2004
Posts: 1424

removed a redundant LDA#255 from each flag byte block, just before the STA back into the bucket list (the accumulator already contains a list terminator at that point), and remembered to factor in the flag bytes being in zero page.

Down to 38.5 raster lines now, only 12% slower than radix sort.

2017-08-13 19:26

doynax
Account closed

Registered: Oct 2004
Posts: 212

Very neat. I never would have considered doing a sparse bucket sort :)

Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.

(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)

2017-08-14 07:08

ChristopherJam

Registered: Aug 2004
Posts: 1424

Quoting doynax

Very neat. I never would have considered doing a sparse bucket sort :)

Thanks!

Quote:

Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.

Yes, I wasn't sure how best to benchmark. Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.

FWIW most of the time is "per used bucket", so if there are only 24(?) distinct Y values, I suspect I'd come out ahead, but if there are that many horizontally aligned enemies a zone sort's likely to be sufficient anyway.

I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor, for a fairly unappealing total of (34+62)*32=3072 cycles per frame (48.7 raster lines)

I suspect that if one just wanted to minimise CPU time expended by the sorter in in the border, a plain bucket sort would be better; insertion's only 14 cycles per actor (7 rasters!), and the times it takes a while for 'fetch next' are precisely when the next actor is a fair way down the screen from the one you just dealt with, so timing wouldn't be quite so critical anyway.

Quote:

(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)

Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations

2017-08-14 12:18

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quoting ChristopherJam

Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.

Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.

Quoting ChristopherJam

I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor

Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.

Quote:

Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations

Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects.

2017-08-14 13:34

ChristopherJam

Registered: Aug 2004
Posts: 1424

Quoting doynax

Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.

Indeed. And for the first 8 sprites, you can just LDX immediate directly from $017f down.

Quote:

Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.

Exactly. The plain bucket sort has the same advantage (TBH I pinched the idea from Playstation order-tables - it's how they z-sort the texture mapped triangles, by inserting them into a linked list of null graphics primitives)

Quote:

Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects.

Yes; I got there eventually :)

2017-09-04 16:19

Rastah Bar
Account closed

Registered: Oct 2012
Posts: 336

A brute force approach seems already very efficient. I think that can be done in 35.6 lines or less for 32 sprite positions and 220 buckets.

;Fill Buckets
ldx input
inc Buckets,x
...
ldx input+31
inc Buckets,x
-------------
10*32=320 cycles assuming input is in ZP.

;Collect sorted values
ldy Buckets
beq next1:   ;Skip if empty.
lda #value1  ;Value corresponding to this bucket.       
pha
dey
bne *-2      ;In case not unique.
sty Buckets  ;Empty the buckets.
next1:
ldy Buckets+1
beq next2:
lda #value2
pha
dey
bne *-2
sty Buckets
next2:
;etc.

If all input numbers are unique the collection stage costs 7 cycles for empty buckets and 19 cycles for the buckets with 1 input value for a total of 7*(220-32)+19*32+320 = 35.6 lines.

If not all input numbers are unique, the cost for a bucket with k numbers is 12+(k-1)*8+7 cycles. Adding one more to a non-empty bucket costs 8 cycles, but since there is then one bucket less with one contribution, and one more empty bucket, one gains 4 cycles overall.

But I could have made a mistake in these calculations or missed something else (initialization, overhead ...).

Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | ... | 21 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

algorithm
TheRyk/MYD!
Peacemaker/CENSOR/Hi..
aNdy/AL/Cosine
Scrap/Genesis Project
Matt
RetroGL
Alakran_64
Guests online: 400

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Codeboys & Endians  (9.7)
4 Mojo  (9.6)
5 Coma Light 13  (9.6)
6 Edge of Disgrace  (9.6)
7 Signal Carnival  (9.6)
8 Wonderland XIV  (9.5)
9 Uncensored  (9.5)
10 Comaland 100%  (9.5)

Top onefile Demos

1 Nine  (9.7)
2 Layers  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.5)
6 Scan and Spin  (9.5)
7 Onscreen 5k  (9.5)
8 Grey  (9.5)
9 Dawnfall V1.1  (9.5)
10 Rainbow Connection  (9.5)

Top Groups

1 Artline Designs  (9.3)
2 Booze Design  (9.3)
3 Performers  (9.3)
4 Oxyron  (9.3)
5 Censor Design  (9.3)

Top Cover Designers

1 Duce  (9.8)
2 Electric  (9.8)
3 Junkie  (9.6)
4 The Elegance  (9.5)
5 Mermaid  (9.3)

Page generated in: 0.123 sec.