Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in 
CSDb User Forums


Forums > C64 Coding > Sorting
2007-10-08 18:08
Oswald

Registered: Apr 2002
Posts: 4092
Sorting

are sorters really so slow in games? :) I have made my unrolled version for my theoretical game (;), and it takes 132 rlines to sort 32 numbers o_O worst case is ~200 lines. tho when its the case of 4 numbers have to be swapped only it does the job in ~10 lines. wastes a lot of memory but I like it :)
 
... 149 posts hidden. Click here to view all posts....
 
2017-08-12 11:39
ChristopherJam

Registered: Aug 2004
Posts: 656
Necro thread time! (I know, I know, if I'd waited another two months it'd be the ten year anniversary. But I want to publish this now so I can move on to something else :P)

I've been meaning for a while to try and accelerate a perfect order bucket sort by finding a good way to skip over unused entries.

After a bit of experimentation this year, the best I've managed is to use a set of flag bytes, with one bit for each bucket to say whether there are any actors in it on this frame.

For any non-zero flag byte, a lookup table used to quickly determine the index of the least significant one bit, so that that that bucket can then be emptied by pushing each sprite index from the list starting with that bucket entry onto the stack. The used buckets are cleared as they are emptied, so no post-cleanup is required, and the untouched buckets take no CPU time at all.

By packing eight flags into each flag byte, and unrolling the code to one block per flag byte, the worst case overhead per bucket is only 1.25 cycles. Thus, the CPU time for a perfect sort is only marginally worse than what it would be for a sort routine that only used half as many buckets.

Total time, a shade under 40 raster lines \O/ (yes, doynax still wins… /o\)

Code is up at codebase, cf http://www.codebase64.org/doku.php?id=base:flagged_bucket_sort
2017-08-13 13:04
ChristopherJam

Registered: Aug 2004
Posts: 656
…removed a redundant LDA#255 from each flag byte block, just before the STA back into the bucket list (the accumulator already contains a list terminator at that point), and remembered to factor in the flag bytes being in zero page.

Down to 38.5 raster lines now, only 12% slower than radix sort.
2017-08-13 21:26
doynax

Registered: Oct 2004
Posts: 209
Very neat. I never would have considered doing a sparse bucket sort :)

Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.

(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)
2017-08-14 09:08
ChristopherJam

Registered: Aug 2004
Posts: 656
Quoting doynax
Very neat. I never would have considered doing a sparse bucket sort :)


Thanks!

Quote:
Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.


Yes, I wasn't sure how best to benchmark. Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.

FWIW most of the time is "per used bucket", so if there are only 24(?) distinct Y values, I suspect I'd come out ahead, but if there are that many horizontally aligned enemies a zone sort's likely to be sufficient anyway.

I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor, for a fairly unappealing total of (34+62)*32=3072 cycles per frame (48.7 raster lines)

I suspect that if one just wanted to minimise CPU time expended by the sorter in in the border, a plain bucket sort would be better; insertion's only 14 cycles per actor (7 rasters!), and the times it takes a while for 'fetch next' are precisely when the next actor is a fair way down the screen from the one you just dealt with, so timing wouldn't be quite so critical anyway.


Quote:
(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)


Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations…
2017-08-14 14:18
doynax

Registered: Oct 2004
Posts: 209
Quoting ChristopherJam
Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.
Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.

Quoting ChristopherJam
I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor
Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.

Quote:
Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations…
Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects.
2017-08-14 15:34
ChristopherJam

Registered: Aug 2004
Posts: 656
Quoting doynax

Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.

Indeed. And for the first 8 sprites, you can just LDX immediate directly from $017f down.

Quote:
Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.

Exactly. The plain bucket sort has the same advantage (TBH I pinched the idea from Playstation order-tables - it's how they z-sort the texture mapped triangles, by inserting them into a linked list of null graphics primitives)

Quote:
Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects.
Yes; I got there eventually :)
2017-09-04 18:19
Color Bar

Registered: Oct 2012
Posts: 133
A brute force approach seems already very efficient. I think that can be done in 35.6 lines or less for 32 sprite positions and 220 buckets.
;Fill Buckets
ldx input
inc Buckets,x
...
ldx input+31
inc Buckets,x
-------------
10*32=320 cycles assuming input is in ZP.

;Collect sorted values
ldy Buckets
beq next1:   ;Skip if empty.
lda #value1  ;Value corresponding to this bucket.       
pha
dey
bne *-2      ;In case not unique.
sty Buckets  ;Empty the buckets.
next1:
ldy Buckets+1
beq next2:
lda #value2
pha
dey
bne *-2
sty Buckets
next2:
;etc.

If all input numbers are unique the collection stage costs 7 cycles for empty buckets and 19 cycles for the buckets with 1 input value for a total of 7*(220-32)+19*32+320 = 35.6 lines.

If not all input numbers are unique, the cost for a bucket with k numbers is 12+(k-1)*8+7 cycles. Adding one more to a non-empty bucket costs 8 cycles, but since there is then one bucket less with one contribution, and one more empty bucket, one gains 4 cycles overall.

But I could have made a mistake in these calculations or missed something else (initialization, overhead ...).
2017-09-04 22:56
ChristopherJam

Registered: Aug 2004
Posts: 656
Color Bar, that is indeed a fast way to get a sorted list of Y values, but what is needed is a sorted list of the indices of the sprites that have those Y positions.

(eg if sprites 1, 2, 3 and 4 are at positions 10, 46, 73 and 32, the required output is 1,4,2,3, not 10,32,46,73 )
2017-09-05 06:53
Oswald

Registered: Apr 2002
Posts: 4092
35-40 lines for 32 sprites is crazy good guys 1.x line per sprite.
2017-09-05 13:52
Color Bar

Registered: Oct 2012
Posts: 133
Quote: Color Bar, that is indeed a fast way to get a sorted list of Y values, but what is needed is a sorted list of the indices of the sprites that have those Y positions.

(eg if sprites 1, 2, 3 and 4 are at positions 10, 46, 73 and 32, the required output is 1,4,2,3, not 10,32,46,73 )


Oh, I see, thanks.

The fact that multiple sprites can have the same Y values makes the problem considerably harder, it seems.
Previous - 1 | ... | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | ... | 17 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Holy Moses/Role
Acidchild/Padua
Joe/WD/HXS/EXON/Arts..
Jammer/TooMany
Mike
Rulas_International
Zoolon/AC64SID
Soren/Camelot, MoN, ..
Mason/Unicess
Guests online: 39
Top Demos
1 Uncensored  (9.7)
2 Edge of Disgrace  (9.7)
3 Coma Light 13  (9.6)
4 The Shores of Reflec..  (9.6)
5 Lunatico  (9.6)
6 Quad Core 100%  (9.5)
7 Comaland 100%  (9.5)
8 Incoherent Nightmare  (9.5)
9 Wonderland XII  (9.5)
10 Comaland  (9.5)
Top onefile Demos
1 Pandemoniac Part 2 o..  (9.6)
2 Synthesis  (9.6)
3 Dawnfall V1.1  (9.5)
4 Daah, Those Acid Pil..  (9.5)
5 Field Sort  (9.4)
6 Treu Love [reu]  (9.4)
7 Dawnfall  (9.3)
8 Globe 2016 [reu]  (9.3)
9 KAOS 64  (9.3)
10 Hardware Accelerated..  (9.2)
Top Groups
1 Booze Design  (9.4)
2 Censor Design  (9.4)
3 Oxyron  (9.4)
4 Crest  (9.3)
5 Finnish Gold  (9.3)
Top NTSC-Fixers
1 Horizon  (9.7)
2 Pudwerx  (9.6)
3 Stormbringer  (9.6)
4 Fungus  (9.6)
5 Moloch  (9.3)

Home - Disclaimer
Copyright © No Name 2001-2017
Page generated in: 0.732 sec.