Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in 
CSDb User Forums


Forums > C64 Coding > Sorting
2007-10-08 18:08
Oswald

Registered: Apr 2002
Posts: 4124
Sorting

are sorters really so slow in games? :) I have made my unrolled version for my theoretical game (;), and it takes 132 rlines to sort 32 numbers o_O worst case is ~200 lines. tho when its the case of 4 numbers have to be swapped only it does the job in ~10 lines. wastes a lot of memory but I like it :)
 
... 149 posts hidden. Click here to view all posts....
 
2007-10-14 20:47
Oswald

Registered: Apr 2002
Posts: 4124
it should be noted tho, that worst case sorting never happens with progressive sort, especially if you optimize for only adding/removing 1 sprite to the sort per frame. I'm also realizing that the zp method presented above is not good, as the rare case with insertion sort (my current approach) is when there's no need to swap.
2007-10-14 21:29
JackAsser

Registered: Jun 2002
Posts: 1249
Quote: hmm - so I'd need to manage less than (34*63)/32=67 cycles per actor.

I've still got some optimising to do, but at the moment the worst case cost per actor ranges from 46 to 87 cycles depending on the number of other actors are in the same bucket (1 to 7 others) plus a small overhead for each of the 10 buckets.

There are a lot of unnecessary loads and stores in the more expensive cases though, so I'll see how I go over this week.

A sorting network is basically a list of optional exchanges of predetermined pairs, that after they are complete guarantee the list is sorted. It avoids computing array indices and a lot of the branching - eg

#define SWAP(a,b) .(:ldx zpt+a:ldy zpt+b:lda act_y,x:cmp act_y,y:bcs noswap:stx zpt+b:sty zpt+a:noswap:.)

SWAP(0,1)
SWAP(2,3)
SWAP(0,2)
SWAP(1,3)
SWAP(1,2)

will ensure that four actor indices stored in zpt..zpt+3 will be sorted by y position. For an 8 element list, you only need 19 swaps.


Hmmm... I was sure sorting networks were O(n log n) but apparantly Ajtai, Komlós and Szemerédi have O(log n) sorting network. I.e. the maximum depth (or comparators executed) is log n. It feels highly counter intuitive imo but I won't question it. Anyways, point is that while radix-sort is O(n*m) (where n is # of input and m is # of digits), an optimal sorting network is O(log n).

Now, this O() notation always hides huges constants for a c64 but for a f.e. 48 actor sorter it should require at most 48*log2(48) comparators statically in memory and at most log2(48) of those are executed at any given input. => NICE!
2007-10-15 02:01
ChristopherJam

Registered: Aug 2004
Posts: 682
Ajtai, Komlós and Szemerédi's network is O(log N) depth, but still has O(N log N) comparitors - their timings are for parallel hardware, where you can swap elements 0 and 1 at the same time as 2 and 3, but swapping 0 with 2 would have to wait until the next cycle. The sample network I gave above they would categorise has having depth 3.

I was looking at something that on serial hardware like our humble c64 would be O(N log k), where k is the maximum number of sprites in any given bucket, which for buckets less than 22 lines high will be 8.

Oswald: I know the stuff I'm playing with this week will not give optimal results for a relatively stable arrangement - it's why I usually use insertion sort on a linked list of actors. I believe Doynax is attempting to avoid the huge spikes of CPU time a progressive sort would take whenever there is a massive change in order, like when two multi-sprite bosses are crossing vertically while you shoot at them with vertically fast moving sprite bullets ;)
2007-10-15 08:45
Oswald

Registered: Apr 2002
Posts: 4124
actually I have just arrived to what you're talking about: insertion sort on a linked list of actors :) as one of the drawbacks of insertion sort is the need to move many elements, but with a linked list you can insert the actor into the right position without shuffling bytes around.

ldx actorlink,x
cmp actory,x
bcc nextelement

- insert actor into chain & return to the outer unrolled loop-

nextelement

ldy actorlink,x
cmp actory,y
bcc nextelement

only 11 cycles per comparison. you even have the prev actor's link in one of the registers, so it can be a one way chain easily. guess you already do it like this ;)
2007-10-15 10:53
ChristopherJam

Registered: Aug 2004
Posts: 682
Exactly :)

I traverse a singly linked actor_next list in the end-of-screen interrupt running their movement scripts, and after each has been moved (unless it has just died) I insert it into a singly linked actor_prev list.

Then once all the actors have been processed, I reverse the list.

ldx actor_prev,y
tay
sta actor_next,x
beq done
...

2007-10-15 13:29
JackAsser

Registered: Jun 2002
Posts: 1249
Quote: Ajtai, Komlós and Szemerédi's network is O(log N) depth, but still has O(N log N) comparitors - their timings are for parallel hardware, where you can swap elements 0 and 1 at the same time as 2 and 3, but swapping 0 with 2 would have to wait until the next cycle. The sample network I gave above they would categorise has having depth 3.

I was looking at something that on serial hardware like our humble c64 would be O(N log k), where k is the maximum number of sprites in any given bucket, which for buckets less than 22 lines high will be 8.

Oswald: I know the stuff I'm playing with this week will not give optimal results for a relatively stable arrangement - it's why I usually use insertion sort on a linked list of actors. I believe Doynax is attempting to avoid the huge spikes of CPU time a progressive sort would take whenever there is a massive change in order, like when two multi-sprite bosses are crossing vertically while you shoot at them with vertically fast moving sprite bullets ;)


Ah yeah, later that night I realized that the O(log N) was due to parallelism.

To optimize further I think we should look at the aims here:

* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.

First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.

Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc.
2007-10-15 13:37
doynax

Registered: Oct 2004
Posts: 211
Quote: Ah yeah, later that night I realized that the O(log N) was due to parallelism.

To optimize further I think we should look at the aims here:

* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.

First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.

Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc.


Unfortunately we still need to sort the first eight sprites in order to reuse them as soon as possible. Unless you use a radically different approach like what Cruzer suggested of course.

There may be cases where we can determine ahead of time that there will be "enough" free raster time anyway and don't perform a perfect sort on some segment of sprites, but as to how you'd implement such a beast in practice..
2017-08-12 11:39
ChristopherJam

Registered: Aug 2004
Posts: 682
Necro thread time! (I know, I know, if I'd waited another two months it'd be the ten year anniversary. But I want to publish this now so I can move on to something else :P)

I've been meaning for a while to try and accelerate a perfect order bucket sort by finding a good way to skip over unused entries.

After a bit of experimentation this year, the best I've managed is to use a set of flag bytes, with one bit for each bucket to say whether there are any actors in it on this frame.

For any non-zero flag byte, a lookup table used to quickly determine the index of the least significant one bit, so that that that bucket can then be emptied by pushing each sprite index from the list starting with that bucket entry onto the stack. The used buckets are cleared as they are emptied, so no post-cleanup is required, and the untouched buckets take no CPU time at all.

By packing eight flags into each flag byte, and unrolling the code to one block per flag byte, the worst case overhead per bucket is only 1.25 cycles. Thus, the CPU time for a perfect sort is only marginally worse than what it would be for a sort routine that only used half as many buckets.

Total time, a shade under 40 raster lines \O/ (yes, doynax still wins… /o\)

Code is up at codebase, cf http://www.codebase64.org/doku.php?id=base:flagged_bucket_sort
2017-08-13 13:04
ChristopherJam

Registered: Aug 2004
Posts: 682
…removed a redundant LDA#255 from each flag byte block, just before the STA back into the bucket list (the accumulator already contains a list terminator at that point), and remembered to factor in the flag bytes being in zero page.

Down to 38.5 raster lines now, only 12% slower than radix sort.
2017-08-13 21:26
doynax

Registered: Oct 2004
Posts: 211
Very neat. I never would have considered doing a sparse bucket sort :)

Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.

(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | ... | 17 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Stone/Prosonix/Offence
Erol/Hitmen/TempesT
CreaMD/React
Unkle K/Reset Magazi..
Guests online: 44
Top Demos
1 Uncensored  (9.7)
2 Edge of Disgrace  (9.7)
3 Coma Light 13  (9.6)
4 The Shores of Reflec..  (9.6)
5 Lunatico  (9.6)
6 Comaland 100%  (9.5)
7 Incoherent Nightmare  (9.5)
8 Wonderland XII  (9.5)
9 Comaland  (9.5)
10 Wonderland XIII  (9.5)
Top onefile Demos
1 FMX Music Demo  (9.6)
2 Pandemoniac Part 2 o..  (9.6)
3 Daah, Those Acid Pil..  (9.5)
4 Synthesis  (9.5)
5 Dawnfall V1.1  (9.5)
6 Dawnfall  (9.4)
7 Treu Love [reu]  (9.4)
8 Field Sort  (9.4)
9 Pro Memoria 4  (9.3)
10 KAOS 64  (9.3)
Top Groups
1 Oxyron  (9.4)
2 Booze Design  (9.4)
3 Censor Design  (9.3)
4 Crest  (9.3)
5 The Judges  (9.3)
Top Mega Swappers
1 Nightshade  (9.4)
2 Aslive  (9.3)
3 R.C.S.  (9.3)
4 Dishy  (9.2)
5 Calypso  (9.2)

Home - Disclaimer
Copyright © No Name 2001-2017
Page generated in: 0.312 sec.