Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in 
CSDb User Forums


Forums > C64 Coding > Sorting
2007-10-08 18:08
Oswald

Registered: Apr 2002
Posts: 4058
Sorting

are sorters really so slow in games? :) I have made my unrolled version for my theoretical game (;), and it takes 132 rlines to sort 32 numbers o_O worst case is ~200 lines. tho when its the case of 4 numbers have to be swapped only it does the job in ~10 lines. wastes a lot of memory but I like it :)
2007-10-08 18:23
doynax

Registered: Oct 2004
Posts: 209
I tried both a bubble sort, an insertion sort and a bucket sort for my game. All of them progressive. The insertion sort was usually the fastest but in the end the bucket sort was the most stable and reliable (i.e. fever missed frames) even with the loss of precision.
It's just not worth having to worry about spawning new sprites at an appropriate position, and not let things move too fast lest the list get too far out of sync.

I can't remember precisely how much raster time it consumes right now but it isn't big deal really compared with all the other work involved.
2007-10-08 18:30
Oswald

Registered: Apr 2002
Posts: 4058
what is the name of your game? I'd like to check it :) I have made an unrolled insertion sort, which will be progressive (if I ever got to that). it takes 22 cycles including the cmp/bcc when a number is moved (swapped) around.
2007-10-08 18:40
Burglar

Registered: Dec 2004
Posts: 782
some games use sprite regions, so there's no need for sorting, just put the sprites in the right regions and you're done. I think Turrican uses that method.

you might want to check out games from Hannes Sommer/Cosmos Designs as well (ie Lions of the Universe +7 ), those should have some good multiplexers...
2007-10-08 18:42
doynax

Registered: Oct 2004
Posts: 209
It far from finished so maybe I shouldn't be giving advice, nevertheless I always appreciate a chance to plug my newest project. http://www.minoan.ath.cx/~doynax/6502/Balls_of_the_Scrolling_Th..

Anyway, the bucket sort probably takes something like 50 cycles per sprite. But that's still far better than the worst case of the insertion sort.
2007-10-08 18:44
Groepaz

Registered: Dec 2001
Posts: 8120
uh, you should be able to sort 64 sprites in the border easily =) even my crappy multiplexer does 48 sprites in border, and i know it's not the most optimized one around :)

the basic idea in games usually is bucket sort first (what burglar called regions), then (if necessary) bubblesort on the buckets.
2007-10-08 18:48
Oswald

Registered: Apr 2002
Posts: 4058
groppie, I'm afraid you're comparing here 'progressive' performance with full rnd sorting. as said swapping around ~4 sprites takes less than 10 lines.
2007-10-08 18:48
JackAsser

Registered: Jun 2002
Posts: 1227
Quote: uh, you should be able to sort 64 sprites in the border easily =) even my crappy multiplexer does 48 sprites in border, and i know it's not the most optimized one around :)

the basic idea in games usually is bucket sort first (what burglar called regions), then (if necessary) bubblesort on the buckets.


Also... when re-using the sort order from frame to frame most algorithms are close to O(n). It's seldom a game hits the theoretical maximum complexity. I really dunno if you should calculate on worst case here since worst case never will happen.
2007-10-08 18:53
doynax

Registered: Oct 2004
Posts: 209
Quote: Also... when re-using the sort order from frame to frame most algorithms are close to O(n). It's seldom a game hits the theoretical maximum complexity. I really dunno if you should calculate on worst case here since worst case never will happen.

Not the worst case perhaps. But the difference I saw between the best and worst case, even when catering to the multiplexer when inserting things, was still something like 300%.
And in an action game it's all about maintaining a smooth framerate no matter what happens.
2007-10-08 18:53
Oswald

Registered: Apr 2002
Posts: 4058
jack, well its just ammazing to me that how slow this operation really is. without tricks sorting alone can eat most of a frame. Taking into account how unoptimized games used to be its nice they can pull off their plexed sprites, and the rest in 50fps :)

edit: to gro, 132 rasterlines is the time it takes to sort 32 totally random numbers.
2007-10-08 18:59
Groepaz

Registered: Dec 2001
Posts: 8120
Quote:

groppie, I'm afraid you're comparing here 'progressive' performance with full rnd sorting. as said swapping around ~4 sprites takes less than 10 lines.


nope, my plexer doesnt take advantage of that, its full random sorting. the key is to use a reasonable amount of buckets, plus an unrolled bubblesort. (and if you can live with a little error you can leave out the bubblesort completely, which is what i did)
2007-10-08 19:01
doynax

Registered: Oct 2004
Posts: 209
Quote:

nope, my plexer doesnt take advantage of that, its full random sorting. the key is to use a reasonable amount of buckets, plus an unrolled bubblesort. (and if you can live with a little error you can leave out the bubblesort completely, which is what i did)
I actually do a progressive bucket sort which seems to help a bit. Bullets and enemy formations and such tends to move in step with each other after all.
2007-10-08 19:06
Oswald

Registered: Apr 2002
Posts: 4058
doynax, whoaw, now thats a massive amount of sprites :) \o/ how do you make the d800 copy, and where ? :) btw what bugs me about bucket sort is how to deal with the random order in the multiplexer, also rejecting 9th sprites seems to be messy :P
2007-10-08 19:09
Groepaz

Registered: Dec 2001
Posts: 8120
Quote:

also rejecting 9th sprites seems to be messy


why reject? just make it so it's ignored :) (ie bucket can take more than 8 sprites, but plexer will only use first 8)
2007-10-08 19:13
doynax

Registered: Oct 2004
Posts: 209
Quote: doynax, whoaw, now thats a massive amount of sprites :) \o/ how do you make the d800 copy, and where ? :) btw what bugs me about bucket sort is how to deal with the random order in the multiplexer, also rejecting 9th sprites seems to be messy :P

The d800 copy is double buffered and evenly distributed across 16 frames, so it really doesn't take much raster time at all ;)
As for the bucket sort I don't quite see what you mean. The
output is still just an ordered list of sprites, just like for the insertion sort.
As for rejecting the 9th sprite I don't do any fancy multiplexer "overload" testing at all. The code just tries to optimistically display as many sprites as possible and if doesn't fit then it's just not displayed (at least that's what happens in the type of multiplexer I wrote).
2007-10-08 19:18
Groepaz

Registered: Dec 2001
Posts: 8120
Quote:

As for rejecting the 9th sprite I don't do any fancy multiplexer "overload" testing at all. The code just tries to optimistically display as many sprites as possible and if doesn't fit then it's just not displayed (at least that's what happens in the type of multiplexer I wrote).


thats probably how most game multiplexers work. (look at SEUCK...argl =D)
2007-10-08 19:23
Oswald

Registered: Apr 2002
Posts: 4058
you cant doublebuffer d800 :) for the bucket sort, I have to think it through as I'm totally new to this idea .) ie a multiplexer relies on the sorted list, wont bugs happen when sprites regs are not written in the order they should be ? :)
2007-10-08 19:26
doynax

Registered: Oct 2004
Posts: 209
Quote: you cant doublebuffer d800 :) for the bucket sort, I have to think it through as I'm totally new to this idea .) ie a multiplexer relies on the sorted list, wont bugs happen when sprites regs are not written in the order they should be ? :)

Really? Feel free to disassemble the code :)

As for the sprites if they aren't perfectly in order then the worst that can happen is that they won't be as tightly packed as possible. If it's properly written anyway.
2007-10-08 19:28
Groepaz

Registered: Dec 2001
Posts: 8120
keep in mind that eg your first "logical" sprite doesnt always have to be the first hardware sprite, it could be any of the 8 hardware sprites in the first bucket. you only have to maintain that if priorities matter (which they often dont in games, since sprites dont overlap)
2007-10-08 19:38
doynax

Registered: Oct 2004
Posts: 209
Quote: keep in mind that eg your first "logical" sprite doesnt always have to be the first hardware sprite, it could be any of the 8 hardware sprites in the first bucket. you only have to maintain that if priorities matter (which they often dont in games, since sprites dont overlap)

In fact it's often useful to write them in reverse order to insure reasonably good priorities in top-down games and the like (i.e. so lower sprites overlaps the higher ones).
Another possibility is to "center" the multiplexer in such a way as to insure the priority of a single sprite (i.e. always assign the player to the first sprite).
2007-10-08 19:42
Oswald

Registered: Apr 2002
Posts: 4058
you might miss sprites otherwise possible to display with the bucket sort.


spr1=70
spr2=74

bucket sort result:

spr2
spr1

then you fire an irq for the next sprite that will be spr2, and already late from spr1
2007-10-08 19:45
doynax

Registered: Oct 2004
Posts: 209
Quote: you might miss sprites otherwise possible to display with the bucket sort.


spr1=70
spr2=74

bucket sort result:

spr2
spr1

then you fire an irq for the next sprite that will be spr2, and already late from spr1


The way I do it is to attach the IRQ to the end of the last use of the new sprite, and thus try to reprogram the "channel" as soon as possible. Then all you have to do is check if you're too late to generate next IRQ (if it is to run before the current scanline) and if so jump straight to the next handler.
2007-10-08 19:46
cadaver

Registered: Feb 2002
Posts: 1042
If you preprocess the sprite-IRQs, you can take care to always take the upmost Y-coordinate as the basis for the IRQ, even if it isn't the first sprite in sort order. Did this in MW1-3 which had inexact sprite sorting. Brr, never again! :)
2007-10-08 19:46
Groepaz

Registered: Dec 2001
Posts: 8120
now you are assuming too much :) ofcourse the displayer code must be written in a way that such thing can not happen.
2007-10-08 20:30
Oswald

Registered: Apr 2002
Posts: 4058
Quote: If you preprocess the sprite-IRQs, you can take care to always take the upmost Y-coordinate as the basis for the IRQ, even if it isn't the first sprite in sort order. Did this in MW1-3 which had inexact sprite sorting. Brr, never again! :)

what was so bad about it, when you could get it 'right' afterall ?
2007-10-08 20:33
Oswald

Registered: Apr 2002
Posts: 4058
Quote: The way I do it is to attach the IRQ to the end of the last use of the new sprite, and thus try to reprogram the "channel" as soon as possible. Then all you have to do is check if you're too late to generate next IRQ (if it is to run before the current scanline) and if so jump straight to the next handler.

hmm 'before the new' sprite approach allows more tight packing ;) clever solution nevertheless.
2007-10-08 20:38
cadaver

Registered: Feb 2002
Posts: 1042
Oswald: well it was never completely "right" in those games, you could create unnecessary artifacts for example when several motorcycles (2x2 sprites) were coming at you and you jumped. But for example, if you have mostly airborne enemies coming at you from several heights and you're mostly on ground, it doesn't matter.
2007-10-09 12:22
Oswald

Registered: Apr 2002
Posts: 4058
radix sort is a very interesting approach :) offers a constant sort time, but sadly that is much worse than progressive insertion sort. when fully unrolled I estimate a running time of ~55 rasterlines.
2007-10-09 12:31
doynax

Registered: Oct 2004
Posts: 209
Quote: radix sort is a very interesting approach :) offers a constant sort time, but sadly that is much worse than progressive insertion sort. when fully unrolled I estimate a running time of ~55 rasterlines.

The bucket sort we talked about earlier is just bucket sort is just a special case of the radix sort. Except it sorts in one step and with some loss of precision to reduce the number of buckets.

Anyway an optimized and unrolled two-step radix sort shouldn't be quite as bad as that. It ought to be possible to get it well below 55 lines. In fact it'd probably be faster than a bucket sort combined with a bubble sort fix-up stage.

Dammit.. Now I have to implement one just to see how it turns out ;)
2007-10-09 13:05
Oswald

Registered: Apr 2002
Posts: 4058
here is my implementation:

there's 2 bucket arrays, each 256 bytes. each bucket can hold max 15 elements, 16th is the counter of the elements. adress of fex. bucket 3's element 5 is: 3*16+5.

bucket1 is used for the first pass
bucket2 for the 2nd.

each code snipplet is a code segment out of an unrolled loop. (ie stuff is missing like lda bucket1+bucketnr*16,x)

I dont see how this could be much faster :) some1 proove me wrong :)


;pass1
	ldx sort+0
	ldy spry,x        ; get sprite y coord
	lax and#0fmul16,y ;get bucket startadress 
	ora bucket1+15,x  ;get adress INside the bucket
	inc bucket1+15,x  ;inc nr of elements
	tax		
smod	sty bucket1,x	  ;store spr to bucket



;pass2

	ldx bucket1+15
	beq next	;empty bucket dont bother
blp1	

	ldy bucket1,x
	lda spry,y         ;get spr y
	and #%11110000     ;upper 4 bits only this time
	tay                ;which is exactly our pointer
	ora bucket2+15,y   ;addy inside bucket
	inc bucket2+15,y   ;inc bucket counter
	tay                ;final bucket addy
	sta bucket2,y      ;store sprite
	dex
	bne blp1	   ;any more in bucket1?

next

;pass3
        ldy #$00 ; nr of sprites counter this is done only once

	ldx bucket2+15
	beq next             ;empty bucket?
blp2
	lda bucket2,x        ;get sprite nr
smod	sta final,y          ;store to final list
	iny                  ;final spr count
	dex           
	bne blp2             ;any more in curr. bucket?



edit, ok, well some bugs there, destroying regs, but assuming they werent.. fixing those would make it even more slower tho
2007-10-09 13:37
doynax

Registered: Oct 2004
Posts: 209
I'm attempting something like this:

	;; initialize the buckets
	ldx #$81
	!for i,0,16 {
	sta lsd_bucket+i
	}
	lda #$fe
	!for i,0,6 {
	sax msd_bucket+i*2+0
	stx msd_bucket+i*2+1
	sbx #-3
	}

	;; lsd sort
	lda #$0f
	!for i,0,32 {
	ldx actor_ypos+i
	sbx #$00
	ldy lsd_bucket,x
	sty actor_link+i
	ldy #i
	sty lsd_bucket,x
	}

	;; msd sort
	!for i,0,13 {
	ldx lsd_bucket+i
	bmi .next

.msd	ldy actor_ypos,x
	lda msd_table,y
	tay

	lda msd_bucket,y
	stx msd_bucket,y

	ldy actor_link,x
	sta actor_link,x
	bmi .next

	ldx actor_ypos,y
	lda msd_table,x
	tax

	lda msd_bucket,x
	sty msd_bucket,x

	ldx actor_link,y
	sta actor_link,y
	bne .msd

.next	}

	;; finally in the mux writer.
	;; for each sprite, alternating x and y
	ldx actor_link,y
	bpl .ok
.bucket	lda msd_bucket-$80,x
	tax
	bmi .bucket
.ok	...


The idea is to use linked lists for the buckets. Sort all possible actors in the first pass (invalid ones having high y values), you can optimize it by trying to keep the maximum actor number as low as possible and skipping those high entries. And we can link the actors together in the actual multiplexer instead of a separate pass. Also the sentinels in the msd buckets can help us to skip buckets easily.
Finally we don't need more than about 13 buckets since not all y coordinates are valid, and as a bonus we can collect the invalid ones into a single "death" list automatically by tweaking the division table. Another bonus is that by sticking a store in the bucket skipping code of the multiplexer you can easily link together a complete list of actors which IMO is more convenient to work with than an order list (i.e. you only need to know the current actor's index to move to the next one, no need to keep track of an order index). Furthermore just about everything is kept on the zeropage, thought only temporarily of course.

All in all it seems to work out to about 28 raster lines. Except I'm not at all certain whether this will actually work yet.. ;)
2007-10-09 14:22
Oswald

Registered: Apr 2002
Posts: 4058
wow, very creative ! :) 28 lines sounds very good :) tho I have yet not an idea how progressive sort can perform when there's little changes ;)
2007-10-09 14:40
doynax

Registered: Oct 2004
Posts: 209
Quote: wow, very creative ! :) 28 lines sounds very good :) tho I have yet not an idea how progressive sort can perform when there's little changes ;)

It depends really. A progressive sort is faster in 95% of the cases but it's that twentieth missed frame which fucks up the flow in an action game. Then again I'm writing a twitch game which aspires to be a bullet hell shooter, but a platformer (say) would obviously have other requirements.

The radix sort is only something like 15 cycles (per actor) slower than my current bucket sort. So if I can get it to work I'll probably switch, and I'd be happy to share the final code if I do. Weighting the extra cycles vs. the extra sprites makes it a tough choice.
2007-10-09 14:53
ChristopherJam

Registered: Aug 2004
Posts: 620
I just used insertion sort starting with the ordering from the previous frame in Teradyne, setting up the first 8 sprites in the end-of-frame IRQ, and then moving each sprite down with an IRQ at the end of its previous use. If the next sprite needs moving before I returning from the IRQ I do it then instead of setting up another.

I stagger creation of all the sprites in a new formation over several frames to spread the workload, then there's never too much work for the sorter to do.
2007-10-09 14:59
doynax

Registered: Oct 2004
Posts: 209
Quote: I just used insertion sort starting with the ordering from the previous frame in Teradyne, setting up the first 8 sprites in the end-of-frame IRQ, and then moving each sprite down with an IRQ at the end of its previous use. If the next sprite needs moving before I returning from the IRQ I do it then instead of setting up another.

I stagger creation of all the sprites in a new formation over several frames to spread the workload, then there's never too much work for the sorter to do.


But then you've mostly got horizontal movement, right? I've got to worry about a laser moving 21 lines per frame and other nasty things :(
Of course managing not to overload the multiplexer in a horizontal shooter is much harder. So I guess I'm still better off ;)
2007-10-09 16:24
Oswald

Registered: Apr 2002
Posts: 4058
btw some inspiration on the attack waves, and enemies: http://youtube.com/watch?v=Aj23K8Ri68E&mode=related&search= a lot if this could be done on a c64 imho.
2007-10-09 16:47
Groepaz

Registered: Dec 2001
Posts: 8120
an ikaruga-like shooter would be generally a cool idea :)
2007-10-10 11:00
ChristopherJam

Registered: Aug 2004
Posts: 620
Yeah, the movement is largely horizontal, with a few loop-the-loops, and that "X of death" formation the reviewer complained of. The times I drop a formation in from above or raise it from below it can struggle a little if there's already a dense formation it has to cross over.

A danmaku shooter like Ikaruga would indeed be awesome, but also a serious challenge to implement!
2007-10-10 13:37
Croozor

Registered: Dec 2001
Posts: 871
I don't think I've ever done a multiplexor where the sprites could move freely around, but is it really necessary to sort them?

Wouldn't it be enough with a map consiting of a byte per rasterline where the bit pattern indicated which sprites were used. To allocate a new sprite, just ora all the bytes for the sprite's y-position and 21 positions down, and see which 0's are left, or if the result is $ff it's too bad.

21 oras per sprite might still be a little heavy, but this could of course be optimized by reducing the precission a bit, e.g. to 3rd line, which means only 7 oras/sprite.

Edit: And of course you need to insert 7 values in the table afterwards as well. :-) Still think it's faster than sorting though.
2007-10-10 13:48
doynax

Registered: Oct 2004
Posts: 209
I guess that's basically a "zone" multiplexer except with better granularity, sort of.. Don't forget that you've got to search through the scanlines first to see if any sprite channel is free all the way down before marking them. Then you'd also need another set of tables to keep track of which virtual sprite is bound to which sprite channel.
Finally you'd have to walk through the array line-by-line to figure out when to trigger the interrupts. I suppose the easiest way would be to walk the array from the bottom up to find the earliest possible line to program the sprite.

Except it'd have to be fairly coarse to win-out over a sort-based approach.

edit: I suppose you wouldn't have to check all entries, just the first and the last one. Or maybe only the first if the flag was interpreted to mean that the sprite is free to be reused here. Hmm..

Either way I figure it'd be a nice way to deal with y-expanded sprites and other special cases for linked sprites as well as for dealing with sprite priorities, i.e. the sprites would be allocated in the order of priority.
2007-10-10 14:04
Oswald

Registered: Apr 2002
Posts: 4058
mixing some ideas this could be powerful. I'm thinking of a chained list of rasterlines having sprites, and each element in this list would point to another list having the sprites. ;)
2007-10-10 14:08
doynax

Registered: Oct 2004
Posts: 209
Quote: mixing some ideas this could be powerful. I'm thinking of a chained list of rasterlines having sprites, and each element in this list would point to another list having the sprites. ;)

Care to elaborate?
Superficially that sounds exactly like what my bucket sort does today, except I think you meant something entirely different..
2007-10-10 14:23
Oswald

Registered: Apr 2002
Posts: 4058
well at a second look it might be not a good idea, as walking the chain each time a new sprite is inserted is something like an insertion sort using doublechained elements :P it would be better not to chain the raster list, instead using a 256 byte table, the end result would be counting sort with chained sprites :) or if you want like your radix sort without separate lsd/msd :)

edit, ok this one fails when you have to reset the 256 byte table :P maybe you could collect the used entrys somewhere during the proces tho.
2007-10-10 16:00
ChristopherJam

Registered: Aug 2004
Posts: 620
Hmm. Perhaps you could insert 'start actor' and 'end actor' events into one each of (say) 50 buckets, one every four rasters.

Then whenever an actor is done with you could add the sprite you used to a free list, and whenever a sprite is required you could take one off the free list. No sorting required - you just need to look ahead in the bucket table to see when the next raster interrupt is required.
2007-10-10 20:20
Oswald

Registered: Apr 2002
Posts: 4058
how about binary trees? %)
2007-10-10 20:35
doynax

Registered: Oct 2004
Posts: 209
Quote: how about binary trees? %)

I doubt it. Any kind of balancing would be way to complicated, and you'd almost certainly encounter nasty edge cases without it. Having to implement an AVL or red-black tree on this machine terrifies me. A splay tree just might be feasible but I'm not sure whether it'd solve the problem. Anyone here with a real CS background who can tell for sure?

Unless you mean a heap sort with it's implicit tree?
2007-10-10 20:58
ChristopherJam

Registered: Aug 2004
Posts: 620
Hmm. I'm sure I've used heap-sort for something on this beasty. May have been hidden-surface-removal for a solid 3d renderer.. (edge events as you traverse a scanline push/pop surfaces)
2007-10-11 06:28
JackAsser

Registered: Jun 2002
Posts: 1227
Quote: I doubt it. Any kind of balancing would be way to complicated, and you'd almost certainly encounter nasty edge cases without it. Having to implement an AVL or red-black tree on this machine terrifies me. A splay tree just might be feasible but I'm not sure whether it'd solve the problem. Anyone here with a real CS background who can tell for sure?

Unless you mean a heap sort with it's implicit tree?


Nah... stick to radix sort and the like since they're O(n). General sorting using trees etc are O(n log n) and on this platform they'll have a nice quite big constant put in front also due to overhead. Those constants are ofcourse negliable for very large n's, however a multiplexer doesn't have large n. ;D

For a complete sort I have serious doubts that anything can out perform a well written radix sort (2 step bucket sort as you have shown before).

If you have similar sort order from frame to frame I'd recommend bubble sort actually. They're O(n^2) worst case but O(n) if sort order is similar + they're dead easy to implement and optimize.

Maybe one could have some code that would realise when to use a full sort with radix-sort and when to "just fix" the sort order using bubble sort. This would improve the average time ofcourse but slightly worsen the worst case ofcourse.

Also as Cadaver mentioned, it's seldom the sort order actually changes that much from one frame to another. On could be smart when you get new enemies etc by spreading the sort for the actors that are about to enter the screen, over several frames.

F.e. you will need some kind of data structure to know what kind of actors that will emerge soon in the scrolling direction. You could have those somewhat pre-sorter and/or prepare-sort them on the way there.

Tons of ideas.... =)
2007-10-11 09:12
ChristopherJam

Registered: Aug 2004
Posts: 620
Hmm. For most shooters the enemy paths are completely predetermined. You could pre-sort all of them (keeping dead enemies onscreen with a blank sprite definition), or at least provide new sortlists for any points in time that a bubble sort is going to be expensive.

That would even allow you to determine in advance when a slightly shuffled order is acceptable :)


Speaking of load management, can anyone think of a situation that an 8-sprite muxer that needs to handle a player in an arbitrary position will always be able to cope with that a 7-sprite muxer with the player on sprite 0 would not? (other than forcing the player to avoid certain y-ranges by using scenery collisions)
2007-10-11 09:20
doynax

Registered: Oct 2004
Posts: 209
Quote:
Hmm. For most shooters the enemy paths are completely predetermined. You could pre-sort all of them (keeping dead enemies onscreen with a blank sprite definition), or at least provide new sortlists for any points in time that a bubble sort is going to be expensive.
Yeah, that'd probably work. If you were writing Armalyte 2..
But it would be way to limiting for my tastes. You've got to at least have bullets fired towards the player in the kind of game I'm trying to write.

Quote:
Speaking of load management, can anyone think of a situation that an 8-sprite muxer that needs to handle a player in an arbitrary position will always be able to cope with that a 7-sprite muxer with the player on sprite 0 would not? (other than forcing the player to avoid certain y-ranges by using scenery collisions)
Sure. Bullets in a vertical shooter tend to be separate from the player and able to reuse the player's sprite. As long as they are spawned a couple of scanlines above the player at least.

Besides I'm happy with allowing a bit of flicker every now and then, especially if it's mostly in the two player mode, in return for a greater amount of objects and not having to worry myself to death over how the enemy paths should be designed as to guarantee zero flicker. Best-effort multiplexing FTW!
2007-10-11 09:57
Oswald

Registered: Apr 2002
Posts: 4058
I'm strongly beside that the player sprite shouldnt be handled exclusively, except the case of more than 8 sprites on a line. also if this happens I'd alternatively reject sprites frame to frame, thus in a flickery fashion but still all sprites would be visible. I've seen this in some soccer game, and I like it
2007-10-11 10:09
ChristopherJam

Registered: Aug 2004
Posts: 620
Fair enough on the flicker risk.

I'd actually been assuming chars or bitmap plotting for the bullets, but yes, if they are to be sprites then an 8 sprite muxer makes more sense than special casing the player.
2007-10-13 10:14
ChristopherJam

Registered: Aug 2004
Posts: 620
This one's has taken up residence in my brain..

Now I've gone back and read the thread properly instead of skimming the first three quarters before reading the rest, I've actually noticed the earlier points about a danmaku shooter being the aim well before the Ikaruga subthread, and that the whole idea was to minimise worst case in a very frame incoherent shooter - methinks my insertion sort commentary was a little out of place.

I did have some inspiration last night when I realised that for any displayable set of actors there will never be more than eight sprites in any 21 line bucket (else the first and last will be too close together to display), so then I went off and did some research on optimal sorting for small sets of numbers. I eventually found
this page on Bose-Nelson sorting networks, got a swap macro down to 22 cycles ( ldx a
ldy b
lda actor_y,x
cmp actor_y,y
bge noswap
stx b
sty a
), decided that was too slow and started handcoding cases for smaller lists of sprites in a bucket (eg 70 cycles to take 3 zero page locations containing actor indices, sort them by actor_y and add them to the actor_next list),
but even then, I lose about 26 cycles just to placing each actor onto the end of the zero page array of actors-in-that bucket, which only leaves me an average of 19 cycles per actor for the sorting and adding to the actor_next list if I'm to beat the routine above.

It looks like @doynax' two pass radix sort still wins - I'm impressed!
2007-10-13 11:00
doynax

Registered: Oct 2004
Posts: 209
ChristopherJam: In practice that radix-sort ended up needing 34 scanlines, for 32-sprites, rather than 28. It just wasn't practical to keep keep the set of sprites compacted like that. Not when combined with certain other features anyway.
So maybe a Bose-Nelson sorting network, whatever that is, could still be competitive?
2007-10-14 15:29
ChristopherJam

Registered: Aug 2004
Posts: 620
hmm - so I'd need to manage less than (34*63)/32=67 cycles per actor.

I've still got some optimising to do, but at the moment the worst case cost per actor ranges from 46 to 87 cycles depending on the number of other actors are in the same bucket (1 to 7 others) plus a small overhead for each of the 10 buckets.

There are a lot of unnecessary loads and stores in the more expensive cases though, so I'll see how I go over this week.

A sorting network is basically a list of optional exchanges of predetermined pairs, that after they are complete guarantee the list is sorted. It avoids computing array indices and a lot of the branching - eg

#define SWAP(a,b) .(:ldx zpt+a:ldy zpt+b:lda act_y,x:cmp act_y,y:bcs noswap:stx zpt+b:sty zpt+a:noswap:.)

SWAP(0,1)
SWAP(2,3)
SWAP(0,2)
SWAP(1,3)
SWAP(1,2)

will ensure that four actor indices stored in zpt..zpt+3 will be sorted by y position. For an 8 element list, you only need 19 swaps.
2007-10-14 19:31
Oswald

Registered: Apr 2002
Posts: 4058
lda (),y
cmp (),y
bcc

is faster when there's no need to swap than

ldx pointer
ldy pointer
lda spry,x
cmp spry,y
2007-10-14 20:04
ChristopherJam

Registered: Aug 2004
Posts: 620
Presumably you are proposing I keep y=0 and interleave my actor indices with the high byte of the actor_y table?

Good thinking, except I am attempting to optimise the worst case, so I do not wish to sacrifice that for a faster average case.
2007-10-14 20:47
Oswald

Registered: Apr 2002
Posts: 4058
it should be noted tho, that worst case sorting never happens with progressive sort, especially if you optimize for only adding/removing 1 sprite to the sort per frame. I'm also realizing that the zp method presented above is not good, as the rare case with insertion sort (my current approach) is when there's no need to swap.
2007-10-14 21:29
JackAsser

Registered: Jun 2002
Posts: 1227
Quote: hmm - so I'd need to manage less than (34*63)/32=67 cycles per actor.

I've still got some optimising to do, but at the moment the worst case cost per actor ranges from 46 to 87 cycles depending on the number of other actors are in the same bucket (1 to 7 others) plus a small overhead for each of the 10 buckets.

There are a lot of unnecessary loads and stores in the more expensive cases though, so I'll see how I go over this week.

A sorting network is basically a list of optional exchanges of predetermined pairs, that after they are complete guarantee the list is sorted. It avoids computing array indices and a lot of the branching - eg

#define SWAP(a,b) .(:ldx zpt+a:ldy zpt+b:lda act_y,x:cmp act_y,y:bcs noswap:stx zpt+b:sty zpt+a:noswap:.)

SWAP(0,1)
SWAP(2,3)
SWAP(0,2)
SWAP(1,3)
SWAP(1,2)

will ensure that four actor indices stored in zpt..zpt+3 will be sorted by y position. For an 8 element list, you only need 19 swaps.


Hmmm... I was sure sorting networks were O(n log n) but apparantly Ajtai, Komlós and Szemerédi have O(log n) sorting network. I.e. the maximum depth (or comparators executed) is log n. It feels highly counter intuitive imo but I won't question it. Anyways, point is that while radix-sort is O(n*m) (where n is # of input and m is # of digits), an optimal sorting network is O(log n).

Now, this O() notation always hides huges constants for a c64 but for a f.e. 48 actor sorter it should require at most 48*log2(48) comparators statically in memory and at most log2(48) of those are executed at any given input. => NICE!
2007-10-15 02:01
ChristopherJam

Registered: Aug 2004
Posts: 620
Ajtai, Komlós and Szemerédi's network is O(log N) depth, but still has O(N log N) comparitors - their timings are for parallel hardware, where you can swap elements 0 and 1 at the same time as 2 and 3, but swapping 0 with 2 would have to wait until the next cycle. The sample network I gave above they would categorise has having depth 3.

I was looking at something that on serial hardware like our humble c64 would be O(N log k), where k is the maximum number of sprites in any given bucket, which for buckets less than 22 lines high will be 8.

Oswald: I know the stuff I'm playing with this week will not give optimal results for a relatively stable arrangement - it's why I usually use insertion sort on a linked list of actors. I believe Doynax is attempting to avoid the huge spikes of CPU time a progressive sort would take whenever there is a massive change in order, like when two multi-sprite bosses are crossing vertically while you shoot at them with vertically fast moving sprite bullets ;)
2007-10-15 08:45
Oswald

Registered: Apr 2002
Posts: 4058
actually I have just arrived to what you're talking about: insertion sort on a linked list of actors :) as one of the drawbacks of insertion sort is the need to move many elements, but with a linked list you can insert the actor into the right position without shuffling bytes around.

ldx actorlink,x
cmp actory,x
bcc nextelement

- insert actor into chain & return to the outer unrolled loop-

nextelement

ldy actorlink,x
cmp actory,y
bcc nextelement

only 11 cycles per comparison. you even have the prev actor's link in one of the registers, so it can be a one way chain easily. guess you already do it like this ;)
2007-10-15 10:53
ChristopherJam

Registered: Aug 2004
Posts: 620
Exactly :)

I traverse a singly linked actor_next list in the end-of-screen interrupt running their movement scripts, and after each has been moved (unless it has just died) I insert it into a singly linked actor_prev list.

Then once all the actors have been processed, I reverse the list.

ldx actor_prev,y
tay
sta actor_next,x
beq done
...

2007-10-15 13:29
JackAsser

Registered: Jun 2002
Posts: 1227
Quote: Ajtai, Komlós and Szemerédi's network is O(log N) depth, but still has O(N log N) comparitors - their timings are for parallel hardware, where you can swap elements 0 and 1 at the same time as 2 and 3, but swapping 0 with 2 would have to wait until the next cycle. The sample network I gave above they would categorise has having depth 3.

I was looking at something that on serial hardware like our humble c64 would be O(N log k), where k is the maximum number of sprites in any given bucket, which for buckets less than 22 lines high will be 8.

Oswald: I know the stuff I'm playing with this week will not give optimal results for a relatively stable arrangement - it's why I usually use insertion sort on a linked list of actors. I believe Doynax is attempting to avoid the huge spikes of CPU time a progressive sort would take whenever there is a massive change in order, like when two multi-sprite bosses are crossing vertically while you shoot at them with vertically fast moving sprite bullets ;)


Ah yeah, later that night I realized that the O(log N) was due to parallelism.

To optimize further I think we should look at the aims here:

* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.

First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.

Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc.
2007-10-15 13:37
doynax

Registered: Oct 2004
Posts: 209
Quote: Ah yeah, later that night I realized that the O(log N) was due to parallelism.

To optimize further I think we should look at the aims here:

* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.

First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.

Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc.


Unfortunately we still need to sort the first eight sprites in order to reuse them as soon as possible. Unless you use a radically different approach like what Cruzer suggested of course.

There may be cases where we can determine ahead of time that there will be "enough" free raster time anyway and don't perform a perfect sort on some segment of sprites, but as to how you'd implement such a beast in practice..
2017-08-12 11:39
ChristopherJam

Registered: Aug 2004
Posts: 620
Necro thread time! (I know, I know, if I'd waited another two months it'd be the ten year anniversary. But I want to publish this now so I can move on to something else :P)

I've been meaning for a while to try and accelerate a perfect order bucket sort by finding a good way to skip over unused entries.

After a bit of experimentation this year, the best I've managed is to use a set of flag bytes, with one bit for each bucket to say whether there are any actors in it on this frame.

For any non-zero flag byte, a lookup table used to quickly determine the index of the least significant one bit, so that that that bucket can then be emptied by pushing each sprite index from the list starting with that bucket entry onto the stack. The used buckets are cleared as they are emptied, so no post-cleanup is required, and the untouched buckets take no CPU time at all.

By packing eight flags into each flag byte, and unrolling the code to one block per flag byte, the worst case overhead per bucket is only 1.25 cycles. Thus, the CPU time for a perfect sort is only marginally worse than what it would be for a sort routine that only used half as many buckets.

Total time, a shade under 40 raster lines \O/ (yes, doynax still wins… /o\)

Code is up at codebase, cf http://www.codebase64.org/doku.php?id=base:flagged_bucket_sort
2017-08-13 13:04
ChristopherJam

Registered: Aug 2004
Posts: 620
…removed a redundant LDA#255 from each flag byte block, just before the STA back into the bucket list (the accumulator already contains a list terminator at that point), and remembered to factor in the flag bytes being in zero page.

Down to 38.5 raster lines now, only 12% slower than radix sort.
2017-08-13 21:26
doynax

Registered: Oct 2004
Posts: 209
Very neat. I never would have considered doing a sparse bucket sort :)

Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.

(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)
2017-08-14 09:08
ChristopherJam

Registered: Aug 2004
Posts: 620
Quoting doynax
Very neat. I never would have considered doing a sparse bucket sort :)


Thanks!

Quote:
Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.


Yes, I wasn't sure how best to benchmark. Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.

FWIW most of the time is "per used bucket", so if there are only 24(?) distinct Y values, I suspect I'd come out ahead, but if there are that many horizontally aligned enemies a zone sort's likely to be sufficient anyway.

I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor, for a fairly unappealing total of (34+62)*32=3072 cycles per frame (48.7 raster lines)

I suspect that if one just wanted to minimise CPU time expended by the sorter in in the border, a plain bucket sort would be better; insertion's only 14 cycles per actor (7 rasters!), and the times it takes a while for 'fetch next' are precisely when the next actor is a fair way down the screen from the one you just dealt with, so timing wouldn't be quite so critical anyway.


Quote:
(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)


Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations…
2017-08-14 14:18
doynax

Registered: Oct 2004
Posts: 209
Quoting ChristopherJam
Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.
Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.

Quoting ChristopherJam
I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor
Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.

Quote:
Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations…
Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects.
2017-08-14 15:34
ChristopherJam

Registered: Aug 2004
Posts: 620
Quoting doynax

Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.

Indeed. And for the first 8 sprites, you can just LDX immediate directly from $017f down.

Quote:
Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.

Exactly. The plain bucket sort has the same advantage (TBH I pinched the idea from Playstation order-tables - it's how they z-sort the texture mapped triangles, by inserting them into a linked list of null graphics primitives)

Quote:
Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects.
Yes; I got there eventually :)
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Thinker/Bytestar
hedning/G★P
Zaz/Dual Crew
Tom-Cat/Nostalgia
Peiselulli/tRSi
lft
grass/LETHARGY
Frantic/Hack'n'Trade
reidrac/usebox.net
Knut Clausen/SHAPE
tlr
Guests online: 42
Top Demos
1 Uncensored  (9.7)
2 Edge of Disgrace  (9.7)
3 Coma Light 13  (9.6)
4 The Shores of Reflec..  (9.6)
5 Lunatico  (9.6)
6 Comaland 100%  (9.5)
7 Incoherent Nightmare  (9.5)
8 Wonderland XII  (9.5)
9 Comaland  (9.5)
10 Wonderland XIII  (9.5)
Top onefile Demos
1 Veterans of Style  (9.5)
2 Dawnfall V1.1  (9.5)
3 Daah, Those Acid Pil..  (9.5)
4 Treu Love [reu]  (9.4)
5 Dawnfall  (9.3)
6 SidRok  (9.3)
7 F1 Evolution  (9.3)
8 One-Der  (9.2)
9 Tunnel Vision  (9.2)
10 Game of Thrones [2sid]  (9.1)
Top Groups
1 Pond  (10)
2 Booze Design  (9.4)
3 Censor Design  (9.4)
4 Oxyron  (9.4)
5 Crest  (9.3)
Top Coders
1 Axis  (9.8)
2 Crossbow  (9.8)
3 Graham  (9.8)
4 HCL  (9.8)
5 Lft  (9.7)

Home - Disclaimer
Copyright © No Name 2001-2017
Page generated in: 0.949 sec.