Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in 
CSDb User Forums


Forums > C64 Coding > Sorting
2007-10-08 18:08
Oswald

Registered: Apr 2002
Posts: 4090
Sorting

are sorters really so slow in games? :) I have made my unrolled version for my theoretical game (;), and it takes 132 rlines to sort 32 numbers o_O worst case is ~200 lines. tho when its the case of 4 numbers have to be swapped only it does the job in ~10 lines. wastes a lot of memory but I like it :)
2007-10-08 18:23
doynax

Registered: Oct 2004
Posts: 209
I tried both a bubble sort, an insertion sort and a bucket sort for my game. All of them progressive. The insertion sort was usually the fastest but in the end the bucket sort was the most stable and reliable (i.e. fever missed frames) even with the loss of precision.
It's just not worth having to worry about spawning new sprites at an appropriate position, and not let things move too fast lest the list get too far out of sync.

I can't remember precisely how much raster time it consumes right now but it isn't big deal really compared with all the other work involved.
2007-10-08 18:30
Oswald

Registered: Apr 2002
Posts: 4090
what is the name of your game? I'd like to check it :) I have made an unrolled insertion sort, which will be progressive (if I ever got to that). it takes 22 cycles including the cmp/bcc when a number is moved (swapped) around.
2007-10-08 18:40
Burglar

Registered: Dec 2004
Posts: 782
some games use sprite regions, so there's no need for sorting, just put the sprites in the right regions and you're done. I think Turrican uses that method.

you might want to check out games from Hannes Sommer/Cosmos Designs as well (ie Lions of the Universe +7 ), those should have some good multiplexers...
2007-10-08 18:42
doynax

Registered: Oct 2004
Posts: 209
It far from finished so maybe I shouldn't be giving advice, nevertheless I always appreciate a chance to plug my newest project. http://www.minoan.ath.cx/~doynax/6502/Balls_of_the_Scrolling_Th..

Anyway, the bucket sort probably takes something like 50 cycles per sprite. But that's still far better than the worst case of the insertion sort.
2007-10-08 18:44
Groepaz

Registered: Dec 2001
Posts: 8213
uh, you should be able to sort 64 sprites in the border easily =) even my crappy multiplexer does 48 sprites in border, and i know it's not the most optimized one around :)

the basic idea in games usually is bucket sort first (what burglar called regions), then (if necessary) bubblesort on the buckets.
2007-10-08 18:48
Oswald

Registered: Apr 2002
Posts: 4090
groppie, I'm afraid you're comparing here 'progressive' performance with full rnd sorting. as said swapping around ~4 sprites takes less than 10 lines.
2007-10-08 18:48
JackAsser

Registered: Jun 2002
Posts: 1242
Quote: uh, you should be able to sort 64 sprites in the border easily =) even my crappy multiplexer does 48 sprites in border, and i know it's not the most optimized one around :)

the basic idea in games usually is bucket sort first (what burglar called regions), then (if necessary) bubblesort on the buckets.


Also... when re-using the sort order from frame to frame most algorithms are close to O(n). It's seldom a game hits the theoretical maximum complexity. I really dunno if you should calculate on worst case here since worst case never will happen.
2007-10-08 18:53
doynax

Registered: Oct 2004
Posts: 209
Quote: Also... when re-using the sort order from frame to frame most algorithms are close to O(n). It's seldom a game hits the theoretical maximum complexity. I really dunno if you should calculate on worst case here since worst case never will happen.

Not the worst case perhaps. But the difference I saw between the best and worst case, even when catering to the multiplexer when inserting things, was still something like 300%.
And in an action game it's all about maintaining a smooth framerate no matter what happens.
2007-10-08 18:53
Oswald

Registered: Apr 2002
Posts: 4090
jack, well its just ammazing to me that how slow this operation really is. without tricks sorting alone can eat most of a frame. Taking into account how unoptimized games used to be its nice they can pull off their plexed sprites, and the rest in 50fps :)

edit: to gro, 132 rasterlines is the time it takes to sort 32 totally random numbers.
2007-10-08 18:59
Groepaz

Registered: Dec 2001
Posts: 8213
Quote:

groppie, I'm afraid you're comparing here 'progressive' performance with full rnd sorting. as said swapping around ~4 sprites takes less than 10 lines.


nope, my plexer doesnt take advantage of that, its full random sorting. the key is to use a reasonable amount of buckets, plus an unrolled bubblesort. (and if you can live with a little error you can leave out the bubblesort completely, which is what i did)
2007-10-08 19:01
doynax

Registered: Oct 2004
Posts: 209
Quote:

nope, my plexer doesnt take advantage of that, its full random sorting. the key is to use a reasonable amount of buckets, plus an unrolled bubblesort. (and if you can live with a little error you can leave out the bubblesort completely, which is what i did)
I actually do a progressive bucket sort which seems to help a bit. Bullets and enemy formations and such tends to move in step with each other after all.
2007-10-08 19:06
Oswald

Registered: Apr 2002
Posts: 4090
doynax, whoaw, now thats a massive amount of sprites :) \o/ how do you make the d800 copy, and where ? :) btw what bugs me about bucket sort is how to deal with the random order in the multiplexer, also rejecting 9th sprites seems to be messy :P
2007-10-08 19:09
Groepaz

Registered: Dec 2001
Posts: 8213
Quote:

also rejecting 9th sprites seems to be messy


why reject? just make it so it's ignored :) (ie bucket can take more than 8 sprites, but plexer will only use first 8)
2007-10-08 19:13
doynax

Registered: Oct 2004
Posts: 209
Quote: doynax, whoaw, now thats a massive amount of sprites :) \o/ how do you make the d800 copy, and where ? :) btw what bugs me about bucket sort is how to deal with the random order in the multiplexer, also rejecting 9th sprites seems to be messy :P

The d800 copy is double buffered and evenly distributed across 16 frames, so it really doesn't take much raster time at all ;)
As for the bucket sort I don't quite see what you mean. The
output is still just an ordered list of sprites, just like for the insertion sort.
As for rejecting the 9th sprite I don't do any fancy multiplexer "overload" testing at all. The code just tries to optimistically display as many sprites as possible and if doesn't fit then it's just not displayed (at least that's what happens in the type of multiplexer I wrote).
2007-10-08 19:18
Groepaz

Registered: Dec 2001
Posts: 8213
Quote:

As for rejecting the 9th sprite I don't do any fancy multiplexer "overload" testing at all. The code just tries to optimistically display as many sprites as possible and if doesn't fit then it's just not displayed (at least that's what happens in the type of multiplexer I wrote).


thats probably how most game multiplexers work. (look at SEUCK...argl =D)
2007-10-08 19:23
Oswald

Registered: Apr 2002
Posts: 4090
you cant doublebuffer d800 :) for the bucket sort, I have to think it through as I'm totally new to this idea .) ie a multiplexer relies on the sorted list, wont bugs happen when sprites regs are not written in the order they should be ? :)
2007-10-08 19:26
doynax

Registered: Oct 2004
Posts: 209
Quote: you cant doublebuffer d800 :) for the bucket sort, I have to think it through as I'm totally new to this idea .) ie a multiplexer relies on the sorted list, wont bugs happen when sprites regs are not written in the order they should be ? :)

Really? Feel free to disassemble the code :)

As for the sprites if they aren't perfectly in order then the worst that can happen is that they won't be as tightly packed as possible. If it's properly written anyway.
2007-10-08 19:28
Groepaz

Registered: Dec 2001
Posts: 8213
keep in mind that eg your first "logical" sprite doesnt always have to be the first hardware sprite, it could be any of the 8 hardware sprites in the first bucket. you only have to maintain that if priorities matter (which they often dont in games, since sprites dont overlap)
2007-10-08 19:38
doynax

Registered: Oct 2004
Posts: 209
Quote: keep in mind that eg your first "logical" sprite doesnt always have to be the first hardware sprite, it could be any of the 8 hardware sprites in the first bucket. you only have to maintain that if priorities matter (which they often dont in games, since sprites dont overlap)

In fact it's often useful to write them in reverse order to insure reasonably good priorities in top-down games and the like (i.e. so lower sprites overlaps the higher ones).
Another possibility is to "center" the multiplexer in such a way as to insure the priority of a single sprite (i.e. always assign the player to the first sprite).
2007-10-08 19:42
Oswald

Registered: Apr 2002
Posts: 4090
you might miss sprites otherwise possible to display with the bucket sort.


spr1=70
spr2=74

bucket sort result:

spr2
spr1

then you fire an irq for the next sprite that will be spr2, and already late from spr1
2007-10-08 19:45
doynax

Registered: Oct 2004
Posts: 209
Quote: you might miss sprites otherwise possible to display with the bucket sort.


spr1=70
spr2=74

bucket sort result:

spr2
spr1

then you fire an irq for the next sprite that will be spr2, and already late from spr1


The way I do it is to attach the IRQ to the end of the last use of the new sprite, and thus try to reprogram the "channel" as soon as possible. Then all you have to do is check if you're too late to generate next IRQ (if it is to run before the current scanline) and if so jump straight to the next handler.
2007-10-08 19:46
cadaver

Registered: Feb 2002
Posts: 1046
If you preprocess the sprite-IRQs, you can take care to always take the upmost Y-coordinate as the basis for the IRQ, even if it isn't the first sprite in sort order. Did this in MW1-3 which had inexact sprite sorting. Brr, never again! :)
2007-10-08 19:46
Groepaz

Registered: Dec 2001
Posts: 8213
now you are assuming too much :) ofcourse the displayer code must be written in a way that such thing can not happen.
2007-10-08 20:30
Oswald

Registered: Apr 2002
Posts: 4090
Quote: If you preprocess the sprite-IRQs, you can take care to always take the upmost Y-coordinate as the basis for the IRQ, even if it isn't the first sprite in sort order. Did this in MW1-3 which had inexact sprite sorting. Brr, never again! :)

what was so bad about it, when you could get it 'right' afterall ?
2007-10-08 20:33
Oswald

Registered: Apr 2002
Posts: 4090
Quote: The way I do it is to attach the IRQ to the end of the last use of the new sprite, and thus try to reprogram the "channel" as soon as possible. Then all you have to do is check if you're too late to generate next IRQ (if it is to run before the current scanline) and if so jump straight to the next handler.

hmm 'before the new' sprite approach allows more tight packing ;) clever solution nevertheless.
2007-10-08 20:38
cadaver

Registered: Feb 2002
Posts: 1046
Oswald: well it was never completely "right" in those games, you could create unnecessary artifacts for example when several motorcycles (2x2 sprites) were coming at you and you jumped. But for example, if you have mostly airborne enemies coming at you from several heights and you're mostly on ground, it doesn't matter.
2007-10-09 12:22
Oswald

Registered: Apr 2002
Posts: 4090
radix sort is a very interesting approach :) offers a constant sort time, but sadly that is much worse than progressive insertion sort. when fully unrolled I estimate a running time of ~55 rasterlines.
2007-10-09 12:31
doynax

Registered: Oct 2004
Posts: 209
Quote: radix sort is a very interesting approach :) offers a constant sort time, but sadly that is much worse than progressive insertion sort. when fully unrolled I estimate a running time of ~55 rasterlines.

The bucket sort we talked about earlier is just bucket sort is just a special case of the radix sort. Except it sorts in one step and with some loss of precision to reduce the number of buckets.

Anyway an optimized and unrolled two-step radix sort shouldn't be quite as bad as that. It ought to be possible to get it well below 55 lines. In fact it'd probably be faster than a bucket sort combined with a bubble sort fix-up stage.

Dammit.. Now I have to implement one just to see how it turns out ;)
2007-10-09 13:05
Oswald

Registered: Apr 2002
Posts: 4090
here is my implementation:

there's 2 bucket arrays, each 256 bytes. each bucket can hold max 15 elements, 16th is the counter of the elements. adress of fex. bucket 3's element 5 is: 3*16+5.

bucket1 is used for the first pass
bucket2 for the 2nd.

each code snipplet is a code segment out of an unrolled loop. (ie stuff is missing like lda bucket1+bucketnr*16,x)

I dont see how this could be much faster :) some1 proove me wrong :)


;pass1
	ldx sort+0
	ldy spry,x        ; get sprite y coord
	lax and#0fmul16,y ;get bucket startadress 
	ora bucket1+15,x  ;get adress INside the bucket
	inc bucket1+15,x  ;inc nr of elements
	tax		
smod	sty bucket1,x	  ;store spr to bucket



;pass2

	ldx bucket1+15
	beq next	;empty bucket dont bother
blp1	

	ldy bucket1,x
	lda spry,y         ;get spr y
	and #%11110000     ;upper 4 bits only this time
	tay                ;which is exactly our pointer
	ora bucket2+15,y   ;addy inside bucket
	inc bucket2+15,y   ;inc bucket counter
	tay                ;final bucket addy
	sta bucket2,y      ;store sprite
	dex
	bne blp1	   ;any more in bucket1?

next

;pass3
        ldy #$00 ; nr of sprites counter this is done only once

	ldx bucket2+15
	beq next             ;empty bucket?
blp2
	lda bucket2,x        ;get sprite nr
smod	sta final,y          ;store to final list
	iny                  ;final spr count
	dex           
	bne blp2             ;any more in curr. bucket?



edit, ok, well some bugs there, destroying regs, but assuming they werent.. fixing those would make it even more slower tho
2007-10-09 13:37
doynax

Registered: Oct 2004
Posts: 209
I'm attempting something like this:

	;; initialize the buckets
	ldx #$81
	!for i,0,16 {
	sta lsd_bucket+i
	}
	lda #$fe
	!for i,0,6 {
	sax msd_bucket+i*2+0
	stx msd_bucket+i*2+1
	sbx #-3
	}

	;; lsd sort
	lda #$0f
	!for i,0,32 {
	ldx actor_ypos+i
	sbx #$00
	ldy lsd_bucket,x
	sty actor_link+i
	ldy #i
	sty lsd_bucket,x
	}

	;; msd sort
	!for i,0,13 {
	ldx lsd_bucket+i
	bmi .next

.msd	ldy actor_ypos,x
	lda msd_table,y
	tay

	lda msd_bucket,y
	stx msd_bucket,y

	ldy actor_link,x
	sta actor_link,x
	bmi .next

	ldx actor_ypos,y
	lda msd_table,x
	tax

	lda msd_bucket,x
	sty msd_bucket,x

	ldx actor_link,y
	sta actor_link,y
	bne .msd

.next	}

	;; finally in the mux writer.
	;; for each sprite, alternating x and y
	ldx actor_link,y
	bpl .ok
.bucket	lda msd_bucket-$80,x
	tax
	bmi .bucket
.ok	...


The idea is to use linked lists for the buckets. Sort all possible actors in the first pass (invalid ones having high y values), you can optimize it by trying to keep the maximum actor number as low as possible and skipping those high entries. And we can link the actors together in the actual multiplexer instead of a separate pass. Also the sentinels in the msd buckets can help us to skip buckets easily.
Finally we don't need more than about 13 buckets since not all y coordinates are valid, and as a bonus we can collect the invalid ones into a single "death" list automatically by tweaking the division table. Another bonus is that by sticking a store in the bucket skipping code of the multiplexer you can easily link together a complete list of actors which IMO is more convenient to work with than an order list (i.e. you only need to know the current actor's index to move to the next one, no need to keep track of an order index). Furthermore just about everything is kept on the zeropage, thought only temporarily of course.

All in all it seems to work out to about 28 raster lines. Except I'm not at all certain whether this will actually work yet.. ;)
2007-10-09 14:22
Oswald

Registered: Apr 2002
Posts: 4090
wow, very creative ! :) 28 lines sounds very good :) tho I have yet not an idea how progressive sort can perform when there's little changes ;)
2007-10-09 14:40
doynax

Registered: Oct 2004
Posts: 209
Quote: wow, very creative ! :) 28 lines sounds very good :) tho I have yet not an idea how progressive sort can perform when there's little changes ;)

It depends really. A progressive sort is faster in 95% of the cases but it's that twentieth missed frame which fucks up the flow in an action game. Then again I'm writing a twitch game which aspires to be a bullet hell shooter, but a platformer (say) would obviously have other requirements.

The radix sort is only something like 15 cycles (per actor) slower than my current bucket sort. So if I can get it to work I'll probably switch, and I'd be happy to share the final code if I do. Weighting the extra cycles vs. the extra sprites makes it a tough choice.
2007-10-09 14:53
ChristopherJam

Registered: Aug 2004
Posts: 655
I just used insertion sort starting with the ordering from the previous frame in Teradyne, setting up the first 8 sprites in the end-of-frame IRQ, and then moving each sprite down with an IRQ at the end of its previous use. If the next sprite needs moving before I returning from the IRQ I do it then instead of setting up another.

I stagger creation of all the sprites in a new formation over several frames to spread the workload, then there's never too much work for the sorter to do.
2007-10-09 14:59
doynax

Registered: Oct 2004
Posts: 209
Quote: I just used insertion sort starting with the ordering from the previous frame in Teradyne, setting up the first 8 sprites in the end-of-frame IRQ, and then moving each sprite down with an IRQ at the end of its previous use. If the next sprite needs moving before I returning from the IRQ I do it then instead of setting up another.

I stagger creation of all the sprites in a new formation over several frames to spread the workload, then there's never too much work for the sorter to do.


But then you've mostly got horizontal movement, right? I've got to worry about a laser moving 21 lines per frame and other nasty things :(
Of course managing not to overload the multiplexer in a horizontal shooter is much harder. So I guess I'm still better off ;)
2007-10-09 16:24
Oswald

Registered: Apr 2002
Posts: 4090
btw some inspiration on the attack waves, and enemies: http://youtube.com/watch?v=Aj23K8Ri68E&mode=related&search= a lot if this could be done on a c64 imho.
2007-10-09 16:47
Groepaz

Registered: Dec 2001
Posts: 8213
an ikaruga-like shooter would be generally a cool idea :)
2007-10-10 11:00
ChristopherJam

Registered: Aug 2004
Posts: 655
Yeah, the movement is largely horizontal, with a few loop-the-loops, and that "X of death" formation the reviewer complained of. The times I drop a formation in from above or raise it from below it can struggle a little if there's already a dense formation it has to cross over.

A danmaku shooter like Ikaruga would indeed be awesome, but also a serious challenge to implement!
2007-10-10 13:37
Cruzer

Registered: Dec 2001
Posts: 892
I don't think I've ever done a multiplexor where the sprites could move freely around, but is it really necessary to sort them?

Wouldn't it be enough with a map consiting of a byte per rasterline where the bit pattern indicated which sprites were used. To allocate a new sprite, just ora all the bytes for the sprite's y-position and 21 positions down, and see which 0's are left, or if the result is $ff it's too bad.

21 oras per sprite might still be a little heavy, but this could of course be optimized by reducing the precission a bit, e.g. to 3rd line, which means only 7 oras/sprite.

Edit: And of course you need to insert 7 values in the table afterwards as well. :-) Still think it's faster than sorting though.
2007-10-10 13:48
doynax

Registered: Oct 2004
Posts: 209
I guess that's basically a "zone" multiplexer except with better granularity, sort of.. Don't forget that you've got to search through the scanlines first to see if any sprite channel is free all the way down before marking them. Then you'd also need another set of tables to keep track of which virtual sprite is bound to which sprite channel.
Finally you'd have to walk through the array line-by-line to figure out when to trigger the interrupts. I suppose the easiest way would be to walk the array from the bottom up to find the earliest possible line to program the sprite.

Except it'd have to be fairly coarse to win-out over a sort-based approach.

edit: I suppose you wouldn't have to check all entries, just the first and the last one. Or maybe only the first if the flag was interpreted to mean that the sprite is free to be reused here. Hmm..

Either way I figure it'd be a nice way to deal with y-expanded sprites and other special cases for linked sprites as well as for dealing with sprite priorities, i.e. the sprites would be allocated in the order of priority.
2007-10-10 14:04
Oswald

Registered: Apr 2002
Posts: 4090
mixing some ideas this could be powerful. I'm thinking of a chained list of rasterlines having sprites, and each element in this list would point to another list having the sprites. ;)
2007-10-10 14:08
doynax

Registered: Oct 2004
Posts: 209
Quote: mixing some ideas this could be powerful. I'm thinking of a chained list of rasterlines having sprites, and each element in this list would point to another list having the sprites. ;)

Care to elaborate?
Superficially that sounds exactly like what my bucket sort does today, except I think you meant something entirely different..
2007-10-10 14:23
Oswald

Registered: Apr 2002
Posts: 4090
well at a second look it might be not a good idea, as walking the chain each time a new sprite is inserted is something like an insertion sort using doublechained elements :P it would be better not to chain the raster list, instead using a 256 byte table, the end result would be counting sort with chained sprites :) or if you want like your radix sort without separate lsd/msd :)

edit, ok this one fails when you have to reset the 256 byte table :P maybe you could collect the used entrys somewhere during the proces tho.
2007-10-10 16:00
ChristopherJam

Registered: Aug 2004
Posts: 655
Hmm. Perhaps you could insert 'start actor' and 'end actor' events into one each of (say) 50 buckets, one every four rasters.

Then whenever an actor is done with you could add the sprite you used to a free list, and whenever a sprite is required you could take one off the free list. No sorting required - you just need to look ahead in the bucket table to see when the next raster interrupt is required.
2007-10-10 20:20
Oswald

Registered: Apr 2002
Posts: 4090
how about binary trees? %)
2007-10-10 20:35
doynax

Registered: Oct 2004
Posts: 209
Quote: how about binary trees? %)

I doubt it. Any kind of balancing would be way to complicated, and you'd almost certainly encounter nasty edge cases without it. Having to implement an AVL or red-black tree on this machine terrifies me. A splay tree just might be feasible but I'm not sure whether it'd solve the problem. Anyone here with a real CS background who can tell for sure?

Unless you mean a heap sort with it's implicit tree?
2007-10-10 20:58
ChristopherJam

Registered: Aug 2004
Posts: 655
Hmm. I'm sure I've used heap-sort for something on this beasty. May have been hidden-surface-removal for a solid 3d renderer.. (edge events as you traverse a scanline push/pop surfaces)
2007-10-11 06:28
JackAsser

Registered: Jun 2002
Posts: 1242
Quote: I doubt it. Any kind of balancing would be way to complicated, and you'd almost certainly encounter nasty edge cases without it. Having to implement an AVL or red-black tree on this machine terrifies me. A splay tree just might be feasible but I'm not sure whether it'd solve the problem. Anyone here with a real CS background who can tell for sure?

Unless you mean a heap sort with it's implicit tree?


Nah... stick to radix sort and the like since they're O(n). General sorting using trees etc are O(n log n) and on this platform they'll have a nice quite big constant put in front also due to overhead. Those constants are ofcourse negliable for very large n's, however a multiplexer doesn't have large n. ;D

For a complete sort I have serious doubts that anything can out perform a well written radix sort (2 step bucket sort as you have shown before).

If you have similar sort order from frame to frame I'd recommend bubble sort actually. They're O(n^2) worst case but O(n) if sort order is similar + they're dead easy to implement and optimize.

Maybe one could have some code that would realise when to use a full sort with radix-sort and when to "just fix" the sort order using bubble sort. This would improve the average time ofcourse but slightly worsen the worst case ofcourse.

Also as Cadaver mentioned, it's seldom the sort order actually changes that much from one frame to another. On could be smart when you get new enemies etc by spreading the sort for the actors that are about to enter the screen, over several frames.

F.e. you will need some kind of data structure to know what kind of actors that will emerge soon in the scrolling direction. You could have those somewhat pre-sorter and/or prepare-sort them on the way there.

Tons of ideas.... =)
2007-10-11 09:12
ChristopherJam

Registered: Aug 2004
Posts: 655
Hmm. For most shooters the enemy paths are completely predetermined. You could pre-sort all of them (keeping dead enemies onscreen with a blank sprite definition), or at least provide new sortlists for any points in time that a bubble sort is going to be expensive.

That would even allow you to determine in advance when a slightly shuffled order is acceptable :)


Speaking of load management, can anyone think of a situation that an 8-sprite muxer that needs to handle a player in an arbitrary position will always be able to cope with that a 7-sprite muxer with the player on sprite 0 would not? (other than forcing the player to avoid certain y-ranges by using scenery collisions)
2007-10-11 09:20
doynax

Registered: Oct 2004
Posts: 209
Quote:
Hmm. For most shooters the enemy paths are completely predetermined. You could pre-sort all of them (keeping dead enemies onscreen with a blank sprite definition), or at least provide new sortlists for any points in time that a bubble sort is going to be expensive.
Yeah, that'd probably work. If you were writing Armalyte 2..
But it would be way to limiting for my tastes. You've got to at least have bullets fired towards the player in the kind of game I'm trying to write.

Quote:
Speaking of load management, can anyone think of a situation that an 8-sprite muxer that needs to handle a player in an arbitrary position will always be able to cope with that a 7-sprite muxer with the player on sprite 0 would not? (other than forcing the player to avoid certain y-ranges by using scenery collisions)
Sure. Bullets in a vertical shooter tend to be separate from the player and able to reuse the player's sprite. As long as they are spawned a couple of scanlines above the player at least.

Besides I'm happy with allowing a bit of flicker every now and then, especially if it's mostly in the two player mode, in return for a greater amount of objects and not having to worry myself to death over how the enemy paths should be designed as to guarantee zero flicker. Best-effort multiplexing FTW!
2007-10-11 09:57
Oswald

Registered: Apr 2002
Posts: 4090
I'm strongly beside that the player sprite shouldnt be handled exclusively, except the case of more than 8 sprites on a line. also if this happens I'd alternatively reject sprites frame to frame, thus in a flickery fashion but still all sprites would be visible. I've seen this in some soccer game, and I like it
2007-10-11 10:09
ChristopherJam

Registered: Aug 2004
Posts: 655
Fair enough on the flicker risk.

I'd actually been assuming chars or bitmap plotting for the bullets, but yes, if they are to be sprites then an 8 sprite muxer makes more sense than special casing the player.
2007-10-13 10:14
ChristopherJam

Registered: Aug 2004
Posts: 655
This one's has taken up residence in my brain..

Now I've gone back and read the thread properly instead of skimming the first three quarters before reading the rest, I've actually noticed the earlier points about a danmaku shooter being the aim well before the Ikaruga subthread, and that the whole idea was to minimise worst case in a very frame incoherent shooter - methinks my insertion sort commentary was a little out of place.

I did have some inspiration last night when I realised that for any displayable set of actors there will never be more than eight sprites in any 21 line bucket (else the first and last will be too close together to display), so then I went off and did some research on optimal sorting for small sets of numbers. I eventually found
this page on Bose-Nelson sorting networks, got a swap macro down to 22 cycles ( ldx a
ldy b
lda actor_y,x
cmp actor_y,y
bge noswap
stx b
sty a
), decided that was too slow and started handcoding cases for smaller lists of sprites in a bucket (eg 70 cycles to take 3 zero page locations containing actor indices, sort them by actor_y and add them to the actor_next list),
but even then, I lose about 26 cycles just to placing each actor onto the end of the zero page array of actors-in-that bucket, which only leaves me an average of 19 cycles per actor for the sorting and adding to the actor_next list if I'm to beat the routine above.

It looks like @doynax' two pass radix sort still wins - I'm impressed!
2007-10-13 11:00
doynax

Registered: Oct 2004
Posts: 209
ChristopherJam: In practice that radix-sort ended up needing 34 scanlines, for 32-sprites, rather than 28. It just wasn't practical to keep keep the set of sprites compacted like that. Not when combined with certain other features anyway.
So maybe a Bose-Nelson sorting network, whatever that is, could still be competitive?
2007-10-14 15:29
ChristopherJam

Registered: Aug 2004
Posts: 655
hmm - so I'd need to manage less than (34*63)/32=67 cycles per actor.

I've still got some optimising to do, but at the moment the worst case cost per actor ranges from 46 to 87 cycles depending on the number of other actors are in the same bucket (1 to 7 others) plus a small overhead for each of the 10 buckets.

There are a lot of unnecessary loads and stores in the more expensive cases though, so I'll see how I go over this week.

A sorting network is basically a list of optional exchanges of predetermined pairs, that after they are complete guarantee the list is sorted. It avoids computing array indices and a lot of the branching - eg

#define SWAP(a,b) .(:ldx zpt+a:ldy zpt+b:lda act_y,x:cmp act_y,y:bcs noswap:stx zpt+b:sty zpt+a:noswap:.)

SWAP(0,1)
SWAP(2,3)
SWAP(0,2)
SWAP(1,3)
SWAP(1,2)

will ensure that four actor indices stored in zpt..zpt+3 will be sorted by y position. For an 8 element list, you only need 19 swaps.
2007-10-14 19:31
Oswald

Registered: Apr 2002
Posts: 4090
lda (),y
cmp (),y
bcc

is faster when there's no need to swap than

ldx pointer
ldy pointer
lda spry,x
cmp spry,y
2007-10-14 20:04
ChristopherJam

Registered: Aug 2004
Posts: 655
Presumably you are proposing I keep y=0 and interleave my actor indices with the high byte of the actor_y table?

Good thinking, except I am attempting to optimise the worst case, so I do not wish to sacrifice that for a faster average case.
2007-10-14 20:47
Oswald

Registered: Apr 2002
Posts: 4090
it should be noted tho, that worst case sorting never happens with progressive sort, especially if you optimize for only adding/removing 1 sprite to the sort per frame. I'm also realizing that the zp method presented above is not good, as the rare case with insertion sort (my current approach) is when there's no need to swap.
2007-10-14 21:29
JackAsser

Registered: Jun 2002
Posts: 1242
Quote: hmm - so I'd need to manage less than (34*63)/32=67 cycles per actor.

I've still got some optimising to do, but at the moment the worst case cost per actor ranges from 46 to 87 cycles depending on the number of other actors are in the same bucket (1 to 7 others) plus a small overhead for each of the 10 buckets.

There are a lot of unnecessary loads and stores in the more expensive cases though, so I'll see how I go over this week.

A sorting network is basically a list of optional exchanges of predetermined pairs, that after they are complete guarantee the list is sorted. It avoids computing array indices and a lot of the branching - eg

#define SWAP(a,b) .(:ldx zpt+a:ldy zpt+b:lda act_y,x:cmp act_y,y:bcs noswap:stx zpt+b:sty zpt+a:noswap:.)

SWAP(0,1)
SWAP(2,3)
SWAP(0,2)
SWAP(1,3)
SWAP(1,2)

will ensure that four actor indices stored in zpt..zpt+3 will be sorted by y position. For an 8 element list, you only need 19 swaps.


Hmmm... I was sure sorting networks were O(n log n) but apparantly Ajtai, Komlós and Szemerédi have O(log n) sorting network. I.e. the maximum depth (or comparators executed) is log n. It feels highly counter intuitive imo but I won't question it. Anyways, point is that while radix-sort is O(n*m) (where n is # of input and m is # of digits), an optimal sorting network is O(log n).

Now, this O() notation always hides huges constants for a c64 but for a f.e. 48 actor sorter it should require at most 48*log2(48) comparators statically in memory and at most log2(48) of those are executed at any given input. => NICE!
2007-10-15 02:01
ChristopherJam

Registered: Aug 2004
Posts: 655
Ajtai, Komlós and Szemerédi's network is O(log N) depth, but still has O(N log N) comparitors - their timings are for parallel hardware, where you can swap elements 0 and 1 at the same time as 2 and 3, but swapping 0 with 2 would have to wait until the next cycle. The sample network I gave above they would categorise has having depth 3.

I was looking at something that on serial hardware like our humble c64 would be O(N log k), where k is the maximum number of sprites in any given bucket, which for buckets less than 22 lines high will be 8.

Oswald: I know the stuff I'm playing with this week will not give optimal results for a relatively stable arrangement - it's why I usually use insertion sort on a linked list of actors. I believe Doynax is attempting to avoid the huge spikes of CPU time a progressive sort would take whenever there is a massive change in order, like when two multi-sprite bosses are crossing vertically while you shoot at them with vertically fast moving sprite bullets ;)
2007-10-15 08:45
Oswald

Registered: Apr 2002
Posts: 4090
actually I have just arrived to what you're talking about: insertion sort on a linked list of actors :) as one of the drawbacks of insertion sort is the need to move many elements, but with a linked list you can insert the actor into the right position without shuffling bytes around.

ldx actorlink,x
cmp actory,x
bcc nextelement

- insert actor into chain & return to the outer unrolled loop-

nextelement

ldy actorlink,x
cmp actory,y
bcc nextelement

only 11 cycles per comparison. you even have the prev actor's link in one of the registers, so it can be a one way chain easily. guess you already do it like this ;)
2007-10-15 10:53
ChristopherJam

Registered: Aug 2004
Posts: 655
Exactly :)

I traverse a singly linked actor_next list in the end-of-screen interrupt running their movement scripts, and after each has been moved (unless it has just died) I insert it into a singly linked actor_prev list.

Then once all the actors have been processed, I reverse the list.

ldx actor_prev,y
tay
sta actor_next,x
beq done
...

2007-10-15 13:29
JackAsser

Registered: Jun 2002
Posts: 1242
Quote: Ajtai, Komlós and Szemerédi's network is O(log N) depth, but still has O(N log N) comparitors - their timings are for parallel hardware, where you can swap elements 0 and 1 at the same time as 2 and 3, but swapping 0 with 2 would have to wait until the next cycle. The sample network I gave above they would categorise has having depth 3.

I was looking at something that on serial hardware like our humble c64 would be O(N log k), where k is the maximum number of sprites in any given bucket, which for buckets less than 22 lines high will be 8.

Oswald: I know the stuff I'm playing with this week will not give optimal results for a relatively stable arrangement - it's why I usually use insertion sort on a linked list of actors. I believe Doynax is attempting to avoid the huge spikes of CPU time a progressive sort would take whenever there is a massive change in order, like when two multi-sprite bosses are crossing vertically while you shoot at them with vertically fast moving sprite bullets ;)


Ah yeah, later that night I realized that the O(log N) was due to parallelism.

To optimize further I think we should look at the aims here:

* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.

First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.

Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc.
2007-10-15 13:37
doynax

Registered: Oct 2004
Posts: 209
Quote: Ah yeah, later that night I realized that the O(log N) was due to parallelism.

To optimize further I think we should look at the aims here:

* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.

First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.

Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc.


Unfortunately we still need to sort the first eight sprites in order to reuse them as soon as possible. Unless you use a radically different approach like what Cruzer suggested of course.

There may be cases where we can determine ahead of time that there will be "enough" free raster time anyway and don't perform a perfect sort on some segment of sprites, but as to how you'd implement such a beast in practice..
2017-08-12 11:39
ChristopherJam

Registered: Aug 2004
Posts: 655
Necro thread time! (I know, I know, if I'd waited another two months it'd be the ten year anniversary. But I want to publish this now so I can move on to something else :P)

I've been meaning for a while to try and accelerate a perfect order bucket sort by finding a good way to skip over unused entries.

After a bit of experimentation this year, the best I've managed is to use a set of flag bytes, with one bit for each bucket to say whether there are any actors in it on this frame.

For any non-zero flag byte, a lookup table used to quickly determine the index of the least significant one bit, so that that that bucket can then be emptied by pushing each sprite index from the list starting with that bucket entry onto the stack. The used buckets are cleared as they are emptied, so no post-cleanup is required, and the untouched buckets take no CPU time at all.

By packing eight flags into each flag byte, and unrolling the code to one block per flag byte, the worst case overhead per bucket is only 1.25 cycles. Thus, the CPU time for a perfect sort is only marginally worse than what it would be for a sort routine that only used half as many buckets.

Total time, a shade under 40 raster lines \O/ (yes, doynax still wins… /o\)

Code is up at codebase, cf http://www.codebase64.org/doku.php?id=base:flagged_bucket_sort
2017-08-13 13:04
ChristopherJam

Registered: Aug 2004
Posts: 655
…removed a redundant LDA#255 from each flag byte block, just before the STA back into the bucket list (the accumulator already contains a list terminator at that point), and remembered to factor in the flag bytes being in zero page.

Down to 38.5 raster lines now, only 12% slower than radix sort.
2017-08-13 21:26
doynax

Registered: Oct 2004
Posts: 209
Very neat. I never would have considered doing a sparse bucket sort :)

Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.

(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)
2017-08-14 09:08
ChristopherJam

Registered: Aug 2004
Posts: 655
Quoting doynax
Very neat. I never would have considered doing a sparse bucket sort :)


Thanks!

Quote:
Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.


Yes, I wasn't sure how best to benchmark. Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.

FWIW most of the time is "per used bucket", so if there are only 24(?) distinct Y values, I suspect I'd come out ahead, but if there are that many horizontally aligned enemies a zone sort's likely to be sufficient anyway.

I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor, for a fairly unappealing total of (34+62)*32=3072 cycles per frame (48.7 raster lines)

I suspect that if one just wanted to minimise CPU time expended by the sorter in in the border, a plain bucket sort would be better; insertion's only 14 cycles per actor (7 rasters!), and the times it takes a while for 'fetch next' are precisely when the next actor is a fair way down the screen from the one you just dealt with, so timing wouldn't be quite so critical anyway.


Quote:
(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)


Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations…
2017-08-14 14:18
doynax

Registered: Oct 2004
Posts: 209
Quoting ChristopherJam
Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.
Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.

Quoting ChristopherJam
I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor
Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.

Quote:
Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations…
Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects.
2017-08-14 15:34
ChristopherJam

Registered: Aug 2004
Posts: 655
Quoting doynax

Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.

Indeed. And for the first 8 sprites, you can just LDX immediate directly from $017f down.

Quote:
Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.

Exactly. The plain bucket sort has the same advantage (TBH I pinched the idea from Playstation order-tables - it's how they z-sort the texture mapped triangles, by inserting them into a linked list of null graphics primitives)

Quote:
Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects.
Yes; I got there eventually :)
2017-09-04 18:19
Color Bar

Registered: Oct 2012
Posts: 132
A brute force approach seems already very efficient. I think that can be done in 35.6 lines or less for 32 sprite positions and 220 buckets.
;Fill Buckets
ldx input
inc Buckets,x
...
ldx input+31
inc Buckets,x
-------------
10*32=320 cycles assuming input is in ZP.

;Collect sorted values
ldy Buckets
beq next1:   ;Skip if empty.
lda #value1  ;Value corresponding to this bucket.       
pha
dey
bne *-2      ;In case not unique.
sty Buckets  ;Empty the buckets.
next1:
ldy Buckets+1
beq next2:
lda #value2
pha
dey
bne *-2
sty Buckets
next2:
;etc.

If all input numbers are unique the collection stage costs 7 cycles for empty buckets and 19 cycles for the buckets with 1 input value for a total of 7*(220-32)+19*32+320 = 35.6 lines.

If not all input numbers are unique, the cost for a bucket with k numbers is 12+(k-1)*8+7 cycles. Adding one more to a non-empty bucket costs 8 cycles, but since there is then one bucket less with one contribution, and one more empty bucket, one gains 4 cycles overall.

But I could have made a mistake in these calculations or missed something else (initialization, overhead ...).
2017-09-04 22:56
ChristopherJam

Registered: Aug 2004
Posts: 655
Color Bar, that is indeed a fast way to get a sorted list of Y values, but what is needed is a sorted list of the indices of the sprites that have those Y positions.

(eg if sprites 1, 2, 3 and 4 are at positions 10, 46, 73 and 32, the required output is 1,4,2,3, not 10,32,46,73 )
2017-09-05 06:53
Oswald

Registered: Apr 2002
Posts: 4090
35-40 lines for 32 sprites is crazy good guys 1.x line per sprite.
2017-09-05 13:52
Color Bar

Registered: Oct 2012
Posts: 132
Quote: Color Bar, that is indeed a fast way to get a sorted list of Y values, but what is needed is a sorted list of the indices of the sprites that have those Y positions.

(eg if sprites 1, 2, 3 and 4 are at positions 10, 46, 73 and 32, the required output is 1,4,2,3, not 10,32,46,73 )


Oh, I see, thanks.

The fact that multiple sprites can have the same Y values makes the problem considerably harder, it seems.
2017-09-06 23:34
Perplex

Registered: Feb 2009
Posts: 197
Here is some old sort code I have used, I think it's originally from Armalyte, or at least based on that code (by Gary Liddon and Dan Phillips.)

It uses the stack for buckets, so lower your stack pointer to somewhere below 50 (or whatever is your lowest Y-value for sorting), then initialize the stack area above that with $ff values.

NOBJECTS  = 24	; Number of objects to be sorted

OBJ_YPOS  = $80	; Y-values of all objects
OBJ_SORT  = $A0	; Indices of all objects sorted by Y-value
TEMP_SP   = $FF	; Temporary stack pointer

StackSort:

	; Save stack pointer
	tsx		
	stx TEMP_SP

	; Phase 1 - Store Y-values in buckets
	.repeat NOBJECTS, i
	  lda OBJ_YPOS + i
	  ;lsr
	  tax
	  txs
	  pla
	  bpl *-1 ; Already value here, try next
	  lda #i
	  pha
	.endrep

	; Phase 2 - Gather (indices of) sorted values
	;ldx #(50/2)-1
	ldx #50-1
	txs
	.repeat NOBJECTS, i
	  pla
	  bmi *-1
	  sta OBJ_SORT + i
	  lda #$ff ; Clean up after ourselves
	  pha
	  pla
	.endrep

	; Restore stack pointer
	ldx TEMP_SP
	txs
	rts

The two commented out lines are for speeding up the gathering phase by halving each Y-value. Depending on how your multiplexer works, you may not need exact order for very close objects if you handle those at the same time.

In circumstances where limiting the amount of available stack is not viable, it can easily be modified to use some other memory area at the cost of a few cycles per object.
2017-09-07 09:40
The Human Code Machine

Registered: Sep 2005
Posts: 101
While working on Vincenct 2 in the nineties I had the same problem finding a fast solution for sorting sprites and for most purposes this sorter here: http://codebase64.org/doku.php?id=base:flexible_32_sprite_multi..
was the best I came up with. Inspired by Rune Gram-Madsen's Bubble a la Rune sorting routine, published at the Asm-One manual, I wrote a fast implementation for the c64. It's a modified bubble sort working bidirectional and for presorted arrays it's very fast.
2017-09-07 11:17
Color Bar

Registered: Oct 2012
Posts: 132
Perplex, that is interesting, thanks for sharing the code. Do you know of an efficient modification that ensures exact ordering?
2017-09-07 14:53
Perplex

Registered: Feb 2009
Posts: 197
If you need exact ordering even with many values clustered close together, then simply moving on to the next unoccupied bucket won't really work, so you'd need a quite different approach, I think.
2017-09-07 22:19
Hein

Registered: Apr 2004
Posts: 802
Using bucket sort ($10 size) and insertion sort to sort the indices of a list of y-values. In this case it takes around 40 rasterlines for 32 sprites when placed on zp. It assumes there's no need for more than 8 sprites per bucket, but 15 are possible.

The multiplexer checks per bucket based on the counter at the end of each bucket.


	*=$10
clear_bucket_counter

	lda #0
	sta buckets+15
	sta buckets+15+16*1
	sta buckets+15+16*2
	sta buckets+15+16*3
	sta buckets+15+16*4
	sta buckets+15+16*5
	sta buckets+15+16*6
	sta buckets+15+16*7
	sta buckets+15+16*8
	sta buckets+15+16*9
	sta buckets+15+16*10
	sta buckets+15+16*11
	sta buckets+15+16*12
	sta buckets+15+16*13
	sta buckets+15+16*14
	sta buckets+15+16*15

	
	ldx #32
set_element_index
	stx element_index
	
	lda sprite_y,x
	ldx #$f0
	sbx #$00	;select right bucket

	ldy buckets+15,x	;get bucket counter (# elements in bucket)
	inc buckets+15,x	;increase bucket counter

	stx set_2+1	;set high nybble bucket index
	dey
	bmi place_current_element	;element count=0 skip sort
			
	stx get_1+1	
	stx get_2+1	
	inx
	stx set_1+1
	
			
	sty slot_hi+1	;highest index
	
compare_element_y
get_1	ldx buckets,y
	cmp sprite_y,x
	dey
	bcc insert_lower_element	;woohoo, found a spot
	bpl compare_element_y
	
insert_lower_element
	sty slot_lo+1	

slot_hi	ldy #0

move_higher_elements

get_2	lda buckets,y		;move higher elements up
set_1	sta buckets,y
	dey
slot_lo	cpy #$ff		
 	bcc move_higher_elements

place_current_element		;finally, insert the lower element index in the right place
	iny
	lax element_index
set_2	sta buckets,y
	dex
	bne set_element_index

	rts

element_index
	.byte 0	


	*=$1000
buckets
	.fill 256,0


2017-09-08 13:16
Fresh

Registered: Jan 2005
Posts: 81
I use a slightly modified version of "Ocean" sorter, with a (somewhat) optimized backward sorting routine.
Its worst case is significantly better than plain Ocean.
Code in kickass syntax.

.const	NSPRITES  	= 32
.const	indexes		= $60
.const	ypos		= $80

// SORT FORWARD
// ************
// Check if values are sorted, if necessary call sortbw routine to
// put an unsorted value in the correct position
sortfw:	
	.for(var i=0;i<NSPRITES;i++) {
		over:
			.if (i!=NSPRITES-1) {
				ldy indexes+1+i
				ldx indexes+i
				lda ypos,y
				cmp ypos,x
				bcs sortfw[i+1].over								
				jsr sortbw[NSPRITES-2-i].back				
			}
			else	rts			
	}

// SORT BACKWARD
// *************
// Move backward a value in .A which is less than the one in current position.
// All values at the left of the current position are ordered so, once we found
// the correct spot for .A, we don't need to check forward, we can get back
// to where the function was called
sortbw:
	.for(var i=0;i<NSPRITES-1;i++)	{
		back:
			stx indexes+NSPRITES-1-i
			.if (i!=NSPRITES-2) {
				ldx indexes+NSPRITES-3-i
				cmp ypos,x						
				bcc sortbw[i+1].back		
			}
			sty indexes+NSPRITES-2-i
			rts			
	}
2017-09-11 00:29
Color Bar

Registered: Oct 2012
Posts: 132
Quote: Color Bar, that is indeed a fast way to get a sorted list of Y values, but what is needed is a sorted list of the indices of the sprites that have those Y positions.

(eg if sprites 1, 2, 3 and 4 are at positions 10, 46, 73 and 32, the required output is 1,4,2,3, not 10,32,46,73 )


I think that can be done in less than 37 rasterlines.

Consider the following unrolled code for outputting sprite indices, sorted according to Y-value (phase 2)

.bucketK
bcc .nextBucket		;Branches always: decrease this value by 3 with selfmodifying code in phase 1 for every new sprite in the bucket.
lda #sprite_index	;index of last sprite
pha		
...	
lda #sprite_index  	;index of first sprite.
pha		
sty .bucketK+1	;Restore branch (to skip all sprites for empty buckets). Call with y=$1b for max 8 sprites per bucket. 

.nextBucket	;unrolled
-------------
3 cycles for an empty bucket, 12 voor buckets with 1 sprite => 3*(220-32)+12*32 = 948 cycli 


Phase1: The following unrolled code jumps to the code that modifies the above phase 2 code for the right bucket:
dex   ;counts sprite index. Initialized with $20.
bmi .phase2
lda Yvalues,x   
asl
bcs *+8   ;There are more than 128 buckets, so I use 2 jump tables
sta *+4
jmp ($xxxx) ;jump table for buckets <128	;jumps to code that fills out phase2 Bucket code (see below)
sta *+4
jmp ($xxxx) ;jump table for buckets >=128	;21/22 cycles sofar
.phase2
clc	;to branch always (either to next bucket or to fetch the sprite indices in the current bucket).
ldy #27		;to restore the branch value of the bucket
jmp .Bucket0
           

;code to fill out bucketK
ldy .bucketK+1   ;The trick is that I use this as an index for the number of sprites already in the bucket
lda newBranchValue,y		;fetch value to branch to the code for one more sprite 
sta .bucketK+1   ;since this is unrolled code, the address of .bucketK is known
tay
txa
sta .bucketK+3,y  ;Store new sprite index. This works since the branch value decreases by 3 for each new sprite, but the address where to store the sprite index as well!
...	;Unrolled => repeat all the above code starting with dex.
--------------
41/42 cycles ->41.5*32 = 1328

In total 1328+948 =2276 -> 36.1 lines.
But there will be at least 25 additional cycles from page boundary crossings. Slightly less than 37 lines I think if I did not overlook something.

Table with branch values (for max 8 sprites per bucket) looks like (x=don't care)
newBranchValue
 BYTE 0 x x 0 x x 3 x x 6 x x 9 x x 12 x x 15 x x 18 x x x x x 21 x 

The first zero makes that a full bucket replaces the last sprite with the new one.
2017-09-12 11:28
ChristopherJam

Registered: Aug 2004
Posts: 655
Excellent work, Color Bar.

If every sprite lands in the second jump table, that pushes you up to 42 cycles for every insertion, but total time in that case is still only
3*(220-32)+12*32 + 42*32 = 2292 cycles, or 36.4 raster lines.

The 25 additional cycles from page boundary crossings you mention can be avoided by padding the output routine out to 32 bytes by adding three bytes of padding after the BCC .nextBucket, and starting bucket0 at $xxff

It takes up another 660 bytes, but the routine is already occupying most of a bank as it is, so that's small change at this point ;)
2017-09-12 12:36
ChristopherJam

Registered: Aug 2004
Posts: 655
Should be able to save about 1760 bytes by only including the last 12 bytes of the insertion code in every third routine; the preceding and succeeding routine for each complete routine can just jump to the nearest complete one (the part from the second sta *+4 onward).

Timing's unchanged, but drags it back down to around 14k including jump tables.
2017-09-12 13:48
Color Bar

Registered: Oct 2012
Posts: 132
Quoting ChristopherJam
Excellent work, Color Bar.
The 25 additional cycles from page boundary crossings you mention can be avoided by padding the output routine out to 32 bytes by adding three bytes of padding after the BCC .nextBucket, and starting bucket0 at $xxff

It takes up another 660 bytes, but the routine is already occupying most of a bank as it is, so that's small change at this point ;)

Those 3 padding bytes can be put to good use by allowing 9 sprites in a bucket.

The routine probably uses too much memory for many applications.
2017-09-12 13:49
Color Bar

Registered: Oct 2012
Posts: 132
Quoting ChristopherJam
Should be able to save about 1760 bytes by only including the last 12 bytes of the insertion code in every third routine; the preceding and succeeding routine for each complete routine can just jump to the nearest complete one (the part from the second sta *+4 onward).

Not sure I understand this. You need a separate fill routine for every possible Y-value/bucket.
2017-09-12 14:10
Color Bar

Registered: Oct 2012
Posts: 132
Btw, if you just want to sort 32 random numbers, the phase 2 code could be modified into
.bucketK
bcc .nextBucket
lda #K
pha
lda #K
pha
...
pha
sty .BucketK+1

and the fill code can be limited to
ldy .bucketK+1 
lda newBranchValue,y		
sta .bucketK+1


Resulting in 3*(220-32)+12*32 + 33.5*32 -> 32 lines. Put perhaps that application requires 256 buckets and then it takes 33.8 lines.
2017-09-12 18:19
ChristopherJam

Registered: Aug 2004
Posts: 655
Quoting Color Bar
Not sure I understand this. You need a separate fill routine for every possible Y-value/bucket.


True, but every one of the 220 routines ends with
.hitableN
  sta *+4
  jmp ($xxxx) ;jump table for buckets >=128
.phase2N
  clc	                ;to branch always.
  ldy #27	   	;to restore the branch value of the bucket
  jmp .Bucket0


You can leave that tail off two thirds of them, eg only keeping it for buckets 1,4,7,10 etc.
The routine for bucket 0 would replace the bcs *+8 with a branch to the hitable lable in the the routine for bucket 1, as would the routine for bucket 2
2017-09-12 18:23
ChristopherJam

Registered: Aug 2004
Posts: 655
Quoting Color Bar
The routine probably uses too much memory for many applications.


A quarter of all RAM? Yes, that's probably a tad excessive in practice :)

It'd be interesting to have a map of best worst case routine for different numbers of actors, and different limits on memory use... could even have CPU time as altitude…
2017-09-12 19:30
Color Bar

Registered: Oct 2012
Posts: 132
Quoting ChristopherJam
Quoting Color Bar
Not sure I understand this. You need a separate fill routine for every possible Y-value/bucket.


True, but every one of the 220 routines ends with
.hitableN
  sta *+4
  jmp ($xxxx) ;jump table for buckets >=128
.phase2N
  clc	                ;to branch always.
  ldy #27	   	;to restore the branch value of the bucket
  jmp .Bucket0


You can leave that tail off two thirds of them, eg only keeping it for buckets 1,4,7,10 etc.
The routine for bucket 0 would replace the bcs *+8 with a branch to the hitable lable in the the routine for bucket 1, as would the routine for bucket 2


Yes, I understand it now. Thanks!
2017-09-12 19:41
Color Bar

Registered: Oct 2012
Posts: 132
Quote: Quoting Color Bar
The routine probably uses too much memory for many applications.


A quarter of all RAM? Yes, that's probably a tad excessive in practice :)

It'd be interesting to have a map of best worst case routine for different numbers of actors, and different limits on memory use... could even have CPU time as altitude…


I am wondering if there aren't any theoretical bounds known in computer science, about memory usage and/or/vs CPU time. But I guess these would have to be C64 specific to be of any use.
2017-09-13 08:58
ChristopherJam

Registered: Aug 2004
Posts: 655
Got a new record now, I think.

We can pack 221 addresses into a 222 byte table by using the high byte of each address as the low byte of the next address.
If we use something like this:
for i in 0..73:
    .byt 64+(i%74)*(1+i//74*24)%74)
.. there are no two addresses less than 24 bytes apart.


The fill routine then drops down to this:
    ;code to fill out bucketK.   24 bytes, 36 cycles
fillK
    ldy .bucketK+1
    lda newBranchValue,y
    sta .bucketK+1         ; 12
    tay
    txa
    sta .bucketK+3,y       ;  9

    dex
    lda Yvalues,x
    sta *+4
    jmp ($xxxx)            ; 15

    ; 36*32 =  1152 cycle, 36 per actor

Just 24 bytes long :)
We avoid the check for x becoming negative by placing actor $ff at y=0, and having the 0 entry of the branch table point to a routine
that checks X to determine whether we've reached the end of the actor list, or are just dealing with an actor that has gone offscreen.

The bucket emptying code remains the same as in Color Bar's original:
.bucketK ;32 bytes
    bcc .nextBucket        ;Branches always: decrease this value by 3 with selfmodifying code in phase 1 for every new sprite in the bucket.
    !byt 0,0,0
    lda #sprite_index   ; repeat lda/pha eight times
    pha
...
    sty .bucketK+1    ;Restore branch (to skip all sprites for empty buckets). Call with y=$1e for max 8 sprites per bucket.
.nextBucket    ;unrolled

    ; 3 cycles for an empty bucket, 12 voor buckets with 1 sprite => 3*(220-32)+12*32 = 948 cycli , 29.6 per actor



The bucket filling routines are packed three to a page over 74 pages, so around 19kb. Each page has an average of 184 bytes empty, so we can probably pack the bucket emptying code into the gaps. Note that some of the buckets would then need to branch past a group of bucket filling routines, so we'd need to use both X and Y for the two different values to restore the branches to "this bucket is empty"

So - total time, 36*32+3*(220-32)+12*32= 220*3+45*32 = 2100 cycles. 65.6 per sprite, 33.33 rasters total.

For 37 sprites or more, we're down to under a raster per sprite :D
2017-09-13 09:16
Color Bar

Registered: Oct 2012
Posts: 132
Great! I still have to study your post in detail, but maybe another 64 cycles gain be gained by adding 3 NOPs before the first sprite in the bucket, like this:
.bucketK
bcc .nextBucket
lda #sprite_index8
pha
...
nop
nop
nop
lda #sprite_index1
pha
sty .bucketK+1


Then the TAY can be removed from the fill code:
ldy .bucketK+1
lda newBranchValue,y
sta .bucketK+1
txa
sta .bucketK,y

and the branch value table changes to:
newBranchValue
 BYTE 0 x x 0 x x 3 x x 6 x x 9 x x 12 x x 15 x x 18 x x 21 x x x x x 27 x

Worst case for the phase 2 code is still 948 cycles, I think.
2017-09-13 09:46
ChristopherJam

Registered: Aug 2004
Posts: 655
Ah, so you're always storing 3 bytes earlier than the previous branch target?

I like it, but I'm not sure if that works for the very first sprite inserted into that bucket, unless you have the branch starting out by landing on the sty .bucketK+1

..which would then cost you an extra four cycles per empty bucket.
2017-09-13 09:53
Color Bar

Registered: Oct 2012
Posts: 132
Quoting ChristopherJam
Got a new record now, I think.

We can pack 221 addresses into a 222 byte table by using the high byte of each address as the low byte of the next address.

Clever idea!

I am not familiar with this notation
for i in 0..73:
    .byt 64+(i%74)*(1+i//74*24)%74)

What do i%74 and i//74 mean?
2017-09-13 10:01
ChristopherJam

Registered: Aug 2004
Posts: 655
Thanks!

Mod and divide (TBH I should have dropped from // back to / when I switched from python to pseudo asm). Oh, and that loop should have been to 74*3

Really, better expressed as

for i in 0..73:
    .byt 64+(i*1)%74

for i in 0..73:
    .byt 64+(i*25)%74

for i in 0..73:
    .byt 64+(i*49)%74
2017-09-13 10:10
Color Bar

Registered: Oct 2012
Posts: 132
Quote: Ah, so you're always storing 3 bytes earlier than the previous branch target?

I like it, but I'm not sure if that works for the very first sprite inserted into that bucket, unless you have the branch starting out by landing on the sty .bucketK+1

..which would then cost you an extra four cycles per empty bucket.


Yes, you are right.
2017-09-13 12:18
Color Bar

Registered: Oct 2012
Posts: 132
Quoting ChristopherJam


The bucket emptying code remains the same as in Color Bar's original:
.bucketK ;32 bytes
    bcc .nextBucket        ;Branches always: decrease this value by 3 with selfmodifying code in phase 1 for every new sprite in the bucket.
    !byt 0,0,0
    lda #sprite_index   ; repeat lda/pha eight times
    pha
...
    sty .bucketK+1


!byt 0,0,0 could be replaced by LDA #value + PHA, i.e., allow a maximum of 9 sprites per bucket, but I guess it doesn't really matter.
2017-09-14 16:12
ChristopherJam

Registered: Aug 2004
Posts: 655
The Human Code Machine, thanks for that codebase link. Must admit, I didn't find that area when I added the flagged bucket sort to the maths and algorithms section. Might need to add some crosslinks..

FWIW though, looks like that one has the same worst case as the one Oswald referenced in the opening remark to this thread. (32*31/2)=496 swaps, 26 cycles per swap, 496*26/63=204 raster lines.
2017-09-14 16:19
Frantic

Registered: Mar 2003
Posts: 1309
I added some crosslinks @ codebase to make people aware about this partial overlap in contents in the two different sections.

Preferably generic sorting algorithms should be placed in the maths and algorithms section, and sorting tailored specifically towards fast sorting of sprites/multiplexing could be kept on the sprites page, if someone feels like cleaning this up.
2017-09-15 05:47
ChristopherJam

Registered: Aug 2004
Posts: 655
Thanks Frantic, I'll move my post.
2017-09-16 09:31
Color Bar

Registered: Oct 2012
Posts: 132
Quoting ChristopherJam

    ;code to fill out bucketK.   24 bytes, 36 cycles
fillK
    ldy .bucketK+1
    lda newBranchValue,y
    sta .bucketK+1         ; 12
    tay
    txa
    sta .bucketK+3,y       ;  9

    dex
    lda Yvalues,x
    sta *+4
    jmp ($xxxx)            ; 15

    ; 36*32 =  1152 cycle, 36 per actor


I think another 2 cycles per actor can be gained by using A as the actor counter:
ldy .bucketK+1
ldx newBranchValue,y
stx .bucketK+1
sta .bucketK+3,x
sbc #$01
.start
tax
ldy Yvalues,x
sty *+4
jmp ($xxxx)

Initialize with SEC, LDA #$1f, JMP .start

Total time: 220*3+43*32 = 2036 cycles or 32.3 rasters!
2017-09-16 16:32
ChristopherJam

Registered: Aug 2004
Posts: 655
Quoting Color Bar

I think another 2 cycles per actor can be gained by using A as the actor counter


Sweet! Yes, that should work. Should probably do a test implementation and upload this beast somewhere. Now if we could just shave off one more cycle per actor...
2017-09-18 12:03
Color Bar

Registered: Oct 2012
Posts: 132
Perhaps not practical, but maybe interesting to discuss:
If you are sure the maximum number of actors in a bucket is never exceeded, 2 more cycles per actor (and 2 bytes) can be shaved off the bucket filling routines by using some BOCs (bonus opcodes = illegals)
lax .bucketK+1
sbx #$06
stx .bucketK+1
tya
sta .bucketK+3,x
dey
lda Yvalues,y
sta *+4
jmp ($xxxx)

where Y is now the actor counter. The bucket emptying code should then be modified according to
bcc .nextBucket
lda #sprite_index
pha
skw $abcd          ;opcode $0c
lda #sprite_index  ;Always 6 bytes for every sprite.
pha                ;Therefore sbx #$06 can be used in the fill routines.
skw $abcd
...
lda #sprite_index
pha
sty .bucketK+1

If we allow up to 10 actors and patch the filling routines with 2 bytes to 64 bytes total, the page crossing penalties can be avoided. Although this algorithm uses a LOT of memory, it may be worth it in some demo effects if speed is the bottleneck. And besides, the absolute minimum possible CPU time is a worthy goal by itself.

As Christopher Jam suggested, we can have the 0 entry of the branch table point to a routine that checks whether we've reached the end of the actor list, or are just dealing with an actor that has gone offscreen.

I think we can then shave off 2 more bytes if the Y values are stored on the stack (terminated by 0) and LDA Yvalues,Y is replaced by PLA.
2017-09-18 13:44
Trash

Registered: Jan 2002
Posts: 96
I dont think this is really exactly what you guys are discussing but I suggest you all to have a look at the sortingroutines used in Time Machine, I have been given an explanation but I don't really understand how it works. I have code that sorts 32 elements just shy of 1400 cycles (if I remember correctly) with a constant speed using the explanation I was given, since it's HCLs I wont share it without his explicit permission but dropping a hint on where to look must be forgiven.
2017-09-18 14:25
ChristopherJam

Registered: Aug 2004
Posts: 655
re. SKW etc: Ooh, dangerous! So I think we're up to around 10k for the PHA routines now?

Will have to see if that can be sanely(!) combined with my thinking from today:

I managed to trim two cycles off the populate routine by switching to just using even numbers for the actor IDs (can interleave most of the data tables), and using a section of the stack to store Y values interleaved with actor IDs.
    ldx bucketK+1
    ldy newBranchTableValue,x
    sty bucketK+1
    pla ; grab this actor's index
    sta bucketK+3,y       ;21

    pla ; grab next Yvalue
    sta *+4
    jmp (jumptable)       ;11

(I was going to read the actor ID out of one of the CIA timers while pulling Y value off the stack, but realised that the timer will be counting down while the stack address increases, so had to abandon that idea).

so, 220*3+41*32=1972 cycles, 61.6 per actor, 31.3 raster lines :)
2017-09-18 14:26
ChristopherJam

Registered: Aug 2004
Posts: 655
Quoting Trash
...1400 cycles...


!!!
2017-09-18 15:27
Color Bar

Registered: Oct 2012
Posts: 132
Quoting ChristopherJam

I managed to trim two cycles off the populate routine by switching to just using even numbers for the actor IDs (can interleave most of the data tables), and using a section of the stack to store Y values interleaved with actor IDs.
so, 220*3+41*32=1972 cycles, 61.6 per actor, 31.3 raster lines :)

Awesome! Less than 1 raster per actor :-)
I actually thought of that as well, but won't that cost you somewhere else? I mean you have to write updated Y values to the stack in that form and maybe that is more costly than writing them non-interleaved? But perhaps not when you use unrolled code ...
2017-09-18 15:41
Color Bar

Registered: Oct 2012
Posts: 132
Quote: I dont think this is really exactly what you guys are discussing but I suggest you all to have a look at the sortingroutines used in Time Machine, I have been given an explanation but I don't really understand how it works. I have code that sorts 32 elements just shy of 1400 cycles (if I remember correctly) with a constant speed using the explanation I was given, since it's HCLs I wont share it without his explicit permission but dropping a hint on where to look must be forgiven.

Since it is Time Machine, it must be ahead of us :-)

Very likely I won't understand that code without any thorough explanation, but is it for the same range of Y values (number of buckets)? And can multiple actors have the same Y values?

Maybe HCL wants to join the discussion?
2017-09-18 16:30
Color Bar

Registered: Oct 2012
Posts: 132
Quoting ChristopherJam

(I was going to read the actor ID out of one of the CIA timers while pulling Y value off the stack, but realised that the timer will be counting down while the stack address increases, so had to abandon that idea).

I remember this post:
SID to calculate line slope
Maybe that could work?
2017-09-18 17:25
lft

Registered: Jul 2007
Posts: 303
Quoting ChristopherJam
I was going to read the actor ID out of one of the CIA timers while pulling Y value off the stack, but realised that the timer will be counting down while the stack address increases, so had to abandon that idea


I haven't been following this discussion in detail, so I don't know if it helps, but: If your routine takes e.g. 31 cycles, then you can mask the timer value by 31, and it will appear to increment.
2017-09-18 18:16
ChristopherJam

Registered: Aug 2004
Posts: 655
Ok, a few random thoughts.


I see now that the routine in comment #103 takes the same number of cycles as that in #105; it just fetches an actor ID with TYA:DEY instead of PLA, and fetches the next Y coordinate with lda YV,Y instead of PLA.

Even if the Y coordinates are on the stack, you still need the DEY to update the actor index.

I can't see a way to combine them, so suspect I'd lean towards my routine (same speed, 5k less unrolled code, 64 bytes more stack usage, faster performance when many actors are on the one line).

Rather than copying the Y coordinates to the stack, I'd make that their home, and also only write the actor IDs once at startup. You would of course have to be very careful to ensure no interrupts while running that section of code, NMI included


Quoting Color Bar

SID to calculate line slope
Maybe that could work?


Oh god. Quite possibly, but I'd rather avoid hamstringing music/effects.

lft; sadly no cycles left for masking :)

*downloads Time Machine to examine later*
2017-09-18 18:54
Color Bar

Registered: Oct 2012
Posts: 132
Quoting ChristopherJam
Ok, a few random thoughts.


I see now that the routine in comment #103 takes the same number of cycles as that in #105; it just fetches an actor ID with TYA:DEY instead of PLA, and fetches the next Y coordinate with lda YV,Y instead of PLA.

Even if the Y coordinates are on the stack, you still need the DEY to update the actor index.

I can't see a way to combine them, so suspect I'd lean towards my routine (same speed, 5k less unrolled code, 64 bytes more stack usage, faster performance when many actors are on the one line).

Rather than copying the Y coordinates to the stack, I'd make that their home, and also only write the actor IDs once at startup. You would of course have to be very careful to ensure no interrupts while running that section of code, NMI included


Yes, your approach has some advantages, including not having to assume that the maximum number of actors per bucket is never exceeded. Unfortunately it does not use any illegals though :-(
2017-09-18 20:57
Trash

Registered: Jan 2002
Posts: 96
So i examined the tickcounts of the code I've got, with heavy use of zeropage it can sort 32 elements in 1188 cycles, without zp it does the same job in 1448 cycles (unless my calculations are incorrect). The drawback is that every value you want to sort has to be reduced in accuracy in order for it to work (eg two lda, sta in two different tables)...'

*edit*

HCL actually drops a hint for this sorter in the commentsection of Classics
2017-09-18 23:22
Color Bar

Registered: Oct 2012
Posts: 132
Quote: re. SKW etc: Ooh, dangerous! So I think we're up to around 10k for the PHA routines now?

Will have to see if that can be sanely(!) combined with my thinking from today:

I managed to trim two cycles off the populate routine by switching to just using even numbers for the actor IDs (can interleave most of the data tables), and using a section of the stack to store Y values interleaved with actor IDs.
    ldx bucketK+1
    ldy newBranchTableValue,x
    sty bucketK+1
    pla ; grab this actor's index
    sta bucketK+3,y       ;21

    pla ; grab next Yvalue
    sta *+4
    jmp (jumptable)       ;11

(I was going to read the actor ID out of one of the CIA timers while pulling Y value off the stack, but realised that the timer will be counting down while the stack address increases, so had to abandon that idea).

so, 220*3+41*32=1972 cycles, 61.6 per actor, 31.3 raster lines :)


I'm sorry, but I think you missed 2 cycles: PLA / STA *+4 / JMP (jumptable) takes 13 cycles. This means that the approach in post #103 is 2 cycles faster.
2017-09-19 08:04
ChristopherJam

Registered: Aug 2004
Posts: 655
Quoting Color Bar
I'm sorry, but I think you missed 2 cycles: PLA / STA *+4 / JMP (jumptable) takes 13 cycles. This means that the approach in post #103 is 2 cycles faster.


Damn, you're right. The SBX#$06 takes two cycles less than the LDY newBranchTableValue,x

FWIW, the 'more than one sprites on a line' case could be improved by replacing the SKW instructions with JMP *+3; it saves a cycle. And you're still using SBX at least :)
2017-09-19 08:07
ChristopherJam

Registered: Aug 2004
Posts: 655
Quoting Trash
. The drawback is that every value you want to sort has to be reduced in accuracy in order for it to work


Ah, it's not a perfect sort of eight bit values? Comparing apples and oranges in that case..
2017-09-19 08:44
Axis/Oxyron

Registered: Apr 2007
Posts: 82
Quote: So i examined the tickcounts of the code I've got, with heavy use of zeropage it can sort 32 elements in 1188 cycles, without zp it does the same job in 1448 cycles (unless my calculations are incorrect). The drawback is that every value you want to sort has to be reduced in accuracy in order for it to work (eg two lda, sta in two different tables)...'

*edit*

HCL actually drops a hint for this sorter in the commentsection of Classics


That ADC STA sequence mentioned in the comments of classics looks like 1 of the passes of a radix-sort. But in pure speed that wont be competitive to the ideas already posted here. Atleast as long as you dont reduce accuracy and really work with 200 different y-positions. My latest version has around 2400 cycles for the given use-case, if I remember right.

But the radix approach has some really nice properties compared to the bucket-sort mostly used here. It sorts in perfect 8 bit accuracy and sorts stable. That makes it more generally usable (e.g. for sorting 3D Bobs and polygons without any flickery). And in critical cases it can even avoid bugs in multiplexing. E.g. when big groups of sprites batch in small areas.
2017-09-19 10:47
Color Bar

Registered: Oct 2012
Posts: 132
Quote: Quoting Color Bar
I'm sorry, but I think you missed 2 cycles: PLA / STA *+4 / JMP (jumptable) takes 13 cycles. This means that the approach in post #103 is 2 cycles faster.


Damn, you're right. The SBX#$06 takes two cycles less than the LDY newBranchTableValue,x

FWIW, the 'more than one sprites on a line' case could be improved by replacing the SKW instructions with JMP *+3; it saves a cycle. And you're still using SBX at least :)


Yes, I was a bit too eager to use illegals. Nevertheless, what I don't like about post #103 is that it will crash when there are more than 10 sprites on a line.

So I lean to sticking to the combination of posts #91 and #101.
2017-09-19 10:59
Color Bar

Registered: Oct 2012
Posts: 132
Quoting Axis/Oxyron
That ADC STA sequence mentioned in the comments of classics looks like 1 of the passes of a radix-sort. But in pure speed that wont be competitive to the ideas already posted here. Atleast as long as you dont reduce accuracy and really work with 200 different y-positions. My latest version has around 2400 cycles for the given use-case, if I remember right.

But the radix approach has some really nice properties compared to the bucket-sort mostly used here. It sorts in perfect 8 bit accuracy and sorts stable. That makes it more generally usable (e.g. for sorting 3D Bobs and polygons without any flickery). And in critical cases it can even avoid bugs in multiplexing. E.g. when big groups of sprites batch in small areas.


The approach proposed here (the combination of posts #91 and #101 or #103) counts down the actor list, so when there are multiple actors with the same Y value, they are added to the bucket in reverse order, but in the second pass the order is restored. So I think this approach is stable as well.

If think the method also has full 8 bit accuracy. The numbers of cycles mentioned are for visible sprite positions and that means 220 buckets. This can be increased to 255 buckets.
2017-09-20 17:41
lft

Registered: Jul 2007
Posts: 303
Field Sort
2017-09-20 18:10
Color Bar

Registered: Oct 2012
Posts: 132
Nice!

I think this 32K of ram usage you refer to applies to combining posts #91 and #103. When combining #91 and #101 it can be considerably lower (but it will still be a lot).

Btw, how many sprites with the same Y value can your routine handle?
2017-09-20 18:46
ChristopherJam

Registered: Aug 2004
Posts: 655
Yes, nice work! This will take some investigating to unravel.
SHX $FE00,Y
^ is basically stx $fe00,y on account of the mask being $fe+1, yes?
2017-09-20 19:39
lft

Registered: Jul 2007
Posts: 303
Color Bar, yes I realise that you went through a number of different versions, trading memory for cycles. But I thought it fair to mention the version that beats mine in running time.

There can be any number of sprites on the same line; They are linked. Of course, the multiplexer drops them if there are too many, as is evident from the flickering in the demo.

This is a generic full random sorting routine for 8-bit values, although in the demo it is limited to 220 different y-values to make it comparable with your benchmarks.

Christopher Jam: That is correct. It's nice to find a use for this opcode where it actually improves performance, because it's quite constrained: The masking drops off if the instruction is suspended by DMA, and page crossings cause the wrong value to be written. So the best bet is indeed to stick to the operand $fe00.
2017-09-20 19:42
Color Bar

Registered: Oct 2012
Posts: 132
Quote: Yes, nice work! This will take some investigating to unravel.
SHX $FE00,Y
^ is basically stx $fe00,y on account of the mask being $fe+1, yes?


Yes, check out this document (pages 38 and 41):
No More Secrets v0.91
2017-09-20 19:47
Groepaz

Registered: Dec 2001
Posts: 8213
lft: now it would be very cool if you could expand this routine to handle excess sprites in a way that they will be displayed on alternating frames ("interlaced") - then it would work perfectly fine for a game (MUCH better than sprites disappearing)
2017-09-20 23:13
ChristopherJam

Registered: Aug 2004
Posts: 655
Dear lords. So the n-actors snippets leading up to $fe00 insert each actor ID in to a linked list with the head at $aa00+yposition, and the 'next' pointers are at $10+actorID, and also replaces the INY at $fe00+yposition with a JMP to a bucket emptying routine.

Here's a sample snippet:
.C:fb52  A4 90       LDY $90+0     ; Y coordinate of actor 0
.C:fb54  9E 00 FE    SHX $FE00,Y   ; this is effectively STX $fe00,y
.C:fb57  B9 00 AA    LDA $AA00,Y   ; bucket head
.C:fb5a  85 10       STA $10+0     ; next pointer for actor 0
.C:fb5c  A9 00       LDA #$00      ; actor id 0
.C:fb5e  99 00 AA    STA $AA00,Y


The INYs that fill $fe00 to $feff increment Y through all the potential bucket indices, jumping off to a bucket emptying routine wherever an INY has been replaced with a JMP

There are four routines at 4c4c, 4cc8, c84c and c8c8 to handle each of the cases where zero to two of the next two buckets are also nonempty.

The bucket emptying routine takes advantage of the list heads being stored in page $aa to reuse the $aa as a TAX instruction when looping over a bucket that contains more than one actor, and the lists are terminated with $c8, which is both negative and doubles as the INY that needs to be written back to $fexx to clear the bucket for the next frame. (it's also written to $aaxx, just like the flagged bucket sort on codebase).

Each bucket emptying routine returns to the scene of the crime after writing the INY back, which bumps Y the value needed for the next bucket. Hence, the four bucket emptying routines are in fact identical; here's a sample:
.C:4c4c  8C 5E 4C    STY $4C5E     ; patch return address
.C:4c4f  BF 00 AA    LAX $AA00,Y   ; fetch bucket head
.C:4c52  48          PHA
.C:4c53  B5 10       LDA $10,X     ; read 'next' pointer.
.C:4c55  10 FA       BPL $4C51     ; Actor ids are positive, INY is not
.C:4c57  99 00 FE    STA $FE00,Y   ; restore INY
.C:4c5a  99 00 AA    STA $AA00,Y   ; empty linked list
.C:4c5d  4C 7E FE    JMP $FE7E


I think that just about covers it. Lft, that's a mighty fine piece of work.
2017-09-20 23:24
ChristopherJam

Registered: Aug 2004
Posts: 655
Color Bar, thanks for the No More Secrets rec; I wasn't familiar with that one.
2017-09-21 00:57
Trash

Registered: Jan 2002
Posts: 96
The sorting algorithm implemented by HCL is actually called Counting Sort (I just found that out..) and it is considered to be O(n).

It should be competive to the discussed code both in size and speed.
2017-09-21 01:03
Cruzer

Registered: Dec 2001
Posts: 892
Hilariously constrained addresses indeed. :) And my respect for lft just grew even higher.

Quoting Groepaz
lft: now it would be very cool if you could expand this routine to handle excess sprites in a way that they will be displayed on alternating frames ("interlaced") - then it would work perfectly fine for a game (MUCH better than sprites disappearing)
I would like no interlacing or disappearing sprites even better. In the ideal world, the routine moving the sprites around should avoid situations with >8 sprites on the same line. And the multiplexor should of course strive for being able to display any situation with <=8 sprites per line.
2017-09-21 01:54
Groepaz

Registered: Dec 2001
Posts: 8213
thats not really useful for a game - in a demo of course that is always preferred - but in a game it would be SO much better if you could move around a large amount of objects without having to care about multiplexer limitations in the game logic. (if you look at gameboy or NES games for example - thats how a lot of them work)
2017-09-21 05:27
ChristopherJam

Registered: Aug 2004
Posts: 655
…or then there was Alien Syndrome, where you could never have more than one alien on any given raster 😂
2017-09-21 07:28
lft

Registered: Jul 2007
Posts: 303
Quoting Trash
The sorting algorithm implemented by HCL is actually called Counting Sort (I just found that out..) and it is considered to be O(n).

It should be competive to the discussed code both in size and speed.


Thanks for the link! That does indeed seem useful.

Here is a straighforward implementation:

for each actor i
   ldx ypos+i
   inc count,x

   lda #0
   clc
for each count k
   sta start+k
   adc count+k

for each actor i
   ldx ypos+i
   ldy start,x
   lda #i
   sta output,y
   inc start,x
   lda #0
   sta count,x


Total number of cycles is 10*i + 4 + 8*k + 28*i, so for our benchmark case with 32 actors and 220 y-positions that would be 2980 cycles, or 47.3 rasterlines. So the above implementation is not competitive.

So I suppose the game is on to try to improve it!
2017-09-21 08:21
Trash

Registered: Jan 2002
Posts: 96
Quoting lft
So I suppose the game is on to try to improve it!


Try losing precision on K by the right amount and in the right way and it will improve without losing to much accuracy....
2017-09-21 23:55
Fresh

Registered: Jan 2005
Posts: 81
A less "aggressive" approach to lft's solution can be obtained by using brk instead of jmp.
It's clearly a bit slower but the emptying routine is only one and be put everywhere: a bit less messy and more "game-friendly".

irq:
	sty $ff+stack	// Fix stack for RTI
	lda buckethead,y
irqloop:	
	sta output,x
	tay
	inx
	lda list,y
	bpl irqloop
	ldy $ff+stack
	sta table,y
	sta sectable,y
	rti	
2017-09-22 07:06
lft

Registered: Jul 2007
Posts: 303
Fresh, that is true, and you even get to place the INY field on any page you like. That's because the opcode is 0, so it doesn't matter that SHX randomly masks it with the destination address + 1.

I would suggest that you store-absolute the output, perhaps to the zero-page, and self-modify the instruction to increment the pointer. That'd free up X as a scratch register, and you wouldn't have to restore Y from the stack area.

But there are drawbacks with this approach. As you mention, it is a bit slower: 15 cycles per actor, after implementing my suggestion above, so that'd give us a total of 43 rasterlines instead of 35. And you cannot do the sorting in main context while interrupts are running, because you'd lose even more cycles distinguishing between BRK/IRQ. More subtly, if a BRK coincides with an interrupt, the interrupt is lost, although the peripheral will typically keep asserting the (maskable) interrupt until it is eventually serviced.
2017-09-22 09:48
HCL

Registered: Feb 2003
Posts: 667
Oh.. i'm late into the discussion again :P

First, i didn't invent the kind of sorting that is used in TimeMachine (and many other Booze-demos). It seems that i am not very good in inventing things. Perhaps i'm better at looking at other's code, understanding it, and making benefit to my own needs. I think i stole it from some demo from the late-80:s..

The nice thing about it is that i wrote a code-generator for it. Perhaps it was for the 3d-Bob-part in TimeMachine.. When i change 3d-bob-object it generates sorting code for the number of bobs that i currently have. Neither that being rocket science, but it helped making up to 32 bobs update each frame :).
2017-09-22 12:11
Fresh

Registered: Jan 2005
Posts: 81
@lft: perfect analysis, no surprise from you! :)
2017-09-22 21:29
lft

Registered: Jul 2007
Posts: 303
@HCL: Could you comment on how to reduce the number of buckets without losing too much accuracy? Trash is being deliberately enigmatic for fear of breaking a confidence, so we could really use a hint. Or would you rather have us dig into the code and see for ourselves?
2017-09-22 22:17
Trash

Registered: Jan 2002
Posts: 96
Quote: @HCL: Could you comment on how to reduce the number of buckets without losing too much accuracy? Trash is being deliberately enigmatic for fear of breaking a confidence, so we could really use a hint. Or would you rather have us dig into the code and see for ourselves?

I do dare to tell you now when I know he borrowed the solution...

This is for sprites (eight per row):

If you remove the last three bits by using lsr you have your new K-value BUT you have to pair it with the original value. So what you are doing is that you assign your actors a value remove lose accuracy (lsr, lsr, lsr or a LUT (K = max value of that table)) store that value in the table you actually sort.

I might be wrong but this works for me using almost exactly your example code...
2017-09-24 01:38
Copyfault

Registered: Dec 2001
Posts: 208
This Countingsort got me thinking...

So I had a closer look at lft's straight forward implementation given before and remembered
Quoting Trash
Try losing precision on K by the right amount and in the right way and it will improve without losing to much accuracy....

What about doing it as follows:
   ldx #offset
for SPRITE = 0 to n-1
   dcp (ypos(SPRITE),x)  ;operand byte is also memory for ypos
next
   lda #$ff
   ldx #$00
for LINE = 0 to 2*(floor((k/2)-2)) STEP 2
   stx start+LINE
   sbx #$00              ;operand byte is memory for the count-array
next
   stx start+2*(floor((k/2)-1))
for SPRITE = 0 to n-1
   ldx ypos(SPRITE)
   ldy start,x
   lda #SPRITE
   sta output,y
   inc start,x
   lda #0
   sta (offset,x)
next

Remarks:
- all loops have to be unrolled
- instead of incrementing the count-array entries for every occurance of the corresponding LINE-value they are decreased, giving [$00-#occurances of LINE] in each array position count[LINE]. But as the adc-instruction in the 2nd loop is also substituted by an sbx-command, this leads to the same result without having to care about the state of carry.
- 2nd byte of each dec-cmp-instruction suits also as memory where the ypos-val for the corresponding sprite-number is stored, i.e.
   ...
Sprite0_ypos = *+1
   dcp (ypos(Sprite0),x)
Sprite1_ypos = *+1
   dcp (ypos(Sprite1),x)
   ...
- only evenly spaced values for the ypositions admitted, i.e. consecutive values must have distance=2
- each ypos corresponds to a pointer in zp pointing to the corresponding operand byte of the sbx#$..-instructions. The 2nd loop would look like
   lda #$ff
   ldx #$00
   stx start+0
count_ypos0 = *+1
   sbx #$00
   stx start+2
count_ypos2 = *+1
   sbx #$00
   stx start+4
   ...
with the corresponding zp part
ypos0 .byte <count_ypos0, >count_ypos0
ypos2 .byte <count_ypos2, >count_ypos2
...

- "offset" is introduced to be able to keep away from the $00/$01-adresses


Beside the ram usage for the start- and output-table the given approach needs a fair amount of zp-memory (one vector per possible sprite-ypos!). This most probably makes it not very usable for real in-game use; what's more: due to the fact that a vector pointer always consists of two consecutive bytes only even-numbered y-positions are handled (alternatively only odd-numbered, just one type...). This could be seen as "loosing precision on the ypos-values";)

On the positive side, counting cylces the approach gets 2 + n*8 + 4 + ((k/2)-1)*6 + 4 + n*29, so for n=32 and k=220 the routine needs 1848 cycles in total -> 29+1/3 rasterlines.
RAM-usage is at $2 + n*$2 + $4 + ((k/2)-1)*$5 + $3 + n*$12 = $4aa
(plus 220 bytes in zp plus 110 bytes for the "starting" table plus 32 bytes for "output")

Mind that it would be unfair to compare this to the cycle count lft gave for the direct approach as the count-array has been halved. But a notable gain in cycle consumption comes from cutting the inner part of the 2nd loop down to 6 cycles which would also make a full loop over 220 array entries considerably faster. Maybe there's a clever way to handle full precision on the ypos-values by appropriately pre-conditioning the ypos-values plus doing a sensible correction loop afterwards... remarks and additions highly welcome;)
2017-09-24 10:58
Color Bar

Registered: Oct 2012
Posts: 132
Quote: The sorting algorithm implemented by HCL is actually called Counting Sort (I just found that out..) and it is considered to be O(n).

It should be competive to the discussed code both in size and speed.


Thanks! I found a short article about Counting Sort in C=hacking Issue #18:
http://www.ffd2.com/fridge/chacking/c=hacking18.txt
2017-09-24 12:00
Color Bar

Registered: Oct 2012
Posts: 132
Copyfault, that is a very interesting approach. But to halve the Y position resolution an AND #$fe has to be applied, which costs an additional 2 cycles per actor. And if you want to preserve the full resolution values as well (for use in the Y updating routine for example), you will have to store them somewhere, adding more to the cycle count.
2017-09-24 17:54
Copyfault

Registered: Dec 2001
Posts: 208
Quote: Copyfault, that is a very interesting approach. But to halve the Y position resolution an AND #$fe has to be applied, which costs an additional 2 cycles per actor. And if you want to preserve the full resolution values as well (for use in the Y updating routine for example), you will have to store them somewhere, adding more to the cycle count.

Ofcourse this works only if the game concept allows for all objects moving with suitable speed (all values hsve to be a multiple of two).

Otherwise you're right, all the necessary additional position handling will most probably eat up too much cycles. What's the current best case cycle count??
2017-09-24 18:22
Color Bar

Registered: Oct 2012
Posts: 132
See posts #91 and #101. It is about 3*k+43*n.
For lft's Field Sort (post #126) it is 2*k+55*n.
2017-09-25 01:25
Trash

Registered: Jan 2002
Posts: 96
I have made a mistake in my earlier posting about how to reduce the accuracy of the sorter...

Imagine you have 32 actors you want to sort and that K is 200, what you need is to reduce K so it holds the upper 5 bits (%11111 = 31).

The result will be a fast sorter that is inaccurate but accurate enough for sprite sorting.

64tass-style:

; YSortPositions contains the reduced spritepositions in Y
NUMBEROFBYTESTOBESORTED = 32
; With tables on zero-page
; 2 + ((3 + 9 + 6 + 19) * NUMBEROFBYTESTOBESORTED) - 3 + 2 + 3 + 12 = 1200

; With tables not on zero-page
; 2 + ((4 + 11 + 8 + 22) * NUMBEROFBYTESTOBESORTED) - 4 + 2 + 3 + 12 = 1455

DoSort
	lda #0				;  2
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
	sta SortTable + i		;  4 /  3
.next
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
	ldx YSortPositions + i		;  4 /  3
	inc YSortPositions,x		;  7 /  6
					;--------
					; 11 /  9
.next
	clc				;  2
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
	sta SortOrder + i		;  4 /  3
	.if i < (NUMBEROFBYTESTOBESORTED - 1)
		adc SortTable + i	;  4 /  3
	.fi
					;--------
					;  8 / 6
.next
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
	ldx ySortPos + i		;  4 /  3
	ldy SortOrder,x			;  4
	inc SortOrder,x			;  7 /  6
	lda #i				;  2
	sta Sorted,y			;  5 /  4
					;--------
					; 22 / 19
.next
	rts				; 12 (jsr + rts)
2017-09-25 07:02
lft

Registered: Jul 2007
Posts: 303
Typo on line 4, should increment at SortTable,x.

The problem with this approach is that you have to compute the upper 5 bits of the y-position somehow, and that will take time. It can be done during sorting, at the cost of an additional 2 cycles per actor, like this:

DoSort
    lda #0
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
    sta SortTable + 8*i
.next
    lda #$f8
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
    ldx YSortPositions + i
    sbx #0
    inc SortTable,x
.next
...


But I'm not convinced that the quality of this "rough sorting" is comparable with the real thing. Consider for instance the case where all actors are positioned 3 rasterlines apart. Whenever a hardware sprite becomes free, the multiplexer must take the next sprite according to the true sorting order, otherwise all sprite units will be busy eight actors later, so an actor must be dropped.

Nevertheless, it is nice to know that this option is available, in situations where the sprites aren't packed too closely but rastertime is tight.
2017-09-25 08:32
Trash

Registered: Jan 2002
Posts: 96
Quoting lft
Typo on line 4, should increment at SortTable,x.



Thanks, missed that..


Quoting lft
The problem with this approach is that you have to compute the upper 5 bits of the y-position somehow, and that will take time. It can be done during sorting, at the cost of an additional 2 cycles per actor, like this:


Or it could be done in a precalculated table, whatever fits your needs..

Quoting lft
Nevertheless, it is nice to know that this option is available, in situations where the sprites aren't packed too closely but rastertime is tight.


Yes, but if you have small amounts of rastertime you either should pre-sort your data or have data that fits with a bubblesort like the one THCM implemented...

I am (like you) not convinced that this sorter is the best for all use-cases but for some it's certainly the best option.
2017-09-25 10:13
The Human Code Machine

Registered: Sep 2005
Posts: 101
In most game scenarios an optimized bubble sort with special routines to add or remove an actor should be faster. In my example the sorter takes ~20 raster lines most of the time with sprites moving up and down. For a game, accurate sorting is important, to get an optimal multiplexing result with reusing a free sprite slot as soon as possible. (as lft said above)
2017-09-25 22:35
Digger

Registered: Mar 2005
Posts: 246
btw. Nice tool for visualising various sorting algos, Counting Sort included: https://visualgo.net/en/sorting
2017-09-25 22:57
ChristopherJam

Registered: Aug 2004
Posts: 655
Well, an unrolled bubble iteration should only take about 8 lines for 32 actors if there's no change, 10ish if there are four or five swaps.

I do like Falco Paul's suggestion of kicking off an Ocean Sort but aborting and switching to bucket sort (or some other constant time algorithm) if the number of swaps passes a certain threshold.
(cf http://www.codebase64.org/doku.php?id=base:sprite_multiplexer_s.. )

As for making lft's suggestion a little more memory flexible, I'm wondering if writing an RTS into the increment field could be cost effective?
2017-09-26 06:52
lft

Registered: Jul 2007
Posts: 303
I have thought of a different way to relax a memory constraint. Consider this variant of the bucket-emptying routine:

        sty     endptr+1
        lax     ylink,y
        pha
        lda     actorlink,x
        sta     ylink,y
        bpl     endptr

        sta     inyfield,y
endptr  jmp     inyfield


The worst-case execution time is the same, while the case of multiple actors in the same bucket is now slower. But now the location of the link table is unconstrained.
2017-09-26 06:59
lft

Registered: Jul 2007
Posts: 303
Quoting ChristopherJam
I'm wondering if writing an RTS into the increment field could be cost effective?


That would imply a JSR out of the increment field, right? That still means you have to put the bucket-emptying routine in four different, fixed places. Meanwhile the execution time increases by a rasterline.

Perhaps you meant the other way around, JSR into the increment field and RTS to the bucket-emptying routine. That would indeed make the memory layout more flexible, but at the cost of three rasterlines.

(Edit: I see now that I misread your post. Of course you mean the second thing.)
2017-09-26 08:54
lft

Registered: Jul 2007
Posts: 303
(removed)
2017-09-26 10:00
ChristopherJam

Registered: Aug 2004
Posts: 655
(intrigued)
2017-09-26 10:05
lft

Registered: Jul 2007
Posts: 303
I wrote a post about how JSR/RTS would interfere with pushing the result to the stack. But that is only a problem if you do it the wrong way, putting JSR in the field and RTS in the emptying routine. So it was a non-issue after all.
2017-09-26 10:11
ChristopherJam

Registered: Aug 2004
Posts: 655
Ah!

Well, FWIW I forgot the results were being pushed to the stack when I was wondering if a TXS:JMP would be a faster way to return to the field… (ie, just keep start of empty routine at top of stack)
2017-09-29 11:22
JackAsser

Registered: Jun 2002
Posts: 1242
Just some thoughts...

Multiplexers in game scenarios, is the sorting normally done in the vertical blank or is the sorting performed in the main loop bur for the next frame's sprite setup?

Was just thinking of the ideas of not sorting at all and the fact that the vertical blank period is a bit crowded. A lot of other stuff must be updated there such as the scrolling etc.

What if the multiplexer simply scans the remaining actors for the next entry? This is in total of course a O(n^2/2) algorithm and dead slow. Worst case to handle would be 8 sprites having to be multiplexed directly below. So you have 21/8 raster lines to find the lowest index in the remaining actors.

Think I'll test this approach some day soon..
2017-09-29 12:46
cadaver

Registered: Feb 2002
Posts: 1046
There are games that do both. I remember at least Midnight Resistance *not* doublebuffering the sorted sprites, so it was doing the sort in the vblank / scorepanel area.

Generally I'd recommend not making something timecritical that absolutely doesn't need to be, therefore rather pre-calculate the sorted sprites anywhere when the main program has time.

If you do the sorting "on the fly", you can't take advantage of last frame's sorting result. In a tight sprite formation, you barely have enough time to load the sprite registers from pre-sorted data, so would imagine you would run into trouble with a "find the next sprite" approach, even with unrolled code.
2017-09-29 13:21
JackAsser

Registered: Jun 2002
Posts: 1242
Quote: There are games that do both. I remember at least Midnight Resistance *not* doublebuffering the sorted sprites, so it was doing the sort in the vblank / scorepanel area.

Generally I'd recommend not making something timecritical that absolutely doesn't need to be, therefore rather pre-calculate the sorted sprites anywhere when the main program has time.

If you do the sorting "on the fly", you can't take advantage of last frame's sorting result. In a tight sprite formation, you barely have enough time to load the sprite registers from pre-sorted data, so would imagine you would run into trouble with a "find the next sprite" approach, even with unrolled code.


Ok, thanks for the insights.
2017-09-29 15:35
ChristopherJam

Registered: Aug 2004
Posts: 655
For applications where border time is a lot scarcer than display time, I suspect I'd lean towards a nice simple linked list based bucket sort; the times it takes a while to find next sprite are precisely when the next sprite is a fair way down the screen, so the time taken wouldn't be so critical. If the next sprite is in the next line or three there'd only be a few links to follow, which shouldn't take too long..
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
hedning/G★P
Mermaid/Vision^G★P
finchy
TheRyk/MYD
Didi/Laxity
kazmat
ciccior2003/HF
E$G/I-IokutO ForcE
Marq/Fit^Lieves!Tuor..
algorithm
ptoing
iAN CooG/HVSC
Fix/[HF] + [ONS]
Compyx/Focus
Guests online: 74
Top Demos
1 Uncensored  (9.7)
2 Edge of Disgrace  (9.7)
3 Coma Light 13  (9.6)
4 The Shores of Reflec..  (9.6)
5 Lunatico  (9.6)
6 Quad Core 100%  (9.5)
7 Comaland 100%  (9.5)
8 Incoherent Nightmare  (9.5)
9 Wonderland XII  (9.5)
10 Comaland  (9.5)
Top onefile Demos
1 Pandemoniac Part 2 o..  (9.6)
2 Synthesis  (9.6)
3 Dawnfall V1.1  (9.5)
4 Daah, Those Acid Pil..  (9.5)
5 Field Sort  (9.4)
6 Treu Love [reu]  (9.4)
7 Dawnfall  (9.3)
8 Globe 2016 [reu]  (9.3)
9 KAOS 64  (9.3)
10 Hardware Accelerated..  (9.2)
Top Groups
1 Booze Design  (9.4)
2 Censor Design  (9.4)
3 Oxyron  (9.4)
4 Crest  (9.3)
5 Finnish Gold  (9.3)
Top NTSC-Fixers
1 Horizon  (9.7)
2 Pudwerx  (9.6)
3 Stormbringer  (9.6)
4 Fungus  (9.6)
5 Moloch  (9.3)

Home - Disclaimer
Copyright © No Name 2001-2017
Page generated in: 4.409 sec.