| |
Oswald
Registered: Apr 2002 Posts: 5086 |
Sorting
are sorters really so slow in games? :) I have made my unrolled version for my theoretical game (;), and it takes 132 rlines to sort 32 numbers o_O worst case is ~200 lines. tho when its the case of 4 numbers have to be swapped only it does the job in ~10 lines. wastes a lot of memory but I like it :) |
|
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
I tried both a bubble sort, an insertion sort and a bucket sort for my game. All of them progressive. The insertion sort was usually the fastest but in the end the bucket sort was the most stable and reliable (i.e. fever missed frames) even with the loss of precision.
It's just not worth having to worry about spawning new sprites at an appropriate position, and not let things move too fast lest the list get too far out of sync.
I can't remember precisely how much raster time it consumes right now but it isn't big deal really compared with all the other work involved. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
what is the name of your game? I'd like to check it :) I have made an unrolled insertion sort, which will be progressive (if I ever got to that). it takes 22 cycles including the cmp/bcc when a number is moved (swapped) around. |
| |
Burglar
Registered: Dec 2004 Posts: 1090 |
some games use sprite regions, so there's no need for sorting, just put the sprites in the right regions and you're done. I think Turrican uses that method.
you might want to check out games from Hannes Sommer/Cosmos Designs as well (ie Lions of the Universe +7 ), those should have some good multiplexers... |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
It far from finished so maybe I shouldn't be giving advice, nevertheless I always appreciate a chance to plug my newest project. http://www.minoan.ath.cx/~doynax/6502/Balls_of_the_Scrolling_Th..
Anyway, the bucket sort probably takes something like 50 cycles per sprite. But that's still far better than the worst case of the insertion sort. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11377 |
uh, you should be able to sort 64 sprites in the border easily =) even my crappy multiplexer does 48 sprites in border, and i know it's not the most optimized one around :)
the basic idea in games usually is bucket sort first (what burglar called regions), then (if necessary) bubblesort on the buckets. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
groppie, I'm afraid you're comparing here 'progressive' performance with full rnd sorting. as said swapping around ~4 sprites takes less than 10 lines. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: uh, you should be able to sort 64 sprites in the border easily =) even my crappy multiplexer does 48 sprites in border, and i know it's not the most optimized one around :)
the basic idea in games usually is bucket sort first (what burglar called regions), then (if necessary) bubblesort on the buckets.
Also... when re-using the sort order from frame to frame most algorithms are close to O(n). It's seldom a game hits the theoretical maximum complexity. I really dunno if you should calculate on worst case here since worst case never will happen. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: Also... when re-using the sort order from frame to frame most algorithms are close to O(n). It's seldom a game hits the theoretical maximum complexity. I really dunno if you should calculate on worst case here since worst case never will happen.
Not the worst case perhaps. But the difference I saw between the best and worst case, even when catering to the multiplexer when inserting things, was still something like 300%.
And in an action game it's all about maintaining a smooth framerate no matter what happens. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
jack, well its just ammazing to me that how slow this operation really is. without tricks sorting alone can eat most of a frame. Taking into account how unoptimized games used to be its nice they can pull off their plexed sprites, and the rest in 50fps :)
edit: to gro, 132 rasterlines is the time it takes to sort 32 totally random numbers. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11377 |
Quote:
groppie, I'm afraid you're comparing here 'progressive' performance with full rnd sorting. as said swapping around ~4 sprites takes less than 10 lines.
nope, my plexer doesnt take advantage of that, its full random sorting. the key is to use a reasonable amount of buckets, plus an unrolled bubblesort. (and if you can live with a little error you can leave out the bubblesort completely, which is what i did) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote:
nope, my plexer doesnt take advantage of that, its full random sorting. the key is to use a reasonable amount of buckets, plus an unrolled bubblesort. (and if you can live with a little error you can leave out the bubblesort completely, which is what i did) I actually do a progressive bucket sort which seems to help a bit. Bullets and enemy formations and such tends to move in step with each other after all. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
doynax, whoaw, now thats a massive amount of sprites :) \o/ how do you make the d800 copy, and where ? :) btw what bugs me about bucket sort is how to deal with the random order in the multiplexer, also rejecting 9th sprites seems to be messy :P |
| |
chatGPZ
Registered: Dec 2001 Posts: 11377 |
Quote:
also rejecting 9th sprites seems to be messy
why reject? just make it so it's ignored :) (ie bucket can take more than 8 sprites, but plexer will only use first 8)
|
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: doynax, whoaw, now thats a massive amount of sprites :) \o/ how do you make the d800 copy, and where ? :) btw what bugs me about bucket sort is how to deal with the random order in the multiplexer, also rejecting 9th sprites seems to be messy :P
The d800 copy is double buffered and evenly distributed across 16 frames, so it really doesn't take much raster time at all ;)
As for the bucket sort I don't quite see what you mean. The
output is still just an ordered list of sprites, just like for the insertion sort.
As for rejecting the 9th sprite I don't do any fancy multiplexer "overload" testing at all. The code just tries to optimistically display as many sprites as possible and if doesn't fit then it's just not displayed (at least that's what happens in the type of multiplexer I wrote). |
| |
chatGPZ
Registered: Dec 2001 Posts: 11377 |
Quote:
As for rejecting the 9th sprite I don't do any fancy multiplexer "overload" testing at all. The code just tries to optimistically display as many sprites as possible and if doesn't fit then it's just not displayed (at least that's what happens in the type of multiplexer I wrote).
thats probably how most game multiplexers work. (look at SEUCK...argl =D) |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
you cant doublebuffer d800 :) for the bucket sort, I have to think it through as I'm totally new to this idea .) ie a multiplexer relies on the sorted list, wont bugs happen when sprites regs are not written in the order they should be ? :) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: you cant doublebuffer d800 :) for the bucket sort, I have to think it through as I'm totally new to this idea .) ie a multiplexer relies on the sorted list, wont bugs happen when sprites regs are not written in the order they should be ? :)
Really? Feel free to disassemble the code :)
As for the sprites if they aren't perfectly in order then the worst that can happen is that they won't be as tightly packed as possible. If it's properly written anyway. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11377 |
keep in mind that eg your first "logical" sprite doesnt always have to be the first hardware sprite, it could be any of the 8 hardware sprites in the first bucket. you only have to maintain that if priorities matter (which they often dont in games, since sprites dont overlap) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: keep in mind that eg your first "logical" sprite doesnt always have to be the first hardware sprite, it could be any of the 8 hardware sprites in the first bucket. you only have to maintain that if priorities matter (which they often dont in games, since sprites dont overlap)
In fact it's often useful to write them in reverse order to insure reasonably good priorities in top-down games and the like (i.e. so lower sprites overlaps the higher ones).
Another possibility is to "center" the multiplexer in such a way as to insure the priority of a single sprite (i.e. always assign the player to the first sprite). |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
you might miss sprites otherwise possible to display with the bucket sort.
spr1=70
spr2=74
bucket sort result:
spr2
spr1
then you fire an irq for the next sprite that will be spr2, and already late from spr1 |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: you might miss sprites otherwise possible to display with the bucket sort.
spr1=70
spr2=74
bucket sort result:
spr2
spr1
then you fire an irq for the next sprite that will be spr2, and already late from spr1
The way I do it is to attach the IRQ to the end of the last use of the new sprite, and thus try to reprogram the "channel" as soon as possible. Then all you have to do is check if you're too late to generate next IRQ (if it is to run before the current scanline) and if so jump straight to the next handler. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
If you preprocess the sprite-IRQs, you can take care to always take the upmost Y-coordinate as the basis for the IRQ, even if it isn't the first sprite in sort order. Did this in MW1-3 which had inexact sprite sorting. Brr, never again! :) |
| |
chatGPZ
Registered: Dec 2001 Posts: 11377 |
now you are assuming too much :) ofcourse the displayer code must be written in a way that such thing can not happen. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
Quote: If you preprocess the sprite-IRQs, you can take care to always take the upmost Y-coordinate as the basis for the IRQ, even if it isn't the first sprite in sort order. Did this in MW1-3 which had inexact sprite sorting. Brr, never again! :)
what was so bad about it, when you could get it 'right' afterall ? |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
Quote: The way I do it is to attach the IRQ to the end of the last use of the new sprite, and thus try to reprogram the "channel" as soon as possible. Then all you have to do is check if you're too late to generate next IRQ (if it is to run before the current scanline) and if so jump straight to the next handler.
hmm 'before the new' sprite approach allows more tight packing ;) clever solution nevertheless. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Oswald: well it was never completely "right" in those games, you could create unnecessary artifacts for example when several motorcycles (2x2 sprites) were coming at you and you jumped. But for example, if you have mostly airborne enemies coming at you from several heights and you're mostly on ground, it doesn't matter. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
radix sort is a very interesting approach :) offers a constant sort time, but sadly that is much worse than progressive insertion sort. when fully unrolled I estimate a running time of ~55 rasterlines. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: radix sort is a very interesting approach :) offers a constant sort time, but sadly that is much worse than progressive insertion sort. when fully unrolled I estimate a running time of ~55 rasterlines.
The bucket sort we talked about earlier is just bucket sort is just a special case of the radix sort. Except it sorts in one step and with some loss of precision to reduce the number of buckets.
Anyway an optimized and unrolled two-step radix sort shouldn't be quite as bad as that. It ought to be possible to get it well below 55 lines. In fact it'd probably be faster than a bucket sort combined with a bubble sort fix-up stage.
Dammit.. Now I have to implement one just to see how it turns out ;) |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
here is my implementation:
there's 2 bucket arrays, each 256 bytes. each bucket can hold max 15 elements, 16th is the counter of the elements. adress of fex. bucket 3's element 5 is: 3*16+5.
bucket1 is used for the first pass
bucket2 for the 2nd.
each code snipplet is a code segment out of an unrolled loop. (ie stuff is missing like lda bucket1+bucketnr*16,x)
I dont see how this could be much faster :) some1 proove me wrong :)
;pass1
ldx sort+0
ldy spry,x ; get sprite y coord
lax and#0fmul16,y ;get bucket startadress
ora bucket1+15,x ;get adress INside the bucket
inc bucket1+15,x ;inc nr of elements
tax
smod sty bucket1,x ;store spr to bucket
;pass2
ldx bucket1+15
beq next ;empty bucket dont bother
blp1
ldy bucket1,x
lda spry,y ;get spr y
and #%11110000 ;upper 4 bits only this time
tay ;which is exactly our pointer
ora bucket2+15,y ;addy inside bucket
inc bucket2+15,y ;inc bucket counter
tay ;final bucket addy
sta bucket2,y ;store sprite
dex
bne blp1 ;any more in bucket1?
next
;pass3
ldy #$00 ; nr of sprites counter this is done only once
ldx bucket2+15
beq next ;empty bucket?
blp2
lda bucket2,x ;get sprite nr
smod sta final,y ;store to final list
iny ;final spr count
dex
bne blp2 ;any more in curr. bucket?
edit, ok, well some bugs there, destroying regs, but assuming they werent.. fixing those would make it even more slower tho |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
I'm attempting something like this:
;; initialize the buckets
ldx #$81
!for i,0,16 {
sta lsd_bucket+i
}
lda #$fe
!for i,0,6 {
sax msd_bucket+i*2+0
stx msd_bucket+i*2+1
sbx #-3
}
;; lsd sort
lda #$0f
!for i,0,32 {
ldx actor_ypos+i
sbx #$00
ldy lsd_bucket,x
sty actor_link+i
ldy #i
sty lsd_bucket,x
}
;; msd sort
!for i,0,13 {
ldx lsd_bucket+i
bmi .next
.msd ldy actor_ypos,x
lda msd_table,y
tay
lda msd_bucket,y
stx msd_bucket,y
ldy actor_link,x
sta actor_link,x
bmi .next
ldx actor_ypos,y
lda msd_table,x
tax
lda msd_bucket,x
sty msd_bucket,x
ldx actor_link,y
sta actor_link,y
bne .msd
.next }
;; finally in the mux writer.
;; for each sprite, alternating x and y
ldx actor_link,y
bpl .ok
.bucket lda msd_bucket-$80,x
tax
bmi .bucket
.ok ...
The idea is to use linked lists for the buckets. Sort all possible actors in the first pass (invalid ones having high y values), you can optimize it by trying to keep the maximum actor number as low as possible and skipping those high entries. And we can link the actors together in the actual multiplexer instead of a separate pass. Also the sentinels in the msd buckets can help us to skip buckets easily.
Finally we don't need more than about 13 buckets since not all y coordinates are valid, and as a bonus we can collect the invalid ones into a single "death" list automatically by tweaking the division table. Another bonus is that by sticking a store in the bucket skipping code of the multiplexer you can easily link together a complete list of actors which IMO is more convenient to work with than an order list (i.e. you only need to know the current actor's index to move to the next one, no need to keep track of an order index). Furthermore just about everything is kept on the zeropage, thought only temporarily of course.
All in all it seems to work out to about 28 raster lines. Except I'm not at all certain whether this will actually work yet.. ;) |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
wow, very creative ! :) 28 lines sounds very good :) tho I have yet not an idea how progressive sort can perform when there's little changes ;) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: wow, very creative ! :) 28 lines sounds very good :) tho I have yet not an idea how progressive sort can perform when there's little changes ;)
It depends really. A progressive sort is faster in 95% of the cases but it's that twentieth missed frame which fucks up the flow in an action game. Then again I'm writing a twitch game which aspires to be a bullet hell shooter, but a platformer (say) would obviously have other requirements.
The radix sort is only something like 15 cycles (per actor) slower than my current bucket sort. So if I can get it to work I'll probably switch, and I'd be happy to share the final code if I do. Weighting the extra cycles vs. the extra sprites makes it a tough choice. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
I just used insertion sort starting with the ordering from the previous frame in Teradyne, setting up the first 8 sprites in the end-of-frame IRQ, and then moving each sprite down with an IRQ at the end of its previous use. If the next sprite needs moving before I returning from the IRQ I do it then instead of setting up another.
I stagger creation of all the sprites in a new formation over several frames to spread the workload, then there's never too much work for the sorter to do. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: I just used insertion sort starting with the ordering from the previous frame in Teradyne, setting up the first 8 sprites in the end-of-frame IRQ, and then moving each sprite down with an IRQ at the end of its previous use. If the next sprite needs moving before I returning from the IRQ I do it then instead of setting up another.
I stagger creation of all the sprites in a new formation over several frames to spread the workload, then there's never too much work for the sorter to do.
But then you've mostly got horizontal movement, right? I've got to worry about a laser moving 21 lines per frame and other nasty things :(
Of course managing not to overload the multiplexer in a horizontal shooter is much harder. So I guess I'm still better off ;) |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
btw some inspiration on the attack waves, and enemies: http://youtube.com/watch?v=Aj23K8Ri68E&mode=related&search= a lot if this could be done on a c64 imho. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11377 |
an ikaruga-like shooter would be generally a cool idea :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Yeah, the movement is largely horizontal, with a few loop-the-loops, and that "X of death" formation the reviewer complained of. The times I drop a formation in from above or raise it from below it can struggle a little if there's already a dense formation it has to cross over.
A danmaku shooter like Ikaruga would indeed be awesome, but also a serious challenge to implement! |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
I don't think I've ever done a multiplexor where the sprites could move freely around, but is it really necessary to sort them?
Wouldn't it be enough with a map consiting of a byte per rasterline where the bit pattern indicated which sprites were used. To allocate a new sprite, just ora all the bytes for the sprite's y-position and 21 positions down, and see which 0's are left, or if the result is $ff it's too bad.
21 oras per sprite might still be a little heavy, but this could of course be optimized by reducing the precission a bit, e.g. to 3rd line, which means only 7 oras/sprite.
Edit: And of course you need to insert 7 values in the table afterwards as well. :-) Still think it's faster than sorting though.
|
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
I guess that's basically a "zone" multiplexer except with better granularity, sort of.. Don't forget that you've got to search through the scanlines first to see if any sprite channel is free all the way down before marking them. Then you'd also need another set of tables to keep track of which virtual sprite is bound to which sprite channel.
Finally you'd have to walk through the array line-by-line to figure out when to trigger the interrupts. I suppose the easiest way would be to walk the array from the bottom up to find the earliest possible line to program the sprite.
Except it'd have to be fairly coarse to win-out over a sort-based approach.
edit: I suppose you wouldn't have to check all entries, just the first and the last one. Or maybe only the first if the flag was interpreted to mean that the sprite is free to be reused here. Hmm..
Either way I figure it'd be a nice way to deal with y-expanded sprites and other special cases for linked sprites as well as for dealing with sprite priorities, i.e. the sprites would be allocated in the order of priority. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
mixing some ideas this could be powerful. I'm thinking of a chained list of rasterlines having sprites, and each element in this list would point to another list having the sprites. ;) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: mixing some ideas this could be powerful. I'm thinking of a chained list of rasterlines having sprites, and each element in this list would point to another list having the sprites. ;)
Care to elaborate?
Superficially that sounds exactly like what my bucket sort does today, except I think you meant something entirely different.. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
well at a second look it might be not a good idea, as walking the chain each time a new sprite is inserted is something like an insertion sort using doublechained elements :P it would be better not to chain the raster list, instead using a 256 byte table, the end result would be counting sort with chained sprites :) or if you want like your radix sort without separate lsd/msd :)
edit, ok this one fails when you have to reset the 256 byte table :P maybe you could collect the used entrys somewhere during the proces tho. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Hmm. Perhaps you could insert 'start actor' and 'end actor' events into one each of (say) 50 buckets, one every four rasters.
Then whenever an actor is done with you could add the sprite you used to a free list, and whenever a sprite is required you could take one off the free list. No sorting required - you just need to look ahead in the bucket table to see when the next raster interrupt is required. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
how about binary trees? %) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: how about binary trees? %)
I doubt it. Any kind of balancing would be way to complicated, and you'd almost certainly encounter nasty edge cases without it. Having to implement an AVL or red-black tree on this machine terrifies me. A splay tree just might be feasible but I'm not sure whether it'd solve the problem. Anyone here with a real CS background who can tell for sure?
Unless you mean a heap sort with it's implicit tree? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Hmm. I'm sure I've used heap-sort for something on this beasty. May have been hidden-surface-removal for a solid 3d renderer.. (edge events as you traverse a scanline push/pop surfaces) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: I doubt it. Any kind of balancing would be way to complicated, and you'd almost certainly encounter nasty edge cases without it. Having to implement an AVL or red-black tree on this machine terrifies me. A splay tree just might be feasible but I'm not sure whether it'd solve the problem. Anyone here with a real CS background who can tell for sure?
Unless you mean a heap sort with it's implicit tree?
Nah... stick to radix sort and the like since they're O(n). General sorting using trees etc are O(n log n) and on this platform they'll have a nice quite big constant put in front also due to overhead. Those constants are ofcourse negliable for very large n's, however a multiplexer doesn't have large n. ;D
For a complete sort I have serious doubts that anything can out perform a well written radix sort (2 step bucket sort as you have shown before).
If you have similar sort order from frame to frame I'd recommend bubble sort actually. They're O(n^2) worst case but O(n) if sort order is similar + they're dead easy to implement and optimize.
Maybe one could have some code that would realise when to use a full sort with radix-sort and when to "just fix" the sort order using bubble sort. This would improve the average time ofcourse but slightly worsen the worst case ofcourse.
Also as Cadaver mentioned, it's seldom the sort order actually changes that much from one frame to another. On could be smart when you get new enemies etc by spreading the sort for the actors that are about to enter the screen, over several frames.
F.e. you will need some kind of data structure to know what kind of actors that will emerge soon in the scrolling direction. You could have those somewhat pre-sorter and/or prepare-sort them on the way there.
Tons of ideas.... =) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Hmm. For most shooters the enemy paths are completely predetermined. You could pre-sort all of them (keeping dead enemies onscreen with a blank sprite definition), or at least provide new sortlists for any points in time that a bubble sort is going to be expensive.
That would even allow you to determine in advance when a slightly shuffled order is acceptable :)
Speaking of load management, can anyone think of a situation that an 8-sprite muxer that needs to handle a player in an arbitrary position will always be able to cope with that a 7-sprite muxer with the player on sprite 0 would not? (other than forcing the player to avoid certain y-ranges by using scenery collisions) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote:Hmm. For most shooters the enemy paths are completely predetermined. You could pre-sort all of them (keeping dead enemies onscreen with a blank sprite definition), or at least provide new sortlists for any points in time that a bubble sort is going to be expensive. Yeah, that'd probably work. If you were writing Armalyte 2..
But it would be way to limiting for my tastes. You've got to at least have bullets fired towards the player in the kind of game I'm trying to write.
Quote:Speaking of load management, can anyone think of a situation that an 8-sprite muxer that needs to handle a player in an arbitrary position will always be able to cope with that a 7-sprite muxer with the player on sprite 0 would not? (other than forcing the player to avoid certain y-ranges by using scenery collisions) Sure. Bullets in a vertical shooter tend to be separate from the player and able to reuse the player's sprite. As long as they are spawned a couple of scanlines above the player at least.
Besides I'm happy with allowing a bit of flicker every now and then, especially if it's mostly in the two player mode, in return for a greater amount of objects and not having to worry myself to death over how the enemy paths should be designed as to guarantee zero flicker. Best-effort multiplexing FTW! |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
I'm strongly beside that the player sprite shouldnt be handled exclusively, except the case of more than 8 sprites on a line. also if this happens I'd alternatively reject sprites frame to frame, thus in a flickery fashion but still all sprites would be visible. I've seen this in some soccer game, and I like it |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Fair enough on the flicker risk.
I'd actually been assuming chars or bitmap plotting for the bullets, but yes, if they are to be sprites then an 8 sprite muxer makes more sense than special casing the player. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
This one's has taken up residence in my brain..
Now I've gone back and read the thread properly instead of skimming the first three quarters before reading the rest, I've actually noticed the earlier points about a danmaku shooter being the aim well before the Ikaruga subthread, and that the whole idea was to minimise worst case in a very frame incoherent shooter - methinks my insertion sort commentary was a little out of place.
I did have some inspiration last night when I realised that for any displayable set of actors there will never be more than eight sprites in any 21 line bucket (else the first and last will be too close together to display), so then I went off and did some research on optimal sorting for small sets of numbers. I eventually found
this page on Bose-Nelson sorting networks, got a swap macro down to 22 cycles ( ldx a
ldy b
lda actor_y,x
cmp actor_y,y
bge noswap
stx b
sty a
), decided that was too slow and started handcoding cases for smaller lists of sprites in a bucket (eg 70 cycles to take 3 zero page locations containing actor indices, sort them by actor_y and add them to the actor_next list),
but even then, I lose about 26 cycles just to placing each actor onto the end of the zero page array of actors-in-that bucket, which only leaves me an average of 19 cycles per actor for the sorting and adding to the actor_next list if I'm to beat the routine above.
It looks like @doynax' two pass radix sort still wins - I'm impressed! |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
ChristopherJam: In practice that radix-sort ended up needing 34 scanlines, for 32-sprites, rather than 28. It just wasn't practical to keep keep the set of sprites compacted like that. Not when combined with certain other features anyway.
So maybe a Bose-Nelson sorting network, whatever that is, could still be competitive? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
hmm - so I'd need to manage less than (34*63)/32=67 cycles per actor.
I've still got some optimising to do, but at the moment the worst case cost per actor ranges from 46 to 87 cycles depending on the number of other actors are in the same bucket (1 to 7 others) plus a small overhead for each of the 10 buckets.
There are a lot of unnecessary loads and stores in the more expensive cases though, so I'll see how I go over this week.
A sorting network is basically a list of optional exchanges of predetermined pairs, that after they are complete guarantee the list is sorted. It avoids computing array indices and a lot of the branching - eg
#define SWAP(a,b) .(:ldx zpt+a:ldy zpt+b:lda act_y,x:cmp act_y,y:bcs noswap:stx zpt+b:sty zpt+a:noswap:.)
SWAP(0,1)
SWAP(2,3)
SWAP(0,2)
SWAP(1,3)
SWAP(1,2)
will ensure that four actor indices stored in zpt..zpt+3 will be sorted by y position. For an 8 element list, you only need 19 swaps. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
lda (),y
cmp (),y
bcc
is faster when there's no need to swap than
ldx pointer
ldy pointer
lda spry,x
cmp spry,y |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Presumably you are proposing I keep y=0 and interleave my actor indices with the high byte of the actor_y table?
Good thinking, except I am attempting to optimise the worst case, so I do not wish to sacrifice that for a faster average case. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
it should be noted tho, that worst case sorting never happens with progressive sort, especially if you optimize for only adding/removing 1 sprite to the sort per frame. I'm also realizing that the zp method presented above is not good, as the rare case with insertion sort (my current approach) is when there's no need to swap. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: hmm - so I'd need to manage less than (34*63)/32=67 cycles per actor.
I've still got some optimising to do, but at the moment the worst case cost per actor ranges from 46 to 87 cycles depending on the number of other actors are in the same bucket (1 to 7 others) plus a small overhead for each of the 10 buckets.
There are a lot of unnecessary loads and stores in the more expensive cases though, so I'll see how I go over this week.
A sorting network is basically a list of optional exchanges of predetermined pairs, that after they are complete guarantee the list is sorted. It avoids computing array indices and a lot of the branching - eg
#define SWAP(a,b) .(:ldx zpt+a:ldy zpt+b:lda act_y,x:cmp act_y,y:bcs noswap:stx zpt+b:sty zpt+a:noswap:.)
SWAP(0,1)
SWAP(2,3)
SWAP(0,2)
SWAP(1,3)
SWAP(1,2)
will ensure that four actor indices stored in zpt..zpt+3 will be sorted by y position. For an 8 element list, you only need 19 swaps.
Hmmm... I was sure sorting networks were O(n log n) but apparantly Ajtai, Komlós and Szemerédi have O(log n) sorting network. I.e. the maximum depth (or comparators executed) is log n. It feels highly counter intuitive imo but I won't question it. Anyways, point is that while radix-sort is O(n*m) (where n is # of input and m is # of digits), an optimal sorting network is O(log n).
Now, this O() notation always hides huges constants for a c64 but for a f.e. 48 actor sorter it should require at most 48*log2(48) comparators statically in memory and at most log2(48) of those are executed at any given input. => NICE! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Ajtai, Komlós and Szemerédi's network is O(log N) depth, but still has O(N log N) comparitors - their timings are for parallel hardware, where you can swap elements 0 and 1 at the same time as 2 and 3, but swapping 0 with 2 would have to wait until the next cycle. The sample network I gave above they would categorise has having depth 3.
I was looking at something that on serial hardware like our humble c64 would be O(N log k), where k is the maximum number of sprites in any given bucket, which for buckets less than 22 lines high will be 8.
Oswald: I know the stuff I'm playing with this week will not give optimal results for a relatively stable arrangement - it's why I usually use insertion sort on a linked list of actors. I believe Doynax is attempting to avoid the huge spikes of CPU time a progressive sort would take whenever there is a massive change in order, like when two multi-sprite bosses are crossing vertically while you shoot at them with vertically fast moving sprite bullets ;) |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
actually I have just arrived to what you're talking about: insertion sort on a linked list of actors :) as one of the drawbacks of insertion sort is the need to move many elements, but with a linked list you can insert the actor into the right position without shuffling bytes around.
ldx actorlink,x
cmp actory,x
bcc nextelement
- insert actor into chain & return to the outer unrolled loop-
nextelement
ldy actorlink,x
cmp actory,y
bcc nextelement
only 11 cycles per comparison. you even have the prev actor's link in one of the registers, so it can be a one way chain easily. guess you already do it like this ;) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Exactly :)
I traverse a singly linked actor_next list in the end-of-screen interrupt running their movement scripts, and after each has been moved (unless it has just died) I insert it into a singly linked actor_prev list.
Then once all the actors have been processed, I reverse the list.
ldx actor_prev,y
tay
sta actor_next,x
beq done
...
|
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Ajtai, Komlós and Szemerédi's network is O(log N) depth, but still has O(N log N) comparitors - their timings are for parallel hardware, where you can swap elements 0 and 1 at the same time as 2 and 3, but swapping 0 with 2 would have to wait until the next cycle. The sample network I gave above they would categorise has having depth 3.
I was looking at something that on serial hardware like our humble c64 would be O(N log k), where k is the maximum number of sprites in any given bucket, which for buckets less than 22 lines high will be 8.
Oswald: I know the stuff I'm playing with this week will not give optimal results for a relatively stable arrangement - it's why I usually use insertion sort on a linked list of actors. I believe Doynax is attempting to avoid the huge spikes of CPU time a progressive sort would take whenever there is a massive change in order, like when two multi-sprite bosses are crossing vertically while you shoot at them with vertically fast moving sprite bullets ;)
Ah yeah, later that night I realized that the O(log N) was due to parallelism.
To optimize further I think we should look at the aims here:
* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.
First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.
Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: Ah yeah, later that night I realized that the O(log N) was due to parallelism.
To optimize further I think we should look at the aims here:
* We want a sorted order to trigger IRQs.
* We want to assign each actor to a sprite channel.
* We want to kick out actors when there all channels are full.
First thing we can note is that the top 8 sprites doesn't really have to be sorted in respect to each other, we only need to assign them to arbitrary unique sprite channels + find the top most for the first IRQ trigger.
Just ideas... but what I mean is that if we instead focus on what really need to be done instead of just blindly sort all actors first, then assign etc.
Unfortunately we still need to sort the first eight sprites in order to reuse them as soon as possible. Unless you use a radically different approach like what Cruzer suggested of course.
There may be cases where we can determine ahead of time that there will be "enough" free raster time anyway and don't perform a perfect sort on some segment of sprites, but as to how you'd implement such a beast in practice.. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Necro thread time! (I know, I know, if I'd waited another two months it'd be the ten year anniversary. But I want to publish this now so I can move on to something else :P)
I've been meaning for a while to try and accelerate a perfect order bucket sort by finding a good way to skip over unused entries.
After a bit of experimentation this year, the best I've managed is to use a set of flag bytes, with one bit for each bucket to say whether there are any actors in it on this frame.
For any non-zero flag byte, a lookup table used to quickly determine the index of the least significant one bit, so that that that bucket can then be emptied by pushing each sprite index from the list starting with that bucket entry onto the stack. The used buckets are cleared as they are emptied, so no post-cleanup is required, and the untouched buckets take no CPU time at all.
By packing eight flags into each flag byte, and unrolling the code to one block per flag byte, the worst case overhead per bucket is only 1.25 cycles. Thus, the CPU time for a perfect sort is only marginally worse than what it would be for a sort routine that only used half as many buckets.
Total time, a shade under 40 raster lines \O/ (yes, doynax still wins
/o\)
Code is up at codebase, cf http://www.codebase64.org/doku.php?id=base:flagged_bucket_sort |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
removed a redundant LDA#255 from each flag byte block, just before the STA back into the bucket list (the accumulator already contains a list terminator at that point), and remembered to factor in the flag bytes being in zero page.
Down to 38.5 raster lines now, only 12% slower than radix sort. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Very neat. I never would have considered doing a sparse bucket sort :)
Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.
(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Quoting doynaxVery neat. I never would have considered doing a sparse bucket sort :)
Thanks!
Quote:Oh, and you probably shouldn't take my stated radix sort timings at face-value. The sparse sort implementation may easily be faster already.
Plus counting sorting cycles in isolation is a bit of a limited model. A major part depends on whether the sorting steps slot into the game's actor and multiplexer updates such that the loops may be folded and loaded data reused.
Yes, I wasn't sure how best to benchmark. Code as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices.
FWIW most of the time is "per used bucket", so if there are only 24(?) distinct Y values, I suspect I'd come out ahead, but if there are that many horizontally aligned enemies a zone sort's likely to be sufficient anyway.
I did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor, for a fairly unappealing total of (34+62)*32=3072 cycles per frame (48.7 raster lines)
I suspect that if one just wanted to minimise CPU time expended by the sorter in in the border, a plain bucket sort would be better; insertion's only 14 cycles per actor (7 rasters!), and the times it takes a while for 'fetch next' are precisely when the next actor is a fair way down the screen from the one you just dealt with, so timing wouldn't be quite so critical anyway.
Quote:(Off-topic, but the coarse bucket sort method can be improved for less frantic applications by using a stable sort while cycling through the remainders as offsets for the division table lookup each frame, eventually reaching the correct order for static displays. It helps if the permutation is chosen to minimize settling errors, and perhaps using a prime divisor or randomized shuffle to avoid interference against sprite movement.)
Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations
|
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quoting ChristopherJamCode as posted has an unrolled loop for pulling out the indices into a linear list, so the time for subsequent 'fetch next index' operations is just 10 cycles per actor (inc *+4:ldy $xxxx:bmi noneleft), for a total of 2742 cycles per frame (43.5 raster lines) to sort and then fetch 32 indices. Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.
Quoting ChristopherJamI did experiment with just inserting into the sparse bucket list in the sort call (34 cycles per actor, 17.2 raster lines), then only pulling them out as required, but that basically requires either coroutines or a lot of loading and storing of sentinels, so my best fetch time was up around 62 cycles per actor Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.
Quote:Interesting point. At first I was thinking that was kind of shell sort related, but the latter is intrinsically unstable. Might have to run some simulations
Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Quoting doynax
Well, `PLA : TAX` with the branch and sentinel baked into the actor class dispatch might work. And if the update happens to leave the Y-coordinate lying around then you might as well make use of it.
Indeed. And for the first 8 sprites, you can just LDX immediate directly from $017f down.
Quote:Yeah, the bitset introduces state which is difficult to get rid whereas the radix sorts gets away with it by chaining the buckets together into a unified list.
Exactly. The plain bucket sort has the same advantage (TBH I pinched the idea from Playstation order-tables - it's how they z-sort the texture mapped triangles, by inserting them into a linked list of null graphics primitives)
Quote:Oh, nothing remotely fancy. I just meant that a bucket sort need not necessarily to use the same partition every frame. So with bucket=(y+k)/N then by cycling through 0≤k<N each frame a stable sort produces the correct ordering in at most N frames for static scenery. This is effectively free and with a carefully permutation the error tends to decay exponentially even for moving objects. Yes; I got there eventually :) |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
A brute force approach seems already very efficient. I think that can be done in 35.6 lines or less for 32 sprite positions and 220 buckets.
;Fill Buckets
ldx input
inc Buckets,x
...
ldx input+31
inc Buckets,x
-------------
10*32=320 cycles assuming input is in ZP.
;Collect sorted values
ldy Buckets
beq next1: ;Skip if empty.
lda #value1 ;Value corresponding to this bucket.
pha
dey
bne *-2 ;In case not unique.
sty Buckets ;Empty the buckets.
next1:
ldy Buckets+1
beq next2:
lda #value2
pha
dey
bne *-2
sty Buckets
next2:
;etc.
If all input numbers are unique the collection stage costs 7 cycles for empty buckets and 19 cycles for the buckets with 1 input value for a total of 7*(220-32)+19*32+320 = 35.6 lines.
If not all input numbers are unique, the cost for a bucket with k numbers is 12+(k-1)*8+7 cycles. Adding one more to a non-empty bucket costs 8 cycles, but since there is then one bucket less with one contribution, and one more empty bucket, one gains 4 cycles overall.
But I could have made a mistake in these calculations or missed something else (initialization, overhead ...). |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Color Bar, that is indeed a fast way to get a sorted list of Y values, but what is needed is a sorted list of the indices of the sprites that have those Y positions.
(eg if sprites 1, 2, 3 and 4 are at positions 10, 46, 73 and 32, the required output is 1,4,2,3, not 10,32,46,73 ) |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
35-40 lines for 32 sprites is crazy good guys 1.x line per sprite. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: Color Bar, that is indeed a fast way to get a sorted list of Y values, but what is needed is a sorted list of the indices of the sprites that have those Y positions.
(eg if sprites 1, 2, 3 and 4 are at positions 10, 46, 73 and 32, the required output is 1,4,2,3, not 10,32,46,73 )
Oh, I see, thanks.
The fact that multiple sprites can have the same Y values makes the problem considerably harder, it seems. |
| |
Perplex
Registered: Feb 2009 Posts: 255 |
Here is some old sort code I have used, I think it's originally from Armalyte, or at least based on that code (by Gary Liddon and Dan Phillips.)
It uses the stack for buckets, so lower your stack pointer to somewhere below 50 (or whatever is your lowest Y-value for sorting), then initialize the stack area above that with $ff values.
NOBJECTS = 24 ; Number of objects to be sorted
OBJ_YPOS = $80 ; Y-values of all objects
OBJ_SORT = $A0 ; Indices of all objects sorted by Y-value
TEMP_SP = $FF ; Temporary stack pointer
StackSort:
; Save stack pointer
tsx
stx TEMP_SP
; Phase 1 - Store Y-values in buckets
.repeat NOBJECTS, i
lda OBJ_YPOS + i
;lsr
tax
txs
pla
bpl *-1 ; Already value here, try next
lda #i
pha
.endrep
; Phase 2 - Gather (indices of) sorted values
;ldx #(50/2)-1
ldx #50-1
txs
.repeat NOBJECTS, i
pla
bmi *-1
sta OBJ_SORT + i
lda #$ff ; Clean up after ourselves
pha
pla
.endrep
; Restore stack pointer
ldx TEMP_SP
txs
rts
The two commented out lines are for speeding up the gathering phase by halving each Y-value. Depending on how your multiplexer works, you may not need exact order for very close objects if you handle those at the same time.
In circumstances where limiting the amount of available stack is not viable, it can easily be modified to use some other memory area at the cost of a few cycles per object. |
| |
The Human Code Machine
Registered: Sep 2005 Posts: 112 |
While working on Vincenct 2 in the nineties I had the same problem finding a fast solution for sorting sprites and for most purposes this sorter here: http://codebase64.org/doku.php?id=base:flexible_32_sprite_multi..
was the best I came up with. Inspired by Rune Gram-Madsen's Bubble a la Rune sorting routine, published at the Asm-One manual, I wrote a fast implementation for the c64. It's a modified bubble sort working bidirectional and for presorted arrays it's very fast. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Perplex, that is interesting, thanks for sharing the code. Do you know of an efficient modification that ensures exact ordering? |
| |
Perplex
Registered: Feb 2009 Posts: 255 |
If you need exact ordering even with many values clustered close together, then simply moving on to the next unoccupied bucket won't really work, so you'd need a quite different approach, I think. |
| |
Hein
Registered: Apr 2004 Posts: 948 |
Using bucket sort ($10 size) and insertion sort to sort the indices of a list of y-values. In this case it takes around 40 rasterlines for 32 sprites when placed on zp. It assumes there's no need for more than 8 sprites per bucket, but 15 are possible.
The multiplexer checks per bucket based on the counter at the end of each bucket.
*=$10
clear_bucket_counter
lda #0
sta buckets+15
sta buckets+15+16*1
sta buckets+15+16*2
sta buckets+15+16*3
sta buckets+15+16*4
sta buckets+15+16*5
sta buckets+15+16*6
sta buckets+15+16*7
sta buckets+15+16*8
sta buckets+15+16*9
sta buckets+15+16*10
sta buckets+15+16*11
sta buckets+15+16*12
sta buckets+15+16*13
sta buckets+15+16*14
sta buckets+15+16*15
ldx #32
set_element_index
stx element_index
lda sprite_y,x
ldx #$f0
sbx #$00 ;select right bucket
ldy buckets+15,x ;get bucket counter (# elements in bucket)
inc buckets+15,x ;increase bucket counter
stx set_2+1 ;set high nybble bucket index
dey
bmi place_current_element ;element count=0 skip sort
stx get_1+1
stx get_2+1
inx
stx set_1+1
sty slot_hi+1 ;highest index
compare_element_y
get_1 ldx buckets,y
cmp sprite_y,x
dey
bcc insert_lower_element ;woohoo, found a spot
bpl compare_element_y
insert_lower_element
sty slot_lo+1
slot_hi ldy #0
move_higher_elements
get_2 lda buckets,y ;move higher elements up
set_1 sta buckets,y
dey
slot_lo cpy #$ff
bcc move_higher_elements
place_current_element ;finally, insert the lower element index in the right place
iny
lax element_index
set_2 sta buckets,y
dex
bne set_element_index
rts
element_index
.byte 0
*=$1000
buckets
.fill 256,0
|
| |
Fresh
Registered: Jan 2005 Posts: 101 |
I use a slightly modified version of "Ocean" sorter, with a (somewhat) optimized backward sorting routine.
Its worst case is significantly better than plain Ocean.
Code in kickass syntax.
.const NSPRITES = 32
.const indexes = $60
.const ypos = $80
// SORT FORWARD
// ************
// Check if values are sorted, if necessary call sortbw routine to
// put an unsorted value in the correct position
sortfw:
.for(var i=0;i<NSPRITES;i++) {
over:
.if (i!=NSPRITES-1) {
ldy indexes+1+i
ldx indexes+i
lda ypos,y
cmp ypos,x
bcs sortfw[i+1].over
jsr sortbw[NSPRITES-2-i].back
}
else rts
}
// SORT BACKWARD
// *************
// Move backward a value in .A which is less than the one in current position.
// All values at the left of the current position are ordered so, once we found
// the correct spot for .A, we don't need to check forward, we can get back
// to where the function was called
sortbw:
.for(var i=0;i<NSPRITES-1;i++) {
back:
stx indexes+NSPRITES-1-i
.if (i!=NSPRITES-2) {
ldx indexes+NSPRITES-3-i
cmp ypos,x
bcc sortbw[i+1].back
}
sty indexes+NSPRITES-2-i
rts
}
|
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: Color Bar, that is indeed a fast way to get a sorted list of Y values, but what is needed is a sorted list of the indices of the sprites that have those Y positions.
(eg if sprites 1, 2, 3 and 4 are at positions 10, 46, 73 and 32, the required output is 1,4,2,3, not 10,32,46,73 )
I think that can be done in less than 37 rasterlines.
Consider the following unrolled code for outputting sprite indices, sorted according to Y-value (phase 2)
.bucketK
bcc .nextBucket ;Branches always: decrease this value by 3 with selfmodifying code in phase 1 for every new sprite in the bucket.
lda #sprite_index ;index of last sprite
pha
...
lda #sprite_index ;index of first sprite.
pha
sty .bucketK+1 ;Restore branch (to skip all sprites for empty buckets). Call with y=$1b for max 8 sprites per bucket.
.nextBucket ;unrolled
-------------
3 cycles for an empty bucket, 12 voor buckets with 1 sprite => 3*(220-32)+12*32 = 948 cycli
Phase1: The following unrolled code jumps to the code that modifies the above phase 2 code for the right bucket:
dex ;counts sprite index. Initialized with $20.
bmi .phase2
lda Yvalues,x
asl
bcs *+8 ;There are more than 128 buckets, so I use 2 jump tables
sta *+4
jmp ($xxxx) ;jump table for buckets <128 ;jumps to code that fills out phase2 Bucket code (see below)
sta *+4
jmp ($xxxx) ;jump table for buckets >=128 ;21/22 cycles sofar
.phase2
clc ;to branch always (either to next bucket or to fetch the sprite indices in the current bucket).
ldy #27 ;to restore the branch value of the bucket
jmp .Bucket0
;code to fill out bucketK
ldy .bucketK+1 ;The trick is that I use this as an index for the number of sprites already in the bucket
lda newBranchValue,y ;fetch value to branch to the code for one more sprite
sta .bucketK+1 ;since this is unrolled code, the address of .bucketK is known
tay
txa
sta .bucketK+3,y ;Store new sprite index. This works since the branch value decreases by 3 for each new sprite, but the address where to store the sprite index as well!
... ;Unrolled => repeat all the above code starting with dex.
--------------
41/42 cycles ->41.5*32 = 1328
In total 1328+948 =2276 -> 36.1 lines.
But there will be at least 25 additional cycles from page boundary crossings. Slightly less than 37 lines I think if I did not overlook something.
Table with branch values (for max 8 sprites per bucket) looks like (x=don't care)
newBranchValue
BYTE 0 x x 0 x x 3 x x 6 x x 9 x x 12 x x 15 x x 18 x x x x x 21 x
The first zero makes that a full bucket replaces the last sprite with the new one. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Excellent work, Color Bar.
If every sprite lands in the second jump table, that pushes you up to 42 cycles for every insertion, but total time in that case is still only
3*(220-32)+12*32 + 42*32 = 2292 cycles, or 36.4 raster lines.
The 25 additional cycles from page boundary crossings you mention can be avoided by padding the output routine out to 32 bytes by adding three bytes of padding after the BCC .nextBucket, and starting bucket0 at $xxff
It takes up another 660 bytes, but the routine is already occupying most of a bank as it is, so that's small change at this point ;) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Should be able to save about 1760 bytes by only including the last 12 bytes of the insertion code in every third routine; the preceding and succeeding routine for each complete routine can just jump to the nearest complete one (the part from the second sta *+4 onward).
Timing's unchanged, but drags it back down to around 14k including jump tables. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting ChristopherJamExcellent work, Color Bar.
The 25 additional cycles from page boundary crossings you mention can be avoided by padding the output routine out to 32 bytes by adding three bytes of padding after the BCC .nextBucket, and starting bucket0 at $xxff
It takes up another 660 bytes, but the routine is already occupying most of a bank as it is, so that's small change at this point ;)
Those 3 padding bytes can be put to good use by allowing 9 sprites in a bucket.
The routine probably uses too much memory for many applications. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting ChristopherJamShould be able to save about 1760 bytes by only including the last 12 bytes of the insertion code in every third routine; the preceding and succeeding routine for each complete routine can just jump to the nearest complete one (the part from the second sta *+4 onward).
Not sure I understand this. You need a separate fill routine for every possible Y-value/bucket. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Btw, if you just want to sort 32 random numbers, the phase 2 code could be modified into
.bucketK
bcc .nextBucket
lda #K
pha
lda #K
pha
...
pha
sty .BucketK+1
and the fill code can be limited to
ldy .bucketK+1
lda newBranchValue,y
sta .bucketK+1
Resulting in 3*(220-32)+12*32 + 33.5*32 -> 32 lines. Put perhaps that application requires 256 buckets and then it takes 33.8 lines. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Quoting Color BarNot sure I understand this. You need a separate fill routine for every possible Y-value/bucket.
True, but every one of the 220 routines ends with
.hitableN
sta *+4
jmp ($xxxx) ;jump table for buckets >=128
.phase2N
clc ;to branch always.
ldy #27 ;to restore the branch value of the bucket
jmp .Bucket0
You can leave that tail off two thirds of them, eg only keeping it for buckets 1,4,7,10 etc.
The routine for bucket 0 would replace the bcs *+8 with a branch to the hitable lable in the the routine for bucket 1, as would the routine for bucket 2 |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Quoting Color BarThe routine probably uses too much memory for many applications.
A quarter of all RAM? Yes, that's probably a tad excessive in practice :)
It'd be interesting to have a map of best worst case routine for different numbers of actors, and different limits on memory use... could even have CPU time as altitude
|
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting ChristopherJamQuoting Color BarNot sure I understand this. You need a separate fill routine for every possible Y-value/bucket.
True, but every one of the 220 routines ends with
.hitableN
sta *+4
jmp ($xxxx) ;jump table for buckets >=128
.phase2N
clc ;to branch always.
ldy #27 ;to restore the branch value of the bucket
jmp .Bucket0
You can leave that tail off two thirds of them, eg only keeping it for buckets 1,4,7,10 etc.
The routine for bucket 0 would replace the bcs *+8 with a branch to the hitable lable in the the routine for bucket 1, as would the routine for bucket 2
Yes, I understand it now. Thanks! |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: Quoting Color BarThe routine probably uses too much memory for many applications.
A quarter of all RAM? Yes, that's probably a tad excessive in practice :)
It'd be interesting to have a map of best worst case routine for different numbers of actors, and different limits on memory use... could even have CPU time as altitude
I am wondering if there aren't any theoretical bounds known in computer science, about memory usage and/or/vs CPU time. But I guess these would have to be C64 specific to be of any use. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Got a new record now, I think.
We can pack 221 addresses into a 222 byte table by using the high byte of each address as the low byte of the next address.
If we use something like this:
for i in 0..73:
.byt 64+(i%74)*(1+i//74*24)%74)
.. there are no two addresses less than 24 bytes apart.
The fill routine then drops down to this:
;code to fill out bucketK. 24 bytes, 36 cycles
fillK
ldy .bucketK+1
lda newBranchValue,y
sta .bucketK+1 ; 12
tay
txa
sta .bucketK+3,y ; 9
dex
lda Yvalues,x
sta *+4
jmp ($xxxx) ; 15
; 36*32 = 1152 cycle, 36 per actor
Just 24 bytes long :)
We avoid the check for x becoming negative by placing actor $ff at y=0, and having the 0 entry of the branch table point to a routine
that checks X to determine whether we've reached the end of the actor list, or are just dealing with an actor that has gone offscreen.
The bucket emptying code remains the same as in Color Bar's original:
.bucketK ;32 bytes
bcc .nextBucket ;Branches always: decrease this value by 3 with selfmodifying code in phase 1 for every new sprite in the bucket.
!byt 0,0,0
lda #sprite_index ; repeat lda/pha eight times
pha
...
sty .bucketK+1 ;Restore branch (to skip all sprites for empty buckets). Call with y=$1e for max 8 sprites per bucket.
.nextBucket ;unrolled
; 3 cycles for an empty bucket, 12 voor buckets with 1 sprite => 3*(220-32)+12*32 = 948 cycli , 29.6 per actor
The bucket filling routines are packed three to a page over 74 pages, so around 19kb. Each page has an average of 184 bytes empty, so we can probably pack the bucket emptying code into the gaps. Note that some of the buckets would then need to branch past a group of bucket filling routines, so we'd need to use both X and Y for the two different values to restore the branches to "this bucket is empty"
So - total time, 36*32+3*(220-32)+12*32= 220*3+45*32 = 2100 cycles. 65.6 per sprite, 33.33 rasters total.
For 37 sprites or more, we're down to under a raster per sprite :D |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Great! I still have to study your post in detail, but maybe another 64 cycles gain be gained by adding 3 NOPs before the first sprite in the bucket, like this:
.bucketK
bcc .nextBucket
lda #sprite_index8
pha
...
nop
nop
nop
lda #sprite_index1
pha
sty .bucketK+1
Then the TAY can be removed from the fill code:
ldy .bucketK+1
lda newBranchValue,y
sta .bucketK+1
txa
sta .bucketK,y
and the branch value table changes to:
newBranchValue
BYTE 0 x x 0 x x 3 x x 6 x x 9 x x 12 x x 15 x x 18 x x 21 x x x x x 27 x
Worst case for the phase 2 code is still 948 cycles, I think. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Ah, so you're always storing 3 bytes earlier than the previous branch target?
I like it, but I'm not sure if that works for the very first sprite inserted into that bucket, unless you have the branch starting out by landing on the sty .bucketK+1
..which would then cost you an extra four cycles per empty bucket. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting ChristopherJamGot a new record now, I think.
We can pack 221 addresses into a 222 byte table by using the high byte of each address as the low byte of the next address.
Clever idea!
I am not familiar with this notation
for i in 0..73:
.byt 64+(i%74)*(1+i//74*24)%74)
What do i%74 and i//74 mean? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Thanks!
Mod and divide (TBH I should have dropped from // back to / when I switched from python to pseudo asm). Oh, and that loop should have been to 74*3
Really, better expressed as
for i in 0..73:
.byt 64+(i*1)%74
for i in 0..73:
.byt 64+(i*25)%74
for i in 0..73:
.byt 64+(i*49)%74
|
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: Ah, so you're always storing 3 bytes earlier than the previous branch target?
I like it, but I'm not sure if that works for the very first sprite inserted into that bucket, unless you have the branch starting out by landing on the sty .bucketK+1
..which would then cost you an extra four cycles per empty bucket.
Yes, you are right. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting ChristopherJam
The bucket emptying code remains the same as in Color Bar's original:
.bucketK ;32 bytes
bcc .nextBucket ;Branches always: decrease this value by 3 with selfmodifying code in phase 1 for every new sprite in the bucket.
!byt 0,0,0
lda #sprite_index ; repeat lda/pha eight times
pha
...
sty .bucketK+1
!byt 0,0,0 could be replaced by LDA #value + PHA, i.e., allow a maximum of 9 sprites per bucket, but I guess it doesn't really matter. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
The Human Code Machine, thanks for that codebase link. Must admit, I didn't find that area when I added the flagged bucket sort to the maths and algorithms section. Might need to add some crosslinks..
FWIW though, looks like that one has the same worst case as the one Oswald referenced in the opening remark to this thread. (32*31/2)=496 swaps, 26 cycles per swap, 496*26/63=204 raster lines. |
| |
Frantic
Registered: Mar 2003 Posts: 1647 |
I added some crosslinks @ codebase to make people aware about this partial overlap in contents in the two different sections.
Preferably generic sorting algorithms should be placed in the maths and algorithms section, and sorting tailored specifically towards fast sorting of sprites/multiplexing could be kept on the sprites page, if someone feels like cleaning this up. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Thanks Frantic, I'll move my post. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting ChristopherJam
;code to fill out bucketK. 24 bytes, 36 cycles
fillK
ldy .bucketK+1
lda newBranchValue,y
sta .bucketK+1 ; 12
tay
txa
sta .bucketK+3,y ; 9
dex
lda Yvalues,x
sta *+4
jmp ($xxxx) ; 15
; 36*32 = 1152 cycle, 36 per actor
I think another 2 cycles per actor can be gained by using A as the actor counter:
ldy .bucketK+1
ldx newBranchValue,y
stx .bucketK+1
sta .bucketK+3,x
sbc #$01
.start
tax
ldy Yvalues,x
sty *+4
jmp ($xxxx)
Initialize with SEC, LDA #$1f, JMP .start
Total time: 220*3+43*32 = 2036 cycles or 32.3 rasters! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Quoting Color Bar
I think another 2 cycles per actor can be gained by using A as the actor counter
Sweet! Yes, that should work. Should probably do a test implementation and upload this beast somewhere. Now if we could just shave off one more cycle per actor... |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Perhaps not practical, but maybe interesting to discuss:
If you are sure the maximum number of actors in a bucket is never exceeded, 2 more cycles per actor (and 2 bytes) can be shaved off the bucket filling routines by using some BOCs (bonus opcodes = illegals)
lax .bucketK+1
sbx #$06
stx .bucketK+1
tya
sta .bucketK+3,x
dey
lda Yvalues,y
sta *+4
jmp ($xxxx)
where Y is now the actor counter. The bucket emptying code should then be modified according to
bcc .nextBucket
lda #sprite_index
pha
skw $abcd ;opcode $0c
lda #sprite_index ;Always 6 bytes for every sprite.
pha ;Therefore sbx #$06 can be used in the fill routines.
skw $abcd
...
lda #sprite_index
pha
sty .bucketK+1
If we allow up to 10 actors and patch the filling routines with 2 bytes to 64 bytes total, the page crossing penalties can be avoided. Although this algorithm uses a LOT of memory, it may be worth it in some demo effects if speed is the bottleneck. And besides, the absolute minimum possible CPU time is a worthy goal by itself.
As Christopher Jam suggested, we can have the 0 entry of the branch table point to a routine that checks whether we've reached the end of the actor list, or are just dealing with an actor that has gone offscreen.
I think we can then shave off 2 more bytes if the Y values are stored on the stack (terminated by 0) and LDA Yvalues,Y is replaced by PLA. |
| |
Trash
Registered: Jan 2002 Posts: 122 |
I dont think this is really exactly what you guys are discussing but I suggest you all to have a look at the sortingroutines used in Time Machine, I have been given an explanation but I don't really understand how it works. I have code that sorts 32 elements just shy of 1400 cycles (if I remember correctly) with a constant speed using the explanation I was given, since it's HCLs I wont share it without his explicit permission but dropping a hint on where to look must be forgiven. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
re. SKW etc: Ooh, dangerous! So I think we're up to around 10k for the PHA routines now?
Will have to see if that can be sanely(!) combined with my thinking from today:
I managed to trim two cycles off the populate routine by switching to just using even numbers for the actor IDs (can interleave most of the data tables), and using a section of the stack to store Y values interleaved with actor IDs.
ldx bucketK+1
ldy newBranchTableValue,x
sty bucketK+1
pla ; grab this actor's index
sta bucketK+3,y ;21
pla ; grab next Yvalue
sta *+4
jmp (jumptable) ;11
(I was going to read the actor ID out of one of the CIA timers while pulling Y value off the stack, but realised that the timer will be counting down while the stack address increases, so had to abandon that idea).
so, 220*3+41*32=1972 cycles, 61.6 per actor, 31.3 raster lines :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Quoting Trash...1400 cycles...
!!! |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting ChristopherJam
I managed to trim two cycles off the populate routine by switching to just using even numbers for the actor IDs (can interleave most of the data tables), and using a section of the stack to store Y values interleaved with actor IDs.
so, 220*3+41*32=1972 cycles, 61.6 per actor, 31.3 raster lines :)
Awesome! Less than 1 raster per actor :-)
I actually thought of that as well, but won't that cost you somewhere else? I mean you have to write updated Y values to the stack in that form and maybe that is more costly than writing them non-interleaved? But perhaps not when you use unrolled code ... |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: I dont think this is really exactly what you guys are discussing but I suggest you all to have a look at the sortingroutines used in Time Machine, I have been given an explanation but I don't really understand how it works. I have code that sorts 32 elements just shy of 1400 cycles (if I remember correctly) with a constant speed using the explanation I was given, since it's HCLs I wont share it without his explicit permission but dropping a hint on where to look must be forgiven.
Since it is Time Machine, it must be ahead of us :-)
Very likely I won't understand that code without any thorough explanation, but is it for the same range of Y values (number of buckets)? And can multiple actors have the same Y values?
Maybe HCL wants to join the discussion? |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting ChristopherJam
(I was going to read the actor ID out of one of the CIA timers while pulling Y value off the stack, but realised that the timer will be counting down while the stack address increases, so had to abandon that idea).
I remember this post:
SID to calculate line slope
Maybe that could work? |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting ChristopherJamI was going to read the actor ID out of one of the CIA timers while pulling Y value off the stack, but realised that the timer will be counting down while the stack address increases, so had to abandon that idea
I haven't been following this discussion in detail, so I don't know if it helps, but: If your routine takes e.g. 31 cycles, then you can mask the timer value by 31, and it will appear to increment. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Ok, a few random thoughts.
I see now that the routine in comment #103 takes the same number of cycles as that in #105; it just fetches an actor ID with TYA:DEY instead of PLA, and fetches the next Y coordinate with lda YV,Y instead of PLA.
Even if the Y coordinates are on the stack, you still need the DEY to update the actor index.
I can't see a way to combine them, so suspect I'd lean towards my routine (same speed, 5k less unrolled code, 64 bytes more stack usage, faster performance when many actors are on the one line).
Rather than copying the Y coordinates to the stack, I'd make that their home, and also only write the actor IDs once at startup. You would of course have to be very careful to ensure no interrupts while running that section of code, NMI included
Quoting Color Bar
SID to calculate line slope
Maybe that could work?
Oh god. Quite possibly, but I'd rather avoid hamstringing music/effects.
lft; sadly no cycles left for masking :)
*downloads Time Machine to examine later* |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting ChristopherJamOk, a few random thoughts.
I see now that the routine in comment #103 takes the same number of cycles as that in #105; it just fetches an actor ID with TYA:DEY instead of PLA, and fetches the next Y coordinate with lda YV,Y instead of PLA.
Even if the Y coordinates are on the stack, you still need the DEY to update the actor index.
I can't see a way to combine them, so suspect I'd lean towards my routine (same speed, 5k less unrolled code, 64 bytes more stack usage, faster performance when many actors are on the one line).
Rather than copying the Y coordinates to the stack, I'd make that their home, and also only write the actor IDs once at startup. You would of course have to be very careful to ensure no interrupts while running that section of code, NMI included
Yes, your approach has some advantages, including not having to assume that the maximum number of actors per bucket is never exceeded. Unfortunately it does not use any illegals though :-( |
| |
Trash
Registered: Jan 2002 Posts: 122 |
So i examined the tickcounts of the code I've got, with heavy use of zeropage it can sort 32 elements in 1188 cycles, without zp it does the same job in 1448 cycles (unless my calculations are incorrect). The drawback is that every value you want to sort has to be reduced in accuracy in order for it to work (eg two lda, sta in two different tables)...'
*edit*
HCL actually drops a hint for this sorter in the commentsection of Classics |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: re. SKW etc: Ooh, dangerous! So I think we're up to around 10k for the PHA routines now?
Will have to see if that can be sanely(!) combined with my thinking from today:
I managed to trim two cycles off the populate routine by switching to just using even numbers for the actor IDs (can interleave most of the data tables), and using a section of the stack to store Y values interleaved with actor IDs.
ldx bucketK+1
ldy newBranchTableValue,x
sty bucketK+1
pla ; grab this actor's index
sta bucketK+3,y ;21
pla ; grab next Yvalue
sta *+4
jmp (jumptable) ;11
(I was going to read the actor ID out of one of the CIA timers while pulling Y value off the stack, but realised that the timer will be counting down while the stack address increases, so had to abandon that idea).
so, 220*3+41*32=1972 cycles, 61.6 per actor, 31.3 raster lines :)
I'm sorry, but I think you missed 2 cycles: PLA / STA *+4 / JMP (jumptable) takes 13 cycles. This means that the approach in post #103 is 2 cycles faster. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Quoting Color BarI'm sorry, but I think you missed 2 cycles: PLA / STA *+4 / JMP (jumptable) takes 13 cycles. This means that the approach in post #103 is 2 cycles faster.
Damn, you're right. The SBX#$06 takes two cycles less than the LDY newBranchTableValue,x
FWIW, the 'more than one sprites on a line' case could be improved by replacing the SKW instructions with JMP *+3; it saves a cycle. And you're still using SBX at least :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Quoting Trash. The drawback is that every value you want to sort has to be reduced in accuracy in order for it to work
Ah, it's not a perfect sort of eight bit values? Comparing apples and oranges in that case.. |
| |
Axis/Oxyron Account closed
Registered: Apr 2007 Posts: 91 |
Quote: So i examined the tickcounts of the code I've got, with heavy use of zeropage it can sort 32 elements in 1188 cycles, without zp it does the same job in 1448 cycles (unless my calculations are incorrect). The drawback is that every value you want to sort has to be reduced in accuracy in order for it to work (eg two lda, sta in two different tables)...'
*edit*
HCL actually drops a hint for this sorter in the commentsection of Classics
That ADC STA sequence mentioned in the comments of classics looks like 1 of the passes of a radix-sort. But in pure speed that wont be competitive to the ideas already posted here. Atleast as long as you dont reduce accuracy and really work with 200 different y-positions. My latest version has around 2400 cycles for the given use-case, if I remember right.
But the radix approach has some really nice properties compared to the bucket-sort mostly used here. It sorts in perfect 8 bit accuracy and sorts stable. That makes it more generally usable (e.g. for sorting 3D Bobs and polygons without any flickery). And in critical cases it can even avoid bugs in multiplexing. E.g. when big groups of sprites batch in small areas. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: Quoting Color BarI'm sorry, but I think you missed 2 cycles: PLA / STA *+4 / JMP (jumptable) takes 13 cycles. This means that the approach in post #103 is 2 cycles faster.
Damn, you're right. The SBX#$06 takes two cycles less than the LDY newBranchTableValue,x
FWIW, the 'more than one sprites on a line' case could be improved by replacing the SKW instructions with JMP *+3; it saves a cycle. And you're still using SBX at least :)
Yes, I was a bit too eager to use illegals. Nevertheless, what I don't like about post #103 is that it will crash when there are more than 10 sprites on a line.
So I lean to sticking to the combination of posts #91 and #101. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting Axis/OxyronThat ADC STA sequence mentioned in the comments of classics looks like 1 of the passes of a radix-sort. But in pure speed that wont be competitive to the ideas already posted here. Atleast as long as you dont reduce accuracy and really work with 200 different y-positions. My latest version has around 2400 cycles for the given use-case, if I remember right.
But the radix approach has some really nice properties compared to the bucket-sort mostly used here. It sorts in perfect 8 bit accuracy and sorts stable. That makes it more generally usable (e.g. for sorting 3D Bobs and polygons without any flickery). And in critical cases it can even avoid bugs in multiplexing. E.g. when big groups of sprites batch in small areas.
The approach proposed here (the combination of posts #91 and #101 or #103) counts down the actor list, so when there are multiple actors with the same Y value, they are added to the bucket in reverse order, but in the second pass the order is restored. So I think this approach is stable as well.
If think the method also has full 8 bit accuracy. The numbers of cycles mentioned are for visible sprite positions and that means 220 buckets. This can be increased to 255 buckets. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Field Sort |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Nice!
I think this 32K of ram usage you refer to applies to combining posts #91 and #103. When combining #91 and #101 it can be considerably lower (but it will still be a lot).
Btw, how many sprites with the same Y value can your routine handle? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Yes, nice work! This will take some investigating to unravel.
SHX $FE00,Y ^ is basically stx $fe00,y on account of the mask being $fe+1, yes? |
| |
lft
Registered: Jul 2007 Posts: 369 |
Color Bar, yes I realise that you went through a number of different versions, trading memory for cycles. But I thought it fair to mention the version that beats mine in running time.
There can be any number of sprites on the same line; They are linked. Of course, the multiplexer drops them if there are too many, as is evident from the flickering in the demo.
This is a generic full random sorting routine for 8-bit values, although in the demo it is limited to 220 different y-values to make it comparable with your benchmarks.
Christopher Jam: That is correct. It's nice to find a use for this opcode where it actually improves performance, because it's quite constrained: The masking drops off if the instruction is suspended by DMA, and page crossings cause the wrong value to be written. So the best bet is indeed to stick to the operand $fe00. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: Yes, nice work! This will take some investigating to unravel.
SHX $FE00,Y ^ is basically stx $fe00,y on account of the mask being $fe+1, yes?
Yes, check out this document (pages 38 and 41):
No More Secrets v0.91 |
| |
chatGPZ
Registered: Dec 2001 Posts: 11377 |
lft: now it would be very cool if you could expand this routine to handle excess sprites in a way that they will be displayed on alternating frames ("interlaced") - then it would work perfectly fine for a game (MUCH better than sprites disappearing) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Dear lords. So the n-actors snippets leading up to $fe00 insert each actor ID in to a linked list with the head at $aa00+yposition, and the 'next' pointers are at $10+actorID, and also replaces the INY at $fe00+yposition with a JMP to a bucket emptying routine.
Here's a sample snippet:
.C:fb52 A4 90 LDY $90+0 ; Y coordinate of actor 0
.C:fb54 9E 00 FE SHX $FE00,Y ; this is effectively STX $fe00,y
.C:fb57 B9 00 AA LDA $AA00,Y ; bucket head
.C:fb5a 85 10 STA $10+0 ; next pointer for actor 0
.C:fb5c A9 00 LDA #$00 ; actor id 0
.C:fb5e 99 00 AA STA $AA00,Y
The INYs that fill $fe00 to $feff increment Y through all the potential bucket indices, jumping off to a bucket emptying routine wherever an INY has been replaced with a JMP
There are four routines at 4c4c, 4cc8, c84c and c8c8 to handle each of the cases where zero to two of the next two buckets are also nonempty.
The bucket emptying routine takes advantage of the list heads being stored in page $aa to reuse the $aa as a TAX instruction when looping over a bucket that contains more than one actor, and the lists are terminated with $c8, which is both negative and doubles as the INY that needs to be written back to $fexx to clear the bucket for the next frame. (it's also written to $aaxx, just like the flagged bucket sort on codebase).
Each bucket emptying routine returns to the scene of the crime after writing the INY back, which bumps Y the value needed for the next bucket. Hence, the four bucket emptying routines are in fact identical; here's a sample:
.C:4c4c 8C 5E 4C STY $4C5E ; patch return address
.C:4c4f BF 00 AA LAX $AA00,Y ; fetch bucket head
.C:4c52 48 PHA
.C:4c53 B5 10 LDA $10,X ; read 'next' pointer.
.C:4c55 10 FA BPL $4C51 ; Actor ids are positive, INY is not
.C:4c57 99 00 FE STA $FE00,Y ; restore INY
.C:4c5a 99 00 AA STA $AA00,Y ; empty linked list
.C:4c5d 4C 7E FE JMP $FE7E
I think that just about covers it. Lft, that's a mighty fine piece of work. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Color Bar, thanks for the No More Secrets rec; I wasn't familiar with that one. |
| |
Trash
Registered: Jan 2002 Posts: 122 |
The sorting algorithm implemented by HCL is actually called Counting Sort (I just found that out..) and it is considered to be O(n).
It should be competive to the discussed code both in size and speed. |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Hilariously constrained addresses indeed. :) And my respect for lft just grew even higher.
Quoting Groepazlft: now it would be very cool if you could expand this routine to handle excess sprites in a way that they will be displayed on alternating frames ("interlaced") - then it would work perfectly fine for a game (MUCH better than sprites disappearing) I would like no interlacing or disappearing sprites even better. In the ideal world, the routine moving the sprites around should avoid situations with >8 sprites on the same line. And the multiplexor should of course strive for being able to display any situation with <=8 sprites per line. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11377 |
thats not really useful for a game - in a demo of course that is always preferred - but in a game it would be SO much better if you could move around a large amount of objects without having to care about multiplexer limitations in the game logic. (if you look at gameboy or NES games for example - thats how a lot of them work) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
or then there was Alien Syndrome, where you could never have more than one alien on any given raster 😂 |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting TrashThe sorting algorithm implemented by HCL is actually called Counting Sort (I just found that out..) and it is considered to be O(n).
It should be competive to the discussed code both in size and speed.
Thanks for the link! That does indeed seem useful.
Here is a straighforward implementation:
for each actor i
ldx ypos+i
inc count,x
lda #0
clc
for each count k
sta start+k
adc count+k
for each actor i
ldx ypos+i
ldy start,x
lda #i
sta output,y
inc start,x
lda #0
sta count,x
Total number of cycles is 10*i + 4 + 8*k + 28*i, so for our benchmark case with 32 actors and 220 y-positions that would be 2980 cycles, or 47.3 rasterlines. So the above implementation is not competitive.
So I suppose the game is on to try to improve it! |
| |
Trash
Registered: Jan 2002 Posts: 122 |
Quoting lftSo I suppose the game is on to try to improve it!
Try losing precision on K by the right amount and in the right way and it will improve without losing to much accuracy.... |
| |
Fresh
Registered: Jan 2005 Posts: 101 |
A less "aggressive" approach to lft's solution can be obtained by using brk instead of jmp.
It's clearly a bit slower but the emptying routine is only one and be put everywhere: a bit less messy and more "game-friendly".
irq:
sty $ff+stack // Fix stack for RTI
lda buckethead,y
irqloop:
sta output,x
tay
inx
lda list,y
bpl irqloop
ldy $ff+stack
sta table,y
sta sectable,y
rti
|
| |
lft
Registered: Jul 2007 Posts: 369 |
Fresh, that is true, and you even get to place the INY field on any page you like. That's because the opcode is 0, so it doesn't matter that SHX randomly masks it with the destination address + 1.
I would suggest that you store-absolute the output, perhaps to the zero-page, and self-modify the instruction to increment the pointer. That'd free up X as a scratch register, and you wouldn't have to restore Y from the stack area.
But there are drawbacks with this approach. As you mention, it is a bit slower: 15 cycles per actor, after implementing my suggestion above, so that'd give us a total of 43 rasterlines instead of 35. And you cannot do the sorting in main context while interrupts are running, because you'd lose even more cycles distinguishing between BRK/IRQ. More subtly, if a BRK coincides with an interrupt, the interrupt is lost, although the peripheral will typically keep asserting the (maskable) interrupt until it is eventually serviced. |
| |
HCL
Registered: Feb 2003 Posts: 727 |
Oh.. i'm late into the discussion again :P
First, i didn't invent the kind of sorting that is used in TimeMachine (and many other Booze-demos). It seems that i am not very good in inventing things. Perhaps i'm better at looking at other's code, understanding it, and making benefit to my own needs. I think i stole it from some demo from the late-80:s..
The nice thing about it is that i wrote a code-generator for it. Perhaps it was for the 3d-Bob-part in TimeMachine.. When i change 3d-bob-object it generates sorting code for the number of bobs that i currently have. Neither that being rocket science, but it helped making up to 32 bobs update each frame :). |
| |
Fresh
Registered: Jan 2005 Posts: 101 |
@lft: perfect analysis, no surprise from you! :) |
| |
lft
Registered: Jul 2007 Posts: 369 |
@HCL: Could you comment on how to reduce the number of buckets without losing too much accuracy? Trash is being deliberately enigmatic for fear of breaking a confidence, so we could really use a hint. Or would you rather have us dig into the code and see for ourselves? |
| |
Trash
Registered: Jan 2002 Posts: 122 |
Quote: @HCL: Could you comment on how to reduce the number of buckets without losing too much accuracy? Trash is being deliberately enigmatic for fear of breaking a confidence, so we could really use a hint. Or would you rather have us dig into the code and see for ourselves?
I do dare to tell you now when I know he borrowed the solution...
This is for sprites (eight per row):
If you remove the last three bits by using lsr you have your new K-value BUT you have to pair it with the original value. So what you are doing is that you assign your actors a value remove lose accuracy (lsr, lsr, lsr or a LUT (K = max value of that table)) store that value in the table you actually sort.
I might be wrong but this works for me using almost exactly your example code... |
| |
Copyfault
Registered: Dec 2001 Posts: 476 |
This Countingsort got me thinking...
So I had a closer look at lft's straight forward implementation given before and remembered
Quoting TrashTry losing precision on K by the right amount and in the right way and it will improve without losing to much accuracy....
What about doing it as follows:
ldx #offset
for SPRITE = 0 to n-1
dcp (ypos(SPRITE),x) ;operand byte is also memory for ypos
next
lda #$ff
ldx #$00
for LINE = 0 to 2*(floor((k/2)-2)) STEP 2
stx start+LINE
sbx #$00 ;operand byte is memory for the count-array
next
stx start+2*(floor((k/2)-1))
for SPRITE = 0 to n-1
ldx ypos(SPRITE)
ldy start,x
lda #SPRITE
sta output,y
inc start,x
lda #0
sta (offset,x)
next
Remarks:
- all loops have to be unrolled
- instead of incrementing the count-array entries for every occurance of the corresponding LINE-value they are decreased, giving [$00-#occurances of LINE] in each array position count[LINE]. But as the adc-instruction in the 2nd loop is also substituted by an sbx-command, this leads to the same result without having to care about the state of carry.
- 2nd byte of each dec-cmp-instruction suits also as memory where the ypos-val for the corresponding sprite-number is stored, i.e.
...
Sprite0_ypos = *+1
dcp (ypos(Sprite0),x)
Sprite1_ypos = *+1
dcp (ypos(Sprite1),x)
...
- only evenly spaced values for the ypositions admitted, i.e. consecutive values must have distance=2
- each ypos corresponds to a pointer in zp pointing to the corresponding operand byte of the sbx#$..-instructions. The 2nd loop would look like
lda #$ff
ldx #$00
stx start+0
count_ypos0 = *+1
sbx #$00
stx start+2
count_ypos2 = *+1
sbx #$00
stx start+4
...
with the corresponding zp partypos0 .byte <count_ypos0, >count_ypos0
ypos2 .byte <count_ypos2, >count_ypos2
...
- "offset" is introduced to be able to keep away from the $00/$01-adresses
Beside the ram usage for the start- and output-table the given approach needs a fair amount of zp-memory (one vector per possible sprite-ypos!). This most probably makes it not very usable for real in-game use; what's more: due to the fact that a vector pointer always consists of two consecutive bytes only even-numbered y-positions are handled (alternatively only odd-numbered, just one type...). This could be seen as "loosing precision on the ypos-values";)
On the positive side, counting cylces the approach gets 2 + n*8 + 4 + ((k/2)-1)*6 + 4 + n*29, so for n=32 and k=220 the routine needs 1848 cycles in total -> 29+1/3 rasterlines.
RAM-usage is at $2 + n*$2 + $4 + ((k/2)-1)*$5 + $3 + n*$12 = $4aa
(plus 220 bytes in zp plus 110 bytes for the "starting" table plus 32 bytes for "output")
Mind that it would be unfair to compare this to the cycle count lft gave for the direct approach as the count-array has been halved. But a notable gain in cycle consumption comes from cutting the inner part of the 2nd loop down to 6 cycles which would also make a full loop over 220 array entries considerably faster. Maybe there's a clever way to handle full precision on the ypos-values by appropriately pre-conditioning the ypos-values plus doing a sensible correction loop afterwards... remarks and additions highly welcome;) |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: The sorting algorithm implemented by HCL is actually called Counting Sort (I just found that out..) and it is considered to be O(n).
It should be competive to the discussed code both in size and speed.
Thanks! I found a short article about Counting Sort in C=hacking Issue #18:
http://www.ffd2.com/fridge/chacking/c=hacking18.txt |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Copyfault, that is a very interesting approach. But to halve the Y position resolution an AND #$fe has to be applied, which costs an additional 2 cycles per actor. And if you want to preserve the full resolution values as well (for use in the Y updating routine for example), you will have to store them somewhere, adding more to the cycle count. |
| |
Copyfault
Registered: Dec 2001 Posts: 476 |
Quote: Copyfault, that is a very interesting approach. But to halve the Y position resolution an AND #$fe has to be applied, which costs an additional 2 cycles per actor. And if you want to preserve the full resolution values as well (for use in the Y updating routine for example), you will have to store them somewhere, adding more to the cycle count.
Ofcourse this works only if the game concept allows for all objects moving with suitable speed (all values hsve to be a multiple of two).
Otherwise you're right, all the necessary additional position handling will most probably eat up too much cycles. What's the current best case cycle count?? |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
See posts #91 and #101. It is about 3*k+43*n.
For lft's Field Sort (post #126) it is 2*k+55*n. |
| |
Trash
Registered: Jan 2002 Posts: 122 |
I have made a mistake in my earlier posting about how to reduce the accuracy of the sorter...
Imagine you have 32 actors you want to sort and that K is 200, what you need is to reduce K so it holds the upper 5 bits (%11111 = 31).
The result will be a fast sorter that is inaccurate but accurate enough for sprite sorting.
64tass-style:
; YSortPositions contains the reduced spritepositions in Y
NUMBEROFBYTESTOBESORTED = 32
; With tables on zero-page
; 2 + ((3 + 9 + 6 + 19) * NUMBEROFBYTESTOBESORTED) - 3 + 2 + 3 + 12 = 1200
; With tables not on zero-page
; 2 + ((4 + 11 + 8 + 22) * NUMBEROFBYTESTOBESORTED) - 4 + 2 + 3 + 12 = 1455
DoSort
lda #0 ; 2
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
sta SortTable + i ; 4 / 3
.next
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
ldx YSortPositions + i ; 4 / 3
inc YSortPositions,x ; 7 / 6
;--------
; 11 / 9
.next
clc ; 2
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
sta SortOrder + i ; 4 / 3
.if i < (NUMBEROFBYTESTOBESORTED - 1)
adc SortTable + i ; 4 / 3
.fi
;--------
; 8 / 6
.next
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
ldx ySortPos + i ; 4 / 3
ldy SortOrder,x ; 4
inc SortOrder,x ; 7 / 6
lda #i ; 2
sta Sorted,y ; 5 / 4
;--------
; 22 / 19
.next
rts ; 12 (jsr + rts)
|
| |
lft
Registered: Jul 2007 Posts: 369 |
Typo on line 4, should increment at SortTable,x.
The problem with this approach is that you have to compute the upper 5 bits of the y-position somehow, and that will take time. It can be done during sorting, at the cost of an additional 2 cycles per actor, like this:
DoSort
lda #0
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
sta SortTable + 8*i
.next
lda #$f8
.for i = 0, i < NUMBEROFBYTESTOBESORTED, i = i + 1
ldx YSortPositions + i
sbx #0
inc SortTable,x
.next
...
But I'm not convinced that the quality of this "rough sorting" is comparable with the real thing. Consider for instance the case where all actors are positioned 3 rasterlines apart. Whenever a hardware sprite becomes free, the multiplexer must take the next sprite according to the true sorting order, otherwise all sprite units will be busy eight actors later, so an actor must be dropped.
Nevertheless, it is nice to know that this option is available, in situations where the sprites aren't packed too closely but rastertime is tight. |
| |
Trash
Registered: Jan 2002 Posts: 122 |
Quoting lftTypo on line 4, should increment at SortTable,x.
Thanks, missed that..
Quoting lftThe problem with this approach is that you have to compute the upper 5 bits of the y-position somehow, and that will take time. It can be done during sorting, at the cost of an additional 2 cycles per actor, like this:
Or it could be done in a precalculated table, whatever fits your needs..
Quoting lftNevertheless, it is nice to know that this option is available, in situations where the sprites aren't packed too closely but rastertime is tight.
Yes, but if you have small amounts of rastertime you either should pre-sort your data or have data that fits with a bubblesort like the one THCM implemented...
I am (like you) not convinced that this sorter is the best for all use-cases but for some it's certainly the best option. |
| |
The Human Code Machine
Registered: Sep 2005 Posts: 112 |
In most game scenarios an optimized bubble sort with special routines to add or remove an actor should be faster. In my example the sorter takes ~20 raster lines most of the time with sprites moving up and down. For a game, accurate sorting is important, to get an optimal multiplexing result with reusing a free sprite slot as soon as possible. (as lft said above) |
| |
Digger
Registered: Mar 2005 Posts: 432 |
btw. Nice tool for visualising various sorting algos, Counting Sort included: https://visualgo.net/en/sorting |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Well, an unrolled bubble iteration should only take about 8 lines for 32 actors if there's no change, 10ish if there are four or five swaps.
I do like Falco Paul's suggestion of kicking off an Ocean Sort but aborting and switching to bucket sort (or some other constant time algorithm) if the number of swaps passes a certain threshold.
(cf http://www.codebase64.org/doku.php?id=base:sprite_multiplexer_s.. )
As for making lft's suggestion a little more memory flexible, I'm wondering if writing an RTS into the increment field could be cost effective? |
| |
lft
Registered: Jul 2007 Posts: 369 |
I have thought of a different way to relax a memory constraint. Consider this variant of the bucket-emptying routine:
sty endptr+1
lax ylink,y
pha
lda actorlink,x
sta ylink,y
bpl endptr
sta inyfield,y
endptr jmp inyfield
The worst-case execution time is the same, while the case of multiple actors in the same bucket is now slower. But now the location of the link table is unconstrained. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting ChristopherJamI'm wondering if writing an RTS into the increment field could be cost effective?
That would imply a JSR out of the increment field, right? That still means you have to put the bucket-emptying routine in four different, fixed places. Meanwhile the execution time increases by a rasterline.
Perhaps you meant the other way around, JSR into the increment field and RTS to the bucket-emptying routine. That would indeed make the memory layout more flexible, but at the cost of three rasterlines.
(Edit: I see now that I misread your post. Of course you mean the second thing.) |
| |
lft
Registered: Jul 2007 Posts: 369 |
(removed) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
(intrigued) |
| |
lft
Registered: Jul 2007 Posts: 369 |
I wrote a post about how JSR/RTS would interfere with pushing the result to the stack. But that is only a problem if you do it the wrong way, putting JSR in the field and RTS in the emptying routine. So it was a non-issue after all. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Ah!
Well, FWIW I forgot the results were being pushed to the stack when I was wondering if a TXS:JMP would be a faster way to return to the field
(ie, just keep start of empty routine at top of stack) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Just some thoughts...
Multiplexers in game scenarios, is the sorting normally done in the vertical blank or is the sorting performed in the main loop bur for the next frame's sprite setup?
Was just thinking of the ideas of not sorting at all and the fact that the vertical blank period is a bit crowded. A lot of other stuff must be updated there such as the scrolling etc.
What if the multiplexer simply scans the remaining actors for the next entry? This is in total of course a O(n^2/2) algorithm and dead slow. Worst case to handle would be 8 sprites having to be multiplexed directly below. So you have 21/8 raster lines to find the lowest index in the remaining actors.
Think I'll test this approach some day soon.. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
There are games that do both. I remember at least Midnight Resistance *not* doublebuffering the sorted sprites, so it was doing the sort in the vblank / scorepanel area.
Generally I'd recommend not making something timecritical that absolutely doesn't need to be, therefore rather pre-calculate the sorted sprites anywhere when the main program has time.
If you do the sorting "on the fly", you can't take advantage of last frame's sorting result. In a tight sprite formation, you barely have enough time to load the sprite registers from pre-sorted data, so would imagine you would run into trouble with a "find the next sprite" approach, even with unrolled code. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: There are games that do both. I remember at least Midnight Resistance *not* doublebuffering the sorted sprites, so it was doing the sort in the vblank / scorepanel area.
Generally I'd recommend not making something timecritical that absolutely doesn't need to be, therefore rather pre-calculate the sorted sprites anywhere when the main program has time.
If you do the sorting "on the fly", you can't take advantage of last frame's sorting result. In a tight sprite formation, you barely have enough time to load the sprite registers from pre-sorted data, so would imagine you would run into trouble with a "find the next sprite" approach, even with unrolled code.
Ok, thanks for the insights. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
For applications where border time is a lot scarcer than display time, I suspect I'd lean towards a nice simple linked list based bucket sort; the times it takes a while to find next sprite are precisely when the next sprite is a fair way down the screen, so the time taken wouldn't be so critical. If the next sprite is in the next line or three there'd only be a few links to follow, which shouldn't take too long.. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Just wanted to mention an O(n) sort I invented long ago I call Fibonacci sort, for how one step does a running sum.
Now I realize it's been invented long ago and is called the counting sort.
I think it takes about 50 cycles per number.
#Fibonacci Sort
unsorted = [3, 1, 4, 1, 5, 9, 2, 6, 0]
counts = [0] * 10
fib = [0] * (10 + 1)
sortd = [0] * len(unsorted)
#First pass: find the counts
for n in unsorted:
counts[n]+=1
#Second pass: do the Fibonacci magic
i = 0
total = 0
for n in counts:
fib[i] = total
total += n
i+=1
#Third pass: output sorted
for n in unsorted:
sortd[fib[n]] = n
fib[n] += 1
#Results
print(sortd)
a very buggy version in assembler:
;count each number
ldx #$ff
count ldy unsort,x
inc count,y
dex
bne count;15/ea
;Fibonacci step
clc
lda count
inx
fib adc count,x
sta sta count,x
inx
bne fib;13/ea
;Copy sorted
sort ldy unsort
ldx count,y
sty sort,x
inc count,y
inc sort+1
bne sort/27/ea
|
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
this is one case for (,x) if you want the usual indice indirection too. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Nice work independently discovering counting sort, Repose
It's worth noting that this one is O(n)+O(m), where m is the number of buckets.
Pass 2 for a "perfect" sort of n sprites would take 220 iterations (256 in your example code)
That's around 220*10 cycles if you unroll the loop a little, so about 35 lines of overhead. Of course, you can cut that down considerably if you don't need as much accuracy or range.
Clearing the counts array also takes time, though you can drop that back from O(m) to O(n) by keeping it seperate to your "fib" array, persisting it frame to frame, and only clearing the entries you incremented in the first place. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
found out this one, posting if someone finds a flaw, somehow I am unsure if I'm overlooking something, because in speed its close to the pha pha method, which feels unbeatable.
phase1 bucketsort, 4bit per buckets:
;this code repeats * numofsprites
ldy yvalues+spriteindex
ldx table,y ;x = y / 16 *2
lda #spriteindex
sta (bucket,x) ;needs some prep on zp
inc bucket,x ;every 2nd zp points to a buckettab
phase2 sorting network:
ldy bucket+z ;19 such compare snippets per bucket
lda yvalues,y ;can sort max 8 sprite per bucket
ldx bucket+q ;z and q is precalculated
cmp yvalues,x
bcc + ; or bcs?
sty bucket+q
stx bucket+z ;26 cycles worst case
with $20 sprites phase1 is 768 cycles, 1976 cycles for phase 2 at worst case (highly unlikely) , overall ~43 rasterlines.
probably will use different sorting networks based on nr of sprites in bucket. that will reduce nr of compare/swap snippets per bucket. |
| |
Krill
Registered: Apr 2002 Posts: 2971 |
Something i keep thinking briefly about whenever sprite sorting pops up is... somehow having hardware-assisted constant-time sorting.
Like, when you have buckets of 4 or 8 coordinates, which are quickly generated as seen in Oswald's post, how about sorting the coordinates within a bucket using, say, CIA timer interrupts or VIC collision detection? :)
For timer interrupts, one could set up 4 timers and then sample the interrupt flags, which would give results according to the values to be sorted.
Collision detection could also help for buckets of 8 coordinates, with some smart sprite-char collision scheme, albeit at the expense of some memory.
But... i never came far with these ideas, but then again i haven't really tried so far. :)
Edit: Of course, with 4 timer interrupts and a raster interrupt, one could sort 5 coordinates on the fly and have the multiplexer interrupt handler triggered in sorted order. But i guess this is out of scope for these discussions. =) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Nice one Oswald.
As is, it's a little slower than https://www.codebase64.org/doku.php?id=base:flagged_bucket_sort (38.5 rasters worst case) but you have got me thinking about sorting networks again.
Could perhaps dispatch to the sorting network routines by spacing the zero page pointers slightly further apart, so that one could jmp($xxxx) to one byte earlier, using the low byte of the storage address as the high byte of the sort routine address
Can also combine pairs of sorting network swap macros to avoid some register spilling, saving about five or six cycles off the pair
You've also reminded me that sometime the last few years I finally realised that sorting network based methods might be able to take advantage of seperating out the player sprite so that each bucket then only needs to deal with at most seven sprites per bucket, or six if there's a second player or a multiple in one player mode.
A six element sorting element is only 12 swaps, a good 15% less per sprite than the 19 for an eight element network. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
thanks! in the meantime I have implemented it, and I am at 46 lines with an avg case, using adaptive sort network. (something off with my speed calcs earlier?)
the swap macro combine is nice I can shave off 8 cycles with that :)
to save memory I'm using another variant than posted:
ldy buckets+q,x
lda yvalues,y
ldy buckets+z,x
cmp yvalues,y
bcs skipswap
lda buckets+q,x
sty buckets+q,x
sta buckets+z,x
.
4 cycles slower but 1/16th mem usage. no need for different code for each bucket.
Gunnar I dont understand how you could do that, not even the timer method :P start the timers and poll the flag register to see which one runs out first, then 2nd etc ?
the player sprite, why would one handle it differently ? it just reduces the flexibility of displayable sprites? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Regarding the player sprite, if it’s free roaming then the multiplexer can only guarantee seven free sprites for antagonists on any given line. So, you get the same capabilities from {putting the player into the actor list and letting the multiplexer drive eight sprites} as you do from {putting the player into sprite zero directly and driving the remaining seven from the multiplexer}.
Same logic applies to a second player, or an orbiting multiple. And if the sorter only needs to deal with at most six sprites per bucket, then the worst case sorting network only has 12 swap macros.
Speaking of buckets, if you’re determining which bucket with a table lookup, you only need 11 of them, each being 21px high. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Reusable sort code sounds good btw. It’d be nice to have a clearer view of code size/speed trade offs. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
btw one thing I could never grasp, how would sprite plexer work with bucket sort, ie not fully sorted but accurate enough ? order of irq doesnt matter for 8 sprites ? just use 1 irq for all 8 ? |
| |
Burglar
Registered: Dec 2004 Posts: 1090 |
Quoting Oswaldbtw one thing I could never grasp, how would sprite plexer work with bucket sort, ie not fully sorted but accurate enough ? order of irq doesnt matter for 8 sprites ? just use 1 irq for all 8 ? I'd go for buckets of 4 sprites I think. |
| |
mhindsbo
Registered: Dec 2014 Posts: 51 |
I know I'm late to the thread and apologies if I repeat what others have said. I spent a long time optimizing the sort for AAII (probably too long, but it is a nostalgic hobby after all). Couple of comments on that basis
- I found using an insertion sort was faster than bubble; much faster and for several reasons
- When you reuse the sorting order from frame to frame you are almost never worst case and it is pretty quick to change the order of only a couple of (neighboring) elements
- You can help it by e.g. only introducing 1 new object per frame. So rather than e.g. 10 objects in a formation coming in at once, spread them over 10 frames. Helps in general w performance as well (not instantiating 10 new objects in one frame)
- You have to optimize equally for the implementation in 6502 machine code. Some algorithms are theoretically faster, but with only 3 registers you might end up doing a lot of temp storing that kills the theory. On modern processors that is of course different
- We are working with small numbers (relative to general sorting). The algorithm for thousands of elements would be different
- In terms of the player I check specially for that sprite to see that a physical one is free. If not I override the most previous free one. Better to have an enemy or bullet flicker than the player
- As Oswald hits on the IRQ's are equally important. They take up about as long as the sort (rough order magnitude). So as always there are tradeoffs
I'll dig out my performance stats when I get home from work and can also share code if that is helpful. |
| |
Bezerk Account closed
Registered: Mar 2020 Posts: 5 |
I enjoyed reading through this thread but in the end it wasn't clear to me which sorting algorithm was the fastest (while using a reasonable amount of memory).
Field Sort by Lft (post #120) seemed the fastest and it was also well described at Field Sort
Still, people kept referring to Radix Sort by Doynax (post #31) as the fastest. It wasn't obvious how to compare the posted Radix Sort implementation to Field Sort because it didn't explicitly output the resulting actor order and also didn't state the allowed range of y-positions.
Field Sort outputs the resulting actor order to the stack and the valid range of y-positions is 0-219.
To compare Radix Sort to Field Sort I decided to implement Radix Sort (inspired by the code posted by Doynax) and optimize it as much as I could.
The resulting Radix Sort implementation outputs the sorted actors to the stack and handles y-positions in the range 0-223.
The result?
Field Sort: Worst case performance of 2208 cycles using "< 2K" of memory.
Radix Sort: Worst case performance of 2044 cycles using 1248 bytes for the code and 62 bytes of zeropage (in addition to the y-positions).
Here's my Radix Sort implementation (in Kick Assembler format):
// Radix sort of 32 "actors" based on their y-positions [0,223] in ascending order
//
// Worst case execution time: 2044 cycles
//
// 1248 bytes of code
// 62 bytes of zeropage (in addition to y-positions)
// * = $xx85 for no branch page boundary crossings
.const NUM_SPRITES = 32
.const NUM_LSD_BUCKETS = 16
.const NUM_MSD_BUCKETS = 14
// Init
// Constant 92 cycles
lda #$ff // 2 cycles
.for (var i = 0; i < NUM_LSD_BUCKETS; i++)
sta lsd_buckets + i // 3
.for (var i = 0; i < NUM_MSD_BUCKETS; i++)
sta msd_buckets + 8*i // 3
// LSD sort
// Constant 576 cycles
lda #NUM_LSD_BUCKETS - 1 // 2
.for (var i = 0; i < NUM_SPRITES; i++)
{
ldx ypos + i // 3
axs #$00 // 2
ldy lsd_buckets,x // 4
sty links + i // 3
.if (i == NUM_LSD_BUCKETS - 1)
sta lsd_buckets,x // 4
else
{
ldy #i // 2
sty lsd_buckets,x // 4
}
}
// MSD sort
//
// Execution time per LSD list as a function of number of list elements:
// - 0 elements: 6 cycles
// - N (odd) elements: 5 + (N - 1) / 2 * (26 + 28) + 27 cycles
// - N (even) elements: 5 + (N - 2) / 2 * (26 + 28) + 26 + 27 cycles
//
// Worst case 958 cycles (15 empty LSD lists, 1 32-element LSD list)
.for (var i = 0; i < NUM_LSD_BUCKETS; i++)
{
ldx lsd_buckets + i // 3
bmi NextBucket // 2/3
Next:
lda ypos,x // 4
alr #$f0 // 2
tay // 2
lda msd_buckets,y // 4
stx msd_buckets,y // 4
ldy links,x // 4
sta links,x // 4
bmi NextBucket // 2/3
lda ypos,y // 4
alr #$f0 // 2
tax // 2
lda msd_buckets,x // 4
sty msd_buckets,x // 4
ldx links,y // 4
sta links,y // 5 !!!
bpl Next // 2/3
NextBucket:
}
// Output final sorted order
//
// Execution time per MSD list as a function of number of list elements:
// - 0 elements: 6 cycles
// - N (odd) elements: 5 + (N - 1) / 2 * 21 + 10 cycles
// - N (even) elements: 5 + (N - 2) / 2 * 21 + 20 cycles
//
// Worst case 418 cycles (13 empty MSD lists, 1 32-element MSD list)
.for (var i = NUM_MSD_BUCKETS - 1; i >= 0; i--)
{
lax msd_buckets + 8*i // 3
bmi NextBucket // 2/3
Next:
pha // 3
lda links,x // 4
bmi NextBucket // 2/3
pha // 3
tay // 2
lax links,y // 4
bpl Next // 2/3
NextBucket:
}
|
| |
lft
Registered: Jul 2007 Posts: 369 |
This is nice! I feel motivated to improve my routine.
But since you are storing the MSD buckets 8 bytes apart, they cannot be on the zero-page. Thus you can't do:
stx msd_buckets,y // 4
(At least not while also storing ypos and links on the zero-page.)
If you put msd_buckets at e.g. $fe00, then you can use the shx instruction, but it will cost 5 cycles. |
| |
Bezerk Account closed
Registered: Mar 2020 Posts: 5 |
It does fit on the zero-page since there are 14 MSD buckets (not 32). |
| |
lft
Registered: Jul 2007 Posts: 369 |
Right, but in that case, where do you fit 32 consecutive bytes of links and ypos? |
| |
Bezerk Account closed
Registered: Mar 2020 Posts: 5 |
Here's the zero-page usage in my test code. It obviously wastes 7 unused bytes per MSD bucket list head but it's just for testing.
I'd be happy to send you the source code. PM me if you're interested.
.const NUM_SPRITES = 32
.const NUM_LSD_BUCKETS = 16
.const NUM_MSD_BUCKETS = 14
* = $10 "Zeropage data" virtual
// Actor y positions in [0,223]
ypos:
.fill NUM_SPRITES,0
// Head of linked list per LSD bucket.
lsd_buckets:
.fill NUM_LSD_BUCKETS, 0
// Next "pointer" in (LSD or MSD) bucket list.
links:
.fill NUM_SPRITES, 0
// Head of linked list per MSD bucket.
msd_buckets:
.fill NUM_MSD_BUCKETS * 8, 0
Memory Map
----------
Default-segment:
*$0010-$00cf Zeropage data
|
| |
lft
Registered: Jul 2007 Posts: 369 |
It makes sense now. Thank you! |
| |
CyberBrain Administrator
Posts: 392 |
Cool routine!! Thanx for sharing! Analyzing how it works was pretty illuminating (aka. fun) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Nice work, Bezerk. Good to finally have a comparable radix sort implementation, and a well optimised one at that. Extraordinarily close to one raster line per actor!
The flagged bucket sort I put up at https://www.codebase64.org/doku.php?id=base:flagged_bucket_sort takes around 2423 cycles - significantly slower than field sort or radix.
Somewhat less onerous memory requirements at least; no magic pages, and the zero page use is contiguous :) |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
Bezerk, can you explain it in detail, frankly I dont understand anything of it :) |
| |
Bezerk Account closed
Registered: Mar 2020 Posts: 5 |
You should Google "radix sort" for a thorough explanation but I can try to explain how the posted code works.
It's a two pass sort.
In the first (LSD) pass you assign each actor to a bucket based on the least significant digit (LSD) of the actor's y-position ($0-$f). Each bucket has a linked list of the actors whose y-position LSD is the same as the bucket index. For example, LSD bucket number $7 has a linked list of actors with y-position LSD = $7 (i.e. $07, $17, ..., $d7). The order in which the actors are processed doesn't matter in this pass.
In the second (MSD) pass you assign each actor to a bucket based on the most significant digit (MSD) of the y-position ($0-$d). In this pass the actors are processed in LSD bucket order (from $0 to $f) and each actor is prepended to the MSD bucket list. The end result is that each MSD bucket list contains actors whose y-position MSD is the same as the bucket index and sorted based on LSD. For example, MSD bucket number $d will have a linked list of actors with y-position MSD = $d (e.g. $da, $d3, $d0). Again, the important point is that each MSD bucket list of actors is sorted in descending y-position order because the actors were added in LSD bucket order.
The final sorted order is then output to the stack by outputting the MSD bucket lists from index $d to $0.
Hopefully this helps enough to let you decipher the code. |
| |
Frantic
Registered: Mar 2003 Posts: 1647 |
@bezerk: Don't hesitate to put the radix sort implementation on codebase64 :) |
| |
Krill
Registered: Apr 2002 Posts: 2971 |
Finally, the C-64 scene delivers an answer to the age-old question of which sorting algorithm actually is best. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
thanks, now I get it :) |
| |
lft
Registered: Jul 2007 Posts: 369 |
I was able to improve the radix sort.
Writeup at codebase64.
This version can sort 32 actors in 1970 cycles. That's 61.6 cycles per actor.
Quoting ChristopherJamExtraordinarily close to one raster line per actor! Boom! |
| |
Frantic
Registered: Mar 2003 Posts: 1647 |
@lft: Much appreciated! Thank you! :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Excellent work, both of you!
I've been meaning to write up a comprehensive comparison of all the approaches mentioned in this post since sometime in 2017, but as that's clearly not going to happen soon, here's a quick table comparing several of the 'sub 2500 cycle worst case' implementations; 32 actors in each instance.
+---------------+------------------+---------+--------+-----------+------+
| author | algorithm | worst | code+ | zeropage | year |
| | | cycles | data | | |
+---------------+------------------+---------+--------+-----------+------+
| cjam | flagged buckets | 2422 | ~3kb | 59b | 2017 |
| lft | field sort | 2200 | sparse | 64b | 2017 |
| colorbar+cjam | inline buckets | 2086 | 19kb | 0b | 2017 |
| bezerk | radix | 2044 | ~2kb | 94/185b | 2020 |
| bezerk+lft | radix | 1970 | ~2kb | 32b | 2020 |
+---------------+------------------+---------+--------+-----------+------+
(cycle counts do not include jsr/rts, or decanting results from index lists in stack)
Bezerk had already just scraped in below my and Colorbar's approach, with a relatively miniscule memory footprint - and now lft has improved that to hit the holy grail of sub-raster performance in the 32 actor category :D |
| |
Bezerk Account closed
Registered: Mar 2020 Posts: 5 |
That's brilliant @Lft, well done!
@ChristopherJam, no need to put me + Lft on this one in the comparison table. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Lovely! Question: if one would sort according to high nybble first, what would be the complexity?
And could someone post an equation for the the latest approach that shows the number of cycles as a function of Y-range and number of actors, please? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
I should have included an e&oe really :)
ok, next draft just lft for final radix entry. Any comments on fieldsort memory usage will be integrated too, I just didn't manage to look that up before dinner.
Rastah Bar - Going by lft's writeup at codebase, it's an extra 51 cycles per actor regardless of order.
Flagged bucket sort's a little messier as it depends how many are on the same line or in the same eight line group.
graphs one of these days.. |
| |
lft
Registered: Jul 2007 Posts: 369 |
My approach uses 60 bytes of zero-page, so this should be updated in the table. It can be reduced to 32 bytes at the cost of a few extra cycles (I think 12, but haven't tested).
Adding support for the full range ($00-$ff, adding two more lists for the high nybble) will bump the zero-page size up to 64 bytes, and add 22 cycles. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
Quote: My approach uses 60 bytes of zero-page, so this should be updated in the table. It can be reduced to 32 bytes at the cost of a few extra cycles (I think 12, but haven't tested).
Adding support for the full range ($00-$ff, adding two more lists for the high nybble) will bump the zero-page size up to 64 bytes, and add 22 cycles.
a general solution would be nice to have on codebase, not everyone wants to sort sprites. also can it handle all entries showing up in 1 bucket ? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Quote: a general solution would be nice to have on codebase, not everyone wants to sort sprites. also can it handle all entries showing up in 1 bucket ?
Doesn't need to support more than eight entries per bucket for sprites… |
| |
lft
Registered: Jul 2007 Posts: 369 |
@Oswald Yes, it handles all cases, always using the same amount of cycles. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
The costs per actor are lower for inline buckets. I was trying to figure out if it could be beneficial to combine the new radix sort and the inline bucket sort. F.e. one stage of radix sort followed by an inline bucket sort for each bucket of the first stage. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
<Original post deleted>
I thought I had found an improvement to inline buckets, but ChristopherJam already proposed that. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Ah, sorry about that :)
But also damn, I'd forgotten whether or not I'd already implemented what you wrote, and was looking forward to rummaging through my code to see if I could integrate it..
Interesting idea about combining radix with IBS - after all, a bucket sort is just a one digit radix sort (compared to the two digit radix sorts used elsewhere in this discussion) |
| |
mhindsbo
Registered: Dec 2014 Posts: 51 |
Here is my insertion sort, if of any interest. Takes about 800-1000 cycles for 20 actors/sprites under actual load in game.
I wont be the fastest in worst case, but under realistic game loads where you control the number of new actors per frame (i.e. start with a mostly sorted list) it is pretty fast and compact (w 32 actors I have not had it exceed 2000 cycles).
sprite_sorty ldy sorted_indexz+1 ; first iteration only compares element 0 & 1 in the table, so no loop needed
lda object_yl,y ; y position of object in position 1
ldx sorted_indexz ; get pointer to the object in position 0
cmp object_yl,x ; compare y position of object 0 to y position of object 1 (stored in AR)
bcs @second ; if Obj 1 > Obj 0 we are done
stx sorted_indexz+1 ; otherwise swap them
sty sorted_indexz
@second ldx #2 ; get first position to be compared
@loop stx current_pos ; position in index table we are finding the right place for
ldy sorted_indexz,x ; get pointer (offset) to current object to find the right place for
sty current_index ; store current index number dynamically in code for later storage
lda object_yl,y ; y position of current object into AR
@in_right_place ldy sorted_indexz-1,x ; check if object is already in the right place ... saves time if list is already partially sorted
cmp object_yl,y ; ... if it is we dont need to swap any positions but can move right on
bcs @next_position
sty sorted_indexz,x
dex
@compare ldy sorted_indexz-1,x ; get pointer to the object we are comparing current one with
cmp object_yl,y ; compare y position of this object to y position of current object (stored in AR)
bcs @found_place ; is y of current object (in AR) >= y of compare object
sty sorted_indexz,x ; move the compared to index one right in the table
dex ; loop to check next position
bne @compare
@found_place lda current_index ; index to current object
sta sorted_indexz,x ; store to the right of the one it is >= of
ldx current_pos
@next_position inx
cpx #NUM_SPR_OBJ ; end of table (number of sprite objects)?
bcc @loop
@end rts
I have N bytes (number of actors) in zero page in terms of the sorted index table. If you also place the y-coordinates in zero page it is 2*N bytes. And the actual code is 61 bytes. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
ldy sorted_indexz+1
lda object_yl,y
why not use lda (),y for this ? you can do the indirection in one step: zp pointers are the sorted indexes, and Y points to Y coordinate attribute within virtual sprite data |
| |
mhindsbo
Registered: Dec 2014 Posts: 51 |
Good question @Oswald that sent me on a spin incl. considering if I would use (zp,x) instead. I had not really contemplated using the index as an index of 16 bit addresses, rather than a simple list of 8 bit numbers.
However unless I misunderstand you the resulting code including the swapping of indexes would become:
@loop sty current_pos ; position in the index
lda (sorted_indexz),y ; y coordinate of object
ldx sorted_indexz,y ; object number/index
stx current_index
@in_right_place cmp (sorted_indexz-2),y ; compare to y coordinate of next object in index
bcs @next_position
ldx sorted_indexz-2,y ; move compared objects index to the right sorted index list
stx sorted_indexz,y
dey
dey
@compare cmp (sorted_indexz-2),y
bcs @found_place
ldx sorted_indexz-2,y
stx sorted_indexz,y
dey
dey
bne @compare
@found_place ldx current_index
stx sorted_indexz,y
@next_position ldy current_pos
iny
iny
cpy #2*NUM_SPR_OBJ
bcc @loop
which is actually a couple of cycle longer per iteration and the index will be twice as long in zp. Also when I later populate the sprite IRQ's I will have two dey's to iterate through the indexes (small extra cost, but still).
It might be worth noting that all my (virtual) sprite data sits in array's of similar properties. I.e. y1,y2,y3, ... x1,x2,x3, ... and not y1,x1 ... y2,x2. I access all values using the index of the object and not the index of the attribute.
I could have missed something. It is early on a Saturday and I have not had enough coffee yet :-)
PS: I did spot a possible optimization more that I will take a look at now. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
I mean to use that kind of indirection every time when you want to read Y coo.
so instead of
ldy sorted_virtual_sprite_pointers
lda ycoordinates,y
ldy sorted.....+1
cmp ycoordinsates,y
lda (sortedpointers),y
cmp (sortedpointers+2),y |
| |
mhindsbo
Registered: Dec 2014 Posts: 51 |
Yes that is what I did above. The load and cmp of y coordinates is faster, but the sorting of the indexes becomes slower. And with the double dey all in all the original implementation is faster (unless I missed something). |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
I see, too bad |