| |
Oswald
Registered: Apr 2002 Posts: 5086 |
Sorting
are sorters really so slow in games? :) I have made my unrolled version for my theoretical game (;), and it takes 132 rlines to sort 32 numbers o_O worst case is ~200 lines. tho when its the case of 4 numbers have to be swapped only it does the job in ~10 lines. wastes a lot of memory but I like it :) |
|
... 193 posts hidden. Click here to view all posts.... |
| |
Krill
Registered: Apr 2002 Posts: 2969 |
Something i keep thinking briefly about whenever sprite sorting pops up is... somehow having hardware-assisted constant-time sorting.
Like, when you have buckets of 4 or 8 coordinates, which are quickly generated as seen in Oswald's post, how about sorting the coordinates within a bucket using, say, CIA timer interrupts or VIC collision detection? :)
For timer interrupts, one could set up 4 timers and then sample the interrupt flags, which would give results according to the values to be sorted.
Collision detection could also help for buckets of 8 coordinates, with some smart sprite-char collision scheme, albeit at the expense of some memory.
But... i never came far with these ideas, but then again i haven't really tried so far. :)
Edit: Of course, with 4 timer interrupts and a raster interrupt, one could sort 5 coordinates on the fly and have the multiplexer interrupt handler triggered in sorted order. But i guess this is out of scope for these discussions. =) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Nice one Oswald.
As is, it's a little slower than https://www.codebase64.org/doku.php?id=base:flagged_bucket_sort (38.5 rasters worst case) but you have got me thinking about sorting networks again.
Could perhaps dispatch to the sorting network routines by spacing the zero page pointers slightly further apart, so that one could jmp($xxxx) to one byte earlier, using the low byte of the storage address as the high byte of the sort routine address
Can also combine pairs of sorting network swap macros to avoid some register spilling, saving about five or six cycles off the pair
You've also reminded me that sometime the last few years I finally realised that sorting network based methods might be able to take advantage of seperating out the player sprite so that each bucket then only needs to deal with at most seven sprites per bucket, or six if there's a second player or a multiple in one player mode.
A six element sorting element is only 12 swaps, a good 15% less per sprite than the 19 for an eight element network. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
thanks! in the meantime I have implemented it, and I am at 46 lines with an avg case, using adaptive sort network. (something off with my speed calcs earlier?)
the swap macro combine is nice I can shave off 8 cycles with that :)
to save memory I'm using another variant than posted:
ldy buckets+q,x
lda yvalues,y
ldy buckets+z,x
cmp yvalues,y
bcs skipswap
lda buckets+q,x
sty buckets+q,x
sta buckets+z,x
.
4 cycles slower but 1/16th mem usage. no need for different code for each bucket.
Gunnar I dont understand how you could do that, not even the timer method :P start the timers and poll the flag register to see which one runs out first, then 2nd etc ?
the player sprite, why would one handle it differently ? it just reduces the flexibility of displayable sprites? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Regarding the player sprite, if it’s free roaming then the multiplexer can only guarantee seven free sprites for antagonists on any given line. So, you get the same capabilities from {putting the player into the actor list and letting the multiplexer drive eight sprites} as you do from {putting the player into sprite zero directly and driving the remaining seven from the multiplexer}.
Same logic applies to a second player, or an orbiting multiple. And if the sorter only needs to deal with at most six sprites per bucket, then the worst case sorting network only has 12 swap macros.
Speaking of buckets, if you’re determining which bucket with a table lookup, you only need 11 of them, each being 21px high. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Reusable sort code sounds good btw. It’d be nice to have a clearer view of code size/speed trade offs. |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
btw one thing I could never grasp, how would sprite plexer work with bucket sort, ie not fully sorted but accurate enough ? order of irq doesnt matter for 8 sprites ? just use 1 irq for all 8 ? |
| |
Burglar
Registered: Dec 2004 Posts: 1088 |
Quoting Oswaldbtw one thing I could never grasp, how would sprite plexer work with bucket sort, ie not fully sorted but accurate enough ? order of irq doesnt matter for 8 sprites ? just use 1 irq for all 8 ? I'd go for buckets of 4 sprites I think. |
| |
mhindsbo
Registered: Dec 2014 Posts: 51 |
I know I'm late to the thread and apologies if I repeat what others have said. I spent a long time optimizing the sort for AAII (probably too long, but it is a nostalgic hobby after all). Couple of comments on that basis
- I found using an insertion sort was faster than bubble; much faster and for several reasons
- When you reuse the sorting order from frame to frame you are almost never worst case and it is pretty quick to change the order of only a couple of (neighboring) elements
- You can help it by e.g. only introducing 1 new object per frame. So rather than e.g. 10 objects in a formation coming in at once, spread them over 10 frames. Helps in general w performance as well (not instantiating 10 new objects in one frame)
- You have to optimize equally for the implementation in 6502 machine code. Some algorithms are theoretically faster, but with only 3 registers you might end up doing a lot of temp storing that kills the theory. On modern processors that is of course different
- We are working with small numbers (relative to general sorting). The algorithm for thousands of elements would be different
- In terms of the player I check specially for that sprite to see that a physical one is free. If not I override the most previous free one. Better to have an enemy or bullet flicker than the player
- As Oswald hits on the IRQ's are equally important. They take up about as long as the sort (rough order magnitude). So as always there are tradeoffs
I'll dig out my performance stats when I get home from work and can also share code if that is helpful. |
| |
Bezerk Account closed
Registered: Mar 2020 Posts: 5 |
I enjoyed reading through this thread but in the end it wasn't clear to me which sorting algorithm was the fastest (while using a reasonable amount of memory).
Field Sort by Lft (post #120) seemed the fastest and it was also well described at Field Sort
Still, people kept referring to Radix Sort by Doynax (post #31) as the fastest. It wasn't obvious how to compare the posted Radix Sort implementation to Field Sort because it didn't explicitly output the resulting actor order and also didn't state the allowed range of y-positions.
Field Sort outputs the resulting actor order to the stack and the valid range of y-positions is 0-219.
To compare Radix Sort to Field Sort I decided to implement Radix Sort (inspired by the code posted by Doynax) and optimize it as much as I could.
The resulting Radix Sort implementation outputs the sorted actors to the stack and handles y-positions in the range 0-223.
The result?
Field Sort: Worst case performance of 2208 cycles using "< 2K" of memory.
Radix Sort: Worst case performance of 2044 cycles using 1248 bytes for the code and 62 bytes of zeropage (in addition to the y-positions).
Here's my Radix Sort implementation (in Kick Assembler format):
// Radix sort of 32 "actors" based on their y-positions [0,223] in ascending order
//
// Worst case execution time: 2044 cycles
//
// 1248 bytes of code
// 62 bytes of zeropage (in addition to y-positions)
// * = $xx85 for no branch page boundary crossings
.const NUM_SPRITES = 32
.const NUM_LSD_BUCKETS = 16
.const NUM_MSD_BUCKETS = 14
// Init
// Constant 92 cycles
lda #$ff // 2 cycles
.for (var i = 0; i < NUM_LSD_BUCKETS; i++)
sta lsd_buckets + i // 3
.for (var i = 0; i < NUM_MSD_BUCKETS; i++)
sta msd_buckets + 8*i // 3
// LSD sort
// Constant 576 cycles
lda #NUM_LSD_BUCKETS - 1 // 2
.for (var i = 0; i < NUM_SPRITES; i++)
{
ldx ypos + i // 3
axs #$00 // 2
ldy lsd_buckets,x // 4
sty links + i // 3
.if (i == NUM_LSD_BUCKETS - 1)
sta lsd_buckets,x // 4
else
{
ldy #i // 2
sty lsd_buckets,x // 4
}
}
// MSD sort
//
// Execution time per LSD list as a function of number of list elements:
// - 0 elements: 6 cycles
// - N (odd) elements: 5 + (N - 1) / 2 * (26 + 28) + 27 cycles
// - N (even) elements: 5 + (N - 2) / 2 * (26 + 28) + 26 + 27 cycles
//
// Worst case 958 cycles (15 empty LSD lists, 1 32-element LSD list)
.for (var i = 0; i < NUM_LSD_BUCKETS; i++)
{
ldx lsd_buckets + i // 3
bmi NextBucket // 2/3
Next:
lda ypos,x // 4
alr #$f0 // 2
tay // 2
lda msd_buckets,y // 4
stx msd_buckets,y // 4
ldy links,x // 4
sta links,x // 4
bmi NextBucket // 2/3
lda ypos,y // 4
alr #$f0 // 2
tax // 2
lda msd_buckets,x // 4
sty msd_buckets,x // 4
ldx links,y // 4
sta links,y // 5 !!!
bpl Next // 2/3
NextBucket:
}
// Output final sorted order
//
// Execution time per MSD list as a function of number of list elements:
// - 0 elements: 6 cycles
// - N (odd) elements: 5 + (N - 1) / 2 * 21 + 10 cycles
// - N (even) elements: 5 + (N - 2) / 2 * 21 + 20 cycles
//
// Worst case 418 cycles (13 empty MSD lists, 1 32-element MSD list)
.for (var i = NUM_MSD_BUCKETS - 1; i >= 0; i--)
{
lax msd_buckets + 8*i // 3
bmi NextBucket // 2/3
Next:
pha // 3
lda links,x // 4
bmi NextBucket // 2/3
pha // 3
tay // 2
lax links,y // 4
bpl Next // 2/3
NextBucket:
}
|
| |
lft
Registered: Jul 2007 Posts: 369 |
This is nice! I feel motivated to improve my routine.
But since you are storing the MSD buckets 8 bytes apart, they cannot be on the zero-page. Thus you can't do:
stx msd_buckets,y // 4
(At least not while also storing ypos and links on the zero-page.)
If you put msd_buckets at e.g. $fe00, then you can use the shx instruction, but it will cost 5 cycles. |
Previous - 1 | ... | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 - Next |