| |
Oswald
Registered: Apr 2002 Posts: 5086 |
Sorting
are sorters really so slow in games? :) I have made my unrolled version for my theoretical game (;), and it takes 132 rlines to sort 32 numbers o_O worst case is ~200 lines. tho when its the case of 4 numbers have to be swapped only it does the job in ~10 lines. wastes a lot of memory but I like it :) |
|
... 193 posts hidden. Click here to view all posts.... |
| |
mhindsbo
Registered: Dec 2014 Posts: 51 |
I know I'm late to the thread and apologies if I repeat what others have said. I spent a long time optimizing the sort for AAII (probably too long, but it is a nostalgic hobby after all). Couple of comments on that basis
- I found using an insertion sort was faster than bubble; much faster and for several reasons
- When you reuse the sorting order from frame to frame you are almost never worst case and it is pretty quick to change the order of only a couple of (neighboring) elements
- You can help it by e.g. only introducing 1 new object per frame. So rather than e.g. 10 objects in a formation coming in at once, spread them over 10 frames. Helps in general w performance as well (not instantiating 10 new objects in one frame)
- You have to optimize equally for the implementation in 6502 machine code. Some algorithms are theoretically faster, but with only 3 registers you might end up doing a lot of temp storing that kills the theory. On modern processors that is of course different
- We are working with small numbers (relative to general sorting). The algorithm for thousands of elements would be different
- In terms of the player I check specially for that sprite to see that a physical one is free. If not I override the most previous free one. Better to have an enemy or bullet flicker than the player
- As Oswald hits on the IRQ's are equally important. They take up about as long as the sort (rough order magnitude). So as always there are tradeoffs
I'll dig out my performance stats when I get home from work and can also share code if that is helpful. |
| |
Bezerk Account closed
Registered: Mar 2020 Posts: 5 |
I enjoyed reading through this thread but in the end it wasn't clear to me which sorting algorithm was the fastest (while using a reasonable amount of memory).
Field Sort by Lft (post #120) seemed the fastest and it was also well described at Field Sort
Still, people kept referring to Radix Sort by Doynax (post #31) as the fastest. It wasn't obvious how to compare the posted Radix Sort implementation to Field Sort because it didn't explicitly output the resulting actor order and also didn't state the allowed range of y-positions.
Field Sort outputs the resulting actor order to the stack and the valid range of y-positions is 0-219.
To compare Radix Sort to Field Sort I decided to implement Radix Sort (inspired by the code posted by Doynax) and optimize it as much as I could.
The resulting Radix Sort implementation outputs the sorted actors to the stack and handles y-positions in the range 0-223.
The result?
Field Sort: Worst case performance of 2208 cycles using "< 2K" of memory.
Radix Sort: Worst case performance of 2044 cycles using 1248 bytes for the code and 62 bytes of zeropage (in addition to the y-positions).
Here's my Radix Sort implementation (in Kick Assembler format):
// Radix sort of 32 "actors" based on their y-positions [0,223] in ascending order
//
// Worst case execution time: 2044 cycles
//
// 1248 bytes of code
// 62 bytes of zeropage (in addition to y-positions)
// * = $xx85 for no branch page boundary crossings
.const NUM_SPRITES = 32
.const NUM_LSD_BUCKETS = 16
.const NUM_MSD_BUCKETS = 14
// Init
// Constant 92 cycles
lda #$ff // 2 cycles
.for (var i = 0; i < NUM_LSD_BUCKETS; i++)
sta lsd_buckets + i // 3
.for (var i = 0; i < NUM_MSD_BUCKETS; i++)
sta msd_buckets + 8*i // 3
// LSD sort
// Constant 576 cycles
lda #NUM_LSD_BUCKETS - 1 // 2
.for (var i = 0; i < NUM_SPRITES; i++)
{
ldx ypos + i // 3
axs #$00 // 2
ldy lsd_buckets,x // 4
sty links + i // 3
.if (i == NUM_LSD_BUCKETS - 1)
sta lsd_buckets,x // 4
else
{
ldy #i // 2
sty lsd_buckets,x // 4
}
}
// MSD sort
//
// Execution time per LSD list as a function of number of list elements:
// - 0 elements: 6 cycles
// - N (odd) elements: 5 + (N - 1) / 2 * (26 + 28) + 27 cycles
// - N (even) elements: 5 + (N - 2) / 2 * (26 + 28) + 26 + 27 cycles
//
// Worst case 958 cycles (15 empty LSD lists, 1 32-element LSD list)
.for (var i = 0; i < NUM_LSD_BUCKETS; i++)
{
ldx lsd_buckets + i // 3
bmi NextBucket // 2/3
Next:
lda ypos,x // 4
alr #$f0 // 2
tay // 2
lda msd_buckets,y // 4
stx msd_buckets,y // 4
ldy links,x // 4
sta links,x // 4
bmi NextBucket // 2/3
lda ypos,y // 4
alr #$f0 // 2
tax // 2
lda msd_buckets,x // 4
sty msd_buckets,x // 4
ldx links,y // 4
sta links,y // 5 !!!
bpl Next // 2/3
NextBucket:
}
// Output final sorted order
//
// Execution time per MSD list as a function of number of list elements:
// - 0 elements: 6 cycles
// - N (odd) elements: 5 + (N - 1) / 2 * 21 + 10 cycles
// - N (even) elements: 5 + (N - 2) / 2 * 21 + 20 cycles
//
// Worst case 418 cycles (13 empty MSD lists, 1 32-element MSD list)
.for (var i = NUM_MSD_BUCKETS - 1; i >= 0; i--)
{
lax msd_buckets + 8*i // 3
bmi NextBucket // 2/3
Next:
pha // 3
lda links,x // 4
bmi NextBucket // 2/3
pha // 3
tay // 2
lax links,y // 4
bpl Next // 2/3
NextBucket:
}
|
| |
lft
Registered: Jul 2007 Posts: 369 |
This is nice! I feel motivated to improve my routine.
But since you are storing the MSD buckets 8 bytes apart, they cannot be on the zero-page. Thus you can't do:
stx msd_buckets,y // 4
(At least not while also storing ypos and links on the zero-page.)
If you put msd_buckets at e.g. $fe00, then you can use the shx instruction, but it will cost 5 cycles. |
| |
Bezerk Account closed
Registered: Mar 2020 Posts: 5 |
It does fit on the zero-page since there are 14 MSD buckets (not 32). |
| |
lft
Registered: Jul 2007 Posts: 369 |
Right, but in that case, where do you fit 32 consecutive bytes of links and ypos? |
| |
Bezerk Account closed
Registered: Mar 2020 Posts: 5 |
Here's the zero-page usage in my test code. It obviously wastes 7 unused bytes per MSD bucket list head but it's just for testing.
I'd be happy to send you the source code. PM me if you're interested.
.const NUM_SPRITES = 32
.const NUM_LSD_BUCKETS = 16
.const NUM_MSD_BUCKETS = 14
* = $10 "Zeropage data" virtual
// Actor y positions in [0,223]
ypos:
.fill NUM_SPRITES,0
// Head of linked list per LSD bucket.
lsd_buckets:
.fill NUM_LSD_BUCKETS, 0
// Next "pointer" in (LSD or MSD) bucket list.
links:
.fill NUM_SPRITES, 0
// Head of linked list per MSD bucket.
msd_buckets:
.fill NUM_MSD_BUCKETS * 8, 0
Memory Map
----------
Default-segment:
*$0010-$00cf Zeropage data
|
| |
lft
Registered: Jul 2007 Posts: 369 |
It makes sense now. Thank you! |
| |
CyberBrain Administrator
Posts: 392 |
Cool routine!! Thanx for sharing! Analyzing how it works was pretty illuminating (aka. fun) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Nice work, Bezerk. Good to finally have a comparable radix sort implementation, and a well optimised one at that. Extraordinarily close to one raster line per actor!
The flagged bucket sort I put up at https://www.codebase64.org/doku.php?id=base:flagged_bucket_sort takes around 2423 cycles - significantly slower than field sort or radix.
Somewhat less onerous memory requirements at least; no magic pages, and the zero page use is contiguous :) |
| |
Oswald
Registered: Apr 2002 Posts: 5086 |
Bezerk, can you explain it in detail, frankly I dont understand anything of it :) |
Previous - 1 | ... | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 - Next |