 
DanPhillips
Registered: Jan 2003 Posts: 18 
Sparse Bubble Sort
Hey,
Over the holidays I was messing with a "get as many sprites around with a proper multiplexor" background for a boot menu.
Had some success, unrolling and using some illegal opcodes, bit packing some data to reduce the amount of data read from memory and some algorithmic re working.
Final attempt sees 82 sprites bouncing (They all move at 50fps, each allows x,y,msb,color,definition and bounce off the left/right border and scanline 0 and $FD.
<Not sure if this is original, but thought it was worth pointing out)
While looking at the simple bubble sort and doing the obvious unrolling I noticed something about the data being read/reread.
The project is in DASM, so DASM macro syntax...
The original:
MAC BubbleYSortMacro ;param {1} index, param {2} source offset
LDY YORDER+1+{1}
LDX YORDER+{1}
LDA YTAB+{2},X
CMP YTAB+{2},Y
BCS .NoNeedToSwap
STX YORDER+1+{1}
STY YORDER+{1}
.NoNeedToSwap
ENDM
The YTAB is double buffered.
New macros (probably only need 2, but 3 for added complexity)
MAC BubbleYSortMacro0 ;param 1 index, param 2 source offset
LDY YORDER+1+{1}
LDX YORDER+{1}
LDA YTAB+{2},X
CMP YTAB+{2},Y
BCS .NoNeedToSwap1
STX YORDER+1+{1}
STY YORDER+{1}
.if {1} > 0
LDX YORDER+{1}
LDA YTAB+{2},X
.endif
.NoNeedToSwap1
ENDM
; x needs to be {1}+1 a needs to be YTAB+{2},x
MAC BubbleYSortMacro1 ;param 1 index, param 2 source offset
LDY YORDER+{1}
CMP YTAB+{2},Y
BCC .NoNeedToSwap2
STY YORDER+1+{1}
STX YORDER+{1}
.if {1} > 0
LDY YORDER+{1}
.endif
.NoNeedToSwap2
ENDM
; y needs to be {1}+1
MAC BubbleYSortMacro2 ;param 1 index, param 2 source offset
LDX YORDER+{1}
LDA YTAB+{2},X
CMP YTAB+{2},Y
BCS .NoNeedToSwap3
STX YORDER+1+{1}
STY YORDER+{1}
.if {1} > 0
LDX YORDER+{1}
LDA YTAB+{2},X
.endif
.NoNeedToSwap3
ENDM
You use the 1st macro, then alternate the 2nd and 3rd.
The savings come from not reloading and alternating the comparison.
If no swapping is needed this saves 7 cycles for the 2nd macro and 3 for the 3rd macro.
In my boot menu background with 82 sprites this saves 360 cycles (average of 10 swaps needed per frame)
Amyway, not sure if this is new, but kinda funny looking at the same simple bubble sort for 30 odd years and not spotting this pattern :D
Cheers
Dan 

... 2 posts hidden. Click here to view all posts.... 
 
Oswald
Registered: Apr 2002 Posts: 4507 
why not use the (),y to make the indirect step directly ?
lda (sprindex1),y
cmp (sprindex2),y
bcs noswap
lda sprindex1
ldx sprindex2
sta sprindex2
stx sprindex1 
 
DanPhillips
Registered: Jan 2003 Posts: 18 
Quote: why not use the (),y to make the indirect step directly ?
lda (sprindex1),y
cmp (sprindex2),y
bcs noswap
lda sprindex1
ldx sprindex2
sta sprindex2
stx sprindex1
That would work (Assuming YTable is page aligned)
It would save 1 cycle per 2 sprites if no swaps were needed.
Would be 1 cycle better for even swaps and 3 cycles worse for odd.
But in this particular case I'd need 164 bytes of zero page.
Which would mean either the XSpeed or the YSpeed tables would need to move out of zero page, both of which would add at least a cycle per sprite overhead.
Cheers
Dan 
 
DanPhillips
Registered: Jan 2003 Posts: 18 
And coding this up there's an overhead if your YTable is double buffered you need to update the (ZP) pairs, all of them, 3 cycles per sprite.
Cheers
Dan 
 
DanPhillips
Registered: Jan 2003 Posts: 18 
Of could use the Y register :)
Cheers
Dan 
 
Oswald
Registered: Apr 2002 Posts: 4507 
what is xspeed / yspeed table ? I think its enough to be able to adress this way only the Y tables. the per swap/noswap gain is not much sadly. 
 
ChristopherJam
Registered: Aug 2004 Posts: 1001 
Speaking of making the swaps faster:
I did wonder about using the Y values to bucket the IDs, then just sorting the list of Y values to avoid all that indirection, but I think the overhead is way higher than any time likely to be saved in the sort routine proper.
nactors* {
ldy sorted_indices+i ;3
lax actorY,y ;4
sta sort_table,y ;5
lda bucket,x ;4
sta bucketnext,y ;5
tya ;2
sta bucket,x ;5 total 28*na, assuming last frame's sort indices are in zp
}
..then sort the sort_table (which just contains Y values), then..nactors * {
ldy sort_table+i ; 3
ldx bucket,y ; 4
lda bucket_next,x ; 4
sta bucket,y ; 5
sta sort_indices+i ; 3 total 19*na
}
19+28=47 cycles of overhead. Need a lot of savings on the sort to justify that...
Nice that the linked list at each bucket doesn't need to be terminated. Each bucket is mentioned as many times in sort_table as there are entries in that bucket, so the correct number of indices will be pulled out post sort. 
 
Pararaum
Registered: Sep 2018 Posts: 4 
You should try the Gnome Sort it is perfect for your task... 
 
ChristopherJam
Registered: Aug 2004 Posts: 1001 
Hah! I've been thinking about trying to optimise a gnome sort (in particular, not putting the pot down while moving it left), but I didn't know it had a name.
Thanks, Pararaum :)
edit: oh wait... I just realised that the difference between gnome sort and insertion sort is that the gnome sort "forgets" the high water mark, and does redundant comparisons on its way back to the next unseen pot. Back to insertion sort it is, probably. 
 
Oswald
Registered: Apr 2002 Posts: 4507 
Quote: Hah! I've been thinking about trying to optimise a gnome sort (in particular, not putting the pot down while moving it left), but I didn't know it had a name.
Thanks, Pararaum :)
edit: oh wait... I just realised that the difference between gnome sort and insertion sort is that the gnome sort "forgets" the high water mark, and does redundant comparisons on its way back to the next unseen pot. Back to insertion sort it is, probably.
you can remember where you started and skip those checks. read wiki as I did :) 
 
ChristopherJam
Registered: Aug 2004 Posts: 1001 
Yes, but then (as the wikipedia article itself concedes) it's just insertion sort with a different name :P 
Previous  1  2  Next 