Bitbreaker
Registered: Oct 2002 Posts: 508 |
Optimizing span filler
To prove a quote recently made in another thread by Skid Row wrong, i am going to ask in public :-)
"Average programmers do ask friends or in forums (like Lemon64, Forum64, CSDB Forum...) and experienced programmers... well,they just don't need to ask! ;)"
I had optimized my span filler some time ago and used it on various occasions with different effects. I meanwhile shrank it down by size, but i would love to get a few things even faster, most of all the inner loop that looks like the following:
* = $0010
fill
;x = x2
;y = y2
lda #$f8
sax <f_jmp+1 ;set initially, as it is not set on every turn later on
f_back ;common entry point where all code segments reenter when done
dey
f_yend cpy #$00 ;forces carry to be set \o/
bcc f_end
f_err lda #$00 ;restore error
f_dx1 sbc #$00 ;do that bresenhamthingy for xend, code will be setup for either flat or steep slope
f_code bcs + ;inx ;in case of flat slopes
bcs_start dex ;bcs * - 3 ;
f_dx2 adc #$00
sta <f_err+1
lda #$f8
sax <f_jmp+1 ;update start of span, depending on bit 7 stuff is rendered to buffer 1 or 2 (offset of $80 in the table)
bne ++ ;so buttugly, but need to skip
bcs_end
+
sta <f_err+1 ;save error
++
lda xstart,y ;load previously calced x1
sta <f_msk+1 ;setup mask without tainting X
arr #$78 ;-> carry is still set, bit 7 always cleared. This way we generate values from $80 .. $bc, a range to which we adopt the memory layout of the row tables
sta <f_jmp+2 ;update byte of jump responsible to select all code-segments that start with xstart
f_patt lda patt_0,y ;fetch pattern
f_msk and maskl ;apply mask for left edge
f_jmp jmp ($1000) ;do it! \o/
f_end
rts
Keep in mind that this loop shall not be unrolled or even duplicated, as the entrypoint f_back is fix and used by all speedcode chunks that are entered through the indirect jump. Doing so would mean multiplying the speedcode chunks or giving them a variable entry point by introducing another indirect jump (which wastes another 2 cycles + setup)
The code at f_dx1 (3 bytes is self modifying, depending on if there's a steep or flat slope generated. Patterns are alternating each line, so the additional lookup (lda patt_0,y is somewhat necessary)
One of the many speedcode chunks could look like:
sta .addr,y ;write through, smart poly order avoids clashes
lda (f_patt+1),y ;refetch pattern, expensive, but at least less than sta patt, lda patt
sta .addr + $080,y
sta .addr + $100,y
sta .addr + $180,y
sta .addr + $200,y
sta .addr + $280,y
sta .addr + $300,y
and maskr,x ;right edge
ora .addr + $380,y ;need to ora here
sta .addr + $380,y
jmp f_back
So any suggestions on how to save further cycles? As you see, there's a few awkward and painful spots, like creating xstart table in a separate step, refetching the pattern again via lda (zp),y, the pattern lookup, the bne++ and the fact that i cannot use any register or facing a register store and load galore that will slow down things to death.
Or is this already the ultimate "optimum hut ab"? |