| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Optimizing span filler
To prove a quote recently made in another thread by Skid Row wrong, i am going to ask in public :-)
"Average programmers do ask friends or in forums (like Lemon64, Forum64, CSDB Forum...) and experienced programmers... well,they just don't need to ask! ;)"
I had optimized my span filler some time ago and used it on various occasions with different effects. I meanwhile shrank it down by size, but i would love to get a few things even faster, most of all the inner loop that looks like the following:
* = $0010
fill
;x = x2
;y = y2
lda #$f8
sax <f_jmp+1 ;set initially, as it is not set on every turn later on
f_back ;common entry point where all code segments reenter when done
dey
f_yend cpy #$00 ;forces carry to be set \o/
bcc f_end
f_err lda #$00 ;restore error
f_dx1 sbc #$00 ;do that bresenhamthingy for xend, code will be setup for either flat or steep slope
f_code bcs + ;inx ;in case of flat slopes
bcs_start dex ;bcs * - 3 ;
f_dx2 adc #$00
sta <f_err+1
lda #$f8
sax <f_jmp+1 ;update start of span, depending on bit 7 stuff is rendered to buffer 1 or 2 (offset of $80 in the table)
bne ++ ;so buttugly, but need to skip
bcs_end
+
sta <f_err+1 ;save error
++
lda xstart,y ;load previously calced x1
sta <f_msk+1 ;setup mask without tainting X
arr #$78 ;-> carry is still set, bit 7 always cleared. This way we generate values from $80 .. $bc, a range to which we adopt the memory layout of the row tables
sta <f_jmp+2 ;update byte of jump responsible to select all code-segments that start with xstart
f_patt lda patt_0,y ;fetch pattern
f_msk and maskl ;apply mask for left edge
f_jmp jmp ($1000) ;do it! \o/
f_end
rts
Keep in mind that this loop shall not be unrolled or even duplicated, as the entrypoint f_back is fix and used by all speedcode chunks that are entered through the indirect jump. Doing so would mean multiplying the speedcode chunks or giving them a variable entry point by introducing another indirect jump (which wastes another 2 cycles + setup)
The code at f_dx1 (3 bytes is self modifying, depending on if there's a steep or flat slope generated. Patterns are alternating each line, so the additional lookup (lda patt_0,y is somewhat necessary)
One of the many speedcode chunks could look like:
sta .addr,y ;write through, smart poly order avoids clashes
lda (f_patt+1),y ;refetch pattern, expensive, but at least less than sta patt, lda patt
sta .addr + $080,y
sta .addr + $100,y
sta .addr + $180,y
sta .addr + $200,y
sta .addr + $280,y
sta .addr + $300,y
and maskr,x ;right edge
ora .addr + $380,y ;need to ora here
sta .addr + $380,y
jmp f_back
So any suggestions on how to save further cycles? As you see, there's a few awkward and painful spots, like creating xstart table in a separate step, refetching the pattern again via lda (zp),y, the pattern lookup, the bne++ and the fact that i cannot use any register or facing a register store and load galore that will slow down things to death.
Or is this already the ultimate "optimum hut ab"? |
|
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
dont think you can get much further with that.
unrolling the slope calcs would lead to a cleaner and probably faster routine.
a) you can keep all slope vars in registers while calcing
b) immediately selfmod results into jump & mask calcs
inx
bcs *- is also expensive, I'd use 16bit adds for such slopes. a log div is good enough for that. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Unfortunately writing out the second slope as well is more expensive as it inflicts a store and another load (+7 cycles minimum per line). Been there, and even was using the stack therefore. It is true however that flat slopes are not optimal. However they do not mean much beef, as the flat portions usually result in just a few lines to be rendered. The latter is the real expensive part.
Edit: also when separating into flat and steep slopes you would need two inner loops, what would again either slow down the returning jump or double the speedcode size, which has quite some size. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Interesting.. what did he write? something like.. "span fillers are just slow and ugly.." i guess :D. |
| |
bepp
Registered: Jun 2010 Posts: 265 |
Lol I read "spam filter" and didn't get a thing. Even now I don't =) |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
HCL: there was this amiga ball compo somewhen, remember ;-P |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
too lazy to count all the cycles, but I think there might be a little chance of unrolling both slopes and span selector might be better. then you may use jsr abs, since lo/hi jmp address can be tabled, hence edge can be kept in a register. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
I replaced the right slope precalculation by an on the fly calculation now, just as i do it with the left side, and it turns out to be even faster (though having penalties due to not being able to use x-register). So much about that :-D |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Failing to improve on this one, but it's inspiring some thoughts for an alternate approach... |
| |
lft
Registered: Jul 2007 Posts: 369 |
This may or may not result in a net gain. Perhaps you could try it out with some actual data:
First, at the cost of two cycles, change
f_err lda #$00
into
pla
Then, at no cycle cost, change each of the two
sta <f_err+1
into
pha
Next, relocate the code so that f_jmp+1 = $48, which is the opcode PHA.
Now you can remove bne ++, and point bcs_end/+ to the operand of the SAX.
The code path for flat slopes is now one cycle faster, but the code path for steep slopes is either two cycles slower or one cycle faster depending on whether the branch is taken. Hence the need to measure on real data. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
LFT: That's one ossom idea, and i would have wished it performs better as it is just my taste! However it performs slightly (but really slightly) worse, the case when the jmp is updated doesn't happen too often as it seems, such a pity :-( |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
That is btw. what the loop currently looks like:
f_back
f_yend = * + 1
cpy #$00
beq f_end
dey
;--------RIGHT SLOPE-----------------------------------------
f_err = * + 1
lda #$00
f_dy = * + 1
sbc #$00
f_code bcs +
bcs_start dex
f_dx = * + 1
adc #$00
sta <f_err
lda #$f8
sax <f_jmp+1
bne ++
bcs_end
+
sta <f_err
++
;--------LEFT SLOPE-----------------------------------------
f_err2 = * + 1
lda #$00
f_dy2 = * + 1
sbc #$00
f_code2 bcs +
bcs_start2 dec <f_x2
f_dx2 = * + 1
adc #$00
sta <f_err2
lda <f_x2
arr #$78
sta <f_jmp+2
bne ++
bcs_end2
+
sta <f_err2
++
;--------PATTERN SETUP and ---------------------------------
f_patt lda patt_0,y
f_x2 = * + 1
and maskl
f_jmp jmp ($1000)
The overall gain so far due to a easier setup and faster loops is ~5% |