[CSDb] - User Forums - Optimizing span filler

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Optimizing span filler

2015-04-15 12:57

Bitbreaker

Registered: Oct 2002
Posts: 508

Optimizing span filler

To prove a quote recently made in another thread by Skid Row wrong, i am going to ask in public :-)
"Average programmers do ask friends or in forums (like Lemon64, Forum64, CSDB Forum...) and experienced programmers... well,they just don't need to ask! ;)"

I had optimized my span filler some time ago and used it on various occasions with different effects. I meanwhile shrank it down by size, but i would love to get a few things even faster, most of all the inner loop that looks like the following:

                * = $0010
fill
;x = x2
;y = y2

                lda #$f8
                sax <f_jmp+1            ;set initially, as it is not set on every turn later on
f_back                                  ;common entry point where all code segments reenter when done
                dey
f_yend          cpy #$00                ;forces carry to be set \o/
                bcc f_end

f_err           lda #$00                ;restore error
f_dx1           sbc #$00                ;do that bresenhamthingy for xend, code will be setup for either flat or steep slope
f_code          bcs +                   ;inx                    ;in case of flat slopes
bcs_start       dex                     ;bcs * - 3              ;
f_dx2           adc #$00
                sta <f_err+1
                lda #$f8
                sax <f_jmp+1            ;update start of span, depending on bit 7 stuff is rendered to buffer 1 or 2 (offset of $80 in the table)
                bne ++                  ;so buttugly, but need to skip
bcs_end
+
                sta <f_err+1            ;save error
++
                lda xstart,y            ;load previously calced x1
                sta <f_msk+1            ;setup mask without tainting X
                arr #$78                ;-> carry is still set, bit 7 always cleared. This way we generate values from $80 .. $bc, a range to which we adopt the memory layout of the row tables
                sta <f_jmp+2            ;update byte of jump responsible to select all code-segments that start with xstart
f_patt          lda patt_0,y            ;fetch pattern
f_msk           and maskl               ;apply mask for left edge
f_jmp           jmp ($1000)             ;do it! \o/
f_end
                rts

Keep in mind that this loop shall not be unrolled or even duplicated, as the entrypoint f_back is fix and used by all speedcode chunks that are entered through the indirect jump. Doing so would mean multiplying the speedcode chunks or giving them a variable entry point by introducing another indirect jump (which wastes another 2 cycles + setup)
The code at f_dx1 (3 bytes is self modifying, depending on if there's a steep or flat slope generated. Patterns are alternating each line, so the additional lookup (lda patt_0,y is somewhat necessary)

One of the many speedcode chunks could look like:

                sta .addr,y             ;write through, smart poly order avoids clashes
                lda (f_patt+1),y        ;refetch pattern, expensive, but at least less than sta patt, lda patt
                sta .addr + $080,y
                sta .addr + $100,y
                sta .addr + $180,y
                sta .addr + $200,y
                sta .addr + $280,y
                sta .addr + $300,y

                and maskr,x             ;right edge
                ora .addr + $380,y      ;need to ora here
                sta .addr + $380,y
                jmp f_back

So any suggestions on how to save further cycles? As you see, there's a few awkward and painful spots, like creating xstart table in a separate step, refetching the pattern again via lda (zp),y, the pattern lookup, the bne++ and the fact that i cannot use any register or facing a register store and load galore that will slow down things to death.
Or is this already the ultimate "optimum hut ab"?

2015-04-15 13:46

Oswald

Registered: Apr 2002
Posts: 5094

dont think you can get much further with that.

unrolling the slope calcs would lead to a cleaner and probably faster routine.

a) you can keep all slope vars in registers while calcing
b) immediately selfmod results into jump & mask calcs

inx
bcs *- is also expensive, I'd use 16bit adds for such slopes. a log div is good enough for that.

2015-04-15 14:15

Bitbreaker

Registered: Oct 2002
Posts: 508

Unfortunately writing out the second slope as well is more expensive as it inflicts a store and another load (+7 cycles minimum per line). Been there, and even was using the stack therefore. It is true however that flat slopes are not optimal. However they do not mean much beef, as the flat portions usually result in just a few lines to be rendered. The latter is the real expensive part.

Edit: also when separating into flat and steep slopes you would need two inner loops, what would again either slow down the returning jump or double the speedcode size, which has quite some size.

2015-04-15 18:31

HCL

Registered: Feb 2003
Posts: 728

Interesting.. what did he write? something like.. "span fillers are just slow and ugly.." i guess :D.

2015-04-15 20:46

bepp

Registered: Jun 2010
Posts: 265

Lol I read "spam filter" and didn't get a thing. Even now I don't =)

2015-04-16 10:14

Bitbreaker

Registered: Oct 2002
Posts: 508

HCL: there was this amiga ball compo somewhen, remember ;-P

2015-04-16 10:28

Oswald

Registered: Apr 2002
Posts: 5094

too lazy to count all the cycles, but I think there might be a little chance of unrolling both slopes and span selector might be better. then you may use jsr abs, since lo/hi jmp address can be tabled, hence edge can be kept in a register.

2015-04-20 13:01

Bitbreaker

Registered: Oct 2002
Posts: 508

I replaced the right slope precalculation by an on the fly calculation now, just as i do it with the left side, and it turns out to be even faster (though having penalties due to not being able to use x-register). So much about that :-D

2015-04-21 16:59

ChristopherJam

Registered: Aug 2004
Posts: 1409

Failing to improve on this one, but it's inspiring some thoughts for an alternate approach...

2015-04-21 19:36

lft

Registered: Jul 2007
Posts: 369

This may or may not result in a net gain. Perhaps you could try it out with some actual data:

First, at the cost of two cycles, change

f_err         lda #$00

into

pla

Then, at no cycle cost, change each of the two

              sta <f_err+1

into

pha

Next, relocate the code so that f_jmp+1 = $48, which is the opcode PHA.

Now you can remove bne ++, and point bcs_end/+ to the operand of the SAX.

The code path for flat slopes is now one cycle faster, but the code path for steep slopes is either two cycles slower or one cycle faster depending on whether the branch is taken. Hence the need to measure on real data.

2015-04-22 06:44

Bitbreaker

Registered: Oct 2002
Posts: 508

LFT: That's one ossom idea, and i would have wished it performs better as it is just my taste! However it performs slightly (but really slightly) worse, the case when the jmp is updated doesn't happen too often as it seems, such a pity :-(

2015-04-22 06:48

Bitbreaker

Registered: Oct 2002
Posts: 508

That is btw. what the loop currently looks like:

f_back
f_yend = * + 1  
                cpy #$00
                beq f_end
                dey
;--------RIGHT SLOPE-----------------------------------------
f_err = * + 1   
                lda #$00
f_dy = * + 1
                sbc #$00
f_code          bcs +
bcs_start       dex
f_dx = * + 1
                adc #$00
                sta <f_err
                lda #$f8
                sax <f_jmp+1
                bne ++
bcs_end
+
                sta <f_err
++
;--------LEFT SLOPE-----------------------------------------
f_err2 = * + 1  
                lda #$00
f_dy2 = * + 1   
                sbc #$00
f_code2         bcs +
bcs_start2      dec <f_x2

f_dx2 = * + 1   
                adc #$00
                sta <f_err2
                lda <f_x2
                arr #$78
                sta <f_jmp+2
                bne ++
bcs_end2
+
                sta <f_err2
++
;--------PATTERN SETUP and ---------------------------------
f_patt          lda patt_0,y
f_x2 = * + 1
                and maskl
f_jmp           jmp ($1000)

The overall gain so far due to a easier setup and faster loops is ~5%

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

algorithm
csabanw
crayon/Hokuto Force
Guests online: 116

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)

Top onefile Demos

1 No Listen  (9.6)
2 Layers  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)

Top Groups

1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)

Top Musicians

1 Rob Hubbard  (9.7)
2 Mutetus  (9.7)
3 Jeroen Tel  (9.7)
4 Linus  (9.6)
5 Stinsen  (9.6)

Page generated in: 0.062 sec.