| |
Trap
Registered: Jul 2010 Posts: 223 |
How to make efficient double-sine calculations
Hi,
I am trying to improve a little on my effect animation skills. To that purpose I'd like to hear how you guys solve the issue of double-sine table calculations. As I am by no means a math-guru - not even close, try to keep it at a practical level :)
Sure, there has to be some clever way around this. Usually I'd do something like the following mock-up code:
lda Counter1 // Copy counters to indexes
sta Index1
lda Counter2
sta Index2
ldx #TableSize
!CalcAnim: ldy Index1
lda SineWave1,y // Get first value
iny // Index1 Delta + 1
sty Index1
ldy Index2
clc
adc SineWave2,y // Add second value
iny // Index2 Delta + 1
sty Index2
tay
lda Lookuptable,y // Find the value and store it
sta Destinationtable,x
dex
bne CalcAnim-
lda Counter1
clc
adc #1 // Velocity 1
sta Counter1
lda Counter2
clc
adc #1 // Velocity 2
sta Counter2
Apart from unrolling the loop, I am short of good ideas on how to make this efficient. Use of ZP for the indexes saves a few cycles as well.
How do you guys approach this in your demos? |
|
| |
Mixer
Registered: Apr 2008 Posts: 452 |
Consider whether some of the maths give constant results and precalculate those. For instance if the velocities are the same all the time, then (sin(a)+sin(b)) could perhaps be precalculated to a single lookup.
Set sin tables to start on page boundary and use the lsb of address as the index, and run the code on zp.
If the add or substract is always 1 then inc/dec may be better.
Sometimes the second lookup can be coded to the sine data bits. Depends on what is desired. |
| |
Glasnost Account closed
Registered: Aug 2011 Posts: 26 |
If it is very time critical, i would use speedcode, and x and y for the 2 counters. The following code would require that the tables are duplicated to fill eg 2x256:
( i times)
lda sin1+i,x
*clc
adc sin2+i,y
sta destination+i
*clc is optional in some cases.. You know your sines if they mess up the carry or not.
If you want it looped you can init zp1 to sin1+counter1, zp2 to sin2+counter2. This example works only for max i=128.
ldy #(i-1)
!loop:
lda (zp1),y
*clc
adc (zp2),y
sta destination,y
dey
bpl !loop-
Last a bit about the sine addition. Note that if you want better precision you can use:
lda sin1,x
adc sin2,y
ror |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
PROTIP: Code
looks
better
in
a
[code]
block. |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Quoting Glasnost lda sin1,x
adc sin2,y
ror Remember clc after ror. Alternatively, if the sum of the two sines is always < 256 you can use: lda sin1,x
adc sin2,y
alr #$fe //throw away least significant bit and then lsr
//(always results in cleared carry) |
| |
Digger
Registered: Mar 2005 Posts: 437 |
Great tip with bit shifting to smooth the sine, never though about that. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
So, if all your tables are page aligned, and if you also follow the sinewave table with a second copy of itself, the following
should have the same result:
lda Counter1 // Copy counters to indexes
sta rna0+1
lda Counter2
sta rna1+1
ldx #TableSize
ldy #0
!CalcAnim:
clc
!rna0
lda SineWave1,y // Get first value
!rna1
adc SineWave2,y // Add second value
sta rna2+1
!rna2
lda Lookuptable // Find the value and store it
sta Destinationtable,x
iny
dex
bne CalcAnim-
lda Counter1
clc
adc #1 // Velocity 1
sta Counter1
lda Counter2
clc
adc #1 // Velocity 2
sta Counter2
But the above loop only needs seperate indices for source and destination because Y is increasing and X is decreasing.
Also, as others have pointed out, you don't need the CLC if you know the results will never overflow (eg because your sine tables contain 64+63*sin(x*pi/128) )
So, you should be able to get the same effects from an inner loop like this
!loop:
lda sin+counter1,y
adc sin+counter2,y
tax
lda lut,x
sta dst,y
dey
bne loop
You can also halve the number of DEY/BNEs by a partial unroll, dividing the output into first/second half:
ldy#TableSize/2
!loop
lda sin+counter1,y
adc sin+counter2,y
tax
lda lut,x
sta dst,y
lda sin+counter1+TableSize/2,y
adc sin+counter2+TableSize/2,y
tax
lda lut,x
sta dst+TableSize/2,y
dey
bne loop
The ALR mentioned above is good to know about (it's news to me!), but if all you want is a straight divide by two of the eight bit result, you can fold that into the table lookup and save the cycles.
(Alternately, if you put a ROR after the ADC you can use 128+127*sin(x*pi/128) - the carry will contain noise, but even without a CLC before the ADC the result will still be more accurate than using the smaller scale factor on your tables.) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
depending on what your lookup values are you may skip the entire lookup by already putting the lookup values into sin1/sin2 or fabricating them so that after adc you get the right values :P :) |
| |
lft
Registered: Jul 2007 Posts: 369 |
Or use character mode, and integrate the lookup table into the font. |
| |
Trap
Registered: Jul 2010 Posts: 223 |
So much awesome info. Thanks guys.
I think some of you are thinking a traditional movement system, but that is not what I was after. Pre-calculating the movements for a table would produce an awful lot of data - for every line, I'd have at least 128 bytes (for a semi-smooth experience). I am not looking for a follow-path, but creating dynamic waves.
With your input I've managed a 33% speed improvement and there's still some cycles left I can shave off.
The ALR/ROR tips were awesome <3
Lft, that's an interesting thought - could you elaborate? |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
the examples are not about precalculating movements into a table, the most extreme here is to precalculate your counter+velocity values into table lookups in an unrolled loop, and that unrolling (speedcode generation) also can happen realtime when changing effect movement.
unrolling your loop with your labels would give you:
ldx Counter1
ldy Counter2
for loopcount=0 to tablesize
lda sin1+LoopCount*Velocity1,x
clc
adc sin2+LoopCount*Velocity2,y
sta destinationtable+loopcount
next
(out of registers for lookuptable)
speed increase would be many fold instead of lowly 33% |
| |
Dano
Registered: Jul 2004 Posts: 234 |
Me doing it mostly the Oswald-way these days. With ROR if needed.. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting TrapLft, that's an interesting thought - could you elaborate?
Well, it depends on what you need the values for, of course. The simplest case would be an 8x8 plasma, where you just use sin(x)+sin(y) as the character value, and put a gradient pattern in the chars. With ECM you can have many colours in the gradient. To animate, you just modify the scale and origin of the coordinate system (x and y independently), and recompute.
Now suppose you want to render f(sin(x) + sin(y)) all over the screen (1x1 plasma), and let's say you limit yourself to sixteen different values of sin(y) and sixteen different values of sin(x). Then you can precalc full lines (wider than the screen) of graphics for each possible sin(y) inside each half of each font bank. Based on your desired scale and offset for the x axis, you compute char references into the first row of each of two video matrices -- one that picks chars from the first half of the font, and one that picks from the second half. Then you fetch that into the VIC chip with a badline, and repeat the same line over the entire screen, while switching banks according to sin(y) on each line. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
lft, that sounds excitingly cool, but your description is impossible to follow for me. 1x1 pixels? how ? :D
one way of 8x8 (char reso) plasma I can imagine is having charsets like:
cset0 char0: picks gradient 0+0
cset1 char0: picks gradient 0+1
cset2 char0: picks gradient 0+2
cset0 char1: picks gradient 1+0
cset1 char1: picks gradient 1+1
cset2 char1: picks gradient 1+2
that way by switching charsets one can do the +sin(y) with the VICII. The beauty of this is, it is enough to render only one line compared to usual "raster plasma" (my terminology) routines, which need to precompute all lines for the possible sin(Y) values per frame. the downside is only 8 different charsets, maybe thats why you talk about half fonts and 2 screens, to have 16 virtual charsets. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Sorry for my confusing description. It was a bit hurried. The last part is quite wrong too, because if we use two video matrices in that way, we have to trigger badlines on every line.
Anyway, here is an example of what I mean. A single VIC-bank is used. In the Y direction, it's normal FPP with 16 choices. In the X direction, by rewriting the video matrices, we are looking at a 40-chars-wide subsection of a complete cyclic pattern of 123 chars (leaving 5 chars = 40 bytes for the first video matrix row). The animation (x-wobble and colour cycle) is built into the pattern. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
thanks, that makes it clear :) |