 
Repose
Registered: Oct 2010 Posts: 129 
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre and post processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?


 
Repose
Registered: Oct 2010 Posts: 129 
After some work I decided to go with a square table based large multiply. 32bitx32bit=64bit in 850 cycles or so at best. I discovered a trick I wish I knew a long time ago; to add long lists of numbers you can do this
ldy #0
clc
lda a0
adc b0
bcc s1
iny
clc
s1 adc c0
sta d0
bcc s2
iny
s2 sty d1
It's basically 6 cycles per row and 7 cycles per column. When adding the 16 (two byte) partials from the 32bit multiply, it averages to 7.5 cycles per add. This is really optimal as the most basic add is 6 cycles (lda a0:adc b0).
I also found out that there's carries slightly less than half the time on average.
I'm working on a routine to do 8x8=16, 8x16=24, 16x16=16 with rounding, 32x32=32 with rounding, 32x32=64, and 32x32 floating point in one routine with different entry points, where (almost) each entry is optimal even for a standalone routine, and this is all in one routine. It's just a matter of arranging the order of operations carefully to progressively create the result.
I think you can get 1000flops on the c64.

 
ChristopherJam
Registered: Aug 2004 Posts: 620 
1000 flops is well within reach  it just depends how much precision you want, and what you define as a flop :)
Single precision IEEE754 only has 24 bits for the mantissa, and you don't need a 48 bit result unless you want to get the rounding exact for multiplyaccumulates.
Have you considered expressing the multiplies as a difference of squares? Should mean instead of needing nine 8x8>16 multiplies for a 24x24 bit you have six 8x8s and a few table lookups. 
 
Repose
Registered: Oct 2010 Posts: 129 
part 1
z1:z0=(x0+y0)^2/4(x0y0)^2/4 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;48 cycles
z1:Z0=x0*y1/2 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y1
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;56 cycles
z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 ccyles
c1:c0=a0^2/4b0^2/4;48 cycles
c3:c2=a1^2/4b1^2/4;48 cycles
t1:t0=a0*a1/2;56 cycles
t3:t2=b0*b1/2;56 cycles
t1:t0=t3:t2;20 cycles
c2:c1+=t1:t0;20 cycles
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1 lda #0;8/11
adc c3:sta c3;6
which is 262+ cycles for your method, 16bit.

 
Repose
Registered: Oct 2010 Posts: 129 
You're proposing,
a*b=(a+b)^2  (ab)^2,
f(x)=x^2/4,
a*b=f(a+b)  f(ab)
Where a,b are 24 bit values.
So we need a 24bit square function.
You can get that by,
(x2 * 2^16 + x1 * 2^8 + x0)^2=
x2^2 * 2^32 + x1^2 * 2^16 + x0^2 + y,
g(x)=x^2/4,
=g(x2) * 2^32 + g(x1) * 2^16 + g(x0)
which are three table lookups and,
y= 2*x1*x0/4 * 2^8 + 2*x2*x0/4 * 2^16 + 2*x2*x1/4 * 2^24
which are three 8bit multiplies.
In general for n byte numbers you get
2n squaring lookups,
2 * nC2 mults where nCr=n!/(nr)!r!,
2n shifts
O(n) = 2n*48 + 2*nC2*(56+20) + 14, n>1
2C2=2
3C2=3
4C2=6
n cycles
2 262
3 530
4 1310
So without optimization, it doesn't look good...

 
Skate
Registered: Jul 2003 Posts: 464 
Quote:
So without optimization, it doesn't look good...
when it comes to math, what looks good on c64 without optimization?
creating a math library for general purposes is a good thing. it's very useful for testing an effect on c64 without spending too much time. but when it comes to finalization, usually most parts of this library won't be used as it is.
imho, best method is to create the library with possible optimizations which doesn't effect the accuracy much. at the last stage, replace all possible calculations with reasonable lookup tables and where lookup table is not an option, inline the math functions and optimize them as much as possible, change the register usages for saving a few more cycles, sacrifice some accuracy if it doesn't effect the result badly etc. 
 
Repose
Registered: Oct 2010 Posts: 129 
You're thinking in terms of demos. I was thinking of applying this as a compiler library. I'm just looking for a general fast calculation.
I'm comparing CJ's suggestion to my first approach, when I say it doesn't look good, I meant his approach.
Update: slight problem in how I count things
O(n) = n*48 + nC2*(56+20) + 14, in fact 1 works and 1C2 is 0, also the 14 term is related to this
n cycles
1 48
2 262
3 386
4 662
This is now good :)
If I can generate combinations in lists, I could generate any precision math routine in kickass... 
 
Bitbreaker
Registered: Oct 2002 Posts: 425 
Quoting Reposepart 1
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1
lda #0;8/11
adc c3
sta c3
this will a) most likely crash as it only skips one byte and then runs on $00 what will be a brk. Also this is not a optimizing trick speedwise nor sizewise:
Better be done like:
anc #$80 ;copy bit 7 to carry and A = A & $80
bcc + ;A is zero on plus, all fine
sbc #$81 ;subtract $81 from $80 = $ff
+
;carry always clear as a free gift
Also, lots of other possibilities to optimize that code, and also some mistakes within the code. 
 
Repose
Registered: Oct 2010 Posts: 129 
I worked on this again. There were a lot of bugs in my post 'part 1'.
Basically, this message is working out how long to multiply 16x16 using the idea of f(x+y)f(xy)=x*y with x,y as 16bit numbers and f(n)=n^2/4.
Note that squaring a 16bit number takes two 8bit squares and a multiply.
Compare this with the usual way, which is 4 multiplies of x0*y0,x1*y0,x0*y1,x1*y1 then adding those together.

z1:z0=(x0+y0)^2/4 means
lda x0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;50 cycles avg
z1:z0=(x0+y0)*(x1+y1)/2 means
lda x0:clc:adc y0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
lda x1:clc:adc y1
tay;32
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;70 cycles avg
z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 cycles
The routine
Again note, (a1:a0)^2/4=a1^2/4,a0^2/4,a1*a0/2
where a1:a0=x1:x0+y1:y0
So a 16bit square/4 is two 8bit squares and a multiply with divide/2
then added all together
We need two 16bit squares and then subtract them
;z=f(x+y) where f(n)=n^2/4 and x,y are 16bit numbers
z1:z0=(x0+y0)^2/4;50 cycles (x+y)^2/4 low
z3:z2=(x1+y1)^2/4;50 cycles (x+y)^2/4 high
t1:t0=(x0+y0)*(x1+y1)/2;70 cycles
z2:z1+=t1:t0;20 cycles
;add the carry to z3 (not shown or calculated) 190+cycles for 16bit square/4
;
;t=f(xy)
t1:t0=(x0y0)^2/4;50 cycles
t3:t2=(x1y1)^2/4;50 cycles
t5:t4=(x0y0)*(x1y1)/2;70 cycles
t2:t1+=t5:t4;20 cycles
;
;z=t
z1:z0=t1:t0;20 cycles
z3:z3=t3:t2;20 cycles
;190*2+40=420 cycles+
There's still some mistakes here and room to optimize the adds,
also the setup part of the table lookups can be arranged to not
change the pointers so much,
but basically it shows that this is very slow and doesn't work out.
This could have been 4 multiplies if done the usual way, instead
it's like 6 multiplies (officially 2, then 4 sqrtable lookups of 9 bit
numbers, which is the same speed). 
 
Repose
Registered: Oct 2010 Posts: 129 
I only have two options left. I can use the results of the 'add/sub a set' thread for adding the partials of large multiplies, where I add the partials by order of column from low to high, or else do the multiplies in an optimal order of not changing the zp pointers, but then having to generate and add the partials in a nonoptimal order.
The two times I can avoid changing the zp pointers saves 28 cycles. Doing adds then having to save them and add a different column is still an unknown to me.
I'm thinking that doing the multiplies optimally will be best, and the result will be just over 200 cycles for an unsigned 16x16.
It's only a bit better than the one posted already in codebase.
I still have to look more into the cosine formula, but I think I am close to the optimal speed for large multiplies. 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Have you checked my Seriously fast multiplication?
Edit: Yes you have.. sorry. 
 
Repose
Registered: Oct 2010 Posts: 129 
Yes, that's the one I beat already. 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: Yes, that's the one I beat already.
Interesting approach! I'll try it myself someday! 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Stop distracting me, I've got other stuff I'm meant to be working on! 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: Stop distracting me, I've got other stuff I'm meant to be working on!
Circle closed. You were the first to teach me multiplications using squares. I remember I found you on some forum ages ago. :D 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.
It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?) 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.
It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?)
Yeah maybe. I remember Graham did the implicit addoptimization to it. I.e. selfmod the highbyte with X and then use ,Y in the table lookup. That way the X+Y comes for "free" 
 
Repose
Registered: Oct 2010 Posts: 129 
That very idea I always thought was brilliant, because it made a useful calculation with the indexing at the same time as a lookup. As a result, this algorithm is faster than could ever be possible. I call it a 'super' algorithm.
And yes, Judd told me that at the time we were working on the C=Hacking article together.
Anyone remember that coding thread in comp.sys.cbm and later, the z80 vs 6502 'contest' ? 
 
Frantic
Registered: Mar 2003 Posts: 1304 
Quote: Yes, that's the one I beat already.
Don't hesitate to write an article, or at least post some source code, at Codebase64. :) Would be much appreciated! 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Repose, did you get around to completing a 32x32>64 routine? I'm trying out some of the ideas from this and the sets of add/sub threads; currently down to around 1300 cycles, though I've still got some optimising to go.
I'm using inx to track carries, but have yet to remove the CLCs after each branch. 
 
Repose
Registered: Oct 2010 Posts: 129 
Yes I wrote a 16x16 to test, and it takes 211 cycles (that's avg with equal branches). The one posted on codebase is 231 cycles. I am just testing it now but having some trouble finding a single step debugger that works for me, see thread.
The time to beat for 32x32 is at least 1300 as posted on codebase so you have to keep going...
I can derive the formula for my current version to estimate exact timing of 32x32 I'll have to get back to you in a bit. 
 
Repose
Registered: Oct 2010 Posts: 129 
Here's some untested code:
x0=$fb
x1=$fc
y0=$fd
y1=$fe
x0_sqr_lo=$8b;2 bytes
x0_sqr_hi=$8d
x0_negsqr_lo=$8f
x0_negsqr_hi=$91
x1_sqr_lo=$93;2 bytes
x1_sqr_hi=$95
x1_negsqr_lo=$97
x1_negsqr_hi=$99
sqrlo=$c000;510 bytes
sqrhi=$c200
negsqrlo=$c400
negsqrhi=$c600
umult16:
;init zp square tables pointers
lda x0
sta x0_sqr_lo
sta x0_sqr_hi
eor #$ff
sta x0_negsqr_lo
sta x0_negsqr_hi;17
lda x1
sta x1_sqr_lo
sta x1_sqr_hi
eor #$ff
sta x1_negsqr_lo
sta x1_negsqr_hi;17
ldx #0;start column 0
ldy y0
SEC
LDA (x0_sqr_lo),y
SBC (x0_negsqr_lo),y
sta z0;x0*y0 lo, C=1
;start column 1
;Y=y0
clc
LDA (x0_sqr_hi),y;x0*y0 hi
ADC (x1_sqr_lo),y;+x1*y0 lo
bcc c1s1;8.5/11.5 avg
inx
clc
c1s1 sbc (x0_negsqr_hi),y;x0*y0 hi
bcc c1s2
dex
clc
c1s2 sbc (x1_negsqr_lo),y;x1*y0 lo
bcc c1s3
dex
clc
c1s3 ldy y1
adc (x0_sqr_lo),y;x0*y1 lo
bcc c1s4
inx
clc
c1s4 SBC (x0_negsqr_lo),y;A=x0*y1 lo
bcc c1s5
dex
clc
;end of column 1
c1s5 sta z1;column 1
;start column 2
ldy y0
txa;carries from column 1
ldx #0;reset carries
clc
adc (x1_sqr_hi),y;+x1*y0 hi
bcc c2s1
inx
c2s1 sbc (x1_negsqr_hi),y;x1*y0 hi
bcc c2s2
dex
clc
c2s2 ldy y1
adc (x0_sqr_hi),y;+x0*y1 hi
bcc c2s3
inx
clc
c2s3 adc (x1_sqr_lo),y;+x1*y1 lo
bcc c2s4
inx
clc
c2s4 sbc (x0_negsqr_hi),y;x0*y1 hi
bcc c2s5
dex
clc
c2s5 sbc (x1_negsqr_lo),y;x1*y1 lo
bcc c2s6
dex
clc
c2s6 sta z2;column 2
;start column 3
;Y=y1
txa;carries from column 2
clc
adc (x1_sqr_hi),y;+x1*y1 hi
sbc (x1_negsqr_hi),y;x1*y1 hi
;shouldn't be any carries in the msb
sta z3;column 3
rts
makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta x0_sqr_lo+1
sta x1_sqr_lo+1
lda #>sqrhi
sta x0_sqr_hi+1
sta x1_sqr_hi+1
lda #>negsqrlo
sta x0_negsqr_lo+1
sta x1_negsqr_lo+1
lda #>negsqrhi
sta x0_negsqr_hi+1
sta x1_negsqr_hi+1
;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9 ; CMP #immediate  skip TYA and clear carry flag
makesqrtables_loop1:
tya
adc #$00
makesqrtables_sm1:
sta sqrhi,x
tay
cmp #$40
txa
ror
makesqrtables_sm2:
adc #$00
sta makesqrtables_sm2+1
inx
makesqrtables_sm3:
sta sqrlo,x
bne makesqrtables_loop1
inc makesqrtables_sm3+2
inc makesqrtables_sm1+2
clc
iny
bne makesqrtables_loop1
;generate negsqr(x)=(255x)^2/4
ldx #$00
ldy #$ff
maketables_loop2:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne maketables_loop2:
rts

 
Repose
Registered: Oct 2010 Posts: 129 
Post correction would be slower on the 16x16, not a long enough run of add/sub. 
 
Repose
Registered: Oct 2010 Posts: 129 
Partials cheatsheet
y1 y0
x1 x0

x0*y0h x0*y0l
x1*y0h x1*y0l
x0*y1h x0*y1l
x1*y1h x1*y1l

24x24bits
x2 x1 x0
y2 y1 y0

y0x0h y0x0l
y0x1h y0x1l
y0x2h y0x2l
y1x0h y1x0l
y1x1h y1x1l
y1x2h y1x2l
y2x0h y2x0l
y2x1h y2x1l
y2x2h y2x2l
These facts are useful to estimating the time of any size calc:
Number of columns is 2*n, n is bytes of each number.
Rows of additions is like 1 3 3 1, 1 3 5 5 3 1, 1 3 5 7 7 5 3 1 for 16,24 and 32 bit mults, and each one being f(x)g(x), so really double that number of addsubs.
The total addsubs is n^2*2. (each is about 10 cycles).
Number of times to change (or pointers to set) of the multiplier is n, then each one is used >n times with the multiplicand (ldy multiplicand), when doing in column order (tbd total). 
 
Repose
Registered: Oct 2010 Posts: 129 
Changes in multiplicand is 2*n (in my case, ldy y(x) ).
eg ldy y0
...
ldy y1
...
ldy y0
...
etc. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
I've replaced all the subtractions with additions by using g(x)=$4000(x*x/4) and offsetting my start number, and rolled the carry correction in to the addition sequence.
Removing the CLCs is probably not worthwhile for 32x32, as there are only 33 of them, so I'd have to spend considerably less than 4 cycles per output byte on the post correction (unlikely).
I'm down to around 800 cycles for a 32x32 now, 776 cycles for zero times zero. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
fo=open("tables.inc","w")
lo=lambda x:x&255
hi=lambda x:(x>>8)
f=lambda x:x*x//4
g=lambda x:(0x4000f(x255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])
dumpArrayToA65(fo, "id", [lo( i ) for i in range(512)])
fo=open("mc.inc","w")
mAcc=0
for i in range(4):
for j in range(4):
mAcc=0x40<<(8*(1+i+j))
initialValue = [((mAcc>>s)&0xff) for s in range(0,64,8)]
def addB(yv,zp,tb):
global lasty
if yv!=lasty:
print(""" ldy mT2+{yv}""".format(yv=yv), file=fo)
lasty=yv
print(""" adc ({zp}),y""".format(zp=zp), file=fo)
if tb<7:
print(""" bcc *+4:inx:clc""", file=fo)
else:
print(""" bcc *+3:clc""", file=fo)
lasty=None
for tb in range(8):
print(""" ; tb={tb} """.format(tb=tb),file=fo)
if tb==0:
print(""" clc """,file=fo)
print(""" ldx#0 """,file=fo)
print(""" lda #${iv:02x} """.format(iv=initialValue[tb]),file=fo)
else:
print(""" txa""", file=fo)
if tb<7:
print(""" ldx#0 """,file=fo)
print(""" adc#${iv:02x}""".format(iv=initialValue[tb]), file=fo)
if initialValue[tb]>0xef:
print(""" bcc *+4:inx:clc""", file=fo)
for j in range(4):
i=tbj
if i in [0,1,2,3]:
addB(i, "zp_fl{j}".format(j=j), tb)
addB(i, "zp_gl{j}".format(j=j), tb)
i=tbj1
if i in [0,1,2,3]:
addB(i, "zp_fh{j}".format(j=j), tb)
addB(i, "zp_gh{j}".format(j=j), tb)
print(""" sta mRes+{tb}""".format(tb=tb), file=fo)
fo.close()

 
ChristopherJam
Registered: Aug 2004 Posts: 620 
(obviously also need four sets of
lda mT1+ 0
sta zp_fl0
sta zp_fh0
eor#255
sta zp_gl0
sta zp_gh0
every time the multiplier changes (included in my cycle times above),
plus also some init code to set the high bytes of the zero page pointers before a set of multiplications is performed (not included in my timings)) 
 
Repose
Registered: Oct 2010 Posts: 129 
Good job, that's right in the range of what I thought was possible.
I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers.
by offset $4000, doesn't that reduce the domain?
Correction is fast, it's only
stx z3
sec
sbc z3
sta z2
Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :) 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Quoting ReposeGood job, that's right in the range of what I thought was possible. Thanks!
Quote:I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers. Doing :)
Quote:by offset $4000, doesn't that reduce the domain?
Nah, second table only contains x**2/4 for x in 255 to 255, so it already maxed out at $3f80
Quote:Correction is fast, it's only
stx z3
sec
sbc z3
sta z2
True, but that's an extra 64 cycles, and removing the CLCs saves at most 64 cycles, sometimes as little as zero (if the branches skip over them all)
Quote:Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :)
Yes, there should be an equivalent that mixes ADC and SBC, I just found it easier to wrap my brain around the edge cases and carry handling by converting it to ADC only. I'll be interested to see what you come up with. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Oh, faster correction:
sec
sbc id,x
sta z2
Still not gonna do it, mind ;) 
 
Repose
Registered: Oct 2010 Posts: 129 
About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.
Let's say in the middle columns where there's 14 adds per column, that's 28 cycles half the time saved from not using CLC, or 14 cycles on average, vs 8 cycles for correction, it still saves 6.
I actually found the stats for the carries, most of them are about half, but adding a higher proportion of high bytes gives less carries. 
 
Repose
Registered: Oct 2010 Posts: 129 
Good catch on the id,x, I was thinking of that a few days ago but it didn't click in for this situation yet :)
And yes, I worked hard at mixing add/sub properly, it still doesn't really made sense but it works. I thought it wouldn't if you DEX to $FF but it still works. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Quoting ReposeAbout the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.
Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts.. 
 
Repose
Registered: Oct 2010 Posts: 129 
Yes it hurts :) I posted the explanation in the add/sub thread if you can follow it.
Try 0  ff  ff in your head. Have fun! :) 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Thanks!
Of course, going to have to start forking off "best best case" vs "best worst case" vs "best average time" pretty soon.
Down to 699 cycles for 0*0, btw ;) 
 
Repose
Registered: Oct 2010 Posts: 129 
I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.
If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.
Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Yes, my average for multiplying 10 randomly selected pairs is around 760 cycles at the moment, ranging from around 740 to 780.
Floats only need 24bit x 24bit, so that should be a lot faster. The shifting for adds will be a bit of a hassle. Do you care about correct behaviour for NaNs etc? And how critical is exact rounding? I'm guessing IEEE standard would be considerably slower than "good enough for most purposes." 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Also, if you're only planning on supporting 32bit ints, there's no way to access the high 32 bits of the result from C anyway  instant 2x speedup :D 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.
If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.
Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc.
Funny, I just wrote my own doomrenderer using original WADfiles, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day. 
 
Oswald
Registered: Apr 2002 Posts: 4054 
Quote: Funny, I just wrote my own doomrenderer using original WADfiles, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day.
cool, I hope its for some c64 demo thingie :) 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: cool, I hope its for some c64 demo thingie :)
No sorry. :) just curiosity. 
 
Oswald
Registered: Apr 2002 Posts: 4054 
Quote: No sorry. :) just curiosity.
nah next time you make a true doom fx, or you'll be kicked out :) 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: nah next time you make a true doom fx, or you'll be kicked out :)
Hehe, wasn't the Andropolis vector enough? ;) there I only used 16bit muls and divs (8.8 fixpoint) hence the coarse resolution of the level (thickness of walls and minimum stair step height). I also optimised to only have axisaligned rooms. Moving to 16.16 muls and remove the axis alignment would perhaps slow it down by 30% but it would be able to render an untextured Doom level I'm sure. 
 
Repose
Registered: Oct 2010 Posts: 129 
Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code) 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Quoting ChristopherJamQuoting ReposeAbout the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.
Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts..
OK, done now and right you are; saved me 13½ cycles on average for my ten random cases (down to 747.2 cycles now), and only cost me an extra five cycles on my best case (up to 697)
I ended up replacing the lda#offset at the starts of each of the middle six columns with an sbc id+$ffoffset,x at the ends.
My code generator has fairly generic treatment of addends now, so it takes care of all the carry logic for me. Core is now
for tb in range(8):
bspecs=getbspecs(tb)
doCounts=(tb<7)
iv=initialValue[tb]
doClears=len(bspecs)<4 and iv!=0
op="lda"
if tb==1:
emit(" ldx#0")
elif tb>1:
emit(" txa")
op="adc"
if doCounts: emit(" ldx#0")
if iv!=0:
if doClears:
bspecs=[pointerReturner( "#${iv:02x}".format(iv=iv), co=(iv>0xef))]+bspecs
else:
bspecs=bspecs+[pointerReturner( "id+${nv:02x},x".format(nv=0xffiv), negate=True)]
for n,s in enumerate(bspecs):
addB(s, op, moveCarry=(n!=len(bspecs)1), doCounts=doCounts, doClears=doClears)
op="adc"
emit(" sta mRes+{tb}\n\n".format(tb=tb))

 
Repose
Registered: Oct 2010 Posts: 129 
Beautiful! Only thing else I could see is to autogenerate for any precision, which isn't much of a leap. Should have option to include tablegenerator code too, and 6816(?).
You can estimate timing much the same way I do it by hand, by including counts per column based on number of adds, then overhead per column, and total overhead. I use variables Px for branches with an initial estimate of Px=.5 over all multiplies. Was gonna have a P generator given a stream of numbers too, that's really fancy of course.
Was gonna say, put the carries in the table (there's even opportunity for a table with builtin accumulate, would be slightly faster).
I have bad news though, I have doubts about my sec fix for postcorrection :( I'm just working that out now (it's simple though, just ensure the running total is offset by 1 for each carry including the first). Did you verify the outputs? 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code)
Not major perhaps, a lot of time is spend on line drawing and EORfilling anyway. But maybe 30%, i dunno. Long time ago now and I'm not really sure about how much time is spend on what. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Thanks!
I test the multiply using a set of 20 randomly selected 32 bit integers, comparing the product of each to a known result, and I time each multiply using CIA#2 with screen blanked and IRQs disabled. So, it's not comprehensive, but I'm reasonably confident.
Yes, generalising for a set of operand widths would probably be quite handy. Good idea instrumenting the generator to calculate the likely execution time. I've been meaning to do something similar with another project..
The post fixup is fine if it's the last addition performed, then the carry from that addition can propagate through to the next column.
Do you mean 65816? I started learning that once, but SuperCPU never really took off, and I tend to ignore most platforms between C64 and modern hardware these days, at least for hobby coding. REU last year was already something of a leap for me. 
 
Repose
Registered: Oct 2010 Posts: 129 
Btw, one more idea: squaring and cubing based on this can be optimized significantly as well. 
 
Repose
Registered: Oct 2010 Posts: 129 
Test your routine with these magic numbers:
00 00 54 56
00 00 03 03
If my theory is correct, that's the only case that will fail (not really, just the only one I'm bothering to solve for).
It's quite a subtle bug and random values won't catch it, you need 'whitebox' testing.
The failure is rare and of the form that the high byte is higher by 1. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Still works :)
Thanks for the test case.
Here are the first two columns of code (and remember that my g(x)=0x4000f(x255) ):
clc
ldy mT2+0
lda (zp_fl0),y
adc (zp_gl0),y
sta mRes+0
ldx#0
lda (zp_fh0),y
adc (zp_gh0),y
bcc *+3
inx
adc (zp_fl1),y
bcc *+3
inx
adc (zp_gl1),y
bcc *+3
inx
ldy mT2+1
adc (zp_fl0),y
bcc *+3
inx
adc (zp_gl0),y
bcc *+3
inx
sbc id+$3f,x
sta mRes+1
The inverse borrow from the final SBC carries forward to the next column; the SBC itself corrects for the false carries while also compensating for the excess 0x40 in the high byte of the g() table. 
 
Repose
Registered: Oct 2010 Posts: 129 
Oh, I know why it works, I constructed those special values for the normal sense, I mean
54 56
03 03

01 xx
00 ff
00 ff
The whole point was to get those 3 partials to be added, ff+ff+01. Where you are adding with offset, I have to construct the multipliers differently. Not only that, but I'm doubly wrong here  I need to find multipliers which cause the f(x)'s to result that way (where my example works only on the production of f()g()).
I'll have to finish this later. In the meantime, I suggest you test every possible 16x16. Not so easy I know, I had to write such things in a 6502 simulator in C, actually just simulated the few instructions I needed, but there's a source code out there you could use for a full simulator. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
I'm going to have to think some more about how to synthesise the equivalent test case.
I did start coding an exhaustive test in 6502 (can determine the required result just by repeated adds; a*(b+1)=a*b+b), then realised it wasn't 2**16 test cases but 2*32. Even at 30x realtime that would take VICE 28 hours assuming 700 cycles per iteration.. 
 
Repose
Registered: Oct 2010 Posts: 129 
00 01 02 03 * 04 05 06 07 and manipulate the tables to what you want to test adds for every branch, and number of carries per column up to 14, think that should do it. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Had a thought this morning  the difference of squares is already well established, the only thing that really needs testing is the carry handling for each column. I'll post about that over at sets of add/sub shortly. 
 
Repose
Registered: Oct 2010 Posts: 129 
That's basically what I just said  multiplying is just adding from a table. Test coverage would include each carry and each amount of carries per column. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Quote: That's basically what I just said  multiplying is just adding from a table. Test coverage would include each carry and each amount of carries per column.
Fair point  I guess I got distracted by your talk of table manipulation.
Posting some analysis of the individual carries in the other thread shortly.
But back to multiplies  I was curious as to how you got away with not offsetting the g() table, then it finally struck me  using SBC instead of ADC is exactly equivalent to doing an ADC of a $ffffg() table.
Do you have working code yet? I would expect you too need a different offset for each column. 
 
Repose
Registered: Oct 2010 Posts: 129 
Just about to work out the subs, though I'm sure it works in some equivalent way, I'm thinking at most a sec or clc when switching between runs of adds and runs of subs. You can do one fixup at the end. The way I'm doing it makes sense too. No offsets needed.
(ps why did Ice T suddenly flash in my mind singing, no beepers needed?)
Sounds like mine is gonna be a lot cleaner, not to mention faster but we'll see :) 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
OK, 16x16 done and tested. Minimum 205 cycles, mean of around 216, including 12 cycles for the JSR/RTS
(assuming multiplier, multiplicand and destination all in ZP). I've just modified the codegen for the 32x32 for now, will have a look later to see if I've missed any obvious optimisations. 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: OK, 16x16 done and tested. Minimum 205 cycles, mean of around 216, including 12 cycles for the JSR/RTS
(assuming multiplier, multiplicand and destination all in ZP). I've just modified the codegen for the 32x32 for now, will have a look later to see if I've missed any obvious optimisations.
How does this compare to my stuff on Codebase? Also unsigned? 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?
Unsigned, yes. 
 
Frantic
Registered: Mar 2003 Posts: 1304 
10% faster ain't bad! 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?
Unsigned, yes.
Nice!!! Havn't checked in detail, same table space overhead? 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Thanks!
An extra 256 bytes for the id table, so five pages of tables altogether. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Actually, scratch that; for the 16x16>32 case, I only ever read from 13 bytes of the identity table.
Make that 2061 bytes of tables required. Also 16 bytes of zero page for pointers. 
 
Repose
Registered: Oct 2010 Posts: 129 
Ok, I don't know who won because I'm counting things differently. I've always assumed branches are taken equally and averaged them. By that method, jackasser's is clearly 233+12 for jsr/rts=245.
One of the alternate versions I already have written is 207+12=219.
Yet, you reported 241 for JackAssers which is 3 less, and if you scale the same way, I get 219 for yours, exactly the same. As far as I can tell, we're tied.
They are not the same at all. My alternate version is very straightforward and doesn't use the repeated add technique. Still working on that (sorry I'm so slow). We're tied, but this isn't even my final entry.
JackAsser's
setup 74
mults 116
adds 43
total 233

 
ChristopherJam
Registered: Aug 2004 Posts: 620 
I hadn't yet counted the cycles in the source yet, so I just timed both my and JackAsser's routines with CIA for ten randomly selected pairs of numbers.
Now I've annotated my source, if I assume branches are taken equally and page crossings on table lookups also cost on average half a cycle, I get 206.5+12=218.5
So yes, ridiculously close at the moment given the different approaches I gather we've taken.
I can shave another two cycles off mine, but only if I bump the memory back up to 5 pages of tables again. (or rather, I need to place a 7 byte table somewhere in the last 64 bytes of a page… long story..)
I gather you have something better in the wings; I look forward to seeing it! 
 
Repose
Registered: Oct 2010 Posts: 129 
Same, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
My current best:
mul1616
lda mT1+ 0 ; 3
sta zp_fl0 ; 3
sta zp_fh0 ; 3
eor#255 ; 2
sta zp_gl0 ; 3
sta zp_gh0 ; 3
lda mT1+ 1 ; 3
sta zp_fl1 ; 3
sta zp_fh1 ; 3
eor#255 ; 2
sta zp_gl1 ; 3
sta zp_gh1 ; 3
clc ; 2
ldy mT2+0 ; 3
lda (zp_fl0),y ; 5.5
adc (zp_gl0),y ; 5.5
sta mRes+0 ; 3
ldx#0 ; 2
lda (zp_fh0),y ; 5.5
adc (zp_gh0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_fl1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gl1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
ldy mT2+1 ; 3
adc (zp_fl0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gl0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
sbc ofste_3f,x ; 4
sta mRes+1 ; 3
txa ; 2
ldx#$bf ; 2
adc (zp_fh0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gh0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_fl1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gl1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
ldy mT2+0 ; 3
adc (zp_fh1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gh1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
sbc ofste_80$bf,x ; 4
sta mRes+2 ; 3
txa ; 2
ldy mT2+1 ; 3
adc (zp_fh1),y ; 5.5
clc ; 2
adc (zp_gh1),y ; 5.5
sta mRes+3 ; 3
rts
; total=204.5
ofste_3f
.byt $3f,$40,$41,$42,$43,$44
.dsb <$bf*,0
ofste_80
.byt $80,$81,$82,$83,$84,$85,$86
(and of course a one off init of the high bytes of the zero page pointers, and square tables as follows:
f=lambda x:x*x//4
g=lambda x:(0x4000f(x255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])

 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Quoting ReposeSame, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile.
Haha, then I should probably acknowledge that my ~750 cycle 32x32>64 requires 32 bytes of zero page pointers, on top of the 16 bytes of zero page split between the multiplier, multiplicand and result :) 
 
Repose
Registered: Oct 2010 Posts: 129 
Wow, you really squeezed that dry  good job! Even though my approach is different (in the runs of adds vein), it follows the same pattern as yours, and thus will turn out almost exactly the same time.
There's other approaches to faster adds though. 
 
Repose
Registered: Oct 2010 Posts: 129 
I think I have 195, but it requires a different approach. You're going to have to wait another day though :) 
 
Repose
Registered: Oct 2010 Posts: 129 
Finally, after fixing a bad sqrtab...
The World's Fastest Published 16x16 Unsigned Mult on 6502.
Testing methodology

To measure a particular multiply, you can enter the monitor in vice and type (where pc is your start address):
r pc=c000
Then type:
z
and keep hitting enter to step through the code. Copy and paste the first line. When you've reached the line *after* the last you want to measure, copy and paste that (also include any points in between). You can now subtract the two to get the timing.
Example:
LDA $FB  A:00 X:0A Y:00 SP:eb ....IZ. 72210316
CLC  A:00 X:0A Y:00 SP:eb ....IZC 72210476
RTS  A:00 X:0A Y:00 SP:eb ....IZ. 72210520
This shows me that my multiply body was 160 cycles, and the adds were 44 cycles, for a total of 204 (nevermind the slow times, I had nothing in zp for testing purposes).
I used that as a guide, but really I added by hand and averaged page crossings and branches for the reported total.
Tell me the speed

158 cycles for the multiply part, with no variation, and the inputs/outputs in zp.
43 cycles for the final additions, with each branch equally likely.
The total is 201. However, if you include the simple variation which requires that part of the code is in zp, you save 3 cycles for a total of 198 (I just wanted to say I could break 200).
add 12 for jsr/rts. I report this way to be consistent with CJ's results above
The Code

;World's fastest 16x16 unsigned mult for 6502
;you can go faster, but not without more code and/or data
;and being less elegant and harder to follow.
;by Repose 2017
;tables of squares
;sqr(x)=x^2/4
;negsqr(x)=(255x)^2/4
sqrlo=$c000;511 bytes
sqrhi=$c200;511 bytes
negsqrlo=$c400;511 bytes
negsqrhi=$c600;511 bytes
;pointers to square tables above
p_sqr_lo=$8b;2 bytes
p_sqr_hi=$8d;2 bytes
p_invsqr_lo=$8f;2 bytes
p_invsqr_hi=$91;2 bytes
;the inputs and outputs
x0=$fb;multiplier, 2 bytes
x1=$fc
y0=$fd;multiplicand, 2 bytes
y1=$fe
z0=$80;product, 4 bytes
z1=$81
z2=$82
z3=$83
;not shown is a routine to make the tables
;also you need to init the pointers' high bytes to the tables
umult16:
;set multiplier as x0
lda x0
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_invsqr_lo
sta p_invsqr_hi;17
ldy y0
sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y;note these two lines taken as 11 total
sta z0;x0*y0l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c1a+1;x0*y0h;31
;c1a means column 1, row a (partial product to be added later)
ldy y1
;sec ;notice that the high byte of sub above is always +ve
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c1b+1;x0*y1l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c2a+1;x0*y1h;31
;set multiplier as x1
lda x1
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_invsqr_lo
sta p_invsqr_hi;17
ldy y0
;sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c1c+1;x1*y0l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c2b+1;x1*y1h;31
ldy y1
;sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c2c+1;x1*y1l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta z3;x1*y1h;31
;4*31+2*17 so far=158
;add partials
;add first two numbers in column 1
;jmp do_adds;put in zp to save 3 cycles :)
do_adds:
clc
c1a lda #0
c1b adc #0;add first two rows of column 1
sta z1;9
;continue to first two numbers in column 2
c2a lda #0
c2b adc #0
sta z2;7
bcc c1c;3 taken/9 not taken, avg 6
inc z3
clc
;add last number of column 1 (row c)
c1c lda #0
adc z1
sta z1;8
;add last number of column 2
c2c lda #0
adc z2
sta z2;8
bcc fin;3/7 avg 5
inc z3
;9+7+6+8+8+5=43
fin rts

 
JackAsser
Registered: Jun 2002 Posts: 1226 
Repose, really nice!
Some further optimization:
sta c2b+1;x1*y1h;31 => tax
c2a lda #0
c2b adc #0 =>
c2b txa
c2a adc #0
But I somehow like the fact that X is kept clean otoh. 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Also if z3 is assume in the yreg when done instead:
sta z3;x1*y1h;31 => tay
all inc z3 => iny
I think that's ok since you'd probably do some lda z3 afterwards anyway, and instead you have it readily available in Y. 
 
Repose
Registered: Oct 2010 Posts: 129 
Thanks :) I screwed up my timing though, the first multiply part is 33 not 31 because of the single SEC, bringing my total to 203 (and VICE was right, the multiply part is 160). Because of this, I think I'll use your optimizations to bring it back down to 201. Therefore it still stands :) 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Also in a real life situation, f.e. subpixel vectors you'd only keep z3 and z2 as a screen limited 8.8 result. Typically one would do:
Rotated x,y,z in 8.8
Rotated z in 8.8
Reciprocal z = 1/z = 0.16
Perspective:
8.8 * 0.16 = 8.24 (keep only the top 8.8, i.e. pixel and the bresenham initial error) 
 
Repose
Registered: Oct 2010 Posts: 129 
Maybe I should use what I've learned to do 3d rotations and perspective transform? I think A Different Perspective 2017 3d Creators Update is in order :) (I'm one of the original authors).
So I had a plan for this fast multiply, it can lead to a fast division because of multiplying by the inverse of the divisor. I can also do squares and cubes faster than this.
Edit: was thinking multiply is only the beginning. I made it 16% faster than your routine but if I can make such gains throughout the transform stack it will add up.
Also for Andropolis, I was thinking to not use EOR fill but a straight store (in fact that's the insight I had on A Different Perspective), and also to calc frame differences and plot those only. 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: Maybe I should use what I've learned to do 3d rotations and perspective transform? I think A Different Perspective 2017 3d Creators Update is in order :) (I'm one of the original authors).
So I had a plan for this fast multiply, it can lead to a fast division because of multiplying by the inverse of the divisor. I can also do squares and cubes faster than this.
Edit: was thinking multiply is only the beginning. I made it 16% faster than your routine but if I can make such gains throughout the transform stack it will add up.
Also for Andropolis, I was thinking to not use EOR fill but a straight store (in fact that's the insight I had on A Different Perspective), and also to calc frame differences and plot those only.
I've had similar ideas and I even did an frame difference experiment in Jave. Problem was that since diffs are small so will the triangles be. They'll be extreme, sharp and 'problematic'. Hard to render correctly and since they're diffs any render error will accumulate.
Regarding transforms I came to the conclusion to cut most of the stack and forget about how it works conventionally.
Regarding divs we all do mul by the reciprocal. For Andropolis I did what Graham did and calced the reciprocal by linear interpolation:
X is 8.8 call it a.b:
1/a.b ~= invtab[a]*(1b) + invtab[a+1]*b
invtab[x] is 0.16 result of 65536/x for x 1..255 
 
Repose
Registered: Oct 2010 Posts: 129 
You should write an article on how you did that, it sounds interesting. Obviously I'm a noob at this problem and would have a lot of research to do.
I wasn't thinking triangles exactly but just doom like hallways, wouldn't that work for differencing? 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: You should write an article on how you did that, it sounds interesting. Obviously I'm a noob at this problem and would have a lot of research to do.
I wasn't thinking triangles exactly but just doom like hallways, wouldn't that work for differencing?
Hehe, I suck at writing articles! :)
Imagine the simpliest case: axis aligned rooms, no height differences and just rotate around Y, i.e. Wolfenstein.
The diff formed by the edge between the wall and the floor/ceiling will be an enlogated thin triangle. The diff might be 34 pixels high close to the camera, this is easy, but further down the corridor the height of the diff will be <1 but still pixels here and there must be pur. Surely an extended bresenham that "draws" both the top and the bottom line simultaneous could handle it, and calc the total height between the lines and still propagate individual errors. 
 
Repose
Registered: Oct 2010 Posts: 129 
Ok, I think that makes sense now.
Now 196
I managed to save another 7 cycles from the correct total of 203, bringing it down to 159+37=196. As a bonus, the two highest bytes are returned in registers.
If I put do_adds in zp, there's one more optimization to save 4 instead of 3, for 192. Finally, if you don't need the lowest bytes, you can save 3 cycles by deleting sta z1.
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
tay;x1*y1h;Y=z3, 30 cycles
do_adds:
;add the first two numbers of column 1
clc
c1a: lda #0
c1b: adc #0
sta z1;9
;continue to first two numbers of column 2
c2a: lda #0
c2b: adc #0
tax;X=z2, 6 cycles
bcc c1c;3/6 avg 4.5
iny;z3++
clc
;add last number of column 1
c1c: lda #0
adc z1
sta z1;8
;add last number of column 2
txa;A=z2
c2c: adc #0
tax;X=z2, 6
bcc fin;3/4 avg 3.5
iny;z3++
;Y=z3, X=z2
;9+6+4.5+8+6+3.5=37
fin: rts

 
Oswald
Registered: Apr 2002 Posts: 4054 
I did not follow closely, could you enlighten me why c1a/b/c is not added up in one go ? 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Damn, so you're now 8.5 cycles faster than me? I was not expecting partial products to be faster than the optimisations we've been working on for runs of adc. Going to have to study this more closely.
Canada holds the record, for now. Nice work! 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: I did not follow closely, could you enlighten me why c1a/b/c is not added up in one go ?
Adding in one go would lose the carry information 
 
Repose
Registered: Oct 2010 Posts: 129 
To be fair, CJ can save 2 cycles by returning in regs like me, then they are just 6.5 or 10.5 cycles apart. Most of our instructions are the same, there's just a difference in overhead. But I can't seem to find any idea that's faster; this could be the actual optimal routine.
I tried all branches for the adds, that's a bit slower. I mean code like:
clc
lda c1a
adc c1b
bcs c1bc c1bc:
clc
adc c1c adc c1c
sta z1 sta z1
lda #1
lda c2a adc c2a
bcs c2ac;7 c2ac:
clc
adc c2b adc c2b adc c2b
And it did seem to work, but I get about 39.5 cycles.
I also looked into the crazy timer idea, but each correction is sbc timerA, which is already too slow for this small number of adds.
I have one idea that can be dramatically faster but it's not very practical.
;y0 in X, y1 in Y, x01, x11 init to $02, all ram swapped in
x0=$fc;fbfc is pointer
x1=$fe;fdfe is pointer
jmp (x01)
multk:
;multiply by constant multiplier k
;and multiplicands X, Y
;then add
rts
There's never any setups so that saves 34 cycles, many of the multipliers can be optimized, though not reducing the average much, important cases like 0,1, and powers of 2 could be faster.
You also need room for tables so it could only handle 7 bits at a time, unless you did each one in pure code.
Even with such an extreme approach, the worst case is probably close to a normal multiply.
What do you think, is this the end? 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Well, without memory restrictions you could do crazy stuff like have variants of the f() or g() tables offset by 1..N so you could move the carry correction into selecting which table to use... but I'm not sure how much that would gain you. 