| |
Repose
Registered: Oct 2010 Posts: 225 |
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?
|
|
| |
Repose
Registered: Oct 2010 Posts: 225 |
After some work I decided to go with a square table based large multiply. 32bitx32bit=64bit in 850 cycles or so at best. I discovered a trick I wish I knew a long time ago; to add long lists of numbers you can do this
ldy #0
clc
lda a0
adc b0
bcc s1
iny
clc
s1 adc c0
sta d0
bcc s2
iny
s2 sty d1
It's basically 6 cycles per row and 7 cycles per column. When adding the 16 (two byte) partials from the 32bit multiply, it averages to 7.5 cycles per add. This is really optimal as the most basic add is 6 cycles (lda a0:adc b0).
I also found out that there's carries slightly less than half the time on average.
I'm working on a routine to do 8x8=16, 8x16=24, 16x16=16 with rounding, 32x32=32 with rounding, 32x32=64, and 32x32 floating point in one routine with different entry points, where (almost) each entry is optimal even for a standalone routine, and this is all in one routine. It's just a matter of arranging the order of operations carefully to progressively create the result.
I think you can get 1000flops on the c64.
|
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
1000 flops is well within reach - it just depends how much precision you want, and what you define as a flop :)
Single precision IEEE-754 only has 24 bits for the mantissa, and you don't need a 48 bit result unless you want to get the rounding exact for multiply-accumulates.
Have you considered expressing the multiplies as a difference of squares? Should mean instead of needing nine 8x8->16 multiplies for a 24x24 bit you have six 8x8s and a few table lookups. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
part 1
z1:z0=(x0+y0)^2/4-(x0-y0)^2/4 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;48 cycles
z1:Z0=x0*y1/2 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y1
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;56 cycles
z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 ccyles
c1:c0=a0^2/4-b0^2/4;48 cycles
c3:c2=a1^2/4-b1^2/4;48 cycles
t1:t0=a0*a1/2;56 cycles
t3:t2=b0*b1/2;56 cycles
t1:t0-=t3:t2;20 cycles
c2:c1+=t1:t0;20 cycles
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1 lda #0;8/11
adc c3:sta c3;6
which is 262+ cycles for your method, 16bit.
|
| |
Repose
Registered: Oct 2010 Posts: 225 |
You're proposing,
a*b=(a+b)^2 - (a-b)^2,
f(x)=x^2/4,
a*b=f(a+b) - f(a-b)
Where a,b are 24 bit values.
So we need a 24bit square function.
You can get that by,
(x2 * 2^16 + x1 * 2^8 + x0)^2=
x2^2 * 2^32 + x1^2 * 2^16 + x0^2 + y,
g(x)=x^2/4,
=g(x2) * 2^32 + g(x1) * 2^16 + g(x0)
which are three table lookups and,
y= 2*x1*x0/4 * 2^8 + 2*x2*x0/4 * 2^16 + 2*x2*x1/4 * 2^24
which are three 8bit multiplies.
In general for n byte numbers you get
2n squaring lookups,
2 * nC2 mults where nCr=n!/(n-r)!r!,
2n shifts
O(n) = 2n*48 + 2*nC2*(56+20) + 14, n>1
2C2=2
3C2=3
4C2=6
n cycles
2 262
3 530
4 1310
So without optimization, it doesn't look good...
|
| |
Skate
Registered: Jul 2003 Posts: 494 |
Quote:
So without optimization, it doesn't look good...
when it comes to math, what looks good on c64 without optimization?
creating a math library for general purposes is a good thing. it's very useful for testing an effect on c64 without spending too much time. but when it comes to finalization, usually most parts of this library won't be used as it is.
imho, best method is to create the library with possible optimizations which doesn't effect the accuracy much. at the last stage, replace all possible calculations with reasonable look-up tables and where look-up table is not an option, inline the math functions and optimize them as much as possible, change the register usages for saving a few more cycles, sacrifice some accuracy if it doesn't effect the result badly etc. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
You're thinking in terms of demos. I was thinking of applying this as a compiler library. I'm just looking for a general fast calculation.
I'm comparing CJ's suggestion to my first approach, when I say it doesn't look good, I meant his approach.
Update: slight problem in how I count things
O(n) = n*48 + nC2*(56+20) + 14, in fact 1 works and 1C2 is 0, also the 14 term is related to this
n cycles
1 48
2 262
3 386
4 662
This is now good :)
If I can generate combinations in lists, I could generate any precision math routine in kickass... |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Quoting Reposepart 1
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1
lda #0;8/11
adc c3
sta c3
this will a) most likely crash as it only skips one byte and then runs on $00 what will be a brk. Also this is not a optimizing trick speedwise nor sizewise:
Better be done like:
anc #$80 ;copy bit 7 to carry and A = A & $80
bcc + ;A is zero on plus, all fine
sbc #$81 ;subtract $81 from $80 = $ff
+
;carry always clear as a free gift
Also, lots of other possibilities to optimize that code, and also some mistakes within the code. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I worked on this again. There were a lot of bugs in my post 'part 1'.
Basically, this message is working out how long to multiply 16x16 using the idea of f(x+y)-f(x-y)=x*y with x,y as 16bit numbers and f(n)=n^2/4.
Note that squaring a 16bit number takes two 8bit squares and a multiply.
Compare this with the usual way, which is 4 multiplies of x0*y0,x1*y0,x0*y1,x1*y1 then adding those together.
----------------
z1:z0=(x0+y0)^2/4 means
lda x0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;50 cycles avg
z1:z0=(x0+y0)*(x1+y1)/2 means
lda x0:clc:adc y0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
lda x1:clc:adc y1
tay;32
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;70 cycles avg
z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 cycles
The routine
Again note, (a1:a0)^2/4=a1^2/4,a0^2/4,a1*a0/2
where a1:a0=x1:x0+y1:y0
So a 16bit square/4 is two 8bit squares and a multiply with divide/2
then added all together
We need two 16bit squares and then subtract them
;z=f(x+y) where f(n)=n^2/4 and x,y are 16bit numbers
z1:z0=(x0+y0)^2/4;50 cycles (x+y)^2/4 low
z3:z2=(x1+y1)^2/4;50 cycles (x+y)^2/4 high
t1:t0=(x0+y0)*(x1+y1)/2;70 cycles
z2:z1+=t1:t0;20 cycles
;add the carry to z3 (not shown or calculated) 190+cycles for 16bit square/4
;
;t=f(x-y)
t1:t0=(x0-y0)^2/4;50 cycles
t3:t2=(x1-y1)^2/4;50 cycles
t5:t4=(x0-y0)*(x1-y1)/2;70 cycles
t2:t1+=t5:t4;20 cycles
;
;z-=t
z1:z0-=t1:t0;20 cycles
z3:z3-=t3:t2;20 cycles
;190*2+40=420 cycles+
There's still some mistakes here and room to optimize the adds,
also the setup part of the table lookups can be arranged to not
change the pointers so much,
but basically it shows that this is very slow and doesn't work out.
This could have been 4 multiplies if done the usual way, instead
it's like 6 multiplies (officially 2, then 4 sqrtable lookups of 9 bit
numbers, which is the same speed). |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I only have two options left. I can use the results of the 'add/sub a set' thread for adding the partials of large multiplies, where I add the partials by order of column from low to high, or else do the multiplies in an optimal order of not changing the zp pointers, but then having to generate and add the partials in a non-optimal order.
The two times I can avoid changing the zp pointers saves 28 cycles. Doing adds then having to save them and add a different column is still an unknown to me.
I'm thinking that doing the multiplies optimally will be best, and the result will be just over 200 cycles for an unsigned 16x16.
It's only a bit better than the one posted already in codebase.
I still have to look more into the cosine formula, but I think I am close to the optimal speed for large multiplies. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Have you checked my Seriously fast multiplication?
Edit: Yes you have.. sorry. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Yes, that's the one I beat already. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Yes, that's the one I beat already.
Interesting approach! I'll try it myself someday! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Stop distracting me, I've got other stuff I'm meant to be working on! |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Stop distracting me, I've got other stuff I'm meant to be working on!
Circle closed. You were the first to teach me multiplications using squares. I remember I found you on some forum ages ago. :D |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.
It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.
It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?)
Yeah maybe. I remember Graham did the implicit add-optimization to it. I.e. self-mod the highbyte with X and then use ,Y in the table lookup. That way the X+Y comes for "free" |
| |
Repose
Registered: Oct 2010 Posts: 225 |
That very idea I always thought was brilliant, because it made a useful calculation with the indexing at the same time as a lookup. As a result, this algorithm is faster than could ever be possible. I call it a 'super' algorithm.
And yes, Judd told me that at the time we were working on the C=Hacking article together.
Anyone remember that coding thread in comp.sys.cbm and later, the z80 vs 6502 'contest' ? |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Quote: Yes, that's the one I beat already.
Don't hesitate to write an article, or at least post some source code, at Codebase64. :) Would be much appreciated! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Repose, did you get around to completing a 32x32->64 routine? I'm trying out some of the ideas from this and the sets of add/sub threads; currently down to around 1300 cycles, though I've still got some optimising to go.
I'm using inx to track carries, but have yet to remove the CLCs after each branch. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Yes I wrote a 16x16 to test, and it takes 211 cycles (that's avg with equal branches). The one posted on codebase is 231 cycles. I am just testing it now but having some trouble finding a single step debugger that works for me, see thread.
The time to beat for 32x32 is at least 1300 as posted on codebase so you have to keep going...
I can derive the formula for my current version to estimate exact timing of 32x32 I'll have to get back to you in a bit. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Here's some untested code:
x0=$fb
x1=$fc
y0=$fd
y1=$fe
x0_sqr_lo=$8b;2 bytes
x0_sqr_hi=$8d
x0_negsqr_lo=$8f
x0_negsqr_hi=$91
x1_sqr_lo=$93;2 bytes
x1_sqr_hi=$95
x1_negsqr_lo=$97
x1_negsqr_hi=$99
sqrlo=$c000;510 bytes
sqrhi=$c200
negsqrlo=$c400
negsqrhi=$c600
umult16:
;init zp square tables pointers
lda x0
sta x0_sqr_lo
sta x0_sqr_hi
eor #$ff
sta x0_negsqr_lo
sta x0_negsqr_hi;17
lda x1
sta x1_sqr_lo
sta x1_sqr_hi
eor #$ff
sta x1_negsqr_lo
sta x1_negsqr_hi;17
ldx #0;start column 0
ldy y0
SEC
LDA (x0_sqr_lo),y
SBC (x0_negsqr_lo),y
sta z0;x0*y0 lo, C=1
;start column 1
;Y=y0
clc
LDA (x0_sqr_hi),y;x0*y0 hi
ADC (x1_sqr_lo),y;+x1*y0 lo
bcc c1s1;8.5/11.5 avg
inx
clc
c1s1 sbc (x0_negsqr_hi),y;x0*y0 hi
bcc c1s2
dex
clc
c1s2 sbc (x1_negsqr_lo),y;-x1*y0 lo
bcc c1s3
dex
clc
c1s3 ldy y1
adc (x0_sqr_lo),y;x0*y1 lo
bcc c1s4
inx
clc
c1s4 SBC (x0_negsqr_lo),y;A=x0*y1 lo
bcc c1s5
dex
clc
;end of column 1
c1s5 sta z1;column 1
;start column 2
ldy y0
txa;carries from column 1
ldx #0;reset carries
clc
adc (x1_sqr_hi),y;+x1*y0 hi
bcc c2s1
inx
c2s1 sbc (x1_negsqr_hi),y;-x1*y0 hi
bcc c2s2
dex
clc
c2s2 ldy y1
adc (x0_sqr_hi),y;+x0*y1 hi
bcc c2s3
inx
clc
c2s3 adc (x1_sqr_lo),y;+x1*y1 lo
bcc c2s4
inx
clc
c2s4 sbc (x0_negsqr_hi),y;-x0*y1 hi
bcc c2s5
dex
clc
c2s5 sbc (x1_negsqr_lo),y;-x1*y1 lo
bcc c2s6
dex
clc
c2s6 sta z2;column 2
;start column 3
;Y=y1
txa;carries from column 2
clc
adc (x1_sqr_hi),y;+x1*y1 hi
sbc (x1_negsqr_hi),y;-x1*y1 hi
;shouldn't be any carries in the msb
sta z3;column 3
rts
makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta x0_sqr_lo+1
sta x1_sqr_lo+1
lda #>sqrhi
sta x0_sqr_hi+1
sta x1_sqr_hi+1
lda #>negsqrlo
sta x0_negsqr_lo+1
sta x1_negsqr_lo+1
lda #>negsqrhi
sta x0_negsqr_hi+1
sta x1_negsqr_hi+1
;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9 ; CMP #immediate - skip TYA and clear carry flag
makesqrtables_loop1:
tya
adc #$00
makesqrtables_sm1:
sta sqrhi,x
tay
cmp #$40
txa
ror
makesqrtables_sm2:
adc #$00
sta makesqrtables_sm2+1
inx
makesqrtables_sm3:
sta sqrlo,x
bne makesqrtables_loop1
inc makesqrtables_sm3+2
inc makesqrtables_sm1+2
clc
iny
bne makesqrtables_loop1
;generate negsqr(x)=(255-x)^2/4
ldx #$00
ldy #$ff
maketables_loop2:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne maketables_loop2:
rts
|
| |
Repose
Registered: Oct 2010 Posts: 225 |
Post correction would be slower on the 16x16, not a long enough run of add/sub. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Partials cheatsheet
y1 y0
x1 x0
------
x0*y0h x0*y0l
x1*y0h x1*y0l
x0*y1h x0*y1l
x1*y1h x1*y1l
----------------
24x24bits
x2 x1 x0
y2 y1 y0
---------------
y0x0h y0x0l
y0x1h y0x1l
y0x2h y0x2l
y1x0h y1x0l
y1x1h y1x1l
y1x2h y1x2l
y2x0h y2x0l
y2x1h y2x1l
y2x2h y2x2l
These facts are useful to estimating the time of any size calc:
Number of columns is 2*n, n is bytes of each number.
Rows of additions is like 1 3 3 1, 1 3 5 5 3 1, 1 3 5 7 7 5 3 1 for 16,24 and 32 bit mults, and each one being f(x)-g(x), so really double that number of add-subs.
The total add-subs is n^2*2. (each is about 10 cycles).
Number of times to change (or pointers to set) of the multiplier is n, then each one is used >n times with the multiplicand (ldy multiplicand), when doing in column order (tbd total). |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Changes in multiplicand is 2*n (in my case, ldy y(x) ).
eg ldy y0
...
ldy y1
...
ldy y0
...
etc. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
I've replaced all the subtractions with additions by using g(x)=$4000-(x*x/4) and offsetting my start number, and rolled the carry correction in to the addition sequence.
Removing the CLCs is probably not worthwhile for 32x32, as there are only 33 of them, so I'd have to spend considerably less than 4 cycles per output byte on the post correction (unlikely).
I'm down to around 800 cycles for a 32x32 now, 776 cycles for zero times zero. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
fo=open("tables.inc","w")
lo=lambda x:x&255
hi=lambda x:(x>>8)
f=lambda x:x*x//4
g=lambda x:(0x4000-f(x-255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])
dumpArrayToA65(fo, "id", [lo( i ) for i in range(512)])
fo=open("mc.inc","w")
mAcc=0
for i in range(4):
for j in range(4):
mAcc-=0x40<<(8*(1+i+j))
initialValue = [((mAcc>>s)&0xff) for s in range(0,64,8)]
def addB(yv,zp,tb):
global lasty
if yv!=lasty:
print(""" ldy mT2+{yv}""".format(yv=yv), file=fo)
lasty=yv
print(""" adc ({zp}),y""".format(zp=zp), file=fo)
if tb<7:
print(""" bcc *+4:inx:clc""", file=fo)
else:
print(""" bcc *+3:clc""", file=fo)
lasty=None
for tb in range(8):
print(""" ; tb={tb} """.format(tb=tb),file=fo)
if tb==0:
print(""" clc """,file=fo)
print(""" ldx#0 """,file=fo)
print(""" lda #${iv:02x} """.format(iv=initialValue[tb]),file=fo)
else:
print(""" txa""", file=fo)
if tb<7:
print(""" ldx#0 """,file=fo)
print(""" adc#${iv:02x}""".format(iv=initialValue[tb]), file=fo)
if initialValue[tb]>0xef:
print(""" bcc *+4:inx:clc""", file=fo)
for j in range(4):
i=tb-j
if i in [0,1,2,3]:
addB(i, "zp_fl{j}".format(j=j), tb)
addB(i, "zp_gl{j}".format(j=j), tb)
i=tb-j-1
if i in [0,1,2,3]:
addB(i, "zp_fh{j}".format(j=j), tb)
addB(i, "zp_gh{j}".format(j=j), tb)
print(""" sta mRes+{tb}""".format(tb=tb), file=fo)
fo.close()
|
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
(obviously also need four sets of
lda mT1+ 0
sta zp_fl0
sta zp_fh0
eor#255
sta zp_gl0
sta zp_gh0
every time the multiplier changes (included in my cycle times above),
plus also some init code to set the high bytes of the zero page pointers before a set of multiplications is performed (not included in my timings)) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Good job, that's right in the range of what I thought was possible.
I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers.
by offset $4000, doesn't that reduce the domain?
Correction is fast, it's only
stx z3
sec
sbc z3
sta z2
Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting ReposeGood job, that's right in the range of what I thought was possible. Thanks!
Quote:I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers. Doing :)
Quote:by offset $4000, doesn't that reduce the domain?
Nah, second table only contains x**2/4 for x in -255 to 255, so it already maxed out at $3f80
Quote:Correction is fast, it's only
stx z3
sec
sbc z3
sta z2
True, but that's an extra 64 cycles, and removing the CLCs saves at most 64 cycles, sometimes as little as zero (if the branches skip over them all)
Quote:Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :)
Yes, there should be an equivalent that mixes ADC and SBC, I just found it easier to wrap my brain around the edge cases and carry handling by converting it to ADC only. I'll be interested to see what you come up with. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Oh, faster correction:
sec
sbc id,x
sta z2
Still not gonna do it, mind ;) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.
Let's say in the middle columns where there's 14 adds per column, that's 28 cycles half the time saved from not using CLC, or 14 cycles on average, vs 8 cycles for correction, it still saves 6.
I actually found the stats for the carries, most of them are about half, but adding a higher proportion of high bytes gives less carries. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Good catch on the id,x, I was thinking of that a few days ago but it didn't click in for this situation yet :)
And yes, I worked hard at mixing add/sub properly, it still doesn't really made sense but it works. I thought it wouldn't if you DEX to $FF but it still works. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting ReposeAbout the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.
Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts.. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Yes it hurts :) I posted the explanation in the add/sub thread if you can follow it.
Try 0 - ff - ff in your head. Have fun! :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Thanks!
Of course, going to have to start forking off "best best case" vs "best worst case" vs "best average time" pretty soon.
Down to 699 cycles for 0*0, btw ;) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.
If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.
Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Yes, my average for multiplying 10 randomly selected pairs is around 760 cycles at the moment, ranging from around 740 to 780.
Floats only need 24bit x 24bit, so that should be a lot faster. The shifting for adds will be a bit of a hassle. Do you care about correct behaviour for NaNs etc? And how critical is exact rounding? I'm guessing IEEE standard would be considerably slower than "good enough for most purposes." |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Also, if you're only planning on supporting 32bit ints, there's no way to access the high 32 bits of the result from C anyway - instant 2x speedup :D |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.
If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.
Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc.
Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
Quote: Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day.
cool, I hope its for some c64 demo thingie :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: cool, I hope its for some c64 demo thingie :)
No sorry. :) just curiosity. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
Quote: No sorry. :) just curiosity.
nah next time you make a true doom fx, or you'll be kicked out :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: nah next time you make a true doom fx, or you'll be kicked out :)
Hehe, wasn't the Andropolis vector enough? ;) there I only used 16-bit muls and divs (8.8 fixpoint) hence the coarse resolution of the level (thickness of walls and minimum stair step height). I also optimised to only have axis-aligned rooms. Moving to 16.16 muls and remove the axis alignment would perhaps slow it down by 30% but it would be able to render an untextured Doom level I'm sure. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting ChristopherJamQuoting ReposeAbout the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.
Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts..
OK, done now and right you are; saved me 13½ cycles on average for my ten random cases (down to 747.2 cycles now), and only cost me an extra five cycles on my best case (up to 697)
I ended up replacing the lda#offset at the starts of each of the middle six columns with an sbc id+$ff-offset,x at the ends.
My code generator has fairly generic treatment of addends now, so it takes care of all the carry logic for me. Core is now
for tb in range(8):
bspecs=getbspecs(tb)
doCounts=(tb<7)
iv=initialValue[tb]
doClears=len(bspecs)<4 and iv!=0
op="lda"
if tb==1:
emit(" ldx#0")
elif tb>1:
emit(" txa")
op="adc"
if doCounts: emit(" ldx#0")
if iv!=0:
if doClears:
bspecs=[pointerReturner( "#${iv:02x}".format(iv=iv), co=(iv>0xef))]+bspecs
else:
bspecs=bspecs+[pointerReturner( "id+${nv:02x},x".format(nv=0xff-iv), negate=True)]
for n,s in enumerate(bspecs):
addB(s, op, moveCarry=(n!=len(bspecs)-1), doCounts=doCounts, doClears=doClears)
op="adc"
emit(" sta mRes+{tb}\n\n".format(tb=tb))
|
| |
Repose
Registered: Oct 2010 Posts: 225 |
Beautiful! Only thing else I could see is to auto-generate for any precision, which isn't much of a leap. Should have option to include table-generator code too, and 6816(?).
You can estimate timing much the same way I do it by hand, by including counts per column based on number of adds, then overhead per column, and total overhead. I use variables Px for branches with an initial estimate of Px=.5 over all multiplies. Was gonna have a P generator given a stream of numbers too, that's really fancy of course.
Was gonna say, put the carries in the table (there's even opportunity for a table with built-in accumulate, would be slightly faster).
I have bad news though, I have doubts about my sec fix for post-correction :( I'm just working that out now (it's simple though, just ensure the running total is offset by 1 for each carry including the first). Did you verify the outputs? |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code)
Not major perhaps, a lot of time is spend on line drawing and EOR-filling anyway. But maybe 30%, i dunno. Long time ago now and I'm not really sure about how much time is spend on what. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Thanks!
I test the multiply using a set of 20 randomly selected 32 bit integers, comparing the product of each to a known result, and I time each multiply using CIA#2 with screen blanked and IRQs disabled. So, it's not comprehensive, but I'm reasonably confident.
Yes, generalising for a set of operand widths would probably be quite handy. Good idea instrumenting the generator to calculate the likely execution time. I've been meaning to do something similar with another project..
The post fixup is fine if it's the last addition performed, then the carry from that addition can propagate through to the next column.
Do you mean 65816? I started learning that once, but SuperCPU never really took off, and I tend to ignore most platforms between C64 and modern hardware these days, at least for hobby coding. REU last year was already something of a leap for me. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Btw, one more idea: squaring and cubing based on this can be optimized significantly as well. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Test your routine with these magic numbers:
00 00 54 56
00 00 03 03
If my theory is correct, that's the only case that will fail (not really, just the only one I'm bothering to solve for).
It's quite a subtle bug and random values won't catch it, you need 'whitebox' testing.
The failure is rare and of the form that the high byte is higher by 1. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Still works :)
Thanks for the test case.
Here are the first two columns of code (and remember that my g(x)=0x4000-f(x-255) ):
clc
ldy mT2+0
lda (zp_fl0),y
adc (zp_gl0),y
sta mRes+0
ldx#0
lda (zp_fh0),y
adc (zp_gh0),y
bcc *+3
inx
adc (zp_fl1),y
bcc *+3
inx
adc (zp_gl1),y
bcc *+3
inx
ldy mT2+1
adc (zp_fl0),y
bcc *+3
inx
adc (zp_gl0),y
bcc *+3
inx
sbc id+$3f,x
sta mRes+1
The inverse borrow from the final SBC carries forward to the next column; the SBC itself corrects for the false carries while also compensating for the excess 0x40 in the high byte of the g() table. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Oh, I know why it works, I constructed those special values for the normal sense, I mean
54 56
03 03
-----
01 xx
00 ff
00 ff
The whole point was to get those 3 partials to be added, ff+ff+01. Where you are adding with offset, I have to construct the multipliers differently. Not only that, but I'm doubly wrong here - I need to find multipliers which cause the f(x)'s to result that way (where my example works only on the production of f()-g()).
I'll have to finish this later. In the meantime, I suggest you test every possible 16x16. Not so easy I know, I had to write such things in a 6502 simulator in C, actually just simulated the few instructions I needed, but there's a source code out there you could use for a full simulator. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
I'm going to have to think some more about how to synthesise the equivalent test case.
I did start coding an exhaustive test in 6502 (can determine the required result just by repeated adds; a*(b+1)=a*b+b), then realised it wasn't 2**16 test cases but 2*32. Even at 30x realtime that would take VICE 28 hours assuming 700 cycles per iteration.. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
00 01 02 03 * 04 05 06 07 and manipulate the tables to what you want to test adds for every branch, and number of carries per column up to 14, think that should do it. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Had a thought this morning - the difference of squares is already well established, the only thing that really needs testing is the carry handling for each column. I'll post about that over at sets of add/sub shortly. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
That's basically what I just said - multiplying is just adding from a table. Test coverage would include each carry and each amount of carries per column. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quote: That's basically what I just said - multiplying is just adding from a table. Test coverage would include each carry and each amount of carries per column.
Fair point - I guess I got distracted by your talk of table manipulation.
Posting some analysis of the individual carries in the other thread shortly.
But back to multiplies - I was curious as to how you got away with not offsetting the g() table, then it finally struck me - using SBC instead of ADC is exactly equivalent to doing an ADC of a $ffff-g() table.
Do you have working code yet? I would expect you too need a different offset for each column. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Just about to work out the subs, though I'm sure it works in some equivalent way, I'm thinking at most a sec or clc when switching between runs of adds and runs of subs. You can do one fixup at the end. The way I'm doing it makes sense too. No offsets needed.
(ps why did Ice T suddenly flash in my mind singing, no beepers needed?)
Sounds like mine is gonna be a lot cleaner, not to mention faster but we'll see :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
OK, 16x16 done and tested. Minimum 205 cycles, mean of around 216, including 12 cycles for the JSR/RTS
(assuming multiplier, multiplicand and destination all in ZP). I've just modified the codegen for the 32x32 for now, will have a look later to see if I've missed any obvious optimisations. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: OK, 16x16 done and tested. Minimum 205 cycles, mean of around 216, including 12 cycles for the JSR/RTS
(assuming multiplier, multiplicand and destination all in ZP). I've just modified the codegen for the 32x32 for now, will have a look later to see if I've missed any obvious optimisations.
How does this compare to my stuff on Codebase? Also unsigned? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?
Unsigned, yes. |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
10% faster ain't bad! |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?
Unsigned, yes.
Nice!!! Havn't checked in detail, same table space overhead? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Thanks!
An extra 256 bytes for the id table, so five pages of tables altogether. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Actually, scratch that; for the 16x16->32 case, I only ever read from 13 bytes of the identity table.
Make that 2061 bytes of tables required. Also 16 bytes of zero page for pointers. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Ok, I don't know who won because I'm counting things differently. I've always assumed branches are taken equally and averaged them. By that method, jackasser's is clearly 233+12 for jsr/rts=245.
One of the alternate versions I already have written is 207+12=219.
Yet, you reported 241 for JackAssers which is 3 less, and if you scale the same way, I get 219 for yours, exactly the same. As far as I can tell, we're tied.
They are not the same at all. My alternate version is very straightforward and doesn't use the repeated add technique. Still working on that (sorry I'm so slow). We're tied, but this isn't even my final entry.
JackAsser's
setup 74
mults 116
adds 43
total 233
|
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
I hadn't yet counted the cycles in the source yet, so I just timed both my and JackAsser's routines with CIA for ten randomly selected pairs of numbers.
Now I've annotated my source, if I assume branches are taken equally and page crossings on table lookups also cost on average half a cycle, I get 206.5+12=218.5
So yes, ridiculously close at the moment given the different approaches I gather we've taken.
I can shave another two cycles off mine, but only if I bump the memory back up to 5 pages of tables again. (or rather, I need to place a 7 byte table somewhere in the last 64 bytes of a pageÂ… long story..)
I gather you have something better in the wings; I look forward to seeing it! |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Same, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
My current best:
mul1616
lda mT1+ 0 ; 3
sta zp_fl0 ; 3
sta zp_fh0 ; 3
eor#255 ; 2
sta zp_gl0 ; 3
sta zp_gh0 ; 3
lda mT1+ 1 ; 3
sta zp_fl1 ; 3
sta zp_fh1 ; 3
eor#255 ; 2
sta zp_gl1 ; 3
sta zp_gh1 ; 3
clc ; 2
ldy mT2+0 ; 3
lda (zp_fl0),y ; 5.5
adc (zp_gl0),y ; 5.5
sta mRes+0 ; 3
ldx#0 ; 2
lda (zp_fh0),y ; 5.5
adc (zp_gh0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_fl1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gl1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
ldy mT2+1 ; 3
adc (zp_fl0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gl0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
sbc ofste_3f,x ; 4
sta mRes+1 ; 3
txa ; 2
ldx#$bf ; 2
adc (zp_fh0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gh0),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_fl1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gl1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
ldy mT2+0 ; 3
adc (zp_fh1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
adc (zp_gh1),y ; 5.5
bcc *+3 ; 2.5
inx ; 1
sbc ofste_80-$bf,x ; 4
sta mRes+2 ; 3
txa ; 2
ldy mT2+1 ; 3
adc (zp_fh1),y ; 5.5
clc ; 2
adc (zp_gh1),y ; 5.5
sta mRes+3 ; 3
rts
; total=204.5
ofste_3f
.byt $3f,$40,$41,$42,$43,$44
.dsb <$bf-*,0
ofste_80
.byt $80,$81,$82,$83,$84,$85,$86
(and of course a one off init of the high bytes of the zero page pointers, and square tables as follows:
f=lambda x:x*x//4
g=lambda x:(0x4000-f(x-255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])
|
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting ReposeSame, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile.
Haha, then I should probably acknowledge that my ~750 cycle 32x32->64 requires 32 bytes of zero page pointers, on top of the 16 bytes of zero page split between the multiplier, multiplicand and result :) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Wow, you really squeezed that dry - good job! Even though my approach is different (in the runs of adds vein), it follows the same pattern as yours, and thus will turn out almost exactly the same time.
There's other approaches to faster adds though. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I think I have 195, but it requires a different approach. You're going to have to wait another day though :) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Finally, after fixing a bad sqrtab...
The World's Fastest Published 16x16 Unsigned Mult on 6502.
Testing methodology
-------------------
To measure a particular multiply, you can enter the monitor in vice and type (where pc is your start address):
r pc=c000
Then type:
z
and keep hitting enter to step through the code. Copy and paste the first line. When you've reached the line *after* the last you want to measure, copy and paste that (also include any points in between). You can now subtract the two to get the timing.
Example:
LDA $FB - A:00 X:0A Y:00 SP:eb ..-..IZ. 72210316
CLC - A:00 X:0A Y:00 SP:eb ..-..IZC 72210476
RTS - A:00 X:0A Y:00 SP:eb ..-..IZ. 72210520
This shows me that my multiply body was 160 cycles, and the adds were 44 cycles, for a total of 204 (nevermind the slow times, I had nothing in zp for testing purposes).
I used that as a guide, but really I added by hand and averaged page crossings and branches for the reported total.
Tell me the speed
-----------------
158 cycles for the multiply part, with no variation, and the inputs/outputs in zp.
43 cycles for the final additions, with each branch equally likely.
The total is 201. However, if you include the simple variation which requires that part of the code is in zp, you save 3 cycles for a total of 198 (I just wanted to say I could break 200).
-add 12 for jsr/rts. I report this way to be consistent with CJ's results above-
The Code
--------
;World's fastest 16x16 unsigned mult for 6502
;you can go faster, but not without more code and/or data
;and being less elegant and harder to follow.
;by Repose 2017
;tables of squares
;sqr(x)=x^2/4
;negsqr(x)=(255-x)^2/4
sqrlo=$c000;511 bytes
sqrhi=$c200;511 bytes
negsqrlo=$c400;511 bytes
negsqrhi=$c600;511 bytes
;pointers to square tables above
p_sqr_lo=$8b;2 bytes
p_sqr_hi=$8d;2 bytes
p_invsqr_lo=$8f;2 bytes
p_invsqr_hi=$91;2 bytes
;the inputs and outputs
x0=$fb;multiplier, 2 bytes
x1=$fc
y0=$fd;multiplicand, 2 bytes
y1=$fe
z0=$80;product, 4 bytes
z1=$81
z2=$82
z3=$83
;not shown is a routine to make the tables
;also you need to init the pointers' high bytes to the tables
umult16:
;set multiplier as x0
lda x0
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_invsqr_lo
sta p_invsqr_hi;17
ldy y0
sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y;note these two lines taken as 11 total
sta z0;x0*y0l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c1a+1;x0*y0h;31
;c1a means column 1, row a (partial product to be added later)
ldy y1
;sec ;notice that the high byte of sub above is always +ve
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c1b+1;x0*y1l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c2a+1;x0*y1h;31
;set multiplier as x1
lda x1
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_invsqr_lo
sta p_invsqr_hi;17
ldy y0
;sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c1c+1;x1*y0l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c2b+1;x1*y1h;31
ldy y1
;sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c2c+1;x1*y1l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta z3;x1*y1h;31
;4*31+2*17 so far=158
;add partials
;-add first two numbers in column 1
;jmp do_adds;put in zp to save 3 cycles :)
do_adds:
clc
c1a lda #0
c1b adc #0;add first two rows of column 1
sta z1;9
;-continue to first two numbers in column 2
c2a lda #0
c2b adc #0
sta z2;7
bcc c1c;3 taken/9 not taken, avg 6
inc z3
clc
;-add last number of column 1 (row c)
c1c lda #0
adc z1
sta z1;8
;-add last number of column 2
c2c lda #0
adc z2
sta z2;8
bcc fin;3/7 avg 5
inc z3
;9+7+6+8+8+5=43
fin rts
|
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Repose, really nice!
Some further optimization:
sta c2b+1;x1*y1h;31 => tax
c2a lda #0
c2b adc #0 =>
c2b txa
c2a adc #0
But I somehow like the fact that X is kept clean otoh. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Also if z3 is assume in the y-reg when done instead:
sta z3;x1*y1h;31 => tay
all inc z3 => iny
I think that's ok since you'd probably do some lda z3 afterwards anyway, and instead you have it readily available in Y. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Thanks :) I screwed up my timing though, the first multiply part is 33 not 31 because of the single SEC, bringing my total to 203 (and VICE was right, the multiply part is 160). Because of this, I think I'll use your optimizations to bring it back down to 201. Therefore it still stands :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Also in a real life situation, f.e. subpixel vectors you'd only keep z3 and z2 as a screen limited 8.8 result. Typically one would do:
Rotated x,y,z in 8.8
Rotated z in 8.8
Reciprocal z = 1/z = 0.16
Perspective:
8.8 * 0.16 = 8.24 (keep only the top 8.8, i.e. pixel and the bresenham initial error) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Maybe I should use what I've learned to do 3d rotations and perspective transform? I think A Different Perspective 2017 3d Creators Update is in order :) (I'm one of the original authors).
So I had a plan for this fast multiply, it can lead to a fast division because of multiplying by the inverse of the divisor. I can also do squares and cubes faster than this.
Edit: was thinking multiply is only the beginning. I made it 16% faster than your routine but if I can make such gains throughout the transform stack it will add up.
Also for Andropolis, I was thinking to not use EOR fill but a straight store (in fact that's the insight I had on A Different Perspective), and also to calc frame differences and plot those only. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Maybe I should use what I've learned to do 3d rotations and perspective transform? I think A Different Perspective 2017 3d Creators Update is in order :) (I'm one of the original authors).
So I had a plan for this fast multiply, it can lead to a fast division because of multiplying by the inverse of the divisor. I can also do squares and cubes faster than this.
Edit: was thinking multiply is only the beginning. I made it 16% faster than your routine but if I can make such gains throughout the transform stack it will add up.
Also for Andropolis, I was thinking to not use EOR fill but a straight store (in fact that's the insight I had on A Different Perspective), and also to calc frame differences and plot those only.
I've had similar ideas and I even did an frame difference experiment in Jave. Problem was that since diffs are small so will the triangles be. They'll be extreme, sharp and 'problematic'. Hard to render correctly and since they're diffs any render error will accumulate.
Regarding transforms I came to the conclusion to cut most of the stack and forget about how it works conventionally.
Regarding divs we all do mul by the reciprocal. For Andropolis I did what Graham did and calced the reciprocal by linear interpolation:
X is 8.8 call it a.b:
1/a.b ~= invtab[a]*(1-b) + invtab[a+1]*b
invtab[x] is 0.16 result of 65536/x for x 1..255 |
| |
Repose
Registered: Oct 2010 Posts: 225 |
You should write an article on how you did that, it sounds interesting. Obviously I'm a noob at this problem and would have a lot of research to do.
I wasn't thinking triangles exactly but just doom like hallways, wouldn't that work for differencing? |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: You should write an article on how you did that, it sounds interesting. Obviously I'm a noob at this problem and would have a lot of research to do.
I wasn't thinking triangles exactly but just doom like hallways, wouldn't that work for differencing?
Hehe, I suck at writing articles! :)
Imagine the simpliest case: axis aligned rooms, no height differences and just rotate around Y, i.e. Wolfenstein.
The diff formed by the edge between the wall and the floor/ceiling will be an enlogated thin triangle. The diff might be 3-4 pixels high close to the camera, this is easy, but further down the corridor the height of the diff will be <1 but still pixels here and there must be pur. Surely an extended bresenham that "draws" both the top and the bottom line simultaneous could handle it, and calc the total height between the lines and still propagate individual errors. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Ok, I think that makes sense now.
Now 196
I managed to save another 7 cycles from the correct total of 203, bringing it down to 159+37=196. As a bonus, the two highest bytes are returned in registers.
If I put do_adds in zp, there's one more optimization to save 4 instead of 3, for 192. Finally, if you don't need the lowest bytes, you can save 3 cycles by deleting sta z1.
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
tay;x1*y1h;Y=z3, 30 cycles
do_adds:
;-add the first two numbers of column 1
clc
c1a: lda #0
c1b: adc #0
sta z1;9
;-continue to first two numbers of column 2
c2a: lda #0
c2b: adc #0
tax;X=z2, 6 cycles
bcc c1c;3/6 avg 4.5
iny;z3++
clc
;-add last number of column 1
c1c: lda #0
adc z1
sta z1;8
;-add last number of column 2
txa;A=z2
c2c: adc #0
tax;X=z2, 6
bcc fin;3/4 avg 3.5
iny;z3++
;Y=z3, X=z2
;9+6+4.5+8+6+3.5=37
fin: rts
|
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
I did not follow closely, could you enlighten me why c1a/b/c is not added up in one go ? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Damn, so you're now 8.5 cycles faster than me? I was not expecting partial products to be faster than the optimisations we've been working on for runs of adc. Going to have to study this more closely.
Canada holds the record, for now. Nice work! |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: I did not follow closely, could you enlighten me why c1a/b/c is not added up in one go ?
Adding in one go would lose the carry information |
| |
Repose
Registered: Oct 2010 Posts: 225 |
To be fair, CJ can save 2 cycles by returning in regs like me, then they are just 6.5 or 10.5 cycles apart. Most of our instructions are the same, there's just a difference in overhead. But I can't seem to find any idea that's faster; this could be the actual optimal routine.
I tried all branches for the adds, that's a bit slower. I mean code like:
clc
lda c1a
adc c1b
bcs c1bc c1bc:
clc
adc c1c adc c1c
sta z1 sta z1
lda #1
lda c2a adc c2a
bcs c2ac;7 c2ac:
clc
adc c2b adc c2b adc c2b
And it did seem to work, but I get about 39.5 cycles.
I also looked into the crazy timer idea, but each correction is sbc timerA, which is already too slow for this small number of adds.
I have one idea that can be dramatically faster but it's not very practical.
;y0 in X, y1 in Y, x0-1, x1-1 init to $02, all ram swapped in
x0=$fc;fb-fc is pointer
x1=$fe;fd-fe is pointer
jmp (x0-1)
mult-k:
;multiply by constant multiplier k
;and multiplicands X, Y
;then add
rts
There's never any setups so that saves 34 cycles, many of the multipliers can be optimized, though not reducing the average much, important cases like 0,1, and powers of 2 could be faster.
You also need room for tables so it could only handle 7 bits at a time, unless you did each one in pure code.
Even with such an extreme approach, the worst case is probably close to a normal multiply.
What do you think, is this the end? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Well, without memory restrictions you could do crazy stuff like have variants of the f() or g() tables offset by 1..N so you could move the carry correction into selecting which table to use... but I'm not sure how much that would gain you. |
| |
Strobe
Registered: Jun 2007 Posts: 6 |
Thanks Repose, ChristopherJam etc for a great 16x16 routine!
I only really needed a 8x16=24 multiply so I've been hacking Reposes one down to what I settled on below.
Uses the table & ZP setup of the original.
At ~89 cycles (?) and 54 bytes it's twice as fast and half the size of the 16x16 one. Also doesn't touch X.
See my thoughts on tweaks in the comments at the end but I'd like to get it faster and feel there is some obvious improvements staring me in the face which I'll be happy for others to point out :)
;8x16=24 version by Strobe of Repose's original 16x16=32 "fastest multiplication"
;How to use: put numbers in x0/y0+y1 and result is Y reg (z2), A reg (z1), z0
;Clobbers Y, A but not X (in original form anyway)
umult8x16:
;set multiplier as x0
lda x0 ;comment out and call with A=x0 -2b3c
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_invsqr_lo
sta p_invsqr_hi;17
ldy y0 ;comment out and call with Y=y0 -2b3c
sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y;note these two lines taken as 11 total
sta z0;x0*y0l comment out if you don't care about z0, -2b3c, OR
lda (p_sqr_hi),y ; ..replace with TAX -1b1c (but destroys X obviously)
sbc (p_invsqr_hi),y
sta c1a+1;x0*y0h;33
;c1a means column 1, row a (partial product to be added later)
ldy y1
;sec ;notice that the high byte of sub above is always +ve
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c1b+1;x0*y1l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
tay ;Y=c2a;x0*y1h; 29 cycles
;17+33+29=79 cycles for main multiply part
do_adds:
;-add the first two numbers of column 1
clc
c1a: lda #0
c1b: adc #0 ;A=z1 6 cycles
;-continue to column 2
bcc fin ;2+
iny ;add carry
;Y=z2, 4-5 cycles (avoiding page boundary cross)
;6+(4+)=10-11 cycles for adder
;total 89-90 cycles
fin: rts
;Diagram of the additions for 16x8=24
; y1 y0
; x x0
; --------
; x0y0h x0y0l
;+ x0y1h x0y1l
;------------------------
; z2 z1 z0
;Possible tweaks:
;1. call with A=x0 and comment out "lda x0" -2b3c (*)
;2. call with Y=y0 and comment out "ldy y0" -2b3c
;3. if you don't need z0 (I didn't), comment out "sta z0" -2b3c (*)
; OR replace with TAX -1b1c (but destroys X register obviously which was safe)
;4. There's no point having do_adds in ZP with a JMP like the 16x16 version
; suggests because you would lose the 2 cycles you gained with the JMP, BUT..
; ..you could put the ENTIRE ROUTINE in zero page, -2b2c
; AND if you do that you might as well replace "lda x0", "ldy y0" and "ldy y1"
; with immediate versions and point x0,y0 & y1 to the appropriate ZP spot for
; extra -3c, so -2b5c combined.
;5. OR forget the ZP stuff and just in-line it, saving the 12 cycle JSR/RTS
; (it's only $36 bytes)
;6. mix 1 and/or 2 and/or 3 with 4 or 5
;(*) these also apply to Repose' original 16x16=32 routine.
|
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Not quite on topic (sorry), but i often find myself needing a "downscaling MUL": 8x16=16, with the 8-bit factor scaling the 16-bit argument by [0..1). (Basically, the lowmost 8 bits of the 24-bit result are discarded.)
Then i re-invent the wheel every time, but it seems to get rounder with each iteration. =)
What would the required resources (cycles, memory) for this case in your optimised routine be? |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
its there, he write comment out sta z0 if you only need 16 bits, "experts" could comment out the whole calc of z0 its just 1 bit precision loss out of 16, depends on what you use it for if that 1 bit is needed or not. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I'm back! What a coincidence, I decided to work on this again. Changelog:
-Using advanced ACME assembler
-Macros for calling/return conventions, lowered precision
-Improved label names, formatting/style, comments
-Proper test suite
-small unsigned 8x8=16, fast unsigned 8x8=16, fast unsigned 16x16=32. Will add fast unsigned 8x16=24, fast unsigned 8x16=16
-All routines separated to library, include what you want with options you want
Currently working on BASIC multiply replacement, just for fun. How fast will a FOR/NEXT loop be? |
| |
Repose
Registered: Oct 2010 Posts: 225 |
This is beta, needs cleaning up, but should work. It's down to 187 cycles apparently. This isn't the more flexible library version, just what I have that works right now.
; unsigned integer multiplication library
x0=$fb
x1=$fc
y0=$fd
y1=$fe
p_sqr_lo=$8b ;2 bytes
p_sqr_hi=$8d ;2 bytes
p_negsqr_lo=$a3 ;2 bytes
p_negsqr_hi=$a5 ;2 bytes
sqrlo=$c000 ;511 bytes
sqrhi=$c200 ;511 bytes
negsqrlo=$c400 ;511 bytes
negsqrhi=$c600 ;511 bytes
!macro mult8_snippet mand,low,high {
;with a multipler stored in the p_sqr* and p_negsqr* pointers,
;a multiplicand in mand,
;and suitable squares tables,
;multiply unsigned two 8 bit numbers
;and store the 16 bit result in low, high
;note: it's up to the caller to SEC, as some can be left out
ldy mand
lda (p_sqr_lo),y
sbc (p_negsqr_lo),y ;11 cycles
sta low ;multiplier * Y, low byte, 4 cycles
lda (p_sqr_hi),y
sbc (p_negsqr_hi),y
sta high ;multiplier * Y, high byte
;11+4+11+4 = 30 cycles
}
!macro mult8_reg_snippet mand,low,reg {
;with a multipler stored in the p_sqr* and p_negsqr* pointers,
;a multiplicand in mand,
;and suitable squares tables,
;multiply unsigned two 8 bit numbers
;and store the 16 bit result in low, register
;note: it's up to the caller to SEC, as some can be left out
!if mand="Y" {
} else {
ldy mand
}
lda (p_sqr_lo),y
sbc (p_negsqr_lo),y ;11 cycles
sta low ;multiplier * Y, low byte, 4 cycles
lda (p_sqr_hi),y
sbc (p_negsqr_hi),y
;multiplier * Y, high byte
!if reg="X" {
tax
} else if reg="Y" {
tay
}
;11+4+11+2 = 28 cycles
}
!macro mult8_set_mier_snippet mier {
;set multiplier as mier
lda mier
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_negsqr_lo
sta p_negsqr_hi ;17 cycles
}
makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta p_sqr_lo+1
lda #>sqrhi
sta p_sqr_hi+1
lda #>negsqrlo
sta p_negsqr_lo+1
lda #>negsqrhi
sta p_negsqr_hi+1
;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9 ; CMP #immediate - skip TYA and clear carry flag
lb1: tya
adc #$00
ml1: sta sqrhi,x
tay
cmp #$40
txa
ror
ml9: adc #$00
sta ml9+1
inx
ml0: sta sqrlo,x
bne lb1
inc ml0+2
inc ml1+2
clc
iny
bne lb1
;generate negsqr(x)=(255-x)^2/4
ldx #$00
ldy #$ff
mt1:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne mt1
rts
umult16:
;unsigned 16x16 mult
;inputs:
; x0 x1
; y0 y1
;outputs:
; z0 z1 A Y
;set multiplier as x0
+mult8_set_mier_snippet x0
;z0 = low(x0*y0)
;X = high(x0*y0)
sec
+mult8_reg_snippet y0,z0,"X"
;x0_y1l = low(x0*y1)
;x0_y1h = high(x0*y1)
+mult8_snippet y1,x0_y1l+1,x0_y1h+1
;set multiplier as x1
+mult8_set_mier_snippet x1
;x1_y0l = low(x1*y0)
;x1_y0h = high(x1*y0)
+mult8_snippet y0,x1_y0l+1,x1_y0h+1
;x1_y1l = low(x1*y1)
;Y = high(x1*y1)
+mult8_reg_snippet y1,x1_y1l+1,"Y"
;17+2+28+30+17+30+28 = 152 cycles for main multiply part
;add partials
;add the first two numbers of column 1
clc
x0_y0h: txa
x0_y1l: adc #0
sta z1 ;9 cycles (z1 zp)
;continue to first two numbers of column 2
x0_y1h: lda #0 ;includes carry from previous column
x1_y0h: adc #0
tax ;X=z2 so far, 6 cycles
bcc x1_y0l ;9/12 avg 10.5 cycles
;Y=x1*y1h
iny ;x1y1h/z3 ++
clc
;add last number of column 1
x1_y0l: lda #0
adc z1 ;contains x0y0h + x0y1l
sta z1 ;8 cycles
;add last number of column 2
txa ;contains x0y1h + x1y0h
x1_y1l: adc #0
;A=z2
bcc fin ;7/86 avg 7.5
iny ;z3++
;Y=z3
;Y=z3, A=z2
;9+10.5+8+7.5=35
;total = 152+35 = 187 cycles
sta z2 ; needed calling conventions for test routine
tya
fin: rts |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Great stuff Repose. How convinient it is that I’m working on 3D stuff now… ;) |
| |
Quiss
Registered: Nov 2016 Posts: 43 |
Nice work! I like the use of (ind), y for the fast inlined multiply, instead of the 4 x "sta zp+x"; "jsr zp" that I've seen some demos use.
Also, JackAsser working on 3D stuff makes me nervous. I still haven't digested the hidden line vectors in 1991. |
| |
Martin Piper
Registered: Nov 2007 Posts: 722 |
This thread is an interesting coincidence as I've been thinking about C64 3D again. :) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I wrote an alpha 16x16=32 unsigned, it's 1024 cycles, but if you inline the jsr umult16, it should be 930 cycles. Now I'm exploring not using umult16's at all, but to consider it as a large set of 8bit multiplies; the point is to optimize setting the multiplier (the high bytes of f(x)). I also am playing with the "zip" adds (where I add from right to left then go back to the right, next row), and optimizing register use, basically tax:txa for saving your spot in the beginning column.
The other news is, by using a gray code for the order of operations, I saved a ldy so my umult16 is 195 now.
I have even better ideas; using Knuth's equivalence I can do just 3xumult16.
con't... |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Knuth's idea
Given x=x1*256+x0, y=y1*256+y0
y1 y0
*x1 x0
Calculation Range
k=(y1-y0)*(x1-x0) +-FE01
l=y1*x1 0-FE01
m=y0*x0 0-FE01
n=l+m-k 0-1FC02
l+m 0-1FC02
l-k or m-k +-FE01
x*y=l*65536+n*256+m (for 8 bit Xn/Yn)
Example (16x16 bits):
y1 y0 ->$20 10
*x1 x0 40 30
k=(20-10)*(40-30)=100
l=20*40=800
m=10*30=300
n=800+300-100=a00
llll0000 -> 08000000
nnnnn00 00a0000
mmmm 0300
----------------
2010*4030 = 080a0300
x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn)
example with 16-bit values:
2000 1000
4000 3000
k=1000*1000=0100 0000
l=2000*4000=0800 0000
m=1000*3000=0300 0000
n=00A00 0000
800 0000 0000 0000
a00 0000 0000
300 0000
------------------
800 0a00 0300 0000
If multiplies are expensive, this is faster. Estimating 32x32=64;
3x umult16 ~585
adds ~94
total ~679
|
| |
Repose
Registered: Oct 2010 Posts: 225 |
About gray codes, ref.
https://en.wikipedia.org/wiki/Gray_code#Constructing_an_n-bit_G..
The use of gray codes is to minimize setting the multiplier/multiplicand, i.e.
;set multiplier as mier
lda mier ;3
sta p_sqr_lo ;3 (often published as f(x))
sta p_sqr_hi ;3
eor #$ff ;2
sta p_negsqr_lo ;3 (also known as g(x))
sta p_negsqr_hi ;3; 3+3+3+2+3+3 = 17 cycles
and
;set multiplicand
ldy mand ;3
In the following table, order 01 would mean change y first. Across the top are the starting variables; i.e. 00 means start with setting x0 and y0 as m'ier and m'and.
start xy
order 00 01 10 11
xy
01 00 01 10 11
01 00 11 10
11 10 01 00
10 11 00 01
10 00 01 10 11
10 11 00 01
11 10 01 00
01 00 11 10
The gray code sequence can be best understood by following these values in a circle; with a starting point and a direction of clockwise or counter-clockwise:
01
00 11
10
there's two ways to end with the high bytes;
start with x0*y1 and change y first, or
start with x1*y0 and change x first. The reason I want to end with the high bytes is to return them in registers, and these bytes are most useful if you only want 16x16=16.
x0*y0
x0*y1 ;change y first i.e. ldy y1
x1*y1 ;change x with +set_mier x1
x1*y0
x0 y0
x1 y0 ;change x first
x1 y1
x0 y1
x0 y1
x0 y0 ;change y first
x1 y0
x1 y1
x0 y1
x1 y1 ;change x first
x1 y0
x0 y0
x1 y0
x1 y1 ;change y first
x0 y1
x0 y0
x1 y0
x0 y0 ;change x first
x0 y1
x1 y1
x1 y1
x1 y0 ;change y first
x0 y0
x0 y1
x1 y1
x0 y1 ;change x first
x0 y0
x1 y0
This is all assuming you are storing each partial result into a self-mod summing routine to come after, in which case the order of operations of the multiplies doesn't matter. If you were adding the result during the multiples, you couldn't make use of this trick to minimize setting of the m'ier. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Survey of multiply methods/ideas
If I really want to explore every way that could be the fastest, here's a list of ideas:
identity mults
1 - sin(a)*cos(b)=(sin(a+b)+sin(a-b))/2
2 - cos(a)*cos(B)=(cos(a+b)+cos(a-b))/2
3 - sin(a)*sin(b)=(cos(a-b)-cos(a+b))/2
4 - a*b=exp(log(a)+log(b))
5 - a*b = [(a + b)² - (a - b)²]/4
6 - k=(y1-y0)*(x1-x0), l=y1*x1, m=y0*x0, n=l+m-k, x*y=l*65536+n*256+m
Karatsuba mult
y1 y0
x1 x0
-----
z2=x1*y1
z0=x0*y0
z1=(x1+x0)*(y1+y0)-z2-z0
Booth multiply
Booth-2 is (n+1)=3 bits a time, overlapping by 1
which is how many?
aaa
bbb
ccc
dd_
or 4
nybble mult
use a table of every a*b where a,b are 4-bit numbers.
residue number system
a set of modulo numbers can represent any number
you can add the residues separately without carry.
example of a (3,5) residue number system, supports 0-14:
n x1 x2
0 0 0
1 1 1
2 2 2
3 0 3
4 1 4
5 2 0
6 0 1
7 1 2
8 2 3
9 0 4
10 1 0
11 2 1
12 0 2
13 1 3
14 2 4
you can pick better moduli; like 2^k and 2^k-1, like 256 and 255 which are co-prime. It encodes 65280 values.
REU multiply
Address Description
------- -----------
$de00 256 bytes of data (See $dfc0-$dfc3 for more information)
$df7e write to this location to activate the RAMLink hardware
$df7f write to this location to deactivate the RAMLink hardware.
$dfa0 lo byte of requested RAMCard memory page
$dfa1 hi byte of requested RAMCard memory page
$dfc0 write to this location to show RL variable RAM at $de00 (default)
$dfc1 write to this location to show RAMCard memory at $de00
$dfc2 write to this location to show the RAM Port device $de00 page at $de00
$dfc0 write to this location to show Pass-Thru Port dev. $de00 page at $de00
;8bit mult, of x*y to z0;z1
rl_page=$de00
rl_sel_ramcard=$dfc1
rl_lo=$dfa0
rl_hi=$dfa1
rl_activate=$df7e
rl_deactivate=$df7f
sta rl_activate;4
stx rl_lo;4
lda #0;2
sta rl_hi;4
sta rl_sel_ramcard;4
lda rl_page,x;5?
sta z0;3
inc rl_hi;is this r/w? 6
lda rl_page,x;5
sta z1;3
sta rl_deactivate;4
rts
total 42
sine mult example
cos(a)*cos(B)=(cos(a+b)+cos(a-b))/2
x cos(x)
-pi/2 0
0 1
pi/2 0
pi -1
3pi/2 0
2pi 1
example, .5 * .75
cos-1(.5)=1.0471975511965977461542144610932
cos-1(.75)=0.72273424781341561117837735264133
a+b=1.7699317990100133573325918137345
a-b=0.32446330338318213497583710845183
cos(a+b)=-0.197821961869480000823505899216
cos(a-b)=0.947821961869480000823505899216
cos(a+b)+cos(a-b)=.75
/2=.375
Convolution mult
Lost my notes here - something starting with W... a lot of adds for a few mults.
https://en.wikipedia.org/wiki/Convolution#Fast_convolution_algo..
Neural net multiplication
not much of a gain
direct massive lookup tables
various low-level code optimizations
illegal opcodes
order of operations
branch-tree optimizations (keeping state in the code location, i.e. branch of every possible multiplier etc.)
Applications of multiplication
-fast way to arrange complex multiply
-fast matrix multiply
-using inverse tables to divide
Different types of number systems
-modular arithmetic, Chinese Remainder system
-storing numbers as primes
-keeping numbers as fractions
-2's complement, 1's complement, offset |
| |
Martin Piper
Registered: Nov 2007 Posts: 722 |
I've been pondering a hardware multiply...
Write control byte: xy
y = bytes for operand 1
x = bytes for operand 2
Write bytes in MSB for operand 1 then 2.
Then read bytes in MSB order for multiply result.
For example:
Write hex: 21 47 23 01
Read: b5 50 00 |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Me too, just yesterday!
https://media.digikey.com/pdf/Data%20Sheets/Parallax%20PDFs/uM-..
The uM-FPU is a 32-bit floating point coprocessor that can be easily interfaced with microcontrollers to provide
support for 32-bit IEEE 754 floating point operations and long integer operations. The uM-FPU is easy to
connect using either an I2C or SPI compatible interface.
Which can be connected to the user port. I believe the serial port would work, then it's 16 cycles per byte. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
The only acceptable arithmetics accelerator is the 6502 in the disk drive. =) |
| |
Martin Piper
Registered: Nov 2007 Posts: 722 |
Quote: Me too, just yesterday!
https://media.digikey.com/pdf/Data%20Sheets/Parallax%20PDFs/uM-..
The uM-FPU is a 32-bit floating point coprocessor that can be easily interfaced with microcontrollers to provide
support for 32-bit IEEE 754 floating point operations and long integer operations. The uM-FPU is easy to
connect using either an I2C or SPI compatible interface.
Which can be connected to the user port. I believe the serial port would work, then it's 16 cycles per byte.
I was thinking more like a MC68882 :)
But finding something similar operating at 5V, with parallel address/data rather than serial and still in production, is tough. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Some kind of extra SPI to parallel circuit would help. Since the chip I mentioned has 4MHz serial speed, it's fast enough to read a parallel port at full speed.
https://highfieldtales.wordpress.com/2011/12/04/very-fast-spi-t..
In the end, what do we want though? You could make a GPU cart for the c64, which runs it's own 2d/3d graphic engine, where you simply define the objects. The motions, display updates, collisions, could all be handled automatically, all you need to to keep score, play music, run the display mode, and handle joystick. Just like Blue REU, the output could be stuffed directly to display memory using an advanced FLI mode with practically full color. It would be like a brand new game machine in 160x200x16.
Or, it could be an assist, using memory mapped registers as you describe, and leave the c64 more in control.
I think of this a lot - do I even want to buy a c64 ultimate? It's convenient and fun to play demos and games with full power, but I think I still want to do obscure hardware hacks and real assists. For example, I've always wanted to use the CIA to measure variations in 50/60 Hz power line frequency. I want to attach the serial port to an SD card and bit bang the storage. |
| |
tlr
Registered: Sep 2003 Posts: 1790 |
Quoting ReposeSome kind of extra SPI to parallel circuit would help. Since the chip I mentioned has 4MHz serial speed, it's fast enough to read a parallel port at full speed.
https://highfieldtales.wordpress.com/2011/12/04/very-fast-spi-t..
You do already have two CIA synchronous serial ports though, maybe they could be used? |
| |
Repose
Registered: Oct 2010 Posts: 225 |
That's what I suggested earlier - just to be clear, Martin wanted a fast, efficient ALU ability by writing arguments to memory, which would be latched until the LSB were written, then immediately read the product back. I digressed a little by suggesting a GPU cart, which takes the idea of assisted computing to it's logical conclusion, and I felt it's a bit pointless by then.
I then suggested to use the serial port to communicate with the SPI port of a proposed ALU (which is still in stock). This would be a bit slow, as I think the SP port could only clock out in 16 cycles.
To overcome that difficulty, one could use a circuit or chip to clock at the full 4MHz of the ALU chip, and leave the result at the parallel port of the CIA, on the Userport, somehow.
To be fair, many functions can be done fairly quickly with large tables, using either the Ramlink or REU, and I wrote sample code for both. There are also methods documented at https://wilsonminesco.com/16bitMathTables/
Large Look-up Tables for Hyperfast, Accurate, 16-Bit Scaled-Integer Math
Plus: The Unexpected Power of Scaled-Integer Math |
| |
tlr
Registered: Sep 2003 Posts: 1790 |
Quoting ReposeThat's what I suggested earlier - just to be clear, Martin wanted a fast, efficient ALU ability by writing arguments to memory, which would be latched until the LSB were written, then immediately read the product back. I digressed a little by suggesting a GPU cart, which takes the idea of assisted computing to it's logical conclusion, and I felt it's a bit pointless by then.
I then suggested to use the serial port to communicate with the SPI port of a proposed ALU (which is still in stock). This would be a bit slow, as I think the SP port could only clock out in 16 cycles.
I obviously should have read more of the thread. :)
I think it'll take 32 cycles. 250kbit/s max IIRC. |
| |
Martin Piper
Registered: Nov 2007 Posts: 722 |
I was thinking a 74LS165 or MCP23S08 might be able to talk to the uM-FPU 32-bit floating point coprocessor and provide an interface to the C64. Either via user port or cartridge port. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Regarding convolutions, I remembered the term I was looking for. There is a fast convolution credited to Winograd. It works out to a lot of additions. I think it might be slow. At the end, I think you have to reduce overflowed digits into a mod plus carry to the next digit. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I've been working hard on my mathlib. I think I've worked out all the use cases and made completely flexible macros that work together in a system, almost like a macro language. These can produce fully optimal code in an easy way.
Covered:
-squaring
-constant multipliers
-re-using multipliers
-distributed multiplication (a*(b,c))
-chained multiplication (a*b*c)
-zp and self-mod versions; and they are the same speed
-use cases for multiple pointers
-cycle timing variables that respond to all variations
-automatic carry flag tracking/setting
-custom table generation for built-in constant add/multiply
-low-precision outputs with optional rounding
To do:
-update test program
-verify the 32x32
-add signed and other versions
-squaring
-polynomials
At least a few more weeks to release something I'm satisfied with :)
I also discovered xc_basic and I'd like to add my library to their source, which is using a simple shift/add loop. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
How many and which of your routines would fit in say, just under 2 kilobytes of RAM? =) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: How many and which of your routines would fit in say, just under 2 kilobytes of RAM? =)
Hehe probably none due to the square tables.
@repose: regarding signing you can just post-fix that if you want, not optimal though but rather fast.
; Apply sign (See C=Hacking16 for details).
lda T1+1
bpl :+
sec
lda PRODUCT+2
sbc T2+0
sta PRODUCT+2
lda PRODUCT+3
sbc T2+1
sta PRODUCT+3
:
lda T2+1
bpl :+
sec
lda PRODUCT+2
sbc T1+0
sta PRODUCT+2
lda PRODUCT+3
sbc T1+1
sta PRODUCT+3
:
But again, you already knew that no doubt. :) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Thanks for the tip! I hadn't considered it in details. However, why use 2's complement? I was thinking to just keep the sign in a byte. When multiplying, it's just EORing the bytes then. Adding/subtracting could cause an overflow, which is the only case that needs a fix. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Quote: How many and which of your routines would fit in say, just under 2 kilobytes of RAM? =)
Jackasser is right, initially I am focused on pure speed, but I realize that smaller options would be useful. You can use a half-sized table and still be fast, in fact this was the usual expression of the squares method before it was fully optimized. Other than that, an unrolled shift-add loop is all I can think of (but I am so tired of seeing those :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Thanks for the tip! I hadn't considered it in details. However, why use 2's complement? I was thinking to just keep the sign in a byte. When multiplying, it's just EORing the bytes then. Adding/subtracting could cause an overflow, which is the only case that needs a fix.
I didn't really though much about it and just used the hint from that article. EOR doesn't suffice. EOR+1 though..
And it also depends on if the inputs are both signed (no EOR+1 needed) or if just one of them is. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting ReposeJackasser is right, initially I am focused on pure speed, but I realize that smaller options would be useful. Yeah, the reason i'm asking is drive-code, of course.
I've had a co-processor drive-code maths library on my list for years, and now that the plumbing (proper custom-drive code API in the IRQ loader) is ready, building stuff on top of that (like the saver plug-in) is quite easy.
Demos using the drive to speed up a certain class of algorithms are still quite rare, after all these years, but there may be a flower pot or two to win, still. =) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Quote: Quoting ReposeJackasser is right, initially I am focused on pure speed, but I realize that smaller options would be useful. Yeah, the reason i'm asking is drive-code, of course.
I've had a co-processor drive-code maths library on my list for years, and now that the plumbing (proper custom-drive code API in the IRQ loader) is ready, building stuff on top of that (like the saver plug-in) is quite easy.
Demos using the drive to speed up a certain class of algorithms are still quite rare, after all these years, but there may be a flower pot or two to win, still. =)
Cool! I've thought of this as well, and it's always been my dream to make either a RAID array or CPU multiprocessing cluster. I put it aside though, when I realized how slow the communication was. It would work best for something like fractals where you do a lot of loops, with small data. Now, on an expanded drive, especially a 1571, that would be quite useful.
Even further, I thought of more database functions a drive could do, or searching, compressing files. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting ReposeCool! I've thought of this as well, and it's always been my dream to make either a RAID array or CPU multiprocessing cluster. I put it aside though, when I realized how slow the communication was. It would work best for something like fractals where you do a lot of loops, with small data. Now, on an expanded drive, especially a 1571, that would be quite useful.
Even further, I thought of more database functions a drive could do, or searching, compressing files. I was also thinking of a computer cluster, with a raytracing/raycasting application to showcase.
It's trivially parallelisable in theory*, highly scalable on many dimensions, and you can put all kinds of programmable devices on the bus, including a C-128 (or C16/+4, etc.) on the other end. =)
* In practice, there's one big problem i've been thinking on and off about but not solved yet: the bus arbitration.
The computers on either end should not just poll all devices but do work themselves and be notified whenever some other work is completed, but that gives rise to race conditions and other problems. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Quote: Quoting ReposeCool! I've thought of this as well, and it's always been my dream to make either a RAID array or CPU multiprocessing cluster. I put it aside though, when I realized how slow the communication was. It would work best for something like fractals where you do a lot of loops, with small data. Now, on an expanded drive, especially a 1571, that would be quite useful.
Even further, I thought of more database functions a drive could do, or searching, compressing files. I was also thinking of a computer cluster, with a raytracing/raycasting application to showcase.
It's trivially parallelisable in theory*, highly scalable on many dimensions, and you can put all kinds of programmable devices on the bus, including a C-128 (or C16/+4, etc.) on the other end. =)
* In practice, there's one big problem i've been thinking on and off about but not solved yet: the bus arbitration.
The computers on either end should not just poll all devices but do work themselves and be notified whenever some other work is completed, but that gives rise to race conditions and other problems.
This is a common problem which has been solved. For ethernet, it's called CSMA/CD. Then there's the method from CAN, the 2 wire car bus, which is a better protocol.
https://copperhilltech.com/blog/controller-area-network-can-bus.. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting ReposeThis is a common problem which has been solved. For ethernet, it's called CSMA/CD. Then there's the method from CAN, the 2 wire car bus, which is a better protocol. I am very well aware of that. But we're still talking about stock C-64 plus original periphery, aren't we? |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Sorry I wasn't clear, this CAN thing works on an I2C bus, to put it in simple terms, it works exactly the same way as the IEC bus. You should be able to translate the ideas regardless, there will always be something that works. If I get time I can explain it better. Besides which, the kernal itself does this, there is the c64 as the master and it can negotiate to talk to any number of drives. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting ReposeIf I get time I can explain it better. Again, no need to. I've worked professionally with I2C and CAN for about a decade.
And those do bus arbitration in hardware, doing this in software efficiently is the problem.
Quoting ReposeBesides which, the kernal itself does this, there is the c64 as the master and it can negotiate to talk to any number of drives. Afaik, bus arbitration is not a thing with single-master buses, like when only the C-64 acts as the master (the drives cannot assert the ATN line).
Different thing with a multi-master bus when any participant may wish to talk at any time. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Continued in Multi-master on IEC bus |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Quoting JackAsserI didn't really though much about it and just used the hint from that article. EOR doesn't suffice. EOR+1 though..
And it also depends on if the inputs are both signed (no EOR+1 needed) or if just one of them is.
Hi,
I didn't get back to you on this. I didn't mean to EOR the product, I meant to EOR the signs.
X Y P
+ + +
- + -
+ - -
- - +
So if the signs were stored with 0=+ and 1=-, then EORing the signs would give the sign of the result.
When multiplying, there is no fixing of the products; they always stay as a "positive" number and don't lose the sign bit.
Add/sub would be slower, you need to know their relative magnitudes. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Quoting JackAsserI didn't really though much about it and just used the hint from that article. EOR doesn't suffice. EOR+1 though..
And it also depends on if the inputs are both signed (no EOR+1 needed) or if just one of them is.
Hi,
I didn't get back to you on this. I didn't mean to EOR the product, I meant to EOR the signs.
X Y P
+ + +
- + -
+ - -
- - +
So if the signs were stored with 0=+ and 1=-, then EORing the signs would give the sign of the result.
When multiplying, there is no fixing of the products; they always stay as a "positive" number and don't lose the sign bit.
Add/sub would be slower, you need to know their relative magnitudes.
Moving away from 2-complement? |
| |
Repose
Registered: Oct 2010 Posts: 225 |
This doesn't need to be an either/or; just something I'm evaluating. unless you're doing a factorial, I see most calculations using multiply followed by an add (polynomial evaluation, matrices), so I just want to see the tradeoff.
If adds and mults always come in pairs, my guess is that either system is about equal. If they are equal, I'd give the nod to separate sign as it gives more range. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
@ChristopherJam (does that work?)
I did the offset $4000 in g thing myself, but tried the zip-add approach. I'm afraid it's a disaster; your adds are basically 9.5 cycles while mine are 11.5:
;continue column 1
.c1_1: lda #0 ;self-modded to contain high(x0*y0) f
adc (multlib_p_neg_sqr_lo_x0),y ;high(x0*y0) g
sta .c1_2+1 ;save partial
;11.5
This is missing the corrections (for $40 in the high bytes of g(x), or p_neg_sqr_hi_x in my notation)
umult16:
;unsigned 16x16 mult
;inputs:
; x0 x1
; y0 y1
;outputs:
; A z2 z1 z0
* = $c000
!zone multlib
;these are defaults, but should be set from the calling program
!addr {
multlib_p_sqr_lo_x0 = $8b ;2 bytes
multlib_p_sqr_hi_x0 = $8d ;2 bytes
multlib_p_neg_sqr_lo_x0 = $a3 ;2 bytes
multlib_p_neg_sqr_hi_x0 = $a5 ;2 bytes
multlib_p_sqr_lo_x1 = $8b ;2 bytes
multlib_p_sqr_hi_x1 = $8d ;2 bytes
multlib_p_neg_sqr_lo_x1 = $a3 ;2 bytes
multlib_p_neg_sqr_hi_x1 = $a5 ;2 bytes
multlib_sqr_lo = $c000 ;511 bytes
multlib_sqr_hi = $c200 ;511 bytes
multlib_neg_sqr_lo = $c400 ;511 bytes
multlib_neg_sqr_hi = $c600 ;511 bytes
}
x0=$fb
x1=$fc
y0=$fd
y1=$fe
z0=$8000
z1=$8001
z2=$8002
z3=$8003
;set multiplier as x0
lda x0 ;3
sta multlib_p_sqr_lo_x0 ;3
sta multlib_p_sqr_hi_x0 ;3
eor #$ff ;2
sta multlib_p_neg_sqr_lo_x0 ;3
sta multlib_p_neg_sqr_hi_x0 ;3; 3+3+3+2+3+3 = 17 cycles
;set multiplier as x1
lda x1 ;3
sta multlib_p_sqr_lo_x1 ;3
sta multlib_p_sqr_hi_x1 ;3
eor #$ff ;2
sta multlib_p_neg_sqr_lo_x1 ;3
sta multlib_p_neg_sqr_hi_x1 ;3; 3+3+3+2+3+3 = 17 cycles
;set multiplicand as y0
ldy y0 ;3
;storing z3 in X, clear X
ldx #0 ;2
clc ;2
;x0_y1l = low(x0*y1)
;x0_y1h = high(x0*y1)
;zip add all mult8's directly
;adding requires offset $4000 in multlib_p_neg_sqr table and compensation
;7
;column 0
lda (multlib_p_sqr_lo_x0),y ;low(x0*y0) f
adc (multlib_p_neg_sqr_lo_x0),y ;11 cycles low(x0*y0) g
sta z0 ;multiplier * Y, low byte, 3 cycles
;need correction factor for g in offset $400
;14
;continue to column 1
adc (multlib_p_sqr_hi_x0),y ;high(x0*y0) f
sta .c1_1+1 ;partial addition result of column 1 (self-mod to be continued)
;9.5
;continue to column 2
adc (multlib_p_sqr_hi_x1),y ;high(x1*y0) f
sta .c2_1+1 ;save partial
bcc .c1_1
inx ;inc z3
clc ;3/6=4.5
;14
;continue column 1
.c1_1: lda #0 ;self-modded to contain high(x0*y0) f
adc (multlib_p_neg_sqr_lo_x0),y ;high(x0*y0) g
sta .c1_2+1 ;save partial
;11.5
;continue column 2
.c2_1: lda #0 ;self-modded to contain high(x1*y0) f
adc (multlib_p_neg_sqr_hi_x1),y ;high(x1*y0) g
sta .c2_2+1 ;save partial
bcc .c1_2
inx ;inc z3
clc ;4.5
;16
;continue column 1
.c1_2: lda #0 ;self-modded to contain high(x0*y0) g
adc (multlib_p_sqr_lo_x1),y ;low(x1*y0) f
sta .c1_3 ;save partial
;11.5
;continue column 2
;set multiplicand as y1
ldy y1 ;3
.c2_2: lda #0 ;self-modded to contain high(x1*y0) g
adc (multlib_p_sqr_hi_x1),y ;high(x0*y1) f
sta .c2_3+1 ;save partial
bcc .c1_3
inx ;inc z3
clc ;4.5
;19
;continue column 1 (extra y0)
ldy y0 ;3
.c1_3: lda #0 ;self-modded to contain low(x1*y0) f
adc (multlib_p_neg_sqr_lo_x1),y ;low(x1*y0) g
sta .c1_4+1 ;save partial
;14.5
;continue column 2
ldy y1 ;3
.c2_3: lda #0 ;self-modded to contain high(x0*y1) f
adc (multlib_p_sqr_hi_x1),y ;high(x0*y1) g
sta .c2_4+1
bcc .c1_4
inx ;inc z3
clc ;4.5
;19
;continue column 1
.c1_4: lda #0 ;self-modded to contain low(x1*y0) g
adc (multlib_p_sqr_lo_x0),y ;low(x0*y1) f
sta .c1_5+1 ;save partial
;11.5
;continue column 2
.c2_4: lda #0 ;self-modded to contain high(x0*y1) g
adc (multlib_p_sqr_hi_x1),y ;low(x1*y1) f
sta .c2_5+1 ;save partal
bcc .c1_5
inx ;inc z3
clc ;4.5
;16
;finish column 1
.c1_5: lda #0 ;self-modded to contain low(x0*y1) f
adc (multlib_p_neg_sqr_lo_x0),y ;low(x0*y1) g
sta z1 ;need correction factor for g in offset $400
;10.5
;finish column 2
.c2_5: lda #0 ;self-modded to contain low(x1*y1) f
adc (multlib_p_neg_sqr_hi_x1),y ;low(x1*y1) g
sta z2 ;need correction factor for g in offset $400
;10.5
;column 3
txa ;carries from c2
adc (multlib_p_sqr_hi_x1),y ;high(x1*y1) f
adc (multlib_p_neg_sqr_hi_x1),y ;high(x1*y1) g
;need correction factor for g in offset $400
;A=z3
;13
;=231.5
fin: rts
|
| |
Repose
Registered: Oct 2010 Posts: 225 |
A guide to the order of additions (optimized for changes in y):
y1 y0
x x1 x0
--------
x0y0h x0y0l
+ x1y0h x1y0l
+ x0y1h x0y1l
+x1y1h x1y1l
------------------------
z3 z2 z1 z0
|
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Ah yes, saving those partials is expensive.
Apologies for the delay - while @ did nothing, I did see the post, but then I had to find the time to reread the last 129 comments here..
I had a few ideas for optimizations while I was failing to get around to reading back, but it looks like I either acknowledged or attempted all of them back in 2017. Excellent point you had back then about it being well worth removing the CLCs for the taller columns.
One new thought I did have was to replace the bcc/inx combo with a ROL of some value read further down, that would simultaneously clear carry, and accumulate a '1' into an index into a table that would count the number of 1 bits. It could even a table selector (if the ROL was of the high byte of a table address) allowing the correction to be incorporated into one of the square table lookups.
However, a non-accumulator ROL is always at least 5 cycles, which is slower than either path through bcc/inx, so I suspect this one's a no go. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I've been going all over the place with distractions, but, I'm exploring a completely new number system which has potential. As I mentioned in my big list, there's many more ways to do math than just 2's complement. Here is a log based arithmetic system, where multiply is as fast as an add, and add is a bit slower:
# -*- coding: utf-8 -*-
"""
Created on Fri Jan 28 22:46:53 2022
@author: Repose
A module to analyze arithmetic via logs
ref.
https://www.siranah.de/html/sail042q.htm
"""
#imports
import sys
from math import log2
#functions/classes
scale=256/log2(256) #8-bit range
lo=lambda x:x&255 #util for creating tables (for future development)
hi=lambda x:(x>>8)
logadd=lambda x:round(log2(1+1/pow(2,x/scale))*scale) #scaled table to add two logs
logsub=lambda x:round(log2(1+pow(2,x/scale))*scale) #scaled table to sub two logs
exp=lambda x:round(pow(2,x/scale)) #scaled exponential 2^x table
lg=lambda x:round(log2(x)*scale) #scaled log2(x) table
def mult(x,y):
'''Multiply two integers via their logs
Keyword arguments:
x,y -- integers 0 to 255
Returns:
The approximate product
'''
return exp(lg(x)+lg(y))
def add(x,y):
'''Add two integers via their logs
Keyword arguments:
x,y -- integers 0 to 255
Returns:
The approximate sum
'''
return exp(logadd(lg(y)-lg(x))+lg(y))
def output_table(fmt, fname, name, addr, tab, start, end):
'''Output a table as an assembly include file
Keyword arguments:
fmt -- string, "ACME"
fname -- string, filename of output file
name -- string, table name
addr -- integer start address of table
tab -- generator function
start -- integer start index value
end -- integer end index value + 1
Returns:
0 (success)
5 (failed to open file)
'''
try:
with open(fname, 'w') as f:
print(f'writing {fname}')
if fmt.lower()=="acme":
fmt_data='!by '
elif fmt.lower()=="dasm":
fmt_data='dc.b '
else:
print(f'unknown assembler format {fmt}')
return 1 #traditional DOS incorrect function
#header
f.write(f'{name}=${addr:04X}\n')
step=16
for i in range(start,end,step):
#line start
f.write(fmt_data)
k=i+step if i+step<end else end
first_invalid=True
first_overflow=True
for j in range(i,k):
try:
val=tab(j)
except ValueError:
val=0
if first_invalid:
print(f'warning: invalid table value replaced with 0 at index {j}')
first_invalid=False
if val>0xff:
val=val & 0xff
if first_overflow:
print(f'warning: table value overflowed at index {j}')
first_overflow=False
f.write(f'${val:02X}')
if j!=k-1:
f.write(',')
else:
f.write('\n')
except:
print(f"Couldn't open {fname} for writing")
return 5 #traditional DOS error code
#main
def main():
'''Analyze arithmetic of two integers via their logs
Keyword arguments:
none
Returns:
0 (success)
'''
print('Guassian logs analysis\n')
print('accuracy test\n')
#should write a helper class for this
erm=0 #maximum multiply error
era=0 #maximum addition error
for x in range(1,256):
for y in range(x, 256):
er=mult(x,y)-x*y
if er>erm:
erm=er
pxm=x
pym=y
er=add(x,y)-(x+y)
if er>era:
era=er
pxa=x
pya=y
#requires python 3.6+ due to f strings
print('multiply')
print(f'max err was {erm} ({erm/(pxm*pym)*100:.2f}%) at x={pxm} and y={pym}')
print(f'{pxm}*{pym}={pxm*pym}, mult({pxm},{pym})={mult(pxm,pym)}\n')
print('add')
print(f'max err was {era} ({era/(pxa+pya)*100:.2f}%) at x={pxa} and y={pya}')
print(f'{pxa}+{pya}={pxa+pya}, add({pxa},{pya})={add(pxa,pya)}\n')
#write tables
print('writing tables')
tables=[{"func":lg,"name":"logtab"}, {"func":exp,"name":"exptab"}, {"func":logadd,"name":"logadd"}]
addr=0xc000
assembler='acme'
for table in tables:
#output_table(fmt, fname, name, addr, tab, start, end+1)
output_table(assembler, table["name"]+'.asm', table["name"], addr, table["func"], 0, 256)
addr+=0x100
if __name__ == '__main__':
sys.exit(main())
writing logtab.asm
warning: invalid table value replaced with 0 at index 0
warning: table value overflowed at index 254
writing exptab.asm
writing logadd.asm
Main.asm to come... |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
The log/exp approach isn't new on this system, though. =)
It's especially useful in downscaling operations, where either of the factors is in [0..1).
Some care needs to be taken to minimise table size for an acceptable accuracy. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: The log/exp approach isn't new on this system, though. =)
It's especially useful in downscaling operations, where either of the factors is in [0..1).
Some care needs to be taken to minimise table size for an acceptable accuracy.
Iirc WVL used a very controlled 8-bit sin/asin for the maths in Pinball Dreams even |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Yes, log has been used for divide in a C=Hacking article. In this case, I can add without converting from log and back when switching between add/sub and mult/div. Consider finding the escape iteration for a Mandelbrot set; you never need to know the values, just when the log proxy meets the test zr*zr-zi*zi>4. I'm hoping this can be faster and more accurate. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
log div is good enough for lines. (unless you're goind for subpix:) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
It's 2023 - is your 16x16=32 bit unsigned multiply under 190 cycles yet? What? No?!? Let me help you out buddy!
Much to my surprise, I have surpassed by previous best by 10.5 cycles or 5%, when I thought it couldn't be optimized further. How is this possible you ask?
The first trick had to do with how cycles are counted. As a shortcut, we usually average branches/boundary crossings. However, some carries happen only 7% of the time, so optimizing for the average case actually slows you down. You should be optimizing for the no carry case.
The other trick was playing with using registers to accumulate results. It wasn't as simple as it sounds; the order of multiplies had to be optimized to leave a target value in the A register.
The goal: multiply the two 16-bit numbers in x0, x1 and y0, y1 (low bytes and high bytes) and return the results in the fastest way possible (which may leave some results in registers, preferably in a useful way like the high bytes). I measure the timing in this convention to expose a timing which could be used in-line, as is often the case for demo effects. In this case, the 3rd byte is returned in A, which could be useful for a multiple accumulate where only the high 16-bits are kept.
The time is 188.1 cycles, averaged over all possible inputs. The (accurate) time of my version on codebase is 198.6
To be clear, z=x*y or (z3, z2, z1, z0) = (x1, x0) * (y1, y0).
;This section performs the 4 sub-multiplies to form
; the partials to be added later in self-mod code.
;Addresses like "x1y0l+1" refer to low(x1*y0) stored
; to the argument of a later "adc #{value}"
;mult8_set_mier_snippet {multiplier}
;mult8_snippet {multiplicand}, {low result}, {high result}
+mult8_set_mier_snippet x1 ;17
+mult8_snippet y0, x1y0l+1, x1y0h+1 ;35
+mult8_snippet y1, x1y1l+1, z3 ;32
+mult8_set_mier_snippet x0 ;17
+mult8_snippet "Y", x0y1l+1, "X" ;28
+mult8_snippet y0, z0, "A" ;28
;results in X=x0y1h and A=x0y0h
;multiply part total: 149-165, average 157
;z3 z2 z1
clc
x0y1l adc #0 ;x0y0h + x0y1l
tay ;6
txa
x1y0h adc #0 ;x0y1h + x1y0h
tax
bcc + ;9
inc z3
clc ;(+6) taken 7% of the time
+ tya
x1y0l adc #0 ;+ x1y0l
sta z1 ;7
txa
x1y1l adc #0 ;+ x1y1l
bcc done ;7
inc z3 ;(+4) taken 42% of the time
Column 1 takes 13 cycles, column 2 takes 16 cycles, and column 3 takes 2.1 cycles.
The total time to add the partial products is 29-39 with an average of 31.1.
The results are returned as a 32-bit result:
z3, A(=z2), z1, z0.
The total time is 156.96875+31.1=188.06875 (178-204). |
| |
Repose
Registered: Oct 2010 Posts: 225 |
A diagram of the calculations:
y1 y0
x x1 x0
--------
x0y0h x0y0l
x0y1h x0y1l
x1y0h x1y0l
x1y1h x1y1l
-----------------------
z3 z2 z1 z0
A listing of the macros is quite long, as a lot of effort went into parsing the arguments the way I'm using them, plus some other features in my libary. Requires the latest C64 Studio 7.5, as there were a number of features I had added. Here is one of them:
!macro mult8_set_mier_snippet .mier {
;set multiplier as mier
;mier can be m/A
;requires .p_.sqr* and .p_.neg.sqr* pointers,
;(mathlib.p_sqr_lo, mathlib.p_neg_sqr_lo, mathlib.p_sqr_hi, mathlib.p_neg_sqr_hi)
;uses these macros from mathlib_util_macros.asm:
;reset_cycles, add_cycles_const
!if .mier = "X" or .mier = "Y" {
!error "multlib: mult8_snippet: mier must be m/A, not ", .mier
}
+reset_cycles
!if .mier != "A" {
lda .mier
!if .mier<$100 {
+add_cycles_const 3
} else {
+add_cycles_const 4
}
}
sta mathlib.p_sqr_lo ;3
sta mathlib.p_sqr_hi ;3
eor #$ff ;2
sta mathlib.p_neg_sqr_lo ;3
sta mathlib.p_neg_sqr_hi ;3; 3+3+3+2+3+3 = 17 cycles
+add_cycles_const 14
}
Here is the order of multiplies I used:
mier = x1
(x1) * y0 -> x1y0l, x1y0h
(x1) * y1 -> x1y1l, z3
mier = x0
(x0) * (y1) -> x0y1l, X
(x0) * y0 -> z0, A
|
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Very good! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Nice work, Repose!
Going to have to take another run at this one of these days.. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.
Even my old routine has won out of 120 routines tested independently.
Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results
This was a thrilling moment for me :)
But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices. |
| |
Raistlin
Registered: Mar 2007 Posts: 680 |
Very nice write up, Repose! |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Quote: I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.
Even my old routine has won out of 120 routines tested independently.
Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results
This was a thrilling moment for me :)
But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.
Congratulations! I totally get that it must have been thrilling. :D |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting ReposeThe time is 188.1 cycles, averaged over all possible inputs. Pretty impressive. :)
It seems that most of the required memory of about 2 KB is spent by tables (which are generated), not so much the code itself, right?
If so, i guess this 16x16=32 routine could come in quite handy for something like https://github.com/exoticorn/upkr which still doesn't seem to have a C-64/plain 6502 unpacker (while 6502 unpackers using hardware-accelerated mul exist). |
| |
Repose
Registered: Oct 2010 Posts: 225 |
The code seems to be 120 bytes and the tables are 2044 bytes. I think I need to add some 30 more to create the tables. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Very very nice write up @Repose! |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)
Hehe, tbf Graham tought me this trick (i.e. selfmod the code to get a+b for free, and using neg-tables to get a-b for free) iirc, or maybe it was CJam, can't remember.
We should add a whole chapter about quick division also. I'm currently (since 2004) using a trick Graham tought me, which I believe HCL also uses, y/x = y* (1/x) instead of doing div. And then using another multiply to linearly interpolate between y*(1/x) and y*(1/(x+1)) by the fractional part of x. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Yes, I know about that trick, it's used in CPUs too! I read a whitepaper on the AMD FPU design.
I'm working on a mathlib of all the fastest routines, I'll be adding division after I'm done with all the multiply and x^2 variations. |
| |
Count Zero
Registered: Jan 2003 Posts: 1932 |
Many thanks to Repose for his codebase contributions at this point as well!
https://codebase64.org/doku.php?id=base:fastest_multiplication gave me the thrills a while ago.
Everybody here ofc is invited to contribute as well :) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Thanks!
A lot of the division approaches are mentioned here:
https://en.wikipedia.org/wiki/Division_algorithm
I can't find the AMD paper, but its similar to this:
https://ieeexplore.ieee.org/document/762835
It was detailing the implementation in the K7 CPU by AMD (very old one).
I don't wanna get too excited to dive into that, until I finish the multiplies :)
But I worked on that today. There's a lot of ways to handle signs. I was thinking of something like
lda x1 (high byte of first input)
asl (get high bit/sign into carry)
lda y1
bcc + (x1 is positive)
bpl + (y1 is positive)
So, there are 4 cases, and if one of the arguments is negative, the resulting product will be negative, and to do that, you only need to do the square tables subtraction in the reverse order, which is equivalent of taking the 2's complement of the product.
So, I'm hoping that that little sort routine above will lead me to variations which are mostly the same speed, so there's little overhead. I think if both inputs are negative will be the worst case.
Another approach is to use different representations. Does offset 128 work better? What about storing the sign in a separate byte? Then everything is always unsigned, and you just EOR the signs together for the result.
Another way is to take the 2's complement of the inputs first, do an unsigned, then 2's complement them again if needed.
Another way, do unsigned but fix up the answer depending on the arguments. You have to subtract one or both inputs from the high bytes, it's explained in C=Hacking 16 under the 3d article. Very good article, which I co-wrote too :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
No idea whether Graham also came up with the self-mod to get the A+B for free trick independently, but I know I came up with it myself when optimising some code either Steve Judd or George Taylor showed me after whoever it was introduced me to the difference of squares approach, back around when Steve was writing 3d rendering articles for C=Hacking in the late 90s.
I'm no longer sure who it was I was having those discussions with; might've been on IRC or might've been email correspondence, but either way I don't have the logs any more. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Oh, you definitely have added some great ideas!
However, I researched it before and there's a few people who came up with that idea before any of us, not c64 sceners either. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I've finally published my new fastest routine, 2023 version. It's even faster now, 187.1.
https://codebase64.org/doku.php?id=base:fastest_multiplication_..
And independently benchmarked here
https://github.com/tobyLobster/multiply_test/?tab=readme-ov-fil..
Both faster and shorter, so not bad. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Nice work! |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Of course, only now do I realize a dumb trick:
x1 = p_sqr_lo
umult16
; set multiplier as x1
lda x1
;sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_neg_sqr_lo
sta p_neg_sqr_hi
; set multiplier as x0
; *x1 is no longer preserved
To easily save 3 cycles and 1 byte zp.
Going a bit further, if you add 6 zp, using two sets of pointers, then x0 and x1 can be p_sqr_lo1 and p_sqr_lo2. And one more step, make the caller set up both pointers, but that's a bit more difficult as we can't assume the caller can use A for the calculation eor #$ff (yes, A will be trashed when you call the umult16, but I mean when the caller has the other half of x_ ready).
So, save 3, 6 or 12 cycles.
One more hack is passing y0 in Y, but in the worst case, that's not convenient to the caller, if it happens to be in X, then it's easier to stx y0 rather than txa:tay.
So, I would say saving 3 or 6 cycles are good options, and 3 for sure. That brings the routine down to 180.1 :) Saving 14.5 cycles and being 10% faster definitely makes me happy, losing another 6 bytes zp not so much. |