[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 225

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

2012-06-10 00:25

Repose

Registered: Oct 2010
Posts: 225

After some work I decided to go with a square table based large multiply. 32bitx32bit=64bit in 850 cycles or so at best. I discovered a trick I wish I knew a long time ago; to add long lists of numbers you can do this
ldy #0
clc
lda a0
adc b0
bcc s1
iny
clc
s1 adc c0
sta d0
bcc s2
iny
s2 sty d1

It's basically 6 cycles per row and 7 cycles per column. When adding the 16 (two byte) partials from the 32bit multiply, it averages to 7.5 cycles per add. This is really optimal as the most basic add is 6 cycles (lda a0:adc b0).

I also found out that there's carries slightly less than half the time on average.

I'm working on a routine to do 8x8=16, 8x16=24, 16x16=16 with rounding, 32x32=32 with rounding, 32x32=64, and 32x32 floating point in one routine with different entry points, where (almost) each entry is optimal even for a standalone routine, and this is all in one routine. It's just a matter of arranging the order of operations carefully to progressively create the result.

I think you can get 1000flops on the c64.

2012-06-10 10:28

ChristopherJam

Registered: Aug 2004
Posts: 1409

1000 flops is well within reach - it just depends how much precision you want, and what you define as a flop :)

Single precision IEEE-754 only has 24 bits for the mantissa, and you don't need a 48 bit result unless you want to get the rounding exact for multiply-accumulates.

Have you considered expressing the multiplies as a difference of squares? Should mean instead of needing nine 8x8->16 multiplies for a 24x24 bit you have six 8x8s and a few table lookups.

2012-06-11 01:47

Repose

Registered: Oct 2010
Posts: 225

part 1

z1:z0=(x0+y0)^2/4-(x0-y0)^2/4 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;48 cycles

z1:Z0=x0*y1/2 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y1
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;56 cycles

z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 ccyles

c1:c0=a0^2/4-b0^2/4;48 cycles
c3:c2=a1^2/4-b1^2/4;48 cycles
t1:t0=a0*a1/2;56 cycles
t3:t2=b0*b1/2;56 cycles
t1:t0-=t3:t2;20 cycles
c2:c1+=t1:t0;20 cycles
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1 lda #0;8/11
adc c3:sta c3;6

which is 262+ cycles for your method, 16bit.

2012-06-11 02:02

Repose

Registered: Oct 2010
Posts: 225

You're proposing,
a*b=(a+b)^2 - (a-b)^2,
f(x)=x^2/4,
a*b=f(a+b) - f(a-b)
Where a,b are 24 bit values.

So we need a 24bit square function.
You can get that by,
(x2 * 2^16 + x1 * 2^8 + x0)^2=
x2^2 * 2^32 + x1^2 * 2^16 + x0^2 + y,
g(x)=x^2/4,
=g(x2) * 2^32 + g(x1) * 2^16 + g(x0)
which are three table lookups and,
y= 2*x1*x0/4 * 2^8 + 2*x2*x0/4 * 2^16 + 2*x2*x1/4 * 2^24
which are three 8bit multiplies.

In general for n byte numbers you get
2n squaring lookups,
2 * nC2 mults where nCr=n!/(n-r)!r!,
2n shifts

O(n) = 2n*48 + 2*nC2*(56+20) + 14, n>1

2C2=2
3C2=3
4C2=6

n cycles
2 262
3 530
4 1310

So without optimization, it doesn't look good...

2012-06-11 09:59

Skate

Registered: Jul 2003
Posts: 494

Quote:

So without optimization, it doesn't look good...

when it comes to math, what looks good on c64 without optimization?

creating a math library for general purposes is a good thing. it's very useful for testing an effect on c64 without spending too much time. but when it comes to finalization, usually most parts of this library won't be used as it is.

imho, best method is to create the library with possible optimizations which doesn't effect the accuracy much. at the last stage, replace all possible calculations with reasonable look-up tables and where look-up table is not an option, inline the math functions and optimize them as much as possible, change the register usages for saving a few more cycles, sacrifice some accuracy if it doesn't effect the result badly etc.

2012-06-11 10:43

Repose

Registered: Oct 2010
Posts: 225

You're thinking in terms of demos. I was thinking of applying this as a compiler library. I'm just looking for a general fast calculation.
I'm comparing CJ's suggestion to my first approach, when I say it doesn't look good, I meant his approach.

Update: slight problem in how I count things
O(n) = n*48 + nC2*(56+20) + 14, in fact 1 works and 1C2 is 0, also the 14 term is related to this

n cycles
1 48
2 262
3 386
4 662

This is now good :)
If I can generate combinations in lists, I could generate any precision math routine in kickass...

2012-06-11 19:33

Bitbreaker

Registered: Oct 2002
Posts: 508

Quoting Repose

part 1
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1
lda #0;8/11
adc c3
sta c3

this will a) most likely crash as it only skips one byte and then runs on $00 what will be a brk. Also this is not a optimizing trick speedwise nor sizewise:

Better be done like:

    anc #$80 ;copy bit 7 to carry and A = A & $80
    bcc +    ;A is zero on plus, all fine
    sbc #$81 ;subtract $81 from $80 = $ff
+
    ;carry always clear as a free gift

Also, lots of other possibilities to optimize that code, and also some mistakes within the code.

2017-03-28 18:46

Repose

Registered: Oct 2010
Posts: 225

I worked on this again. There were a lot of bugs in my post 'part 1'.

Basically, this message is working out how long to multiply 16x16 using the idea of f(x+y)-f(x-y)=x*y with x,y as 16bit numbers and f(n)=n^2/4.
Note that squaring a 16bit number takes two 8bit squares and a multiply.

Compare this with the usual way, which is 4 multiplies of x0*y0,x1*y0,x0*y1,x1*y1 then adding those together.

----------------
z1:z0=(x0+y0)^2/4 means
lda x0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;50 cycles avg

z1:z0=(x0+y0)*(x1+y1)/2 means
lda x0:clc:adc y0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
lda x1:clc:adc y1
tay;32
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;70 cycles avg

z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 cycles

The routine
Again note, (a1:a0)^2/4=a1^2/4,a0^2/4,a1*a0/2
where a1:a0=x1:x0+y1:y0
So a 16bit square/4 is two 8bit squares and a multiply with divide/2
then added all together
We need two 16bit squares and then subtract them

;z=f(x+y) where f(n)=n^2/4 and x,y are 16bit numbers
z1:z0=(x0+y0)^2/4;50 cycles (x+y)^2/4 low
z3:z2=(x1+y1)^2/4;50 cycles (x+y)^2/4 high
t1:t0=(x0+y0)*(x1+y1)/2;70 cycles
z2:z1+=t1:t0;20 cycles
;add the carry to z3 (not shown or calculated) 190+cycles for 16bit square/4
;
;t=f(x-y)
t1:t0=(x0-y0)^2/4;50 cycles
t3:t2=(x1-y1)^2/4;50 cycles
t5:t4=(x0-y0)*(x1-y1)/2;70 cycles
t2:t1+=t5:t4;20 cycles
;
;z-=t
z1:z0-=t1:t0;20 cycles
z3:z3-=t3:t2;20 cycles
;190*2+40=420 cycles+

There's still some mistakes here and room to optimize the adds,
also the setup part of the table lookups can be arranged to not
change the pointers so much,
but basically it shows that this is very slow and doesn't work out.

This could have been 4 multiplies if done the usual way, instead
it's like 6 multiplies (officially 2, then 4 sqrtable lookups of 9 bit
numbers, which is the same speed).

2017-03-28 18:57

Repose

Registered: Oct 2010
Posts: 225

I only have two options left. I can use the results of the 'add/sub a set' thread for adding the partials of large multiplies, where I add the partials by order of column from low to high, or else do the multiplies in an optimal order of not changing the zp pointers, but then having to generate and add the partials in a non-optimal order.

The two times I can avoid changing the zp pointers saves 28 cycles. Doing adds then having to save them and add a different column is still an unknown to me.

I'm thinking that doing the multiplies optimally will be best, and the result will be just over 200 cycles for an unsigned 16x16.

It's only a bit better than the one posted already in codebase.

I still have to look more into the cosine formula, but I think I am close to the optimal speed for large multiplies.

2017-03-29 08:12

JackAsser

Registered: Jun 2002
Posts: 2014

Have you checked my Seriously fast multiplication?

Edit: Yes you have.. sorry.

2017-03-29 10:21

Repose

Registered: Oct 2010
Posts: 225

Yes, that's the one I beat already.

2017-03-29 11:46

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Yes, that's the one I beat already.

Interesting approach! I'll try it myself someday!

2017-03-29 12:03

ChristopherJam

Registered: Aug 2004
Posts: 1409

Stop distracting me, I've got other stuff I'm meant to be working on!

2017-03-29 12:22

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Stop distracting me, I've got other stuff I'm meant to be working on!

Circle closed. You were the first to teach me multiplications using squares. I remember I found you on some forum ages ago. :D

2017-03-29 13:46

ChristopherJam

Registered: Aug 2004
Posts: 1409

Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.

It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?)

2017-03-29 14:41

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.

It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?)

Yeah maybe. I remember Graham did the implicit add-optimization to it. I.e. self-mod the highbyte with X and then use ,Y in the table lookup. That way the X+Y comes for "free"

2017-03-29 15:34

Repose

Registered: Oct 2010
Posts: 225

That very idea I always thought was brilliant, because it made a useful calculation with the indexing at the same time as a lookup. As a result, this algorithm is faster than could ever be possible. I call it a 'super' algorithm.

And yes, Judd told me that at the time we were working on the C=Hacking article together.

Anyone remember that coding thread in comp.sys.cbm and later, the z80 vs 6502 'contest' ?

2017-03-29 15:48

Frantic

Registered: Mar 2003
Posts: 1648

Quote: Yes, that's the one I beat already.

Don't hesitate to write an article, or at least post some source code, at Codebase64. :) Would be much appreciated!

2017-04-07 22:12

ChristopherJam

Registered: Aug 2004
Posts: 1409

Repose, did you get around to completing a 32x32->64 routine? I'm trying out some of the ideas from this and the sets of add/sub threads; currently down to around 1300 cycles, though I've still got some optimising to go.

I'm using inx to track carries, but have yet to remove the CLCs after each branch.

2017-04-08 03:52

Repose

Registered: Oct 2010
Posts: 225

Yes I wrote a 16x16 to test, and it takes 211 cycles (that's avg with equal branches). The one posted on codebase is 231 cycles. I am just testing it now but having some trouble finding a single step debugger that works for me, see thread.

The time to beat for 32x32 is at least 1300 as posted on codebase so you have to keep going...

I can derive the formula for my current version to estimate exact timing of 32x32 I'll have to get back to you in a bit.

2017-04-08 04:38

Repose

Registered: Oct 2010
Posts: 225

Here's some untested code:

x0=$fb
x1=$fc
y0=$fd
y1=$fe
x0_sqr_lo=$8b;2 bytes
x0_sqr_hi=$8d
x0_negsqr_lo=$8f
x0_negsqr_hi=$91
x1_sqr_lo=$93;2 bytes
x1_sqr_hi=$95
x1_negsqr_lo=$97
x1_negsqr_hi=$99

sqrlo=$c000;510 bytes
sqrhi=$c200
negsqrlo=$c400
negsqrhi=$c600

umult16:
;init zp square tables pointers
lda x0
sta x0_sqr_lo
sta x0_sqr_hi
eor #$ff
sta x0_negsqr_lo
sta x0_negsqr_hi;17

lda x1
sta x1_sqr_lo
sta x1_sqr_hi
eor #$ff
sta x1_negsqr_lo
sta x1_negsqr_hi;17

ldx #0;start column 0
ldy y0

SEC
LDA (x0_sqr_lo),y
SBC (x0_negsqr_lo),y
sta z0;x0*y0 lo, C=1

;start column 1
;Y=y0
clc
LDA (x0_sqr_hi),y;x0*y0 hi
ADC (x1_sqr_lo),y;+x1*y0 lo
bcc c1s1;8.5/11.5 avg
inx
clc
c1s1 sbc (x0_negsqr_hi),y;x0*y0 hi
bcc c1s2
dex
clc
c1s2 sbc (x1_negsqr_lo),y;-x1*y0 lo
bcc c1s3
dex
clc
c1s3 ldy y1
adc (x0_sqr_lo),y;x0*y1 lo
bcc c1s4
inx
clc
c1s4 SBC (x0_negsqr_lo),y;A=x0*y1 lo
bcc c1s5
dex
clc
;end of column 1
c1s5 sta z1;column 1

;start column 2
ldy y0
txa;carries from column 1
ldx #0;reset carries
clc
adc (x1_sqr_hi),y;+x1*y0 hi
bcc c2s1
inx
c2s1 sbc (x1_negsqr_hi),y;-x1*y0 hi
bcc c2s2
dex
clc
c2s2 ldy y1
adc (x0_sqr_hi),y;+x0*y1 hi
bcc c2s3
inx
clc
c2s3 adc (x1_sqr_lo),y;+x1*y1 lo
bcc c2s4
inx
clc
c2s4 sbc (x0_negsqr_hi),y;-x0*y1 hi
bcc c2s5
dex
clc
c2s5 sbc (x1_negsqr_lo),y;-x1*y1 lo
bcc c2s6
dex
clc
c2s6 sta z2;column 2

;start column 3
;Y=y1
txa;carries from column 2
clc
adc (x1_sqr_hi),y;+x1*y1 hi
sbc (x1_negsqr_hi),y;-x1*y1 hi
;shouldn't be any carries in the msb
sta z3;column 3

rts

makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta x0_sqr_lo+1
sta x1_sqr_lo+1
lda #>sqrhi
sta x0_sqr_hi+1
sta x1_sqr_hi+1
lda #>negsqrlo
sta x0_negsqr_lo+1
sta x1_negsqr_lo+1
lda #>negsqrhi
sta x0_negsqr_hi+1
sta x1_negsqr_hi+1

;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9   ; CMP #immediate - skip TYA and clear carry flag
      
makesqrtables_loop1:  
tya
adc #$00
makesqrtables_sm1:  
sta sqrhi,x
tay
cmp #$40
txa
ror
makesqrtables_sm2:
adc #$00
sta makesqrtables_sm2+1
inx
makesqrtables_sm3:
sta sqrlo,x
bne makesqrtables_loop1
inc makesqrtables_sm3+2
inc makesqrtables_sm1+2
clc
iny
bne makesqrtables_loop1
      
;generate negsqr(x)=(255-x)^2/4
ldx #$00
ldy #$ff
      
maketables_loop2:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne maketables_loop2:
rts

2017-04-08 04:43

Repose

Registered: Oct 2010
Posts: 225

Post correction would be slower on the 16x16, not a long enough run of add/sub.

2017-04-08 05:00

Repose

Registered: Oct 2010
Posts: 225

Partials cheatsheet

                        y1  y0
                        x1  x0
                        ------
                 x0*y0h x0*y0l
          x1*y0h x1*y0l
          x0*y1h x0*y1l
   x1*y1h x1*y1l

----------------
24x24bits

                    x2     x1    x0
                    y2     y1    y0
                    ---------------
                        y0x0h y0x0l
                  y0x1h y0x1l
            y0x2h y0x2l
                  y1x0h y1x0l
            y1x1h y1x1l
      y1x2h y1x2l
            y2x0h y2x0l
      y2x1h y2x1l
y2x2h y2x2l

These facts are useful to estimating the time of any size calc:

Number of columns is 2*n, n is bytes of each number.
Rows of additions is like 1 3 3 1, 1 3 5 5 3 1, 1 3 5 7 7 5 3 1 for 16,24 and 32 bit mults, and each one being f(x)-g(x), so really double that number of add-subs.
The total add-subs is n^2*2. (each is about 10 cycles).
Number of times to change (or pointers to set) of the multiplier is n, then each one is used >n times with the multiplicand (ldy multiplicand), when doing in column order (tbd total).

2017-04-08 05:44

Repose

Registered: Oct 2010
Posts: 225

Changes in multiplicand is 2*n (in my case, ldy y(x) ).
eg ldy y0
...
ldy y1
...
ldy y0
...
etc.

2017-04-08 07:42

ChristopherJam

Registered: Aug 2004
Posts: 1409

I've replaced all the subtractions with additions by using g(x)=$4000-(x*x/4) and offsetting my start number, and rolled the carry correction in to the addition sequence.

Removing the CLCs is probably not worthwhile for 32x32, as there are only 33 of them, so I'd have to spend considerably less than 4 cycles per output byte on the post correction (unlikely).

I'm down to around 800 cycles for a 32x32 now, 776 cycles for zero times zero.

2017-04-08 07:43

ChristopherJam

Registered: Aug 2004
Posts: 1409

fo=open("tables.inc","w")
lo=lambda x:x&255
hi=lambda x:(x>>8)
f=lambda x:x*x//4
g=lambda x:(0x4000-f(x-255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])
dumpArrayToA65(fo, "id",  [lo(  i ) for i in range(512)])

fo=open("mc.inc","w")
mAcc=0
for i in range(4):
    for j in range(4):
        mAcc-=0x40<<(8*(1+i+j))

initialValue = [((mAcc>>s)&0xff) for s in range(0,64,8)]

def addB(yv,zp,tb):
    global lasty
    if yv!=lasty:
        print("""    ldy mT2+{yv}""".format(yv=yv), file=fo)
        lasty=yv
    print("""    adc ({zp}),y""".format(zp=zp), file=fo)
    if tb<7:
        print("""    bcc *+4:inx:clc""", file=fo)
    else:
        print("""    bcc *+3:clc""", file=fo)

lasty=None
for tb in range(8):
    print(""" ; tb={tb} """.format(tb=tb),file=fo)
    if tb==0:
        print("""    clc """,file=fo)
        print("""    ldx#0 """,file=fo)
        print("""    lda #${iv:02x} """.format(iv=initialValue[tb]),file=fo)
    else:
        print("""    txa""", file=fo)
        if tb<7:
            print("""    ldx#0 """,file=fo)
        print("""    adc#${iv:02x}""".format(iv=initialValue[tb]), file=fo)

        if initialValue[tb]>0xef:
            print("""    bcc *+4:inx:clc""", file=fo)


    for j in range(4):
        i=tb-j
        if i in [0,1,2,3]:
            addB(i, "zp_fl{j}".format(j=j), tb)
            addB(i, "zp_gl{j}".format(j=j), tb)
        i=tb-j-1
        if i in [0,1,2,3]:
            addB(i, "zp_fh{j}".format(j=j), tb)
            addB(i, "zp_gh{j}".format(j=j), tb)
    print("""    sta mRes+{tb}""".format(tb=tb), file=fo)

fo.close()

2017-04-08 07:48

ChristopherJam

Registered: Aug 2004
Posts: 1409

(obviously also need four sets of

    lda mT1+ 0
    sta zp_fl0
    sta zp_fh0
    eor#255
    sta zp_gl0
    sta zp_gh0

every time the multiplier changes (included in my cycle times above),
plus also some init code to set the high bytes of the zero page pointers before a set of multiplications is performed (not included in my timings))

2017-04-08 08:37

Repose

Registered: Oct 2010
Posts: 225

Good job, that's right in the range of what I thought was possible.

I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers.

by offset $4000, doesn't that reduce the domain?

Correction is fast, it's only

stx z3
sec
sbc z3
sta z2

Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :)

2017-04-08 09:50

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting Repose

Good job, that's right in the range of what I thought was possible.

Thanks!

Quote:

I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers.

Doing :)

Quote:

by offset $4000, doesn't that reduce the domain?

Nah, second table only contains x**2/4 for x in -255 to 255, so it already maxed out at $3f80

Quote:

Correction is fast, it's only

stx z3 sec sbc z3 sta z2

True, but that's an extra 64 cycles, and removing the CLCs saves at most 64 cycles, sometimes as little as zero (if the branches skip over them all)

Quote:

Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :)

Yes, there should be an equivalent that mixes ADC and SBC, I just found it easier to wrap my brain around the edge cases and carry handling by converting it to ADC only. I'll be interested to see what you come up with.

2017-04-08 10:19

ChristopherJam

Registered: Aug 2004
Posts: 1409

Oh, faster correction:

  sec
  sbc id,x
  sta z2

Still not gonna do it, mind ;)

2017-04-08 10:59

Repose

Registered: Oct 2010
Posts: 225

About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.

Let's say in the middle columns where there's 14 adds per column, that's 28 cycles half the time saved from not using CLC, or 14 cycles on average, vs 8 cycles for correction, it still saves 6.

I actually found the stats for the carries, most of them are about half, but adding a higher proportion of high bytes gives less carries.

2017-04-08 11:01

Repose

Registered: Oct 2010
Posts: 225

Good catch on the id,x, I was thinking of that a few days ago but it didn't click in for this situation yet :)

And yes, I worked hard at mixing add/sub properly, it still doesn't really made sense but it works. I thought it wouldn't if you DEX to $FF but it still works.

2017-04-08 11:14

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting Repose

About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.

Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts..

2017-04-08 11:24

Repose

Registered: Oct 2010
Posts: 225

Yes it hurts :) I posted the explanation in the add/sub thread if you can follow it.
Try 0 - ff - ff in your head. Have fun! :)

2017-04-08 12:18

ChristopherJam

Registered: Aug 2004
Posts: 1409

Thanks!

Of course, going to have to start forking off "best best case" vs "best worst case" vs "best average time" pretty soon.

Down to 699 cycles for 0*0, btw ;)

2017-04-08 14:09

Repose

Registered: Oct 2010
Posts: 225

I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.

If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.

Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc.

2017-04-10 09:53

ChristopherJam

Registered: Aug 2004
Posts: 1409

Yes, my average for multiplying 10 randomly selected pairs is around 760 cycles at the moment, ranging from around 740 to 780.

Floats only need 24bit x 24bit, so that should be a lot faster. The shifting for adds will be a bit of a hassle. Do you care about correct behaviour for NaNs etc? And how critical is exact rounding? I'm guessing IEEE standard would be considerably slower than "good enough for most purposes."

2017-04-10 09:55

ChristopherJam

Registered: Aug 2004
Posts: 1409

Also, if you're only planning on supporting 32bit ints, there's no way to access the high 32 bits of the result from C anyway - instant 2x speedup :D

2017-04-10 12:45

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.

If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.

Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc.

Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day.

2017-04-10 17:44

Oswald

Registered: Apr 2002
Posts: 5094

Quote: Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day.

cool, I hope its for some c64 demo thingie :)

2017-04-10 18:42

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: cool, I hope its for some c64 demo thingie :)

No sorry. :) just curiosity.

2017-04-11 04:52

Oswald

Registered: Apr 2002
Posts: 5094

Quote: No sorry. :) just curiosity.

nah next time you make a true doom fx, or you'll be kicked out :)

2017-04-11 04:59

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: nah next time you make a true doom fx, or you'll be kicked out :)

Hehe, wasn't the Andropolis vector enough? ;) there I only used 16-bit muls and divs (8.8 fixpoint) hence the coarse resolution of the level (thickness of walls and minimum stair step height). I also optimised to only have axis-aligned rooms. Moving to 16.16 muls and remove the axis alignment would perhaps slow it down by 30% but it would be able to render an untextured Doom level I'm sure.

2017-04-11 21:49

Repose

Registered: Oct 2010
Posts: 225

Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code)

2017-04-12 04:12

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting ChristopherJam

Quoting Repose
About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.

Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts..

OK, done now and right you are; saved me 13½ cycles on average for my ten random cases (down to 747.2 cycles now), and only cost me an extra five cycles on my best case (up to 697)

I ended up replacing the lda#offset at the starts of each of the middle six columns with an sbc id+$ff-offset,x at the ends.

My code generator has fairly generic treatment of addends now, so it takes care of all the carry logic for me. Core is now

for tb in range(8):
    bspecs=getbspecs(tb)
    doCounts=(tb<7)
    iv=initialValue[tb]
    doClears=len(bspecs)<4 and iv!=0

    op="lda"

    if tb==1:                                                                                  
        emit("    ldx#0")                                                          
    elif tb>1:
        emit("    txa")
        op="adc"
        if doCounts: emit("    ldx#0")

    if iv!=0:
        if doClears:
            bspecs=[pointerReturner( "#${iv:02x}".format(iv=iv), co=(iv>0xef))]+bspecs
        else:
            bspecs=bspecs+[pointerReturner( "id+${nv:02x},x".format(nv=0xff-iv), negate=True)]

    for n,s in enumerate(bspecs):
        addB(s, op, moveCarry=(n!=len(bspecs)-1), doCounts=doCounts, doClears=doClears)
        op="adc"

    emit("    sta mRes+{tb}\n\n".format(tb=tb))

2017-04-12 08:19

Repose

Registered: Oct 2010
Posts: 225

Beautiful! Only thing else I could see is to auto-generate for any precision, which isn't much of a leap. Should have option to include table-generator code too, and 6816(?).

You can estimate timing much the same way I do it by hand, by including counts per column based on number of adds, then overhead per column, and total overhead. I use variables Px for branches with an initial estimate of Px=.5 over all multiplies. Was gonna have a P generator given a stream of numbers too, that's really fancy of course.

Was gonna say, put the carries in the table (there's even opportunity for a table with built-in accumulate, would be slightly faster).

I have bad news though, I have doubts about my sec fix for post-correction :( I'm just working that out now (it's simple though, just ensure the running total is offset by 1 for each carry including the first). Did you verify the outputs?

2017-04-12 08:40

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code)

Not major perhaps, a lot of time is spend on line drawing and EOR-filling anyway. But maybe 30%, i dunno. Long time ago now and I'm not really sure about how much time is spend on what.

2017-04-12 08:42

ChristopherJam

Registered: Aug 2004
Posts: 1409

Thanks!

I test the multiply using a set of 20 randomly selected 32 bit integers, comparing the product of each to a known result, and I time each multiply using CIA#2 with screen blanked and IRQs disabled. So, it's not comprehensive, but I'm reasonably confident.

Yes, generalising for a set of operand widths would probably be quite handy. Good idea instrumenting the generator to calculate the likely execution time. I've been meaning to do something similar with another project..

The post fixup is fine if it's the last addition performed, then the carry from that addition can propagate through to the next column.

Do you mean 65816? I started learning that once, but SuperCPU never really took off, and I tend to ignore most platforms between C64 and modern hardware these days, at least for hobby coding. REU last year was already something of a leap for me.

2017-04-12 12:14

Repose

Registered: Oct 2010
Posts: 225

Btw, one more idea: squaring and cubing based on this can be optimized significantly as well.

2017-04-13 08:42

Repose

Registered: Oct 2010
Posts: 225

Test your routine with these magic numbers:
00 00 54 56
00 00 03 03

If my theory is correct, that's the only case that will fail (not really, just the only one I'm bothering to solve for).
It's quite a subtle bug and random values won't catch it, you need 'whitebox' testing.

The failure is rare and of the form that the high byte is higher by 1.

2017-04-13 09:47

ChristopherJam

Registered: Aug 2004
Posts: 1409

Still works :)

Thanks for the test case.

Here are the first two columns of code (and remember that my g(x)=0x4000-f(x-255) ):

    clc 
    ldy mT2+0
    lda (zp_fl0),y
    adc (zp_gl0),y
    sta mRes+0


    ldx#0
    lda (zp_fh0),y
    adc (zp_gh0),y
    bcc *+3
    inx
    adc (zp_fl1),y
    bcc *+3
    inx
    adc (zp_gl1),y
    bcc *+3
    inx
    ldy mT2+1
    adc (zp_fl0),y
    bcc *+3
    inx
    adc (zp_gl0),y
    bcc *+3
    inx
    sbc id+$3f,x
    sta mRes+1

The inverse borrow from the final SBC carries forward to the next column; the SBC itself corrects for the false carries while also compensating for the excess 0x40 in the high byte of the g() table.

2017-04-13 10:18

Repose

Registered: Oct 2010
Posts: 225

Oh, I know why it works, I constructed those special values for the normal sense, I mean

   54 56
   03 03
   -----
   01 xx
00 ff
00 ff

The whole point was to get those 3 partials to be added, ff+ff+01. Where you are adding with offset, I have to construct the multipliers differently. Not only that, but I'm doubly wrong here - I need to find multipliers which cause the f(x)'s to result that way (where my example works only on the production of f()-g()).

I'll have to finish this later. In the meantime, I suggest you test every possible 16x16. Not so easy I know, I had to write such things in a 6502 simulator in C, actually just simulated the few instructions I needed, but there's a source code out there you could use for a full simulator.

2017-04-13 13:24

ChristopherJam

Registered: Aug 2004
Posts: 1409

I'm going to have to think some more about how to synthesise the equivalent test case.

I did start coding an exhaustive test in 6502 (can determine the required result just by repeated adds; a*(b+1)=a*b+b), then realised it wasn't 2**16 test cases but 2*32. Even at 30x realtime that would take VICE 28 hours assuming 700 cycles per iteration..

2017-04-14 00:55

Repose

Registered: Oct 2010
Posts: 225

00 01 02 03 * 04 05 06 07 and manipulate the tables to what you want to test adds for every branch, and number of carries per column up to 14, think that should do it.

2017-04-14 15:15

ChristopherJam

Registered: Aug 2004
Posts: 1409

Had a thought this morning - the difference of squares is already well established, the only thing that really needs testing is the carry handling for each column. I'll post about that over at sets of add/sub shortly.

2017-04-14 20:36

Repose

Registered: Oct 2010
Posts: 225

That's basically what I just said - multiplying is just adding from a table. Test coverage would include each carry and each amount of carries per column.

2017-04-15 06:35

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quote: That's basically what I just said - multiplying is just adding from a table. Test coverage would include each carry and each amount of carries per column.

Fair point - I guess I got distracted by your talk of table manipulation.

Posting some analysis of the individual carries in the other thread shortly.

But back to multiplies - I was curious as to how you got away with not offsetting the g() table, then it finally struck me - using SBC instead of ADC is exactly equivalent to doing an ADC of a $ffff-g() table.

Do you have working code yet? I would expect you too need a different offset for each column.

2017-04-15 06:45

Repose

Registered: Oct 2010
Posts: 225

Just about to work out the subs, though I'm sure it works in some equivalent way, I'm thinking at most a sec or clc when switching between runs of adds and runs of subs. You can do one fixup at the end. The way I'm doing it makes sense too. No offsets needed.
(ps why did Ice T suddenly flash in my mind singing, no beepers needed?)
Sounds like mine is gonna be a lot cleaner, not to mention faster but we'll see :)

2017-04-15 11:19

ChristopherJam

Registered: Aug 2004
Posts: 1409

OK, 16x16 done and tested. Minimum 205 cycles, mean of around 216, including 12 cycles for the JSR/RTS

(assuming multiplier, multiplicand and destination all in ZP). I've just modified the codegen for the 32x32 for now, will have a look later to see if I've missed any obvious optimisations.

2017-04-15 13:01

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: OK, 16x16 done and tested. Minimum 205 cycles, mean of around 216, including 12 cycles for the JSR/RTS

(assuming multiplier, multiplicand and destination all in ZP). I've just modified the codegen for the 32x32 for now, will have a look later to see if I've missed any obvious optimisations.

How does this compare to my stuff on Codebase? Also unsigned?

2017-04-15 14:22

ChristopherJam

Registered: Aug 2004
Posts: 1409

Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?

Unsigned, yes.

2017-04-15 14:47

Frantic

Registered: Mar 2003
Posts: 1648

10% faster ain't bad!

2017-04-15 16:49

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?

Unsigned, yes.

Nice!!! Havn't checked in detail, same table space overhead?

2017-04-15 16:55

ChristopherJam

Registered: Aug 2004
Posts: 1409

Thanks!

An extra 256 bytes for the id table, so five pages of tables altogether.

2017-04-15 18:48

ChristopherJam

Registered: Aug 2004
Posts: 1409

Actually, scratch that; for the 16x16->32 case, I only ever read from 13 bytes of the identity table.

Make that 2061 bytes of tables required. Also 16 bytes of zero page for pointers.

2017-04-15 20:39

Repose

Registered: Oct 2010
Posts: 225

Ok, I don't know who won because I'm counting things differently. I've always assumed branches are taken equally and averaged them. By that method, jackasser's is clearly 233+12 for jsr/rts=245.

One of the alternate versions I already have written is 207+12=219.

Yet, you reported 241 for JackAssers which is 3 less, and if you scale the same way, I get 219 for yours, exactly the same. As far as I can tell, we're tied.

They are not the same at all. My alternate version is very straightforward and doesn't use the repeated add technique. Still working on that (sorry I'm so slow). We're tied, but this isn't even my final entry.

JackAsser's
setup   74
mults  116
adds    43
total  233

2017-04-15 21:08

ChristopherJam

Registered: Aug 2004
Posts: 1409

I hadn't yet counted the cycles in the source yet, so I just timed both my and JackAsser's routines with CIA for ten randomly selected pairs of numbers.

Now I've annotated my source, if I assume branches are taken equally and page crossings on table lookups also cost on average half a cycle, I get 206.5+12=218.5

So yes, ridiculously close at the moment given the different approaches I gather we've taken.

I can shave another two cycles off mine, but only if I bump the memory back up to 5 pages of tables again. (or rather, I need to place a 7 byte table somewhere in the last 64 bytes of a page long story..)

I gather you have something better in the wings; I look forward to seeing it!

2017-04-15 21:21

Repose

Registered: Oct 2010
Posts: 225

Same, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile.

2017-04-15 21:25

ChristopherJam

Registered: Aug 2004
Posts: 1409

My current best:

mul1616
    lda mT1+ 0      ; 3
    sta zp_fl0      ; 3
    sta zp_fh0      ; 3
    eor#255         ; 2
    sta zp_gl0      ; 3
    sta zp_gh0      ; 3

    lda mT1+ 1      ; 3
    sta zp_fl1      ; 3
    sta zp_fh1      ; 3
    eor#255         ; 2
    sta zp_gl1      ; 3
    sta zp_gh1      ; 3

    clc             ; 2
    ldy mT2+0       ; 3
    lda (zp_fl0),y  ; 5.5
    adc (zp_gl0),y  ; 5.5
    sta mRes+0      ; 3

    ldx#0           ; 2
    lda (zp_fh0),y  ; 5.5
    adc (zp_gh0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_fl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    ldy mT2+1       ; 3
    adc (zp_fl0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gl0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    sbc ofste_3f,x  ; 4
    sta mRes+1      ; 3

    txa             ; 2
    ldx#$bf         ; 2
    adc (zp_fh0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gh0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_fl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    ldy mT2+0       ; 3
    adc (zp_fh1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gh1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    sbc ofste_80-$bf,x  ; 4
    sta mRes+2      ; 3

    txa             ; 2
    ldy mT2+1       ; 3
    adc (zp_fh1),y  ; 5.5
    clc             ; 2
    adc (zp_gh1),y  ; 5.5
    sta mRes+3      ; 3

    rts

    ; total=204.5

ofste_3f
	.byt $3f,$40,$41,$42,$43,$44

    .dsb <$bf-*,0
ofste_80
	.byt $80,$81,$82,$83,$84,$85,$86

(and of course a one off init of the high bytes of the zero page pointers, and square tables as follows:

f=lambda x:x*x//4
g=lambda x:(0x4000-f(x-255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])

2017-04-15 21:28

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting Repose

Same, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile.

Haha, then I should probably acknowledge that my ~750 cycle 32x32->64 requires 32 bytes of zero page pointers, on top of the 16 bytes of zero page split between the multiplier, multiplicand and result :)

2017-04-15 22:26

Repose

Registered: Oct 2010
Posts: 225

Wow, you really squeezed that dry - good job! Even though my approach is different (in the runs of adds vein), it follows the same pattern as yours, and thus will turn out almost exactly the same time.

There's other approaches to faster adds though.

2017-04-16 05:27

Repose

Registered: Oct 2010
Posts: 225

I think I have 195, but it requires a different approach. You're going to have to wait another day though :)

2017-04-17 09:30

Repose

Registered: Oct 2010
Posts: 225

Finally, after fixing a bad sqrtab...

The World's Fastest Published 16x16 Unsigned Mult on 6502.

Testing methodology
-------------------
To measure a particular multiply, you can enter the monitor in vice and type (where pc is your start address):

r pc=c000

Then type:

and keep hitting enter to step through the code. Copy and paste the first line. When you've reached the line *after* the last you want to measure, copy and paste that (also include any points in between). You can now subtract the two to get the timing.

Example:

LDA $FB        - A:00 X:0A Y:00 SP:eb ..-..IZ.   72210316
CLC            - A:00 X:0A Y:00 SP:eb ..-..IZC   72210476
RTS            - A:00 X:0A Y:00 SP:eb ..-..IZ.   72210520

This shows me that my multiply body was 160 cycles, and the adds were 44 cycles, for a total of 204 (nevermind the slow times, I had nothing in zp for testing purposes).

I used that as a guide, but really I added by hand and averaged page crossings and branches for the reported total.

Tell me the speed
-----------------
158 cycles for the multiply part, with no variation, and the inputs/outputs in zp.
43 cycles for the final additions, with each branch equally likely.
The total is 201. However, if you include the simple variation which requires that part of the code is in zp, you save 3 cycles for a total of 198 (I just wanted to say I could break 200).
-add 12 for jsr/rts. I report this way to be consistent with CJ's results above-

The Code
--------

;World's fastest 16x16 unsigned mult for 6502
;you can go faster, but not without more code and/or data
;and being less elegant and harder to follow.
;by Repose 2017

;tables of squares
;sqr(x)=x^2/4
;negsqr(x)=(255-x)^2/4
sqrlo=$c000;511 bytes
sqrhi=$c200;511 bytes
negsqrlo=$c400;511 bytes
negsqrhi=$c600;511 bytes

;pointers to square tables above
p_sqr_lo=$8b;2 bytes
p_sqr_hi=$8d;2 bytes
p_invsqr_lo=$8f;2 bytes
p_invsqr_hi=$91;2 bytes

;the inputs and outputs
x0=$fb;multiplier, 2 bytes
x1=$fc
y0=$fd;multiplicand, 2 bytes
y1=$fe
z0=$80;product, 4 bytes
z1=$81
z2=$82
z3=$83

;not shown is a routine to make the tables
;also you need to init the pointers' high bytes to the tables

umult16:
;set multiplier as x0
lda x0
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_invsqr_lo
sta p_invsqr_hi;17

ldy y0
sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y;note these two lines taken as 11 total
sta z0;x0*y0l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c1a+1;x0*y0h;31
;c1a means column 1, row a (partial product to be added later)

ldy y1
;sec  ;notice that the high byte of sub above is always +ve
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c1b+1;x0*y1l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c2a+1;x0*y1h;31

;set multiplier as x1
lda x1
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_invsqr_lo
sta p_invsqr_hi;17

ldy y0
;sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c1c+1;x1*y0l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c2b+1;x1*y1h;31

ldy y1
;sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c2c+1;x1*y1l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta z3;x1*y1h;31

;4*31+2*17 so far=158
;add partials
;-add first two numbers in column 1
;jmp do_adds;put in zp to save 3 cycles :)
do_adds:
clc
c1a lda #0 
c1b adc #0;add first two rows of column 1
sta z1;9
;-continue to first two numbers in column 2                   
c2a lda #0
c2b adc #0 
sta z2;7
bcc c1c;3 taken/9 not taken, avg 6 
inc z3
clc
;-add last number of column 1 (row c)
c1c lda #0 
adc z1
sta z1;8
;-add last number of column 2
c2c lda #0 
adc z2
sta z2;8
bcc fin;3/7 avg 5
inc z3
;9+7+6+8+8+5=43
fin rts

2017-04-17 18:54

JackAsser

Registered: Jun 2002
Posts: 2014

Repose, really nice!

Some further optimization:

sta c2b+1;x1*y1h;31 => tax

c2a lda #0
c2b adc #0 =>

c2b txa
c2a adc #0

But I somehow like the fact that X is kept clean otoh.

2017-04-17 18:56

JackAsser

Registered: Jun 2002
Posts: 2014

Also if z3 is assume in the y-reg when done instead:

sta z3;x1*y1h;31 => tay

all inc z3 => iny

I think that's ok since you'd probably do some lda z3 afterwards anyway, and instead you have it readily available in Y.

2017-04-17 19:27

Repose

Registered: Oct 2010
Posts: 225

Thanks :) I screwed up my timing though, the first multiply part is 33 not 31 because of the single SEC, bringing my total to 203 (and VICE was right, the multiply part is 160). Because of this, I think I'll use your optimizations to bring it back down to 201. Therefore it still stands :)

2017-04-17 20:54

JackAsser

Registered: Jun 2002
Posts: 2014

Also in a real life situation, f.e. subpixel vectors you'd only keep z3 and z2 as a screen limited 8.8 result. Typically one would do:

Rotated x,y,z in 8.8
Rotated z in 8.8
Reciprocal z = 1/z = 0.16

Perspective:
8.8 * 0.16 = 8.24 (keep only the top 8.8, i.e. pixel and the bresenham initial error)

2017-04-17 21:11

Repose

Registered: Oct 2010
Posts: 225

Maybe I should use what I've learned to do 3d rotations and perspective transform? I think A Different Perspective 2017 3d Creators Update is in order :) (I'm one of the original authors).

So I had a plan for this fast multiply, it can lead to a fast division because of multiplying by the inverse of the divisor. I can also do squares and cubes faster than this.

Edit: was thinking multiply is only the beginning. I made it 16% faster than your routine but if I can make such gains throughout the transform stack it will add up.

Also for Andropolis, I was thinking to not use EOR fill but a straight store (in fact that's the insight I had on A Different Perspective), and also to calc frame differences and plot those only.

2017-04-17 22:24

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Maybe I should use what I've learned to do 3d rotations and perspective transform? I think A Different Perspective 2017 3d Creators Update is in order :) (I'm one of the original authors).

So I had a plan for this fast multiply, it can lead to a fast division because of multiplying by the inverse of the divisor. I can also do squares and cubes faster than this.

Edit: was thinking multiply is only the beginning. I made it 16% faster than your routine but if I can make such gains throughout the transform stack it will add up.

Also for Andropolis, I was thinking to not use EOR fill but a straight store (in fact that's the insight I had on A Different Perspective), and also to calc frame differences and plot those only.

I've had similar ideas and I even did an frame difference experiment in Jave. Problem was that since diffs are small so will the triangles be. They'll be extreme, sharp and 'problematic'. Hard to render correctly and since they're diffs any render error will accumulate.

Regarding transforms I came to the conclusion to cut most of the stack and forget about how it works conventionally.

Regarding divs we all do mul by the reciprocal. For Andropolis I did what Graham did and calced the reciprocal by linear interpolation:

X is 8.8 call it a.b:

1/a.b ~= invtab[a]*(1-b) + invtab[a+1]*b

invtab[x] is 0.16 result of 65536/x for x 1..255

2017-04-17 23:35

Repose

Registered: Oct 2010
Posts: 225

You should write an article on how you did that, it sounds interesting. Obviously I'm a noob at this problem and would have a lot of research to do.

I wasn't thinking triangles exactly but just doom like hallways, wouldn't that work for differencing?

2017-04-18 04:58

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: You should write an article on how you did that, it sounds interesting. Obviously I'm a noob at this problem and would have a lot of research to do.

I wasn't thinking triangles exactly but just doom like hallways, wouldn't that work for differencing?

Hehe, I suck at writing articles! :)

Imagine the simpliest case: axis aligned rooms, no height differences and just rotate around Y, i.e. Wolfenstein.

The diff formed by the edge between the wall and the floor/ceiling will be an enlogated thin triangle. The diff might be 3-4 pixels high close to the camera, this is easy, but further down the corridor the height of the diff will be <1 but still pixels here and there must be pur. Surely an extended bresenham that "draws" both the top and the bottom line simultaneous could handle it, and calc the total height between the lines and still propagate individual errors.

2017-04-18 09:28

Repose

Registered: Oct 2010
Posts: 225

Ok, I think that makes sense now.

Now 196
I managed to save another 7 cycles from the correct total of 203, bringing it down to 159+37=196. As a bonus, the two highest bytes are returned in registers.

If I put do_adds in zp, there's one more optimization to save 4 instead of 3, for 192. Finally, if you don't need the lowest bytes, you can save 3 cycles by deleting sta z1.

lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
tay;x1*y1h;Y=z3, 30 cycles

do_adds:
;-add the first two numbers of column 1
	clc
c1a:	lda #0
c1b:	adc #0
	sta z1;9

;-continue to first two numbers of column 2
c2a:	lda #0
c2b:	adc #0
	tax;X=z2, 6 cycles
	bcc c1c;3/6 avg 4.5
	iny;z3++
	clc

;-add last number of column 1
c1c:	lda #0
	adc z1
	sta z1;8

;-add last number of column 2
	txa;A=z2
c2c:	adc #0
	tax;X=z2, 6
	bcc fin;3/4 avg 3.5
	iny;z3++

;Y=z3, X=z2
;9+6+4.5+8+6+3.5=37
fin:	rts

2017-04-18 13:11

Oswald

Registered: Apr 2002
Posts: 5094

I did not follow closely, could you enlighten me why c1a/b/c is not added up in one go ?

2017-04-18 13:43

ChristopherJam

Registered: Aug 2004
Posts: 1409

Damn, so you're now 8.5 cycles faster than me? I was not expecting partial products to be faster than the optimisations we've been working on for runs of adc. Going to have to study this more closely.

Canada holds the record, for now. Nice work!

2017-04-18 17:31

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: I did not follow closely, could you enlighten me why c1a/b/c is not added up in one go ?

Adding in one go would lose the carry information

2017-04-20 11:57

Repose

Registered: Oct 2010
Posts: 225

To be fair, CJ can save 2 cycles by returning in regs like me, then they are just 6.5 or 10.5 cycles apart. Most of our instructions are the same, there's just a difference in overhead. But I can't seem to find any idea that's faster; this could be the actual optimal routine.

I tried all branches for the adds, that's a bit slower. I mean code like:

clc
lda c1a
adc c1b
bcs c1bc	c1bc:
		clc
adc c1c	        adc c1c
sta z1		sta z1
		lda #1
lda c2a	        adc c2a
		bcs c2ac;7	c2ac:
				clc
adc c2b	        adc c2b	        adc c2b

And it did seem to work, but I get about 39.5 cycles.

I also looked into the crazy timer idea, but each correction is sbc timerA, which is already too slow for this small number of adds.

I have one idea that can be dramatically faster but it's not very practical.

;y0 in X, y1 in Y, x0-1, x1-1 init to $02, all ram swapped in
x0=$fc;fb-fc is pointer
x1=$fe;fd-fe is pointer
jmp (x0-1)
mult-k:
;multiply by constant multiplier k
;and multiplicands X, Y
;then add
rts

There's never any setups so that saves 34 cycles, many of the multipliers can be optimized, though not reducing the average much, important cases like 0,1, and powers of 2 could be faster.

You also need room for tables so it could only handle 7 bits at a time, unless you did each one in pure code.

Even with such an extreme approach, the worst case is probably close to a normal multiply.

What do you think, is this the end?

2017-04-20 13:22

ChristopherJam

Registered: Aug 2004
Posts: 1409

Well, without memory restrictions you could do crazy stuff like have variants of the f() or g() tables offset by 1..N so you could move the carry correction into selecting which table to use... but I'm not sure how much that would gain you.

2021-03-02 08:14

Strobe

Registered: Jun 2007
Posts: 6

Thanks Repose, ChristopherJam etc for a great 16x16 routine!

I only really needed a 8x16=24 multiply so I've been hacking Reposes one down to what I settled on below.
Uses the table & ZP setup of the original.
At ~89 cycles (?) and 54 bytes it's twice as fast and half the size of the 16x16 one. Also doesn't touch X.
See my thoughts on tweaks in the comments at the end but I'd like to get it faster and feel there is some obvious improvements staring me in the face which I'll be happy for others to point out :)

;8x16=24 version by Strobe of Repose's original 16x16=32 "fastest multiplication"
;How to use: put numbers in x0/y0+y1 and result is Y reg (z2), A reg (z1), z0
;Clobbers Y, A but not X (in original form anyway)
umult8x16:
;set multiplier as x0
    lda x0              ;comment out and call with A=x0 -2b3c
    sta p_sqr_lo
    sta p_sqr_hi
    eor #$ff
    sta p_invsqr_lo
    sta p_invsqr_hi;17

    ldy y0              ;comment out and call with Y=y0 -2b3c
    sec
    lda (p_sqr_lo),y
    sbc (p_invsqr_lo),y;note these two lines taken as 11 total
    sta z0;x0*y0l       comment out if you don't care about z0, -2b3c, OR
    lda (p_sqr_hi),y    ; ..replace with TAX -1b1c (but destroys X obviously)
    sbc (p_invsqr_hi),y
    sta c1a+1;x0*y0h;33
;c1a means column 1, row a (partial product to be added later)

    ldy y1
    ;sec  ;notice that the high byte of sub above is always +ve
    lda (p_sqr_lo),y
    sbc (p_invsqr_lo),y
    sta c1b+1;x0*y1l
    lda (p_sqr_hi),y
    sbc (p_invsqr_hi),y
    tay ;Y=c2a;x0*y1h; 29 cycles

;17+33+29=79 cycles for main multiply part
do_adds:
;-add the first two numbers of column 1
        clc
c1a:	lda #0
c1b:	adc #0  ;A=z1  6 cycles

;-continue to column 2
        bcc fin   ;2+
        iny  ;add carry
        ;Y=z2, 4-5 cycles (avoiding page boundary cross)
;6+(4+)=10-11 cycles for adder
;total 89-90 cycles
fin:    rts

;Diagram of the additions for 16x8=24
;                y1    y0
;             x        x0
;                --------
;             x0y0h x0y0l
;+      x0y1h x0y1l
;------------------------
;          z2    z1    z0 

;Possible tweaks:
;1. call with A=x0 and comment out "lda x0" -2b3c  (*)
;2. call with Y=y0 and comment out "ldy y0" -2b3c
;3. if you don't need z0 (I didn't), comment out "sta z0" -2b3c  (*)
;   OR replace with TAX -1b1c (but destroys X register obviously which was safe)
;4. There's no point having do_adds in ZP with a JMP like the 16x16 version
;   suggests because you would lose the 2 cycles you gained with the JMP, BUT..
;   ..you could put the ENTIRE ROUTINE in zero page, -2b2c
;   AND if you do that you might as well replace "lda x0", "ldy y0" and "ldy y1"
;   with immediate versions and point x0,y0 & y1 to the appropriate ZP spot for
;   extra -3c, so -2b5c combined.
;5. OR forget the ZP stuff and just in-line it, saving the 12 cycle JSR/RTS
;   (it's only $36 bytes)
;6. mix 1 and/or 2 and/or 3 with 4 or 5
;(*) these also apply to Repose' original 16x16=32 routine.

2021-03-02 10:47

Krill

Registered: Apr 2002
Posts: 2980

Not quite on topic (sorry), but i often find myself needing a "downscaling MUL": 8x16=16, with the 8-bit factor scaling the 16-bit argument by [0..1). (Basically, the lowmost 8 bits of the 24-bit result are discarded.)

Then i re-invent the wheel every time, but it seems to get rounder with each iteration. =)

What would the required resources (cycles, memory) for this case in your optimised routine be?

2021-03-03 07:10

Oswald

Registered: Apr 2002
Posts: 5094

its there, he write comment out sta z0 if you only need 16 bits, "experts" could comment out the whole calc of z0 its just 1 bit precision loss out of 16, depends on what you use it for if that 1 bit is needed or not.

2021-11-28 15:19

Repose

Registered: Oct 2010
Posts: 225

I'm back! What a coincidence, I decided to work on this again. Changelog:
-Using advanced ACME assembler
-Macros for calling/return conventions, lowered precision
-Improved label names, formatting/style, comments
-Proper test suite
-small unsigned 8x8=16, fast unsigned 8x8=16, fast unsigned 16x16=32. Will add fast unsigned 8x16=24, fast unsigned 8x16=16
-All routines separated to library, include what you want with options you want

Currently working on BASIC multiply replacement, just for fun. How fast will a FOR/NEXT loop be?

2021-11-28 15:23

Repose

Registered: Oct 2010
Posts: 225

This is beta, needs cleaning up, but should work. It's down to 187 cycles apparently. This isn't the more flexible library version, just what I have that works right now.

; unsigned integer multiplication library

x0=$fb
x1=$fc
y0=$fd
y1=$fe
p_sqr_lo=$8b ;2 bytes
p_sqr_hi=$8d ;2 bytes
p_negsqr_lo=$a3 ;2 bytes
p_negsqr_hi=$a5 ;2 bytes

sqrlo=$c000 ;511 bytes
sqrhi=$c200 ;511 bytes
negsqrlo=$c400 ;511 bytes
negsqrhi=$c600 ;511 bytes

!macro mult8_snippet mand,low,high {
;with a multipler stored in the p_sqr* and p_negsqr* pointers,
;a multiplicand in mand,
;and suitable squares tables,
;multiply unsigned two 8 bit numbers
;and store the 16 bit result in low, high
;note: it's up to the caller to SEC, as some can be left out
ldy mand
lda (p_sqr_lo),y
sbc (p_negsqr_lo),y ;11 cycles
sta low ;multiplier * Y, low byte, 4 cycles
lda (p_sqr_hi),y
sbc (p_negsqr_hi),y
sta high ;multiplier * Y, high byte
;11+4+11+4 = 30 cycles
}

!macro mult8_reg_snippet mand,low,reg {
;with a multipler stored in the p_sqr* and p_negsqr* pointers,
;a multiplicand in mand,
;and suitable squares tables,
;multiply unsigned two 8 bit numbers
;and store the 16 bit result in low, register
;note: it's up to the caller to SEC, as some can be left out
!if mand="Y" {
} else {
ldy mand
}
lda (p_sqr_lo),y
sbc (p_negsqr_lo),y ;11 cycles
sta low ;multiplier * Y, low byte, 4 cycles
lda (p_sqr_hi),y
sbc (p_negsqr_hi),y
;multiplier * Y, high byte
!if reg="X" {
tax
} else if reg="Y" {
tay
}
;11+4+11+2 = 28 cycles
}

!macro mult8_set_mier_snippet mier {
;set multiplier as mier
lda mier
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_negsqr_lo
sta p_negsqr_hi ;17 cycles
}

makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta p_sqr_lo+1
lda #>sqrhi
sta p_sqr_hi+1
lda #>negsqrlo
sta p_negsqr_lo+1
lda #>negsqrhi
sta p_negsqr_hi+1

;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9 ; CMP #immediate - skip TYA and clear carry flag
lb1: tya
adc #$00
ml1: sta sqrhi,x
tay
cmp #$40
txa
ror
ml9: adc #$00
sta ml9+1
inx
ml0: sta sqrlo,x
bne lb1
inc ml0+2
inc ml1+2
clc
iny
bne lb1

;generate negsqr(x)=(255-x)^2/4
ldx #$00
ldy #$ff
mt1:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne mt1
rts

umult16:
;unsigned 16x16 mult
;inputs:
; x0 x1
; y0 y1
;outputs:
; z0 z1 A Y
;set multiplier as x0
+mult8_set_mier_snippet x0

;z0 = low(x0*y0)
;X = high(x0*y0)
sec
+mult8_reg_snippet y0,z0,"X"

;x0_y1l = low(x0*y1)
;x0_y1h = high(x0*y1)
+mult8_snippet y1,x0_y1l+1,x0_y1h+1

;set multiplier as x1
+mult8_set_mier_snippet x1
;x1_y0l = low(x1*y0)
;x1_y0h = high(x1*y0)
+mult8_snippet y0,x1_y0l+1,x1_y0h+1

;x1_y1l = low(x1*y1)
;Y = high(x1*y1)
+mult8_reg_snippet y1,x1_y1l+1,"Y"

;17+2+28+30+17+30+28 = 152 cycles for main multiply part

;add partials

;add the first two numbers of column 1
clc
x0_y0h: txa
x0_y1l: adc #0
sta z1 ;9 cycles (z1 zp)

;continue to first two numbers of column 2
x0_y1h: lda #0 ;includes carry from previous column
x1_y0h: adc #0
tax ;X=z2 so far, 6 cycles
bcc x1_y0l ;9/12 avg 10.5 cycles
;Y=x1*y1h
iny ;x1y1h/z3 ++
clc

;add last number of column 1
x1_y0l: lda #0
adc z1 ;contains x0y0h + x0y1l
sta z1 ;8 cycles

;add last number of column 2
txa ;contains x0y1h + x1y0h
x1_y1l: adc #0
;A=z2
bcc fin ;7/86 avg 7.5
iny ;z3++
;Y=z3

;Y=z3, A=z2
;9+10.5+8+7.5=35
;total = 152+35 = 187 cycles
sta z2 ; needed calling conventions for test routine
tya
fin: rts

2021-11-28 19:58

JackAsser

Registered: Jun 2002
Posts: 2014

Great stuff Repose. How convinient it is that I’m working on 3D stuff now… ;)

2021-11-30 18:23

Quiss

Registered: Nov 2016
Posts: 43

Nice work! I like the use of (ind), y for the fast inlined multiply, instead of the 4 x "sta zp+x"; "jsr zp" that I've seen some demos use.

Also, JackAsser working on 3D stuff makes me nervous. I still haven't digested the hidden line vectors in 1991.

2021-12-10 16:56

Martin Piper

Registered: Nov 2007
Posts: 722

This thread is an interesting coincidence as I've been thinking about C64 3D again. :)

2021-12-14 10:29

Repose

Registered: Oct 2010
Posts: 225

I wrote an alpha 16x16=32 unsigned, it's 1024 cycles, but if you inline the jsr umult16, it should be 930 cycles. Now I'm exploring not using umult16's at all, but to consider it as a large set of 8bit multiplies; the point is to optimize setting the multiplier (the high bytes of f(x)). I also am playing with the "zip" adds (where I add from right to left then go back to the right, next row), and optimizing register use, basically tax:txa for saving your spot in the beginning column.

The other news is, by using a gray code for the order of operations, I saved a ldy so my umult16 is 195 now.

I have even better ideas; using Knuth's equivalence I can do just 3xumult16.

con't...

2021-12-14 11:33

Repose

Registered: Oct 2010
Posts: 225

Knuth's idea

Given x=x1*256+x0, y=y1*256+y0

 y1 y0
*x1 x0

Calculation       Range
k=(y1-y0)*(x1-x0) +-FE01
l=y1*x1           0-FE01
m=y0*x0           0-FE01
n=l+m-k           0-1FC02

l+m               0-1FC02
l-k or m-k        +-FE01

x*y=l*65536+n*256+m (for 8 bit Xn/Yn)

Example (16x16 bits):
 y1 y0 ->$20 10
*x1 x0    40 30

k=(20-10)*(40-30)=100
l=20*40=800
m=10*30=300
n=800+300-100=a00

llll0000 -> 08000000
 nnnnn00     00a0000
    mmmm        0300
----------------
2010*4030 = 080a0300

x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn)

example with 16-bit values:
2000 1000
4000 3000
k=1000*1000=0100 0000
l=2000*4000=0800 0000
m=1000*3000=0300 0000
n=00A00 0000

800 0000 0000 0000
     a00 0000 0000
          300 0000
------------------
800 0a00 0300 0000

If multiplies are expensive, this is faster. Estimating 32x32=64;

3x umult16 ~585
adds        ~94
total      ~679

2021-12-14 11:40

Repose

Registered: Oct 2010
Posts: 225

About gray codes, ref.
https://en.wikipedia.org/wiki/Gray_code#Constructing_an_n-bit_G..

The use of gray codes is to minimize setting the multiplier/multiplicand, i.e.

;set multiplier as mier
lda mier ;3
sta p_sqr_lo ;3 (often published as f(x))
sta p_sqr_hi ;3
eor #$ff ;2
sta p_negsqr_lo ;3 (also known as g(x))
sta p_negsqr_hi ;3; 3+3+3+2+3+3 = 17 cycles

and

;set multiplicand
ldy mand ;3

In the following table, order 01 would mean change y first. Across the top are the starting variables; i.e. 00 means start with setting x0 and y0 as m'ier and m'and.

	start                   xy
order	00	01	10	11
xy
01	00	01	10	11
	01	00	11	10
	11	10	01	00
	10	11	00	01

10	00	01	10	11
	10	11	00	01
	11	10	01	00
	01	00	11	10

The gray code sequence can be best understood by following these values in a circle; with a starting point and a direction of clockwise or counter-clockwise:
	01
      00  11
	10

there's two ways to end with the high bytes;
start with x0*y1 and change y first, or
start with x1*y0 and change x first. The reason I want to end with the high bytes is to return them in registers, and these bytes are most useful if you only want 16x16=16.

x0*y0
x0*y1 ;change y first i.e. ldy y1
x1*y1 ;change x with +set_mier x1
x1*y0

x0 y0
x1 y0 ;change x first
x1 y1
x0 y1

x0 y1
x0 y0 ;change y first
x1 y0
x1 y1

x0 y1
x1 y1 ;change x first
x1 y0
x0 y0

x1 y0
x1 y1 ;change y first
x0 y1
x0 y0

x1 y0
x0 y0 ;change x first
x0 y1
x1 y1

x1 y1
x1 y0 ;change y first
x0 y0
x0 y1

x1 y1
x0 y1 ;change x first
x0 y0
x1 y0

This is all assuming you are storing each partial result into a self-mod summing routine to come after, in which case the order of operations of the multiplies doesn't matter. If you were adding the result during the multiples, you couldn't make use of this trick to minimize setting of the m'ier.

2021-12-14 12:27

Repose

Registered: Oct 2010
Posts: 225

Survey of multiply methods/ideas

If I really want to explore every way that could be the fastest, here's a list of ideas:

identity mults
1 - sin(a)*cos(b)=(sin(a+b)+sin(a-b))/2
2 - cos(a)*cos(B)=(cos(a+b)+cos(a-b))/2
3 - sin(a)*sin(b)=(cos(a-b)-cos(a+b))/2
4 - a*b=exp(log(a)+log(b))
5 - a*b = [(a + b)² - (a - b)²]/4
6 - k=(y1-y0)*(x1-x0), l=y1*x1, m=y0*x0, n=l+m-k, x*y=l*65536+n*256+m

Karatsuba mult

y1 y0
x1 x0
-----
z2=x1*y1
z0=x0*y0
z1=(x1+x0)*(y1+y0)-z2-z0

Booth multiply

Booth-2 is (n+1)=3 bits a time, overlapping by 1
which is how many?

aaa
  bbb
    ccc
      dd_

or 4

nybble mult
use a table of every a*b where a,b are 4-bit numbers.

residue number system
a set of modulo numbers can represent any number
you can add the residues separately without carry.
example of a (3,5) residue number system, supports 0-14:

you can pick better moduli; like 2^k and 2^k-1, like 256 and 255 which are co-prime. It encodes 65280 values.

REU multiply

Address Description
------- -----------
$de00   256 bytes of data (See $dfc0-$dfc3 for more information)
$df7e   write to this location to activate the RAMLink hardware
$df7f   write to this location to deactivate the RAMLink hardware.
$dfa0   lo byte of requested RAMCard memory page
$dfa1   hi byte of requested RAMCard memory page
$dfc0   write to this location to show RL variable RAM at $de00 (default)
$dfc1   write to this location to show RAMCard memory at $de00
$dfc2   write to this location to show the RAM Port device $de00 page at $de00
$dfc0   write to this location to show Pass-Thru Port dev. $de00 page at $de00

;8bit mult, of x*y to z0;z1
rl_page=$de00
rl_sel_ramcard=$dfc1
rl_lo=$dfa0
rl_hi=$dfa1
rl_activate=$df7e
rl_deactivate=$df7f
sta rl_activate;4
stx rl_lo;4
lda #0;2
sta rl_hi;4
sta rl_sel_ramcard;4
lda rl_page,x;5?
sta z0;3
inc rl_hi;is this r/w? 6
lda rl_page,x;5
sta z1;3
sta rl_deactivate;4
rts

total 42

sine mult example
cos(a)*cos(B)=(cos(a+b)+cos(a-b))/2

x	cos(x)
-pi/2	0
0	1
pi/2	0
pi	-1
3pi/2	0
2pi	1

example, .5 * .75
cos-1(.5)=1.0471975511965977461542144610932
cos-1(.75)=0.72273424781341561117837735264133
a+b=1.7699317990100133573325918137345
a-b=0.32446330338318213497583710845183
cos(a+b)=-0.197821961869480000823505899216
cos(a-b)=0.947821961869480000823505899216
cos(a+b)+cos(a-b)=.75
/2=.375

Convolution mult
Lost my notes here - something starting with W... a lot of adds for a few mults.
https://en.wikipedia.org/wiki/Convolution#Fast_convolution_algo..

Neural net multiplication
not much of a gain

direct massive lookup tables
various low-level code optimizations
illegal opcodes
order of operations
branch-tree optimizations (keeping state in the code location, i.e. branch of every possible multiplier etc.)

Applications of multiplication
-fast way to arrange complex multiply
-fast matrix multiply
-using inverse tables to divide

Different types of number systems
-modular arithmetic, Chinese Remainder system
-storing numbers as primes
-keeping numbers as fractions
-2's complement, 1's complement, offset

2021-12-15 09:01

Martin Piper

Registered: Nov 2007
Posts: 722

I've been pondering a hardware multiply...
Write control byte: xy
y = bytes for operand 1
x = bytes for operand 2

Write bytes in MSB for operand 1 then 2.

Then read bytes in MSB order for multiply result.

For example:
Write hex: 21 47 23 01
Read: b5 50 00

2021-12-15 11:59

Repose

Registered: Oct 2010
Posts: 225

Me too, just yesterday!
https://media.digikey.com/pdf/Data%20Sheets/Parallax%20PDFs/uM-..

The uM-FPU is a 32-bit floating point coprocessor that can be easily interfaced with microcontrollers to provide
support for 32-bit IEEE 754 floating point operations and long integer operations. The uM-FPU is easy to
connect using either an I2C or SPI compatible interface.

Which can be connected to the user port. I believe the serial port would work, then it's 16 cycles per byte.

2021-12-15 13:23

Krill

Registered: Apr 2002
Posts: 2980

The only acceptable arithmetics accelerator is the 6502 in the disk drive. =)

2021-12-15 15:18

Martin Piper

Registered: Nov 2007
Posts: 722

Quote: Me too, just yesterday!
https://media.digikey.com/pdf/Data%20Sheets/Parallax%20PDFs/uM-..

The uM-FPU is a 32-bit floating point coprocessor that can be easily interfaced with microcontrollers to provide
support for 32-bit IEEE 754 floating point operations and long integer operations. The uM-FPU is easy to
connect using either an I2C or SPI compatible interface.

Which can be connected to the user port. I believe the serial port would work, then it's 16 cycles per byte.

I was thinking more like a MC68882 :)
But finding something similar operating at 5V, with parallel address/data rather than serial and still in production, is tough.

2021-12-15 19:52

Repose

Registered: Oct 2010
Posts: 225

Some kind of extra SPI to parallel circuit would help. Since the chip I mentioned has 4MHz serial speed, it's fast enough to read a parallel port at full speed.
https://highfieldtales.wordpress.com/2011/12/04/very-fast-spi-t..

In the end, what do we want though? You could make a GPU cart for the c64, which runs it's own 2d/3d graphic engine, where you simply define the objects. The motions, display updates, collisions, could all be handled automatically, all you need to to keep score, play music, run the display mode, and handle joystick. Just like Blue REU, the output could be stuffed directly to display memory using an advanced FLI mode with practically full color. It would be like a brand new game machine in 160x200x16.

Or, it could be an assist, using memory mapped registers as you describe, and leave the c64 more in control.

I think of this a lot - do I even want to buy a c64 ultimate? It's convenient and fun to play demos and games with full power, but I think I still want to do obscure hardware hacks and real assists. For example, I've always wanted to use the CIA to measure variations in 50/60 Hz power line frequency. I want to attach the serial port to an SD card and bit bang the storage.

2021-12-15 20:27

tlr

Registered: Sep 2003
Posts: 1790

Quoting Repose

Some kind of extra SPI to parallel circuit would help. Since the chip I mentioned has 4MHz serial speed, it's fast enough to read a parallel port at full speed.
https://highfieldtales.wordpress.com/2011/12/04/very-fast-spi-t..

You do already have two CIA synchronous serial ports though, maybe they could be used?

2021-12-16 00:53

Repose

Registered: Oct 2010
Posts: 225

That's what I suggested earlier - just to be clear, Martin wanted a fast, efficient ALU ability by writing arguments to memory, which would be latched until the LSB were written, then immediately read the product back. I digressed a little by suggesting a GPU cart, which takes the idea of assisted computing to it's logical conclusion, and I felt it's a bit pointless by then.

I then suggested to use the serial port to communicate with the SPI port of a proposed ALU (which is still in stock). This would be a bit slow, as I think the SP port could only clock out in 16 cycles.

To overcome that difficulty, one could use a circuit or chip to clock at the full 4MHz of the ALU chip, and leave the result at the parallel port of the CIA, on the Userport, somehow.

To be fair, many functions can be done fairly quickly with large tables, using either the Ramlink or REU, and I wrote sample code for both. There are also methods documented at https://wilsonminesco.com/16bitMathTables/
Large Look-up Tables for Hyperfast, Accurate, 16-Bit Scaled-Integer Math
Plus: The Unexpected Power of Scaled-Integer Math

2021-12-16 11:54

tlr

Registered: Sep 2003
Posts: 1790

Quoting Repose

That's what I suggested earlier - just to be clear, Martin wanted a fast, efficient ALU ability by writing arguments to memory, which would be latched until the LSB were written, then immediately read the product back. I digressed a little by suggesting a GPU cart, which takes the idea of assisted computing to it's logical conclusion, and I felt it's a bit pointless by then.

I then suggested to use the serial port to communicate with the SPI port of a proposed ALU (which is still in stock). This would be a bit slow, as I think the SP port could only clock out in 16 cycles.

I obviously should have read more of the thread. :)

I think it'll take 32 cycles. 250kbit/s max IIRC.

2021-12-16 14:59

Martin Piper

Registered: Nov 2007
Posts: 722

I was thinking a 74LS165 or MCP23S08 might be able to talk to the uM-FPU 32-bit floating point coprocessor and provide an interface to the C64. Either via user port or cartridge port.

2021-12-31 12:15

Repose

Registered: Oct 2010
Posts: 225

Regarding convolutions, I remembered the term I was looking for. There is a fast convolution credited to Winograd. It works out to a lot of additions. I think it might be slow. At the end, I think you have to reduce overflowed digits into a mod plus carry to the next digit.

2022-01-03 16:53

Repose

Registered: Oct 2010
Posts: 225

I've been working hard on my mathlib. I think I've worked out all the use cases and made completely flexible macros that work together in a system, almost like a macro language. These can produce fully optimal code in an easy way.
Covered:
-squaring
-constant multipliers
-re-using multipliers
-distributed multiplication (a*(b,c))
-chained multiplication (a*b*c)
-zp and self-mod versions; and they are the same speed
-use cases for multiple pointers
-cycle timing variables that respond to all variations
-automatic carry flag tracking/setting
-custom table generation for built-in constant add/multiply
-low-precision outputs with optional rounding

To do:
-update test program
-verify the 32x32
-add signed and other versions
-squaring
-polynomials

At least a few more weeks to release something I'm satisfied with :)
I also discovered xc_basic and I'd like to add my library to their source, which is using a simple shift/add loop.

2022-01-03 17:29

Krill

Registered: Apr 2002
Posts: 2980

How many and which of your routines would fit in say, just under 2 kilobytes of RAM? =)

2022-01-03 19:24

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: How many and which of your routines would fit in say, just under 2 kilobytes of RAM? =)

Hehe probably none due to the square tables.

@repose: regarding signing you can just post-fix that if you want, not optimal though but rather fast.


 ; Apply sign (See C=Hacking16 for details).
                lda T1+1
                bpl :+
                    sec
                    lda PRODUCT+2
                    sbc T2+0
                    sta PRODUCT+2
                    lda PRODUCT+3
                    sbc T2+1
                    sta PRODUCT+3
                :
                lda T2+1
                bpl :+
                    sec
                    lda PRODUCT+2
                    sbc T1+0
                    sta PRODUCT+2
                    lda PRODUCT+3
                    sbc T1+1
                    sta PRODUCT+3
                :

But again, you already knew that no doubt. :)

2022-01-03 22:31

Repose

Registered: Oct 2010
Posts: 225

Thanks for the tip! I hadn't considered it in details. However, why use 2's complement? I was thinking to just keep the sign in a byte. When multiplying, it's just EORing the bytes then. Adding/subtracting could cause an overflow, which is the only case that needs a fix.

2022-01-03 22:44

Repose

Registered: Oct 2010
Posts: 225

Quote: How many and which of your routines would fit in say, just under 2 kilobytes of RAM? =)

Jackasser is right, initially I am focused on pure speed, but I realize that smaller options would be useful. You can use a half-sized table and still be fast, in fact this was the usual expression of the squares method before it was fully optimized. Other than that, an unrolled shift-add loop is all I can think of (but I am so tired of seeing those :)

2022-01-03 22:48

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Thanks for the tip! I hadn't considered it in details. However, why use 2's complement? I was thinking to just keep the sign in a byte. When multiplying, it's just EORing the bytes then. Adding/subtracting could cause an overflow, which is the only case that needs a fix.

I didn't really though much about it and just used the hint from that article. EOR doesn't suffice. EOR+1 though..

And it also depends on if the inputs are both signed (no EOR+1 needed) or if just one of them is.

2022-01-03 23:23

Krill

Registered: Apr 2002
Posts: 2980

Quoting Repose

Jackasser is right, initially I am focused on pure speed, but I realize that smaller options would be useful.

Yeah, the reason i'm asking is drive-code, of course.

I've had a co-processor drive-code maths library on my list for years, and now that the plumbing (proper custom-drive code API in the IRQ loader) is ready, building stuff on top of that (like the saver plug-in) is quite easy.

Demos using the drive to speed up a certain class of algorithms are still quite rare, after all these years, but there may be a flower pot or two to win, still. =)

2022-01-04 05:15

Repose

Registered: Oct 2010
Posts: 225

Quote: Quoting Repose
Jackasser is right, initially I am focused on pure speed, but I realize that smaller options would be useful.
Yeah, the reason i'm asking is drive-code, of course.

I've had a co-processor drive-code maths library on my list for years, and now that the plumbing (proper custom-drive code API in the IRQ loader) is ready, building stuff on top of that (like the saver plug-in) is quite easy.

Demos using the drive to speed up a certain class of algorithms are still quite rare, after all these years, but there may be a flower pot or two to win, still. =)

Cool! I've thought of this as well, and it's always been my dream to make either a RAID array or CPU multiprocessing cluster. I put it aside though, when I realized how slow the communication was. It would work best for something like fractals where you do a lot of loops, with small data. Now, on an expanded drive, especially a 1571, that would be quite useful.
Even further, I thought of more database functions a drive could do, or searching, compressing files.

2022-01-04 10:24

Krill

Registered: Apr 2002
Posts: 2980

Quoting Repose

Cool! I've thought of this as well, and it's always been my dream to make either a RAID array or CPU multiprocessing cluster. I put it aside though, when I realized how slow the communication was. It would work best for something like fractals where you do a lot of loops, with small data. Now, on an expanded drive, especially a 1571, that would be quite useful.
Even further, I thought of more database functions a drive could do, or searching, compressing files.

I was also thinking of a computer cluster, with a raytracing/raycasting application to showcase.
It's trivially parallelisable in theory*, highly scalable on many dimensions, and you can put all kinds of programmable devices on the bus, including a C-128 (or C16/+4, etc.) on the other end. =)

* In practice, there's one big problem i've been thinking on and off about but not solved yet: the bus arbitration.
The computers on either end should not just poll all devices but do work themselves and be notified whenever some other work is completed, but that gives rise to race conditions and other problems.

2022-01-04 13:03

Repose

Registered: Oct 2010
Posts: 225

Quote: Quoting Repose
Cool! I've thought of this as well, and it's always been my dream to make either a RAID array or CPU multiprocessing cluster. I put it aside though, when I realized how slow the communication was. It would work best for something like fractals where you do a lot of loops, with small data. Now, on an expanded drive, especially a 1571, that would be quite useful.
Even further, I thought of more database functions a drive could do, or searching, compressing files.
I was also thinking of a computer cluster, with a raytracing/raycasting application to showcase.
It's trivially parallelisable in theory*, highly scalable on many dimensions, and you can put all kinds of programmable devices on the bus, including a C-128 (or C16/+4, etc.) on the other end. =)

* In practice, there's one big problem i've been thinking on and off about but not solved yet: the bus arbitration.
The computers on either end should not just poll all devices but do work themselves and be notified whenever some other work is completed, but that gives rise to race conditions and other problems.

This is a common problem which has been solved. For ethernet, it's called CSMA/CD. Then there's the method from CAN, the 2 wire car bus, which is a better protocol.
https://copperhilltech.com/blog/controller-area-network-can-bus..

2022-01-04 13:20

Krill

Registered: Apr 2002
Posts: 2980

Quoting Repose

This is a common problem which has been solved. For ethernet, it's called CSMA/CD. Then there's the method from CAN, the 2 wire car bus, which is a better protocol.

I am very well aware of that. But we're still talking about stock C-64 plus original periphery, aren't we?

2022-01-04 15:36

Repose

Registered: Oct 2010
Posts: 225

Sorry I wasn't clear, this CAN thing works on an I2C bus, to put it in simple terms, it works exactly the same way as the IEC bus. You should be able to translate the ideas regardless, there will always be something that works. If I get time I can explain it better. Besides which, the kernal itself does this, there is the c64 as the master and it can negotiate to talk to any number of drives.

2022-01-04 16:29

Krill

Registered: Apr 2002
Posts: 2980

Quoting Repose

If I get time I can explain it better.

Again, no need to. I've worked professionally with I2C and CAN for about a decade.
And those do bus arbitration in hardware, doing this in software efficiently is the problem.

Quoting Repose

Besides which, the kernal itself does this, there is the c64 as the master and it can negotiate to talk to any number of drives.

Afaik, bus arbitration is not a thing with single-master buses, like when only the C-64 acts as the master (the drives cannot assert the ATN line).
Different thing with a multi-master bus when any participant may wish to talk at any time.

2022-01-04 17:34

Repose

Registered: Oct 2010
Posts: 225

Continued in Multi-master on IEC bus

2022-01-07 07:36

Repose

Registered: Oct 2010
Posts: 225

Quoting JackAsser

I didn't really though much about it and just used the hint from that article. EOR doesn't suffice. EOR+1 though..

And it also depends on if the inputs are both signed (no EOR+1 needed) or if just one of them is.

Hi,
I didn't get back to you on this. I didn't mean to EOR the product, I meant to EOR the signs.
X Y P
+ + +
- + -
+ - -
- - +

So if the signs were stored with 0=+ and 1=-, then EORing the signs would give the sign of the result.
When multiplying, there is no fixing of the products; they always stay as a "positive" number and don't lose the sign bit.
Add/sub would be slower, you need to know their relative magnitudes.

2022-01-07 17:44

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Quoting JackAsser
I didn't really though much about it and just used the hint from that article. EOR doesn't suffice. EOR+1 though..

And it also depends on if the inputs are both signed (no EOR+1 needed) or if just one of them is.

Hi,
I didn't get back to you on this. I didn't mean to EOR the product, I meant to EOR the signs.
X Y P
+ + +
- + -
+ - -
- - +

So if the signs were stored with 0=+ and 1=-, then EORing the signs would give the sign of the result.
When multiplying, there is no fixing of the products; they always stay as a "positive" number and don't lose the sign bit.
Add/sub would be slower, you need to know their relative magnitudes.

Moving away from 2-complement?

2022-01-07 17:52

Repose

Registered: Oct 2010
Posts: 225

This doesn't need to be an either/or; just something I'm evaluating. unless you're doing a factorial, I see most calculations using multiply followed by an add (polynomial evaluation, matrices), so I just want to see the tradeoff.
If adds and mults always come in pairs, my guess is that either system is about equal. If they are equal, I'd give the nod to separate sign as it gives more range.

2022-01-11 23:02

Repose

Registered: Oct 2010
Posts: 225

@ChristopherJam (does that work?)
I did the offset $4000 in g thing myself, but tried the zip-add approach. I'm afraid it's a disaster; your adds are basically 9.5 cycles while mine are 11.5:

        ;continue column 1
.c1_1:  lda #0 ;self-modded to contain high(x0*y0) f
        adc (multlib_p_neg_sqr_lo_x0),y ;high(x0*y0) g
        sta .c1_2+1 ;save partial
        ;11.5

This is missing the corrections (for $40 in the high bytes of g(x), or p_neg_sqr_hi_x in my notation)

umult16:
        ;unsigned 16x16 mult 
        ;inputs:
        ; x0 x1 
        ; y0 y1 
        ;outputs:
        ; A z2 z1 z0
        * = $c000
      
!zone multlib
   
;these are defaults, but should be set from the calling program
!addr {
  multlib_p_sqr_lo_x0 = $8b ;2 bytes
  multlib_p_sqr_hi_x0 = $8d ;2 bytes
  multlib_p_neg_sqr_lo_x0 = $a3 ;2 bytes
  multlib_p_neg_sqr_hi_x0 = $a5 ;2 bytes
  
  multlib_p_sqr_lo_x1 = $8b ;2 bytes
  multlib_p_sqr_hi_x1 = $8d ;2 bytes
  multlib_p_neg_sqr_lo_x1 = $a3 ;2 bytes
  multlib_p_neg_sqr_hi_x1 = $a5 ;2 bytes

  multlib_sqr_lo = $c000 ;511 bytes
  multlib_sqr_hi = $c200 ;511 bytes
  multlib_neg_sqr_lo = $c400 ;511 bytes
  multlib_neg_sqr_hi = $c600 ;511 bytes
}

  x0=$fb
  x1=$fc
  y0=$fd
  y1=$fe
  z0=$8000
  z1=$8001
  z2=$8002
  z3=$8003

;set multiplier as x0
        lda x0 ;3
        sta multlib_p_sqr_lo_x0 ;3
        sta multlib_p_sqr_hi_x0 ;3
        eor #$ff ;2
        sta multlib_p_neg_sqr_lo_x0 ;3
        sta multlib_p_neg_sqr_hi_x0 ;3; 3+3+3+2+3+3 = 17 cycles
        
;set multiplier as x1
        lda x1 ;3
        sta multlib_p_sqr_lo_x1 ;3
        sta multlib_p_sqr_hi_x1 ;3
        eor #$ff ;2
        sta multlib_p_neg_sqr_lo_x1 ;3
        sta multlib_p_neg_sqr_hi_x1 ;3; 3+3+3+2+3+3 = 17 cycles
        
        ;set multiplicand as y0
        ldy y0 ;3
        ;storing z3 in X, clear X
        ldx #0 ;2
        
        clc ;2
        ;x0_y1l =  low(x0*y1)
        ;x0_y1h = high(x0*y1)
        ;zip add all mult8's directly
        ;adding requires offset $4000 in multlib_p_neg_sqr table and compensation
        ;7
        
        ;column 0
        lda (multlib_p_sqr_lo_x0),y ;low(x0*y0) f
        adc (multlib_p_neg_sqr_lo_x0),y ;11 cycles low(x0*y0) g
        sta z0 ;multiplier * Y, low byte, 3 cycles
        ;need correction factor for g in offset $400
        ;14
        
        ;continue to column 1
        adc (multlib_p_sqr_hi_x0),y ;high(x0*y0) f
        sta .c1_1+1 ;partial addition result of column 1 (self-mod to be continued)
        ;9.5
        
        ;continue to column 2
        adc (multlib_p_sqr_hi_x1),y ;high(x1*y0) f
        sta .c2_1+1 ;save partial
        bcc .c1_1
        inx ;inc z3
        clc ;3/6=4.5
        ;14
        
        ;continue column 1
.c1_1:  lda #0 ;self-modded to contain high(x0*y0) f
        adc (multlib_p_neg_sqr_lo_x0),y ;high(x0*y0) g
        sta .c1_2+1 ;save partial
        ;11.5
        
        ;continue column 2
.c2_1:  lda #0 ;self-modded to contain high(x1*y0) f
        adc (multlib_p_neg_sqr_hi_x1),y ;high(x1*y0) g
        sta .c2_2+1 ;save partial
        bcc .c1_2
        inx ;inc z3
        clc ;4.5
        ;16
        
        ;continue column 1
.c1_2:  lda #0 ;self-modded to contain high(x0*y0) g
        adc (multlib_p_sqr_lo_x1),y ;low(x1*y0) f
        sta .c1_3 ;save partial
        ;11.5
        
        ;continue column 2
        ;set multiplicand as y1
        ldy y1 ;3
.c2_2:  lda #0 ;self-modded to contain high(x1*y0) g
        adc (multlib_p_sqr_hi_x1),y ;high(x0*y1) f
        sta .c2_3+1 ;save partial
        bcc .c1_3
        inx ;inc z3
        clc ;4.5
        ;19
        
        ;continue column 1 (extra y0)
        ldy y0 ;3
.c1_3:  lda #0 ;self-modded to contain low(x1*y0) f
        adc (multlib_p_neg_sqr_lo_x1),y ;low(x1*y0) g
        sta .c1_4+1 ;save partial
        ;14.5
        
        ;continue column 2
        ldy y1 ;3
.c2_3:  lda #0 ;self-modded to contain high(x0*y1) f
        adc (multlib_p_sqr_hi_x1),y ;high(x0*y1) g
        sta .c2_4+1
        bcc .c1_4
        inx ;inc z3
        clc ;4.5
        ;19
        
        ;continue column 1
.c1_4:  lda #0 ;self-modded to contain low(x1*y0) g
        adc (multlib_p_sqr_lo_x0),y ;low(x0*y1) f
        sta .c1_5+1 ;save partial
        ;11.5

        ;continue column 2
.c2_4:  lda #0 ;self-modded to contain high(x0*y1) g
        adc (multlib_p_sqr_hi_x1),y ;low(x1*y1) f
        sta .c2_5+1 ;save partal
        bcc .c1_5
        inx ;inc z3
        clc ;4.5
        ;16
        
        ;finish column 1
.c1_5:  lda #0 ;self-modded to contain low(x0*y1) f
        adc (multlib_p_neg_sqr_lo_x0),y ;low(x0*y1) g
        sta z1 ;need correction factor for g in offset $400
        ;10.5
        
        ;finish column 2
.c2_5:  lda #0 ;self-modded to contain low(x1*y1) f
        adc (multlib_p_neg_sqr_hi_x1),y ;low(x1*y1) g
        sta z2 ;need correction factor for g in offset $400
        ;10.5
        
        ;column 3
        txa ;carries from c2
        adc (multlib_p_sqr_hi_x1),y ;high(x1*y1) f
        adc (multlib_p_neg_sqr_hi_x1),y ;high(x1*y1) g
        ;need correction factor for g in offset $400
        ;A=z3
        ;13
        ;=231.5

fin:    rts

2022-01-11 23:05

Repose

Registered: Oct 2010
Posts: 225

A guide to the order of additions (optimized for changes in y):

                y1    y0
             x  x1    x0
                --------
             x0y0h x0y0l
+      x1y0h x1y0l
+      x0y1h x0y1l
+x1y1h x1y1l
------------------------
    z3    z2    z1    z0

2022-01-19 11:30

ChristopherJam

Registered: Aug 2004
Posts: 1409

Ah yes, saving those partials is expensive.

Apologies for the delay - while @ did nothing, I did see the post, but then I had to find the time to reread the last 129 comments here..

I had a few ideas for optimizations while I was failing to get around to reading back, but it looks like I either acknowledged or attempted all of them back in 2017. Excellent point you had back then about it being well worth removing the CLCs for the taller columns.

One new thought I did have was to replace the bcc/inx combo with a ROL of some value read further down, that would simultaneously clear carry, and accumulate a '1' into an index into a table that would count the number of 1 bits. It could even a table selector (if the ROL was of the high byte of a table address) allowing the correction to be incorporated into one of the square table lookups.

However, a non-accumulator ROL is always at least 5 cycles, which is slower than either path through bcc/inx, so I suspect this one's a no go.

2022-01-30 11:51

Repose

Registered: Oct 2010
Posts: 225

I've been going all over the place with distractions, but, I'm exploring a completely new number system which has potential. As I mentioned in my big list, there's many more ways to do math than just 2's complement. Here is a log based arithmetic system, where multiply is as fast as an add, and add is a bit slower:

# -*- coding: utf-8 -*-
"""
Created on Fri Jan 28 22:46:53 2022

@author: Repose
A module to analyze arithmetic via logs
ref.
https://www.siranah.de/html/sail042q.htm
"""

#imports
import sys
from math import log2

#functions/classes
scale=256/log2(256) #8-bit range
lo=lambda x:x&255 #util for creating tables (for future development)
hi=lambda x:(x>>8)
logadd=lambda x:round(log2(1+1/pow(2,x/scale))*scale) #scaled table to add two logs
logsub=lambda x:round(log2(1+pow(2,x/scale))*scale) #scaled table to sub two logs
exp=lambda x:round(pow(2,x/scale)) #scaled exponential 2^x table
lg=lambda x:round(log2(x)*scale) #scaled log2(x) table

def mult(x,y):
    '''Multiply two integers via their logs

    Keyword arguments:
        x,y -- integers 0 to 255
    Returns:
        The approximate product
    '''
    return exp(lg(x)+lg(y))

def add(x,y):
    '''Add two integers via their logs

    Keyword arguments:
        x,y -- integers 0 to 255
    Returns:
        The approximate sum
    '''
    return exp(logadd(lg(y)-lg(x))+lg(y))

def output_table(fmt, fname, name, addr, tab, start, end):
    '''Output a table as an assembly include file

    Keyword arguments:
        fmt -- string, "ACME"
        fname -- string, filename of output file
        name -- string, table name
        addr -- integer start address of table
        tab -- generator function
        start -- integer start index value
        end -- integer end index value + 1
    Returns:
        0 (success)
        5 (failed to open file)
    '''
    try:
        with open(fname, 'w') as f:
            print(f'writing {fname}')
            
            if fmt.lower()=="acme":
                fmt_data='!by '
            elif fmt.lower()=="dasm":
                fmt_data='dc.b '
            else:
                print(f'unknown assembler format {fmt}')
                return 1 #traditional DOS incorrect function

            #header
            f.write(f'{name}=${addr:04X}\n')
            step=16
            for i in range(start,end,step):
                #line start
                f.write(fmt_data)
                k=i+step if i+step<end else end
                first_invalid=True
                first_overflow=True
                for j in range(i,k):
                    try:
                        val=tab(j)
                    except ValueError:
                        val=0
                        if first_invalid:
                            print(f'warning: invalid table value replaced with 0 at index {j}')
                            first_invalid=False
                    if val>0xff:
                        val=val & 0xff
                        if first_overflow:
                            print(f'warning: table value overflowed at index {j}')
                            first_overflow=False
                    f.write(f'${val:02X}')
                    if j!=k-1:
                        f.write(',')
                    else:
                        f.write('\n')

    except:
            print(f"Couldn't open {fname} for writing")
            return 5 #traditional DOS error code

#main
def main():
    '''Analyze arithmetic of two integers via their logs

    Keyword arguments:
        none
    Returns:
        0 (success)
    '''
    print('Guassian logs analysis\n')
    print('accuracy test\n')
    
    #should write a helper class for this
    erm=0 #maximum multiply error
    era=0 #maximum addition error
    
    for x in range(1,256):
        for y in range(x, 256):
            
            er=mult(x,y)-x*y
            if er>erm:
                erm=er
                pxm=x
                pym=y
                
            er=add(x,y)-(x+y)
            if er>era:
                era=er
                pxa=x
                pya=y
    
    #requires python 3.6+ due to f strings
    print('multiply')
    print(f'max err was {erm} ({erm/(pxm*pym)*100:.2f}%) at x={pxm} and y={pym}')
    print(f'{pxm}*{pym}={pxm*pym}, mult({pxm},{pym})={mult(pxm,pym)}\n')

    print('add')
    print(f'max err was {era} ({era/(pxa+pya)*100:.2f}%) at x={pxa} and y={pya}')
    print(f'{pxa}+{pya}={pxa+pya}, add({pxa},{pya})={add(pxa,pya)}\n')
    
    #write tables
    print('writing tables')
    tables=[{"func":lg,"name":"logtab"}, {"func":exp,"name":"exptab"}, {"func":logadd,"name":"logadd"}]
    addr=0xc000
    assembler='acme'
    for table in tables:
        #output_table(fmt, fname, name, addr, tab, start, end+1)
        output_table(assembler, table["name"]+'.asm', table["name"], addr, table["func"], 0, 256)
        addr+=0x100


if __name__ == '__main__':
	sys.exit(main())

writing logtab.asm
warning: invalid table value replaced with 0 at index 0
warning: table value overflowed at index 254
writing exptab.asm
writing logadd.asm

Main.asm to come...

2022-01-30 12:10

Krill

Registered: Apr 2002
Posts: 2980

The log/exp approach isn't new on this system, though. =)

It's especially useful in downscaling operations, where either of the factors is in [0..1).

Some care needs to be taken to minimise table size for an acceptable accuracy.

2022-01-30 17:26

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: The log/exp approach isn't new on this system, though. =)

It's especially useful in downscaling operations, where either of the factors is in [0..1).

Some care needs to be taken to minimise table size for an acceptable accuracy.

Iirc WVL used a very controlled 8-bit sin/asin for the maths in Pinball Dreams even

2022-02-01 07:32

Repose

Registered: Oct 2010
Posts: 225

Yes, log has been used for divide in a C=Hacking article. In this case, I can add without converting from log and back when switching between add/sub and mult/div. Consider finding the escape iteration for a Mandelbrot set; you never need to know the values, just when the log proxy meets the test zr*zr-zi*zi>4. I'm hoping this can be faster and more accurate.

2022-02-01 08:10

Oswald

Registered: Apr 2002
Posts: 5094

log div is good enough for lines. (unless you're goind for subpix:)

2023-08-22 05:44

Repose

Registered: Oct 2010
Posts: 225

It's 2023 - is your 16x16=32 bit unsigned multiply under 190 cycles yet? What? No?!? Let me help you out buddy!
Much to my surprise, I have surpassed by previous best by 10.5 cycles or 5%, when I thought it couldn't be optimized further. How is this possible you ask?
The first trick had to do with how cycles are counted. As a shortcut, we usually average branches/boundary crossings. However, some carries happen only 7% of the time, so optimizing for the average case actually slows you down. You should be optimizing for the no carry case.
The other trick was playing with using registers to accumulate results. It wasn't as simple as it sounds; the order of multiplies had to be optimized to leave a target value in the A register.

The goal: multiply the two 16-bit numbers in x0, x1 and y0, y1 (low bytes and high bytes) and return the results in the fastest way possible (which may leave some results in registers, preferably in a useful way like the high bytes). I measure the timing in this convention to expose a timing which could be used in-line, as is often the case for demo effects. In this case, the 3rd byte is returned in A, which could be useful for a multiple accumulate where only the high 16-bits are kept.

The time is 188.1 cycles, averaged over all possible inputs. The (accurate) time of my version on codebase is 198.6

To be clear, z=x*y or (z3, z2, z1, z0) = (x1, x0) * (y1, y0).

;This section performs the 4 sub-multiplies to form
;  the partials to be added later in self-mod code.
;Addresses like "x1y0l+1" refer to low(x1*y0) stored
;  to the argument of a later "adc #{value}"
;mult8_set_mier_snippet {multiplier}
;mult8_snippet {multiplicand}, {low result}, {high result}
+mult8_set_mier_snippet x1          ;17
+mult8_snippet y0, x1y0l+1, x1y0h+1 ;35
+mult8_snippet y1, x1y1l+1, z3      ;32
+mult8_set_mier_snippet x0          ;17
+mult8_snippet "Y", x0y1l+1, "X"    ;28
+mult8_snippet y0, z0, "A"          ;28
;results in X=x0y1h and A=x0y0h
;multiply part total: 149-165, average 157
;z3           z2           z1
                           clc
                     x0y1l adc #0 ;x0y0h + x0y1l
                           tay ;6
              txa
        x1y0h adc #0 ;x0y1h + x1y0h
              tax
              bcc + ;9
inc z3
clc ;(+6) taken 7% of the time
                         + tya
                     x1y0l adc #0 ;+ x1y0l
                           sta z1 ;7
              txa
        x1y1l adc #0 ;+ x1y1l
              bcc done ;7
inc z3 ;(+4) taken 42% of the time

Column 1 takes 13 cycles, column 2 takes 16 cycles, and column 3 takes 2.1 cycles.
The total time to add the partial products is 29-39 with an average of 31.1.
The results are returned as a 32-bit result:
z3, A(=z2), z1, z0.
The total time is 156.96875+31.1=188.06875 (178-204).

2023-08-22 06:09

Repose

Registered: Oct 2010
Posts: 225

A diagram of the calculations:

               y1    y0
            x  x1    x0
               --------
            x0y0h x0y0l
      x0y1h x0y1l
      x1y0h x1y0l
x1y1h x1y1l
-----------------------
   z3    z2    z1    z0

A listing of the macros is quite long, as a lot of effort went into parsing the arguments the way I'm using them, plus some other features in my libary. Requires the latest C64 Studio 7.5, as there were a number of features I had added. Here is one of them:

!macro mult8_set_mier_snippet .mier {
;set multiplier as mier
;mier can be m/A
;requires .p_.sqr* and .p_.neg.sqr* pointers,
;(mathlib.p_sqr_lo, mathlib.p_neg_sqr_lo, mathlib.p_sqr_hi, mathlib.p_neg_sqr_hi)
;uses these macros from mathlib_util_macros.asm:
;reset_cycles, add_cycles_const
  !if .mier = "X" or .mier = "Y" {
    !error "multlib: mult8_snippet: mier must be m/A, not ", .mier
  } 
  +reset_cycles
  !if .mier != "A" {
    lda .mier
    !if .mier<$100 {
        +add_cycles_const 3
      } else {
        +add_cycles_const 4
      }
  }
  sta mathlib.p_sqr_lo ;3
  sta mathlib.p_sqr_hi ;3
  eor #$ff ;2
  sta mathlib.p_neg_sqr_lo ;3
  sta mathlib.p_neg_sqr_hi ;3; 3+3+3+2+3+3 = 17 cycles
  +add_cycles_const 14
}

Here is the order of multiplies I used:

mier = x1
(x1) * y0 -> x1y0l, x1y0h
(x1) * y1 -> x1y1l, z3
mier = x0
(x0) * (y1) -> x0y1l, X
(x0) * y0 -> z0, A

2023-08-22 13:32

Frantic

Registered: Mar 2003
Posts: 1648

Very good!

2023-08-23 04:05

ChristopherJam

Registered: Aug 2004
Posts: 1409

Nice work, Repose!

Going to have to take another run at this one of these days..

2023-12-11 09:35

Repose

Registered: Oct 2010
Posts: 225

I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.

Even my old routine has won out of 120 routines tested independently.

Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results

This was a thrilling moment for me :)

But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.

2023-12-11 10:00

Raistlin

Registered: Mar 2007
Posts: 680

Very nice write up, Repose!

2023-12-11 12:51

Frantic

Registered: Mar 2003
Posts: 1648

Quote: I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.

Even my old routine has won out of 120 routines tested independently.

Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results

This was a thrilling moment for me :)

But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.

Congratulations! I totally get that it must have been thrilling. :D

2023-12-11 13:57

Krill

Registered: Apr 2002
Posts: 2980

Quoting Repose

The time is 188.1 cycles, averaged over all possible inputs.

Pretty impressive. :)

It seems that most of the required memory of about 2 KB is spent by tables (which are generated), not so much the code itself, right?

If so, i guess this 16x16=32 routine could come in quite handy for something like https://github.com/exoticorn/upkr which still doesn't seem to have a C-64/plain 6502 unpacker (while 6502 unpackers using hardware-accelerated mul exist).

2023-12-11 14:09

Repose

Registered: Oct 2010
Posts: 225

The code seems to be 120 bytes and the tables are 2044 bytes. I think I need to add some 30 more to create the tables.

2023-12-12 09:31

JackAsser

Registered: Jun 2002
Posts: 2014

Very very nice write up @Repose!

2023-12-12 17:47

Repose

Registered: Oct 2010
Posts: 225

Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)

2023-12-12 20:16

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)

Hehe, tbf Graham tought me this trick (i.e. selfmod the code to get a+b for free, and using neg-tables to get a-b for free) iirc, or maybe it was CJam, can't remember.

We should add a whole chapter about quick division also. I'm currently (since 2004) using a trick Graham tought me, which I believe HCL also uses, y/x = y* (1/x) instead of doing div. And then using another multiply to linearly interpolate between y*(1/x) and y*(1/(x+1)) by the fractional part of x.

2023-12-12 20:23

Repose

Registered: Oct 2010
Posts: 225

Yes, I know about that trick, it's used in CPUs too! I read a whitepaper on the AMD FPU design.
I'm working on a mathlib of all the fastest routines, I'll be adding division after I'm done with all the multiply and x^2 variations.

2023-12-13 23:06

Count Zero

Registered: Jan 2003
Posts: 1932

Many thanks to Repose for his codebase contributions at this point as well!
https://codebase64.org/doku.php?id=base:fastest_multiplication gave me the thrills a while ago.

Everybody here ofc is invited to contribute as well :)

2023-12-13 23:55

Repose

Registered: Oct 2010
Posts: 225

Thanks!

A lot of the division approaches are mentioned here:
https://en.wikipedia.org/wiki/Division_algorithm

I can't find the AMD paper, but its similar to this:
https://ieeexplore.ieee.org/document/762835

It was detailing the implementation in the K7 CPU by AMD (very old one).

I don't wanna get too excited to dive into that, until I finish the multiplies :)

But I worked on that today. There's a lot of ways to handle signs. I was thinking of something like
lda x1 (high byte of first input)
asl (get high bit/sign into carry)
lda y1
bcc + (x1 is positive)
bpl + (y1 is positive)

So, there are 4 cases, and if one of the arguments is negative, the resulting product will be negative, and to do that, you only need to do the square tables subtraction in the reverse order, which is equivalent of taking the 2's complement of the product.

So, I'm hoping that that little sort routine above will lead me to variations which are mostly the same speed, so there's little overhead. I think if both inputs are negative will be the worst case.

Another approach is to use different representations. Does offset 128 work better? What about storing the sign in a separate byte? Then everything is always unsigned, and you just EOR the signs together for the result.

Another way is to take the 2's complement of the inputs first, do an unsigned, then 2's complement them again if needed.

Another way, do unsigned but fix up the answer depending on the arguments. You have to subtract one or both inputs from the high bytes, it's explained in C=Hacking 16 under the 3d article. Very good article, which I co-wrote too :)

2023-12-14 10:26

ChristopherJam

Registered: Aug 2004
Posts: 1409

No idea whether Graham also came up with the self-mod to get the A+B for free trick independently, but I know I came up with it myself when optimising some code either Steve Judd or George Taylor showed me after whoever it was introduced me to the difference of squares approach, back around when Steve was writing 3d rendering articles for C=Hacking in the late 90s.

I'm no longer sure who it was I was having those discussions with; might've been on IRC or might've been email correspondence, but either way I don't have the logs any more.

2023-12-14 13:26

Repose

Registered: Oct 2010
Posts: 225

Oh, you definitely have added some great ideas!

However, I researched it before and there's a few people who came up with that idea before any of us, not c64 sceners either.

2024-02-13 08:13

Repose

Registered: Oct 2010
Posts: 225

I've finally published my new fastest routine, 2023 version. It's even faster now, 187.1.
https://codebase64.org/doku.php?id=base:fastest_multiplication_..
And independently benchmarked here
https://github.com/tobyLobster/multiply_test/?tab=readme-ov-fil..
Both faster and shorter, so not bad.

2024-02-13 13:37

ChristopherJam

Registered: Aug 2004
Posts: 1409

Nice work!

2024-02-13 20:58

Repose

Registered: Oct 2010
Posts: 225

Of course, only now do I realize a dumb trick:

x1 = p_sqr_lo

umult16
; set multiplier as x1
lda x1
;sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_neg_sqr_lo
sta p_neg_sqr_hi

; set multiplier as x0
; *x1 is no longer preserved

To easily save 3 cycles and 1 byte zp.
Going a bit further, if you add 6 zp, using two sets of pointers, then x0 and x1 can be p_sqr_lo1 and p_sqr_lo2. And one more step, make the caller set up both pointers, but that's a bit more difficult as we can't assume the caller can use A for the calculation eor #$ff (yes, A will be trashed when you call the umult16, but I mean when the caller has the other half of x_ ready).
So, save 3, 6 or 12 cycles.
One more hack is passing y0 in Y, but in the worst case, that's not convenient to the caller, if it happens to be in X, then it's easier to stx y0 rather than txa:tay.

So, I would say saving 3 or 6 cycles are good options, and 3 for sure. That brings the routine down to 180.1 :) Saving 14.5 cycles and being 10% faster definitely makes me happy, losing another 6 bytes zp not so much.

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

Acidchild/Padua
Unlock/Padua/Albion
csabanw
Scrap/Genesis Project
Guests online: 102

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)

Top onefile Demos

1 Layers  (9.6)
2 No Listen  (9.6)
3 Party Elk 2  (9.6)
4 Cubic Dream  (9.6)
5 Copper Booze  (9.6)
6 Rainbow Connection  (9.5)
7 Dawnfall V1.1  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)

Top Groups

1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)

Top NTSC-Fixers

1 Pudwerx  (10)
2 Booze  (9.7)
3 Stormbringer  (9.7)
4 Fungus  (9.6)
5 Grim Reaper  (9.3)

Page generated in: 0.529 sec.