[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 222

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

2012-06-10 00:25

Repose

Registered: Oct 2010
Posts: 222

After some work I decided to go with a square table based large multiply. 32bitx32bit=64bit in 850 cycles or so at best. I discovered a trick I wish I knew a long time ago; to add long lists of numbers you can do this
ldy #0
clc
lda a0
adc b0
bcc s1
iny
clc
s1 adc c0
sta d0
bcc s2
iny
s2 sty d1

It's basically 6 cycles per row and 7 cycles per column. When adding the 16 (two byte) partials from the 32bit multiply, it averages to 7.5 cycles per add. This is really optimal as the most basic add is 6 cycles (lda a0:adc b0).

I also found out that there's carries slightly less than half the time on average.

I'm working on a routine to do 8x8=16, 8x16=24, 16x16=16 with rounding, 32x32=32 with rounding, 32x32=64, and 32x32 floating point in one routine with different entry points, where (almost) each entry is optimal even for a standalone routine, and this is all in one routine. It's just a matter of arranging the order of operations carefully to progressively create the result.

I think you can get 1000flops on the c64.

2012-06-10 10:28

ChristopherJam

Registered: Aug 2004
Posts: 1378

1000 flops is well within reach - it just depends how much precision you want, and what you define as a flop :)

Single precision IEEE-754 only has 24 bits for the mantissa, and you don't need a 48 bit result unless you want to get the rounding exact for multiply-accumulates.

Have you considered expressing the multiplies as a difference of squares? Should mean instead of needing nine 8x8->16 multiplies for a 24x24 bit you have six 8x8s and a few table lookups.

2012-06-11 01:47

Repose

Registered: Oct 2010
Posts: 222

part 1

z1:z0=(x0+y0)^2/4-(x0-y0)^2/4 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;48 cycles

z1:Z0=x0*y1/2 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y1
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;56 cycles

z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 ccyles

c1:c0=a0^2/4-b0^2/4;48 cycles
c3:c2=a1^2/4-b1^2/4;48 cycles
t1:t0=a0*a1/2;56 cycles
t3:t2=b0*b1/2;56 cycles
t1:t0-=t3:t2;20 cycles
c2:c1+=t1:t0;20 cycles
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1 lda #0;8/11
adc c3:sta c3;6

which is 262+ cycles for your method, 16bit.

2012-06-11 02:02

Repose

Registered: Oct 2010
Posts: 222

You're proposing,
a*b=(a+b)^2 - (a-b)^2,
f(x)=x^2/4,
a*b=f(a+b) - f(a-b)
Where a,b are 24 bit values.

So we need a 24bit square function.
You can get that by,
(x2 * 2^16 + x1 * 2^8 + x0)^2=
x2^2 * 2^32 + x1^2 * 2^16 + x0^2 + y,
g(x)=x^2/4,
=g(x2) * 2^32 + g(x1) * 2^16 + g(x0)
which are three table lookups and,
y= 2*x1*x0/4 * 2^8 + 2*x2*x0/4 * 2^16 + 2*x2*x1/4 * 2^24
which are three 8bit multiplies.

In general for n byte numbers you get
2n squaring lookups,
2 * nC2 mults where nCr=n!/(n-r)!r!,
2n shifts

O(n) = 2n*48 + 2*nC2*(56+20) + 14, n>1

2C2=2
3C2=3
4C2=6

n cycles
2 262
3 530
4 1310

So without optimization, it doesn't look good...

2012-06-11 09:59

Skate

Registered: Jul 2003
Posts: 490

Quote:

So without optimization, it doesn't look good...

when it comes to math, what looks good on c64 without optimization?

creating a math library for general purposes is a good thing. it's very useful for testing an effect on c64 without spending too much time. but when it comes to finalization, usually most parts of this library won't be used as it is.

imho, best method is to create the library with possible optimizations which doesn't effect the accuracy much. at the last stage, replace all possible calculations with reasonable look-up tables and where look-up table is not an option, inline the math functions and optimize them as much as possible, change the register usages for saving a few more cycles, sacrifice some accuracy if it doesn't effect the result badly etc.

2012-06-11 10:43

Repose

Registered: Oct 2010
Posts: 222

You're thinking in terms of demos. I was thinking of applying this as a compiler library. I'm just looking for a general fast calculation.
I'm comparing CJ's suggestion to my first approach, when I say it doesn't look good, I meant his approach.

Update: slight problem in how I count things
O(n) = n*48 + nC2*(56+20) + 14, in fact 1 works and 1C2 is 0, also the 14 term is related to this

n cycles
1 48
2 262
3 386
4 662

This is now good :)
If I can generate combinations in lists, I could generate any precision math routine in kickass...

2012-06-11 19:33

Bitbreaker

Registered: Oct 2002
Posts: 500

Quoting Repose

part 1
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1
lda #0;8/11
adc c3
sta c3

this will a) most likely crash as it only skips one byte and then runs on $00 what will be a brk. Also this is not a optimizing trick speedwise nor sizewise:

Better be done like:

    anc #$80 ;copy bit 7 to carry and A = A & $80
    bcc +    ;A is zero on plus, all fine
    sbc #$81 ;subtract $81 from $80 = $ff
+
    ;carry always clear as a free gift

Also, lots of other possibilities to optimize that code, and also some mistakes within the code.

2017-03-28 18:46

Repose

Registered: Oct 2010
Posts: 222

I worked on this again. There were a lot of bugs in my post 'part 1'.

Basically, this message is working out how long to multiply 16x16 using the idea of f(x+y)-f(x-y)=x*y with x,y as 16bit numbers and f(n)=n^2/4.
Note that squaring a 16bit number takes two 8bit squares and a multiply.

Compare this with the usual way, which is 4 multiplies of x0*y0,x1*y0,x0*y1,x1*y1 then adding those together.

----------------
z1:z0=(x0+y0)^2/4 means
lda x0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;50 cycles avg

z1:z0=(x0+y0)*(x1+y1)/2 means
lda x0:clc:adc y0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
lda x1:clc:adc y1
tay;32
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;70 cycles avg

z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 cycles

The routine
Again note, (a1:a0)^2/4=a1^2/4,a0^2/4,a1*a0/2
where a1:a0=x1:x0+y1:y0
So a 16bit square/4 is two 8bit squares and a multiply with divide/2
then added all together
We need two 16bit squares and then subtract them

;z=f(x+y) where f(n)=n^2/4 and x,y are 16bit numbers
z1:z0=(x0+y0)^2/4;50 cycles (x+y)^2/4 low
z3:z2=(x1+y1)^2/4;50 cycles (x+y)^2/4 high
t1:t0=(x0+y0)*(x1+y1)/2;70 cycles
z2:z1+=t1:t0;20 cycles
;add the carry to z3 (not shown or calculated) 190+cycles for 16bit square/4
;
;t=f(x-y)
t1:t0=(x0-y0)^2/4;50 cycles
t3:t2=(x1-y1)^2/4;50 cycles
t5:t4=(x0-y0)*(x1-y1)/2;70 cycles
t2:t1+=t5:t4;20 cycles
;
;z-=t
z1:z0-=t1:t0;20 cycles
z3:z3-=t3:t2;20 cycles
;190*2+40=420 cycles+

There's still some mistakes here and room to optimize the adds,
also the setup part of the table lookups can be arranged to not
change the pointers so much,
but basically it shows that this is very slow and doesn't work out.

This could have been 4 multiplies if done the usual way, instead
it's like 6 multiplies (officially 2, then 4 sqrtable lookups of 9 bit
numbers, which is the same speed).

2017-03-28 18:57

Repose

Registered: Oct 2010
Posts: 222

I only have two options left. I can use the results of the 'add/sub a set' thread for adding the partials of large multiplies, where I add the partials by order of column from low to high, or else do the multiplies in an optimal order of not changing the zp pointers, but then having to generate and add the partials in a non-optimal order.

The two times I can avoid changing the zp pointers saves 28 cycles. Doing adds then having to save them and add a different column is still an unknown to me.

I'm thinking that doing the multiplies optimally will be best, and the result will be just over 200 cycles for an unsigned 16x16.

It's only a bit better than the one posted already in codebase.

I still have to look more into the cosine formula, but I think I am close to the optimal speed for large multiplies.

2017-03-29 08:12

JackAsser

Registered: Jun 2002
Posts: 1989

Have you checked my Seriously fast multiplication?

Edit: Yes you have.. sorry.

... 144 posts hidden. Click here to view all posts....