[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 225

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2012-06-11 09:59

Skate

Registered: Jul 2003
Posts: 494

Quote:

So without optimization, it doesn't look good...

when it comes to math, what looks good on c64 without optimization?

creating a math library for general purposes is a good thing. it's very useful for testing an effect on c64 without spending too much time. but when it comes to finalization, usually most parts of this library won't be used as it is.

imho, best method is to create the library with possible optimizations which doesn't effect the accuracy much. at the last stage, replace all possible calculations with reasonable look-up tables and where look-up table is not an option, inline the math functions and optimize them as much as possible, change the register usages for saving a few more cycles, sacrifice some accuracy if it doesn't effect the result badly etc.

2012-06-11 10:43

Repose

Registered: Oct 2010
Posts: 225

You're thinking in terms of demos. I was thinking of applying this as a compiler library. I'm just looking for a general fast calculation.
I'm comparing CJ's suggestion to my first approach, when I say it doesn't look good, I meant his approach.

Update: slight problem in how I count things
O(n) = n*48 + nC2*(56+20) + 14, in fact 1 works and 1C2 is 0, also the 14 term is related to this

n cycles
1 48
2 262
3 386
4 662

This is now good :)
If I can generate combinations in lists, I could generate any precision math routine in kickass...

2012-06-11 19:33

Bitbreaker

Registered: Oct 2002
Posts: 508

Quoting Repose

part 1
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1
lda #0;8/11
adc c3
sta c3

this will a) most likely crash as it only skips one byte and then runs on $00 what will be a brk. Also this is not a optimizing trick speedwise nor sizewise:

Better be done like:

    anc #$80 ;copy bit 7 to carry and A = A & $80
    bcc +    ;A is zero on plus, all fine
    sbc #$81 ;subtract $81 from $80 = $ff
+
    ;carry always clear as a free gift

Also, lots of other possibilities to optimize that code, and also some mistakes within the code.

2017-03-28 18:46

Repose

Registered: Oct 2010
Posts: 225

I worked on this again. There were a lot of bugs in my post 'part 1'.

Basically, this message is working out how long to multiply 16x16 using the idea of f(x+y)-f(x-y)=x*y with x,y as 16bit numbers and f(n)=n^2/4.
Note that squaring a 16bit number takes two 8bit squares and a multiply.

Compare this with the usual way, which is 4 multiplies of x0*y0,x1*y0,x0*y1,x1*y1 then adding those together.

----------------
z1:z0=(x0+y0)^2/4 means
lda x0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;50 cycles avg

z1:z0=(x0+y0)*(x1+y1)/2 means
lda x0:clc:adc y0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
lda x1:clc:adc y1
tay;32
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;70 cycles avg

z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 cycles

The routine
Again note, (a1:a0)^2/4=a1^2/4,a0^2/4,a1*a0/2
where a1:a0=x1:x0+y1:y0
So a 16bit square/4 is two 8bit squares and a multiply with divide/2
then added all together
We need two 16bit squares and then subtract them

;z=f(x+y) where f(n)=n^2/4 and x,y are 16bit numbers
z1:z0=(x0+y0)^2/4;50 cycles (x+y)^2/4 low
z3:z2=(x1+y1)^2/4;50 cycles (x+y)^2/4 high
t1:t0=(x0+y0)*(x1+y1)/2;70 cycles
z2:z1+=t1:t0;20 cycles
;add the carry to z3 (not shown or calculated) 190+cycles for 16bit square/4
;
;t=f(x-y)
t1:t0=(x0-y0)^2/4;50 cycles
t3:t2=(x1-y1)^2/4;50 cycles
t5:t4=(x0-y0)*(x1-y1)/2;70 cycles
t2:t1+=t5:t4;20 cycles
;
;z-=t
z1:z0-=t1:t0;20 cycles
z3:z3-=t3:t2;20 cycles
;190*2+40=420 cycles+

There's still some mistakes here and room to optimize the adds,
also the setup part of the table lookups can be arranged to not
change the pointers so much,
but basically it shows that this is very slow and doesn't work out.

This could have been 4 multiplies if done the usual way, instead
it's like 6 multiplies (officially 2, then 4 sqrtable lookups of 9 bit
numbers, which is the same speed).

2017-03-28 18:57

Repose

Registered: Oct 2010
Posts: 225

I only have two options left. I can use the results of the 'add/sub a set' thread for adding the partials of large multiplies, where I add the partials by order of column from low to high, or else do the multiplies in an optimal order of not changing the zp pointers, but then having to generate and add the partials in a non-optimal order.

The two times I can avoid changing the zp pointers saves 28 cycles. Doing adds then having to save them and add a different column is still an unknown to me.

I'm thinking that doing the multiplies optimally will be best, and the result will be just over 200 cycles for an unsigned 16x16.

It's only a bit better than the one posted already in codebase.

I still have to look more into the cosine formula, but I think I am close to the optimal speed for large multiplies.

2017-03-29 08:12

JackAsser

Registered: Jun 2002
Posts: 2014

Have you checked my Seriously fast multiplication?

Edit: Yes you have.. sorry.

2017-03-29 10:21

Repose

Registered: Oct 2010
Posts: 225

Yes, that's the one I beat already.

2017-03-29 11:46

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Yes, that's the one I beat already.

Interesting approach! I'll try it myself someday!

2017-03-29 12:03

ChristopherJam

Registered: Aug 2004
Posts: 1409