Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
 Welcome to our latest new user Equinoxe_4 ! (Registered 2017-04-23) You are not logged in 
CSDb User Forums

Forums > C64 Coding > Fast large multiplies
2012-06-09 21:45

Registered: Oct 2010
Posts: 129
Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

2012-06-10 02:25

Registered: Oct 2010
Posts: 129
After some work I decided to go with a square table based large multiply. 32bitx32bit=64bit in 850 cycles or so at best. I discovered a trick I wish I knew a long time ago; to add long lists of numbers you can do this
ldy #0
lda a0
adc b0
bcc s1
s1 adc c0
sta d0
bcc s2
s2 sty d1

It's basically 6 cycles per row and 7 cycles per column. When adding the 16 (two byte) partials from the 32bit multiply, it averages to 7.5 cycles per add. This is really optimal as the most basic add is 6 cycles (lda a0:adc b0).

I also found out that there's carries slightly less than half the time on average.

I'm working on a routine to do 8x8=16, 8x16=24, 16x16=16 with rounding, 32x32=32 with rounding, 32x32=64, and 32x32 floating point in one routine with different entry points, where (almost) each entry is optimal even for a standalone routine, and this is all in one routine. It's just a matter of arranging the order of operations carefully to progressively create the result.

I think you can get 1000flops on the c64.
2012-06-10 12:28

Registered: Aug 2004
Posts: 581
1000 flops is well within reach - it just depends how much precision you want, and what you define as a flop :)

Single precision IEEE-754 only has 24 bits for the mantissa, and you don't need a 48 bit result unless you want to get the rounding exact for multiply-accumulates.

Have you considered expressing the multiplies as a difference of squares? Should mean instead of needing nine 8x8->16 multiplies for a 24x24 bit you have six 8x8s and a few table lookups.
2012-06-11 03:47

Registered: Oct 2010
Posts: 129
part 1

z1:z0=(x0+y0)^2/4-(x0-y0)^2/4 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y0
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;48 cycles

z1:Z0=x0*y1/2 means
lda x0:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4:ldy y1
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;56 cycles

z1:z0+=x1:x0 means
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 ccyles

c1:c0=a0^2/4-b0^2/4;48 cycles
c3:c2=a1^2/4-b1^2/4;48 cycles
t1:t0=a0*a1/2;56 cycles
t3:t2=b0*b1/2;56 cycles
t1:t0-=t3:t2;20 cycles
c2:c1+=t1:t0;20 cycles
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
.1 lda #0;8/11
adc c3:sta c3;6

which is 262+ cycles for your method, 16bit.

2012-06-11 04:02

Registered: Oct 2010
Posts: 129
You're proposing,
a*b=(a+b)^2 - (a-b)^2,
a*b=f(a+b) - f(a-b)
Where a,b are 24 bit values.

So we need a 24bit square function.
You can get that by,
(x2 * 2^16 + x1 * 2^8 + x0)^2=
x2^2 * 2^32 + x1^2 * 2^16 + x0^2 + y,
=g(x2) * 2^32 + g(x1) * 2^16 + g(x0)
which are three table lookups and,
y= 2*x1*x0/4 * 2^8 + 2*x2*x0/4 * 2^16 + 2*x2*x1/4 * 2^24
which are three 8bit multiplies.

In general for n byte numbers you get
2n squaring lookups,
2 * nC2 mults where nCr=n!/(n-r)!r!,
2n shifts

O(n) = 2n*48 + 2*nC2*(56+20) + 14, n>1


n cycles
2 262
3 530
4 1310

So without optimization, it doesn't look good...

2012-06-11 11:59

Registered: Jul 2003
Posts: 459

So without optimization, it doesn't look good...

when it comes to math, what looks good on c64 without optimization?

creating a math library for general purposes is a good thing. it's very useful for testing an effect on c64 without spending too much time. but when it comes to finalization, usually most parts of this library won't be used as it is.

imho, best method is to create the library with possible optimizations which doesn't effect the accuracy much. at the last stage, replace all possible calculations with reasonable look-up tables and where look-up table is not an option, inline the math functions and optimize them as much as possible, change the register usages for saving a few more cycles, sacrifice some accuracy if it doesn't effect the result badly etc.
2012-06-11 12:43

Registered: Oct 2010
Posts: 129
You're thinking in terms of demos. I was thinking of applying this as a compiler library. I'm just looking for a general fast calculation.
I'm comparing CJ's suggestion to my first approach, when I say it doesn't look good, I meant his approach.

Update: slight problem in how I count things
O(n) = n*48 + nC2*(56+20) + 14, in fact 1 works and 1C2 is 0, also the 14 term is related to this

n cycles
1 48
2 262
3 386
4 662

This is now good :)
If I can generate combinations in lists, I could generate any precision math routine in kickass...
2012-06-11 21:33

Registered: Oct 2002
Posts: 423
Quoting Repose
part 1
;sign extend for c3
bpl .1
lda #$ff
.byt $24;thanks optimizing tricks thread
lda #0;8/11
adc c3
sta c3

this will a) most likely crash as it only skips one byte and then runs on $00 what will be a brk. Also this is not a optimizing trick speedwise nor sizewise:

Better be done like:
    anc #$80 ;copy bit 7 to carry and A = A & $80
    bcc +    ;A is zero on plus, all fine
    sbc #$81 ;subtract $81 from $80 = $ff
    ;carry always clear as a free gift

Also, lots of other possibilities to optimize that code, and also some mistakes within the code.
2017-03-28 20:46

Registered: Oct 2010
Posts: 129
I worked on this again. There were a lot of bugs in my post 'part 1'.

Basically, this message is working out how long to multiply 16x16 using the idea of f(x+y)-f(x-y)=x*y with x,y as 16bit numbers and f(n)=n^2/4.
Note that squaring a 16bit number takes two 8bit squares and a multiply.

Compare this with the usual way, which is 4 multiplies of x0*y0,x1*y0,x0*y1,x1*y1 then adding those together.

z1:z0=(x0+y0)^2/4 means
lda x0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
ldy y0
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;50 cycles avg

z1:z0=(x0+y0)*(x1+y1)/2 means
lda x0:clc:adc y0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
lda x1:clc:adc y1
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;70 cycles avg

z1:z0+=x1:x0 means
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 cycles

The routine
Again note, (a1:a0)^2/4=a1^2/4,a0^2/4,a1*a0/2
where a1:a0=x1:x0+y1:y0
So a 16bit square/4 is two 8bit squares and a multiply with divide/2
then added all together
We need two 16bit squares and then subtract them

;z=f(x+y) where f(n)=n^2/4 and x,y are 16bit numbers
z1:z0=(x0+y0)^2/4;50 cycles (x+y)^2/4 low
z3:z2=(x1+y1)^2/4;50 cycles (x+y)^2/4 high
t1:t0=(x0+y0)*(x1+y1)/2;70 cycles
z2:z1+=t1:t0;20 cycles
;add the carry to z3 (not shown or calculated) 190+cycles for 16bit square/4
t1:t0=(x0-y0)^2/4;50 cycles
t3:t2=(x1-y1)^2/4;50 cycles
t5:t4=(x0-y0)*(x1-y1)/2;70 cycles
t2:t1+=t5:t4;20 cycles
z1:z0-=t1:t0;20 cycles
z3:z3-=t3:t2;20 cycles
;190*2+40=420 cycles+

There's still some mistakes here and room to optimize the adds,
also the setup part of the table lookups can be arranged to not
change the pointers so much,
but basically it shows that this is very slow and doesn't work out.

This could have been 4 multiplies if done the usual way, instead
it's like 6 multiplies (officially 2, then 4 sqrtable lookups of 9 bit
numbers, which is the same speed).
2017-03-28 20:57

Registered: Oct 2010
Posts: 129
I only have two options left. I can use the results of the 'add/sub a set' thread for adding the partials of large multiplies, where I add the partials by order of column from low to high, or else do the multiplies in an optimal order of not changing the zp pointers, but then having to generate and add the partials in a non-optimal order.

The two times I can avoid changing the zp pointers saves 28 cycles. Doing adds then having to save them and add a different column is still an unknown to me.

I'm thinking that doing the multiplies optimally will be best, and the result will be just over 200 cycles for an unsigned 16x16.

It's only a bit better than the one posted already in codebase.

I still have to look more into the cosine formula, but I think I am close to the optimal speed for large multiplies.
2017-03-29 10:12

Registered: Jun 2002
Posts: 1204
Have you checked my Seriously fast multiplication?

Edit: Yes you have.. sorry.
... 77 posts hidden. Click here to view all posts....
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Users Online
Soren/Camelot, MoN, ..
Radiant/Panda Design
Guests online: 43
Top Demos
1 Uncensored  (9.7)
2 Edge of Disgrace  (9.7)
3 The Shores of Reflec..  (9.7)
4 Coma Light 13  (9.6)
5 Lunatico  (9.6)
6 Comaland 100%  (9.6)
7 Incoherent Nightmare  (9.5)
8 Wonderland XII  (9.5)
9 Comaland  (9.5)
10 Wonderland XIII  (9.5)
Top onefile Demos
1 Dawnfall V1.1  (9.5)
2 Daah, Those Acid Pil..  (9.4)
3 SidRok  (9.4)
4 Treu Love [reu]  (9.4)
5 Tunnel Vision  (9.3)
6 Dawnfall  (9.3)
7 One-Der  (9.2)
8 Globe 2016 [reu]  (9.2)
9 Hardware Accelerated..  (9.2)
10 Safe VSP  (9.1)
Top Groups
1 Booze Design  (9.4)
2 Oxyron  (9.4)
3 Censor Design  (9.4)
4 Crest  (9.3)
5 Camelot  (9.2)
Top Original Suppliers
1 Derbyshire Ram  (9.3)
2 hedning  (9.2)
3 Jazzcat  (9.1)
4 Fungus  (9.0)
5 Irata  (9.0)

Home - Disclaimer
Copyright © No Name 2001-2017
Page generated in: 0.163 sec.