| |
Repose
Registered: Oct 2010 Posts: 222 |
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?
|
|
... 144 posts hidden. Click here to view all posts.... |
| |
Repose
Registered: Oct 2010 Posts: 222 |
I worked on this again. There were a lot of bugs in my post 'part 1'.
Basically, this message is working out how long to multiply 16x16 using the idea of f(x+y)-f(x-y)=x*y with x,y as 16bit numbers and f(n)=n^2/4.
Note that squaring a 16bit number takes two 8bit squares and a multiply.
Compare this with the usual way, which is 4 multiplies of x0*y0,x1*y0,x0*y1,x1*y1 then adding those together.
----------------
z1:z0=(x0+y0)^2/4 means
lda x0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
ldy y0
sec
lda (zp1),y:sbc (zp3),y:sta z0
lda (zp2),y:sbc (zp4),y:sta z1;50 cycles avg
z1:z0=(x0+y0)*(x1+y1)/2 means
lda x0:clc:adc y0
sta zp1:sta zp2
eor #$ff
sta zp3:sta zp4
lda x1:clc:adc y1
tay;32
sec
lda (zp1),y:sbc (zp3),y:tax
lda (zp2),y:sbc (zp4),y:lsr:sta z1
txa:ror:sta z0;70 cycles avg
z1:z0+=x1:x0 means
clc
lda z0:adc x0:sta z0
lda z1:adc x1:sta z1;20 cycles
The routine
Again note, (a1:a0)^2/4=a1^2/4,a0^2/4,a1*a0/2
where a1:a0=x1:x0+y1:y0
So a 16bit square/4 is two 8bit squares and a multiply with divide/2
then added all together
We need two 16bit squares and then subtract them
;z=f(x+y) where f(n)=n^2/4 and x,y are 16bit numbers
z1:z0=(x0+y0)^2/4;50 cycles (x+y)^2/4 low
z3:z2=(x1+y1)^2/4;50 cycles (x+y)^2/4 high
t1:t0=(x0+y0)*(x1+y1)/2;70 cycles
z2:z1+=t1:t0;20 cycles
;add the carry to z3 (not shown or calculated) 190+cycles for 16bit square/4
;
;t=f(x-y)
t1:t0=(x0-y0)^2/4;50 cycles
t3:t2=(x1-y1)^2/4;50 cycles
t5:t4=(x0-y0)*(x1-y1)/2;70 cycles
t2:t1+=t5:t4;20 cycles
;
;z-=t
z1:z0-=t1:t0;20 cycles
z3:z3-=t3:t2;20 cycles
;190*2+40=420 cycles+
There's still some mistakes here and room to optimize the adds,
also the setup part of the table lookups can be arranged to not
change the pointers so much,
but basically it shows that this is very slow and doesn't work out.
This could have been 4 multiplies if done the usual way, instead
it's like 6 multiplies (officially 2, then 4 sqrtable lookups of 9 bit
numbers, which is the same speed). |
| |
Repose
Registered: Oct 2010 Posts: 222 |
I only have two options left. I can use the results of the 'add/sub a set' thread for adding the partials of large multiplies, where I add the partials by order of column from low to high, or else do the multiplies in an optimal order of not changing the zp pointers, but then having to generate and add the partials in a non-optimal order.
The two times I can avoid changing the zp pointers saves 28 cycles. Doing adds then having to save them and add a different column is still an unknown to me.
I'm thinking that doing the multiplies optimally will be best, and the result will be just over 200 cycles for an unsigned 16x16.
It's only a bit better than the one posted already in codebase.
I still have to look more into the cosine formula, but I think I am close to the optimal speed for large multiplies. |
| |
JackAsser
Registered: Jun 2002 Posts: 1987 |
Have you checked my Seriously fast multiplication?
Edit: Yes you have.. sorry. |
| |
Repose
Registered: Oct 2010 Posts: 222 |
Yes, that's the one I beat already. |
| |
JackAsser
Registered: Jun 2002 Posts: 1987 |
Quote: Yes, that's the one I beat already.
Interesting approach! I'll try it myself someday! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1370 |
Stop distracting me, I've got other stuff I'm meant to be working on! |
| |
JackAsser
Registered: Jun 2002 Posts: 1987 |
Quote: Stop distracting me, I've got other stuff I'm meant to be working on!
Circle closed. You were the first to teach me multiplications using squares. I remember I found you on some forum ages ago. :D |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1370 |
Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.
It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?) |
| |
JackAsser
Registered: Jun 2002 Posts: 1987 |
Quote: Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.
It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?)
Yeah maybe. I remember Graham did the implicit add-optimization to it. I.e. self-mod the highbyte with X and then use ,Y in the table lookup. That way the X+Y comes for "free" |
| |
Repose
Registered: Oct 2010 Posts: 222 |
That very idea I always thought was brilliant, because it made a useful calculation with the indexing at the same time as a lookup. As a result, this algorithm is faster than could ever be possible. I call it a 'super' algorithm.
And yes, Judd told me that at the time we were working on the C=Hacking article together.
Anyone remember that coding thread in comp.sys.cbm and later, the z80 vs 6502 'contest' ? |
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 16 - Next |