| |
Repose
Registered: Oct 2010 Posts: 225 |
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?
|
|
... 144 posts hidden. Click here to view all posts.... |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.
It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.
It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?)
Yeah maybe. I remember Graham did the implicit add-optimization to it. I.e. self-mod the highbyte with X and then use ,Y in the table lookup. That way the X+Y comes for "free" |
| |
Repose
Registered: Oct 2010 Posts: 225 |
That very idea I always thought was brilliant, because it made a useful calculation with the indexing at the same time as a lookup. As a result, this algorithm is faster than could ever be possible. I call it a 'super' algorithm.
And yes, Judd told me that at the time we were working on the C=Hacking article together.
Anyone remember that coding thread in comp.sys.cbm and later, the z80 vs 6502 'contest' ? |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Quote: Yes, that's the one I beat already.
Don't hesitate to write an article, or at least post some source code, at Codebase64. :) Would be much appreciated! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Repose, did you get around to completing a 32x32->64 routine? I'm trying out some of the ideas from this and the sets of add/sub threads; currently down to around 1300 cycles, though I've still got some optimising to go.
I'm using inx to track carries, but have yet to remove the CLCs after each branch. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Yes I wrote a 16x16 to test, and it takes 211 cycles (that's avg with equal branches). The one posted on codebase is 231 cycles. I am just testing it now but having some trouble finding a single step debugger that works for me, see thread.
The time to beat for 32x32 is at least 1300 as posted on codebase so you have to keep going...
I can derive the formula for my current version to estimate exact timing of 32x32 I'll have to get back to you in a bit. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Here's some untested code:
x0=$fb
x1=$fc
y0=$fd
y1=$fe
x0_sqr_lo=$8b;2 bytes
x0_sqr_hi=$8d
x0_negsqr_lo=$8f
x0_negsqr_hi=$91
x1_sqr_lo=$93;2 bytes
x1_sqr_hi=$95
x1_negsqr_lo=$97
x1_negsqr_hi=$99
sqrlo=$c000;510 bytes
sqrhi=$c200
negsqrlo=$c400
negsqrhi=$c600
umult16:
;init zp square tables pointers
lda x0
sta x0_sqr_lo
sta x0_sqr_hi
eor #$ff
sta x0_negsqr_lo
sta x0_negsqr_hi;17
lda x1
sta x1_sqr_lo
sta x1_sqr_hi
eor #$ff
sta x1_negsqr_lo
sta x1_negsqr_hi;17
ldx #0;start column 0
ldy y0
SEC
LDA (x0_sqr_lo),y
SBC (x0_negsqr_lo),y
sta z0;x0*y0 lo, C=1
;start column 1
;Y=y0
clc
LDA (x0_sqr_hi),y;x0*y0 hi
ADC (x1_sqr_lo),y;+x1*y0 lo
bcc c1s1;8.5/11.5 avg
inx
clc
c1s1 sbc (x0_negsqr_hi),y;x0*y0 hi
bcc c1s2
dex
clc
c1s2 sbc (x1_negsqr_lo),y;-x1*y0 lo
bcc c1s3
dex
clc
c1s3 ldy y1
adc (x0_sqr_lo),y;x0*y1 lo
bcc c1s4
inx
clc
c1s4 SBC (x0_negsqr_lo),y;A=x0*y1 lo
bcc c1s5
dex
clc
;end of column 1
c1s5 sta z1;column 1
;start column 2
ldy y0
txa;carries from column 1
ldx #0;reset carries
clc
adc (x1_sqr_hi),y;+x1*y0 hi
bcc c2s1
inx
c2s1 sbc (x1_negsqr_hi),y;-x1*y0 hi
bcc c2s2
dex
clc
c2s2 ldy y1
adc (x0_sqr_hi),y;+x0*y1 hi
bcc c2s3
inx
clc
c2s3 adc (x1_sqr_lo),y;+x1*y1 lo
bcc c2s4
inx
clc
c2s4 sbc (x0_negsqr_hi),y;-x0*y1 hi
bcc c2s5
dex
clc
c2s5 sbc (x1_negsqr_lo),y;-x1*y1 lo
bcc c2s6
dex
clc
c2s6 sta z2;column 2
;start column 3
;Y=y1
txa;carries from column 2
clc
adc (x1_sqr_hi),y;+x1*y1 hi
sbc (x1_negsqr_hi),y;-x1*y1 hi
;shouldn't be any carries in the msb
sta z3;column 3
rts
makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta x0_sqr_lo+1
sta x1_sqr_lo+1
lda #>sqrhi
sta x0_sqr_hi+1
sta x1_sqr_hi+1
lda #>negsqrlo
sta x0_negsqr_lo+1
sta x1_negsqr_lo+1
lda #>negsqrhi
sta x0_negsqr_hi+1
sta x1_negsqr_hi+1
;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9 ; CMP #immediate - skip TYA and clear carry flag
makesqrtables_loop1:
tya
adc #$00
makesqrtables_sm1:
sta sqrhi,x
tay
cmp #$40
txa
ror
makesqrtables_sm2:
adc #$00
sta makesqrtables_sm2+1
inx
makesqrtables_sm3:
sta sqrlo,x
bne makesqrtables_loop1
inc makesqrtables_sm3+2
inc makesqrtables_sm1+2
clc
iny
bne makesqrtables_loop1
;generate negsqr(x)=(255-x)^2/4
ldx #$00
ldy #$ff
maketables_loop2:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne maketables_loop2:
rts
|
| |
Repose
Registered: Oct 2010 Posts: 225 |
Post correction would be slower on the 16x16, not a long enough run of add/sub. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Partials cheatsheet
y1 y0
x1 x0
------
x0*y0h x0*y0l
x1*y0h x1*y0l
x0*y1h x0*y1l
x1*y1h x1*y1l
----------------
24x24bits
x2 x1 x0
y2 y1 y0
---------------
y0x0h y0x0l
y0x1h y0x1l
y0x2h y0x2l
y1x0h y1x0l
y1x1h y1x1l
y1x2h y1x2l
y2x0h y2x0l
y2x1h y2x1l
y2x2h y2x2l
These facts are useful to estimating the time of any size calc:
Number of columns is 2*n, n is bytes of each number.
Rows of additions is like 1 3 3 1, 1 3 5 5 3 1, 1 3 5 7 7 5 3 1 for 16,24 and 32 bit mults, and each one being f(x)-g(x), so really double that number of add-subs.
The total add-subs is n^2*2. (each is about 10 cycles).
Number of times to change (or pointers to set) of the multiplier is n, then each one is used >n times with the multiplicand (ldy multiplicand), when doing in column order (tbd total). |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Changes in multiplicand is 2*n (in my case, ldy y(x) ).
eg ldy y0
...
ldy y1
...
ldy y0
...
etc. |
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 16 - Next |