| | Repose
Registered: Oct 2010 Posts: 225 |
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?
|
|
... 144 posts hidden. Click here to view all posts.... |
| | Repose
Registered: Oct 2010 Posts: 225 |
Here's some untested code:
x0=$fb
x1=$fc
y0=$fd
y1=$fe
x0_sqr_lo=$8b;2 bytes
x0_sqr_hi=$8d
x0_negsqr_lo=$8f
x0_negsqr_hi=$91
x1_sqr_lo=$93;2 bytes
x1_sqr_hi=$95
x1_negsqr_lo=$97
x1_negsqr_hi=$99
sqrlo=$c000;510 bytes
sqrhi=$c200
negsqrlo=$c400
negsqrhi=$c600
umult16:
;init zp square tables pointers
lda x0
sta x0_sqr_lo
sta x0_sqr_hi
eor #$ff
sta x0_negsqr_lo
sta x0_negsqr_hi;17
lda x1
sta x1_sqr_lo
sta x1_sqr_hi
eor #$ff
sta x1_negsqr_lo
sta x1_negsqr_hi;17
ldx #0;start column 0
ldy y0
SEC
LDA (x0_sqr_lo),y
SBC (x0_negsqr_lo),y
sta z0;x0*y0 lo, C=1
;start column 1
;Y=y0
clc
LDA (x0_sqr_hi),y;x0*y0 hi
ADC (x1_sqr_lo),y;+x1*y0 lo
bcc c1s1;8.5/11.5 avg
inx
clc
c1s1 sbc (x0_negsqr_hi),y;x0*y0 hi
bcc c1s2
dex
clc
c1s2 sbc (x1_negsqr_lo),y;-x1*y0 lo
bcc c1s3
dex
clc
c1s3 ldy y1
adc (x0_sqr_lo),y;x0*y1 lo
bcc c1s4
inx
clc
c1s4 SBC (x0_negsqr_lo),y;A=x0*y1 lo
bcc c1s5
dex
clc
;end of column 1
c1s5 sta z1;column 1
;start column 2
ldy y0
txa;carries from column 1
ldx #0;reset carries
clc
adc (x1_sqr_hi),y;+x1*y0 hi
bcc c2s1
inx
c2s1 sbc (x1_negsqr_hi),y;-x1*y0 hi
bcc c2s2
dex
clc
c2s2 ldy y1
adc (x0_sqr_hi),y;+x0*y1 hi
bcc c2s3
inx
clc
c2s3 adc (x1_sqr_lo),y;+x1*y1 lo
bcc c2s4
inx
clc
c2s4 sbc (x0_negsqr_hi),y;-x0*y1 hi
bcc c2s5
dex
clc
c2s5 sbc (x1_negsqr_lo),y;-x1*y1 lo
bcc c2s6
dex
clc
c2s6 sta z2;column 2
;start column 3
;Y=y1
txa;carries from column 2
clc
adc (x1_sqr_hi),y;+x1*y1 hi
sbc (x1_negsqr_hi),y;-x1*y1 hi
;shouldn't be any carries in the msb
sta z3;column 3
rts
makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta x0_sqr_lo+1
sta x1_sqr_lo+1
lda #>sqrhi
sta x0_sqr_hi+1
sta x1_sqr_hi+1
lda #>negsqrlo
sta x0_negsqr_lo+1
sta x1_negsqr_lo+1
lda #>negsqrhi
sta x0_negsqr_hi+1
sta x1_negsqr_hi+1
;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9 ; CMP #immediate - skip TYA and clear carry flag
makesqrtables_loop1:
tya
adc #$00
makesqrtables_sm1:
sta sqrhi,x
tay
cmp #$40
txa
ror
makesqrtables_sm2:
adc #$00
sta makesqrtables_sm2+1
inx
makesqrtables_sm3:
sta sqrlo,x
bne makesqrtables_loop1
inc makesqrtables_sm3+2
inc makesqrtables_sm1+2
clc
iny
bne makesqrtables_loop1
;generate negsqr(x)=(255-x)^2/4
ldx #$00
ldy #$ff
maketables_loop2:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne maketables_loop2:
rts
|
| | Repose
Registered: Oct 2010 Posts: 225 |
Post correction would be slower on the 16x16, not a long enough run of add/sub. |
| | Repose
Registered: Oct 2010 Posts: 225 |
Partials cheatsheet
y1 y0
x1 x0
------
x0*y0h x0*y0l
x1*y0h x1*y0l
x0*y1h x0*y1l
x1*y1h x1*y1l
----------------
24x24bits
x2 x1 x0
y2 y1 y0
---------------
y0x0h y0x0l
y0x1h y0x1l
y0x2h y0x2l
y1x0h y1x0l
y1x1h y1x1l
y1x2h y1x2l
y2x0h y2x0l
y2x1h y2x1l
y2x2h y2x2l
These facts are useful to estimating the time of any size calc:
Number of columns is 2*n, n is bytes of each number.
Rows of additions is like 1 3 3 1, 1 3 5 5 3 1, 1 3 5 7 7 5 3 1 for 16,24 and 32 bit mults, and each one being f(x)-g(x), so really double that number of add-subs.
The total add-subs is n^2*2. (each is about 10 cycles).
Number of times to change (or pointers to set) of the multiplier is n, then each one is used >n times with the multiplicand (ldy multiplicand), when doing in column order (tbd total). |
| | Repose
Registered: Oct 2010 Posts: 225 |
Changes in multiplicand is 2*n (in my case, ldy y(x) ).
eg ldy y0
...
ldy y1
...
ldy y0
...
etc. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
I've replaced all the subtractions with additions by using g(x)=$4000-(x*x/4) and offsetting my start number, and rolled the carry correction in to the addition sequence.
Removing the CLCs is probably not worthwhile for 32x32, as there are only 33 of them, so I'd have to spend considerably less than 4 cycles per output byte on the post correction (unlikely).
I'm down to around 800 cycles for a 32x32 now, 776 cycles for zero times zero. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
fo=open("tables.inc","w")
lo=lambda x:x&255
hi=lambda x:(x>>8)
f=lambda x:x*x//4
g=lambda x:(0x4000-f(x-255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])
dumpArrayToA65(fo, "id", [lo( i ) for i in range(512)])
fo=open("mc.inc","w")
mAcc=0
for i in range(4):
for j in range(4):
mAcc-=0x40<<(8*(1+i+j))
initialValue = [((mAcc>>s)&0xff) for s in range(0,64,8)]
def addB(yv,zp,tb):
global lasty
if yv!=lasty:
print(""" ldy mT2+{yv}""".format(yv=yv), file=fo)
lasty=yv
print(""" adc ({zp}),y""".format(zp=zp), file=fo)
if tb<7:
print(""" bcc *+4:inx:clc""", file=fo)
else:
print(""" bcc *+3:clc""", file=fo)
lasty=None
for tb in range(8):
print(""" ; tb={tb} """.format(tb=tb),file=fo)
if tb==0:
print(""" clc """,file=fo)
print(""" ldx#0 """,file=fo)
print(""" lda #${iv:02x} """.format(iv=initialValue[tb]),file=fo)
else:
print(""" txa""", file=fo)
if tb<7:
print(""" ldx#0 """,file=fo)
print(""" adc#${iv:02x}""".format(iv=initialValue[tb]), file=fo)
if initialValue[tb]>0xef:
print(""" bcc *+4:inx:clc""", file=fo)
for j in range(4):
i=tb-j
if i in [0,1,2,3]:
addB(i, "zp_fl{j}".format(j=j), tb)
addB(i, "zp_gl{j}".format(j=j), tb)
i=tb-j-1
if i in [0,1,2,3]:
addB(i, "zp_fh{j}".format(j=j), tb)
addB(i, "zp_gh{j}".format(j=j), tb)
print(""" sta mRes+{tb}""".format(tb=tb), file=fo)
fo.close()
|
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
(obviously also need four sets of
lda mT1+ 0
sta zp_fl0
sta zp_fh0
eor#255
sta zp_gl0
sta zp_gh0
every time the multiplier changes (included in my cycle times above),
plus also some init code to set the high bytes of the zero page pointers before a set of multiplications is performed (not included in my timings)) |
| | Repose
Registered: Oct 2010 Posts: 225 |
Good job, that's right in the range of what I thought was possible.
I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers.
by offset $4000, doesn't that reduce the domain?
Correction is fast, it's only
stx z3
sec
sbc z3
sta z2
Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :) |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting ReposeGood job, that's right in the range of what I thought was possible. Thanks!
Quote:I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers. Doing :)
Quote:by offset $4000, doesn't that reduce the domain?
Nah, second table only contains x**2/4 for x in -255 to 255, so it already maxed out at $3f80
Quote:Correction is fast, it's only
stx z3
sec
sbc z3
sta z2
True, but that's an extra 64 cycles, and removing the CLCs saves at most 64 cycles, sometimes as little as zero (if the branches skip over them all)
Quote:Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :)
Yes, there should be an equivalent that mixes ADC and SBC, I just found it easier to wrap my brain around the edge cases and carry handling by converting it to ADC only. I'll be interested to see what you come up with. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Oh, faster correction:
sec
sbc id,x
sta z2
Still not gonna do it, mind ;) |
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 16 - Next | |