| | Repose
Registered: Oct 2010 Posts: 225 |
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?
|
|
... 144 posts hidden. Click here to view all posts.... |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
I've replaced all the subtractions with additions by using g(x)=$4000-(x*x/4) and offsetting my start number, and rolled the carry correction in to the addition sequence.
Removing the CLCs is probably not worthwhile for 32x32, as there are only 33 of them, so I'd have to spend considerably less than 4 cycles per output byte on the post correction (unlikely).
I'm down to around 800 cycles for a 32x32 now, 776 cycles for zero times zero. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
fo=open("tables.inc","w")
lo=lambda x:x&255
hi=lambda x:(x>>8)
f=lambda x:x*x//4
g=lambda x:(0x4000-f(x-255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])
dumpArrayToA65(fo, "id", [lo( i ) for i in range(512)])
fo=open("mc.inc","w")
mAcc=0
for i in range(4):
for j in range(4):
mAcc-=0x40<<(8*(1+i+j))
initialValue = [((mAcc>>s)&0xff) for s in range(0,64,8)]
def addB(yv,zp,tb):
global lasty
if yv!=lasty:
print(""" ldy mT2+{yv}""".format(yv=yv), file=fo)
lasty=yv
print(""" adc ({zp}),y""".format(zp=zp), file=fo)
if tb<7:
print(""" bcc *+4:inx:clc""", file=fo)
else:
print(""" bcc *+3:clc""", file=fo)
lasty=None
for tb in range(8):
print(""" ; tb={tb} """.format(tb=tb),file=fo)
if tb==0:
print(""" clc """,file=fo)
print(""" ldx#0 """,file=fo)
print(""" lda #${iv:02x} """.format(iv=initialValue[tb]),file=fo)
else:
print(""" txa""", file=fo)
if tb<7:
print(""" ldx#0 """,file=fo)
print(""" adc#${iv:02x}""".format(iv=initialValue[tb]), file=fo)
if initialValue[tb]>0xef:
print(""" bcc *+4:inx:clc""", file=fo)
for j in range(4):
i=tb-j
if i in [0,1,2,3]:
addB(i, "zp_fl{j}".format(j=j), tb)
addB(i, "zp_gl{j}".format(j=j), tb)
i=tb-j-1
if i in [0,1,2,3]:
addB(i, "zp_fh{j}".format(j=j), tb)
addB(i, "zp_gh{j}".format(j=j), tb)
print(""" sta mRes+{tb}""".format(tb=tb), file=fo)
fo.close()
|
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
(obviously also need four sets of
lda mT1+ 0
sta zp_fl0
sta zp_fh0
eor#255
sta zp_gl0
sta zp_gh0
every time the multiplier changes (included in my cycle times above),
plus also some init code to set the high bytes of the zero page pointers before a set of multiplications is performed (not included in my timings)) |
| | Repose
Registered: Oct 2010 Posts: 225 |
Good job, that's right in the range of what I thought was possible.
I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers.
by offset $4000, doesn't that reduce the domain?
Correction is fast, it's only
stx z3
sec
sbc z3
sta z2
Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :) |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting ReposeGood job, that's right in the range of what I thought was possible. Thanks!
Quote:I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers. Doing :)
Quote:by offset $4000, doesn't that reduce the domain?
Nah, second table only contains x**2/4 for x in -255 to 255, so it already maxed out at $3f80
Quote:Correction is fast, it's only
stx z3
sec
sbc z3
sta z2
True, but that's an extra 64 cycles, and removing the CLCs saves at most 64 cycles, sometimes as little as zero (if the branches skip over them all)
Quote:Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :)
Yes, there should be an equivalent that mixes ADC and SBC, I just found it easier to wrap my brain around the edge cases and carry handling by converting it to ADC only. I'll be interested to see what you come up with. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Oh, faster correction:
sec
sbc id,x
sta z2
Still not gonna do it, mind ;) |
| | Repose
Registered: Oct 2010 Posts: 225 |
About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.
Let's say in the middle columns where there's 14 adds per column, that's 28 cycles half the time saved from not using CLC, or 14 cycles on average, vs 8 cycles for correction, it still saves 6.
I actually found the stats for the carries, most of them are about half, but adding a higher proportion of high bytes gives less carries. |
| | Repose
Registered: Oct 2010 Posts: 225 |
Good catch on the id,x, I was thinking of that a few days ago but it didn't click in for this situation yet :)
And yes, I worked hard at mixing add/sub properly, it still doesn't really made sense but it works. I thought it wouldn't if you DEX to $FF but it still works. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting ReposeAbout the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.
Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts.. |
| | Repose
Registered: Oct 2010 Posts: 225 |
Yes it hurts :) I posted the explanation in the add/sub thread if you can follow it.
Try 0 - ff - ff in your head. Have fun! :) |
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 16 - Next | |