[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 225

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2017-04-12 04:12

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting ChristopherJam

Quoting Repose
About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.

Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts..

OK, done now and right you are; saved me 13½ cycles on average for my ten random cases (down to 747.2 cycles now), and only cost me an extra five cycles on my best case (up to 697)

I ended up replacing the lda#offset at the starts of each of the middle six columns with an sbc id+$ff-offset,x at the ends.

My code generator has fairly generic treatment of addends now, so it takes care of all the carry logic for me. Core is now

for tb in range(8):
    bspecs=getbspecs(tb)
    doCounts=(tb<7)
    iv=initialValue[tb]
    doClears=len(bspecs)<4 and iv!=0

    op="lda"

    if tb==1:                                                                                  
        emit("    ldx#0")                                                          
    elif tb>1:
        emit("    txa")
        op="adc"
        if doCounts: emit("    ldx#0")

    if iv!=0:
        if doClears:
            bspecs=[pointerReturner( "#${iv:02x}".format(iv=iv), co=(iv>0xef))]+bspecs
        else:
            bspecs=bspecs+[pointerReturner( "id+${nv:02x},x".format(nv=0xff-iv), negate=True)]

    for n,s in enumerate(bspecs):
        addB(s, op, moveCarry=(n!=len(bspecs)-1), doCounts=doCounts, doClears=doClears)
        op="adc"

    emit("    sta mRes+{tb}\n\n".format(tb=tb))

2017-04-12 08:19

Repose

Registered: Oct 2010
Posts: 225

Beautiful! Only thing else I could see is to auto-generate for any precision, which isn't much of a leap. Should have option to include table-generator code too, and 6816(?).

You can estimate timing much the same way I do it by hand, by including counts per column based on number of adds, then overhead per column, and total overhead. I use variables Px for branches with an initial estimate of Px=.5 over all multiplies. Was gonna have a P generator given a stream of numbers too, that's really fancy of course.

Was gonna say, put the carries in the table (there's even opportunity for a table with built-in accumulate, would be slightly faster).

I have bad news though, I have doubts about my sec fix for post-correction :( I'm just working that out now (it's simple though, just ensure the running total is offset by 1 for each carry including the first). Did you verify the outputs?

2017-04-12 08:40

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code)

Not major perhaps, a lot of time is spend on line drawing and EOR-filling anyway. But maybe 30%, i dunno. Long time ago now and I'm not really sure about how much time is spend on what.

2017-04-12 08:42

ChristopherJam

Registered: Aug 2004
Posts: 1409

Thanks!

I test the multiply using a set of 20 randomly selected 32 bit integers, comparing the product of each to a known result, and I time each multiply using CIA#2 with screen blanked and IRQs disabled. So, it's not comprehensive, but I'm reasonably confident.

Yes, generalising for a set of operand widths would probably be quite handy. Good idea instrumenting the generator to calculate the likely execution time. I've been meaning to do something similar with another project..

The post fixup is fine if it's the last addition performed, then the carry from that addition can propagate through to the next column.

Do you mean 65816? I started learning that once, but SuperCPU never really took off, and I tend to ignore most platforms between C64 and modern hardware these days, at least for hobby coding. REU last year was already something of a leap for me.

2017-04-12 12:14

Repose

Registered: Oct 2010
Posts: 225

Btw, one more idea: squaring and cubing based on this can be optimized significantly as well.

2017-04-13 08:42

Repose

Registered: Oct 2010
Posts: 225

Test your routine with these magic numbers:
00 00 54 56
00 00 03 03

If my theory is correct, that's the only case that will fail (not really, just the only one I'm bothering to solve for).
It's quite a subtle bug and random values won't catch it, you need 'whitebox' testing.

The failure is rare and of the form that the high byte is higher by 1.

2017-04-13 09:47

ChristopherJam

Registered: Aug 2004
Posts: 1409

Still works :)

Thanks for the test case.

Here are the first two columns of code (and remember that my g(x)=0x4000-f(x-255) ):

    clc 
    ldy mT2+0
    lda (zp_fl0),y
    adc (zp_gl0),y
    sta mRes+0


    ldx#0
    lda (zp_fh0),y
    adc (zp_gh0),y
    bcc *+3
    inx
    adc (zp_fl1),y
    bcc *+3
    inx
    adc (zp_gl1),y
    bcc *+3
    inx
    ldy mT2+1
    adc (zp_fl0),y
    bcc *+3
    inx
    adc (zp_gl0),y
    bcc *+3
    inx
    sbc id+$3f,x
    sta mRes+1

The inverse borrow from the final SBC carries forward to the next column; the SBC itself corrects for the false carries while also compensating for the excess 0x40 in the high byte of the g() table.

2017-04-13 10:18

Repose

Registered: Oct 2010
Posts: 225

Oh, I know why it works, I constructed those special values for the normal sense, I mean

   54 56
   03 03
   -----
   01 xx
00 ff
00 ff

The whole point was to get those 3 partials to be added, ff+ff+01. Where you are adding with offset, I have to construct the multipliers differently. Not only that, but I'm doubly wrong here - I need to find multipliers which cause the f(x)'s to result that way (where my example works only on the production of f()-g()).

I'll have to finish this later. In the meantime, I suggest you test every possible 16x16. Not so easy I know, I had to write such things in a 6502 simulator in C, actually just simulated the few instructions I needed, but there's a source code out there you could use for a full simulator.

2017-04-13 13:24

ChristopherJam

Registered: Aug 2004
Posts: 1409

I'm going to have to think some more about how to synthesise the equivalent test case.

I did start coding an exhaustive test in 6502 (can determine the required result just by repeated adds; a*(b+1)=a*b+b), then realised it wasn't 2**16 test cases but 2*32. Even at 30x realtime that would take VICE 28 hours assuming 700 cycles per iteration..

2017-04-14 00:55

Repose

Registered: Oct 2010
Posts: 225

00 01 02 03 * 04 05 06 07 and manipulate the tables to what you want to test adds for every branch, and number of carries per column up to 14, think that should do it.

Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | ... | 16 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

Mike
Paulko64
Krill/Plush
HOL2001/Quantum
MP Software/Hokuto F..
Operator Teleksu
Matt
Guests online: 98

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)

Top onefile Demos

1 Layers  (9.6)
2 No Listen  (9.6)
3 Party Elk 2  (9.6)
4 Cubic Dream  (9.6)
5 Copper Booze  (9.6)
6 Rainbow Connection  (9.5)
7 Dawnfall V1.1  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)

Top Groups

1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)

Top Webmasters

1 Slaygon  (9.6)
2 Perff  (9.6)
3 Sabbi  (9.5)
4 Morpheus  (9.4)
5 CreaMD  (9.1)

Page generated in: 0.047 sec.