[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 222

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2017-04-10 09:55

ChristopherJam

Registered: Aug 2004
Posts: 1378

Also, if you're only planning on supporting 32bit ints, there's no way to access the high 32 bits of the result from C anyway - instant 2x speedup :D

2017-04-10 12:45

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.

If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.

Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc.

Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day.

2017-04-10 17:44

Oswald

Registered: Apr 2002
Posts: 5017

Quote: Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day.

cool, I hope its for some c64 demo thingie :)

2017-04-10 18:42

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: cool, I hope its for some c64 demo thingie :)

No sorry. :) just curiosity.

2017-04-11 04:52

Oswald

Registered: Apr 2002
Posts: 5017

Quote: No sorry. :) just curiosity.

nah next time you make a true doom fx, or you'll be kicked out :)

2017-04-11 04:59

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: nah next time you make a true doom fx, or you'll be kicked out :)

Hehe, wasn't the Andropolis vector enough? ;) there I only used 16-bit muls and divs (8.8 fixpoint) hence the coarse resolution of the level (thickness of walls and minimum stair step height). I also optimised to only have axis-aligned rooms. Moving to 16.16 muls and remove the axis alignment would perhaps slow it down by 30% but it would be able to render an untextured Doom level I'm sure.

2017-04-11 21:49

Repose

Registered: Oct 2010
Posts: 222

Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code)

2017-04-12 04:12

ChristopherJam

Registered: Aug 2004
Posts: 1378

Quoting ChristopherJam

Quoting Repose
About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.

Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts..

OK, done now and right you are; saved me 13½ cycles on average for my ten random cases (down to 747.2 cycles now), and only cost me an extra five cycles on my best case (up to 697)

I ended up replacing the lda#offset at the starts of each of the middle six columns with an sbc id+$ff-offset,x at the ends.

My code generator has fairly generic treatment of addends now, so it takes care of all the carry logic for me. Core is now

for tb in range(8):
    bspecs=getbspecs(tb)
    doCounts=(tb<7)
    iv=initialValue[tb]
    doClears=len(bspecs)<4 and iv!=0

    op="lda"

    if tb==1:                                                                                  
        emit("    ldx#0")                                                          
    elif tb>1:
        emit("    txa")
        op="adc"
        if doCounts: emit("    ldx#0")

    if iv!=0:
        if doClears:
            bspecs=[pointerReturner( "#${iv:02x}".format(iv=iv), co=(iv>0xef))]+bspecs
        else:
            bspecs=bspecs+[pointerReturner( "id+${nv:02x},x".format(nv=0xff-iv), negate=True)]

    for n,s in enumerate(bspecs):
        addB(s, op, moveCarry=(n!=len(bspecs)-1), doCounts=doCounts, doClears=doClears)
        op="adc"

    emit("    sta mRes+{tb}\n\n".format(tb=tb))

2017-04-12 08:19

Repose

Registered: Oct 2010
Posts: 222

Beautiful! Only thing else I could see is to auto-generate for any precision, which isn't much of a leap. Should have option to include table-generator code too, and 6816(?).

You can estimate timing much the same way I do it by hand, by including counts per column based on number of adds, then overhead per column, and total overhead. I use variables Px for branches with an initial estimate of Px=.5 over all multiplies. Was gonna have a P generator given a stream of numbers too, that's really fancy of course.

Was gonna say, put the carries in the table (there's even opportunity for a table with built-in accumulate, would be slightly faster).

I have bad news though, I have doubts about my sec fix for post-correction :( I'm just working that out now (it's simple though, just ensure the running total is offset by 1 for each carry including the first). Did you verify the outputs?

2017-04-12 08:40

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code)

Not major perhaps, a lot of time is spend on line drawing and EOR-filling anyway. But maybe 30%, i dunno. Long time ago now and I'm not really sure about how much time is spend on what.

Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 16 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

csio/monarchy c+4
Didi/Laxity
iceout/Avatar/HF
AlexC
TheRyk/MYD!
DeMOSic/HF^MS^BCC^LSD
Guests online: 108

Top Demos

1 Next Level  (9.8)
2 Mojo  (9.7)
3 Coma Light 13  (9.7)
4 Edge of Disgrace  (9.6)
5 Comaland 100%  (9.6)
6 No Bounds  (9.6)
7 Uncensored  (9.6)
8 Wonderland XIV  (9.6)
9 Bromance  (9.6)
10 Memento Mori  (9.6)

Top onefile Demos

1 It's More Fun to Com..  (9.7)
2 Party Elk 2  (9.7)
3 Cubic Dream  (9.6)
4 Copper Booze  (9.5)
5 TRSAC, Gabber & Pebe..  (9.5)
6 Rainbow Connection  (9.5)
7 Onscreen 5k  (9.5)
8 Wafer Demo  (9.5)
9 Dawnfall V1.1  (9.5)
10 Quadrants  (9.5)

Top Groups

1 Oxyron  (9.3)
2 Nostalgia  (9.3)
3 Booze Design  (9.3)
4 Censor Design  (9.3)
5 Crest  (9.3)

Top Swappers

1 Derbyshire Ram  (10)
2 Jerry  (9.8)
3 Violator  (9.8)
4 Acidchild  (9.7)
5 Starlight  (9.6)

Page generated in: 0.048 sec.