[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 222

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2023-08-23 04:05

ChristopherJam

Registered: Aug 2004
Posts: 1381

Nice work, Repose!

Going to have to take another run at this one of these days..

2023-12-11 09:35

Repose

Registered: Oct 2010
Posts: 222

I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.

Even my old routine has won out of 120 routines tested independently.

Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results

This was a thrilling moment for me :)

But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.

2023-12-11 10:00

Raistlin

Registered: Mar 2007
Posts: 575

Very nice write up, Repose!

2023-12-11 12:51

Frantic

Registered: Mar 2003
Posts: 1629

Quote: I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.

Even my old routine has won out of 120 routines tested independently.

Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results

This was a thrilling moment for me :)

But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.

Congratulations! I totally get that it must have been thrilling. :D

2023-12-11 13:57

Krill

Registered: Apr 2002
Posts: 2854

Quoting Repose

The time is 188.1 cycles, averaged over all possible inputs.

Pretty impressive. :)

It seems that most of the required memory of about 2 KB is spent by tables (which are generated), not so much the code itself, right?

If so, i guess this 16x16=32 routine could come in quite handy for something like https://github.com/exoticorn/upkr which still doesn't seem to have a C-64/plain 6502 unpacker (while 6502 unpackers using hardware-accelerated mul exist).

2023-12-11 14:09

Repose

Registered: Oct 2010
Posts: 222

The code seems to be 120 bytes and the tables are 2044 bytes. I think I need to add some 30 more to create the tables.

2023-12-12 09:31

JackAsser

Registered: Jun 2002
Posts: 1990

Very very nice write up @Repose!

2023-12-12 17:47

Repose

Registered: Oct 2010
Posts: 222

Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)

2023-12-12 20:16

JackAsser

Registered: Jun 2002
Posts: 1990

Quote: Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)

Hehe, tbf Graham tought me this trick (i.e. selfmod the code to get a+b for free, and using neg-tables to get a-b for free) iirc, or maybe it was CJam, can't remember.

We should add a whole chapter about quick division also. I'm currently (since 2004) using a trick Graham tought me, which I believe HCL also uses, y/x = y* (1/x) instead of doing div. And then using another multiply to linearly interpolate between y*(1/x) and y*(1/(x+1)) by the fractional part of x.

2023-12-12 20:23

Repose

Registered: Oct 2010
Posts: 222

Yes, I know about that trick, it's used in CPUs too! I read a whitepaper on the AMD FPU design.
I'm working on a mathlib of all the fastest routines, I'll be adding division after I'm done with all the multiply and x^2 variations.