| |
Repose
Registered: Oct 2010 Posts: 225 |
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?
|
|
... 144 posts hidden. Click here to view all posts.... |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Quote: I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.
Even my old routine has won out of 120 routines tested independently.
Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results
This was a thrilling moment for me :)
But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.
Congratulations! I totally get that it must have been thrilling. :D |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting ReposeThe time is 188.1 cycles, averaged over all possible inputs. Pretty impressive. :)
It seems that most of the required memory of about 2 KB is spent by tables (which are generated), not so much the code itself, right?
If so, i guess this 16x16=32 routine could come in quite handy for something like https://github.com/exoticorn/upkr which still doesn't seem to have a C-64/plain 6502 unpacker (while 6502 unpackers using hardware-accelerated mul exist). |
| |
Repose
Registered: Oct 2010 Posts: 225 |
The code seems to be 120 bytes and the tables are 2044 bytes. I think I need to add some 30 more to create the tables. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Very very nice write up @Repose! |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)
Hehe, tbf Graham tought me this trick (i.e. selfmod the code to get a+b for free, and using neg-tables to get a-b for free) iirc, or maybe it was CJam, can't remember.
We should add a whole chapter about quick division also. I'm currently (since 2004) using a trick Graham tought me, which I believe HCL also uses, y/x = y* (1/x) instead of doing div. And then using another multiply to linearly interpolate between y*(1/x) and y*(1/(x+1)) by the fractional part of x. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Yes, I know about that trick, it's used in CPUs too! I read a whitepaper on the AMD FPU design.
I'm working on a mathlib of all the fastest routines, I'll be adding division after I'm done with all the multiply and x^2 variations. |
| |
Count Zero
Registered: Jan 2003 Posts: 1932 |
Many thanks to Repose for his codebase contributions at this point as well!
https://codebase64.org/doku.php?id=base:fastest_multiplication gave me the thrills a while ago.
Everybody here ofc is invited to contribute as well :) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Thanks!
A lot of the division approaches are mentioned here:
https://en.wikipedia.org/wiki/Division_algorithm
I can't find the AMD paper, but its similar to this:
https://ieeexplore.ieee.org/document/762835
It was detailing the implementation in the K7 CPU by AMD (very old one).
I don't wanna get too excited to dive into that, until I finish the multiplies :)
But I worked on that today. There's a lot of ways to handle signs. I was thinking of something like
lda x1 (high byte of first input)
asl (get high bit/sign into carry)
lda y1
bcc + (x1 is positive)
bpl + (y1 is positive)
So, there are 4 cases, and if one of the arguments is negative, the resulting product will be negative, and to do that, you only need to do the square tables subtraction in the reverse order, which is equivalent of taking the 2's complement of the product.
So, I'm hoping that that little sort routine above will lead me to variations which are mostly the same speed, so there's little overhead. I think if both inputs are negative will be the worst case.
Another approach is to use different representations. Does offset 128 work better? What about storing the sign in a separate byte? Then everything is always unsigned, and you just EOR the signs together for the result.
Another way is to take the 2's complement of the inputs first, do an unsigned, then 2's complement them again if needed.
Another way, do unsigned but fix up the answer depending on the arguments. You have to subtract one or both inputs from the high bytes, it's explained in C=Hacking 16 under the 3d article. Very good article, which I co-wrote too :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
No idea whether Graham also came up with the self-mod to get the A+B for free trick independently, but I know I came up with it myself when optimising some code either Steve Judd or George Taylor showed me after whoever it was introduced me to the difference of squares approach, back around when Steve was writing 3d rendering articles for C=Hacking in the late 90s.
I'm no longer sure who it was I was having those discussions with; might've been on IRC or might've been email correspondence, but either way I don't have the logs any more. |
Previous - 1 | ... | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 - Next |