| |
Repose
Registered: Oct 2010 Posts: 225 |
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?
|
|
... 144 posts hidden. Click here to view all posts.... |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)
Hehe, tbf Graham tought me this trick (i.e. selfmod the code to get a+b for free, and using neg-tables to get a-b for free) iirc, or maybe it was CJam, can't remember.
We should add a whole chapter about quick division also. I'm currently (since 2004) using a trick Graham tought me, which I believe HCL also uses, y/x = y* (1/x) instead of doing div. And then using another multiply to linearly interpolate between y*(1/x) and y*(1/(x+1)) by the fractional part of x. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Yes, I know about that trick, it's used in CPUs too! I read a whitepaper on the AMD FPU design.
I'm working on a mathlib of all the fastest routines, I'll be adding division after I'm done with all the multiply and x^2 variations. |
| |
Count Zero
Registered: Jan 2003 Posts: 1932 |
Many thanks to Repose for his codebase contributions at this point as well!
https://codebase64.org/doku.php?id=base:fastest_multiplication gave me the thrills a while ago.
Everybody here ofc is invited to contribute as well :) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Thanks!
A lot of the division approaches are mentioned here:
https://en.wikipedia.org/wiki/Division_algorithm
I can't find the AMD paper, but its similar to this:
https://ieeexplore.ieee.org/document/762835
It was detailing the implementation in the K7 CPU by AMD (very old one).
I don't wanna get too excited to dive into that, until I finish the multiplies :)
But I worked on that today. There's a lot of ways to handle signs. I was thinking of something like
lda x1 (high byte of first input)
asl (get high bit/sign into carry)
lda y1
bcc + (x1 is positive)
bpl + (y1 is positive)
So, there are 4 cases, and if one of the arguments is negative, the resulting product will be negative, and to do that, you only need to do the square tables subtraction in the reverse order, which is equivalent of taking the 2's complement of the product.
So, I'm hoping that that little sort routine above will lead me to variations which are mostly the same speed, so there's little overhead. I think if both inputs are negative will be the worst case.
Another approach is to use different representations. Does offset 128 work better? What about storing the sign in a separate byte? Then everything is always unsigned, and you just EOR the signs together for the result.
Another way is to take the 2's complement of the inputs first, do an unsigned, then 2's complement them again if needed.
Another way, do unsigned but fix up the answer depending on the arguments. You have to subtract one or both inputs from the high bytes, it's explained in C=Hacking 16 under the 3d article. Very good article, which I co-wrote too :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
No idea whether Graham also came up with the self-mod to get the A+B for free trick independently, but I know I came up with it myself when optimising some code either Steve Judd or George Taylor showed me after whoever it was introduced me to the difference of squares approach, back around when Steve was writing 3d rendering articles for C=Hacking in the late 90s.
I'm no longer sure who it was I was having those discussions with; might've been on IRC or might've been email correspondence, but either way I don't have the logs any more. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Oh, you definitely have added some great ideas!
However, I researched it before and there's a few people who came up with that idea before any of us, not c64 sceners either. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I've finally published my new fastest routine, 2023 version. It's even faster now, 187.1.
https://codebase64.org/doku.php?id=base:fastest_multiplication_..
And independently benchmarked here
https://github.com/tobyLobster/multiply_test/?tab=readme-ov-fil..
Both faster and shorter, so not bad. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Nice work! |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Of course, only now do I realize a dumb trick:
x1 = p_sqr_lo
umult16
; set multiplier as x1
lda x1
;sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_neg_sqr_lo
sta p_neg_sqr_hi
; set multiplier as x0
; *x1 is no longer preserved
To easily save 3 cycles and 1 byte zp.
Going a bit further, if you add 6 zp, using two sets of pointers, then x0 and x1 can be p_sqr_lo1 and p_sqr_lo2. And one more step, make the caller set up both pointers, but that's a bit more difficult as we can't assume the caller can use A for the calculation eor #$ff (yes, A will be trashed when you call the umult16, but I mean when the caller has the other half of x_ ready).
So, save 3, 6 or 12 cycles.
One more hack is passing y0 in Y, but in the worst case, that's not convenient to the caller, if it happens to be in X, then it's easier to stx y0 rather than txa:tay.
So, I would say saving 3 or 6 cycles are good options, and 3 for sure. That brings the routine down to 180.1 :) Saving 14.5 cycles and being 10% faster definitely makes me happy, losing another 6 bytes zp not so much. |
Previous - 1 | ... | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 - Next |