Welcome to our latest new user
BritBlaster
! (Registered 2025-03-12)
You are not logged in -
nap
CSDb User Forums
Forums
>
C64 Coding
>
Fast large multiplies
2012-06-09
19:45
Repose
Registered: Oct 2010
Posts: 227
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?
... 145 posts hidden. Click
here
to view all posts....
2023-12-12
20:16
JackAsser
Registered: Jun 2002
Posts: 2038
Quote:
Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)
Hehe, tbf Graham tought me this trick (i.e. selfmod the code to get a+b for free, and using neg-tables to get a-b for free) iirc, or maybe it was CJam, can't remember.
We should add a whole chapter about quick division also. I'm currently (since 2004) using a trick Graham tought me, which I believe HCL also uses, y/x = y* (1/x) instead of doing div. And then using another multiply to linearly interpolate between y*(1/x) and y*(1/(x+1)) by the fractional part of x.
2023-12-12
20:23
Repose
Registered: Oct 2010
Posts: 227
Yes, I know about that trick, it's used in CPUs too! I read a whitepaper on the AMD FPU design.
I'm working on a mathlib of all the fastest routines, I'll be adding division after I'm done with all the multiply and x^2 variations.
2023-12-13
23:06
Count Zero
Registered: Jan 2003
Posts: 1946
Many thanks to Repose for his codebase contributions at this point as well!
https://codebase64.org/doku.php?id=base:fastest_multiplication
gave me the thrills a while ago.
Everybody here ofc is invited to contribute as well :)
2023-12-13
23:55
Repose
Registered: Oct 2010
Posts: 227
Thanks!
A lot of the division approaches are mentioned here:
https://en.wikipedia.org/wiki/Division_algorithm
I can't find the AMD paper, but its similar to this:
https://ieeexplore.ieee.org/document/762835
It was detailing the implementation in the K7 CPU by AMD (very old one).
I don't wanna get too excited to dive into that, until I finish the multiplies :)
But I worked on that today. There's a lot of ways to handle signs. I was thinking of something like
lda x1 (high byte of first input)
asl (get high bit/sign into carry)
lda y1
bcc + (x1 is positive)
bpl + (y1 is positive)
So, there are 4 cases, and if one of the arguments is negative, the resulting product will be negative, and to do that, you only need to do the square tables subtraction in the reverse order, which is equivalent of taking the 2's complement of the product.
So, I'm hoping that that little sort routine above will lead me to variations which are mostly the same speed, so there's little overhead. I think if both inputs are negative will be the worst case.
Another approach is to use different representations. Does offset 128 work better? What about storing the sign in a separate byte? Then everything is always unsigned, and you just EOR the signs together for the result.
Another way is to take the 2's complement of the inputs first, do an unsigned, then 2's complement them again if needed.
Another way, do unsigned but fix up the answer depending on the arguments. You have to subtract one or both inputs from the high bytes, it's explained in C=Hacking 16 under the 3d article. Very good article, which I co-wrote too :)
2023-12-14
10:26
ChristopherJam
Registered: Aug 2004
Posts: 1416
No idea whether Graham also came up with the self-mod to get the A+B for free trick independently, but I know I came up with it myself when optimising some code either Steve Judd or George Taylor showed me after whoever it was introduced me to the difference of squares approach, back around when Steve was writing 3d rendering articles for C=Hacking in the late 90s.
I'm no longer sure who it was I was having those discussions with; might've been on IRC or might've been email correspondence, but either way I don't have the logs any more.
2023-12-14
13:26
Repose
Registered: Oct 2010
Posts: 227
Oh, you definitely have added some great ideas!
However, I researched it before and there's a few people who came up with that idea before any of us, not c64 sceners either.
2024-02-13
08:13
Repose
Registered: Oct 2010
Posts: 227
I've finally published my new fastest routine, 2023 version. It's even faster now, 187.1.
https://codebase64.org/doku.php?id=base:fastest_multiplication_..
And independently benchmarked here
https://github.com/tobyLobster/multiply_test/?tab=readme-ov-fil..
Both faster and shorter, so not bad.
2024-02-13
13:37
ChristopherJam
Registered: Aug 2004
Posts: 1416
Nice work!
2024-02-13
20:58
Repose
Registered: Oct 2010
Posts: 227
Of course, only now do I realize a dumb trick:
x1 = p_sqr_lo
umult16
; set multiplier as x1
lda x1
;sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_neg_sqr_lo
sta p_neg_sqr_hi
; set multiplier as x0
; *x1 is no longer preserved
To easily save 3 cycles and 1 byte zp.
Going a bit further, if you add 6 zp, using two sets of pointers, then x0 and x1 can be p_sqr_lo1 and p_sqr_lo2. And one more step, make the caller set up both pointers, but that's a bit more difficult as we can't assume the caller can use A for the calculation eor #$ff (yes, A will be trashed when you call the umult16, but I mean when the caller has the other half of x_ ready).
So, save 3, 6 or 12 cycles.
One more hack is passing y0 in Y, but in the worst case, that's not convenient to the caller, if it happens to be in X, then it's easier to stx y0 rather than txa:tay.
So, I would say saving 3 or 6 cycles are good options, and 3 for sure. That brings the routine down to 180.1 :) Saving 14.5 cycles and being 10% faster definitely makes me happy, losing another 6 bytes zp not so much.
2025-01-01
13:13
Repose
Registered: Oct 2010
Posts: 227
Another year and I'm just a day late, but I do have a 2024 version of multiplies which are faster again, breaking the world record 3 times in a row. I still have some work to do to fully verify and explore all the possibilities, but it's coming soon. I think it will be another 6% faster at best.
I keep wondering, how could I not see this? How can I open my mind better to the possibilities? You have to always go back to first principles and explore every variation.
So, this is just a pre-announcement.
Previous
-
1
| ... |
6
|
7
|
8
|
9
|
10
|
11
|
12
|
13
|
14
|
15
| 16 - Next
Refresh
Subscribe to this thread:
You need to be logged in to post in the forum.
Search the forum:
Search
All forums
C64 Coding
C64 Composing
C64 Pixeling
C64 Productions
CSDb Bug Reports
CSDb Development
CSDb Discussions
CSDb Entries
CSDb Feedback
CSDb Info
CSDb moderators
CSDb Questions
Messages to moderators
Requests
for
in
Writer & text
Text
Writer
All times are CET.
Search CSDb
All
Releases
Groups
Sceners
Events
BBS
SIDs
-------
Forum
Comments
Advanced
Users Online
hedning/G★P
Alakran_64
megasoftargentina
Mason/Unicess
Guests online: 108
Top Demos
1
Next Level
(9.7)
2
13:37
(9.7)
3
Codeboys & Endians
(9.7)
4
Coma Light 13
(9.6)
5
Mojo
(9.6)
6
Edge of Disgrace
(9.6)
7
Uncensored
(9.6)
8
Comaland 100%
(9.6)
9
What Is The Matrix 2
(9.5)
10
Wonderland XIV
(9.5)
Top onefile Demos
1
Nine
(9.8)
2
Layers
(9.6)
3
Cubic Dream
(9.6)
4
Party Elk 2
(9.6)
5
Copper Booze
(9.6)
6
Dawnfall V1.1
(9.5)
7
Rainbow Connection
(9.5)
8
Onscreen 5k
(9.5)
9
Morph
(9.5)
10
Barry Boomer - Trapp..
(9.5)
Top Groups
1
Oxyron
(9.3)
2
Booze Design
(9.3)
3
Performers
(9.3)
4
Censor Design
(9.2)
5
Triad
(9.2)
Top Organizers
1
Burglar
(9.9)
2
Sixx
(9.8)
3
MWS
(9.7)
4
Irata
(9.7)
5
hedning
(9.7)
Home
-
Disclaimer
Copyright © No Name 2001-2025
Page generated in: 0.053 sec.