Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > Fast large multiplies
2012-06-09 19:45
Repose

Registered: Oct 2010
Posts: 225
Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

 
... 144 posts hidden. Click here to view all posts....
 
2023-12-11 12:51
Frantic

Registered: Mar 2003
Posts: 1648
Quote: I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.

Even my old routine has won out of 120 routines tested independently.

Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results

This was a thrilling moment for me :)

But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.


Congratulations! I totally get that it must have been thrilling. :D
2023-12-11 13:57
Krill

Registered: Apr 2002
Posts: 2980
Quoting Repose
The time is 188.1 cycles, averaged over all possible inputs.
Pretty impressive. :)

It seems that most of the required memory of about 2 KB is spent by tables (which are generated), not so much the code itself, right?

If so, i guess this 16x16=32 routine could come in quite handy for something like https://github.com/exoticorn/upkr which still doesn't seem to have a C-64/plain 6502 unpacker (while 6502 unpackers using hardware-accelerated mul exist).
2023-12-11 14:09
Repose

Registered: Oct 2010
Posts: 225
The code seems to be 120 bytes and the tables are 2044 bytes. I think I need to add some 30 more to create the tables.
2023-12-12 09:31
JackAsser

Registered: Jun 2002
Posts: 2014
Very very nice write up @Repose!
2023-12-12 17:47
Repose

Registered: Oct 2010
Posts: 225
Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)
2023-12-12 20:16
JackAsser

Registered: Jun 2002
Posts: 2014
Quote: Bruh! You were my hero before I started this :) You held the record before me, and still we are both way ahead of most posts and published books. Respects :)

Hehe, tbf Graham tought me this trick (i.e. selfmod the code to get a+b for free, and using neg-tables to get a-b for free) iirc, or maybe it was CJam, can't remember.

We should add a whole chapter about quick division also. I'm currently (since 2004) using a trick Graham tought me, which I believe HCL also uses, y/x = y* (1/x) instead of doing div. And then using another multiply to linearly interpolate between y*(1/x) and y*(1/(x+1)) by the fractional part of x.
2023-12-12 20:23
Repose

Registered: Oct 2010
Posts: 225
Yes, I know about that trick, it's used in CPUs too! I read a whitepaper on the AMD FPU design.
I'm working on a mathlib of all the fastest routines, I'll be adding division after I'm done with all the multiply and x^2 variations.
2023-12-13 23:06
Count Zero

Registered: Jan 2003
Posts: 1932
Many thanks to Repose for his codebase contributions at this point as well!
https://codebase64.org/doku.php?id=base:fastest_multiplication gave me the thrills a while ago.

Everybody here ofc is invited to contribute as well :)
2023-12-13 23:55
Repose

Registered: Oct 2010
Posts: 225
Thanks!

A lot of the division approaches are mentioned here:
https://en.wikipedia.org/wiki/Division_algorithm

I can't find the AMD paper, but its similar to this:
https://ieeexplore.ieee.org/document/762835

It was detailing the implementation in the K7 CPU by AMD (very old one).

I don't wanna get too excited to dive into that, until I finish the multiplies :)

But I worked on that today. There's a lot of ways to handle signs. I was thinking of something like
lda x1 (high byte of first input)
asl (get high bit/sign into carry)
lda y1
bcc + (x1 is positive)
bpl + (y1 is positive)

So, there are 4 cases, and if one of the arguments is negative, the resulting product will be negative, and to do that, you only need to do the square tables subtraction in the reverse order, which is equivalent of taking the 2's complement of the product.

So, I'm hoping that that little sort routine above will lead me to variations which are mostly the same speed, so there's little overhead. I think if both inputs are negative will be the worst case.

Another approach is to use different representations. Does offset 128 work better? What about storing the sign in a separate byte? Then everything is always unsigned, and you just EOR the signs together for the result.

Another way is to take the 2's complement of the inputs first, do an unsigned, then 2's complement them again if needed.

Another way, do unsigned but fix up the answer depending on the arguments. You have to subtract one or both inputs from the high bytes, it's explained in C=Hacking 16 under the 3d article. Very good article, which I co-wrote too :)
2023-12-14 10:26
ChristopherJam

Registered: Aug 2004
Posts: 1409
No idea whether Graham also came up with the self-mod to get the A+B for free trick independently, but I know I came up with it myself when optimising some code either Steve Judd or George Taylor showed me after whoever it was introduced me to the difference of squares approach, back around when Steve was writing 3d rendering articles for C=Hacking in the late 90s.

I'm no longer sure who it was I was having those discussions with; might've been on IRC or might've been email correspondence, but either way I don't have the logs any more.
Previous - 1 | ... | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Magic/Nah-Kolor
Martinland
Freeze/Blazon
WVL/Xenon
Mason/Unicess
Steffan/BOOM!
ΛΛdZ
astaroth/TRSI
iAN CooG/HVSC
Copyfault/Extend^tsn..
A3/AFL
E$G/HF ⭐ 7
slimeysmine
The Syndrom/TIA/Pret..
Walt/Bonzai
algorithm
iceout/Avatar/HF
Guests online: 87
Top Demos
1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)
Top onefile Demos
1 Layers  (9.6)
2 No Listen  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)
Top Groups
1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)
Top Original Suppliers
1 Derbyshire Ram  (9.7)
2 Fungus  (9.3)
3 Black Beard  (9.2)
4 Baracuda  (9.2)
5 hedning  (9.1)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.048 sec.