[CSDb] - User Forums - Fast large multiplies

Welcome to our latest new user Nicron ! (Registered 2024-05-21)

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 222

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2022-01-03 16:53

Repose

Registered: Oct 2010
Posts: 222

I've been working hard on my mathlib. I think I've worked out all the use cases and made completely flexible macros that work together in a system, almost like a macro language. These can produce fully optimal code in an easy way.
Covered:
-squaring
-constant multipliers
-re-using multipliers
-distributed multiplication (a*(b,c))
-chained multiplication (a*b*c)
-zp and self-mod versions; and they are the same speed
-use cases for multiple pointers
-cycle timing variables that respond to all variations
-automatic carry flag tracking/setting
-custom table generation for built-in constant add/multiply
-low-precision outputs with optional rounding

To do:
-update test program
-verify the 32x32
-add signed and other versions
-squaring
-polynomials

At least a few more weeks to release something I'm satisfied with :)
I also discovered xc_basic and I'd like to add my library to their source, which is using a simple shift/add loop.

2022-01-03 17:29

Krill

Registered: Apr 2002
Posts: 2854

How many and which of your routines would fit in say, just under 2 kilobytes of RAM? =)

2022-01-03 19:24

JackAsser

Registered: Jun 2002
Posts: 1993

Quote: How many and which of your routines would fit in say, just under 2 kilobytes of RAM? =)

Hehe probably none due to the square tables.

@repose: regarding signing you can just post-fix that if you want, not optimal though but rather fast.


 ; Apply sign (See C=Hacking16 for details).
                lda T1+1
                bpl :+
                    sec
                    lda PRODUCT+2
                    sbc T2+0
                    sta PRODUCT+2
                    lda PRODUCT+3
                    sbc T2+1
                    sta PRODUCT+3
                :
                lda T2+1
                bpl :+
                    sec
                    lda PRODUCT+2
                    sbc T1+0
                    sta PRODUCT+2
                    lda PRODUCT+3
                    sbc T1+1
                    sta PRODUCT+3
                :

But again, you already knew that no doubt. :)

2022-01-03 22:31

Repose

Registered: Oct 2010
Posts: 222

Thanks for the tip! I hadn't considered it in details. However, why use 2's complement? I was thinking to just keep the sign in a byte. When multiplying, it's just EORing the bytes then. Adding/subtracting could cause an overflow, which is the only case that needs a fix.

2022-01-03 22:44

Repose

Registered: Oct 2010
Posts: 222

Quote: How many and which of your routines would fit in say, just under 2 kilobytes of RAM? =)

Jackasser is right, initially I am focused on pure speed, but I realize that smaller options would be useful. You can use a half-sized table and still be fast, in fact this was the usual expression of the squares method before it was fully optimized. Other than that, an unrolled shift-add loop is all I can think of (but I am so tired of seeing those :)

2022-01-03 22:48

JackAsser

Registered: Jun 2002
Posts: 1993

Quote: Thanks for the tip! I hadn't considered it in details. However, why use 2's complement? I was thinking to just keep the sign in a byte. When multiplying, it's just EORing the bytes then. Adding/subtracting could cause an overflow, which is the only case that needs a fix.

I didn't really though much about it and just used the hint from that article. EOR doesn't suffice. EOR+1 though..

And it also depends on if the inputs are both signed (no EOR+1 needed) or if just one of them is.

2022-01-03 23:23

Krill

Registered: Apr 2002
Posts: 2854

Quoting Repose

Jackasser is right, initially I am focused on pure speed, but I realize that smaller options would be useful.

Yeah, the reason i'm asking is drive-code, of course.

I've had a co-processor drive-code maths library on my list for years, and now that the plumbing (proper custom-drive code API in the IRQ loader) is ready, building stuff on top of that (like the saver plug-in) is quite easy.

Demos using the drive to speed up a certain class of algorithms are still quite rare, after all these years, but there may be a flower pot or two to win, still. =)

2022-01-04 05:15

Repose

Registered: Oct 2010
Posts: 222

Quote: Quoting Repose
Jackasser is right, initially I am focused on pure speed, but I realize that smaller options would be useful.
Yeah, the reason i'm asking is drive-code, of course.

I've had a co-processor drive-code maths library on my list for years, and now that the plumbing (proper custom-drive code API in the IRQ loader) is ready, building stuff on top of that (like the saver plug-in) is quite easy.

Demos using the drive to speed up a certain class of algorithms are still quite rare, after all these years, but there may be a flower pot or two to win, still. =)

Cool! I've thought of this as well, and it's always been my dream to make either a RAID array or CPU multiprocessing cluster. I put it aside though, when I realized how slow the communication was. It would work best for something like fractals where you do a lot of loops, with small data. Now, on an expanded drive, especially a 1571, that would be quite useful.
Even further, I thought of more database functions a drive could do, or searching, compressing files.

2022-01-04 10:24

Krill

Registered: Apr 2002
Posts: 2854

Quoting Repose

Cool! I've thought of this as well, and it's always been my dream to make either a RAID array or CPU multiprocessing cluster. I put it aside though, when I realized how slow the communication was. It would work best for something like fractals where you do a lot of loops, with small data. Now, on an expanded drive, especially a 1571, that would be quite useful.
Even further, I thought of more database functions a drive could do, or searching, compressing files.

I was also thinking of a computer cluster, with a raytracing/raycasting application to showcase.
It's trivially parallelisable in theory*, highly scalable on many dimensions, and you can put all kinds of programmable devices on the bus, including a C-128 (or C16/+4, etc.) on the other end. =)

* In practice, there's one big problem i've been thinking on and off about but not solved yet: the bus arbitration.
The computers on either end should not just poll all devices but do work themselves and be notified whenever some other work is completed, but that gives rise to race conditions and other problems.

2022-01-04 13:03

Repose

Registered: Oct 2010
Posts: 222

Quote: Quoting Repose
Cool! I've thought of this as well, and it's always been my dream to make either a RAID array or CPU multiprocessing cluster. I put it aside though, when I realized how slow the communication was. It would work best for something like fractals where you do a lot of loops, with small data. Now, on an expanded drive, especially a 1571, that would be quite useful.
Even further, I thought of more database functions a drive could do, or searching, compressing files.
I was also thinking of a computer cluster, with a raytracing/raycasting application to showcase.
It's trivially parallelisable in theory*, highly scalable on many dimensions, and you can put all kinds of programmable devices on the bus, including a C-128 (or C16/+4, etc.) on the other end. =)

* In practice, there's one big problem i've been thinking on and off about but not solved yet: the bus arbitration.
The computers on either end should not just poll all devices but do work themselves and be notified whenever some other work is completed, but that gives rise to race conditions and other problems.

This is a common problem which has been solved. For ethernet, it's called CSMA/CD. Then there's the method from CAN, the 2 wire car bus, which is a better protocol.
https://copperhilltech.com/blog/controller-area-network-can-bus..