Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in 
CSDb User Forums


Forums > C64 Coding > Fast large multiplies
2012-06-09 21:45
Repose
Account closed

Registered: Oct 2010
Posts: 129
Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

 
... 77 posts hidden. Click here to view all posts....
 
2017-04-08 10:37
Repose
Account closed

Registered: Oct 2010
Posts: 129
Good job, that's right in the range of what I thought was possible.

I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers.

by offset $4000, doesn't that reduce the domain?

Correction is fast, it's only
stx z3
sec
sbc z3
sta z2


Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :)
2017-04-08 11:50
ChristopherJam

Registered: Aug 2004
Posts: 707
Quoting Repose
Good job, that's right in the range of what I thought was possible.
Thanks!

Quote:
I have an improvement; instead of trashing A to change the multiplier, you can prestuff pointers with the 4 multipliers.
Doing :)

Quote:
by offset $4000, doesn't that reduce the domain?

Nah, second table only contains x**2/4 for x in -255 to 255, so it already maxed out at $3f80

Quote:
Correction is fast, it's only
stx z3
sec
sbc z3
sta z2

True, but that's an extra 64 cycles, and removing the CLCs saves at most 64 cycles, sometimes as little as zero (if the branches skip over them all)

Quote:
Also yours shouldn't be any faster than my approach from what I can tell, though I do have some ideas to speed up adds again.. we'll see :)

Yes, there should be an equivalent that mixes ADC and SBC, I just found it easier to wrap my brain around the edge cases and carry handling by converting it to ADC only. I'll be interested to see what you come up with.
2017-04-08 12:19
ChristopherJam

Registered: Aug 2004
Posts: 707
Oh, faster correction:
  sec
  sbc id,x
  sta z2

Still not gonna do it, mind ;)
2017-04-08 12:59
Repose
Account closed

Registered: Oct 2010
Posts: 129
About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.

Let's say in the middle columns where there's 14 adds per column, that's 28 cycles half the time saved from not using CLC, or 14 cycles on average, vs 8 cycles for correction, it still saves 6.

I actually found the stats for the carries, most of them are about half, but adding a higher proportion of high bytes gives less carries.
2017-04-08 13:01
Repose
Account closed

Registered: Oct 2010
Posts: 129
Good catch on the id,x, I was thinking of that a few days ago but it didn't click in for this situation yet :)

And yes, I worked hard at mixing add/sub properly, it still doesn't really made sense but it works. I thought it wouldn't if you DEX to $FF but it still works.
2017-04-08 13:14
ChristopherJam

Registered: Aug 2004
Posts: 707
Quoting Repose
About the correction, I think you're adding things up wrong. I only use correction for those columns where it's faster, and I found the break even at 7 adds, so it should work. All but the outer 1 or 2 columns can use it.

Ah, good point. Only remaining issue is what to do with the borrow if the correction underflows. My brain hurts..
2017-04-08 13:24
Repose
Account closed

Registered: Oct 2010
Posts: 129
Yes it hurts :) I posted the explanation in the add/sub thread if you can follow it.
Try 0 - ff - ff in your head. Have fun! :)
2017-04-08 14:18
ChristopherJam

Registered: Aug 2004
Posts: 707
Thanks!

Of course, going to have to start forking off "best best case" vs "best worst case" vs "best average time" pretty soon.

Down to 699 cycles for 0*0, btw ;)
2017-04-08 16:09
Repose
Account closed

Registered: Oct 2010
Posts: 129
I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.

If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.

Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc.
2017-04-10 11:53
ChristopherJam

Registered: Aug 2004
Posts: 707
Yes, my average for multiplying 10 randomly selected pairs is around 760 cycles at the moment, ranging from around 740 to 780.

Floats only need 24bit x 24bit, so that should be a lot faster. The shifting for adds will be a bit of a hassle. Do you care about correct behaviour for NaNs etc? And how critical is exact rounding? I'm guessing IEEE standard would be considerably slower than "good enough for most purposes."
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Total Chaos
hedning/G★P
uneksija
iAN CooG/HVSC
Groepaz
Mr. SID
tlr
Slator/Arsenic^CAE
Guests online: 47
Top Demos
1 Uncensored  (9.7)
2 Edge of Disgrace  (9.7)
3 Coma Light 13  (9.6)
4 The Shores of Reflec..  (9.6)
5 Lunatico  (9.6)
6 Comaland 100%  (9.5)
7 Incoherent Nightmare  (9.5)
8 Wonderland XII  (9.5)
9 Comaland  (9.5)
10 Wonderland XIII  (9.5)
Top onefile Demos
1 Pandemoniac Part 2 o..  (9.6)
2 FMX Music Demo  (9.6)
3 Daah, Those Acid Pil..  (9.5)
4 Synthesis  (9.5)
5 Dawnfall V1.1  (9.5)
6 Dawnfall  (9.4)
7 Treu Love [reu]  (9.4)
8 Field Sort  (9.4)
9 KAOS 64  (9.3)
10 One-Der  (9.2)
Top Groups
1 Oxyron  (9.4)
2 Booze Design  (9.4)
3 Censor Design  (9.3)
4 The Judges  (9.3)
5 Crest  (9.3)
Top Swappers
1 Jerry  (10)
2 Zyron  (10)
3 Derbyshire Ram  (10)
4 Splatterhead  (9.8)
5 Walker  (9.7)

Home - Disclaimer
Copyright © No Name 2001-2017
Page generated in: 0.305 sec.