[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 222

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2017-04-08 12:18

ChristopherJam

Registered: Aug 2004
Posts: 1378

Thanks!

Of course, going to have to start forking off "best best case" vs "best worst case" vs "best average time" pretty soon.

Down to 699 cycles for 0*0, btw ;)

2017-04-08 14:09

Repose

Registered: Oct 2010
Posts: 222

I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.

If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.

Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc.

2017-04-10 09:53

ChristopherJam

Registered: Aug 2004
Posts: 1378

Yes, my average for multiplying 10 randomly selected pairs is around 760 cycles at the moment, ranging from around 740 to 780.

Floats only need 24bit x 24bit, so that should be a lot faster. The shifting for adds will be a bit of a hassle. Do you care about correct behaviour for NaNs etc? And how critical is exact rounding? I'm guessing IEEE standard would be considerably slower than "good enough for most purposes."

2017-04-10 09:55

ChristopherJam

Registered: Aug 2004
Posts: 1378

Also, if you're only planning on supporting 32bit ints, there's no way to access the high 32 bits of the result from C anyway - instant 2x speedup :D

2017-04-10 12:45

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.

If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.

Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc.

Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day.

2017-04-10 17:44

Oswald

Registered: Apr 2002
Posts: 5017

Quote: Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day.

cool, I hope its for some c64 demo thingie :)

2017-04-10 18:42

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: cool, I hope its for some c64 demo thingie :)

No sorry. :) just curiosity.

2017-04-11 04:52

Oswald

Registered: Apr 2002
Posts: 5017

Quote: No sorry. :) just curiosity.

nah next time you make a true doom fx, or you'll be kicked out :)

2017-04-11 04:59

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: nah next time you make a true doom fx, or you'll be kicked out :)

Hehe, wasn't the Andropolis vector enough? ;) there I only used 16-bit muls and divs (8.8 fixpoint) hence the coarse resolution of the level (thickness of walls and minimum stair step height). I also optimised to only have axis-aligned rooms. Moving to 16.16 muls and remove the axis alignment would perhaps slow it down by 30% but it would be able to render an untextured Doom level I'm sure.

2017-04-11 21:49

Repose

Registered: Oct 2010
Posts: 222

Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code)