| |
Repose
Registered: Oct 2010 Posts: 225 |
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?
|
|
... 144 posts hidden. Click here to view all posts.... |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Thanks!
Of course, going to have to start forking off "best best case" vs "best worst case" vs "best average time" pretty soon.
Down to 699 cycles for 0*0, btw ;) |
| |
Repose
Registered: Oct 2010 Posts: 225 |
I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.
If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.
Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Yes, my average for multiplying 10 randomly selected pairs is around 760 cycles at the moment, ranging from around 740 to 780.
Floats only need 24bit x 24bit, so that should be a lot faster. The shifting for adds will be a bit of a hassle. Do you care about correct behaviour for NaNs etc? And how critical is exact rounding? I'm guessing IEEE standard would be considerably slower than "good enough for most purposes." |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Also, if you're only planning on supporting 32bit ints, there's no way to access the high 32 bits of the result from C anyway - instant 2x speedup :D |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: I've thought about how to decide or statistically optimize by input, I think 0 and 1 would be good cases to be faster, but not at the expense of a lot of avg speed, which will vastly dominate in any sane situation.
If we finish this, next steps are signed, floating, and the big one is division. With a great multiply, you can use reciprocal division, but you still need remainder.
Ultimately I'd like to replicate all the basic arithmetic of a 32bit cpu, then it would be a complete library for the doom port (which compiles to an emulated cpu), C compilers, etc.
Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
Quote: Funny, I just wrote my own doom-renderer using original WAD-files, albeit in C/SDL, not C64. :) But still, it was a fun exercise down memory lane. Coded a lot of such stuff back in the day.
cool, I hope its for some c64 demo thingie :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: cool, I hope its for some c64 demo thingie :)
No sorry. :) just curiosity. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
Quote: No sorry. :) just curiosity.
nah next time you make a true doom fx, or you'll be kicked out :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: nah next time you make a true doom fx, or you'll be kicked out :)
Hehe, wasn't the Andropolis vector enough? ;) there I only used 16-bit muls and divs (8.8 fixpoint) hence the coarse resolution of the level (thickness of walls and minimum stair step height). I also optimised to only have axis-aligned rooms. Moving to 16.16 muls and remove the axis alignment would perhaps slow it down by 30% but it would be able to render an untextured Doom level I'm sure. |
| |
Repose
Registered: Oct 2010 Posts: 225 |
Very impressive. So if I can speed up a 16x16, that would be of practical benefit in speeding the framerate of Andropolis? Now there's a goal, would you be able to substitute my routine to see what happens? (If I can make a major speedup to your code) |
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 16 - Next |