 
Repose
Registered: Oct 2010 Posts: 129 
Fast large multiplies
I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre and post processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?


... 77 posts hidden. Click here to view all posts.... 
 
Repose
Registered: Oct 2010 Posts: 129 
Just about to work out the subs, though I'm sure it works in some equivalent way, I'm thinking at most a sec or clc when switching between runs of adds and runs of subs. You can do one fixup at the end. The way I'm doing it makes sense too. No offsets needed.
(ps why did Ice T suddenly flash in my mind singing, no beepers needed?)
Sounds like mine is gonna be a lot cleaner, not to mention faster but we'll see :) 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
OK, 16x16 done and tested. Minimum 205 cycles, mean of around 216, including 12 cycles for the JSR/RTS
(assuming multiplier, multiplicand and destination all in ZP). I've just modified the codegen for the 32x32 for now, will have a look later to see if I've missed any obvious optimisations. 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: OK, 16x16 done and tested. Minimum 205 cycles, mean of around 216, including 12 cycles for the JSR/RTS
(assuming multiplier, multiplicand and destination all in ZP). I've just modified the codegen for the 32x32 for now, will have a look later to see if I've missed any obvious optimisations.
How does this compare to my stuff on Codebase? Also unsigned? 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?
Unsigned, yes. 
 
Frantic
Registered: Mar 2003 Posts: 1304 
10% faster ain't bad! 
 
JackAsser
Registered: Jun 2002 Posts: 1226 
Quote: Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?
Unsigned, yes.
Nice!!! Havn't checked in detail, same table space overhead? 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Thanks!
An extra 256 bytes for the id table, so five pages of tables altogether. 
 
ChristopherJam
Registered: Aug 2004 Posts: 620 
Actually, scratch that; for the 16x16>32 case, I only ever read from 13 bytes of the identity table.
Make that 2061 bytes of tables required. Also 16 bytes of zero page for pointers. 
 
Repose
Registered: Oct 2010 Posts: 129 
Ok, I don't know who won because I'm counting things differently. I've always assumed branches are taken equally and averaged them. By that method, jackasser's is clearly 233+12 for jsr/rts=245.
One of the alternate versions I already have written is 207+12=219.
Yet, you reported 241 for JackAssers which is 3 less, and if you scale the same way, I get 219 for yours, exactly the same. As far as I can tell, we're tied.
They are not the same at all. My alternate version is very straightforward and doesn't use the repeated add technique. Still working on that (sorry I'm so slow). We're tied, but this isn't even my final entry.
JackAsser's
setup 74
mults 116
adds 43
total 233

 
ChristopherJam
Registered: Aug 2004 Posts: 620 
I hadn't yet counted the cycles in the source yet, so I just timed both my and JackAsser's routines with CIA for ten randomly selected pairs of numbers.
Now I've annotated my source, if I assume branches are taken equally and page crossings on table lookups also cost on average half a cycle, I get 206.5+12=218.5
So yes, ridiculously close at the moment given the different approaches I gather we've taken.
I can shave another two cycles off mine, but only if I bump the memory back up to 5 pages of tables again. (or rather, I need to place a 7 byte table somewhere in the last 64 bytes of a page… long story..)
I gather you have something better in the wings; I look forward to seeing it! 
Previous  1  2  3  4  5  6  7  8  9  Next 