Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > Fast large multiplies
2012-06-09 19:45
Repose

Registered: Oct 2010
Posts: 225
Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

 
... 144 posts hidden. Click here to view all posts....
 
2017-04-15 14:22
ChristopherJam

Registered: Aug 2004
Posts: 1409
Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?

Unsigned, yes.
2017-04-15 14:47
Frantic

Registered: Mar 2003
Posts: 1648
10% faster ain't bad!
2017-04-15 16:49
JackAsser

Registered: Jun 2002
Posts: 2014
Quote: Under the same conditions, your stuff averages ~241 cycles, with a minimum of 232. So, only about 10% faster?

Unsigned, yes.


Nice!!! Havn't checked in detail, same table space overhead?
2017-04-15 16:55
ChristopherJam

Registered: Aug 2004
Posts: 1409
Thanks!

An extra 256 bytes for the id table, so five pages of tables altogether.
2017-04-15 18:48
ChristopherJam

Registered: Aug 2004
Posts: 1409
Actually, scratch that; for the 16x16->32 case, I only ever read from 13 bytes of the identity table.

Make that 2061 bytes of tables required. Also 16 bytes of zero page for pointers.
2017-04-15 20:39
Repose

Registered: Oct 2010
Posts: 225
Ok, I don't know who won because I'm counting things differently. I've always assumed branches are taken equally and averaged them. By that method, jackasser's is clearly 233+12 for jsr/rts=245.

One of the alternate versions I already have written is 207+12=219.

Yet, you reported 241 for JackAssers which is 3 less, and if you scale the same way, I get 219 for yours, exactly the same. As far as I can tell, we're tied.

They are not the same at all. My alternate version is very straightforward and doesn't use the repeated add technique. Still working on that (sorry I'm so slow). We're tied, but this isn't even my final entry.
JackAsser's
setup   74
mults  116
adds    43
total  233
2017-04-15 21:08
ChristopherJam

Registered: Aug 2004
Posts: 1409
I hadn't yet counted the cycles in the source yet, so I just timed both my and JackAsser's routines with CIA for ten randomly selected pairs of numbers.

Now I've annotated my source, if I assume branches are taken equally and page crossings on table lookups also cost on average half a cycle, I get 206.5+12=218.5

So yes, ridiculously close at the moment given the different approaches I gather we've taken.

I can shave another two cycles off mine, but only if I bump the memory back up to 5 pages of tables again. (or rather, I need to place a 7 byte table somewhere in the last 64 bytes of a pageĀ… long story..)

I gather you have something better in the wings; I look forward to seeing it!
2017-04-15 21:21
Repose

Registered: Oct 2010
Posts: 225
Same, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile.
2017-04-15 21:25
ChristopherJam

Registered: Aug 2004
Posts: 1409
My current best:
mul1616
    lda mT1+ 0      ; 3
    sta zp_fl0      ; 3
    sta zp_fh0      ; 3
    eor#255         ; 2
    sta zp_gl0      ; 3
    sta zp_gh0      ; 3

    lda mT1+ 1      ; 3
    sta zp_fl1      ; 3
    sta zp_fh1      ; 3
    eor#255         ; 2
    sta zp_gl1      ; 3
    sta zp_gh1      ; 3

    clc             ; 2
    ldy mT2+0       ; 3
    lda (zp_fl0),y  ; 5.5
    adc (zp_gl0),y  ; 5.5
    sta mRes+0      ; 3

    ldx#0           ; 2
    lda (zp_fh0),y  ; 5.5
    adc (zp_gh0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_fl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    ldy mT2+1       ; 3
    adc (zp_fl0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gl0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    sbc ofste_3f,x  ; 4
    sta mRes+1      ; 3

    txa             ; 2
    ldx#$bf         ; 2
    adc (zp_fh0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gh0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_fl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    ldy mT2+0       ; 3
    adc (zp_fh1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gh1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    sbc ofste_80-$bf,x  ; 4
    sta mRes+2      ; 3

    txa             ; 2
    ldy mT2+1       ; 3
    adc (zp_fh1),y  ; 5.5
    clc             ; 2
    adc (zp_gh1),y  ; 5.5
    sta mRes+3      ; 3

    rts

    ; total=204.5

ofste_3f
	.byt $3f,$40,$41,$42,$43,$44

    .dsb <$bf-*,0
ofste_80
	.byt $80,$81,$82,$83,$84,$85,$86


(and of course a one off init of the high bytes of the zero page pointers, and square tables as follows:
f=lambda x:x*x//4
g=lambda x:(0x4000-f(x-255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])
2017-04-15 21:28
ChristopherJam

Registered: Aug 2004
Posts: 1409
Quoting Repose
Same, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile.


Haha, then I should probably acknowledge that my ~750 cycle 32x32->64 requires 32 bytes of zero page pointers, on top of the 16 bytes of zero page split between the multiplier, multiplicand and result :)
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | ... | 16 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Flashback
No-XS
CRT
iAN CooG/HVSC
Lead/House Designs
Holy Moses/Role
Beast/Crescent
Technotron/I-I F
sachy/BOOM!
Guests online: 129
Top Demos
1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)
Top onefile Demos
1 Layers  (9.6)
2 No Listen  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Rainbow Connection  (9.5)
7 Dawnfall V1.1  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)
Top Groups
1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)
Top Coders
1 Axis  (9.8)
2 Graham  (9.8)
3 Lft  (9.8)
4 Crossbow  (9.8)
5 HCL  (9.8)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.085 sec.