Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > Fast large multiplies
2012-06-09 19:45
Repose

Registered: Oct 2010
Posts: 222
Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

 
... 144 posts hidden. Click here to view all posts....
 
2017-04-18 04:58
JackAsser

Registered: Jun 2002
Posts: 1987
Quote: You should write an article on how you did that, it sounds interesting. Obviously I'm a noob at this problem and would have a lot of research to do.

I wasn't thinking triangles exactly but just doom like hallways, wouldn't that work for differencing?


Hehe, I suck at writing articles! :)

Imagine the simpliest case: axis aligned rooms, no height differences and just rotate around Y, i.e. Wolfenstein.

The diff formed by the edge between the wall and the floor/ceiling will be an enlogated thin triangle. The diff might be 3-4 pixels high close to the camera, this is easy, but further down the corridor the height of the diff will be <1 but still pixels here and there must be pur. Surely an extended bresenham that "draws" both the top and the bottom line simultaneous could handle it, and calc the total height between the lines and still propagate individual errors.
2017-04-18 09:28
Repose

Registered: Oct 2010
Posts: 222
Ok, I think that makes sense now.

Now 196
I managed to save another 7 cycles from the correct total of 203, bringing it down to 159+37=196. As a bonus, the two highest bytes are returned in registers.

If I put do_adds in zp, there's one more optimization to save 4 instead of 3, for 192. Finally, if you don't need the lowest bytes, you can save 3 cycles by deleting sta z1.
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
tay;x1*y1h;Y=z3, 30 cycles

do_adds:
;-add the first two numbers of column 1
	clc
c1a:	lda #0
c1b:	adc #0
	sta z1;9

;-continue to first two numbers of column 2
c2a:	lda #0
c2b:	adc #0
	tax;X=z2, 6 cycles
	bcc c1c;3/6 avg 4.5
	iny;z3++
	clc

;-add last number of column 1
c1c:	lda #0
	adc z1
	sta z1;8

;-add last number of column 2
	txa;A=z2
c2c:	adc #0
	tax;X=z2, 6
	bcc fin;3/4 avg 3.5
	iny;z3++

;Y=z3, X=z2
;9+6+4.5+8+6+3.5=37
fin:	rts
2017-04-18 13:11
Oswald

Registered: Apr 2002
Posts: 5007
I did not follow closely, could you enlighten me why c1a/b/c is not added up in one go ?
2017-04-18 13:43
ChristopherJam

Registered: Aug 2004
Posts: 1359
Damn, so you're now 8.5 cycles faster than me? I was not expecting partial products to be faster than the optimisations we've been working on for runs of adc. Going to have to study this more closely.

Canada holds the record, for now. Nice work!
2017-04-18 17:31
JackAsser

Registered: Jun 2002
Posts: 1987
Quote: I did not follow closely, could you enlighten me why c1a/b/c is not added up in one go ?

Adding in one go would lose the carry information
2017-04-20 11:57
Repose

Registered: Oct 2010
Posts: 222
To be fair, CJ can save 2 cycles by returning in regs like me, then they are just 6.5 or 10.5 cycles apart. Most of our instructions are the same, there's just a difference in overhead. But I can't seem to find any idea that's faster; this could be the actual optimal routine.

I tried all branches for the adds, that's a bit slower. I mean code like:
clc
lda c1a
adc c1b
bcs c1bc	c1bc:
		clc
adc c1c	        adc c1c
sta z1		sta z1
		lda #1
lda c2a	        adc c2a
		bcs c2ac;7	c2ac:
				clc
adc c2b	        adc c2b	        adc c2b

And it did seem to work, but I get about 39.5 cycles.

I also looked into the crazy timer idea, but each correction is sbc timerA, which is already too slow for this small number of adds.

I have one idea that can be dramatically faster but it's not very practical.
;y0 in X, y1 in Y, x0-1, x1-1 init to $02, all ram swapped in
x0=$fc;fb-fc is pointer
x1=$fe;fd-fe is pointer
jmp (x0-1)
mult-k:
;multiply by constant multiplier k
;and multiplicands X, Y
;then add
rts

There's never any setups so that saves 34 cycles, many of the multipliers can be optimized, though not reducing the average much, important cases like 0,1, and powers of 2 could be faster.

You also need room for tables so it could only handle 7 bits at a time, unless you did each one in pure code.

Even with such an extreme approach, the worst case is probably close to a normal multiply.

What do you think, is this the end?
2017-04-20 13:22
ChristopherJam

Registered: Aug 2004
Posts: 1359
Well, without memory restrictions you could do crazy stuff like have variants of the f() or g() tables offset by 1..N so you could move the carry correction into selecting which table to use... but I'm not sure how much that would gain you.
2021-03-02 08:14
Strobe

Registered: Jun 2007
Posts: 6
Thanks Repose, ChristopherJam etc for a great 16x16 routine!

I only really needed a 8x16=24 multiply so I've been hacking Reposes one down to what I settled on below.
Uses the table & ZP setup of the original.
At ~89 cycles (?) and 54 bytes it's twice as fast and half the size of the 16x16 one. Also doesn't touch X.
See my thoughts on tweaks in the comments at the end but I'd like to get it faster and feel there is some obvious improvements staring me in the face which I'll be happy for others to point out :)
;8x16=24 version by Strobe of Repose's original 16x16=32 "fastest multiplication"
;How to use: put numbers in x0/y0+y1 and result is Y reg (z2), A reg (z1), z0
;Clobbers Y, A but not X (in original form anyway)
umult8x16:
;set multiplier as x0
    lda x0              ;comment out and call with A=x0 -2b3c
    sta p_sqr_lo
    sta p_sqr_hi
    eor #$ff
    sta p_invsqr_lo
    sta p_invsqr_hi;17

    ldy y0              ;comment out and call with Y=y0 -2b3c
    sec
    lda (p_sqr_lo),y
    sbc (p_invsqr_lo),y;note these two lines taken as 11 total
    sta z0;x0*y0l       comment out if you don't care about z0, -2b3c, OR
    lda (p_sqr_hi),y    ; ..replace with TAX -1b1c (but destroys X obviously)
    sbc (p_invsqr_hi),y
    sta c1a+1;x0*y0h;33
;c1a means column 1, row a (partial product to be added later)

    ldy y1
    ;sec  ;notice that the high byte of sub above is always +ve
    lda (p_sqr_lo),y
    sbc (p_invsqr_lo),y
    sta c1b+1;x0*y1l
    lda (p_sqr_hi),y
    sbc (p_invsqr_hi),y
    tay ;Y=c2a;x0*y1h; 29 cycles

;17+33+29=79 cycles for main multiply part
do_adds:
;-add the first two numbers of column 1
        clc
c1a:	lda #0
c1b:	adc #0  ;A=z1  6 cycles

;-continue to column 2
        bcc fin   ;2+
        iny  ;add carry
        ;Y=z2, 4-5 cycles (avoiding page boundary cross)
;6+(4+)=10-11 cycles for adder
;total 89-90 cycles
fin:    rts

;Diagram of the additions for 16x8=24
;                y1    y0
;             x        x0
;                --------
;             x0y0h x0y0l
;+      x0y1h x0y1l
;------------------------
;          z2    z1    z0 

;Possible tweaks:
;1. call with A=x0 and comment out "lda x0" -2b3c  (*)
;2. call with Y=y0 and comment out "ldy y0" -2b3c
;3. if you don't need z0 (I didn't), comment out "sta z0" -2b3c  (*)
;   OR replace with TAX -1b1c (but destroys X register obviously which was safe)
;4. There's no point having do_adds in ZP with a JMP like the 16x16 version
;   suggests because you would lose the 2 cycles you gained with the JMP, BUT..
;   ..you could put the ENTIRE ROUTINE in zero page, -2b2c
;   AND if you do that you might as well replace "lda x0", "ldy y0" and "ldy y1"
;   with immediate versions and point x0,y0 & y1 to the appropriate ZP spot for
;   extra -3c, so -2b5c combined.
;5. OR forget the ZP stuff and just in-line it, saving the 12 cycle JSR/RTS
;   (it's only $36 bytes)
;6. mix 1 and/or 2 and/or 3 with 4 or 5
;(*) these also apply to Repose' original 16x16=32 routine.
2021-03-02 10:47
Krill

Registered: Apr 2002
Posts: 2804
Not quite on topic (sorry), but i often find myself needing a "downscaling MUL": 8x16=16, with the 8-bit factor scaling the 16-bit argument by [0..1). (Basically, the lowmost 8 bits of the 24-bit result are discarded.)

Then i re-invent the wheel every time, but it seems to get rounder with each iteration. =)

What would the required resources (cycles, memory) for this case in your optimised routine be?
2021-03-03 07:10
Oswald

Registered: Apr 2002
Posts: 5007
its there, he write comment out sta z0 if you only need 16 bits, "experts" could comment out the whole calc of z0 its just 1 bit precision loss out of 16, depends on what you use it for if that 1 bit is needed or not.
Previous - 1 | ... | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | ... | 16 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
jmin
katon/Lepsi De
iceout/Avatar/HF
Enforcer/Deers
kbs/Pht/Lxt
Didi/Laxity
Båtsman/Y-front
csabanw
Guests online: 353
Top Demos
1 Next Level  (9.8)
2 Mojo  (9.7)
3 Coma Light 13  (9.7)
4 Edge of Disgrace  (9.6)
5 No Bounds  (9.6)
6 Comaland 100%  (9.6)
7 Uncensored  (9.6)
8 The Ghost  (9.6)
9 Wonderland XIV  (9.6)
10 Bromance  (9.6)
Top onefile Demos
1 Party Elk 2  (9.7)
2 Cubic Dream  (9.6)
3 Copper Booze  (9.5)
4 Rainbow Connection  (9.5)
5 TRSAC, Gabber & Pebe..  (9.5)
6 Onscreen 5k  (9.5)
7 Dawnfall V1.1  (9.5)
8 Quadrants  (9.5)
9 Daah, Those Acid Pil..  (9.5)
10 Birth of a Flower  (9.5)
Top Groups
1 Booze Design  (9.3)
2 Nostalgia  (9.3)
3 Oxyron  (9.3)
4 Censor Design  (9.3)
5 Crest  (9.3)
Top Original Suppliers
1 Derbyshire Ram  (9.5)
2 Black Beard  (9.4)
3 hedning  (9.2)
4 Baracuda  (9.1)
5 Irata  (8.5)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.197 sec.