[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 222

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2021-03-02 08:14

Strobe

Registered: Jun 2007
Posts: 6

Thanks Repose, ChristopherJam etc for a great 16x16 routine!

I only really needed a 8x16=24 multiply so I've been hacking Reposes one down to what I settled on below.
Uses the table & ZP setup of the original.
At ~89 cycles (?) and 54 bytes it's twice as fast and half the size of the 16x16 one. Also doesn't touch X.
See my thoughts on tweaks in the comments at the end but I'd like to get it faster and feel there is some obvious improvements staring me in the face which I'll be happy for others to point out :)

;8x16=24 version by Strobe of Repose's original 16x16=32 "fastest multiplication"
;How to use: put numbers in x0/y0+y1 and result is Y reg (z2), A reg (z1), z0
;Clobbers Y, A but not X (in original form anyway)
umult8x16:
;set multiplier as x0
    lda x0              ;comment out and call with A=x0 -2b3c
    sta p_sqr_lo
    sta p_sqr_hi
    eor #$ff
    sta p_invsqr_lo
    sta p_invsqr_hi;17

    ldy y0              ;comment out and call with Y=y0 -2b3c
    sec
    lda (p_sqr_lo),y
    sbc (p_invsqr_lo),y;note these two lines taken as 11 total
    sta z0;x0*y0l       comment out if you don't care about z0, -2b3c, OR
    lda (p_sqr_hi),y    ; ..replace with TAX -1b1c (but destroys X obviously)
    sbc (p_invsqr_hi),y
    sta c1a+1;x0*y0h;33
;c1a means column 1, row a (partial product to be added later)

    ldy y1
    ;sec  ;notice that the high byte of sub above is always +ve
    lda (p_sqr_lo),y
    sbc (p_invsqr_lo),y
    sta c1b+1;x0*y1l
    lda (p_sqr_hi),y
    sbc (p_invsqr_hi),y
    tay ;Y=c2a;x0*y1h; 29 cycles

;17+33+29=79 cycles for main multiply part
do_adds:
;-add the first two numbers of column 1
        clc
c1a:	lda #0
c1b:	adc #0  ;A=z1  6 cycles

;-continue to column 2
        bcc fin   ;2+
        iny  ;add carry
        ;Y=z2, 4-5 cycles (avoiding page boundary cross)
;6+(4+)=10-11 cycles for adder
;total 89-90 cycles
fin:    rts

;Diagram of the additions for 16x8=24
;                y1    y0
;             x        x0
;                --------
;             x0y0h x0y0l
;+      x0y1h x0y1l
;------------------------
;          z2    z1    z0 

;Possible tweaks:
;1. call with A=x0 and comment out "lda x0" -2b3c  (*)
;2. call with Y=y0 and comment out "ldy y0" -2b3c
;3. if you don't need z0 (I didn't), comment out "sta z0" -2b3c  (*)
;   OR replace with TAX -1b1c (but destroys X register obviously which was safe)
;4. There's no point having do_adds in ZP with a JMP like the 16x16 version
;   suggests because you would lose the 2 cycles you gained with the JMP, BUT..
;   ..you could put the ENTIRE ROUTINE in zero page, -2b2c
;   AND if you do that you might as well replace "lda x0", "ldy y0" and "ldy y1"
;   with immediate versions and point x0,y0 & y1 to the appropriate ZP spot for
;   extra -3c, so -2b5c combined.
;5. OR forget the ZP stuff and just in-line it, saving the 12 cycle JSR/RTS
;   (it's only $36 bytes)
;6. mix 1 and/or 2 and/or 3 with 4 or 5
;(*) these also apply to Repose' original 16x16=32 routine.

2021-03-02 10:47

Krill

Registered: Apr 2002
Posts: 2845

Not quite on topic (sorry), but i often find myself needing a "downscaling MUL": 8x16=16, with the 8-bit factor scaling the 16-bit argument by [0..1). (Basically, the lowmost 8 bits of the 24-bit result are discarded.)

Then i re-invent the wheel every time, but it seems to get rounder with each iteration. =)

What would the required resources (cycles, memory) for this case in your optimised routine be?

2021-03-03 07:10

Oswald

Registered: Apr 2002
Posts: 5017

its there, he write comment out sta z0 if you only need 16 bits, "experts" could comment out the whole calc of z0 its just 1 bit precision loss out of 16, depends on what you use it for if that 1 bit is needed or not.

2021-11-28 15:19

Repose

Registered: Oct 2010
Posts: 222

I'm back! What a coincidence, I decided to work on this again. Changelog:
-Using advanced ACME assembler
-Macros for calling/return conventions, lowered precision
-Improved label names, formatting/style, comments
-Proper test suite
-small unsigned 8x8=16, fast unsigned 8x8=16, fast unsigned 16x16=32. Will add fast unsigned 8x16=24, fast unsigned 8x16=16
-All routines separated to library, include what you want with options you want

Currently working on BASIC multiply replacement, just for fun. How fast will a FOR/NEXT loop be?

2021-11-28 15:23

Repose

Registered: Oct 2010
Posts: 222

This is beta, needs cleaning up, but should work. It's down to 187 cycles apparently. This isn't the more flexible library version, just what I have that works right now.

; unsigned integer multiplication library

x0=$fb
x1=$fc
y0=$fd
y1=$fe
p_sqr_lo=$8b ;2 bytes
p_sqr_hi=$8d ;2 bytes
p_negsqr_lo=$a3 ;2 bytes
p_negsqr_hi=$a5 ;2 bytes

sqrlo=$c000 ;511 bytes
sqrhi=$c200 ;511 bytes
negsqrlo=$c400 ;511 bytes
negsqrhi=$c600 ;511 bytes

!macro mult8_snippet mand,low,high {
;with a multipler stored in the p_sqr* and p_negsqr* pointers,
;a multiplicand in mand,
;and suitable squares tables,
;multiply unsigned two 8 bit numbers
;and store the 16 bit result in low, high
;note: it's up to the caller to SEC, as some can be left out
ldy mand
lda (p_sqr_lo),y
sbc (p_negsqr_lo),y ;11 cycles
sta low ;multiplier * Y, low byte, 4 cycles
lda (p_sqr_hi),y
sbc (p_negsqr_hi),y
sta high ;multiplier * Y, high byte
;11+4+11+4 = 30 cycles
}

!macro mult8_reg_snippet mand,low,reg {
;with a multipler stored in the p_sqr* and p_negsqr* pointers,
;a multiplicand in mand,
;and suitable squares tables,
;multiply unsigned two 8 bit numbers
;and store the 16 bit result in low, register
;note: it's up to the caller to SEC, as some can be left out
!if mand="Y" {
} else {
ldy mand
}
lda (p_sqr_lo),y
sbc (p_negsqr_lo),y ;11 cycles
sta low ;multiplier * Y, low byte, 4 cycles
lda (p_sqr_hi),y
sbc (p_negsqr_hi),y
;multiplier * Y, high byte
!if reg="X" {
tax
} else if reg="Y" {
tay
}
;11+4+11+2 = 28 cycles
}

!macro mult8_set_mier_snippet mier {
;set multiplier as mier
lda mier
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_negsqr_lo
sta p_negsqr_hi ;17 cycles
}

makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta p_sqr_lo+1
lda #>sqrhi
sta p_sqr_hi+1
lda #>negsqrlo
sta p_negsqr_lo+1
lda #>negsqrhi
sta p_negsqr_hi+1

;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9 ; CMP #immediate - skip TYA and clear carry flag
lb1: tya
adc #$00
ml1: sta sqrhi,x
tay
cmp #$40
txa
ror
ml9: adc #$00
sta ml9+1
inx
ml0: sta sqrlo,x
bne lb1
inc ml0+2
inc ml1+2
clc
iny
bne lb1

;generate negsqr(x)=(255-x)^2/4
ldx #$00
ldy #$ff
mt1:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne mt1
rts

umult16:
;unsigned 16x16 mult
;inputs:
; x0 x1
; y0 y1
;outputs:
; z0 z1 A Y
;set multiplier as x0
+mult8_set_mier_snippet x0

;z0 = low(x0*y0)
;X = high(x0*y0)
sec
+mult8_reg_snippet y0,z0,"X"

;x0_y1l = low(x0*y1)
;x0_y1h = high(x0*y1)
+mult8_snippet y1,x0_y1l+1,x0_y1h+1

;set multiplier as x1
+mult8_set_mier_snippet x1
;x1_y0l = low(x1*y0)
;x1_y0h = high(x1*y0)
+mult8_snippet y0,x1_y0l+1,x1_y0h+1

;x1_y1l = low(x1*y1)
;Y = high(x1*y1)
+mult8_reg_snippet y1,x1_y1l+1,"Y"

;17+2+28+30+17+30+28 = 152 cycles for main multiply part

;add partials

;add the first two numbers of column 1
clc
x0_y0h: txa
x0_y1l: adc #0
sta z1 ;9 cycles (z1 zp)

;continue to first two numbers of column 2
x0_y1h: lda #0 ;includes carry from previous column
x1_y0h: adc #0
tax ;X=z2 so far, 6 cycles
bcc x1_y0l ;9/12 avg 10.5 cycles
;Y=x1*y1h
iny ;x1y1h/z3 ++
clc

;add last number of column 1
x1_y0l: lda #0
adc z1 ;contains x0y0h + x0y1l
sta z1 ;8 cycles

;add last number of column 2
txa ;contains x0y1h + x1y0h
x1_y1l: adc #0
;A=z2
bcc fin ;7/86 avg 7.5
iny ;z3++
;Y=z3

;Y=z3, A=z2
;9+10.5+8+7.5=35
;total = 152+35 = 187 cycles
sta z2 ; needed calling conventions for test routine
tya
fin: rts

2021-11-28 19:58

JackAsser

Registered: Jun 2002
Posts: 1989

Great stuff Repose. How convinient it is that I’m working on 3D stuff now… ;)

2021-11-30 18:23

Quiss

Registered: Nov 2016
Posts: 37

Nice work! I like the use of (ind), y for the fast inlined multiply, instead of the 4 x "sta zp+x"; "jsr zp" that I've seen some demos use.

Also, JackAsser working on 3D stuff makes me nervous. I still haven't digested the hidden line vectors in 1991.

2021-12-10 16:56

Martin Piper

Registered: Nov 2007
Posts: 634

This thread is an interesting coincidence as I've been thinking about C64 3D again. :)

2021-12-14 10:29

Repose

Registered: Oct 2010
Posts: 222

I wrote an alpha 16x16=32 unsigned, it's 1024 cycles, but if you inline the jsr umult16, it should be 930 cycles. Now I'm exploring not using umult16's at all, but to consider it as a large set of 8bit multiplies; the point is to optimize setting the multiplier (the high bytes of f(x)). I also am playing with the "zip" adds (where I add from right to left then go back to the right, next row), and optimizing register use, basically tax:txa for saving your spot in the beginning column.

The other news is, by using a gray code for the order of operations, I saved a ldy so my umult16 is 195 now.

I have even better ideas; using Knuth's equivalence I can do just 3xumult16.

con't...

2021-12-14 11:33

Repose

Registered: Oct 2010
Posts: 222

Knuth's idea

Given x=x1*256+x0, y=y1*256+y0

 y1 y0
*x1 x0

Calculation       Range
k=(y1-y0)*(x1-x0) +-FE01
l=y1*x1           0-FE01
m=y0*x0           0-FE01
n=l+m-k           0-1FC02

l+m               0-1FC02
l-k or m-k        +-FE01

x*y=l*65536+n*256+m (for 8 bit Xn/Yn)

Example (16x16 bits):
 y1 y0 ->$20 10
*x1 x0    40 30

k=(20-10)*(40-30)=100
l=20*40=800
m=10*30=300
n=800+300-100=a00

llll0000 -> 08000000
 nnnnn00     00a0000
    mmmm        0300
----------------
2010*4030 = 080a0300

x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn)

example with 16-bit values:
2000 1000
4000 3000
k=1000*1000=0100 0000
l=2000*4000=0800 0000
m=1000*3000=0300 0000
n=00A00 0000

800 0000 0000 0000
     a00 0000 0000
          300 0000
------------------
800 0a00 0300 0000

If multiplies are expensive, this is faster. Estimating 32x32=64;

3x umult16 ~585
adds        ~94
total      ~679

Previous - 1 | ... | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

CA$H/TRiAD
MMS_Z
Magic/Nah-Kolor
kbs/Pht/Lxt
psych
hedning/G★P
Guests online: 164

Top Demos

1 Next Level  (9.8)
2 Mojo  (9.7)
3 Coma Light 13  (9.7)
4 Edge of Disgrace  (9.6)
5 Comaland 100%  (9.6)
6 No Bounds  (9.6)
7 Uncensored  (9.6)
8 Wonderland XIV  (9.6)
9 Memento Mori  (9.6)
10 Bromance  (9.5)

Top onefile Demos

1 It's More Fun to Com..  (9.7)
2 Party Elk 2  (9.7)
3 Cubic Dream  (9.6)
4 Copper Booze  (9.5)
5 TRSAC, Gabber & Pebe..  (9.5)
6 Rainbow Connection  (9.5)
7 Wafer Demo  (9.5)
8 Dawnfall V1.1  (9.5)
9 Quadrants  (9.5)
10 Daah, Those Acid Pil..  (9.5)

Top Groups

1 Nostalgia  (9.3)
2 Oxyron  (9.3)
3 Booze Design  (9.3)
4 Censor Design  (9.3)
5 Crest  (9.3)

Top Webmasters

1 Slaygon  (9.7)
2 Perff  (9.6)
3 Morpheus  (9.5)
4 Sabbi  (9.5)
5 CreaMD  (9.1)

Page generated in: 0.283 sec.