[CSDb] - User Forums - Fast large multiplies

Welcome to our latest new user Gabezcommodore ! (Registered 2025-07-24)

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose
Account closed

Registered: Oct 2010
Posts: 227

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 145 posts hidden. Click here to view all posts....

2017-03-29 13:46

ChristopherJam

Registered: Aug 2004
Posts: 1427

Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.

It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?)

2017-03-29 14:41

JackAsser

Registered: Jun 2002
Posts: 2038

Quote: Oh, sweet. I remember hammering that one out with Stephen Judd on IRC back in the day.

It was a long time ago now, but I think the difference of squares concept was one he introduced me to, I just helped optimise the implementation (I'm fairy sure the truncating divide by 4 was one of my contributions?)

Yeah maybe. I remember Graham did the implicit add-optimization to it. I.e. self-mod the highbyte with X and then use ,Y in the table lookup. That way the X+Y comes for "free"

2017-03-29 15:34

Repose
Account closed

Registered: Oct 2010
Posts: 227

That very idea I always thought was brilliant, because it made a useful calculation with the indexing at the same time as a lookup. As a result, this algorithm is faster than could ever be possible. I call it a 'super' algorithm.

And yes, Judd told me that at the time we were working on the C=Hacking article together.

Anyone remember that coding thread in comp.sys.cbm and later, the z80 vs 6502 'contest' ?

2017-03-29 15:48

Frantic

Registered: Mar 2003
Posts: 1661

Quote: Yes, that's the one I beat already.

Don't hesitate to write an article, or at least post some source code, at Codebase64. :) Would be much appreciated!

2017-04-07 22:12

ChristopherJam

Registered: Aug 2004
Posts: 1427

Repose, did you get around to completing a 32x32->64 routine? I'm trying out some of the ideas from this and the sets of add/sub threads; currently down to around 1300 cycles, though I've still got some optimising to go.

I'm using inx to track carries, but have yet to remove the CLCs after each branch.

2017-04-08 03:52

Repose
Account closed

Registered: Oct 2010
Posts: 227

Yes I wrote a 16x16 to test, and it takes 211 cycles (that's avg with equal branches). The one posted on codebase is 231 cycles. I am just testing it now but having some trouble finding a single step debugger that works for me, see thread.

The time to beat for 32x32 is at least 1300 as posted on codebase so you have to keep going...

I can derive the formula for my current version to estimate exact timing of 32x32 I'll have to get back to you in a bit.

2017-04-08 04:38

Repose
Account closed

Registered: Oct 2010
Posts: 227

Here's some untested code:

x0=$fb
x1=$fc
y0=$fd
y1=$fe
x0_sqr_lo=$8b;2 bytes
x0_sqr_hi=$8d
x0_negsqr_lo=$8f
x0_negsqr_hi=$91
x1_sqr_lo=$93;2 bytes
x1_sqr_hi=$95
x1_negsqr_lo=$97
x1_negsqr_hi=$99

sqrlo=$c000;510 bytes
sqrhi=$c200
negsqrlo=$c400
negsqrhi=$c600

umult16:
;init zp square tables pointers
lda x0
sta x0_sqr_lo
sta x0_sqr_hi
eor #$ff
sta x0_negsqr_lo
sta x0_negsqr_hi;17

lda x1
sta x1_sqr_lo
sta x1_sqr_hi
eor #$ff
sta x1_negsqr_lo
sta x1_negsqr_hi;17

ldx #0;start column 0
ldy y0

SEC
LDA (x0_sqr_lo),y
SBC (x0_negsqr_lo),y
sta z0;x0*y0 lo, C=1

;start column 1
;Y=y0
clc
LDA (x0_sqr_hi),y;x0*y0 hi
ADC (x1_sqr_lo),y;+x1*y0 lo
bcc c1s1;8.5/11.5 avg
inx
clc
c1s1 sbc (x0_negsqr_hi),y;x0*y0 hi
bcc c1s2
dex
clc
c1s2 sbc (x1_negsqr_lo),y;-x1*y0 lo
bcc c1s3
dex
clc
c1s3 ldy y1
adc (x0_sqr_lo),y;x0*y1 lo
bcc c1s4
inx
clc
c1s4 SBC (x0_negsqr_lo),y;A=x0*y1 lo
bcc c1s5
dex
clc
;end of column 1
c1s5 sta z1;column 1

;start column 2
ldy y0
txa;carries from column 1
ldx #0;reset carries
clc
adc (x1_sqr_hi),y;+x1*y0 hi
bcc c2s1
inx
c2s1 sbc (x1_negsqr_hi),y;-x1*y0 hi
bcc c2s2
dex
clc
c2s2 ldy y1
adc (x0_sqr_hi),y;+x0*y1 hi
bcc c2s3
inx
clc
c2s3 adc (x1_sqr_lo),y;+x1*y1 lo
bcc c2s4
inx
clc
c2s4 sbc (x0_negsqr_hi),y;-x0*y1 hi
bcc c2s5
dex
clc
c2s5 sbc (x1_negsqr_lo),y;-x1*y1 lo
bcc c2s6
dex
clc
c2s6 sta z2;column 2

;start column 3
;Y=y1
txa;carries from column 2
clc
adc (x1_sqr_hi),y;+x1*y1 hi
sbc (x1_negsqr_hi),y;-x1*y1 hi
;shouldn't be any carries in the msb
sta z3;column 3

rts

makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta x0_sqr_lo+1
sta x1_sqr_lo+1
lda #>sqrhi
sta x0_sqr_hi+1
sta x1_sqr_hi+1
lda #>negsqrlo
sta x0_negsqr_lo+1
sta x1_negsqr_lo+1
lda #>negsqrhi
sta x0_negsqr_hi+1
sta x1_negsqr_hi+1

;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9   ; CMP #immediate - skip TYA and clear carry flag
      
makesqrtables_loop1:  
tya
adc #$00
makesqrtables_sm1:  
sta sqrhi,x
tay
cmp #$40
txa
ror
makesqrtables_sm2:
adc #$00
sta makesqrtables_sm2+1
inx
makesqrtables_sm3:
sta sqrlo,x
bne makesqrtables_loop1
inc makesqrtables_sm3+2
inc makesqrtables_sm1+2
clc
iny
bne makesqrtables_loop1
      
;generate negsqr(x)=(255-x)^2/4
ldx #$00
ldy #$ff
      
maketables_loop2:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne maketables_loop2:
rts

2017-04-08 04:43

Repose
Account closed

Registered: Oct 2010
Posts: 227

Post correction would be slower on the 16x16, not a long enough run of add/sub.

2017-04-08 05:00

Repose
Account closed

Registered: Oct 2010
Posts: 227

Partials cheatsheet

                        y1  y0
                        x1  x0
                        ------
                 x0*y0h x0*y0l
          x1*y0h x1*y0l
          x0*y1h x0*y1l
   x1*y1h x1*y1l

----------------
24x24bits

                    x2     x1    x0
                    y2     y1    y0
                    ---------------
                        y0x0h y0x0l
                  y0x1h y0x1l
            y0x2h y0x2l
                  y1x0h y1x0l
            y1x1h y1x1l
      y1x2h y1x2l
            y2x0h y2x0l
      y2x1h y2x1l
y2x2h y2x2l

These facts are useful to estimating the time of any size calc:

Number of columns is 2*n, n is bytes of each number.
Rows of additions is like 1 3 3 1, 1 3 5 5 3 1, 1 3 5 7 7 5 3 1 for 16,24 and 32 bit mults, and each one being f(x)-g(x), so really double that number of add-subs.
The total add-subs is n^2*2. (each is about 10 cycles).
Number of times to change (or pointers to set) of the multiplier is n, then each one is used >n times with the multiplicand (ldy multiplicand), when doing in column order (tbd total).

2017-04-08 05:44

Repose
Account closed

Registered: Oct 2010
Posts: 227

Changes in multiplicand is 2*n (in my case, ldy y(x) ).
eg ldy y0
...
ldy y1
...
ldy y0
...
etc.

Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | ... | 16 | 17 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

Andy/AEG
Tim/Silicon Limited
iAN CooG/HVSC
dev0
ΛΛdZ
grass/LETHARGY
MagerValp/G★P
Acidchild/Padua
Twilight/Excess/Arcade
zzarko/Avatar
Sulevi/Virtual Dreams
YTM/Elysium
Guests online: 28

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Harminc  (9.7)
4 Codeboys & Endians  (9.7)
5 Mojo  (9.7)
6 Coma Light 13  (9.6)
7 Edge of Disgrace  (9.6)
8 Comaland 100%  (9.6)
9 Signal Carnival  (9.6)
10 Wonderland XIV  (9.6)

Top onefile Demos

1 Morphing Starball  (9.9)
2 Nine  (9.8)
3 Layers  (9.6)
4 Cubic Dream  (9.6)
5 Party Elk 2  (9.6)
6 Copper Booze  (9.5)
7 Scan and Spin  (9.5)
8 Onscreen 5k  (9.5)
9 Dawnfall V1.1  (9.5)
10 Grey  (9.5)

Top Groups

1 Performers  (9.3)
2 Booze Design  (9.3)
3 Censor Design  (9.3)
4 Oxyron  (9.3)
5 Artline Designs  (9.3)

Top Cover Designers

1 Duce  (9.8)
2 Electric  (9.8)
3 Junkie  (9.6)
4 The Elegance  (9.5)
5 Mermaid  (9.3)

Page generated in: 0.047 sec.