Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > Fast large multiplies
2012-06-09 19:45
Repose

Registered: Oct 2010
Posts: 225
Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

 
... 144 posts hidden. Click here to view all posts....
 
2021-11-28 15:19
Repose

Registered: Oct 2010
Posts: 225
I'm back! What a coincidence, I decided to work on this again. Changelog:
-Using advanced ACME assembler
-Macros for calling/return conventions, lowered precision
-Improved label names, formatting/style, comments
-Proper test suite
-small unsigned 8x8=16, fast unsigned 8x8=16, fast unsigned 16x16=32. Will add fast unsigned 8x16=24, fast unsigned 8x16=16
-All routines separated to library, include what you want with options you want

Currently working on BASIC multiply replacement, just for fun. How fast will a FOR/NEXT loop be?
2021-11-28 15:23
Repose

Registered: Oct 2010
Posts: 225
This is beta, needs cleaning up, but should work. It's down to 187 cycles apparently. This isn't the more flexible library version, just what I have that works right now.

; unsigned integer multiplication library

x0=$fb
x1=$fc
y0=$fd
y1=$fe
p_sqr_lo=$8b ;2 bytes
p_sqr_hi=$8d ;2 bytes
p_negsqr_lo=$a3 ;2 bytes
p_negsqr_hi=$a5 ;2 bytes

sqrlo=$c000 ;511 bytes
sqrhi=$c200 ;511 bytes
negsqrlo=$c400 ;511 bytes
negsqrhi=$c600 ;511 bytes

!macro mult8_snippet mand,low,high {
;with a multipler stored in the p_sqr* and p_negsqr* pointers,
;a multiplicand in mand,
;and suitable squares tables,
;multiply unsigned two 8 bit numbers
;and store the 16 bit result in low, high
;note: it's up to the caller to SEC, as some can be left out
ldy mand
lda (p_sqr_lo),y
sbc (p_negsqr_lo),y ;11 cycles
sta low ;multiplier * Y, low byte, 4 cycles
lda (p_sqr_hi),y
sbc (p_negsqr_hi),y
sta high ;multiplier * Y, high byte
;11+4+11+4 = 30 cycles
}

!macro mult8_reg_snippet mand,low,reg {
;with a multipler stored in the p_sqr* and p_negsqr* pointers,
;a multiplicand in mand,
;and suitable squares tables,
;multiply unsigned two 8 bit numbers
;and store the 16 bit result in low, register
;note: it's up to the caller to SEC, as some can be left out
!if mand="Y" {
} else {
ldy mand
}
lda (p_sqr_lo),y
sbc (p_negsqr_lo),y ;11 cycles
sta low ;multiplier * Y, low byte, 4 cycles
lda (p_sqr_hi),y
sbc (p_negsqr_hi),y
;multiplier * Y, high byte
!if reg="X" {
tax
} else if reg="Y" {
tay
}
;11+4+11+2 = 28 cycles
}

!macro mult8_set_mier_snippet mier {
;set multiplier as mier
lda mier
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_negsqr_lo
sta p_negsqr_hi ;17 cycles
}

makesqrtables:
;init zp square tables pointers
lda #>sqrlo
sta p_sqr_lo+1
lda #>sqrhi
sta p_sqr_hi+1
lda #>negsqrlo
sta p_negsqr_lo+1
lda #>negsqrhi
sta p_negsqr_hi+1

;generate sqr(x)=x^2/4
ldx #$00
txa
!by $c9 ; CMP #immediate - skip TYA and clear carry flag
lb1: tya
adc #$00
ml1: sta sqrhi,x
tay
cmp #$40
txa
ror
ml9: adc #$00
sta ml9+1
inx
ml0: sta sqrlo,x
bne lb1
inc ml0+2
inc ml1+2
clc
iny
bne lb1

;generate negsqr(x)=(255-x)^2/4
ldx #$00
ldy #$ff
mt1:
lda sqrhi+1,x
sta negsqrhi+$100,x
lda sqrhi,x
sta negsqrhi,y
lda sqrlo+1,x
sta negsqrlo+$100,x
lda sqrlo,x
sta negsqrlo,y
dey
inx
bne mt1
rts

umult16:
;unsigned 16x16 mult
;inputs:
; x0 x1
; y0 y1
;outputs:
; z0 z1 A Y
;set multiplier as x0
+mult8_set_mier_snippet x0

;z0 = low(x0*y0)
;X = high(x0*y0)
sec
+mult8_reg_snippet y0,z0,"X"

;x0_y1l = low(x0*y1)
;x0_y1h = high(x0*y1)
+mult8_snippet y1,x0_y1l+1,x0_y1h+1

;set multiplier as x1
+mult8_set_mier_snippet x1
;x1_y0l = low(x1*y0)
;x1_y0h = high(x1*y0)
+mult8_snippet y0,x1_y0l+1,x1_y0h+1

;x1_y1l = low(x1*y1)
;Y = high(x1*y1)
+mult8_reg_snippet y1,x1_y1l+1,"Y"

;17+2+28+30+17+30+28 = 152 cycles for main multiply part

;add partials

;add the first two numbers of column 1
clc
x0_y0h: txa
x0_y1l: adc #0
sta z1 ;9 cycles (z1 zp)

;continue to first two numbers of column 2
x0_y1h: lda #0 ;includes carry from previous column
x1_y0h: adc #0
tax ;X=z2 so far, 6 cycles
bcc x1_y0l ;9/12 avg 10.5 cycles
;Y=x1*y1h
iny ;x1y1h/z3 ++
clc

;add last number of column 1
x1_y0l: lda #0
adc z1 ;contains x0y0h + x0y1l
sta z1 ;8 cycles

;add last number of column 2
txa ;contains x0y1h + x1y0h
x1_y1l: adc #0
;A=z2
bcc fin ;7/86 avg 7.5
iny ;z3++
;Y=z3

;Y=z3, A=z2
;9+10.5+8+7.5=35
;total = 152+35 = 187 cycles
sta z2 ; needed calling conventions for test routine
tya
fin: rts
2021-11-28 19:58
JackAsser

Registered: Jun 2002
Posts: 2014
Great stuff Repose. How convinient it is that I’m working on 3D stuff now… ;)
2021-11-30 18:23
Quiss

Registered: Nov 2016
Posts: 43
Nice work! I like the use of (ind), y for the fast inlined multiply, instead of the 4 x "sta zp+x"; "jsr zp" that I've seen some demos use.

Also, JackAsser working on 3D stuff makes me nervous. I still haven't digested the hidden line vectors in 1991.
2021-12-10 16:56
Martin Piper

Registered: Nov 2007
Posts: 722
This thread is an interesting coincidence as I've been thinking about C64 3D again. :)
2021-12-14 10:29
Repose

Registered: Oct 2010
Posts: 225
I wrote an alpha 16x16=32 unsigned, it's 1024 cycles, but if you inline the jsr umult16, it should be 930 cycles. Now I'm exploring not using umult16's at all, but to consider it as a large set of 8bit multiplies; the point is to optimize setting the multiplier (the high bytes of f(x)). I also am playing with the "zip" adds (where I add from right to left then go back to the right, next row), and optimizing register use, basically tax:txa for saving your spot in the beginning column.

The other news is, by using a gray code for the order of operations, I saved a ldy so my umult16 is 195 now.

I have even better ideas; using Knuth's equivalence I can do just 3xumult16.

con't...
2021-12-14 11:33
Repose

Registered: Oct 2010
Posts: 225
Knuth's idea

Given x=x1*256+x0, y=y1*256+y0
 y1 y0
*x1 x0

Calculation       Range
k=(y1-y0)*(x1-x0) +-FE01
l=y1*x1           0-FE01
m=y0*x0           0-FE01
n=l+m-k           0-1FC02

l+m               0-1FC02
l-k or m-k        +-FE01

x*y=l*65536+n*256+m (for 8 bit Xn/Yn)

Example (16x16 bits):
 y1 y0 ->$20 10
*x1 x0    40 30

k=(20-10)*(40-30)=100
l=20*40=800
m=10*30=300
n=800+300-100=a00

llll0000 -> 08000000
 nnnnn00     00a0000
    mmmm        0300
----------------
2010*4030 = 080a0300

x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn)

example with 16-bit values:
2000 1000
4000 3000
k=1000*1000=0100 0000
l=2000*4000=0800 0000
m=1000*3000=0300 0000
n=00A00 0000

800 0000 0000 0000
     a00 0000 0000
          300 0000
------------------
800 0a00 0300 0000

If multiplies are expensive, this is faster. Estimating 32x32=64;

3x umult16 ~585
adds        ~94
total      ~679
2021-12-14 11:40
Repose

Registered: Oct 2010
Posts: 225
About gray codes, ref.
https://en.wikipedia.org/wiki/Gray_code#Constructing_an_n-bit_G..

The use of gray codes is to minimize setting the multiplier/multiplicand, i.e.
;set multiplier as mier
lda mier ;3
sta p_sqr_lo ;3 (often published as f(x))
sta p_sqr_hi ;3
eor #$ff ;2
sta p_negsqr_lo ;3 (also known as g(x))
sta p_negsqr_hi ;3; 3+3+3+2+3+3 = 17 cycles

and

;set multiplicand
ldy mand ;3

In the following table, order 01 would mean change y first. Across the top are the starting variables; i.e. 00 means start with setting x0 and y0 as m'ier and m'and.
	start                   xy
order	00	01	10	11
xy
01	00	01	10	11
	01	00	11	10
	11	10	01	00
	10	11	00	01

10	00	01	10	11
	10	11	00	01
	11	10	01	00
	01	00	11	10

The gray code sequence can be best understood by following these values in a circle; with a starting point and a direction of clockwise or counter-clockwise:
	01
      00  11
	10

there's two ways to end with the high bytes;
start with x0*y1 and change y first, or
start with x1*y0 and change x first. The reason I want to end with the high bytes is to return them in registers, and these bytes are most useful if you only want 16x16=16.

x0*y0
x0*y1 ;change y first i.e. ldy y1
x1*y1 ;change x with +set_mier x1
x1*y0

x0 y0
x1 y0 ;change x first
x1 y1
x0 y1

x0 y1
x0 y0 ;change y first
x1 y0
x1 y1

x0 y1
x1 y1 ;change x first
x1 y0
x0 y0

x1 y0
x1 y1 ;change y first
x0 y1
x0 y0

x1 y0
x0 y0 ;change x first
x0 y1
x1 y1

x1 y1
x1 y0 ;change y first
x0 y0
x0 y1

x1 y1
x0 y1 ;change x first
x0 y0
x1 y0

This is all assuming you are storing each partial result into a self-mod summing routine to come after, in which case the order of operations of the multiplies doesn't matter. If you were adding the result during the multiples, you couldn't make use of this trick to minimize setting of the m'ier.
2021-12-14 12:27
Repose

Registered: Oct 2010
Posts: 225
Survey of multiply methods/ideas

If I really want to explore every way that could be the fastest, here's a list of ideas:

identity mults
1 - sin(a)*cos(b)=(sin(a+b)+sin(a-b))/2
2 - cos(a)*cos(B)=(cos(a+b)+cos(a-b))/2
3 - sin(a)*sin(b)=(cos(a-b)-cos(a+b))/2
4 - a*b=exp(log(a)+log(b))
5 - a*b = [(a + b)² - (a - b)²]/4
6 - k=(y1-y0)*(x1-x0), l=y1*x1, m=y0*x0, n=l+m-k, x*y=l*65536+n*256+m

Karatsuba mult

y1 y0
x1 x0
-----
z2=x1*y1
z0=x0*y0
z1=(x1+x0)*(y1+y0)-z2-z0

Booth multiply

Booth-2 is (n+1)=3 bits a time, overlapping by 1
which is how many?
aaa
  bbb
    ccc
      dd_

or 4

nybble mult
use a table of every a*b where a,b are 4-bit numbers.

residue number system
a set of modulo numbers can represent any number
you can add the residues separately without carry.
example of a (3,5) residue number system, supports 0-14:
n	x1	x2
0	0	0
1	1	1
2	2	2
3	0	3
4	1	4
5	2	0
6	0	1
7	1	2
8	2	3
9	0	4
10	1	0
11	2	1
12	0	2
13	1	3
14	2	4

you can pick better moduli; like 2^k and 2^k-1, like 256 and 255 which are co-prime. It encodes 65280 values.

REU multiply
Address Description
------- -----------
$de00   256 bytes of data (See $dfc0-$dfc3 for more information)
$df7e   write to this location to activate the RAMLink hardware
$df7f   write to this location to deactivate the RAMLink hardware.
$dfa0   lo byte of requested RAMCard memory page
$dfa1   hi byte of requested RAMCard memory page
$dfc0   write to this location to show RL variable RAM at $de00 (default)
$dfc1   write to this location to show RAMCard memory at $de00
$dfc2   write to this location to show the RAM Port device $de00 page at $de00
$dfc0   write to this location to show Pass-Thru Port dev. $de00 page at $de00

;8bit mult, of x*y to z0;z1
rl_page=$de00
rl_sel_ramcard=$dfc1
rl_lo=$dfa0
rl_hi=$dfa1
rl_activate=$df7e
rl_deactivate=$df7f
sta rl_activate;4
stx rl_lo;4
lda #0;2
sta rl_hi;4
sta rl_sel_ramcard;4
lda rl_page,x;5?
sta z0;3
inc rl_hi;is this r/w? 6
lda rl_page,x;5
sta z1;3
sta rl_deactivate;4
rts

total 42

sine mult example
cos(a)*cos(B)=(cos(a+b)+cos(a-b))/2
x	cos(x)
-pi/2	0
0	1
pi/2	0
pi	-1
3pi/2	0
2pi	1

example, .5 * .75
cos-1(.5)=1.0471975511965977461542144610932
cos-1(.75)=0.72273424781341561117837735264133
a+b=1.7699317990100133573325918137345
a-b=0.32446330338318213497583710845183
cos(a+b)=-0.197821961869480000823505899216
cos(a-b)=0.947821961869480000823505899216
cos(a+b)+cos(a-b)=.75
/2=.375

Convolution mult
Lost my notes here - something starting with W... a lot of adds for a few mults.
https://en.wikipedia.org/wiki/Convolution#Fast_convolution_algo..

Neural net multiplication
not much of a gain

direct massive lookup tables
various low-level code optimizations
illegal opcodes
order of operations
branch-tree optimizations (keeping state in the code location, i.e. branch of every possible multiplier etc.)

Applications of multiplication
-fast way to arrange complex multiply
-fast matrix multiply
-using inverse tables to divide

Different types of number systems
-modular arithmetic, Chinese Remainder system
-storing numbers as primes
-keeping numbers as fractions
-2's complement, 1's complement, offset
2021-12-15 09:01
Martin Piper

Registered: Nov 2007
Posts: 722
I've been pondering a hardware multiply...
Write control byte: xy
y = bytes for operand 1
x = bytes for operand 2

Write bytes in MSB for operand 1 then 2.

Then read bytes in MSB order for multiply result.

For example:
Write hex: 21 47 23 01
Read: b5 50 00
Previous - 1 | ... | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Flashback
Krill/Plush
Mason/Unicess
Mibri/ATL^MSL^PRX
iAN CooG/HVSC
anonym/padua
sln.pixelrat
Dano/Padua
Guests online: 117
Top Demos
1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)
Top onefile Demos
1 Layers  (9.6)
2 No Listen  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)
Top Groups
1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)
Top Fullscreen Graphicians
1 Joe  (9.7)
2 Sulevi  (9.6)
3 The Sarge  (9.6)
4 Veto  (9.6)
5 Facet  (9.6)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.253 sec.