Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in 
CSDb User Forums


Forums > C64 Coding > Fast large multiplies
2012-06-09 21:45
Repose

Registered: Oct 2010
Posts: 129
Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

 
... 77 posts hidden. Click here to view all posts....
 
2017-04-15 23:21
Repose

Registered: Oct 2010
Posts: 129
Same, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile.
2017-04-15 23:25
ChristopherJam

Registered: Aug 2004
Posts: 581
My current best:
mul1616
    lda mT1+ 0      ; 3
    sta zp_fl0      ; 3
    sta zp_fh0      ; 3
    eor#255         ; 2
    sta zp_gl0      ; 3
    sta zp_gh0      ; 3

    lda mT1+ 1      ; 3
    sta zp_fl1      ; 3
    sta zp_fh1      ; 3
    eor#255         ; 2
    sta zp_gl1      ; 3
    sta zp_gh1      ; 3

    clc             ; 2
    ldy mT2+0       ; 3
    lda (zp_fl0),y  ; 5.5
    adc (zp_gl0),y  ; 5.5
    sta mRes+0      ; 3

    ldx#0           ; 2
    lda (zp_fh0),y  ; 5.5
    adc (zp_gh0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_fl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    ldy mT2+1       ; 3
    adc (zp_fl0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gl0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    sbc ofste_3f,x  ; 4
    sta mRes+1      ; 3

    txa             ; 2
    ldx#$bf         ; 2
    adc (zp_fh0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gh0),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_fl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gl1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    ldy mT2+0       ; 3
    adc (zp_fh1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    adc (zp_gh1),y  ; 5.5
    bcc *+3         ; 2.5
    inx             ; 1
    sbc ofste_80-$bf,x  ; 4
    sta mRes+2      ; 3

    txa             ; 2
    ldy mT2+1       ; 3
    adc (zp_fh1),y  ; 5.5
    clc             ; 2
    adc (zp_gh1),y  ; 5.5
    sta mRes+3      ; 3

    rts

    ; total=204.5

ofste_3f
	.byt $3f,$40,$41,$42,$43,$44

    .dsb <$bf-*,0
ofste_80
	.byt $80,$81,$82,$83,$84,$85,$86


(and of course a one off init of the high bytes of the zero page pointers, and square tables as follows:
f=lambda x:x*x//4
g=lambda x:(0x4000-f(x-255))&0xffff
dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)])
dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)])
dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)])
dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])
2017-04-15 23:28
ChristopherJam

Registered: Aug 2004
Posts: 581
Quoting Repose
Same, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile.


Haha, then I should probably acknowledge that my ~750 cycle 32x32->64 requires 32 bytes of zero page pointers, on top of the 16 bytes of zero page split between the multiplier, multiplicand and result :)
2017-04-16 00:26
Repose

Registered: Oct 2010
Posts: 129
Wow, you really squeezed that dry - good job! Even though my approach is different (in the runs of adds vein), it follows the same pattern as yours, and thus will turn out almost exactly the same time.

There's other approaches to faster adds though.
2017-04-16 07:27
Repose

Registered: Oct 2010
Posts: 129
I think I have 195, but it requires a different approach. You're going to have to wait another day though :)
2017-04-17 11:30
Repose

Registered: Oct 2010
Posts: 129
Finally, after fixing a bad sqrtab...

The World's Fastest Published 16x16 Unsigned Mult on 6502.

Testing methodology
-------------------
To measure a particular multiply, you can enter the monitor in vice and type (where pc is your start address):
r pc=c000

Then type:
z

and keep hitting enter to step through the code. Copy and paste the first line. When you've reached the line *after* the last you want to measure, copy and paste that (also include any points in between). You can now subtract the two to get the timing.

Example:
LDA $FB        - A:00 X:0A Y:00 SP:eb ..-..IZ.   72210316
CLC            - A:00 X:0A Y:00 SP:eb ..-..IZC   72210476
RTS            - A:00 X:0A Y:00 SP:eb ..-..IZ.   72210520

This shows me that my multiply body was 160 cycles, and the adds were 44 cycles, for a total of 204 (nevermind the slow times, I had nothing in zp for testing purposes).

I used that as a guide, but really I added by hand and averaged page crossings and branches for the reported total.

Tell me the speed
-----------------
158 cycles for the multiply part, with no variation, and the inputs/outputs in zp.
43 cycles for the final additions, with each branch equally likely.
The total is 201. However, if you include the simple variation which requires that part of the code is in zp, you save 3 cycles for a total of 198 (I just wanted to say I could break 200).
-add 12 for jsr/rts. I report this way to be consistent with CJ's results above-

The Code
--------
;World's fastest 16x16 unsigned mult for 6502
;you can go faster, but not without more code and/or data
;and being less elegant and harder to follow.
;by Repose 2017

;tables of squares
;sqr(x)=x^2/4
;negsqr(x)=(255-x)^2/4
sqrlo=$c000;511 bytes
sqrhi=$c200;511 bytes
negsqrlo=$c400;511 bytes
negsqrhi=$c600;511 bytes

;pointers to square tables above
p_sqr_lo=$8b;2 bytes
p_sqr_hi=$8d;2 bytes
p_invsqr_lo=$8f;2 bytes
p_invsqr_hi=$91;2 bytes

;the inputs and outputs
x0=$fb;multiplier, 2 bytes
x1=$fc
y0=$fd;multiplicand, 2 bytes
y1=$fe
z0=$80;product, 4 bytes
z1=$81
z2=$82
z3=$83

;not shown is a routine to make the tables
;also you need to init the pointers' high bytes to the tables

umult16:
;set multiplier as x0
lda x0
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_invsqr_lo
sta p_invsqr_hi;17

ldy y0
sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y;note these two lines taken as 11 total
sta z0;x0*y0l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c1a+1;x0*y0h;31
;c1a means column 1, row a (partial product to be added later)

ldy y1
;sec  ;notice that the high byte of sub above is always +ve
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c1b+1;x0*y1l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c2a+1;x0*y1h;31

;set multiplier as x1
lda x1
sta p_sqr_lo
sta p_sqr_hi
eor #$ff
sta p_invsqr_lo
sta p_invsqr_hi;17

ldy y0
;sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c1c+1;x1*y0l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta c2b+1;x1*y1h;31

ldy y1
;sec
lda (p_sqr_lo),y
sbc (p_invsqr_lo),y
sta c2c+1;x1*y1l
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
sta z3;x1*y1h;31

;4*31+2*17 so far=158
;add partials
;-add first two numbers in column 1
;jmp do_adds;put in zp to save 3 cycles :)
do_adds:
clc
c1a lda #0 
c1b adc #0;add first two rows of column 1
sta z1;9
;-continue to first two numbers in column 2                   
c2a lda #0
c2b adc #0 
sta z2;7
bcc c1c;3 taken/9 not taken, avg 6 
inc z3
clc
;-add last number of column 1 (row c)
c1c lda #0 
adc z1
sta z1;8
;-add last number of column 2
c2c lda #0 
adc z2
sta z2;8
bcc fin;3/7 avg 5
inc z3
;9+7+6+8+8+5=43
fin rts
2017-04-17 20:54
JackAsser

Registered: Jun 2002
Posts: 1204
Repose, really nice!

Some further optimization:

sta c2b+1;x1*y1h;31 => tax

c2a lda #0
c2b adc #0 =>

c2b txa
c2a adc #0

But I somehow like the fact that X is kept clean otoh.
2017-04-17 20:56
JackAsser

Registered: Jun 2002
Posts: 1204
Also if z3 is assume in the y-reg when done instead:

sta z3;x1*y1h;31 => tay

all inc z3 => iny

I think that's ok since you'd probably do some lda z3 afterwards anyway, and instead you have it readily available in Y.
2017-04-17 21:27
Repose

Registered: Oct 2010
Posts: 129
Thanks :) I screwed up my timing though, the first multiply part is 33 not 31 because of the single SEC, bringing my total to 203 (and VICE was right, the multiply part is 160). Because of this, I think I'll use your optimizations to bring it back down to 201. Therefore it still stands :)
2017-04-17 22:54
JackAsser

Registered: Jun 2002
Posts: 1204
Also in a real life situation, f.e. subpixel vectors you'd only keep z3 and z2 as a screen limited 8.8 result. Typically one would do:

Rotated x,y,z in 8.8
Rotated z in 8.8
Reciprocal z = 1/z = 0.16

Perspective:
8.8 * 0.16 = 8.24 (keep only the top 8.8, i.e. pixel and the bresenham initial error)
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
SAM
sailor/Triad
E$G/I-IokutO ForcE
rail slave/Hokuto Fo..
Romppainen/ΤRIΛD
Pantaloon/Fairlight
B3L4/Hokuto Force
Guests online: 49
Top Demos
1 Uncensored  (9.7)
2 Edge of Disgrace  (9.7)
3 The Shores of Reflec..  (9.7)
4 Coma Light 13  (9.6)
5 Lunatico  (9.6)
6 Comaland 100%  (9.6)
7 Incoherent Nightmare  (9.5)
8 Wonderland XII  (9.5)
9 Comaland  (9.5)
10 Wonderland XIII  (9.5)
Top onefile Demos
1 Dawnfall V1.1  (9.5)
2 Daah, Those Acid Pil..  (9.4)
3 SidRok  (9.4)
4 Treu Love [reu]  (9.4)
5 Tunnel Vision  (9.3)
6 Dawnfall  (9.3)
7 One-Der  (9.2)
8 Globe 2016 [reu]  (9.2)
9 Hardware Accelerated..  (9.2)
10 Safe VSP  (9.1)
Top Groups
1 Booze Design  (9.4)
2 Oxyron  (9.4)
3 Censor Design  (9.3)
4 Crest  (9.3)
5 Camelot  (9.2)
Top Swappers
1 Jerry  (10)
2 Zyron  (10)
3 Derbyshire Ram  (10)
4 Splatterhead  (9.8)
5 Walker  (9.7)

Home - Disclaimer
Copyright © No Name 2001-2017
Page generated in: 0.297 sec.