[CSDb] - User Forums - sets of add/sub

Welcome to our latest new user maak ! (Registered 2024-04-18)

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > sets of add/sub

2017-03-18 12:02

Repose

Registered: Oct 2010
Posts: 222

sets of add/sub

I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.

First I'll look at adding 3x 16bit adds and generalizing

Standard version
-----------------
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum

;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add
rts

In general,
Time is 16+(n-2)*13+2=18+(n-2)*13

n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44

Fast Version 1
--------------

2 clc

3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
;11c
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles

Conclusions
Adding 3 numbers, I went from 31 to 26 cycles.

The statistics of adding 3 numbers drawn from multiplies
--------------------------------------------------------

In a partial sums routine, only some bytes of the product are added

case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296

Conclusions
High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.

ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example.

2017-03-18 13:10

Repose

Registered: Oct 2010
Posts: 222

tbh I can't use Y like first solution because I would need that with a mult routine so ya...

Some more stats

sqr mult

ab = [(a + b)² - (a - b)²]/4

Multiplying two 16bit numbers, x and y, consisting of low bytes (x0, y0) and high bytes (x1, y1), into partial sums of p0-p3, again low and high bytes. These are added to the 32bit result.

         y1  y0
         x1  x0
         ------
        p0h p0l
    p1h p1l
    p2h p2l
p3h p3l

stats for sqr mult
------------------

995688448	23.2% 	p0h+p1l>255
2105475200	49%	p1l+p2l>255
1991145471	46.4%	p0h+p1l+p2l>255

1030820094	24%	p1h+p2h>255
1031913855	24%	p2h+p3l>255
2044225475	47.6%	p1h+p2h+p3l>255

2335172480	54.4%	one carry in p0h column
 765991168	17.8%	two carries in p0h column

1819237028	42.4%	one carry in p1h column
 243496921	 5.7%	two carries in p1h column

Why bother with stats, when it could only save a few cycles?

That's not really what I'm looking at, it's more about coincidently finding that the range is reduced at some point that I can make a more significant optimization. It's a longshot yeah but I'm trying hard.

In the big picture, I'm actually looking at optimizing adds as a way of speeding up multiplication, but so far I have found that keeping the pointers intact saves more time, saving 14 cycles per reuse. What I mean is, storing x0 to the table pointer, to reuse in the multiplication of x0*y0 and x0*y1, for example.

For now I'm only focusing on the possibility of faster adds to explore that option.

I also had some stats for the (a + b)² - (a - b)² part, they seem close to 50% of the time there is a borrow, starting at 1 * 31.

It seems that stringing subs and adds would be slow as they have competing needs for the Carry, with sub a clear carry means a borrow to the next higher byte, with add it means add 1 to the next byte. Also I looked at the possibility that a+c-b could be done in a row. Not sure yet, but it would be great if the virtual 9 bit sum could be kept for the next sub.

Or you could just save me the whole bother and point me to a 16x16 multiply in less than 270 cycles.

Why does anyone need such fast 16x16 mult anyhow, it's not a demo effect?

tinylib for Doom port :) (would use 16bit adds in the 16bit code of course) Would love to see the frame rate go up, and for such cool projects you need a good arithmetic library.

Besides tbh I just got hooked on this as a challenge one day, but also because I wanted to include it as a basic replacement or compiler to speed it up a lot (like for next loops would be so much faster with x% variables).

Finally, consider that you do 16 mults for 32bit numbers and cycles do add up.

2017-03-18 13:59

Oswald

Registered: Apr 2002
Posts: 5017

you're too deep in the rabbit hole to follow you, just shooting a random idea:

lda a
clc
adc c
bcc +
inc resulthi
adc e
bcc +
inc resulthi

sbc b
bcs +
dec resulthi
sbc d
bcs +
dec resulthi

2017-03-18 14:14

Repose

Registered: Oct 2010
Posts: 222

;adds
ldx #0
lda a
clc
adc c
bcc s1;13 if taken or 12
inx;14
s1 adc e
bcc s2;+5 if taken or 4
inx
;18-20 I think
;subs
s2 sec; I think you need this here
sbc b
bcs s3;+5/4
dex
s3 sbc d
bcs s4;+6
dex
s4 sta r
stx r+1;+6
so 18+14+6=38-42

Thanks for your input. This also has to continue for 16bit adds. I'm sorry I meant a-b+c-d as 16bit numbers actually.

It's averaging about 6 cycles per add/sub, seems pretty good to me. Makes more sense than the way I was thinking of it.

I'll have to get back on this though.

One problem, X could rollover to $FE. Wish I could INY for each sub then:
sty temp
txa
sbc temp
sta r+1;or adc a+1 etc.

2017-03-20 17:48

ChristopherJam

Registered: Aug 2004
Posts: 1370

Perhaps use control flow to track number of overflows? This (untested) routine has four tails, for different addends to the high byte. The labels indicate which numbers have had their low bytes added to the accumulator, and the total number of carries that need passing on to the high byte.


  lda n1
  adc n2
  bcc n1n2_0
  clc
n1n2_1
  adc n3
  bcc n1n2n3_1
  clc
n1n2n3_2
  adc n4
  bcc n1n2n3n4_2
  clc
n1n2n3n4_3
  adc n5
  sta sum

  lda#3
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts


n1n2_0
  adc n3
  bcc n1n2n3_0
  clc
n1n2n3_1
  adc n4
  bcc n1n2n3n4_1
  clc
n1n2n3n4_2
  adc n5
  sta sum

  lda#2
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts

n1n2n3_0
  adc n4
  bcc n1n2n3n4_0
  clc
n1n2n3n4_1
  adc n5
  sta sum

  lda#1
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts

n1n2n3n4_0
  adc n5
  sta sum

  lda#0
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts

Can also ditch the CLCs in the tails if you know there will be no overflow (eg if your numbers are unsigned, or represented in excess $2000). Would take you down to 6-7 cycles per additional low byte, 3 cycles per additional high byte. For enough addends it'd probably also be worth removing the CLCs from the low byte code sequences and just subtracting the misplaced carries from the final result instead.

2017-03-21 03:41

ChristopherJam

Registered: Aug 2004
Posts: 1370

Quoting Repose

One problem, X could rollover to $FE.

Could also avoid that by computing a+(-b)+c+(-d)+e+(-f), if b d and f are coming from a table (as they would be for the multiplies I gather this is for) Just store the negated value in the table instead :)

2017-03-21 06:05

Repose

Registered: Oct 2010
Posts: 222

Thanks for the input!

Quick reply for now,
- Yes, I'd thought of storing the values as 2's complement negative, the sign bit steals precision and reduces the domain of the multply. It could be useful to have an alternate and faster routine around, but for now I'm trying to solve a full unsigned 16x16.

-The final sum of a-f will be a positive number, since this is coming from an unsigned multiply partial product. Also, each a-b has a>=b.

I used that to construct the following 16x16 unsigned multiply where the partials are computed and added in order.

Conventions
f means f(x)=x^2/4, g means g(x)=(255-x)^2/4
x0/x1 are low/high of multiplier, x0/y1 multiplicand
Note there is a difference. I take the multiplier as meaning the one that requires setup, aka lda multiplier:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4

"setup x0" is a macro-like notation which means what I just stated above.

This is a bit confusing if you really try to follow it, but here's the order of partials:

         y1  y0
         x1  x0
         ------
        p0h p0l ;x0*y0
    p1h p1l ;x0*y1 for l and y1*x0 for high
    p2h p2l ;x1*y0
p3h p3l ;y1*x1

f(a+b)-g(a-b)

    in pointers         in Y
p0l=f_low_x0-g_low_x0 * y0
p0h=f_high_x0-g_high_x0 * y0
p1l=f_low_x0-g_low_x0 * y1
p1h=f_high_y1-g_high_y1 * x0
p2l=f_low_x1-g_low_x1 * y0
p2h=f_high_x1-g_high_x1 * y0
p3l=f_low_y1-g_low_y1 * x1
p3h=f_high_y1-g_high_y1 * x1

If you do it in lsb to msb order to do the adds,
x0*y0 low -> r0 (p0l)

x0*y0 high (p0h)
x0*y1 low (p1l)
x1*y0 low -> r1 (p2l)

x1*y0 high (p2h)
y1*x0 high (p1h) this is different than p1l
y1*x1 low -> r2 (p3l)

y1*x1 high -> r4 (p3h)

And now the code:

setup x0;14 cycles
setup x1
setup y1;42c
ldy y0
sec
lda (f_low_x0),y
sbc (g_low_x0),y
sta r0;p0l, 18c

;the adds of column 1
ldx #0;stores carries, reset every column with adds
clc; Y=y0
lda (f_high_x0),y;A=p0h, (a+b) part
ldy y1
adc (f_low_x0),y;+p1l (a+b) part
bcc s1
inx
clc
s1 ldy y0
adc (f_low_x1),y;+p2l (a+b) part
bcc s2; 31c
inx
clc

;the subs of column 1
s2 sec; Y=y0
;A=the (a+b) parts added, carry in X
sbc (g_high_x0),y
bcs s3
dex
sec
s3 ldy y1
sbc (g_low_x0),y;-p1l (a-b) part
bcs s4
dex
sec
s4 ldy y0
sbc (g_low_x1),y;-p2l (a-b) part
bcs s5
dex
s5 sta r1;35 or 65 per column

;the adds of column 2
txa; get the carries from column 1
clc

...

;column 3
txa
clc
adc (f_high_y1),y;p3h (a+b) part
sec
sbc (g_high_y1),y;p3h (a-b) part
sta r3;19
...
42+18+65+65+19=209

This beats the 16x16 posted on codebase, which is 224 by my estimate.
Times are minimal times, really there's a bit more added when there's overflows/carries.
It's good enough to make a fair comparison though.

2017-03-21 06:22

Repose

Registered: Oct 2010
Posts: 222

The breakthrough I had today, was to realize I don't have to reuse the multiplier pointers.
I was stuck on thinking I had to stuff the zp in the middle of an add when all registers were in use.
Then I realized I could prestuff the pointers and use many of them, then I was able to add the partials in order.

I'll have to try again with the "the position in code is that data" idea.

So here's the two variations:
1) as above, 3 setups, partials added in order
2) like codebase, 2 setups, partials added out of order.
Which one will reign victorious?

[1] http://www.codebase64.org/doku.php?id=base:seriously_fast_multi..

[2] http://www.6502.org/source/integers/fastmult.htm

2017-03-21 10:48

JackAsser

Registered: Jun 2002
Posts: 1987

a-b+c-d+e-f = a+c+e - (b+d+f)

; Assuming 8-bit unsigned input and 16-bit unsigned output
lda #0
sta tmp1_hi
sta tmp2_hi

; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
   inc tmp1_hi
   clc
:
adc e
sta tmp1_lo
bcc :+
   inc tmp1_hi
   clc
:

; Perform tmp2=b+d+f
lda b
adc c
bcc :+
   inc tmp2_hi
   clc
:
adc f
sta tmp2_lo
bcc :+
   inc tmp2_hi
:

; Perform tmp1-=tmp2
sec
lda tmp1_lo
sbc tmp2_lo
sta tmp1_lo
lda tmp2_hi
sbc tmp2_hi
sta tmp2_hi

; Result in tmp1

2017-03-21 11:22

Frantic

Registered: Mar 2003
Posts: 1627

If it is ok to use the y-register as well, one could do:

; Assuming 8-bit unsigned input and 16-bit unsigned output
ldy #0

; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
   iny
   clc
:
adc e
sta tmp1_lo
bcc :+
   iny
   clc
:
sty tmp1_hi

[...]

2017-03-21 12:15

lft

Registered: Jul 2007
Posts: 369

Neat idea, probably off topic: If you are summing a lot of bytes (e.g. computing a checksum), and you track the MSB as in Frantic's post, then you don't have to bother with keeping carry clear. After the computation, simply correct the error by subtracting the MSB.

Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning.

... 45 posts hidden. Click here to view all posts....

Previous - 1 | 2 | 3 | 4 | 5 | 6 - Next

Refresh

Subscribe to this thread: