[CSDb] - User Forums - sets of add/sub

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > sets of add/sub

2017-03-18 12:02

Repose
Account closed

Registered: Oct 2010
Posts: 227

sets of add/sub

I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.

First I'll look at adding 3x 16bit adds and generalizing

Standard version
-----------------
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum

;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add
rts

In general,
Time is 16+(n-2)*13+2=18+(n-2)*13

n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44

Fast Version 1
--------------

2 clc

3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
;11c
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles

Conclusions
Adding 3 numbers, I went from 31 to 26 cycles.

The statistics of adding 3 numbers drawn from multiplies
--------------------------------------------------------

In a partial sums routine, only some bytes of the product are added

case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296

Conclusions
High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.

ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example.

... 45 posts hidden. Click here to view all posts....

2017-03-20 17:48

ChristopherJam

Registered: Aug 2004
Posts: 1424

Perhaps use control flow to track number of overflows? This (untested) routine has four tails, for different addends to the high byte. The labels indicate which numbers have had their low bytes added to the accumulator, and the total number of carries that need passing on to the high byte.

  lda n1
  adc n2
  bcc n1n2_0
  clc
n1n2_1
  adc n3
  bcc n1n2n3_1
  clc
n1n2n3_2
  adc n4
  bcc n1n2n3n4_2
  clc
n1n2n3n4_3
  adc n5
  sta sum

  lda#3
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts


n1n2_0
  adc n3
  bcc n1n2n3_0
  clc
n1n2n3_1
  adc n4
  bcc n1n2n3n4_1
  clc
n1n2n3n4_2
  adc n5
  sta sum

  lda#2
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts

n1n2n3_0
  adc n4
  bcc n1n2n3n4_0
  clc
n1n2n3n4_1
  adc n5
  sta sum

  lda#1
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts

n1n2n3n4_0
  adc n5
  sta sum

  lda#0
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts

Can also ditch the CLCs in the tails if you know there will be no overflow (eg if your numbers are unsigned, or represented in excess $2000). Would take you down to 6-7 cycles per additional low byte, 3 cycles per additional high byte. For enough addends it'd probably also be worth removing the CLCs from the low byte code sequences and just subtracting the misplaced carries from the final result instead.

2017-03-21 03:41

ChristopherJam

Registered: Aug 2004
Posts: 1424

Quoting Repose

One problem, X could rollover to $FE.

Could also avoid that by computing a+(-b)+c+(-d)+e+(-f), if b d and f are coming from a table (as they would be for the multiplies I gather this is for) Just store the negated value in the table instead :)

2017-03-21 06:05

Repose
Account closed

Registered: Oct 2010
Posts: 227

Thanks for the input!

Quick reply for now,
- Yes, I'd thought of storing the values as 2's complement negative, the sign bit steals precision and reduces the domain of the multply. It could be useful to have an alternate and faster routine around, but for now I'm trying to solve a full unsigned 16x16.

-The final sum of a-f will be a positive number, since this is coming from an unsigned multiply partial product. Also, each a-b has a>=b.

I used that to construct the following 16x16 unsigned multiply where the partials are computed and added in order.

Conventions
f means f(x)=x^2/4, g means g(x)=(255-x)^2/4
x0/x1 are low/high of multiplier, x0/y1 multiplicand
Note there is a difference. I take the multiplier as meaning the one that requires setup, aka lda multiplier:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4

"setup x0" is a macro-like notation which means what I just stated above.

This is a bit confusing if you really try to follow it, but here's the order of partials:

         y1  y0
         x1  x0
         ------
        p0h p0l ;x0*y0
    p1h p1l ;x0*y1 for l and y1*x0 for high
    p2h p2l ;x1*y0
p3h p3l ;y1*x1

f(a+b)-g(a-b)

    in pointers         in Y
p0l=f_low_x0-g_low_x0 * y0
p0h=f_high_x0-g_high_x0 * y0
p1l=f_low_x0-g_low_x0 * y1
p1h=f_high_y1-g_high_y1 * x0
p2l=f_low_x1-g_low_x1 * y0
p2h=f_high_x1-g_high_x1 * y0
p3l=f_low_y1-g_low_y1 * x1
p3h=f_high_y1-g_high_y1 * x1

If you do it in lsb to msb order to do the adds,
x0*y0 low -> r0 (p0l)

x0*y0 high (p0h)
x0*y1 low (p1l)
x1*y0 low -> r1 (p2l)

x1*y0 high (p2h)
y1*x0 high (p1h) this is different than p1l
y1*x1 low -> r2 (p3l)

y1*x1 high -> r4 (p3h)

And now the code:

setup x0;14 cycles
setup x1
setup y1;42c
ldy y0
sec
lda (f_low_x0),y
sbc (g_low_x0),y
sta r0;p0l, 18c

;the adds of column 1
ldx #0;stores carries, reset every column with adds
clc; Y=y0
lda (f_high_x0),y;A=p0h, (a+b) part
ldy y1
adc (f_low_x0),y;+p1l (a+b) part
bcc s1
inx
clc
s1 ldy y0
adc (f_low_x1),y;+p2l (a+b) part
bcc s2; 31c
inx
clc

;the subs of column 1
s2 sec; Y=y0
;A=the (a+b) parts added, carry in X
sbc (g_high_x0),y
bcs s3
dex
sec
s3 ldy y1
sbc (g_low_x0),y;-p1l (a-b) part
bcs s4
dex
sec
s4 ldy y0
sbc (g_low_x1),y;-p2l (a-b) part
bcs s5
dex
s5 sta r1;35 or 65 per column

;the adds of column 2
txa; get the carries from column 1
clc

...

;column 3
txa
clc
adc (f_high_y1),y;p3h (a+b) part
sec
sbc (g_high_y1),y;p3h (a-b) part
sta r3;19
...
42+18+65+65+19=209

This beats the 16x16 posted on codebase, which is 224 by my estimate.
Times are minimal times, really there's a bit more added when there's overflows/carries.
It's good enough to make a fair comparison though.

2017-03-21 06:22

Repose
Account closed

Registered: Oct 2010
Posts: 227

The breakthrough I had today, was to realize I don't have to reuse the multiplier pointers.
I was stuck on thinking I had to stuff the zp in the middle of an add when all registers were in use.
Then I realized I could prestuff the pointers and use many of them, then I was able to add the partials in order.

I'll have to try again with the "the position in code is that data" idea.

So here's the two variations:
1) as above, 3 setups, partials added in order
2) like codebase, 2 setups, partials added out of order.
Which one will reign victorious?

[1] http://www.codebase64.org/doku.php?id=base:seriously_fast_multi..

[2] http://www.6502.org/source/integers/fastmult.htm

2017-03-21 10:48

JackAsser

Registered: Jun 2002
Posts: 2038

a-b+c-d+e-f = a+c+e - (b+d+f)

; Assuming 8-bit unsigned input and 16-bit unsigned output
lda #0
sta tmp1_hi
sta tmp2_hi

; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
   inc tmp1_hi
   clc
:
adc e
sta tmp1_lo
bcc :+
   inc tmp1_hi
   clc
:

; Perform tmp2=b+d+f
lda b
adc c
bcc :+
   inc tmp2_hi
   clc
:
adc f
sta tmp2_lo
bcc :+
   inc tmp2_hi
:

; Perform tmp1-=tmp2
sec
lda tmp1_lo
sbc tmp2_lo
sta tmp1_lo
lda tmp2_hi
sbc tmp2_hi
sta tmp2_hi

; Result in tmp1

2017-03-21 11:22

Frantic

Registered: Mar 2003
Posts: 1661

If it is ok to use the y-register as well, one could do:

; Assuming 8-bit unsigned input and 16-bit unsigned output
ldy #0

; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
   iny
   clc
:
adc e
sta tmp1_lo
bcc :+
   iny
   clc
:
sty tmp1_hi

[...]

2017-03-21 12:15

lft

Registered: Jul 2007
Posts: 369

Neat idea, probably off topic: If you are summing a lot of bytes (e.g. computing a checksum), and you track the MSB as in Frantic's post, then you don't have to bother with keeping carry clear. After the computation, simply correct the error by subtracting the MSB.

Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning.

2017-03-21 12:41

JackAsser

Registered: Jun 2002
Posts: 2038

Quote: Neat idea, probably off topic: If you are summing a lot of bytes (e.g. computing a checksum), and you track the MSB as in Frantic's post, then you don't have to bother with keeping carry clear. After the computation, simply correct the error by subtracting the MSB.

Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning.

Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.

2017-03-21 14:46

ChristopherJam

Registered: Aug 2004
Posts: 1424

We're all terrible at reading each other's comments. I wrote #5 having only read the thread up to #2, all three of Repose, Frantic and myself have proposed different ways of accumulating the carries from the low byte computation, and both lft and myself have independently proposed fixing up the low byte in post rather than clearing carry as you go.

I do like lft's contribution (addition?) of using the carry from the last addition as inverse borrow for the fixup, mind :)

2017-03-21 14:49

ChristopherJam

Registered: Aug 2004
Posts: 1424

Quoting JackAsser

Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.

I wonder if you could just get away with checking the Y value every N potential overflows, and just see if it has decreased since the last test... it'd be quicker than doing a branch every time.

N may well be as high as 255; I'd have to think a bit harder to be sure of that.

Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 - Next

Refresh

Subscribe to this thread: