[CSDb] - User Forums - sets of add/sub

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > sets of add/sub

2017-03-18 12:02

Repose

Registered: Oct 2010
Posts: 225

sets of add/sub

I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.

First I'll look at adding 3x 16bit adds and generalizing

Standard version
-----------------
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum

;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add
rts

In general,
Time is 16+(n-2)*13+2=18+(n-2)*13

n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44

Fast Version 1
--------------

2 clc

3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
;11c
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles

Conclusions
Adding 3 numbers, I went from 31 to 26 cycles.

The statistics of adding 3 numbers drawn from multiplies
--------------------------------------------------------

In a partial sums routine, only some bytes of the product are added

case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296

Conclusions
High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.

ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example.

2017-03-18 13:10

Repose

Registered: Oct 2010
Posts: 225

tbh I can't use Y like first solution because I would need that with a mult routine so ya...

Some more stats

sqr mult

ab = [(a + b)² - (a - b)²]/4

Multiplying two 16bit numbers, x and y, consisting of low bytes (x0, y0) and high bytes (x1, y1), into partial sums of p0-p3, again low and high bytes. These are added to the 32bit result.

         y1  y0
         x1  x0
         ------
        p0h p0l
    p1h p1l
    p2h p2l
p3h p3l

stats for sqr mult
------------------

995688448	23.2% 	p0h+p1l>255
2105475200	49%	p1l+p2l>255
1991145471	46.4%	p0h+p1l+p2l>255

1030820094	24%	p1h+p2h>255
1031913855	24%	p2h+p3l>255
2044225475	47.6%	p1h+p2h+p3l>255

2335172480	54.4%	one carry in p0h column
 765991168	17.8%	two carries in p0h column

1819237028	42.4%	one carry in p1h column
 243496921	 5.7%	two carries in p1h column

Why bother with stats, when it could only save a few cycles?

That's not really what I'm looking at, it's more about coincidently finding that the range is reduced at some point that I can make a more significant optimization. It's a longshot yeah but I'm trying hard.

In the big picture, I'm actually looking at optimizing adds as a way of speeding up multiplication, but so far I have found that keeping the pointers intact saves more time, saving 14 cycles per reuse. What I mean is, storing x0 to the table pointer, to reuse in the multiplication of x0*y0 and x0*y1, for example.

For now I'm only focusing on the possibility of faster adds to explore that option.

I also had some stats for the (a + b)² - (a - b)² part, they seem close to 50% of the time there is a borrow, starting at 1 * 31.

It seems that stringing subs and adds would be slow as they have competing needs for the Carry, with sub a clear carry means a borrow to the next higher byte, with add it means add 1 to the next byte. Also I looked at the possibility that a+c-b could be done in a row. Not sure yet, but it would be great if the virtual 9 bit sum could be kept for the next sub.

Or you could just save me the whole bother and point me to a 16x16 multiply in less than 270 cycles.

Why does anyone need such fast 16x16 mult anyhow, it's not a demo effect?

tinylib for Doom port :) (would use 16bit adds in the 16bit code of course) Would love to see the frame rate go up, and for such cool projects you need a good arithmetic library.

Besides tbh I just got hooked on this as a challenge one day, but also because I wanted to include it as a basic replacement or compiler to speed it up a lot (like for next loops would be so much faster with x% variables).

Finally, consider that you do 16 mults for 32bit numbers and cycles do add up.

2017-03-18 13:59

Oswald

Registered: Apr 2002
Posts: 5094

you're too deep in the rabbit hole to follow you, just shooting a random idea:

lda a
clc
adc c
bcc +
inc resulthi
adc e
bcc +
inc resulthi

sbc b
bcs +
dec resulthi
sbc d
bcs +
dec resulthi

2017-03-18 14:14

Repose

Registered: Oct 2010
Posts: 225

;adds
ldx #0
lda a
clc
adc c
bcc s1;13 if taken or 12
inx;14
s1 adc e
bcc s2;+5 if taken or 4
inx
;18-20 I think
;subs
s2 sec; I think you need this here
sbc b
bcs s3;+5/4
dex
s3 sbc d
bcs s4;+6
dex
s4 sta r
stx r+1;+6
so 18+14+6=38-42

Thanks for your input. This also has to continue for 16bit adds. I'm sorry I meant a-b+c-d as 16bit numbers actually.

It's averaging about 6 cycles per add/sub, seems pretty good to me. Makes more sense than the way I was thinking of it.

I'll have to get back on this though.

One problem, X could rollover to $FE. Wish I could INY for each sub then:
sty temp
txa
sbc temp
sta r+1;or adc a+1 etc.

2017-03-20 17:48

ChristopherJam

Registered: Aug 2004
Posts: 1409

Perhaps use control flow to track number of overflows? This (untested) routine has four tails, for different addends to the high byte. The labels indicate which numbers have had their low bytes added to the accumulator, and the total number of carries that need passing on to the high byte.


  lda n1
  adc n2
  bcc n1n2_0
  clc
n1n2_1
  adc n3
  bcc n1n2n3_1
  clc
n1n2n3_2
  adc n4
  bcc n1n2n3n4_2
  clc
n1n2n3n4_3
  adc n5
  sta sum

  lda#3
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts


n1n2_0
  adc n3
  bcc n1n2n3_0
  clc
n1n2n3_1
  adc n4
  bcc n1n2n3n4_1
  clc
n1n2n3n4_2
  adc n5
  sta sum

  lda#2
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts

n1n2n3_0
  adc n4
  bcc n1n2n3n4_0
  clc
n1n2n3n4_1
  adc n5
  sta sum

  lda#1
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts

n1n2n3n4_0
  adc n5
  sta sum

  lda#0
  adc n1+1
  clc
  adc n2+1
  clc
  adc n3+1
  clc
  adc n4+1
  clc
  adc n5+1
  sta sum+1
  rts

Can also ditch the CLCs in the tails if you know there will be no overflow (eg if your numbers are unsigned, or represented in excess $2000). Would take you down to 6-7 cycles per additional low byte, 3 cycles per additional high byte. For enough addends it'd probably also be worth removing the CLCs from the low byte code sequences and just subtracting the misplaced carries from the final result instead.

2017-03-21 03:41

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting Repose

One problem, X could rollover to $FE.

Could also avoid that by computing a+(-b)+c+(-d)+e+(-f), if b d and f are coming from a table (as they would be for the multiplies I gather this is for) Just store the negated value in the table instead :)

2017-03-21 06:05

Repose

Registered: Oct 2010
Posts: 225

Thanks for the input!

Quick reply for now,
- Yes, I'd thought of storing the values as 2's complement negative, the sign bit steals precision and reduces the domain of the multply. It could be useful to have an alternate and faster routine around, but for now I'm trying to solve a full unsigned 16x16.

-The final sum of a-f will be a positive number, since this is coming from an unsigned multiply partial product. Also, each a-b has a>=b.

I used that to construct the following 16x16 unsigned multiply where the partials are computed and added in order.

Conventions
f means f(x)=x^2/4, g means g(x)=(255-x)^2/4
x0/x1 are low/high of multiplier, x0/y1 multiplicand
Note there is a difference. I take the multiplier as meaning the one that requires setup, aka lda multiplier:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4

"setup x0" is a macro-like notation which means what I just stated above.

This is a bit confusing if you really try to follow it, but here's the order of partials:

         y1  y0
         x1  x0
         ------
        p0h p0l ;x0*y0
    p1h p1l ;x0*y1 for l and y1*x0 for high
    p2h p2l ;x1*y0
p3h p3l ;y1*x1

f(a+b)-g(a-b)

    in pointers         in Y
p0l=f_low_x0-g_low_x0 * y0
p0h=f_high_x0-g_high_x0 * y0
p1l=f_low_x0-g_low_x0 * y1
p1h=f_high_y1-g_high_y1 * x0
p2l=f_low_x1-g_low_x1 * y0
p2h=f_high_x1-g_high_x1 * y0
p3l=f_low_y1-g_low_y1 * x1
p3h=f_high_y1-g_high_y1 * x1

If you do it in lsb to msb order to do the adds,
x0*y0 low -> r0 (p0l)

x0*y0 high (p0h)
x0*y1 low (p1l)
x1*y0 low -> r1 (p2l)

x1*y0 high (p2h)
y1*x0 high (p1h) this is different than p1l
y1*x1 low -> r2 (p3l)

y1*x1 high -> r4 (p3h)

And now the code:

setup x0;14 cycles
setup x1
setup y1;42c
ldy y0
sec
lda (f_low_x0),y
sbc (g_low_x0),y
sta r0;p0l, 18c

;the adds of column 1
ldx #0;stores carries, reset every column with adds
clc; Y=y0
lda (f_high_x0),y;A=p0h, (a+b) part
ldy y1
adc (f_low_x0),y;+p1l (a+b) part
bcc s1
inx
clc
s1 ldy y0
adc (f_low_x1),y;+p2l (a+b) part
bcc s2; 31c
inx
clc

;the subs of column 1
s2 sec; Y=y0
;A=the (a+b) parts added, carry in X
sbc (g_high_x0),y
bcs s3
dex
sec
s3 ldy y1
sbc (g_low_x0),y;-p1l (a-b) part
bcs s4
dex
sec
s4 ldy y0
sbc (g_low_x1),y;-p2l (a-b) part
bcs s5
dex
s5 sta r1;35 or 65 per column

;the adds of column 2
txa; get the carries from column 1
clc

...

;column 3
txa
clc
adc (f_high_y1),y;p3h (a+b) part
sec
sbc (g_high_y1),y;p3h (a-b) part
sta r3;19
...
42+18+65+65+19=209

This beats the 16x16 posted on codebase, which is 224 by my estimate.
Times are minimal times, really there's a bit more added when there's overflows/carries.
It's good enough to make a fair comparison though.

2017-03-21 06:22

Repose

Registered: Oct 2010
Posts: 225

The breakthrough I had today, was to realize I don't have to reuse the multiplier pointers.
I was stuck on thinking I had to stuff the zp in the middle of an add when all registers were in use.
Then I realized I could prestuff the pointers and use many of them, then I was able to add the partials in order.

I'll have to try again with the "the position in code is that data" idea.

So here's the two variations:
1) as above, 3 setups, partials added in order
2) like codebase, 2 setups, partials added out of order.
Which one will reign victorious?

[1] http://www.codebase64.org/doku.php?id=base:seriously_fast_multi..

[2] http://www.6502.org/source/integers/fastmult.htm

2017-03-21 10:48

JackAsser

Registered: Jun 2002
Posts: 2014

a-b+c-d+e-f = a+c+e - (b+d+f)

; Assuming 8-bit unsigned input and 16-bit unsigned output
lda #0
sta tmp1_hi
sta tmp2_hi

; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
   inc tmp1_hi
   clc
:
adc e
sta tmp1_lo
bcc :+
   inc tmp1_hi
   clc
:

; Perform tmp2=b+d+f
lda b
adc c
bcc :+
   inc tmp2_hi
   clc
:
adc f
sta tmp2_lo
bcc :+
   inc tmp2_hi
:

; Perform tmp1-=tmp2
sec
lda tmp1_lo
sbc tmp2_lo
sta tmp1_lo
lda tmp2_hi
sbc tmp2_hi
sta tmp2_hi

; Result in tmp1

2017-03-21 11:22

Frantic

Registered: Mar 2003
Posts: 1648

If it is ok to use the y-register as well, one could do:

; Assuming 8-bit unsigned input and 16-bit unsigned output
ldy #0

; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
   iny
   clc
:
adc e
sta tmp1_lo
bcc :+
   iny
   clc
:
sty tmp1_hi

[...]

2017-03-21 12:15

lft

Registered: Jul 2007
Posts: 369

Neat idea, probably off topic: If you are summing a lot of bytes (e.g. computing a checksum), and you track the MSB as in Frantic's post, then you don't have to bother with keeping carry clear. After the computation, simply correct the error by subtracting the MSB.

Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning.

2017-03-21 12:41

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Neat idea, probably off topic: If you are summing a lot of bytes (e.g. computing a checksum), and you track the MSB as in Frantic's post, then you don't have to bother with keeping carry clear. After the computation, simply correct the error by subtracting the MSB.

Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning.

Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.

2017-03-21 14:46

ChristopherJam

Registered: Aug 2004
Posts: 1409

We're all terrible at reading each other's comments. I wrote #5 having only read the thread up to #2, all three of Repose, Frantic and myself have proposed different ways of accumulating the carries from the low byte computation, and both lft and myself have independently proposed fixing up the low byte in post rather than clearing carry as you go.

I do like lft's contribution (addition?) of using the carry from the last addition as inverse borrow for the fixup, mind :)

2017-03-21 14:49

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting JackAsser

Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.

I wonder if you could just get away with checking the Y value every N potential overflows, and just see if it has decreased since the last test... it'd be quicker than doing a branch every time.

N may well be as high as 255; I'd have to think a bit harder to be sure of that.

2017-03-21 20:08

lft

Registered: Jul 2007
Posts: 369

Quoting ChristopherJam

We're all terrible at reading each other's comments.

Whoops, you're quite right. Sorry!

2017-03-22 00:50

Fresh

Registered: Jan 2005
Posts: 101

While thinking about this topic I've come up with a completely different approach which is not ideal for this case but may be useful if we need to add more numbers. Hope this is not too off-topic.
Beware, I haven't checked if this needs some kind of CIA version fix: this works fine with old CIAs.
BTW, I'm not using the carry.

.const  zp0=$fb
.const  zp1=$fc
.const  zp2=$fd

.pc=$1000
start:
        sei
!:	bit $d011
	bpl !-   // No BLs!
	lda zp1
	sta b+1
	lda zp2
	sta c+1	
	lda #$00
	sta $dd0f
	sta $dd06
	lda #$01
	sta $dd07
	sta $dd0f		
	
	ldy zp0		// 3
b:
	ldx tab,y	// 4/5
c:
	ldy tab,x	// 4/5
	
	// Here we could add more ldx/ldy
lo:		
	ldx $dd06	// 4 => Wait=15/17	
	lda carrytab,x
	// HI/LO in A/Y
	sty $63
	sta $62
        cli
	jmp $bdd1	

	
.align $0100
tab:
.fill 512,i&$ff
carrytab:
.fill 256,[$f3-i]&$ff

2017-03-22 05:32

Repose

Registered: Oct 2010
Posts: 225

I can't follow your code yet, but an idea just came to me (which probably you used), that any addition crossing a page boundary takes one more cycle and you can simply difference the cycles to find the number of carries. Very clever!

My own idea which just came to mind based on this concept, is

addtab is a 510 byte table, starting at a page boundary (very important!), that is just a count (0,1,1...255,0..)

ldx n1
lda n2
sta sm1+1
lda n3
sta sm2+1;or just imagine the numbers are already stored there

;initialize and start the timer to a high value
lda TimerA_Low
sta start

sm1 lda addtab,x; if addtab+x crosses page, add 1 cycle
tax; transfer running sum (n1+n2) to X
sm2 lda addtab,x;add n3 to running sum
...
sta r0;the low byte of sum
lda start
sec
sbc timerAlo
sbc #some constant
sta r1;the high byte of sum

It's not fast but, it's an interesting concept that could be applied to optimized code somehow.

If I store n2,n3... directly in sm1+1, sm2+1..., then it optimizes out to:

ldx n1
sm1 lda addtab,x;n1+n2
tax
sm2 lda addtab,x;n3+ (n1+n2)
tax
sm3 lda addtab,x;n4+ (n1+n2+n3)
...
carries are time of code minus the constant 6*n (the number of values added)

This is also quite faster so it's actually practical!

I just invented the missing add instruction. The indexed addressing is doing the adds, and the timer is doing the carries, in parallel.

2017-03-22 06:33

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting Repose

The indexed addressing is doing the adds, and the timer is doing the carries, in parallel.

which is exactly what Fresh is proposing \O/

@lft - all good, I missed that Repose was tracking the carry count in X in post #4

2017-03-22 06:41

Repose

Registered: Oct 2010
Posts: 225

Even better:

clc
lda n1
adc n2
bcs *+2;basically do nothing as quickly as possible to make carry consume one more cycle
adc n3
bcs *+2
...
sta sum
;diff timer
sta sum+1

My first method was 6/7, this one is 5/6. The probability of a carry is just under half, so basically approaches 5.5 cycles per number added.

I love to see someone's face try to decode that though, confuse the @#@ out of them :)

2017-03-22 07:18

Repose

Registered: Oct 2010
Posts: 225

You can even add/sub in the same sequence

;now do subtractions
sec
sbc n5
bcs *+2; 'subtract' one cycle from timing if there's a borrow
sbc n6
bcs *+2
...

2017-03-22 07:21

Repose

Registered: Oct 2010
Posts: 225

Why stop there?

lda n
ror
bcs *+2
x8
;diff timer
sta popcnt

Population count in 4.5 cycles per bit. Of course a table would be much faster. What else useful could you do?

2017-03-22 08:29

Repose

Registered: Oct 2010
Posts: 225

bcs/bcc, bmi/bpl, bvc can all be used in this technique to accumulate the times those flags were in a certain state. By extension, you can accumulate times a certain bit was set or value was reached, like cmp #$40:bcc *+2:
So think of the timer as an extra register that you can increment or decrement (by using the opposite branch), using branches or page crossings, that is only fast to use over a longish sequence. It can do corrections after the fact or count things. There is also a free divide, IIRC you can set the speed the timer counts? Or is it fixed to decrement each clock cycle? No matter, use the SID voice 3 output and set the frequency for a free division.

Question, what use can this be put to? And can you use the flags as part of a calculation ,while using the accumulation of times the flags happened for a parallel computation using the timer (which has no sideeffects) to do double the work? (I mean, use the branch to do one thing, then at the end do another thing with the timer value).

In another direction, can you do some computation unrolled then interrupt it at a certain time to see the value and that's your result?

example,
rol
bcs +2
inx
... x8
interrupt after 24 cycles
X will be 4-(popcnt(A and $#f)/6) or something (the full formula is kinda long)

Another idea I've used to to make 'infinite' loops that can only be stopped by an interrupt
lda #0
sta $4000
sta $4001
..
decide which area of the bitmap you want to fill
jump into the code
stop it with an interrupt
of course you can 'plot' an RTS into the middle as well, then restore it.
Interrupt idea might turn out faster

This is getting off topic though.

2017-03-22 08:42

ChristopherJam

Registered: Aug 2004
Posts: 1409

I've definitely considered using an interrupt to terminate an unrolled fill loop, but yes, plotting an RTS seemed like a saner option at the time.

2017-03-22 08:53

ChristopherJam

Registered: Aug 2004
Posts: 1409

Could also use CIA to see count how many iterations a fractal takes to escape, freeing up registers for state :)

2017-03-22 08:55

Pex Mahoney Tufvesson

Registered: Sep 2003
Posts: 52

I love you guys! :-D
I've used timers a lot for stopping infinite loops. One example is the part in BitLive4-demo where I draw a full-screen all-border white/black line using $d020 and CPU only. No time for keeping track of where to stop. :)
---
Have a noise night!
http://mahoney.c64.org

2017-03-22 09:04

Repose

Registered: Oct 2010
Posts: 225

@cj Very useful idea, that would definitely save time. In the end, just use the Timer value as a colour and plot it. It doesn't really matter the values involved, you can always arrange them to be something different depending on escape "time" literally... then animate the 'palette' by starting one cycle later..
See what I mean? I'm saying if you plot a fractal in real time in one frame say, then the colours are determined by the times to iterate the formula, and that value from the timer directly translate to a colour, then starting later shifts the colour palette down one, then you have a free animated fractal.

@mahooney Nice, that's a very good use because it's the only way possible to create an effect. Will check out your demo thanks for the comment :)

2017-03-22 09:48

Repose

Registered: Oct 2010
Posts: 225

Another free calculation on the Timer, remember it's 16bit so you can cause a borrow from the high byte, so lda TimerA+1:bmi xx or bne is like asking was there at least n counts (carries, borrows, whatever you're doing).

Say my adds were 6 cycles each, I'm adding 4 numbers, I start timer at $124, then lda TimerA+1:bne means no carries. Don't know how it's useful either, just throwing out a lot of ideas. You gotta brainstorm and not hold yourself back, useful stuff will stick eventually.

There's also two timers, you can have them read different values.

You don't need the sbc #constant because you should set the timer to #constant + variation at the start, then at the end the timer reads exactly the high byte +1 (since Timer can't be zero)
so
ldx timerA
dex
stx sum+1

2017-04-01 00:43

Repose

Registered: Oct 2010
Posts: 225

I've finished analyzing this thread.

My 2nd version (1)

clc:lda n1;5 cycles overhead
adc n2:bcs s1;5 if no carry
adc n3;3 which means 2 saved on last number
sta r0
lda #0:adc #0:sta r1:rts;10 w/o rts
s1 clc;8 if carry, so 5/8 avg 6.5
adc n3
sta r0
lda #1:adc #0:sta r1:rts

Total: 13+(n-1)*6.5
---------------
Frantic's suggestion (2)

This is the same as mine as inspired by Oswald

ldy #0:lda a:clc;7 cycles overhead
adc c:bcc +3:iny:clc;6/9 avg 7.5
adc e:bcc +3:iny:clc
...
sta r0:sty r1;6 cycles overhead

Total: 13+(n-1)*7.5, n is number of zp 8 bit numbers to be added to 16 bit result.
---------------
lft's suggestion (3)

ldy #0:lda a:clc;7
adc c:bcc +3:iny;6/7 avg 6.5
adc c:bcc +3:iny;note that carry wasn't reset so we falsely are +1
...
sty r1;save high byte of result
sec:sbc r1:sta r0;subract all the false carries for the low byte of result
;11

Total: 18+(n-1)*6.5

Note that (2) breaks even with (3) at n=5 or when 6 numbers are added together. With longer sequences, (3) is slightly faster.
---------------
CJ's suggestion (4)

This is a similar but expanded version of my (1)

lda n1;overhead 3
adc n2:bcc n1n2_0;7 per add if carry
clc
n1n2_1 adc n3:bcc n1n2n3_1
...
n1n2_0 adc n3:bcc ...;6 per add if no carry, avg 6.5
...
sta sum
lda #x:adc ...:sta sum+1;eqv. overhead is 13 to be almost exactly
like mine

Total: 13+(n-1)*6.5
can avg to (n-1)*5.5 if you use post correction.

---------------
My timer version (5)

overhead + (n-1)*5.5

Where the overhead is fairly big.

-----------
I only analyzed the low bytes, but it gives some idea of the approaches. (1)/(4) use a large amount of code and rely on the idea that the position in the code encodes the number of carries.

(2)/(3) use a register to track the carries and can be sped up over long runs by post correction of the low byte. The break-even point is at 6 numbers.

Finally (5) requires precise timing but can add long runs quickly. It's the same average speed as (4). It's up to you if you want to have huge code or short code but no screen or interrupts (screen timing could be factored out in some cases but not interrupts).

The differences are very small, from 5.5 to 7.5 per add.

My actual application is in adding the partial sums of large multiplies when using the squares multiply method. For a 32x32 unsigned multiply, there's up to 7 rows of numbers to be added, which is actually 7 subs (from squares subtraction) and 8 adds (previous carry and 7 rows). That's 15 numbers add/sub together with a final carry.

Conclusion: using (3) (with post correction) for most of the adds would be the fastest and convenient, at 6.5 per byte. If you went insane with code size, you could save another 72 cycles or so.

Estimating how much this would add to the multiply is a little tricky; the numbers themselves are part of the multiply, and the add/subs are integrated into the routine, so it's only the 'extra' part to track the carries that is important, which averages 4.5.
There's about 72 numbers add/sub so at least 324 beyond the usual 38 cycles per multiply.
I'm guessing between 800-1000 cycles in total, which means the c64 can reach 1000flops single precision :)

2017-04-01 00:55

Repose

Registered: Oct 2010
Posts: 225

For comparison, basic is 2300 and http://www.codebase64.org/doku.php?id=base:fast_floating_point_.. is 1300. I haven't estimated the floating point overhead to the multiply, but clearly my routine will be noticeably quicker.

I've also come to conclusions about the order of multiplies, but that's not about adding so I'll update the 'fast large multiplies' thread.

Thanks for all your input. Topic closed.

2017-04-03 21:44

Repose

Registered: Oct 2010
Posts: 225

Ok I've tested adding/subbing sequences of numbers now. There's one tricky bit.

First consider this routine for repeated subtractions:
(warning: following routine is wrong, which I explain why)
ldx #0
lda n1
sec;7 cycles overhead here
sbc n2
bcs s1;Carry is still set if n1>=n2 meaning no borrow
dex;keep track of borrows in X
s1 sbc n3
bcs s3;6 if no borrow, 7 if borrow, avg 6.5
dex
s3 stx r1;save high byte of result
sec
sbc r1;post correction of low byte
sta r0;save low byte
;post correction overhead is 5, 6 more to store result
Total time: 18+(n-1)*6.5 cycles

*note that without post correction, the timing is 12+7.5*(n-1), and they break even at 7 subs, which means it's faster to use post correction if you do 8+ subs in a row, but only saves a cycle per extra sub.

Now let's go through an example calculation.
0-FF-FF or 0-255-255 = FE 02 or -510.

If you get confused about 2's complement, look at it this way, what would you have to add to FE02 to get (01) 00 00?
FE+02=00 carry 1, this means you need to add 254 to the low byte to start.
Now the high byte, FE+carry(1)=FF. What do you need to add to the high byte? FF+01=(01) 00. Adding one to the high byte is like adding 256. In total, we added 256+254=510, so we must have had -510 to start with.

lda #0;n1=0
ldx #0;X tracks borrows, starts at 0.
sec;C=1
sbc #$ff;0-FF=1 as in -255 or (FF) 01, C=0 for borrow
bcs s1:dex;X=FF
s1 sbc #$ff;1-ff-not(C) or 1-ff-1=1, C=0 or borrow
bcs s2:dex:X=FE
s2 stx r1;store correct high byte as FE
sec
sbc r1;A=A-r1, 01-FE=03 as in 1-254=-253 or (FF) 03.

Our answer is too positive. What went wrong?

The idea of the post correction is that the running tally in A gets a little more wrong with each borrow, since we aren't using SEC before each subtraction. There were two borrows, but only one of them messed up our tally.
All we have to do is use CLC instead of SEC at the beginning, which seems weird but it works.

Here's the revised values:
C=0
00-FF-not(C)=0-FF-1=(FF) 00. C=0, X=FF.
00-FF-not(C)=0-FF-1=(FF) 00. C=0, X=FE.
Store correct high byte as FE.
C=1
Low byte is 00-FE=(FF) 02.
Final correct result is FE 02.

The same routine but ADC instead of SBC and INX instead of DEX also works, and can come before this. The only overhead is one extra CLC before the sequence of SBC starts.

You'd expect a similar bug in the adding version, but it works there because I start with CLC, and each carry messes up the running sum, so likewise, in the sub version, we must make sure each SBC messes up the tally, so we start with CLC as well.

In conclusion, you can add/sub long sequences of numbers with post correction with only 6.5 cycles per add, which is nearly the fastest possible if over 7 adds. You can add 257 numbers this way until X overflows.

I also want to point out, in my previous post where the carries are implied as the position in the code, the code branches grow so quickly that you need to add cycles for page boundary crossings in the branches.

2017-04-11 13:13

Repose

Registered: Oct 2010
Posts: 225

My clc fix was a little hasty, this works better:

ldx #0
clc
lda n1
adc n2
bcc s1
inx
adc #0;fix
sec;fix
s1 adc n3
bcc s2
inx
stx h
sec
sbc h
sta l

And something similar for subs. It's 2 cycles slower on avg though, however you can get that back with

sec
sbc id,x

In actual use you don't stx h, you immediately go to

txa
ldx #0

I have to do a bit more testing, post correction may not actually work and you have to use clc each time.

2017-04-11 19:20

Oswald

Registered: Apr 2002
Posts: 5094

random idea #2 one could accumulate C's with rol-s too

clc
lda n1
adc n2
ror selfm+1
adc n3
ror selfm+1
adc n4
ror selfm+1
selfm jmp fixhibyte

2017-04-11 21:43

Repose

Registered: Oct 2010
Posts: 225

That's an original idea, however that only gives 2 bytes between code, as the values could be 1,3,5,7 etc. If you give it a final ror, that still wouldn't be enough. I think you need 8 bytes to correct.
lda #correction:sta l:rts/bpl etc.
This pushes the overhead way up.
You could use the roring as an index to a table,
ldx correction
adc cortable,x

2017-04-12 10:42

Oswald

Registered: Apr 2002
Posts: 5094

you could shift in bits even from the most significant bit to solve the code gap.

reading table could be simpler:

ror selfmod+1

selfmod lda rortable
sta hibyte

2017-04-12 10:47

ChristopherJam

Registered: Aug 2004
Posts: 1409

It would even clear carry for you, so you wouldn't have to fixup :)

Unfortunately bcc *+3:inx is about 2.5 cycles faster per add. Pity, I prefer the ROR solution. It does take less instruction ram at least?

2017-04-12 21:17

Repose

Registered: Oct 2010
Posts: 225

Actually came up with yet another way, which I think is kinda cool, as it only works for 16x16 partials so it fits perfectly.

24bit+2x16bit add
This strange combination is equivalent to the sum of the partial products of
a 16x16 multiply.

a2 a1 a0
   b1 b0
+  c1 c0
--------
r2 r1 r0

2 clc
3 lda a0
3 adc b0
2 tax;a0+b0
3 lda a1
3 adc b1
2 tay;a1+b1
3 adc a2
3 sta r2
2 txa
;no clc as there's never a carry from r2, in a mult
3 adc c0
3 sta r0
2 tya
3 adc c1
3 sta r1
3 bcc s1;3/7 avg 5
5 inc r2
s1 rts

45 cycles

2017-04-13 09:18

Repose

Registered: Oct 2010
Posts: 225

So bad news. The post correction idea has a subtle bug that can't be fixed.

I'll summarize.

This thread was exploring quick ways to add several 16 bit numbers quickly (in my case, I wanted to add a 24bit to two 16bit numbers, which corresponds to the addition of the partial products in an unsigned 16x16 multiply).

One of the solutions was to just go ahead and add them all without clearing carry and get the wrong answer. This was an attractive solution because it was fast, being only 6.5 cycles per add.

It looks like this:

clc
lda n1
adc n2
bcc s1
inx
s1 adc n3
bcc s2
inx
s2 sta low
stx high;or add the next column (the high bytes of n1-n3) with txa:ldx #0

Then the question was, how to fix that at the end? It seemed the obvious answer was:

s2 stx high
sec
sbc high
sta low;remove the carries which were added by NOT using CLC

Then I noticed this didn't work in every case, because the first carry doesn't offset the running total, only the 2nd carries forward.

Quick example,
ff+ff=fe with carry set, then +ff+Carry=fe. Only one false carry was added, but the high byte is 2, so 2 was subtracted from the low byte, giving fc, where the answer should be fd (ff+ff+ff=02 fd).

My first approach was to use SEC at the very start. Obviously this won't work, as it just adds 1 every time, so it won't work for 1+2+3.

Then I realized I need adc #0 at the very end, to offset A (the low byte) by an equal number of carries.

This works.

It looks like this:

s2 adc #0
stx high
sec
sbc high
sta low

The subtle bug is, the high byte can now be wrong, because the carries aren't always happening when they should be. This only shows up in the case ff+ff+01.
ff+ff=01 fe, Carry set (correct). fe+Carry+01=00 Carry set. The carry being set here, causes an INX. Now we have ff+ff+01=02 ff, which is wrong, it should be 01 ff.

Now the question is, is it possible for such partial products to ever appear in a multiply?

Solving this is also tricky. You want x0*y1=ff and x1*y0=ff and x0*y0 (high byte)=01.

To solve this, I found the factors of 255: (1, 255) (3, 85) (5, 51) (15, 17) , and I knew x0*y0 had to be >256, and that the other factors had to be an equal distance from 85 (so they could add to 510). It's hard to explain right now, but you'll see that (3, 85+-1) works out perfectly.

Let me be clear now, the 16bit numbers to be multiplied that cause this bug are, 5456 * 0303.

This is an amazingly sneaky and subtle bug that randomized testing could never find. I suspect there's several more cases with higher precision values. It would be impossible even with todays computers to test every possible 32x32 multiply. This has to be carefully analyzed.

In a way, it's one of the most subtle bugs I've seen.

2017-04-13 09:40

JackAsser

Registered: Jun 2002
Posts: 2014

At each inx you simply need to have a clc also, not? :)

2017-04-13 10:09

Repose

Registered: Oct 2010
Posts: 225

Right, unless you know you will never multiply 5654 x 0303 :) It's gonna add back some 13 cycles to the 32x32 though, which is not much problem.

Edit: thought I lost 72 cycles for a moment there, whew!

2017-04-13 20:39

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting Repose

Then I noticed this didn't work in every case, because the first carry doesn't offset the running total, only the 2nd carries forward.

Au contraire! The carry is itself part of the running total.

As you say, after adding ff, ff and ff, we have A=$fe, C=1, X=2
so if we take our running total to be A+C, it is $ff, i.e. the expected two higher than the correct result. Continuing from your s2:

s2 stx hi
   sbc hi   ; this adds 255-x to the current value of A+C
   sta low  ; will now contain <(n1+n2+n3+0xff)

   lda hi
   adc#0
   sta hi   ; will now contain >(n1+n2+n3+0xff)

Note that the above computes the 16 bit sum of (n1+n2+n3+0xff), rather than (n1+n2+n3). But that's trivial to repair; just set carry at the very beginning, and do an adc#$ff instead of #0 at the beginning of the 'hi' sum

I've tested the above on 0+0+0, $ff+$ff+$ff, $80+$80+$00 and $b0+$b0+$9f; the last case tests for correct behaviour when the sbc hi underflows. All work!

2017-04-13 20:41

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quote: At each inx you simply need to have a clc also, not? :)

Yes, but the point of the post fixup is to remove them all!

It's only cost effective if you're adding more than about half a dozen addends, Repose mentioned something about that over on the other thread.

2017-04-13 20:43

Repose

Registered: Oct 2010
Posts: 225

Btw, that test case isn't valid for the way I'm adding the f(x) values in a row first (f(x)=x^2/256). So I really need to construct better values.

Sure, I can just avoid the buggy addition routine, but there's a big point here - randomized testing is very unlikely to ever show up the bug.

What we need is 100% test coverage of all the branches/carries to avoid any such bugs in the future. For example, getting all those additions in the right order without making a typo, for some 72 of them (considering a 32x32), would be very difficult.

Now comes the interesting problem of which values cause carries in each sub-addition.

This is a very clear case where 100% testing is impossible, randomized testing will fail, and only careful analysis will catch bugs.

So I want to work on this and I believe CJ is as well.
I think it lies in finding factors and doing small sets of trials around that.

2017-04-13 20:48

ChristopherJam

Registered: Aug 2004
Posts: 1409

Oh, just tried ff+ff+01. Sums to 01ff as expected

2017-04-13 20:49

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting Repose

getting all those additions in the right order without making a typo, for some 72 of them (considering a 32x32), would be very difficult.

..exactly why I went for a code generator.

But you're right, getting good test coverage for edge cases is hard

2017-04-13 20:50

Repose

Registered: Oct 2010
Posts: 225

Ok, seems I took a wrong turn somewhere, and that when you went down the optimization path with sbc tab,x you avoided my particular bug.

My point about test coverage is still valid though.

I'll have to grok your post later.

2017-04-13 20:54

ChristopherJam

Registered: Aug 2004
Posts: 1409

Quoting Repose

Ok, seems I took a wrong turn somewhere, and that when you went down the optimization path with sbc tab,x you avoided my particular bug.

The key is to not set carry before doing the sbc high - that SEC was throwing away information.
Quote:

My point about test coverage is still valid though.

Agreed.

2017-04-13 21:01

Repose

Registered: Oct 2010
Posts: 225

Ok quick post yes I realized I need to keep the last carry which is why I did adc #0 before the subtract, I think we're on the page.

2017-04-14 15:54

ChristopherJam

Registered: Aug 2004
Posts: 1409

So, for the NxN multiply, as Repose pointed out it's only worthwhile skipping the CLCs for columns with a reasonable number of addends. Coincidentally, all of those columns also require the addition of a non-zero offset, to account for my excess $4000 representation of the g() table.

I end each of those columns with the following sequence:
sbc id+($ff-Kn),x
sta res+n
txa

where Kn is the additional constant for that column (must be at least as high as the maximum value X can reach, lest the indexing could cross the page boundary)

I've just run an exhaustive test in VICE, comparing a more traditionally computed reference value with the optimised code above, and it matches for all 16842752 valid sets of inputs. I've included the test code below, eliding the loop iterator and display routines for brevity.

; test parameters are
;  A  (00..ff)
;  C  (00..01)
;  X  (00..ff)
;  K  (X..ff)   ; routine requires that X+(0xff-K)<= 0xff,  ie X<=K
;
; result should be A+C+K+255*X

;-----------------------------
;  calculate reference value
;-----------------------------

    lda tC
    lsr
    lda tA
    adc tK
    sta ref+0
    lda tX
    adc#0
    sta ref+1  ; ref is now A+C+K+256*X

    sec
    lda ref+0
    sbc tX
    sta ref+0
    lda ref+1
    sbc#0
    sta ref+1   ; ref is now A+C+K+255*X
;-----------------------------
; prepare registers for test
;-----------------------------
    lda tK
    eor#$ff
    sta rn+1
    lda tC
    lsr
    lda tA
    ldx tX
;-----------------------------
; run test code
;-----------------------------
rn  sbc id+$3f,x
    sta res+0
    txa
    adc#0
    sta res+1

;-----------------------------
; compare result to reference value
;-----------------------------
    lda res+0
    cmp ref+0
    bne fail
    lda res+1
    cmp ref+1
    bne fail

2017-04-15 06:39

Repose

Registered: Oct 2010
Posts: 225

I got it to work now thanks :)
And good news, I found something even faster now! It always beats the clc version, even for 3 adds. I might be able to beat whatever you have :)

See if you can generate 16x16 your way, that will be my first working example for now.

2017-04-15 06:42

ChristopherJam

Registered: Aug 2004
Posts: 1409

And now proof that the adc:bcc:inx sequence retains sufficient information to calculate the sum of the column, regardless of when carries occur.

If we define T to be equal to the current value of A+C+255*X,
it's possible to prove that after each adc:bcc:inx sequence, T increases by the value added.

; before:
; T=A+C+255*X, so A=T-C-255*X

then we execute

    adc #V
    bcc *+3
    inx

Let C', X' and A' be the values of carry, accumulator and X after running the code sequence.

Now, either the add didn't set C,
in which case

  C'=0
  X'=X
  A'=A+V+C
    =(T-C-255*X)+V+C
    =T+V-255*X

  T'=A'+C'+255*X'
    =(T+V-255*X)+0+255*X
    =T+V

or, the add did set C, in which case

  C'=1
  X'=X+1          # as the increment happens
  A'=A+V+C-256    # as the add has overflowed into C
    =(T-C-255*X)+V+C-256
    =T+V-255*X-256

  T'=A'+C'+255*X'
    =(T+V-255*X-256)+1+255*(X+1)
    = T+V-255*X-255+255*X+255
    = T+V

Either way, T has increased by the correct amount

2017-04-15 06:48

Repose

Registered: Oct 2010
Posts: 225

Nice proof! Couldn't have said it better myself.

2017-04-15 06:53

Repose

Registered: Oct 2010
Posts: 225

So my adds are 14+9*(n-1) for column 0, 11+9*(n-1) for the middle columns, and something decent for the last column (overhead 14 or less). The point being it always beats 12+10*(n-1) for the clc version.

2017-04-15 09:28

Repose

Registered: Oct 2010
Posts: 225

I figured this out my way, and why the new fix is better and why the old version failed.

Follow along with the registers and code in two cases:

	 A C  X r1 r0
ldx #0	     00	
sec	   1 
lda #ff ff  
adc #2	02 1
bcc s1
inx	     01
adc #ff	02 1
bcc s2
inx	     02
stx r1	        02
sbc r1	00 1
sta r0	           00
txa	02
sbc #0  02 1
sta r1	        02

	 A C  X r1 r0
ldx #0	     00	
sec	   1 
lda #ff ff  
adc #1	01 1
bcc s1
inx	     01
adc #ff	01 1
bcc s2
inx	     02
stx r1	        02
sbc r1	ff 0
sta r0	           ff
txa	02
sbc #0	01 1
sta r1	        01

Essentially, if carry is 0 after the low bytes, decrement the high byte. It's that fix to the high byte which was missing in my first version. I didn't realize how to test for that particular case, but all I needed was hiding in the Carry after the low bytes. The workthrough above is assuming you are only adding one column of numbers, if you continued to another column, you'd end it differently:

s2 sbc id,x
sta r0
;column 2
txa
sbc #0;correction to high byte
ldx #0
adc n1h;add high bytes

2017-04-16 15:49

ChristopherJam

Registered: Aug 2004
Posts: 1409

Ok, I think I follow how that last one works now.

A slight optimisation first, going to replace

    stx r1
    sbc r1
with
    sbc id,x

It's two cycles faster, only needs as many bytes of table as you have INXs (so, under a dozen), and makes it easier to
present the proof in the rest of this comment. Don't need to touch r1, as it gets overwritten later anyway.

Now, again defining T to be A+C+255*X

    ldx #0
a)  sec
     ;  initialise T to 1

    lda n1
     ;        now T is A+C+255*X = n1+1+255*0 = n1+1

    adc n2
    bcc *+3
    inx
     ;        now T is 1+n1+n2, cf proof in comment about the adc:bcc:inx sequence earlier in this thread

    adc n3
    bcc *+3
    inx
     ;        now T is 1+n1+n2+n3

  ; so all that's required now is adding A+C+255*X-1 and storing them

b)  sbc id,x  ; now sbc nn is the same as adc $ff-nn, so here we are adding $ff and subtracting X
    sta r0

    txa
c)  sbc #0    ; again, equivalent to adc $ff-0, so here we are adding x, $ff, and any carry from the low byte
    sta r1    ; ie, adding 256*X and $ff00 to the total in r1:r0

So, between points (a) (b) and (c), you've added 0001, 00ff, and ff00 to your total, which should result in carry being set
at the end of the routine, but no other effect on the result. Lines (b) and (c) contribute 255*x, leaving the sum in r1:r0

Usefully, the set carry flag is just what's needed for the start of the next column. Overall, nice work!

2017-04-16 16:28

Fresh

Registered: Jan 2005
Posts: 101

Looks like you've already reached your target, just want to share a similar approach that would avoid messing with the carry.
This would clearly need to be executed on ZP to be really efficient but, in that case, the overhead would be (a bit) less than yours.

	lax eorval+1
	ldy #$00
val1:
	axs #$00
	
val2:
	axs #$00
	bcs val3
	iny
val3:
	axs #$00
	bcs val4
	iny
val4:
	axs #$00
	bcs storehi
	iny	
	
storehi:
	sty r1
	txa
eorval:
	eor #$ff
	sta r0

2017-04-17 02:20

ChristopherJam

Registered: Aug 2004
Posts: 1409

It's certainly a fast way of adding a bunch of constant values, but I fear any speed advantage would more than counterbalanced by having to fetch the addends in an earlier code segment and writing them into the routine.

There are no registers left for addressing (eg) table entries.

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

Walt/Bonzai
Steffan/BOOM!
Mike
iAN CooG/HVSC
deetsay
Twilight/Excess/Arcade
XmikeX
Airwolf/F4CG
Copyfault/Extend^tsn..
New Design/Excess
Guests online: 138

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)

Top onefile Demos

1 Layers  (9.6)
2 No Listen  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)

Top Groups

1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)

Top NTSC-Fixers

1 Pudwerx  (10)
2 Booze  (9.7)
3 Stormbringer  (9.7)
4 Fungus  (9.6)
5 Grim Reaper  (9.3)

Page generated in: 0.109 sec.