Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
 Welcome to our latest new user Druid ! (Registered 2017-06-23) You are not logged in 
CSDb User Forums

Forums > C64 Coding > sets of add/sub
2017-03-18 13:02

Registered: Oct 2010
Posts: 129
sets of add/sub

I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.

First I'll look at adding 3x 16bit adds and generalizing

Standard version
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum

;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add

In general,
Time is 16+(n-2)*13+2=18+(n-2)*13

n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44

Fast Version 1

2 clc

3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles

Adding 3 numbers, I went from 31 to 26 cycles.

The statistics of adding 3 numbers drawn from multiplies

In a partial sums routine, only some bytes of the product are added

case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296

High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.

ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example.
... 45 posts hidden. Click here to view all posts....
2017-04-13 23:01

Registered: Oct 2010
Posts: 129
Ok quick post yes I realized I need to keep the last carry which is why I did adc #0 before the subtract, I think we're on the page.
2017-04-14 17:54

Registered: Aug 2004
Posts: 601
So, for the NxN multiply, as Repose pointed out it's only worthwhile skipping the CLCs for columns with a reasonable number of addends. Coincidentally, all of those columns also require the addition of a non-zero offset, to account for my excess $4000 representation of the g() table.

I end each of those columns with the following sequence:
sbc id+($ff-Kn),x
sta res+n

where Kn is the additional constant for that column (must be at least as high as the maximum value X can reach, lest the indexing could cross the page boundary)

I've just run an exhaustive test in VICE, comparing a more traditionally computed reference value with the optimised code above, and it matches for all 16842752 valid sets of inputs. I've included the test code below, eliding the loop iterator and display routines for brevity.

; test parameters are
;  A  (00..ff)
;  C  (00..01)
;  X  (00..ff)
;  K  (X..ff)   ; routine requires that X+(0xff-K)<= 0xff,  ie X<=K
; result should be A+C+K+255*X

;  calculate reference value

    lda tC
    lda tA
    adc tK
    sta ref+0
    lda tX
    sta ref+1  ; ref is now A+C+K+256*X

    lda ref+0
    sbc tX
    sta ref+0
    lda ref+1
    sta ref+1   ; ref is now A+C+K+255*X
; prepare registers for test
    lda tK
    sta rn+1
    lda tC
    lda tA
    ldx tX
; run test code
rn  sbc id+$3f,x
    sta res+0
    sta res+1

; compare result to reference value
    lda res+0
    cmp ref+0
    bne fail
    lda res+1
    cmp ref+1
    bne fail
2017-04-15 08:39

Registered: Oct 2010
Posts: 129
I got it to work now thanks :)
And good news, I found something even faster now! It always beats the clc version, even for 3 adds. I might be able to beat whatever you have :)

See if you can generate 16x16 your way, that will be my first working example for now.
2017-04-15 08:42

Registered: Aug 2004
Posts: 601
And now proof that the adc:bcc:inx sequence retains sufficient information to calculate the sum of the column, regardless of when carries occur.

If we define T to be equal to the current value of A+C+255*X,
it's possible to prove that after each adc:bcc:inx sequence, T increases by the value added.

; before:
; T=A+C+255*X, so A=T-C-255*X

then we execute
    adc #V
    bcc *+3

Let C', X' and A' be the values of carry, accumulator and X after running the code sequence.

Now, either the add didn't set C,
in which case


or, the add did set C, in which case
  X'=X+1          # as the increment happens
  A'=A+V+C-256    # as the add has overflowed into C

    = T+V-255*X-255+255*X+255
    = T+V

Either way, T has increased by the correct amount
2017-04-15 08:48

Registered: Oct 2010
Posts: 129
Nice proof! Couldn't have said it better myself.
2017-04-15 08:53

Registered: Oct 2010
Posts: 129
So my adds are 14+9*(n-1) for column 0, 11+9*(n-1) for the middle columns, and something decent for the last column (overhead 14 or less). The point being it always beats 12+10*(n-1) for the clc version.
2017-04-15 11:28

Registered: Oct 2010
Posts: 129
I figured this out my way, and why the new fix is better and why the old version failed.

Follow along with the registers and code in two cases:
	 A C  X r1 r0
ldx #0	     00	
sec	   1 
lda #ff ff  
adc #2	02 1
bcc s1
inx	     01
adc #ff	02 1
bcc s2
inx	     02
stx r1	        02
sbc r1	00 1
sta r0	           00
txa	02
sbc #0  02 1
sta r1	        02

	 A C  X r1 r0
ldx #0	     00	
sec	   1 
lda #ff ff  
adc #1	01 1
bcc s1
inx	     01
adc #ff	01 1
bcc s2
inx	     02
stx r1	        02
sbc r1	ff 0
sta r0	           ff
txa	02
sbc #0	01 1
sta r1	        01

Essentially, if carry is 0 after the low bytes, decrement the high byte. It's that fix to the high byte which was missing in my first version. I didn't realize how to test for that particular case, but all I needed was hiding in the Carry after the low bytes. The workthrough above is assuming you are only adding one column of numbers, if you continued to another column, you'd end it differently:
s2 sbc id,x
sta r0
;column 2
sbc #0;correction to high byte
ldx #0
adc n1h;add high bytes
2017-04-16 17:49

Registered: Aug 2004
Posts: 601
Ok, I think I follow how that last one works now.

A slight optimisation first, going to replace
    stx r1
    sbc r1
    sbc id,x

It's two cycles faster, only needs as many bytes of table as you have INXs (so, under a dozen), and makes it easier to
present the proof in the rest of this comment. Don't need to touch r1, as it gets overwritten later anyway.

Now, again defining T to be A+C+255*X
    ldx #0
a)  sec
     ;  initialise T to 1

    lda n1
     ;        now T is A+C+255*X = n1+1+255*0 = n1+1

    adc n2
    bcc *+3
     ;        now T is 1+n1+n2, cf proof in comment about the adc:bcc:inx sequence earlier in this thread

    adc n3
    bcc *+3
     ;        now T is 1+n1+n2+n3

  ; so all that's required now is adding A+C+255*X-1 and storing them

b)  sbc id,x  ; now sbc nn is the same as adc $ff-nn, so here we are adding $ff and subtracting X
    sta r0

c)  sbc #0    ; again, equivalent to adc $ff-0, so here we are adding x, $ff, and any carry from the low byte
    sta r1    ; ie, adding 256*X and $ff00 to the total in r1:r0

So, between points (a) (b) and (c), you've added 0001, 00ff, and ff00 to your total, which should result in carry being set
at the end of the routine, but no other effect on the result. Lines (b) and (c) contribute 255*x, leaving the sum in r1:r0

Usefully, the set carry flag is just what's needed for the start of the next column. Overall, nice work!
2017-04-16 18:28

Registered: Jan 2005
Posts: 78
Looks like you've already reached your target, just want to share a similar approach that would avoid messing with the carry.
This would clearly need to be executed on ZP to be really efficient but, in that case, the overhead would be (a bit) less than yours.

	lax eorval+1
	ldy #$00
	axs #$00
	axs #$00
	bcs val3
	axs #$00
	bcs val4
	axs #$00
	bcs storehi
	sty r1
	eor #$ff
	sta r0
2017-04-17 04:20

Registered: Aug 2004
Posts: 601
It's certainly a fast way of adding a bunch of constant values, but I fear any speed advantage would more than counterbalanced by having to fetch the addends in an earlier code segment and writing them into the routine.

There are no registers left for addressing (eg) table entries.
Previous - 1 | 2 | 3 | 4 | 5 | 6 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Users Online
Bytebreaker/Hokuto F..
Guests online: 85
Top Demos
1 Uncensored  (9.7)
2 Edge of Disgrace  (9.7)
3 ES1RA  (9.6)
4 The Shores of Reflec..  (9.6)
5 Coma Light 13  (9.6)
6 Lunatico  (9.6)
7 Comaland 100%  (9.5)
8 Incoherent Nightmare  (9.5)
9 Wonderland XII  (9.5)
10 Comaland  (9.5)
Top onefile Demos
1 Dawnfall V1.1  (9.5)
2 Daah, Those Acid Pil..  (9.5)
3 Treu Love [reu]  (9.4)
4 Dawnfall  (9.3)
5 SidRok  (9.3)
6 Achtung 5 Years Mayd..  (9.3)
7 One-Der  (9.2)
8 Tunnel Vision  (9.2)
9 Globe 2016 [reu]  (9.1)
10 Safe VSP  (9.1)
Top Groups
1 Booze Design  (9.4)
2 Oxyron  (9.4)
3 Censor Design  (9.4)
4 Crest  (9.3)
5 Arsenic  (9.3)
Top NTSC-Fixers
1 Pudwerx  (9.8)
2 Horizon  (9.7)
3 Stormbringer  (9.6)
4 Fungus  (9.6)
5 Booze  (9.5)

Home - Disclaimer
Copyright © No Name 2001-2017
Page generated in: 1.975 sec.