Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > sets of add/sub
2017-03-18 12:02
Repose

Registered: Oct 2010
Posts: 225
sets of add/sub

I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.

First I'll look at adding 3x 16bit adds and generalizing

Standard version
-----------------
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum

;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add
rts

In general,
Time is 16+(n-2)*13+2=18+(n-2)*13

n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44

Fast Version 1
--------------

2 clc

3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
;11c
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles

Conclusions
Adding 3 numbers, I went from 31 to 26 cycles.

The statistics of adding 3 numbers drawn from multiplies
--------------------------------------------------------

In a partial sums routine, only some bytes of the product are added

case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296

Conclusions
High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.

ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example.
 
... 45 posts hidden. Click here to view all posts....
 
2017-03-22 09:48
Repose

Registered: Oct 2010
Posts: 225
Another free calculation on the Timer, remember it's 16bit so you can cause a borrow from the high byte, so lda TimerA+1:bmi xx or bne is like asking was there at least n counts (carries, borrows, whatever you're doing).

Say my adds were 6 cycles each, I'm adding 4 numbers, I start timer at $124, then lda TimerA+1:bne means no carries. Don't know how it's useful either, just throwing out a lot of ideas. You gotta brainstorm and not hold yourself back, useful stuff will stick eventually.

There's also two timers, you can have them read different values.

You don't need the sbc #constant because you should set the timer to #constant + variation at the start, then at the end the timer reads exactly the high byte +1 (since Timer can't be zero)
so
ldx timerA
dex
stx sum+1
2017-04-01 00:43
Repose

Registered: Oct 2010
Posts: 225
I've finished analyzing this thread.

My 2nd version (1)

clc:lda n1;5 cycles overhead
adc n2:bcs s1;5 if no carry
adc n3;3 which means 2 saved on last number
sta r0
lda #0:adc #0:sta r1:rts;10 w/o rts
s1 clc;8 if carry, so 5/8 avg 6.5
adc n3
sta r0
lda #1:adc #0:sta r1:rts

Total: 13+(n-1)*6.5
---------------
Frantic's suggestion (2)

This is the same as mine as inspired by Oswald

ldy #0:lda a:clc;7 cycles overhead
adc c:bcc +3:iny:clc;6/9 avg 7.5
adc e:bcc +3:iny:clc
...
sta r0:sty r1;6 cycles overhead

Total: 13+(n-1)*7.5, n is number of zp 8 bit numbers to be added to 16 bit result.
---------------
lft's suggestion (3)

ldy #0:lda a:clc;7
adc c:bcc +3:iny;6/7 avg 6.5
adc c:bcc +3:iny;note that carry wasn't reset so we falsely are +1
...
sty r1;save high byte of result
sec:sbc r1:sta r0;subract all the false carries for the low byte of result
;11

Total: 18+(n-1)*6.5

Note that (2) breaks even with (3) at n=5 or when 6 numbers are added together. With longer sequences, (3) is slightly faster.
---------------
CJ's suggestion (4)

This is a similar but expanded version of my (1)

lda n1;overhead 3
adc n2:bcc n1n2_0;7 per add if carry
clc
n1n2_1 adc n3:bcc n1n2n3_1
...
n1n2_0 adc n3:bcc ...;6 per add if no carry, avg 6.5
...
sta sum
lda #x:adc ...:sta sum+1;eqv. overhead is 13 to be almost exactly
like mine

Total: 13+(n-1)*6.5
can avg to (n-1)*5.5 if you use post correction.

---------------
My timer version (5)

overhead + (n-1)*5.5

Where the overhead is fairly big.

-----------
I only analyzed the low bytes, but it gives some idea of the approaches. (1)/(4) use a large amount of code and rely on the idea that the position in the code encodes the number of carries.

(2)/(3) use a register to track the carries and can be sped up over long runs by post correction of the low byte. The break-even point is at 6 numbers.

Finally (5) requires precise timing but can add long runs quickly. It's the same average speed as (4). It's up to you if you want to have huge code or short code but no screen or interrupts (screen timing could be factored out in some cases but not interrupts).

The differences are very small, from 5.5 to 7.5 per add.

My actual application is in adding the partial sums of large multiplies when using the squares multiply method. For a 32x32 unsigned multiply, there's up to 7 rows of numbers to be added, which is actually 7 subs (from squares subtraction) and 8 adds (previous carry and 7 rows). That's 15 numbers add/sub together with a final carry.

Conclusion: using (3) (with post correction) for most of the adds would be the fastest and convenient, at 6.5 per byte. If you went insane with code size, you could save another 72 cycles or so.

Estimating how much this would add to the multiply is a little tricky; the numbers themselves are part of the multiply, and the add/subs are integrated into the routine, so it's only the 'extra' part to track the carries that is important, which averages 4.5.
There's about 72 numbers add/sub so at least 324 beyond the usual 38 cycles per multiply.
I'm guessing between 800-1000 cycles in total, which means the c64 can reach 1000flops single precision :)
2017-04-01 00:55
Repose

Registered: Oct 2010
Posts: 225
For comparison, basic is 2300 and http://www.codebase64.org/doku.php?id=base:fast_floating_point_.. is 1300. I haven't estimated the floating point overhead to the multiply, but clearly my routine will be noticeably quicker.

I've also come to conclusions about the order of multiplies, but that's not about adding so I'll update the 'fast large multiplies' thread.

Thanks for all your input. Topic closed.
2017-04-03 21:44
Repose

Registered: Oct 2010
Posts: 225
Ok I've tested adding/subbing sequences of numbers now. There's one tricky bit.

First consider this routine for repeated subtractions:
(warning: following routine is wrong, which I explain why)
ldx #0
lda n1
sec;7 cycles overhead here
sbc n2
bcs s1;Carry is still set if n1>=n2 meaning no borrow
dex;keep track of borrows in X
s1 sbc n3
bcs s3;6 if no borrow, 7 if borrow, avg 6.5
dex
s3 stx r1;save high byte of result
sec
sbc r1;post correction of low byte
sta r0;save low byte
;post correction overhead is 5, 6 more to store result
Total time: 18+(n-1)*6.5 cycles

*note that without post correction, the timing is 12+7.5*(n-1), and they break even at 7 subs, which means it's faster to use post correction if you do 8+ subs in a row, but only saves a cycle per extra sub.

Now let's go through an example calculation.
0-FF-FF or 0-255-255 = FE 02 or -510.

If you get confused about 2's complement, look at it this way, what would you have to add to FE02 to get (01) 00 00?
FE+02=00 carry 1, this means you need to add 254 to the low byte to start.
Now the high byte, FE+carry(1)=FF. What do you need to add to the high byte? FF+01=(01) 00. Adding one to the high byte is like adding 256. In total, we added 256+254=510, so we must have had -510 to start with.

lda #0;n1=0
ldx #0;X tracks borrows, starts at 0.
sec;C=1
sbc #$ff;0-FF=1 as in -255 or (FF) 01, C=0 for borrow
bcs s1:dex;X=FF
s1 sbc #$ff;1-ff-not(C) or 1-ff-1=1, C=0 or borrow
bcs s2:dex:X=FE
s2 stx r1;store correct high byte as FE
sec
sbc r1;A=A-r1, 01-FE=03 as in 1-254=-253 or (FF) 03.

Our answer is too positive. What went wrong?

The idea of the post correction is that the running tally in A gets a little more wrong with each borrow, since we aren't using SEC before each subtraction. There were two borrows, but only one of them messed up our tally.
All we have to do is use CLC instead of SEC at the beginning, which seems weird but it works.

Here's the revised values:
C=0
00-FF-not(C)=0-FF-1=(FF) 00. C=0, X=FF.
00-FF-not(C)=0-FF-1=(FF) 00. C=0, X=FE.
Store correct high byte as FE.
C=1
Low byte is 00-FE=(FF) 02.
Final correct result is FE 02.

The same routine but ADC instead of SBC and INX instead of DEX also works, and can come before this. The only overhead is one extra CLC before the sequence of SBC starts.

You'd expect a similar bug in the adding version, but it works there because I start with CLC, and each carry messes up the running sum, so likewise, in the sub version, we must make sure each SBC messes up the tally, so we start with CLC as well.

In conclusion, you can add/sub long sequences of numbers with post correction with only 6.5 cycles per add, which is nearly the fastest possible if over 7 adds. You can add 257 numbers this way until X overflows.

I also want to point out, in my previous post where the carries are implied as the position in the code, the code branches grow so quickly that you need to add cycles for page boundary crossings in the branches.
2017-04-11 13:13
Repose

Registered: Oct 2010
Posts: 225
My clc fix was a little hasty, this works better:
ldx #0
clc
lda n1
adc n2
bcc s1
inx
adc #0;fix
sec;fix
s1 adc n3
bcc s2
inx
stx h
sec
sbc h
sta l


And something similar for subs. It's 2 cycles slower on avg though, however you can get that back with
sec
sbc id,x

In actual use you don't stx h, you immediately go to
txa
ldx #0


I have to do a bit more testing, post correction may not actually work and you have to use clc each time.
2017-04-11 19:20
Oswald

Registered: Apr 2002
Posts: 5094
random idea #2 one could accumulate C's with rol-s too


clc
lda n1
adc n2
ror selfm+1
adc n3
ror selfm+1
adc n4
ror selfm+1
selfm jmp fixhibyte
2017-04-11 21:43
Repose

Registered: Oct 2010
Posts: 225
That's an original idea, however that only gives 2 bytes between code, as the values could be 1,3,5,7 etc. If you give it a final ror, that still wouldn't be enough. I think you need 8 bytes to correct.
lda #correction:sta l:rts/bpl etc.
This pushes the overhead way up.
You could use the roring as an index to a table,
ldx correction
adc cortable,x
2017-04-12 10:42
Oswald

Registered: Apr 2002
Posts: 5094
you could shift in bits even from the most significant bit to solve the code gap.

reading table could be simpler:

ror selfmod+1

selfmod lda rortable
sta hibyte
2017-04-12 10:47
ChristopherJam

Registered: Aug 2004
Posts: 1409
It would even clear carry for you, so you wouldn't have to fixup :)

Unfortunately bcc *+3:inx is about 2.5 cycles faster per add. Pity, I prefer the ROR solution. It does take less instruction ram at least?
2017-04-12 21:17
Repose

Registered: Oct 2010
Posts: 225
Actually came up with yet another way, which I think is kinda cool, as it only works for 16x16 partials so it fits perfectly.


24bit+2x16bit add
This strange combination is equivalent to the sum of the partial products of
a 16x16 multiply.
a2 a1 a0
   b1 b0
+  c1 c0
--------
r2 r1 r0

2 clc
3 lda a0
3 adc b0
2 tax;a0+b0
3 lda a1
3 adc b1
2 tay;a1+b1
3 adc a2
3 sta r2
2 txa
;no clc as there's never a carry from r2, in a mult
3 adc c0
3 sta r0
2 tya
3 adc c1
3 sta r1
3 bcc s1;3/7 avg 5
5 inc r2
s1 rts

45 cycles
Previous - 1 | 2 | 3 | 4 | 5 | 6 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Chesser/Blazon
Alakran_64
Freeze/Blazon
YPS
algorithm
Ben Breton/Absence o..
Guests online: 120
Top Demos
1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)
Top onefile Demos
1 Layers  (9.6)
2 No Listen  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)
Top Groups
1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)
Top NTSC-Fixers
1 Pudwerx  (10)
2 Booze  (9.7)
3 Stormbringer  (9.7)
4 Fungus  (9.6)
5 Grim Reaper  (9.3)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.062 sec.