Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > sets of add/sub
2017-03-18 12:02
Repose

Registered: Oct 2010
Posts: 225
sets of add/sub

I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.

First I'll look at adding 3x 16bit adds and generalizing

Standard version
-----------------
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum

;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add
rts

In general,
Time is 16+(n-2)*13+2=18+(n-2)*13

n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44

Fast Version 1
--------------

2 clc

3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
;11c
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles

Conclusions
Adding 3 numbers, I went from 31 to 26 cycles.

The statistics of adding 3 numbers drawn from multiplies
--------------------------------------------------------

In a partial sums routine, only some bytes of the product are added

case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296

Conclusions
High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.

ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example.
 
... 45 posts hidden. Click here to view all posts....
 
2017-03-22 08:29
Repose

Registered: Oct 2010
Posts: 225
bcs/bcc, bmi/bpl, bvc can all be used in this technique to accumulate the times those flags were in a certain state. By extension, you can accumulate times a certain bit was set or value was reached, like cmp #$40:bcc *+2:
So think of the timer as an extra register that you can increment or decrement (by using the opposite branch), using branches or page crossings, that is only fast to use over a longish sequence. It can do corrections after the fact or count things. There is also a free divide, IIRC you can set the speed the timer counts? Or is it fixed to decrement each clock cycle? No matter, use the SID voice 3 output and set the frequency for a free division.

Question, what use can this be put to? And can you use the flags as part of a calculation ,while using the accumulation of times the flags happened for a parallel computation using the timer (which has no sideeffects) to do double the work? (I mean, use the branch to do one thing, then at the end do another thing with the timer value).

In another direction, can you do some computation unrolled then interrupt it at a certain time to see the value and that's your result?

example,
rol
bcs +2
inx
... x8
interrupt after 24 cycles
X will be 4-(popcnt(A and $#f)/6) or something (the full formula is kinda long)

Another idea I've used to to make 'infinite' loops that can only be stopped by an interrupt
lda #0
sta $4000
sta $4001
..
decide which area of the bitmap you want to fill
jump into the code
stop it with an interrupt
of course you can 'plot' an RTS into the middle as well, then restore it.
Interrupt idea might turn out faster

This is getting off topic though.
2017-03-22 08:42
ChristopherJam

Registered: Aug 2004
Posts: 1409
I've definitely considered using an interrupt to terminate an unrolled fill loop, but yes, plotting an RTS seemed like a saner option at the time.
2017-03-22 08:53
ChristopherJam

Registered: Aug 2004
Posts: 1409
Could also use CIA to see count how many iterations a fractal takes to escape, freeing up registers for state :)
2017-03-22 08:55
Pex Mahoney Tufvesson

Registered: Sep 2003
Posts: 52
I love you guys! :-D
I've used timers a lot for stopping infinite loops. One example is the part in BitLive4-demo where I draw a full-screen all-border white/black line using $d020 and CPU only. No time for keeping track of where to stop. :)
---
Have a noise night!
http://mahoney.c64.org
2017-03-22 09:04
Repose

Registered: Oct 2010
Posts: 225
@cj Very useful idea, that would definitely save time. In the end, just use the Timer value as a colour and plot it. It doesn't really matter the values involved, you can always arrange them to be something different depending on escape "time" literally... then animate the 'palette' by starting one cycle later..
See what I mean? I'm saying if you plot a fractal in real time in one frame say, then the colours are determined by the times to iterate the formula, and that value from the timer directly translate to a colour, then starting later shifts the colour palette down one, then you have a free animated fractal.

@mahooney Nice, that's a very good use because it's the only way possible to create an effect. Will check out your demo thanks for the comment :)
2017-03-22 09:48
Repose

Registered: Oct 2010
Posts: 225
Another free calculation on the Timer, remember it's 16bit so you can cause a borrow from the high byte, so lda TimerA+1:bmi xx or bne is like asking was there at least n counts (carries, borrows, whatever you're doing).

Say my adds were 6 cycles each, I'm adding 4 numbers, I start timer at $124, then lda TimerA+1:bne means no carries. Don't know how it's useful either, just throwing out a lot of ideas. You gotta brainstorm and not hold yourself back, useful stuff will stick eventually.

There's also two timers, you can have them read different values.

You don't need the sbc #constant because you should set the timer to #constant + variation at the start, then at the end the timer reads exactly the high byte +1 (since Timer can't be zero)
so
ldx timerA
dex
stx sum+1
2017-04-01 00:43
Repose

Registered: Oct 2010
Posts: 225
I've finished analyzing this thread.

My 2nd version (1)

clc:lda n1;5 cycles overhead
adc n2:bcs s1;5 if no carry
adc n3;3 which means 2 saved on last number
sta r0
lda #0:adc #0:sta r1:rts;10 w/o rts
s1 clc;8 if carry, so 5/8 avg 6.5
adc n3
sta r0
lda #1:adc #0:sta r1:rts

Total: 13+(n-1)*6.5
---------------
Frantic's suggestion (2)

This is the same as mine as inspired by Oswald

ldy #0:lda a:clc;7 cycles overhead
adc c:bcc +3:iny:clc;6/9 avg 7.5
adc e:bcc +3:iny:clc
...
sta r0:sty r1;6 cycles overhead

Total: 13+(n-1)*7.5, n is number of zp 8 bit numbers to be added to 16 bit result.
---------------
lft's suggestion (3)

ldy #0:lda a:clc;7
adc c:bcc +3:iny;6/7 avg 6.5
adc c:bcc +3:iny;note that carry wasn't reset so we falsely are +1
...
sty r1;save high byte of result
sec:sbc r1:sta r0;subract all the false carries for the low byte of result
;11

Total: 18+(n-1)*6.5

Note that (2) breaks even with (3) at n=5 or when 6 numbers are added together. With longer sequences, (3) is slightly faster.
---------------
CJ's suggestion (4)

This is a similar but expanded version of my (1)

lda n1;overhead 3
adc n2:bcc n1n2_0;7 per add if carry
clc
n1n2_1 adc n3:bcc n1n2n3_1
...
n1n2_0 adc n3:bcc ...;6 per add if no carry, avg 6.5
...
sta sum
lda #x:adc ...:sta sum+1;eqv. overhead is 13 to be almost exactly
like mine

Total: 13+(n-1)*6.5
can avg to (n-1)*5.5 if you use post correction.

---------------
My timer version (5)

overhead + (n-1)*5.5

Where the overhead is fairly big.

-----------
I only analyzed the low bytes, but it gives some idea of the approaches. (1)/(4) use a large amount of code and rely on the idea that the position in the code encodes the number of carries.

(2)/(3) use a register to track the carries and can be sped up over long runs by post correction of the low byte. The break-even point is at 6 numbers.

Finally (5) requires precise timing but can add long runs quickly. It's the same average speed as (4). It's up to you if you want to have huge code or short code but no screen or interrupts (screen timing could be factored out in some cases but not interrupts).

The differences are very small, from 5.5 to 7.5 per add.

My actual application is in adding the partial sums of large multiplies when using the squares multiply method. For a 32x32 unsigned multiply, there's up to 7 rows of numbers to be added, which is actually 7 subs (from squares subtraction) and 8 adds (previous carry and 7 rows). That's 15 numbers add/sub together with a final carry.

Conclusion: using (3) (with post correction) for most of the adds would be the fastest and convenient, at 6.5 per byte. If you went insane with code size, you could save another 72 cycles or so.

Estimating how much this would add to the multiply is a little tricky; the numbers themselves are part of the multiply, and the add/subs are integrated into the routine, so it's only the 'extra' part to track the carries that is important, which averages 4.5.
There's about 72 numbers add/sub so at least 324 beyond the usual 38 cycles per multiply.
I'm guessing between 800-1000 cycles in total, which means the c64 can reach 1000flops single precision :)
2017-04-01 00:55
Repose

Registered: Oct 2010
Posts: 225
For comparison, basic is 2300 and http://www.codebase64.org/doku.php?id=base:fast_floating_point_.. is 1300. I haven't estimated the floating point overhead to the multiply, but clearly my routine will be noticeably quicker.

I've also come to conclusions about the order of multiplies, but that's not about adding so I'll update the 'fast large multiplies' thread.

Thanks for all your input. Topic closed.
2017-04-03 21:44
Repose

Registered: Oct 2010
Posts: 225
Ok I've tested adding/subbing sequences of numbers now. There's one tricky bit.

First consider this routine for repeated subtractions:
(warning: following routine is wrong, which I explain why)
ldx #0
lda n1
sec;7 cycles overhead here
sbc n2
bcs s1;Carry is still set if n1>=n2 meaning no borrow
dex;keep track of borrows in X
s1 sbc n3
bcs s3;6 if no borrow, 7 if borrow, avg 6.5
dex
s3 stx r1;save high byte of result
sec
sbc r1;post correction of low byte
sta r0;save low byte
;post correction overhead is 5, 6 more to store result
Total time: 18+(n-1)*6.5 cycles

*note that without post correction, the timing is 12+7.5*(n-1), and they break even at 7 subs, which means it's faster to use post correction if you do 8+ subs in a row, but only saves a cycle per extra sub.

Now let's go through an example calculation.
0-FF-FF or 0-255-255 = FE 02 or -510.

If you get confused about 2's complement, look at it this way, what would you have to add to FE02 to get (01) 00 00?
FE+02=00 carry 1, this means you need to add 254 to the low byte to start.
Now the high byte, FE+carry(1)=FF. What do you need to add to the high byte? FF+01=(01) 00. Adding one to the high byte is like adding 256. In total, we added 256+254=510, so we must have had -510 to start with.

lda #0;n1=0
ldx #0;X tracks borrows, starts at 0.
sec;C=1
sbc #$ff;0-FF=1 as in -255 or (FF) 01, C=0 for borrow
bcs s1:dex;X=FF
s1 sbc #$ff;1-ff-not(C) or 1-ff-1=1, C=0 or borrow
bcs s2:dex:X=FE
s2 stx r1;store correct high byte as FE
sec
sbc r1;A=A-r1, 01-FE=03 as in 1-254=-253 or (FF) 03.

Our answer is too positive. What went wrong?

The idea of the post correction is that the running tally in A gets a little more wrong with each borrow, since we aren't using SEC before each subtraction. There were two borrows, but only one of them messed up our tally.
All we have to do is use CLC instead of SEC at the beginning, which seems weird but it works.

Here's the revised values:
C=0
00-FF-not(C)=0-FF-1=(FF) 00. C=0, X=FF.
00-FF-not(C)=0-FF-1=(FF) 00. C=0, X=FE.
Store correct high byte as FE.
C=1
Low byte is 00-FE=(FF) 02.
Final correct result is FE 02.

The same routine but ADC instead of SBC and INX instead of DEX also works, and can come before this. The only overhead is one extra CLC before the sequence of SBC starts.

You'd expect a similar bug in the adding version, but it works there because I start with CLC, and each carry messes up the running sum, so likewise, in the sub version, we must make sure each SBC messes up the tally, so we start with CLC as well.

In conclusion, you can add/sub long sequences of numbers with post correction with only 6.5 cycles per add, which is nearly the fastest possible if over 7 adds. You can add 257 numbers this way until X overflows.

I also want to point out, in my previous post where the carries are implied as the position in the code, the code branches grow so quickly that you need to add cycles for page boundary crossings in the branches.
2017-04-11 13:13
Repose

Registered: Oct 2010
Posts: 225
My clc fix was a little hasty, this works better:
ldx #0
clc
lda n1
adc n2
bcc s1
inx
adc #0;fix
sec;fix
s1 adc n3
bcc s2
inx
stx h
sec
sbc h
sta l


And something similar for subs. It's 2 cycles slower on avg though, however you can get that back with
sec
sbc id,x

In actual use you don't stx h, you immediately go to
txa
ldx #0


I have to do a bit more testing, post correction may not actually work and you have to use clc each time.
Previous - 1 | 2 | 3 | 4 | 5 | 6 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Didi/Laxity
Alakran_64
Yogibear/Protovision
Guests online: 131
Top Demos
1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)
Top onefile Demos
1 No Listen  (9.6)
2 Layers  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)
Top Groups
1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)
Top Diskmag Editors
1 Magic  (9.8)
2 hedning  (9.6)
3 Jazzcat  (9.5)
4 Elwix  (9.1)
5 Remix  (9.1)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.068 sec.