Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > sets of add/sub
2017-03-18 12:02
Repose

Registered: Oct 2010
Posts: 225
sets of add/sub

I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.

First I'll look at adding 3x 16bit adds and generalizing

Standard version
-----------------
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum

;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add
rts

In general,
Time is 16+(n-2)*13+2=18+(n-2)*13

n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44

Fast Version 1
--------------

2 clc

3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
;11c
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles

Conclusions
Adding 3 numbers, I went from 31 to 26 cycles.

The statistics of adding 3 numbers drawn from multiplies
--------------------------------------------------------

In a partial sums routine, only some bytes of the product are added

case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296

Conclusions
High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.

ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example.
 
... 45 posts hidden. Click here to view all posts....
 
2017-04-11 19:20
Oswald

Registered: Apr 2002
Posts: 5094
random idea #2 one could accumulate C's with rol-s too


clc
lda n1
adc n2
ror selfm+1
adc n3
ror selfm+1
adc n4
ror selfm+1
selfm jmp fixhibyte
2017-04-11 21:43
Repose

Registered: Oct 2010
Posts: 225
That's an original idea, however that only gives 2 bytes between code, as the values could be 1,3,5,7 etc. If you give it a final ror, that still wouldn't be enough. I think you need 8 bytes to correct.
lda #correction:sta l:rts/bpl etc.
This pushes the overhead way up.
You could use the roring as an index to a table,
ldx correction
adc cortable,x
2017-04-12 10:42
Oswald

Registered: Apr 2002
Posts: 5094
you could shift in bits even from the most significant bit to solve the code gap.

reading table could be simpler:

ror selfmod+1

selfmod lda rortable
sta hibyte
2017-04-12 10:47
ChristopherJam

Registered: Aug 2004
Posts: 1409
It would even clear carry for you, so you wouldn't have to fixup :)

Unfortunately bcc *+3:inx is about 2.5 cycles faster per add. Pity, I prefer the ROR solution. It does take less instruction ram at least?
2017-04-12 21:17
Repose

Registered: Oct 2010
Posts: 225
Actually came up with yet another way, which I think is kinda cool, as it only works for 16x16 partials so it fits perfectly.


24bit+2x16bit add
This strange combination is equivalent to the sum of the partial products of
a 16x16 multiply.
a2 a1 a0
   b1 b0
+  c1 c0
--------
r2 r1 r0

2 clc
3 lda a0
3 adc b0
2 tax;a0+b0
3 lda a1
3 adc b1
2 tay;a1+b1
3 adc a2
3 sta r2
2 txa
;no clc as there's never a carry from r2, in a mult
3 adc c0
3 sta r0
2 tya
3 adc c1
3 sta r1
3 bcc s1;3/7 avg 5
5 inc r2
s1 rts

45 cycles
2017-04-13 09:18
Repose

Registered: Oct 2010
Posts: 225
So bad news. The post correction idea has a subtle bug that can't be fixed.

I'll summarize.

This thread was exploring quick ways to add several 16 bit numbers quickly (in my case, I wanted to add a 24bit to two 16bit numbers, which corresponds to the addition of the partial products in an unsigned 16x16 multiply).

One of the solutions was to just go ahead and add them all without clearing carry and get the wrong answer. This was an attractive solution because it was fast, being only 6.5 cycles per add.

It looks like this:
clc
lda n1
adc n2
bcc s1
inx
s1 adc n3
bcc s2
inx
s2 sta low
stx high;or add the next column (the high bytes of n1-n3) with txa:ldx #0


Then the question was, how to fix that at the end? It seemed the obvious answer was:
s2 stx high
sec
sbc high
sta low;remove the carries which were added by NOT using CLC

Then I noticed this didn't work in every case, because the first carry doesn't offset the running total, only the 2nd carries forward.

Quick example,
ff+ff=fe with carry set, then +ff+Carry=fe. Only one false carry was added, but the high byte is 2, so 2 was subtracted from the low byte, giving fc, where the answer should be fd (ff+ff+ff=02 fd).

My first approach was to use SEC at the very start. Obviously this won't work, as it just adds 1 every time, so it won't work for 1+2+3.

Then I realized I need adc #0 at the very end, to offset A (the low byte) by an equal number of carries.

This works.

It looks like this:
s2 adc #0
stx high
sec
sbc high
sta low


The subtle bug is, the high byte can now be wrong, because the carries aren't always happening when they should be. This only shows up in the case ff+ff+01.
ff+ff=01 fe, Carry set (correct). fe+Carry+01=00 Carry set. The carry being set here, causes an INX. Now we have ff+ff+01=02 ff, which is wrong, it should be 01 ff.

Now the question is, is it possible for such partial products to ever appear in a multiply?

Solving this is also tricky. You want x0*y1=ff and x1*y0=ff and x0*y0 (high byte)=01.

To solve this, I found the factors of 255: (1, 255) (3, 85) (5, 51) (15, 17) , and I knew x0*y0 had to be >256, and that the other factors had to be an equal distance from 85 (so they could add to 510). It's hard to explain right now, but you'll see that (3, 85+-1) works out perfectly.

Let me be clear now, the 16bit numbers to be multiplied that cause this bug are, 5456 * 0303.

This is an amazingly sneaky and subtle bug that randomized testing could never find. I suspect there's several more cases with higher precision values. It would be impossible even with todays computers to test every possible 32x32 multiply. This has to be carefully analyzed.

In a way, it's one of the most subtle bugs I've seen.
2017-04-13 09:40
JackAsser

Registered: Jun 2002
Posts: 2014
At each inx you simply need to have a clc also, not? :)
2017-04-13 10:09
Repose

Registered: Oct 2010
Posts: 225
Right, unless you know you will never multiply 5654 x 0303 :) It's gonna add back some 13 cycles to the 32x32 though, which is not much problem.

Edit: thought I lost 72 cycles for a moment there, whew!
2017-04-13 20:39
ChristopherJam

Registered: Aug 2004
Posts: 1409
Quoting Repose
Then I noticed this didn't work in every case, because the first carry doesn't offset the running total, only the 2nd carries forward.


Au contraire! The carry is itself part of the running total.

As you say, after adding ff, ff and ff, we have A=$fe, C=1, X=2
so if we take our running total to be A+C, it is $ff, i.e. the expected two higher than the correct result. Continuing from your s2:
s2 stx hi
   sbc hi   ; this adds 255-x to the current value of A+C
   sta low  ; will now contain <(n1+n2+n3+0xff)

   lda hi
   adc#0
   sta hi   ; will now contain >(n1+n2+n3+0xff)

Note that the above computes the 16 bit sum of (n1+n2+n3+0xff), rather than (n1+n2+n3). But that's trivial to repair; just set carry at the very beginning, and do an adc#$ff instead of #0 at the beginning of the 'hi' sum

I've tested the above on 0+0+0, $ff+$ff+$ff, $80+$80+$00 and $b0+$b0+$9f; the last case tests for correct behaviour when the sbc hi underflows. All work!
2017-04-13 20:41
ChristopherJam

Registered: Aug 2004
Posts: 1409
Quote: At each inx you simply need to have a clc also, not? :)

Yes, but the point of the post fixup is to remove them all!

It's only cost effective if you're adding more than about half a dozen addends, Repose mentioned something about that over on the other thread.
Previous - 1 | 2 | 3 | 4 | 5 | 6 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
csabanw
Brataccas/HF
Genius/Xenon
JackAsser/Booze Design
Jammer
Guests online: 100
Top Demos
1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)
Top onefile Demos
1 No Listen  (9.6)
2 Layers  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)
Top Groups
1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)
Top Organizers
1 Burglar  (9.9)
2 Sixx  (9.8)
3 hedning  (9.7)
4 Irata  (9.7)
5 Tim  (9.7)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.06 sec.