| | Repose
Registered: Oct 2010 Posts: 225 |
sets of add/sub
I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.
First I'll look at adding 3x 16bit adds and generalizing
Standard version
-----------------
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum
;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add
rts
In general,
Time is 16+(n-2)*13+2=18+(n-2)*13
n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44
Fast Version 1
--------------
2 clc
3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
;11c
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles
Conclusions
Adding 3 numbers, I went from 31 to 26 cycles.
The statistics of adding 3 numbers drawn from multiplies
--------------------------------------------------------
In a partial sums routine, only some bytes of the product are added
case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296
Conclusions
High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.
ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example. |
|
... 45 posts hidden. Click here to view all posts.... |
| | Repose
Registered: Oct 2010 Posts: 225 |
So bad news. The post correction idea has a subtle bug that can't be fixed.
I'll summarize.
This thread was exploring quick ways to add several 16 bit numbers quickly (in my case, I wanted to add a 24bit to two 16bit numbers, which corresponds to the addition of the partial products in an unsigned 16x16 multiply).
One of the solutions was to just go ahead and add them all without clearing carry and get the wrong answer. This was an attractive solution because it was fast, being only 6.5 cycles per add.
It looks like this:
clc
lda n1
adc n2
bcc s1
inx
s1 adc n3
bcc s2
inx
s2 sta low
stx high;or add the next column (the high bytes of n1-n3) with txa:ldx #0
Then the question was, how to fix that at the end? It seemed the obvious answer was:
s2 stx high
sec
sbc high
sta low;remove the carries which were added by NOT using CLC
Then I noticed this didn't work in every case, because the first carry doesn't offset the running total, only the 2nd carries forward.
Quick example,
ff+ff=fe with carry set, then +ff+Carry=fe. Only one false carry was added, but the high byte is 2, so 2 was subtracted from the low byte, giving fc, where the answer should be fd (ff+ff+ff=02 fd).
My first approach was to use SEC at the very start. Obviously this won't work, as it just adds 1 every time, so it won't work for 1+2+3.
Then I realized I need adc #0 at the very end, to offset A (the low byte) by an equal number of carries.
This works.
It looks like this:
s2 adc #0
stx high
sec
sbc high
sta low
The subtle bug is, the high byte can now be wrong, because the carries aren't always happening when they should be. This only shows up in the case ff+ff+01.
ff+ff=01 fe, Carry set (correct). fe+Carry+01=00 Carry set. The carry being set here, causes an INX. Now we have ff+ff+01=02 ff, which is wrong, it should be 01 ff.
Now the question is, is it possible for such partial products to ever appear in a multiply?
Solving this is also tricky. You want x0*y1=ff and x1*y0=ff and x0*y0 (high byte)=01.
To solve this, I found the factors of 255: (1, 255) (3, 85) (5, 51) (15, 17) , and I knew x0*y0 had to be >256, and that the other factors had to be an equal distance from 85 (so they could add to 510). It's hard to explain right now, but you'll see that (3, 85+-1) works out perfectly.
Let me be clear now, the 16bit numbers to be multiplied that cause this bug are, 5456 * 0303.
This is an amazingly sneaky and subtle bug that randomized testing could never find. I suspect there's several more cases with higher precision values. It would be impossible even with todays computers to test every possible 32x32 multiply. This has to be carefully analyzed.
In a way, it's one of the most subtle bugs I've seen. |
| | JackAsser
Registered: Jun 2002 Posts: 2014 |
At each inx you simply need to have a clc also, not? :) |
| | Repose
Registered: Oct 2010 Posts: 225 |
Right, unless you know you will never multiply 5654 x 0303 :) It's gonna add back some 13 cycles to the 32x32 though, which is not much problem.
Edit: thought I lost 72 cycles for a moment there, whew! |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting ReposeThen I noticed this didn't work in every case, because the first carry doesn't offset the running total, only the 2nd carries forward.
Au contraire! The carry is itself part of the running total.
As you say, after adding ff, ff and ff, we have A=$fe, C=1, X=2
so if we take our running total to be A+C, it is $ff, i.e. the expected two higher than the correct result. Continuing from your s2:
s2 stx hi
sbc hi ; this adds 255-x to the current value of A+C
sta low ; will now contain <(n1+n2+n3+0xff)
lda hi
adc#0
sta hi ; will now contain >(n1+n2+n3+0xff)
Note that the above computes the 16 bit sum of (n1+n2+n3+0xff), rather than (n1+n2+n3). But that's trivial to repair; just set carry at the very beginning, and do an adc#$ff instead of #0 at the beginning of the 'hi' sum
I've tested the above on 0+0+0, $ff+$ff+$ff, $80+$80+$00 and $b0+$b0+$9f; the last case tests for correct behaviour when the sbc hi underflows. All work! |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quote: At each inx you simply need to have a clc also, not? :)
Yes, but the point of the post fixup is to remove them all!
It's only cost effective if you're adding more than about half a dozen addends, Repose mentioned something about that over on the other thread. |
| | Repose
Registered: Oct 2010 Posts: 225 |
Btw, that test case isn't valid for the way I'm adding the f(x) values in a row first (f(x)=x^2/256). So I really need to construct better values.
Sure, I can just avoid the buggy addition routine, but there's a big point here - randomized testing is very unlikely to ever show up the bug.
What we need is 100% test coverage of all the branches/carries to avoid any such bugs in the future. For example, getting all those additions in the right order without making a typo, for some 72 of them (considering a 32x32), would be very difficult.
Now comes the interesting problem of which values cause carries in each sub-addition.
This is a very clear case where 100% testing is impossible, randomized testing will fail, and only careful analysis will catch bugs.
So I want to work on this and I believe CJ is as well.
I think it lies in finding factors and doing small sets of trials around that. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Oh, just tried ff+ff+01. Sums to 01ff as expected |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting Reposegetting all those additions in the right order without making a typo, for some 72 of them (considering a 32x32), would be very difficult.
..exactly why I went for a code generator.
But you're right, getting good test coverage for edge cases is hardÂ… |
| | Repose
Registered: Oct 2010 Posts: 225 |
Ok, seems I took a wrong turn somewhere, and that when you went down the optimization path with sbc tab,x you avoided my particular bug.
My point about test coverage is still valid though.
I'll have to grok your post later. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting ReposeOk, seems I took a wrong turn somewhere, and that when you went down the optimization path with sbc tab,x you avoided my particular bug.
The key is to not set carry before doing the sbc high - that SEC was throwing away information.
Quote:My point about test coverage is still valid though.
Agreed. |
Previous - 1 | 2 | 3 | 4 | 5 | 6 - Next | |