| |
Repose
Registered: Oct 2010 Posts: 222 |
sets of add/sub
I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.
First I'll look at adding 3x 16bit adds and generalizing
Standard version
-----------------
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum
;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add
rts
In general,
Time is 16+(n-2)*13+2=18+(n-2)*13
n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44
Fast Version 1
--------------
2 clc
3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
;11c
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles
Conclusions
Adding 3 numbers, I went from 31 to 26 cycles.
The statistics of adding 3 numbers drawn from multiplies
--------------------------------------------------------
In a partial sums routine, only some bytes of the product are added
case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296
Conclusions
High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.
ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example. |
|
... 45 posts hidden. Click here to view all posts.... |
| |
Repose
Registered: Oct 2010 Posts: 222 |
The breakthrough I had today, was to realize I don't have to reuse the multiplier pointers.
I was stuck on thinking I had to stuff the zp in the middle of an add when all registers were in use.
Then I realized I could prestuff the pointers and use many of them, then I was able to add the partials in order.
I'll have to try again with the "the position in code is that data" idea.
So here's the two variations:
1) as above, 3 setups, partials added in order
2) like codebase, 2 setups, partials added out of order.
Which one will reign victorious?
[1] http://www.codebase64.org/doku.php?id=base:seriously_fast_multi..
[2] http://www.6502.org/source/integers/fastmult.htm |
| |
JackAsser
Registered: Jun 2002 Posts: 1989 |
a-b+c-d+e-f = a+c+e - (b+d+f)
; Assuming 8-bit unsigned input and 16-bit unsigned output
lda #0
sta tmp1_hi
sta tmp2_hi
; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
inc tmp1_hi
clc
:
adc e
sta tmp1_lo
bcc :+
inc tmp1_hi
clc
:
; Perform tmp2=b+d+f
lda b
adc c
bcc :+
inc tmp2_hi
clc
:
adc f
sta tmp2_lo
bcc :+
inc tmp2_hi
:
; Perform tmp1-=tmp2
sec
lda tmp1_lo
sbc tmp2_lo
sta tmp1_lo
lda tmp2_hi
sbc tmp2_hi
sta tmp2_hi
; Result in tmp1
|
| |
Frantic
Registered: Mar 2003 Posts: 1627 |
If it is ok to use the y-register as well, one could do:
; Assuming 8-bit unsigned input and 16-bit unsigned output
ldy #0
; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
iny
clc
:
adc e
sta tmp1_lo
bcc :+
iny
clc
:
sty tmp1_hi
[...]
|
| |
lft
Registered: Jul 2007 Posts: 369 |
Neat idea, probably off topic: If you are summing a lot of bytes (e.g. computing a checksum), and you track the MSB as in Frantic's post, then you don't have to bother with keeping carry clear. After the computation, simply correct the error by subtracting the MSB.
Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning. |
| |
JackAsser
Registered: Jun 2002 Posts: 1989 |
Quote: Neat idea, probably off topic: If you are summing a lot of bytes (e.g. computing a checksum), and you track the MSB as in Frantic's post, then you don't have to bother with keeping carry clear. After the computation, simply correct the error by subtracting the MSB.
Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning.
Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1378 |
We're all terrible at reading each other's comments. I wrote #5 having only read the thread up to #2, all three of Repose, Frantic and myself have proposed different ways of accumulating the carries from the low byte computation, and both lft and myself have independently proposed fixing up the low byte in post rather than clearing carry as you go.
I do like lft's contribution (addition?) of using the carry from the last addition as inverse borrow for the fixup, mind :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1378 |
Quoting JackAsserIndeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.
I wonder if you could just get away with checking the Y value every N potential overflows, and just see if it has decreased since the last test... it'd be quicker than doing a branch every time.
N may well be as high as 255; I'd have to think a bit harder to be sure of that. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting ChristopherJamWe're all terrible at reading each other's comments.
Whoops, you're quite right. Sorry! |
| |
Fresh
Registered: Jan 2005 Posts: 101 |
While thinking about this topic I've come up with a completely different approach which is not ideal for this case but may be useful if we need to add more numbers. Hope this is not too off-topic.
Beware, I haven't checked if this needs some kind of CIA version fix: this works fine with old CIAs.
BTW, I'm not using the carry.
.const zp0=$fb
.const zp1=$fc
.const zp2=$fd
.pc=$1000
start:
sei
!: bit $d011
bpl !- // No BLs!
lda zp1
sta b+1
lda zp2
sta c+1
lda #$00
sta $dd0f
sta $dd06
lda #$01
sta $dd07
sta $dd0f
ldy zp0 // 3
b:
ldx tab,y // 4/5
c:
ldy tab,x // 4/5
// Here we could add more ldx/ldy
lo:
ldx $dd06 // 4 => Wait=15/17
lda carrytab,x
// HI/LO in A/Y
sty $63
sta $62
cli
jmp $bdd1
.align $0100
tab:
.fill 512,i&$ff
carrytab:
.fill 256,[$f3-i]&$ff
|
| |
Repose
Registered: Oct 2010 Posts: 222 |
I can't follow your code yet, but an idea just came to me (which probably you used), that any addition crossing a page boundary takes one more cycle and you can simply difference the cycles to find the number of carries. Very clever!
My own idea which just came to mind based on this concept, is
addtab is a 510 byte table, starting at a page boundary (very important!), that is just a count (0,1,1...255,0..)
ldx n1
lda n2
sta sm1+1
lda n3
sta sm2+1;or just imagine the numbers are already stored there
;initialize and start the timer to a high value
lda TimerA_Low
sta start
sm1 lda addtab,x; if addtab+x crosses page, add 1 cycle
tax; transfer running sum (n1+n2) to X
sm2 lda addtab,x;add n3 to running sum
...
sta r0;the low byte of sum
lda start
sec
sbc timerAlo
sbc #some constant
sta r1;the high byte of sum
It's not fast but, it's an interesting concept that could be applied to optimized code somehow.
If I store n2,n3... directly in sm1+1, sm2+1..., then it optimizes out to:
ldx n1
sm1 lda addtab,x;n1+n2
tax
sm2 lda addtab,x;n3+ (n1+n2)
tax
sm3 lda addtab,x;n4+ (n1+n2+n3)
...
carries are time of code minus the constant 6*n (the number of values added)
This is also quite faster so it's actually practical!
I just invented the missing add instruction. The indexed addressing is doing the adds, and the timer is doing the carries, in parallel. |
Previous - 1 | 2 | 3 | 4 | 5 | 6 - Next |