Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > sets of add/sub
2017-03-18 12:02
Repose

Registered: Oct 2010
Posts: 222
sets of add/sub

I want to do a-b+c-d+e-f. I wonder if you can save some cycles somewhere? I looked at the stats of the actual numbers, no hopes there, any pair can carry/borrow with about 50% chance.

First I'll look at adding 3x 16bit adds and generalizing

Standard version
-----------------
We will keep the running sum in a register
n1,n2,n3 - numbers to add
r, r+1,... - the sum

;assume n1 stored in r
2 clc
3 lda r
3 adc n2
2 tax;n2+n1 save low byte in X
2 lda #0;small optimization first time when you know r+1 is 0.
2 adc #0; basically convert carry to 1
2 tay;16c save high byte in Y
;-- repeat for n>=3 (n is number of numbers to add)
;clc not needed
2 txa
3 adc n3
3 sta r;r+n3=n1+n2+n3 or tax until final add
2 tya
2 adc #0
3 sta r+1;16+15=31 or tay until final add
rts

In general,
Time is 16+(n-2)*13+2=18+(n-2)*13

n time, cycles
2 18 =18
3 18+13 =31
4 18+13+13 =44

Fast Version 1
--------------

2 clc

3 lda n2
3 adc r;n2+n1
2 bcs s1;2+9=11 if taken
3 adc n3
3 sta r
2 lda #0
2 adc #0
3 sta r+1;2+8+13=23 if no carry in n1+n2
rts;or bcc s2 =26
;11c
2 s1 clc;there was a carry in first add, A=n1+n2
3 adc n3
3 sta r
2 lda #1;representing the carry from n1+n2>255
2 adc #0
3 sta r+1;11+15=26 if carry in n1+n2
s2 rts;23/26, average 24.5, saving 6.5 cycles from standard (31c)
realistically always 26 or saving 5 cycles

Conclusions
Adding 3 numbers, I went from 31 to 26 cycles.

The statistics of adding 3 numbers drawn from multiplies
--------------------------------------------------------

In a partial sums routine, only some bytes of the product are added

case carries % total
l l h 2908442244 67.7 4294967296
h h l 2044355014 47.6 4294967296

Conclusions
High bytes of products are always less likely to create a carry when added to other numbers. The differences are so little that it hardly makes any difference.

ps took a while to simulate 4 billion multiplies and gather stats. I'd hate to figure out stats for larger mults, would have to actually think through it instead. But I can tell you they change quite a bit, in some cases there's hardly two carries in a row for example.
 
... 45 posts hidden. Click here to view all posts....
 
2017-03-21 06:05
Repose

Registered: Oct 2010
Posts: 222
Thanks for the input!

Quick reply for now,
- Yes, I'd thought of storing the values as 2's complement negative, the sign bit steals precision and reduces the domain of the multply. It could be useful to have an alternate and faster routine around, but for now I'm trying to solve a full unsigned 16x16.

-The final sum of a-f will be a positive number, since this is coming from an unsigned multiply partial product. Also, each a-b has a>=b.

I used that to construct the following 16x16 unsigned multiply where the partials are computed and added in order.

Conventions
f means f(x)=x^2/4, g means g(x)=(255-x)^2/4
x0/x1 are low/high of multiplier, x0/y1 multiplicand
Note there is a difference. I take the multiplier as meaning the one that requires setup, aka lda multiplier:sta zp1:sta zp2:eor #$ff:sta zp3:sta zp4

"setup x0" is a macro-like notation which means what I just stated above.

This is a bit confusing if you really try to follow it, but here's the order of partials:
         y1  y0
         x1  x0
         ------
        p0h p0l ;x0*y0
    p1h p1l ;x0*y1 for l and y1*x0 for high
    p2h p2l ;x1*y0
p3h p3l ;y1*x1

f(a+b)-g(a-b)

    in pointers         in Y
p0l=f_low_x0-g_low_x0 * y0
p0h=f_high_x0-g_high_x0 * y0
p1l=f_low_x0-g_low_x0 * y1
p1h=f_high_y1-g_high_y1 * x0
p2l=f_low_x1-g_low_x1 * y0
p2h=f_high_x1-g_high_x1 * y0
p3l=f_low_y1-g_low_y1 * x1
p3h=f_high_y1-g_high_y1 * x1

If you do it in lsb to msb order to do the adds,
x0*y0 low -> r0 (p0l)

x0*y0 high (p0h)
x0*y1 low (p1l)
x1*y0 low -> r1 (p2l)

x1*y0 high (p2h)
y1*x0 high (p1h) this is different than p1l
y1*x1 low -> r2 (p3l)

y1*x1 high -> r4 (p3h)


And now the code:

setup x0;14 cycles
setup x1
setup y1;42c
ldy y0
sec
lda (f_low_x0),y
sbc (g_low_x0),y
sta r0;p0l, 18c

;the adds of column 1
ldx #0;stores carries, reset every column with adds
clc; Y=y0
lda (f_high_x0),y;A=p0h, (a+b) part
ldy y1
adc (f_low_x0),y;+p1l (a+b) part
bcc s1
inx
clc
s1 ldy y0
adc (f_low_x1),y;+p2l (a+b) part
bcc s2; 31c
inx
clc

;the subs of column 1
s2 sec; Y=y0
;A=the (a+b) parts added, carry in X
sbc (g_high_x0),y
bcs s3
dex
sec
s3 ldy y1
sbc (g_low_x0),y;-p1l (a-b) part
bcs s4
dex
sec
s4 ldy y0
sbc (g_low_x1),y;-p2l (a-b) part
bcs s5
dex
s5 sta r1;35 or 65 per column

;the adds of column 2
txa; get the carries from column 1
clc

...

;column 3
txa
clc
adc (f_high_y1),y;p3h (a+b) part
sec
sbc (g_high_y1),y;p3h (a-b) part
sta r3;19
...
42+18+65+65+19=209

This beats the 16x16 posted on codebase, which is 224 by my estimate.
Times are minimal times, really there's a bit more added when there's overflows/carries.
It's good enough to make a fair comparison though.
2017-03-21 06:22
Repose

Registered: Oct 2010
Posts: 222
The breakthrough I had today, was to realize I don't have to reuse the multiplier pointers.
I was stuck on thinking I had to stuff the zp in the middle of an add when all registers were in use.
Then I realized I could prestuff the pointers and use many of them, then I was able to add the partials in order.

I'll have to try again with the "the position in code is that data" idea.

So here's the two variations:
1) as above, 3 setups, partials added in order
2) like codebase, 2 setups, partials added out of order.
Which one will reign victorious?

[1] http://www.codebase64.org/doku.php?id=base:seriously_fast_multi..

[2] http://www.6502.org/source/integers/fastmult.htm
2017-03-21 10:48
JackAsser

Registered: Jun 2002
Posts: 1987
a-b+c-d+e-f = a+c+e - (b+d+f)

; Assuming 8-bit unsigned input and 16-bit unsigned output
lda #0
sta tmp1_hi
sta tmp2_hi

; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
   inc tmp1_hi
   clc
:
adc e
sta tmp1_lo
bcc :+
   inc tmp1_hi
   clc
:

; Perform tmp2=b+d+f
lda b
adc c
bcc :+
   inc tmp2_hi
   clc
:
adc f
sta tmp2_lo
bcc :+
   inc tmp2_hi
:

; Perform tmp1-=tmp2
sec
lda tmp1_lo
sbc tmp2_lo
sta tmp1_lo
lda tmp2_hi
sbc tmp2_hi
sta tmp2_hi

; Result in tmp1
2017-03-21 11:22
Frantic

Registered: Mar 2003
Posts: 1626
If it is ok to use the y-register as well, one could do:
; Assuming 8-bit unsigned input and 16-bit unsigned output
ldy #0

; Perform tmp1=a+c+e
clc
lda a
adc c
bcc :+
   iny
   clc
:
adc e
sta tmp1_lo
bcc :+
   iny
   clc
:
sty tmp1_hi

[...]
2017-03-21 12:15
lft

Registered: Jul 2007
Posts: 369
Neat idea, probably off topic: If you are summing a lot of bytes (e.g. computing a checksum), and you track the MSB as in Frantic's post, then you don't have to bother with keeping carry clear. After the computation, simply correct the error by subtracting the MSB.

Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning.
2017-03-21 12:41
JackAsser

Registered: Jun 2002
Posts: 1987
Quote: Neat idea, probably off topic: If you are summing a lot of bytes (e.g. computing a checksum), and you track the MSB as in Frantic's post, then you don't have to bother with keeping carry clear. After the computation, simply correct the error by subtracting the MSB.

Conveniently, the carry from the last addition can be included in the subtraction, but then you have to set carry at the very beginning.


Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.
2017-03-21 14:46
ChristopherJam

Registered: Aug 2004
Posts: 1359
We're all terrible at reading each other's comments. I wrote #5 having only read the thread up to #2, all three of Repose, Frantic and myself have proposed different ways of accumulating the carries from the low byte computation, and both lft and myself have independently proposed fixing up the low byte in post rather than clearing carry as you go.

I do like lft's contribution (addition?) of using the carry from the last addition as inverse borrow for the fixup, mind :)
2017-03-21 14:49
ChristopherJam

Registered: Aug 2004
Posts: 1359
Quoting JackAsser
Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.


I wonder if you could just get away with checking the Y value every N potential overflows, and just see if it has decreased since the last test... it'd be quicker than doing a branch every time.

N may well be as high as 255; I'd have to think a bit harder to be sure of that.
2017-03-21 20:08
lft

Registered: Jul 2007
Posts: 369
Quoting ChristopherJam
We're all terrible at reading each other's comments.


Whoops, you're quite right. Sorry!
2017-03-22 00:50
Fresh

Registered: Jan 2005
Posts: 101
While thinking about this topic I've come up with a completely different approach which is not ideal for this case but may be useful if we need to add more numbers. Hope this is not too off-topic.
Beware, I haven't checked if this needs some kind of CIA version fix: this works fine with old CIAs.
BTW, I'm not using the carry.


.const  zp0=$fb
.const  zp1=$fc
.const  zp2=$fd

.pc=$1000
start:
        sei
!:	bit $d011
	bpl !-   // No BLs!
	lda zp1
	sta b+1
	lda zp2
	sta c+1	
	lda #$00
	sta $dd0f
	sta $dd06
	lda #$01
	sta $dd07
	sta $dd0f		
	
	ldy zp0		// 3
b:
	ldx tab,y	// 4/5
c:
	ldy tab,x	// 4/5
	
	// Here we could add more ldx/ldy
lo:		
	ldx $dd06	// 4 => Wait=15/17	
	lda carrytab,x
	// HI/LO in A/Y
	sty $63
	sta $62
        cli
	jmp $bdd1	

	
.align $0100
tab:
.fill 512,i&$ff
carrytab:
.fill 256,[$f3-i]&$ff
Previous - 1 | 2 | 3 | 4 | 5 | 6 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
DJ Gruby/TRiAD
algorithm
hedning/G★P
Elder0010/G★P
CopAss/Leader
Courage
deetsay
kbs/Pht/Lxt
csabanw
rambo/Therapy/ Resou..
Mason/Unicess
Guests online: 361
Top Demos
1 Next Level  (9.8)
2 Mojo  (9.7)
3 Coma Light 13  (9.7)
4 Edge of Disgrace  (9.6)
5 No Bounds  (9.6)
6 Comaland 100%  (9.6)
7 Uncensored  (9.6)
8 The Ghost  (9.6)
9 Wonderland XIV  (9.6)
10 Bromance  (9.6)
Top onefile Demos
1 Party Elk 2  (9.7)
2 Cubic Dream  (9.6)
3 Copper Booze  (9.5)
4 Rainbow Connection  (9.5)
5 TRSAC, Gabber & Pebe..  (9.5)
6 Onscreen 5k  (9.5)
7 Dawnfall V1.1  (9.5)
8 Quadrants  (9.5)
9 Daah, Those Acid Pil..  (9.5)
10 Birth of a Flower  (9.5)
Top Groups
1 Booze Design  (9.3)
2 Nostalgia  (9.3)
3 Oxyron  (9.3)
4 Censor Design  (9.3)
5 Crest  (9.3)
Top NTSC-Fixers
1 Pudwerx  (10)
2 Booze  (9.7)
3 Stormbringer  (9.7)
4 Fungus  (9.6)
5 Grim Reaper  (9.3)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.051 sec.