y1 y0 x1 x0 ------ p0h p0l p1h p1l p2h p2l p3h p3l stats for sqr mult ------------------ 995688448 23.2% p0h+p1l>255 2105475200 49% p1l+p2l>255 1991145471 46.4% p0h+p1l+p2l>255 1030820094 24% p1h+p2h>255 1031913855 24% p2h+p3l>255 2044225475 47.6% p1h+p2h+p3l>255 2335172480 54.4% one carry in p0h column 765991168 17.8% two carries in p0h column 1819237028 42.4% one carry in p1h column 243496921 5.7% two carries in p1h column
lda n1 adc n2 bcc n1n2_0 clc n1n2_1 adc n3 bcc n1n2n3_1 clc n1n2n3_2 adc n4 bcc n1n2n3n4_2 clc n1n2n3n4_3 adc n5 sta sum lda#3 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts n1n2_0 adc n3 bcc n1n2n3_0 clc n1n2n3_1 adc n4 bcc n1n2n3n4_1 clc n1n2n3n4_2 adc n5 sta sum lda#2 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts n1n2n3_0 adc n4 bcc n1n2n3n4_0 clc n1n2n3n4_1 adc n5 sta sum lda#1 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts n1n2n3n4_0 adc n5 sta sum lda#0 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts
One problem, X could rollover to $FE.
y1 y0 x1 x0 ------ p0h p0l ;x0*y0 p1h p1l ;x0*y1 for l and y1*x0 for high p2h p2l ;x1*y0 p3h p3l ;y1*x1 f(a+b)-g(a-b) in pointers in Y p0l=f_low_x0-g_low_x0 * y0 p0h=f_high_x0-g_high_x0 * y0 p1l=f_low_x0-g_low_x0 * y1 p1h=f_high_y1-g_high_y1 * x0 p2l=f_low_x1-g_low_x1 * y0 p2h=f_high_x1-g_high_x1 * y0 p3l=f_low_y1-g_low_y1 * x1 p3h=f_high_y1-g_high_y1 * x1 If you do it in lsb to msb order to do the adds, x0*y0 low -> r0 (p0l) x0*y0 high (p0h) x0*y1 low (p1l) x1*y0 low -> r1 (p2l) x1*y0 high (p2h) y1*x0 high (p1h) this is different than p1l y1*x1 low -> r2 (p3l) y1*x1 high -> r4 (p3h)
setup x0;14 cycles setup x1 setup y1;42c ldy y0 sec lda (f_low_x0),y sbc (g_low_x0),y sta r0;p0l, 18c ;the adds of column 1 ldx #0;stores carries, reset every column with adds clc; Y=y0 lda (f_high_x0),y;A=p0h, (a+b) part ldy y1 adc (f_low_x0),y;+p1l (a+b) part bcc s1 inx clc s1 ldy y0 adc (f_low_x1),y;+p2l (a+b) part bcc s2; 31c inx clc ;the subs of column 1 s2 sec; Y=y0 ;A=the (a+b) parts added, carry in X sbc (g_high_x0),y bcs s3 dex sec s3 ldy y1 sbc (g_low_x0),y;-p1l (a-b) part bcs s4 dex sec s4 ldy y0 sbc (g_low_x1),y;-p2l (a-b) part bcs s5 dex s5 sta r1;35 or 65 per column ;the adds of column 2 txa; get the carries from column 1 clc ... ;column 3 txa clc adc (f_high_y1),y;p3h (a+b) part sec sbc (g_high_y1),y;p3h (a-b) part sta r3;19 ... 42+18+65+65+19=209
; Assuming 8-bit unsigned input and 16-bit unsigned output lda #0 sta tmp1_hi sta tmp2_hi ; Perform tmp1=a+c+e clc lda a adc c bcc :+ inc tmp1_hi clc : adc e sta tmp1_lo bcc :+ inc tmp1_hi clc : ; Perform tmp2=b+d+f lda b adc c bcc :+ inc tmp2_hi clc : adc f sta tmp2_lo bcc :+ inc tmp2_hi : ; Perform tmp1-=tmp2 sec lda tmp1_lo sbc tmp2_lo sta tmp1_lo lda tmp2_hi sbc tmp2_hi sta tmp2_hi ; Result in tmp1
; Assuming 8-bit unsigned input and 16-bit unsigned output ldy #0 ; Perform tmp1=a+c+e clc lda a adc c bcc :+ iny clc : adc e sta tmp1_lo bcc :+ iny clc : sty tmp1_hi [...]
Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.
We're all terrible at reading each other's comments.
.const zp0=$fb .const zp1=$fc .const zp2=$fd .pc=$1000 start: sei !: bit $d011 bpl !- // No BLs! lda zp1 sta b+1 lda zp2 sta c+1 lda #$00 sta $dd0f sta $dd06 lda #$01 sta $dd07 sta $dd0f ldy zp0 // 3 b: ldx tab,y // 4/5 c: ldy tab,x // 4/5 // Here we could add more ldx/ldy lo: ldx $dd06 // 4 => Wait=15/17 lda carrytab,x // HI/LO in A/Y sty $63 sta $62 cli jmp $bdd1 .align $0100 tab: .fill 512,i&$ff carrytab: .fill 256,[$f3-i]&$ff
The indexed addressing is doing the adds, and the timer is doing the carries, in parallel.
ldx #0 clc lda n1 adc n2 bcc s1 inx adc #0;fix sec;fix s1 adc n3 bcc s2 inx stx h sec sbc h sta l
sec sbc id,x
txa ldx #0
a2 a1 a0 b1 b0 + c1 c0 -------- r2 r1 r0 2 clc 3 lda a0 3 adc b0 2 tax;a0+b0 3 lda a1 3 adc b1 2 tay;a1+b1 3 adc a2 3 sta r2 2 txa ;no clc as there's never a carry from r2, in a mult 3 adc c0 3 sta r0 2 tya 3 adc c1 3 sta r1 3 bcc s1;3/7 avg 5 5 inc r2 s1 rts
clc lda n1 adc n2 bcc s1 inx s1 adc n3 bcc s2 inx s2 sta low stx high;or add the next column (the high bytes of n1-n3) with txa:ldx #0
s2 stx high sec sbc high sta low;remove the carries which were added by NOT using CLC
s2 adc #0 stx high sec sbc high sta low
Then I noticed this didn't work in every case, because the first carry doesn't offset the running total, only the 2nd carries forward.
s2 stx hi sbc hi ; this adds 255-x to the current value of A+C sta low ; will now contain <(n1+n2+n3+0xff) lda hi adc#0 sta hi ; will now contain >(n1+n2+n3+0xff)
getting all those additions in the right order without making a typo, for some 72 of them (considering a 32x32), would be very difficult.
Ok, seems I took a wrong turn somewhere, and that when you went down the optimization path with sbc tab,x you avoided my particular bug.
My point about test coverage is still valid though.
; test parameters are ; A (00..ff) ; C (00..01) ; X (00..ff) ; K (X..ff) ; routine requires that X+(0xff-K)<= 0xff, ie X<=K ; ; result should be A+C+K+255*X ;----------------------------- ; calculate reference value ;----------------------------- lda tC lsr lda tA adc tK sta ref+0 lda tX adc#0 sta ref+1 ; ref is now A+C+K+256*X sec lda ref+0 sbc tX sta ref+0 lda ref+1 sbc#0 sta ref+1 ; ref is now A+C+K+255*X ;----------------------------- ; prepare registers for test ;----------------------------- lda tK eor#$ff sta rn+1 lda tC lsr lda tA ldx tX ;----------------------------- ; run test code ;----------------------------- rn sbc id+$3f,x sta res+0 txa adc#0 sta res+1 ;----------------------------- ; compare result to reference value ;----------------------------- lda res+0 cmp ref+0 bne fail lda res+1 cmp ref+1 bne fail
adc #V bcc *+3 inx
C'=0 X'=X A'=A+V+C =(T-C-255*X)+V+C =T+V-255*X T'=A'+C'+255*X' =(T+V-255*X)+0+255*X =T+V
C'=1 X'=X+1 # as the increment happens A'=A+V+C-256 # as the add has overflowed into C =(T-C-255*X)+V+C-256 =T+V-255*X-256 T'=A'+C'+255*X' =(T+V-255*X-256)+1+255*(X+1) = T+V-255*X-255+255*X+255 = T+V
A C X r1 r0 ldx #0 00 sec 1 lda #ff ff adc #2 02 1 bcc s1 inx 01 adc #ff 02 1 bcc s2 inx 02 stx r1 02 sbc r1 00 1 sta r0 00 txa 02 sbc #0 02 1 sta r1 02 A C X r1 r0 ldx #0 00 sec 1 lda #ff ff adc #1 01 1 bcc s1 inx 01 adc #ff 01 1 bcc s2 inx 02 stx r1 02 sbc r1 ff 0 sta r0 ff txa 02 sbc #0 01 1 sta r1 01
s2 sbc id,x sta r0 ;column 2 txa sbc #0;correction to high byte ldx #0 adc n1h;add high bytes
stx r1 sbc r1 with sbc id,x
ldx #0 a) sec ; initialise T to 1 lda n1 ; now T is A+C+255*X = n1+1+255*0 = n1+1 adc n2 bcc *+3 inx ; now T is 1+n1+n2, cf proof in comment about the adc:bcc:inx sequence earlier in this thread adc n3 bcc *+3 inx ; now T is 1+n1+n2+n3 ; so all that's required now is adding A+C+255*X-1 and storing them b) sbc id,x ; now sbc nn is the same as adc $ff-nn, so here we are adding $ff and subtracting X sta r0 txa c) sbc #0 ; again, equivalent to adc $ff-0, so here we are adding x, $ff, and any carry from the low byte sta r1 ; ie, adding 256*X and $ff00 to the total in r1:r0
lax eorval+1 ldy #$00 val1: axs #$00 val2: axs #$00 bcs val3 iny val3: axs #$00 bcs val4 iny val4: axs #$00 bcs storehi iny storehi: sty r1 txa eorval: eor #$ff sta r0