y1 y0 x1 x0 ------ p0h p0l ;x0*y0 p1h p1l ;x0*y1 for l and y1*x0 for high p2h p2l ;x1*y0 p3h p3l ;y1*x1 f(a+b)-g(a-b) in pointers in Y p0l=f_low_x0-g_low_x0 * y0 p0h=f_high_x0-g_high_x0 * y0 p1l=f_low_x0-g_low_x0 * y1 p1h=f_high_y1-g_high_y1 * x0 p2l=f_low_x1-g_low_x1 * y0 p2h=f_high_x1-g_high_x1 * y0 p3l=f_low_y1-g_low_y1 * x1 p3h=f_high_y1-g_high_y1 * x1 If you do it in lsb to msb order to do the adds, x0*y0 low -> r0 (p0l) x0*y0 high (p0h) x0*y1 low (p1l) x1*y0 low -> r1 (p2l) x1*y0 high (p2h) y1*x0 high (p1h) this is different than p1l y1*x1 low -> r2 (p3l) y1*x1 high -> r4 (p3h)
setup x0;14 cycles setup x1 setup y1;42c ldy y0 sec lda (f_low_x0),y sbc (g_low_x0),y sta r0;p0l, 18c ;the adds of column 1 ldx #0;stores carries, reset every column with adds clc; Y=y0 lda (f_high_x0),y;A=p0h, (a+b) part ldy y1 adc (f_low_x0),y;+p1l (a+b) part bcc s1 inx clc s1 ldy y0 adc (f_low_x1),y;+p2l (a+b) part bcc s2; 31c inx clc ;the subs of column 1 s2 sec; Y=y0 ;A=the (a+b) parts added, carry in X sbc (g_high_x0),y bcs s3 dex sec s3 ldy y1 sbc (g_low_x0),y;-p1l (a-b) part bcs s4 dex sec s4 ldy y0 sbc (g_low_x1),y;-p2l (a-b) part bcs s5 dex s5 sta r1;35 or 65 per column ;the adds of column 2 txa; get the carries from column 1 clc ... ;column 3 txa clc adc (f_high_y1),y;p3h (a+b) part sec sbc (g_high_y1),y;p3h (a-b) part sta r3;19 ... 42+18+65+65+19=209
; Assuming 8-bit unsigned input and 16-bit unsigned output lda #0 sta tmp1_hi sta tmp2_hi ; Perform tmp1=a+c+e clc lda a adc c bcc :+ inc tmp1_hi clc : adc e sta tmp1_lo bcc :+ inc tmp1_hi clc : ; Perform tmp2=b+d+f lda b adc c bcc :+ inc tmp2_hi clc : adc f sta tmp2_lo bcc :+ inc tmp2_hi : ; Perform tmp1-=tmp2 sec lda tmp1_lo sbc tmp2_lo sta tmp1_lo lda tmp2_hi sbc tmp2_hi sta tmp2_hi ; Result in tmp1
; Assuming 8-bit unsigned input and 16-bit unsigned output ldy #0 ; Perform tmp1=a+c+e clc lda a adc c bcc :+ iny clc : adc e sta tmp1_lo bcc :+ iny clc : sty tmp1_hi [...]
Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.
We're all terrible at reading each other's comments.
.const zp0=$fb .const zp1=$fc .const zp2=$fd .pc=$1000 start: sei !: bit $d011 bpl !- // No BLs! lda zp1 sta b+1 lda zp2 sta c+1 lda #$00 sta $dd0f sta $dd06 lda #$01 sta $dd07 sta $dd0f ldy zp0 // 3 b: ldx tab,y // 4/5 c: ldy tab,x // 4/5 // Here we could add more ldx/ldy lo: ldx $dd06 // 4 => Wait=15/17 lda carrytab,x // HI/LO in A/Y sty $63 sta $62 cli jmp $bdd1 .align $0100 tab: .fill 512,i&$ff carrytab: .fill 256,[$f3-i]&$ff