lda n1 adc n2 bcc n1n2_0 clc n1n2_1 adc n3 bcc n1n2n3_1 clc n1n2n3_2 adc n4 bcc n1n2n3n4_2 clc n1n2n3n4_3 adc n5 sta sum lda#3 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts n1n2_0 adc n3 bcc n1n2n3_0 clc n1n2n3_1 adc n4 bcc n1n2n3n4_1 clc n1n2n3n4_2 adc n5 sta sum lda#2 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts n1n2n3_0 adc n4 bcc n1n2n3n4_0 clc n1n2n3n4_1 adc n5 sta sum lda#1 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts n1n2n3n4_0 adc n5 sta sum lda#0 adc n1+1 clc adc n2+1 clc adc n3+1 clc adc n4+1 clc adc n5+1 sta sum+1 rts

One problem, X could rollover to $FE.

y1 y0 x1 x0 ------ p0h p0l ;x0*y0 p1h p1l ;x0*y1 for l and y1*x0 for high p2h p2l ;x1*y0 p3h p3l ;y1*x1 f(a+b)-g(a-b) in pointers in Y p0l=f_low_x0-g_low_x0 * y0 p0h=f_high_x0-g_high_x0 * y0 p1l=f_low_x0-g_low_x0 * y1 p1h=f_high_y1-g_high_y1 * x0 p2l=f_low_x1-g_low_x1 * y0 p2h=f_high_x1-g_high_x1 * y0 p3l=f_low_y1-g_low_y1 * x1 p3h=f_high_y1-g_high_y1 * x1 If you do it in lsb to msb order to do the adds, x0*y0 low -> r0 (p0l) x0*y0 high (p0h) x0*y1 low (p1l) x1*y0 low -> r1 (p2l) x1*y0 high (p2h) y1*x0 high (p1h) this is different than p1l y1*x1 low -> r2 (p3l) y1*x1 high -> r4 (p3h)

setup x0;14 cycles setup x1 setup y1;42c ldy y0 sec lda (f_low_x0),y sbc (g_low_x0),y sta r0;p0l, 18c ;the adds of column 1 ldx #0;stores carries, reset every column with adds clc; Y=y0 lda (f_high_x0),y;A=p0h, (a+b) part ldy y1 adc (f_low_x0),y;+p1l (a+b) part bcc s1 inx clc s1 ldy y0 adc (f_low_x1),y;+p2l (a+b) part bcc s2; 31c inx clc ;the subs of column 1 s2 sec; Y=y0 ;A=the (a+b) parts added, carry in X sbc (g_high_x0),y bcs s3 dex sec s3 ldy y1 sbc (g_low_x0),y;-p1l (a-b) part bcs s4 dex sec s4 ldy y0 sbc (g_low_x1),y;-p2l (a-b) part bcs s5 dex s5 sta r1;35 or 65 per column ;the adds of column 2 txa; get the carries from column 1 clc ... ;column 3 txa clc adc (f_high_y1),y;p3h (a+b) part sec sbc (g_high_y1),y;p3h (a-b) part sta r3;19 ... 42+18+65+65+19=209

; Assuming 8-bit unsigned input and 16-bit unsigned output lda #0 sta tmp1_hi sta tmp2_hi ; Perform tmp1=a+c+e clc lda a adc c bcc :+ inc tmp1_hi clc : adc e sta tmp1_lo bcc :+ inc tmp1_hi clc : ; Perform tmp2=b+d+f lda b adc c bcc :+ inc tmp2_hi clc : adc f sta tmp2_lo bcc :+ inc tmp2_hi : ; Perform tmp1-=tmp2 sec lda tmp1_lo sbc tmp2_lo sta tmp1_lo lda tmp2_hi sbc tmp2_hi sta tmp2_hi ; Result in tmp1

; Assuming 8-bit unsigned input and 16-bit unsigned output ldy #0 ; Perform tmp1=a+c+e clc lda a adc c bcc :+ iny clc : adc e sta tmp1_lo bcc :+ iny clc : sty tmp1_hi [...]

Indeed, but special care has to be taken after 256 potential overflows. That would of course imply a >16-bit result in the end.