clc lda n1 adc n2 bcc s1 inx s1 adc n3 bcc s2 inx s2 sta low stx high;or add the next column (the high bytes of n1-n3) with txa:ldx #0
s2 stx high sec sbc high sta low;remove the carries which were added by NOT using CLC
s2 adc #0 stx high sec sbc high sta low
Then I noticed this didn't work in every case, because the first carry doesn't offset the running total, only the 2nd carries forward.
s2 stx hi sbc hi ; this adds 255-x to the current value of A+C sta low ; will now contain <(n1+n2+n3+0xff) lda hi adc#0 sta hi ; will now contain >(n1+n2+n3+0xff)
getting all those additions in the right order without making a typo, for some 72 of them (considering a 32x32), would be very difficult.
Ok, seems I took a wrong turn somewhere, and that when you went down the optimization path with sbc tab,x you avoided my particular bug.
My point about test coverage is still valid though.