JackAsser's setup 74 mults 116 adds 43 total 233
mul1616 lda mT1+ 0 ; 3 sta zp_fl0 ; 3 sta zp_fh0 ; 3 eor#255 ; 2 sta zp_gl0 ; 3 sta zp_gh0 ; 3 lda mT1+ 1 ; 3 sta zp_fl1 ; 3 sta zp_fh1 ; 3 eor#255 ; 2 sta zp_gl1 ; 3 sta zp_gh1 ; 3 clc ; 2 ldy mT2+0 ; 3 lda (zp_fl0),y ; 5.5 adc (zp_gl0),y ; 5.5 sta mRes+0 ; 3 ldx#0 ; 2 lda (zp_fh0),y ; 5.5 adc (zp_gh0),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 adc (zp_fl1),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 adc (zp_gl1),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 ldy mT2+1 ; 3 adc (zp_fl0),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 adc (zp_gl0),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 sbc ofste_3f,x ; 4 sta mRes+1 ; 3 txa ; 2 ldx#$bf ; 2 adc (zp_fh0),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 adc (zp_gh0),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 adc (zp_fl1),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 adc (zp_gl1),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 ldy mT2+0 ; 3 adc (zp_fh1),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 adc (zp_gh1),y ; 5.5 bcc *+3 ; 2.5 inx ; 1 sbc ofste_80-$bf,x ; 4 sta mRes+2 ; 3 txa ; 2 ldy mT2+1 ; 3 adc (zp_fh1),y ; 5.5 clc ; 2 adc (zp_gh1),y ; 5.5 sta mRes+3 ; 3 rts ; total=204.5 ofste_3f .byt $3f,$40,$41,$42,$43,$44 .dsb <$bf-*,0 ofste_80 .byt $80,$81,$82,$83,$84,$85,$86
f=lambda x:x*x//4 g=lambda x:(0x4000-f(x-255))&0xffff dumpArrayToA65(fo, "flo", [lo(f(i)) for i in range(512)]) dumpArrayToA65(fo, "fhi", [hi(f(i)) for i in range(512)]) dumpArrayToA65(fo, "glo", [lo(g(i)) for i in range(512)]) dumpArrayToA65(fo, "ghi", [hi(g(i)) for i in range(512)])
Same, I can easily take 4 off mine at the expense of ~36 bytes more zp, but I don't consider that elegant or worthwhile.