;8x16=24 version by Strobe of Repose's original 16x16=32 "fastest multiplication" ;How to use: put numbers in x0/y0+y1 and result is Y reg (z2), A reg (z1), z0 ;Clobbers Y, A but not X (in original form anyway) umult8x16: ;set multiplier as x0 lda x0 ;comment out and call with A=x0 -2b3c sta p_sqr_lo sta p_sqr_hi eor #$ff sta p_invsqr_lo sta p_invsqr_hi;17 ldy y0 ;comment out and call with Y=y0 -2b3c sec lda (p_sqr_lo),y sbc (p_invsqr_lo),y;note these two lines taken as 11 total sta z0;x0*y0l comment out if you don't care about z0, -2b3c, OR lda (p_sqr_hi),y ; ..replace with TAX -1b1c (but destroys X obviously) sbc (p_invsqr_hi),y sta c1a+1;x0*y0h;33 ;c1a means column 1, row a (partial product to be added later) ldy y1 ;sec ;notice that the high byte of sub above is always +ve lda (p_sqr_lo),y sbc (p_invsqr_lo),y sta c1b+1;x0*y1l lda (p_sqr_hi),y sbc (p_invsqr_hi),y tay ;Y=c2a;x0*y1h; 29 cycles ;17+33+29=79 cycles for main multiply part do_adds: ;-add the first two numbers of column 1 clc c1a: lda #0 c1b: adc #0 ;A=z1 6 cycles ;-continue to column 2 bcc fin ;2+ iny ;add carry ;Y=z2, 4-5 cycles (avoiding page boundary cross) ;6+(4+)=10-11 cycles for adder ;total 89-90 cycles fin: rts ;Diagram of the additions for 16x8=24 ; y1 y0 ; x x0 ; -------- ; x0y0h x0y0l ;+ x0y1h x0y1l ;------------------------ ; z2 z1 z0 ;Possible tweaks: ;1. call with A=x0 and comment out "lda x0" -2b3c (*) ;2. call with Y=y0 and comment out "ldy y0" -2b3c ;3. if you don't need z0 (I didn't), comment out "sta z0" -2b3c (*) ; OR replace with TAX -1b1c (but destroys X register obviously which was safe) ;4. There's no point having do_adds in ZP with a JMP like the 16x16 version ; suggests because you would lose the 2 cycles you gained with the JMP, BUT.. ; ..you could put the ENTIRE ROUTINE in zero page, -2b2c ; AND if you do that you might as well replace "lda x0", "ldy y0" and "ldy y1" ; with immediate versions and point x0,y0 & y1 to the appropriate ZP spot for ; extra -3c, so -2b5c combined. ;5. OR forget the ZP stuff and just in-line it, saving the 12 cycle JSR/RTS ; (it's only $36 bytes) ;6. mix 1 and/or 2 and/or 3 with 4 or 5 ;(*) these also apply to Repose' original 16x16=32 routine.
y1 y0 *x1 x0 Calculation Range k=(y1-y0)*(x1-x0) +-FE01 l=y1*x1 0-FE01 m=y0*x0 0-FE01 n=l+m-k 0-1FC02 l+m 0-1FC02 l-k or m-k +-FE01 x*y=l*65536+n*256+m (for 8 bit Xn/Yn) Example (16x16 bits): y1 y0 ->$20 10 *x1 x0 40 30 k=(20-10)*(40-30)=100 l=20*40=800 m=10*30=300 n=800+300-100=a00 llll0000 -> 08000000 nnnnn00 00a0000 mmmm 0300 ---------------- 2010*4030 = 080a0300 x*y=l*4294967296+n*65536+m (for 16 bit Xn/Yn) example with 16-bit values: 2000 1000 4000 3000 k=1000*1000=0100 0000 l=2000*4000=0800 0000 m=1000*3000=0300 0000 n=00A00 0000 800 0000 0000 0000 a00 0000 0000 300 0000 ------------------ 800 0a00 0300 0000 If multiplies are expensive, this is faster. Estimating 32x32=64; 3x umult16 ~585 adds ~94 total ~679