lda (p_sqr_hi),y sbc (p_invsqr_hi),y tay;x1*y1h;Y=z3, 30 cycles do_adds: ;-add the first two numbers of column 1 clc c1a: lda #0 c1b: adc #0 sta z1;9 ;-continue to first two numbers of column 2 c2a: lda #0 c2b: adc #0 tax;X=z2, 6 cycles bcc c1c;3/6 avg 4.5 iny;z3++ clc ;-add last number of column 1 c1c: lda #0 adc z1 sta z1;8 ;-add last number of column 2 txa;A=z2 c2c: adc #0 tax;X=z2, 6 bcc fin;3/4 avg 3.5 iny;z3++ ;Y=z3, X=z2 ;9+6+4.5+8+6+3.5=37 fin: rts

clc lda c1a adc c1b bcs c1bc c1bc: clc adc c1c adc c1c sta z1 sta z1 lda #1 lda c2a adc c2a bcs c2ac;7 c2ac: clc adc c2b adc c2b adc c2b

;y0 in X, y1 in Y, x0-1, x1-1 init to $02, all ram swapped in x0=$fc;fb-fc is pointer x1=$fe;fd-fe is pointer jmp (x0-1) mult-k: ;multiply by constant multiplier k ;and multiplicands X, Y ;then add rts

;8x16=24 version by Strobe of Repose's original 16x16=32 "fastest multiplication" ;How to use: put numbers in x0/y0+y1 and result is Y reg (z2), A reg (z1), z0 ;Clobbers Y, A but not X (in original form anyway) umult8x16: ;set multiplier as x0 lda x0 ;comment out and call with A=x0 -2b3c sta p_sqr_lo sta p_sqr_hi eor #$ff sta p_invsqr_lo sta p_invsqr_hi;17 ldy y0 ;comment out and call with Y=y0 -2b3c sec lda (p_sqr_lo),y sbc (p_invsqr_lo),y;note these two lines taken as 11 total sta z0;x0*y0l comment out if you don't care about z0, -2b3c, OR lda (p_sqr_hi),y ; ..replace with TAX -1b1c (but destroys X obviously) sbc (p_invsqr_hi),y sta c1a+1;x0*y0h;33 ;c1a means column 1, row a (partial product to be added later) ldy y1 ;sec ;notice that the high byte of sub above is always +ve lda (p_sqr_lo),y sbc (p_invsqr_lo),y sta c1b+1;x0*y1l lda (p_sqr_hi),y sbc (p_invsqr_hi),y tay ;Y=c2a;x0*y1h; 29 cycles ;17+33+29=79 cycles for main multiply part do_adds: ;-add the first two numbers of column 1 clc c1a: lda #0 c1b: adc #0 ;A=z1 6 cycles ;-continue to column 2 bcc fin ;2+ iny ;add carry ;Y=z2, 4-5 cycles (avoiding page boundary cross) ;6+(4+)=10-11 cycles for adder ;total 89-90 cycles fin: rts ;Diagram of the additions for 16x8=24 ; y1 y0 ; x x0 ; -------- ; x0y0h x0y0l ;+ x0y1h x0y1l ;------------------------ ; z2 z1 z0 ;Possible tweaks: ;1. call with A=x0 and comment out "lda x0" -2b3c (*) ;2. call with Y=y0 and comment out "ldy y0" -2b3c ;3. if you don't need z0 (I didn't), comment out "sta z0" -2b3c (*) ; OR replace with TAX -1b1c (but destroys X register obviously which was safe) ;4. There's no point having do_adds in ZP with a JMP like the 16x16 version ; suggests because you would lose the 2 cycles you gained with the JMP, BUT.. ; ..you could put the ENTIRE ROUTINE in zero page, -2b2c ; AND if you do that you might as well replace "lda x0", "ldy y0" and "ldy y1" ; with immediate versions and point x0,y0 & y1 to the appropriate ZP spot for ; extra -3c, so -2b5c combined. ;5. OR forget the ZP stuff and just in-line it, saving the 12 cycle JSR/RTS ; (it's only $36 bytes) ;6. mix 1 and/or 2 and/or 3 with 4 or 5 ;(*) these also apply to Repose' original 16x16=32 routine.