[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 227

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 145 posts hidden. Click here to view all posts....

2022-01-07 17:52

Repose

Registered: Oct 2010
Posts: 227

This doesn't need to be an either/or; just something I'm evaluating. unless you're doing a factorial, I see most calculations using multiply followed by an add (polynomial evaluation, matrices), so I just want to see the tradeoff.
If adds and mults always come in pairs, my guess is that either system is about equal. If they are equal, I'd give the nod to separate sign as it gives more range.

2022-01-11 23:02

Repose

Registered: Oct 2010
Posts: 227

@ChristopherJam (does that work?)
I did the offset $4000 in g thing myself, but tried the zip-add approach. I'm afraid it's a disaster; your adds are basically 9.5 cycles while mine are 11.5:

        ;continue column 1
.c1_1:  lda #0 ;self-modded to contain high(x0*y0) f
        adc (multlib_p_neg_sqr_lo_x0),y ;high(x0*y0) g
        sta .c1_2+1 ;save partial
        ;11.5

This is missing the corrections (for $40 in the high bytes of g(x), or p_neg_sqr_hi_x in my notation)

umult16:
        ;unsigned 16x16 mult 
        ;inputs:
        ; x0 x1 
        ; y0 y1 
        ;outputs:
        ; A z2 z1 z0
        * = $c000
      
!zone multlib
   
;these are defaults, but should be set from the calling program
!addr {
  multlib_p_sqr_lo_x0 = $8b ;2 bytes
  multlib_p_sqr_hi_x0 = $8d ;2 bytes
  multlib_p_neg_sqr_lo_x0 = $a3 ;2 bytes
  multlib_p_neg_sqr_hi_x0 = $a5 ;2 bytes
  
  multlib_p_sqr_lo_x1 = $8b ;2 bytes
  multlib_p_sqr_hi_x1 = $8d ;2 bytes
  multlib_p_neg_sqr_lo_x1 = $a3 ;2 bytes
  multlib_p_neg_sqr_hi_x1 = $a5 ;2 bytes

  multlib_sqr_lo = $c000 ;511 bytes
  multlib_sqr_hi = $c200 ;511 bytes
  multlib_neg_sqr_lo = $c400 ;511 bytes
  multlib_neg_sqr_hi = $c600 ;511 bytes
}

  x0=$fb
  x1=$fc
  y0=$fd
  y1=$fe
  z0=$8000
  z1=$8001
  z2=$8002
  z3=$8003

;set multiplier as x0
        lda x0 ;3
        sta multlib_p_sqr_lo_x0 ;3
        sta multlib_p_sqr_hi_x0 ;3
        eor #$ff ;2
        sta multlib_p_neg_sqr_lo_x0 ;3
        sta multlib_p_neg_sqr_hi_x0 ;3; 3+3+3+2+3+3 = 17 cycles
        
;set multiplier as x1
        lda x1 ;3
        sta multlib_p_sqr_lo_x1 ;3
        sta multlib_p_sqr_hi_x1 ;3
        eor #$ff ;2
        sta multlib_p_neg_sqr_lo_x1 ;3
        sta multlib_p_neg_sqr_hi_x1 ;3; 3+3+3+2+3+3 = 17 cycles
        
        ;set multiplicand as y0
        ldy y0 ;3
        ;storing z3 in X, clear X
        ldx #0 ;2
        
        clc ;2
        ;x0_y1l =  low(x0*y1)
        ;x0_y1h = high(x0*y1)
        ;zip add all mult8's directly
        ;adding requires offset $4000 in multlib_p_neg_sqr table and compensation
        ;7
        
        ;column 0
        lda (multlib_p_sqr_lo_x0),y ;low(x0*y0) f
        adc (multlib_p_neg_sqr_lo_x0),y ;11 cycles low(x0*y0) g
        sta z0 ;multiplier * Y, low byte, 3 cycles
        ;need correction factor for g in offset $400
        ;14
        
        ;continue to column 1
        adc (multlib_p_sqr_hi_x0),y ;high(x0*y0) f
        sta .c1_1+1 ;partial addition result of column 1 (self-mod to be continued)
        ;9.5
        
        ;continue to column 2
        adc (multlib_p_sqr_hi_x1),y ;high(x1*y0) f
        sta .c2_1+1 ;save partial
        bcc .c1_1
        inx ;inc z3
        clc ;3/6=4.5
        ;14
        
        ;continue column 1
.c1_1:  lda #0 ;self-modded to contain high(x0*y0) f
        adc (multlib_p_neg_sqr_lo_x0),y ;high(x0*y0) g
        sta .c1_2+1 ;save partial
        ;11.5
        
        ;continue column 2
.c2_1:  lda #0 ;self-modded to contain high(x1*y0) f
        adc (multlib_p_neg_sqr_hi_x1),y ;high(x1*y0) g
        sta .c2_2+1 ;save partial
        bcc .c1_2
        inx ;inc z3
        clc ;4.5
        ;16
        
        ;continue column 1
.c1_2:  lda #0 ;self-modded to contain high(x0*y0) g
        adc (multlib_p_sqr_lo_x1),y ;low(x1*y0) f
        sta .c1_3 ;save partial
        ;11.5
        
        ;continue column 2
        ;set multiplicand as y1
        ldy y1 ;3
.c2_2:  lda #0 ;self-modded to contain high(x1*y0) g
        adc (multlib_p_sqr_hi_x1),y ;high(x0*y1) f
        sta .c2_3+1 ;save partial
        bcc .c1_3
        inx ;inc z3
        clc ;4.5
        ;19
        
        ;continue column 1 (extra y0)
        ldy y0 ;3
.c1_3:  lda #0 ;self-modded to contain low(x1*y0) f
        adc (multlib_p_neg_sqr_lo_x1),y ;low(x1*y0) g
        sta .c1_4+1 ;save partial
        ;14.5
        
        ;continue column 2
        ldy y1 ;3
.c2_3:  lda #0 ;self-modded to contain high(x0*y1) f
        adc (multlib_p_sqr_hi_x1),y ;high(x0*y1) g
        sta .c2_4+1
        bcc .c1_4
        inx ;inc z3
        clc ;4.5
        ;19
        
        ;continue column 1
.c1_4:  lda #0 ;self-modded to contain low(x1*y0) g
        adc (multlib_p_sqr_lo_x0),y ;low(x0*y1) f
        sta .c1_5+1 ;save partial
        ;11.5

        ;continue column 2
.c2_4:  lda #0 ;self-modded to contain high(x0*y1) g
        adc (multlib_p_sqr_hi_x1),y ;low(x1*y1) f
        sta .c2_5+1 ;save partal
        bcc .c1_5
        inx ;inc z3
        clc ;4.5
        ;16
        
        ;finish column 1
.c1_5:  lda #0 ;self-modded to contain low(x0*y1) f
        adc (multlib_p_neg_sqr_lo_x0),y ;low(x0*y1) g
        sta z1 ;need correction factor for g in offset $400
        ;10.5
        
        ;finish column 2
.c2_5:  lda #0 ;self-modded to contain low(x1*y1) f
        adc (multlib_p_neg_sqr_hi_x1),y ;low(x1*y1) g
        sta z2 ;need correction factor for g in offset $400
        ;10.5
        
        ;column 3
        txa ;carries from c2
        adc (multlib_p_sqr_hi_x1),y ;high(x1*y1) f
        adc (multlib_p_neg_sqr_hi_x1),y ;high(x1*y1) g
        ;need correction factor for g in offset $400
        ;A=z3
        ;13
        ;=231.5

fin:    rts

2022-01-11 23:05

Repose

Registered: Oct 2010
Posts: 227

A guide to the order of additions (optimized for changes in y):

                y1    y0
             x  x1    x0
                --------
             x0y0h x0y0l
+      x1y0h x1y0l
+      x0y1h x0y1l
+x1y1h x1y1l
------------------------
    z3    z2    z1    z0

2022-01-19 11:30

ChristopherJam

Registered: Aug 2004
Posts: 1423

Ah yes, saving those partials is expensive.

Apologies for the delay - while @ did nothing, I did see the post, but then I had to find the time to reread the last 129 comments here..

I had a few ideas for optimizations while I was failing to get around to reading back, but it looks like I either acknowledged or attempted all of them back in 2017. Excellent point you had back then about it being well worth removing the CLCs for the taller columns.

One new thought I did have was to replace the bcc/inx combo with a ROL of some value read further down, that would simultaneously clear carry, and accumulate a '1' into an index into a table that would count the number of 1 bits. It could even a table selector (if the ROL was of the high byte of a table address) allowing the correction to be incorporated into one of the square table lookups.

However, a non-accumulator ROL is always at least 5 cycles, which is slower than either path through bcc/inx, so I suspect this one's a no go.

2022-01-30 11:51

Repose

Registered: Oct 2010
Posts: 227

I've been going all over the place with distractions, but, I'm exploring a completely new number system which has potential. As I mentioned in my big list, there's many more ways to do math than just 2's complement. Here is a log based arithmetic system, where multiply is as fast as an add, and add is a bit slower:

# -*- coding: utf-8 -*-
"""
Created on Fri Jan 28 22:46:53 2022

@author: Repose
A module to analyze arithmetic via logs
ref.
https://www.siranah.de/html/sail042q.htm
"""

#imports
import sys
from math import log2

#functions/classes
scale=256/log2(256) #8-bit range
lo=lambda x:x&255 #util for creating tables (for future development)
hi=lambda x:(x>>8)
logadd=lambda x:round(log2(1+1/pow(2,x/scale))*scale) #scaled table to add two logs
logsub=lambda x:round(log2(1+pow(2,x/scale))*scale) #scaled table to sub two logs
exp=lambda x:round(pow(2,x/scale)) #scaled exponential 2^x table
lg=lambda x:round(log2(x)*scale) #scaled log2(x) table

def mult(x,y):
    '''Multiply two integers via their logs

    Keyword arguments:
        x,y -- integers 0 to 255
    Returns:
        The approximate product
    '''
    return exp(lg(x)+lg(y))

def add(x,y):
    '''Add two integers via their logs

    Keyword arguments:
        x,y -- integers 0 to 255
    Returns:
        The approximate sum
    '''
    return exp(logadd(lg(y)-lg(x))+lg(y))

def output_table(fmt, fname, name, addr, tab, start, end):
    '''Output a table as an assembly include file

    Keyword arguments:
        fmt -- string, "ACME"
        fname -- string, filename of output file
        name -- string, table name
        addr -- integer start address of table
        tab -- generator function
        start -- integer start index value
        end -- integer end index value + 1
    Returns:
        0 (success)
        5 (failed to open file)
    '''
    try:
        with open(fname, 'w') as f:
            print(f'writing {fname}')
            
            if fmt.lower()=="acme":
                fmt_data='!by '
            elif fmt.lower()=="dasm":
                fmt_data='dc.b '
            else:
                print(f'unknown assembler format {fmt}')
                return 1 #traditional DOS incorrect function

            #header
            f.write(f'{name}=${addr:04X}\n')
            step=16
            for i in range(start,end,step):
                #line start
                f.write(fmt_data)
                k=i+step if i+step<end else end
                first_invalid=True
                first_overflow=True
                for j in range(i,k):
                    try:
                        val=tab(j)
                    except ValueError:
                        val=0
                        if first_invalid:
                            print(f'warning: invalid table value replaced with 0 at index {j}')
                            first_invalid=False
                    if val>0xff:
                        val=val & 0xff
                        if first_overflow:
                            print(f'warning: table value overflowed at index {j}')
                            first_overflow=False
                    f.write(f'${val:02X}')
                    if j!=k-1:
                        f.write(',')
                    else:
                        f.write('\n')

    except:
            print(f"Couldn't open {fname} for writing")
            return 5 #traditional DOS error code

#main
def main():
    '''Analyze arithmetic of two integers via their logs

    Keyword arguments:
        none
    Returns:
        0 (success)
    '''
    print('Guassian logs analysis\n')
    print('accuracy test\n')
    
    #should write a helper class for this
    erm=0 #maximum multiply error
    era=0 #maximum addition error
    
    for x in range(1,256):
        for y in range(x, 256):
            
            er=mult(x,y)-x*y
            if er>erm:
                erm=er
                pxm=x
                pym=y
                
            er=add(x,y)-(x+y)
            if er>era:
                era=er
                pxa=x
                pya=y
    
    #requires python 3.6+ due to f strings
    print('multiply')
    print(f'max err was {erm} ({erm/(pxm*pym)*100:.2f}%) at x={pxm} and y={pym}')
    print(f'{pxm}*{pym}={pxm*pym}, mult({pxm},{pym})={mult(pxm,pym)}\n')

    print('add')
    print(f'max err was {era} ({era/(pxa+pya)*100:.2f}%) at x={pxa} and y={pya}')
    print(f'{pxa}+{pya}={pxa+pya}, add({pxa},{pya})={add(pxa,pya)}\n')
    
    #write tables
    print('writing tables')
    tables=[{"func":lg,"name":"logtab"}, {"func":exp,"name":"exptab"}, {"func":logadd,"name":"logadd"}]
    addr=0xc000
    assembler='acme'
    for table in tables:
        #output_table(fmt, fname, name, addr, tab, start, end+1)
        output_table(assembler, table["name"]+'.asm', table["name"], addr, table["func"], 0, 256)
        addr+=0x100


if __name__ == '__main__':
	sys.exit(main())

writing logtab.asm
warning: invalid table value replaced with 0 at index 0
warning: table value overflowed at index 254
writing exptab.asm
writing logadd.asm

Main.asm to come...

2022-01-30 12:10

Krill

Registered: Apr 2002
Posts: 3083

The log/exp approach isn't new on this system, though. =)

It's especially useful in downscaling operations, where either of the factors is in [0..1).

Some care needs to be taken to minimise table size for an acceptable accuracy.

2022-01-30 17:26

JackAsser

Registered: Jun 2002
Posts: 2038

Quote: The log/exp approach isn't new on this system, though. =)

It's especially useful in downscaling operations, where either of the factors is in [0..1).

Some care needs to be taken to minimise table size for an acceptable accuracy.

Iirc WVL used a very controlled 8-bit sin/asin for the maths in Pinball Dreams even

2022-02-01 07:32

Repose

Registered: Oct 2010
Posts: 227

Yes, log has been used for divide in a C=Hacking article. In this case, I can add without converting from log and back when switching between add/sub and mult/div. Consider finding the escape iteration for a Mandelbrot set; you never need to know the values, just when the log proxy meets the test zr*zr-zi*zi>4. I'm hoping this can be faster and more accurate.

2022-02-01 08:10

Oswald

Registered: Apr 2002
Posts: 5126

log div is good enough for lines. (unless you're goind for subpix:)

2023-08-22 05:44

Repose

Registered: Oct 2010
Posts: 227

It's 2023 - is your 16x16=32 bit unsigned multiply under 190 cycles yet? What? No?!? Let me help you out buddy!
Much to my surprise, I have surpassed by previous best by 10.5 cycles or 5%, when I thought it couldn't be optimized further. How is this possible you ask?
The first trick had to do with how cycles are counted. As a shortcut, we usually average branches/boundary crossings. However, some carries happen only 7% of the time, so optimizing for the average case actually slows you down. You should be optimizing for the no carry case.
The other trick was playing with using registers to accumulate results. It wasn't as simple as it sounds; the order of multiplies had to be optimized to leave a target value in the A register.

The goal: multiply the two 16-bit numbers in x0, x1 and y0, y1 (low bytes and high bytes) and return the results in the fastest way possible (which may leave some results in registers, preferably in a useful way like the high bytes). I measure the timing in this convention to expose a timing which could be used in-line, as is often the case for demo effects. In this case, the 3rd byte is returned in A, which could be useful for a multiple accumulate where only the high 16-bits are kept.

The time is 188.1 cycles, averaged over all possible inputs. The (accurate) time of my version on codebase is 198.6

To be clear, z=x*y or (z3, z2, z1, z0) = (x1, x0) * (y1, y0).

;This section performs the 4 sub-multiplies to form
;  the partials to be added later in self-mod code.
;Addresses like "x1y0l+1" refer to low(x1*y0) stored
;  to the argument of a later "adc #{value}"
;mult8_set_mier_snippet {multiplier}
;mult8_snippet {multiplicand}, {low result}, {high result}
+mult8_set_mier_snippet x1          ;17
+mult8_snippet y0, x1y0l+1, x1y0h+1 ;35
+mult8_snippet y1, x1y1l+1, z3      ;32
+mult8_set_mier_snippet x0          ;17
+mult8_snippet "Y", x0y1l+1, "X"    ;28
+mult8_snippet y0, z0, "A"          ;28
;results in X=x0y1h and A=x0y0h
;multiply part total: 149-165, average 157
;z3           z2           z1
                           clc
                     x0y1l adc #0 ;x0y0h + x0y1l
                           tay ;6
              txa
        x1y0h adc #0 ;x0y1h + x1y0h
              tax
              bcc + ;9
inc z3
clc ;(+6) taken 7% of the time
                         + tya
                     x1y0l adc #0 ;+ x1y0l
                           sta z1 ;7
              txa
        x1y1l adc #0 ;+ x1y1l
              bcc done ;7
inc z3 ;(+4) taken 42% of the time

Column 1 takes 13 cycles, column 2 takes 16 cycles, and column 3 takes 2.1 cycles.
The total time to add the partial products is 29-39 with an average of 31.1.
The results are returned as a 32-bit result:
z3, A(=z2), z1, z0.
The total time is 156.96875+31.1=188.06875 (178-204).

Previous - 1 | ... | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

GI-Joe/MYD!
tlr
bugjam
TheRyk/MYD!
Brittle/Dentifrice^(?)
CreaMD/React
anonym/padua
iAN CooG/HVSC
grass/LETHARGY
Guests online: 285

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Codeboys & Endians  (9.7)
4 Mojo  (9.6)
5 Coma Light 13  (9.6)
6 Edge of Disgrace  (9.6)
7 Signal Carnival  (9.6)
8 Uncensored  (9.5)
9 Wonderland XIV  (9.5)
10 No Bounds  (9.5)

Top onefile Demos

1 Nine  (9.7)
2 Layers  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.5)
6 Scan and Spin  (9.5)
7 Onscreen 5k  (9.5)
8 Grey  (9.5)
9 Dawnfall V1.1  (9.5)
10 Rainbow Connection  (9.5)

Top Groups

1 Artline Designs  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Performers  (9.3)
5 Censor Design  (9.3)

Top Musicians

1 Rob Hubbard  (9.7)
2 Jeroen Tel  (9.7)
3 Stinsen  (9.7)
4 LMan  (9.7)
5 Linus  (9.6)

Page generated in: 0.154 sec.