[CSDb] - User Forums - Fast large multiplies

Welcome to our latest new user Nicron ! (Registered 2024-05-21)

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 222

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2022-01-30 11:51

Repose

Registered: Oct 2010
Posts: 222

I've been going all over the place with distractions, but, I'm exploring a completely new number system which has potential. As I mentioned in my big list, there's many more ways to do math than just 2's complement. Here is a log based arithmetic system, where multiply is as fast as an add, and add is a bit slower:

# -*- coding: utf-8 -*-
"""
Created on Fri Jan 28 22:46:53 2022

@author: Repose
A module to analyze arithmetic via logs
ref.
https://www.siranah.de/html/sail042q.htm
"""

#imports
import sys
from math import log2

#functions/classes
scale=256/log2(256) #8-bit range
lo=lambda x:x&255 #util for creating tables (for future development)
hi=lambda x:(x>>8)
logadd=lambda x:round(log2(1+1/pow(2,x/scale))*scale) #scaled table to add two logs
logsub=lambda x:round(log2(1+pow(2,x/scale))*scale) #scaled table to sub two logs
exp=lambda x:round(pow(2,x/scale)) #scaled exponential 2^x table
lg=lambda x:round(log2(x)*scale) #scaled log2(x) table

def mult(x,y):
    '''Multiply two integers via their logs

    Keyword arguments:
        x,y -- integers 0 to 255
    Returns:
        The approximate product
    '''
    return exp(lg(x)+lg(y))

def add(x,y):
    '''Add two integers via their logs

    Keyword arguments:
        x,y -- integers 0 to 255
    Returns:
        The approximate sum
    '''
    return exp(logadd(lg(y)-lg(x))+lg(y))

def output_table(fmt, fname, name, addr, tab, start, end):
    '''Output a table as an assembly include file

    Keyword arguments:
        fmt -- string, "ACME"
        fname -- string, filename of output file
        name -- string, table name
        addr -- integer start address of table
        tab -- generator function
        start -- integer start index value
        end -- integer end index value + 1
    Returns:
        0 (success)
        5 (failed to open file)
    '''
    try:
        with open(fname, 'w') as f:
            print(f'writing {fname}')
            
            if fmt.lower()=="acme":
                fmt_data='!by '
            elif fmt.lower()=="dasm":
                fmt_data='dc.b '
            else:
                print(f'unknown assembler format {fmt}')
                return 1 #traditional DOS incorrect function

            #header
            f.write(f'{name}=${addr:04X}\n')
            step=16
            for i in range(start,end,step):
                #line start
                f.write(fmt_data)
                k=i+step if i+step<end else end
                first_invalid=True
                first_overflow=True
                for j in range(i,k):
                    try:
                        val=tab(j)
                    except ValueError:
                        val=0
                        if first_invalid:
                            print(f'warning: invalid table value replaced with 0 at index {j}')
                            first_invalid=False
                    if val>0xff:
                        val=val & 0xff
                        if first_overflow:
                            print(f'warning: table value overflowed at index {j}')
                            first_overflow=False
                    f.write(f'${val:02X}')
                    if j!=k-1:
                        f.write(',')
                    else:
                        f.write('\n')

    except:
            print(f"Couldn't open {fname} for writing")
            return 5 #traditional DOS error code

#main
def main():
    '''Analyze arithmetic of two integers via their logs

    Keyword arguments:
        none
    Returns:
        0 (success)
    '''
    print('Guassian logs analysis\n')
    print('accuracy test\n')
    
    #should write a helper class for this
    erm=0 #maximum multiply error
    era=0 #maximum addition error
    
    for x in range(1,256):
        for y in range(x, 256):
            
            er=mult(x,y)-x*y
            if er>erm:
                erm=er
                pxm=x
                pym=y
                
            er=add(x,y)-(x+y)
            if er>era:
                era=er
                pxa=x
                pya=y
    
    #requires python 3.6+ due to f strings
    print('multiply')
    print(f'max err was {erm} ({erm/(pxm*pym)*100:.2f}%) at x={pxm} and y={pym}')
    print(f'{pxm}*{pym}={pxm*pym}, mult({pxm},{pym})={mult(pxm,pym)}\n')

    print('add')
    print(f'max err was {era} ({era/(pxa+pya)*100:.2f}%) at x={pxa} and y={pya}')
    print(f'{pxa}+{pya}={pxa+pya}, add({pxa},{pya})={add(pxa,pya)}\n')
    
    #write tables
    print('writing tables')
    tables=[{"func":lg,"name":"logtab"}, {"func":exp,"name":"exptab"}, {"func":logadd,"name":"logadd"}]
    addr=0xc000
    assembler='acme'
    for table in tables:
        #output_table(fmt, fname, name, addr, tab, start, end+1)
        output_table(assembler, table["name"]+'.asm', table["name"], addr, table["func"], 0, 256)
        addr+=0x100


if __name__ == '__main__':
	sys.exit(main())

writing logtab.asm
warning: invalid table value replaced with 0 at index 0
warning: table value overflowed at index 254
writing exptab.asm
writing logadd.asm

Main.asm to come...

2022-01-30 12:10

Krill

Registered: Apr 2002
Posts: 2854

The log/exp approach isn't new on this system, though. =)

It's especially useful in downscaling operations, where either of the factors is in [0..1).

Some care needs to be taken to minimise table size for an acceptable accuracy.

2022-01-30 17:26

JackAsser

Registered: Jun 2002
Posts: 1993

Quote: The log/exp approach isn't new on this system, though. =)

It's especially useful in downscaling operations, where either of the factors is in [0..1).

Some care needs to be taken to minimise table size for an acceptable accuracy.

Iirc WVL used a very controlled 8-bit sin/asin for the maths in Pinball Dreams even

2022-02-01 07:32

Repose

Registered: Oct 2010
Posts: 222

Yes, log has been used for divide in a C=Hacking article. In this case, I can add without converting from log and back when switching between add/sub and mult/div. Consider finding the escape iteration for a Mandelbrot set; you never need to know the values, just when the log proxy meets the test zr*zr-zi*zi>4. I'm hoping this can be faster and more accurate.

2022-02-01 08:10

Oswald

Registered: Apr 2002
Posts: 5028

log div is good enough for lines. (unless you're goind for subpix:)

2023-08-22 05:44

Repose

Registered: Oct 2010
Posts: 222

It's 2023 - is your 16x16=32 bit unsigned multiply under 190 cycles yet? What? No?!? Let me help you out buddy!
Much to my surprise, I have surpassed by previous best by 10.5 cycles or 5%, when I thought it couldn't be optimized further. How is this possible you ask?
The first trick had to do with how cycles are counted. As a shortcut, we usually average branches/boundary crossings. However, some carries happen only 7% of the time, so optimizing for the average case actually slows you down. You should be optimizing for the no carry case.
The other trick was playing with using registers to accumulate results. It wasn't as simple as it sounds; the order of multiplies had to be optimized to leave a target value in the A register.

The goal: multiply the two 16-bit numbers in x0, x1 and y0, y1 (low bytes and high bytes) and return the results in the fastest way possible (which may leave some results in registers, preferably in a useful way like the high bytes). I measure the timing in this convention to expose a timing which could be used in-line, as is often the case for demo effects. In this case, the 3rd byte is returned in A, which could be useful for a multiple accumulate where only the high 16-bits are kept.

The time is 188.1 cycles, averaged over all possible inputs. The (accurate) time of my version on codebase is 198.6

To be clear, z=x*y or (z3, z2, z1, z0) = (x1, x0) * (y1, y0).

;This section performs the 4 sub-multiplies to form
;  the partials to be added later in self-mod code.
;Addresses like "x1y0l+1" refer to low(x1*y0) stored
;  to the argument of a later "adc #{value}"
;mult8_set_mier_snippet {multiplier}
;mult8_snippet {multiplicand}, {low result}, {high result}
+mult8_set_mier_snippet x1          ;17
+mult8_snippet y0, x1y0l+1, x1y0h+1 ;35
+mult8_snippet y1, x1y1l+1, z3      ;32
+mult8_set_mier_snippet x0          ;17
+mult8_snippet "Y", x0y1l+1, "X"    ;28
+mult8_snippet y0, z0, "A"          ;28
;results in X=x0y1h and A=x0y0h
;multiply part total: 149-165, average 157
;z3           z2           z1
                           clc
                     x0y1l adc #0 ;x0y0h + x0y1l
                           tay ;6
              txa
        x1y0h adc #0 ;x0y1h + x1y0h
              tax
              bcc + ;9
inc z3
clc ;(+6) taken 7% of the time
                         + tya
                     x1y0l adc #0 ;+ x1y0l
                           sta z1 ;7
              txa
        x1y1l adc #0 ;+ x1y1l
              bcc done ;7
inc z3 ;(+4) taken 42% of the time

Column 1 takes 13 cycles, column 2 takes 16 cycles, and column 3 takes 2.1 cycles.
The total time to add the partial products is 29-39 with an average of 31.1.
The results are returned as a 32-bit result:
z3, A(=z2), z1, z0.
The total time is 156.96875+31.1=188.06875 (178-204).

2023-08-22 06:09

Repose

Registered: Oct 2010
Posts: 222

A diagram of the calculations:

               y1    y0
            x  x1    x0
               --------
            x0y0h x0y0l
      x0y1h x0y1l
      x1y0h x1y0l
x1y1h x1y1l
-----------------------
   z3    z2    z1    z0

A listing of the macros is quite long, as a lot of effort went into parsing the arguments the way I'm using them, plus some other features in my libary. Requires the latest C64 Studio 7.5, as there were a number of features I had added. Here is one of them:

!macro mult8_set_mier_snippet .mier {
;set multiplier as mier
;mier can be m/A
;requires .p_.sqr* and .p_.neg.sqr* pointers,
;(mathlib.p_sqr_lo, mathlib.p_neg_sqr_lo, mathlib.p_sqr_hi, mathlib.p_neg_sqr_hi)
;uses these macros from mathlib_util_macros.asm:
;reset_cycles, add_cycles_const
  !if .mier = "X" or .mier = "Y" {
    !error "multlib: mult8_snippet: mier must be m/A, not ", .mier
  } 
  +reset_cycles
  !if .mier != "A" {
    lda .mier
    !if .mier<$100 {
        +add_cycles_const 3
      } else {
        +add_cycles_const 4
      }
  }
  sta mathlib.p_sqr_lo ;3
  sta mathlib.p_sqr_hi ;3
  eor #$ff ;2
  sta mathlib.p_neg_sqr_lo ;3
  sta mathlib.p_neg_sqr_hi ;3; 3+3+3+2+3+3 = 17 cycles
  +add_cycles_const 14
}

Here is the order of multiplies I used:

mier = x1
(x1) * y0 -> x1y0l, x1y0h
(x1) * y1 -> x1y1l, z3
mier = x0
(x0) * (y1) -> x0y1l, X
(x0) * y0 -> z0, A

2023-08-22 13:32

Frantic

Registered: Mar 2003
Posts: 1629

Very good!

2023-08-23 04:05

ChristopherJam

Registered: Aug 2004
Posts: 1381

Nice work, Repose!

Going to have to take another run at this one of these days..

2023-12-11 09:35

Repose

Registered: Oct 2010
Posts: 222

I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.

Even my old routine has won out of 120 routines tested independently.

Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results

This was a thrilling moment for me :)

But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.

Previous - 1 | ... | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

Wisdom/Crescent
tlr
iceout/Avatar/HF
Hoild/Ultimate Newco..
Paul Bearer
Nicron
psenough
Low Spirit
Acidchild/Padua
Linus/MSL
DonChaos
Guests online: 105

Top Demos

1 Next Level  (9.8)
2 13:37  (9.7)
3 Coma Light 13  (9.7)
4 Edge of Disgrace  (9.6)
5 Mojo  (9.6)
6 Comaland 100%  (9.6)
7 Uncensored  (9.6)
8 No Bounds  (9.6)
9 Bromance  (9.5)
10 Wonderland XII  (9.5)

Top onefile Demos

1 Layers  (9.6)
2 Cubic Dream  (9.6)
3 Party Elk 2  (9.6)
4 Copper Booze  (9.6)
5 TRSAC, Gabber & Pebe..  (9.5)
6 Rainbow Connection  (9.5)
7 It's More Fun to Com..  (9.5)
8 Dawnfall V1.1  (9.5)
9 Quadrants  (9.5)
10 Daah, Those Acid Pil..  (9.5)

Top Groups

1 Oxyron  (9.3)
2 Booze Design  (9.3)
3 Censor Design  (9.3)
4 Crest  (9.3)
5 Performers  (9.3)

Top Swappers

1 Derbyshire Ram  (10)
2 Jerry  (9.8)
3 Violator  (9.8)
4 Acidchild  (9.7)
5 Starlight  (9.6)

Page generated in: 0.048 sec.