[CSDb] - User Forums - Fast large multiplies

Welcome to our latest new user lotus_skylight ! (Registered 2024-09-25)

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 225

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2022-01-30 12:10

Krill

Registered: Apr 2002
Posts: 2940

The log/exp approach isn't new on this system, though. =)

It's especially useful in downscaling operations, where either of the factors is in [0..1).

Some care needs to be taken to minimise table size for an acceptable accuracy.

2022-01-30 17:26

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: The log/exp approach isn't new on this system, though. =)

It's especially useful in downscaling operations, where either of the factors is in [0..1).

Some care needs to be taken to minimise table size for an acceptable accuracy.

Iirc WVL used a very controlled 8-bit sin/asin for the maths in Pinball Dreams even

2022-02-01 07:32

Repose

Registered: Oct 2010
Posts: 225

Yes, log has been used for divide in a C=Hacking article. In this case, I can add without converting from log and back when switching between add/sub and mult/div. Consider finding the escape iteration for a Mandelbrot set; you never need to know the values, just when the log proxy meets the test zr*zr-zi*zi>4. I'm hoping this can be faster and more accurate.

2022-02-01 08:10

Oswald

Registered: Apr 2002
Posts: 5076

log div is good enough for lines. (unless you're goind for subpix:)

2023-08-22 05:44

Repose

Registered: Oct 2010
Posts: 225

It's 2023 - is your 16x16=32 bit unsigned multiply under 190 cycles yet? What? No?!? Let me help you out buddy!
Much to my surprise, I have surpassed by previous best by 10.5 cycles or 5%, when I thought it couldn't be optimized further. How is this possible you ask?
The first trick had to do with how cycles are counted. As a shortcut, we usually average branches/boundary crossings. However, some carries happen only 7% of the time, so optimizing for the average case actually slows you down. You should be optimizing for the no carry case.
The other trick was playing with using registers to accumulate results. It wasn't as simple as it sounds; the order of multiplies had to be optimized to leave a target value in the A register.

The goal: multiply the two 16-bit numbers in x0, x1 and y0, y1 (low bytes and high bytes) and return the results in the fastest way possible (which may leave some results in registers, preferably in a useful way like the high bytes). I measure the timing in this convention to expose a timing which could be used in-line, as is often the case for demo effects. In this case, the 3rd byte is returned in A, which could be useful for a multiple accumulate where only the high 16-bits are kept.

The time is 188.1 cycles, averaged over all possible inputs. The (accurate) time of my version on codebase is 198.6

To be clear, z=x*y or (z3, z2, z1, z0) = (x1, x0) * (y1, y0).

;This section performs the 4 sub-multiplies to form
;  the partials to be added later in self-mod code.
;Addresses like "x1y0l+1" refer to low(x1*y0) stored
;  to the argument of a later "adc #{value}"
;mult8_set_mier_snippet {multiplier}
;mult8_snippet {multiplicand}, {low result}, {high result}
+mult8_set_mier_snippet x1          ;17
+mult8_snippet y0, x1y0l+1, x1y0h+1 ;35
+mult8_snippet y1, x1y1l+1, z3      ;32
+mult8_set_mier_snippet x0          ;17
+mult8_snippet "Y", x0y1l+1, "X"    ;28
+mult8_snippet y0, z0, "A"          ;28
;results in X=x0y1h and A=x0y0h
;multiply part total: 149-165, average 157
;z3           z2           z1
                           clc
                     x0y1l adc #0 ;x0y0h + x0y1l
                           tay ;6
              txa
        x1y0h adc #0 ;x0y1h + x1y0h
              tax
              bcc + ;9
inc z3
clc ;(+6) taken 7% of the time
                         + tya
                     x1y0l adc #0 ;+ x1y0l
                           sta z1 ;7
              txa
        x1y1l adc #0 ;+ x1y1l
              bcc done ;7
inc z3 ;(+4) taken 42% of the time

Column 1 takes 13 cycles, column 2 takes 16 cycles, and column 3 takes 2.1 cycles.
The total time to add the partial products is 29-39 with an average of 31.1.
The results are returned as a 32-bit result:
z3, A(=z2), z1, z0.
The total time is 156.96875+31.1=188.06875 (178-204).

2023-08-22 06:09

Repose

Registered: Oct 2010
Posts: 225

A diagram of the calculations:

               y1    y0
            x  x1    x0
               --------
            x0y0h x0y0l
      x0y1h x0y1l
      x1y0h x1y0l
x1y1h x1y1l
-----------------------
   z3    z2    z1    z0

A listing of the macros is quite long, as a lot of effort went into parsing the arguments the way I'm using them, plus some other features in my libary. Requires the latest C64 Studio 7.5, as there were a number of features I had added. Here is one of them:

!macro mult8_set_mier_snippet .mier {
;set multiplier as mier
;mier can be m/A
;requires .p_.sqr* and .p_.neg.sqr* pointers,
;(mathlib.p_sqr_lo, mathlib.p_neg_sqr_lo, mathlib.p_sqr_hi, mathlib.p_neg_sqr_hi)
;uses these macros from mathlib_util_macros.asm:
;reset_cycles, add_cycles_const
  !if .mier = "X" or .mier = "Y" {
    !error "multlib: mult8_snippet: mier must be m/A, not ", .mier
  } 
  +reset_cycles
  !if .mier != "A" {
    lda .mier
    !if .mier<$100 {
        +add_cycles_const 3
      } else {
        +add_cycles_const 4
      }
  }
  sta mathlib.p_sqr_lo ;3
  sta mathlib.p_sqr_hi ;3
  eor #$ff ;2
  sta mathlib.p_neg_sqr_lo ;3
  sta mathlib.p_neg_sqr_hi ;3; 3+3+3+2+3+3 = 17 cycles
  +add_cycles_const 14
}

Here is the order of multiplies I used:

mier = x1
(x1) * y0 -> x1y0l, x1y0h
(x1) * y1 -> x1y1l, z3
mier = x0
(x0) * (y1) -> x0y1l, X
(x0) * y0 -> z0, A

2023-08-22 13:32

Frantic

Registered: Mar 2003
Posts: 1641

Very good!

2023-08-23 04:05

ChristopherJam

Registered: Aug 2004
Posts: 1403

Nice work, Repose!

Going to have to take another run at this one of these days..

2023-12-11 09:35

Repose

Registered: Oct 2010
Posts: 225

I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.

Even my old routine has won out of 120 routines tested independently.

Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results

This was a thrilling moment for me :)

But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.

2023-12-11 10:00

Raistlin

Registered: Mar 2007
Posts: 625

Very nice write up, Repose!