[CSDb] - User Forums - Fast large multiplies

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast large multiplies

2012-06-09 19:45

Repose

Registered: Oct 2010
Posts: 225

Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

... 144 posts hidden. Click here to view all posts....

2023-08-22 05:44

Repose

Registered: Oct 2010
Posts: 225

It's 2023 - is your 16x16=32 bit unsigned multiply under 190 cycles yet? What? No?!? Let me help you out buddy!
Much to my surprise, I have surpassed by previous best by 10.5 cycles or 5%, when I thought it couldn't be optimized further. How is this possible you ask?
The first trick had to do with how cycles are counted. As a shortcut, we usually average branches/boundary crossings. However, some carries happen only 7% of the time, so optimizing for the average case actually slows you down. You should be optimizing for the no carry case.
The other trick was playing with using registers to accumulate results. It wasn't as simple as it sounds; the order of multiplies had to be optimized to leave a target value in the A register.

The goal: multiply the two 16-bit numbers in x0, x1 and y0, y1 (low bytes and high bytes) and return the results in the fastest way possible (which may leave some results in registers, preferably in a useful way like the high bytes). I measure the timing in this convention to expose a timing which could be used in-line, as is often the case for demo effects. In this case, the 3rd byte is returned in A, which could be useful for a multiple accumulate where only the high 16-bits are kept.

The time is 188.1 cycles, averaged over all possible inputs. The (accurate) time of my version on codebase is 198.6

To be clear, z=x*y or (z3, z2, z1, z0) = (x1, x0) * (y1, y0).

;This section performs the 4 sub-multiplies to form
;  the partials to be added later in self-mod code.
;Addresses like "x1y0l+1" refer to low(x1*y0) stored
;  to the argument of a later "adc #{value}"
;mult8_set_mier_snippet {multiplier}
;mult8_snippet {multiplicand}, {low result}, {high result}
+mult8_set_mier_snippet x1          ;17
+mult8_snippet y0, x1y0l+1, x1y0h+1 ;35
+mult8_snippet y1, x1y1l+1, z3      ;32
+mult8_set_mier_snippet x0          ;17
+mult8_snippet "Y", x0y1l+1, "X"    ;28
+mult8_snippet y0, z0, "A"          ;28
;results in X=x0y1h and A=x0y0h
;multiply part total: 149-165, average 157
;z3           z2           z1
                           clc
                     x0y1l adc #0 ;x0y0h + x0y1l
                           tay ;6
              txa
        x1y0h adc #0 ;x0y1h + x1y0h
              tax
              bcc + ;9
inc z3
clc ;(+6) taken 7% of the time
                         + tya
                     x1y0l adc #0 ;+ x1y0l
                           sta z1 ;7
              txa
        x1y1l adc #0 ;+ x1y1l
              bcc done ;7
inc z3 ;(+4) taken 42% of the time

Column 1 takes 13 cycles, column 2 takes 16 cycles, and column 3 takes 2.1 cycles.
The total time to add the partial products is 29-39 with an average of 31.1.
The results are returned as a 32-bit result:
z3, A(=z2), z1, z0.
The total time is 156.96875+31.1=188.06875 (178-204).

2023-08-22 06:09

Repose

Registered: Oct 2010
Posts: 225

A diagram of the calculations:

               y1    y0
            x  x1    x0
               --------
            x0y0h x0y0l
      x0y1h x0y1l
      x1y0h x1y0l
x1y1h x1y1l
-----------------------
   z3    z2    z1    z0

A listing of the macros is quite long, as a lot of effort went into parsing the arguments the way I'm using them, plus some other features in my libary. Requires the latest C64 Studio 7.5, as there were a number of features I had added. Here is one of them:

!macro mult8_set_mier_snippet .mier {
;set multiplier as mier
;mier can be m/A
;requires .p_.sqr* and .p_.neg.sqr* pointers,
;(mathlib.p_sqr_lo, mathlib.p_neg_sqr_lo, mathlib.p_sqr_hi, mathlib.p_neg_sqr_hi)
;uses these macros from mathlib_util_macros.asm:
;reset_cycles, add_cycles_const
  !if .mier = "X" or .mier = "Y" {
    !error "multlib: mult8_snippet: mier must be m/A, not ", .mier
  } 
  +reset_cycles
  !if .mier != "A" {
    lda .mier
    !if .mier<$100 {
        +add_cycles_const 3
      } else {
        +add_cycles_const 4
      }
  }
  sta mathlib.p_sqr_lo ;3
  sta mathlib.p_sqr_hi ;3
  eor #$ff ;2
  sta mathlib.p_neg_sqr_lo ;3
  sta mathlib.p_neg_sqr_hi ;3; 3+3+3+2+3+3 = 17 cycles
  +add_cycles_const 14
}

Here is the order of multiplies I used:

mier = x1
(x1) * y0 -> x1y0l, x1y0h
(x1) * y1 -> x1y1l, z3
mier = x0
(x0) * (y1) -> x0y1l, X
(x0) * y0 -> z0, A

2023-08-22 13:32

Frantic

Registered: Mar 2003
Posts: 1648

Very good!

2023-08-23 04:05

ChristopherJam

Registered: Aug 2004
Posts: 1409

Nice work, Repose!

Going to have to take another run at this one of these days..

2023-12-11 09:35

Repose

Registered: Oct 2010
Posts: 225

I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.

Even my old routine has won out of 120 routines tested independently.

Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results

This was a thrilling moment for me :)

But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.

2023-12-11 10:00

Raistlin

Registered: Mar 2007
Posts: 680

Very nice write up, Repose!

2023-12-11 12:51

Frantic

Registered: Mar 2003
Posts: 1648

Quote: I can now support my claim with some confidence that I have achieved a world record for a published 16x16=32 unsigned mult.

Even my old routine has won out of 120 routines tested independently.

Take a look at the entry mult15.a here:
https://github.com/TobyLobster/multiply_test#the-results

This was a thrilling moment for me :)

But in general, its interesting to read about the various approaches, and to be fair, if you need a smaller routine, there's better choices.

Congratulations! I totally get that it must have been thrilling. :D

2023-12-11 13:57

Krill

Registered: Apr 2002
Posts: 2980

Quoting Repose

The time is 188.1 cycles, averaged over all possible inputs.

Pretty impressive. :)

It seems that most of the required memory of about 2 KB is spent by tables (which are generated), not so much the code itself, right?

If so, i guess this 16x16=32 routine could come in quite handy for something like https://github.com/exoticorn/upkr which still doesn't seem to have a C-64/plain 6502 unpacker (while 6502 unpackers using hardware-accelerated mul exist).

2023-12-11 14:09

Repose

Registered: Oct 2010
Posts: 225

The code seems to be 120 bytes and the tables are 2044 bytes. I think I need to add some 30 more to create the tables.

2023-12-12 09:31

JackAsser

Registered: Jun 2002
Posts: 2014

Very very nice write up @Repose!

Previous - 1 | ... | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

Chesser/Blazon
Impetigo/Crescent
LordNikon/Dekadence
DJ Gruby/TRiAD
Krill/Plush
Acidchild/Padua
katon/Lepsi De
Unlock/Padua/Albion
Guests online: 99

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)

Top onefile Demos

1 Layers  (9.6)
2 No Listen  (9.6)
3 Party Elk 2  (9.6)
4 Cubic Dream  (9.6)
5 Copper Booze  (9.6)
6 Rainbow Connection  (9.5)
7 Dawnfall V1.1  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)

Top Groups

1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)

Top Musicians

1 Rob Hubbard  (9.7)
2 Mutetus  (9.7)
3 Jeroen Tel  (9.7)
4 Linus  (9.6)
5 Stinsen  (9.6)

Page generated in: 0.071 sec.