Log inRegister an accountBrowse CSDbHelp & documentationFacts & StatisticsThe forumsAvailable RSS-feeds on CSDbSupport CSDb Commodore 64 Scene Database
You are not logged in - nap
CSDb User Forums


Forums > C64 Coding > Fast large multiplies
2012-06-09 19:45
Repose

Registered: Oct 2010
Posts: 222
Fast large multiplies

I've discovered some interesting optimizations for multiplying large numbers, if the multiply routine time depends on the bits of the mulitplier. Usually if there's a 1 bit in the multiplier, with a standard shift and add routine, there's a "bit" more time or that bit.
The method uses several ways of transforming the input to have less 1 bits. Normally, if every value appears equally, you average half 1 bits. In my case, that becomes the worst case, and there's about a quarter 1 bits. This can speed up any routine, even the one that happens to be in rom, by using pre- and post- processing of results. The improvement is about 20%.
Another speedup is optimizing the same multiplier applied to multiple multiplicands. This saves a little in processing the multiplier bits once. This can save another 15%.
Using the square table method will be faster but use a lot of data and a lot of code.
Would anyone be interested in this?

 
... 144 posts hidden. Click here to view all posts....
 
2017-04-17 18:56
JackAsser

Registered: Jun 2002
Posts: 1987
Also if z3 is assume in the y-reg when done instead:

sta z3;x1*y1h;31 => tay

all inc z3 => iny

I think that's ok since you'd probably do some lda z3 afterwards anyway, and instead you have it readily available in Y.
2017-04-17 19:27
Repose

Registered: Oct 2010
Posts: 222
Thanks :) I screwed up my timing though, the first multiply part is 33 not 31 because of the single SEC, bringing my total to 203 (and VICE was right, the multiply part is 160). Because of this, I think I'll use your optimizations to bring it back down to 201. Therefore it still stands :)
2017-04-17 20:54
JackAsser

Registered: Jun 2002
Posts: 1987
Also in a real life situation, f.e. subpixel vectors you'd only keep z3 and z2 as a screen limited 8.8 result. Typically one would do:

Rotated x,y,z in 8.8
Rotated z in 8.8
Reciprocal z = 1/z = 0.16

Perspective:
8.8 * 0.16 = 8.24 (keep only the top 8.8, i.e. pixel and the bresenham initial error)
2017-04-17 21:11
Repose

Registered: Oct 2010
Posts: 222
Maybe I should use what I've learned to do 3d rotations and perspective transform? I think A Different Perspective 2017 3d Creators Update is in order :) (I'm one of the original authors).

So I had a plan for this fast multiply, it can lead to a fast division because of multiplying by the inverse of the divisor. I can also do squares and cubes faster than this.

Edit: was thinking multiply is only the beginning. I made it 16% faster than your routine but if I can make such gains throughout the transform stack it will add up.

Also for Andropolis, I was thinking to not use EOR fill but a straight store (in fact that's the insight I had on A Different Perspective), and also to calc frame differences and plot those only.
2017-04-17 22:24
JackAsser

Registered: Jun 2002
Posts: 1987
Quote: Maybe I should use what I've learned to do 3d rotations and perspective transform? I think A Different Perspective 2017 3d Creators Update is in order :) (I'm one of the original authors).

So I had a plan for this fast multiply, it can lead to a fast division because of multiplying by the inverse of the divisor. I can also do squares and cubes faster than this.

Edit: was thinking multiply is only the beginning. I made it 16% faster than your routine but if I can make such gains throughout the transform stack it will add up.

Also for Andropolis, I was thinking to not use EOR fill but a straight store (in fact that's the insight I had on A Different Perspective), and also to calc frame differences and plot those only.


I've had similar ideas and I even did an frame difference experiment in Jave. Problem was that since diffs are small so will the triangles be. They'll be extreme, sharp and 'problematic'. Hard to render correctly and since they're diffs any render error will accumulate.

Regarding transforms I came to the conclusion to cut most of the stack and forget about how it works conventionally.

Regarding divs we all do mul by the reciprocal. For Andropolis I did what Graham did and calced the reciprocal by linear interpolation:

X is 8.8 call it a.b:

1/a.b ~= invtab[a]*(1-b) + invtab[a+1]*b

invtab[x] is 0.16 result of 65536/x for x 1..255
2017-04-17 23:35
Repose

Registered: Oct 2010
Posts: 222
You should write an article on how you did that, it sounds interesting. Obviously I'm a noob at this problem and would have a lot of research to do.

I wasn't thinking triangles exactly but just doom like hallways, wouldn't that work for differencing?
2017-04-18 04:58
JackAsser

Registered: Jun 2002
Posts: 1987
Quote: You should write an article on how you did that, it sounds interesting. Obviously I'm a noob at this problem and would have a lot of research to do.

I wasn't thinking triangles exactly but just doom like hallways, wouldn't that work for differencing?


Hehe, I suck at writing articles! :)

Imagine the simpliest case: axis aligned rooms, no height differences and just rotate around Y, i.e. Wolfenstein.

The diff formed by the edge between the wall and the floor/ceiling will be an enlogated thin triangle. The diff might be 3-4 pixels high close to the camera, this is easy, but further down the corridor the height of the diff will be <1 but still pixels here and there must be pur. Surely an extended bresenham that "draws" both the top and the bottom line simultaneous could handle it, and calc the total height between the lines and still propagate individual errors.
2017-04-18 09:28
Repose

Registered: Oct 2010
Posts: 222
Ok, I think that makes sense now.

Now 196
I managed to save another 7 cycles from the correct total of 203, bringing it down to 159+37=196. As a bonus, the two highest bytes are returned in registers.

If I put do_adds in zp, there's one more optimization to save 4 instead of 3, for 192. Finally, if you don't need the lowest bytes, you can save 3 cycles by deleting sta z1.
lda (p_sqr_hi),y
sbc (p_invsqr_hi),y
tay;x1*y1h;Y=z3, 30 cycles

do_adds:
;-add the first two numbers of column 1
	clc
c1a:	lda #0
c1b:	adc #0
	sta z1;9

;-continue to first two numbers of column 2
c2a:	lda #0
c2b:	adc #0
	tax;X=z2, 6 cycles
	bcc c1c;3/6 avg 4.5
	iny;z3++
	clc

;-add last number of column 1
c1c:	lda #0
	adc z1
	sta z1;8

;-add last number of column 2
	txa;A=z2
c2c:	adc #0
	tax;X=z2, 6
	bcc fin;3/4 avg 3.5
	iny;z3++

;Y=z3, X=z2
;9+6+4.5+8+6+3.5=37
fin:	rts
2017-04-18 13:11
Oswald

Registered: Apr 2002
Posts: 5007
I did not follow closely, could you enlighten me why c1a/b/c is not added up in one go ?
2017-04-18 13:43
ChristopherJam

Registered: Aug 2004
Posts: 1359
Damn, so you're now 8.5 cycles faster than me? I was not expecting partial products to be faster than the optimisations we've been working on for runs of adc. Going to have to study this more closely.

Canada holds the record, for now. Nice work!
Previous - 1 | ... | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | ... | 16 - Next
RefreshSubscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search   for   in  
All times are CET.
Search CSDb
Advanced
Users Online
Nova
Airwolf/F4CG
Mason/Unicess
Freeze/Blazon
radius75
Dr. Doom/RAD
Nordischsound/Hokuto..
iAN CooG/HVSC
Sokrates
Guests online: 1138
Top Demos
1 Next Level  (9.8)
2 Mojo  (9.7)
3 Coma Light 13  (9.7)
4 Edge of Disgrace  (9.6)
5 No Bounds  (9.6)
6 Comaland 100%  (9.6)
7 Uncensored  (9.6)
8 The Ghost  (9.6)
9 Wonderland XIV  (9.6)
10 Bromance  (9.6)
Top onefile Demos
1 Party Elk 2  (9.7)
2 Cubic Dream  (9.6)
3 Copper Booze  (9.5)
4 Rainbow Connection  (9.5)
5 TRSAC, Gabber & Pebe..  (9.5)
6 Onscreen 5k  (9.5)
7 Dawnfall V1.1  (9.5)
8 Quadrants  (9.5)
9 Daah, Those Acid Pil..  (9.5)
10 Birth of a Flower  (9.5)
Top Groups
1 Booze Design  (9.3)
2 Nostalgia  (9.3)
3 Oxyron  (9.3)
4 Censor Design  (9.3)
5 Crest  (9.3)
Top Organizers
1 Burglar  (9.9)
2 Sixx  (9.8)
3 hedning  (9.7)
4 Irata  (9.7)
5 MWS  (9.6)

Home - Disclaimer
Copyright © No Name 2001-2024
Page generated in: 0.046 sec.