| |
Luke
Registered: Dec 2004 Posts: 19 |
8 or 16bit muls/divs
About multiply:
Bad one:
a*b=((a+b)/2)^2-(((a+b)/2-b))^2 with x*x matrix
It's near 30-40 cycles, but unstable with (a+b)bit0 bcoz ror after adc.
Some ppl using "nybble tables" $0n x $xy with $1000 size of tables and swap+add, but more than 50 cycles left.
And last bad is old asl bcc adc ror routine, but very slow
like 80+ cycles.
Now I trying to do any 16x16bit signed multiply, but it's really hard to make "short time" routine. Anyone got any idea?
About Divu/Divs
Classical lsr bcs sbc rol too slow.
But a/b= e^(lna-lnb) looks quite fast with e^x matrix, but
quite inaccurate.
Somebody got other "faster" idea for it? Particular 16/16bit routines.
|
|
| |
Graham Account closed
Registered: Dec 2002 Posts: 990 |
For 16 bit mul simply do this:
(256*x_hi + x_lo)*(256*y_hi + y_lo) = 65536*x_hi*y_hi + 256*x_hi*y_lo + 256*x_lo*y_hi + x_lo*y_lo
So a 16 bit mul ist just four 8 bit muls. Ofcourse, the sign is a problem, you have to remove it before mul and apply it afterwards again.
And div is always a problem. That's why DIV is much slower on all CPUs: There is no good way to optimize it. |
| |
Luke
Registered: Dec 2004 Posts: 19 |
I remember that, but after tests slower than classical asl bcc adc ror :) But I'll check it again.
edit. You are right, can be faster with tables. Damn, how to crunch that tables now , it's too much memory left :)
edit2. btw. Anyone got any fast arithmetical a=x*x method? |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
to my knowledge the best method is the a*b=((a+b)/2)^2-(((a+b)/2-b))^2 considering both speed and accuracy.
for divide it heavily depends on what you want to do. for getting line slopes the a/b= e^(lna-lnb) method is accurate enough, if you need accuracy you have to use multiply ((1/a)*b) imho. |
| |
Luke
Registered: Dec 2004 Posts: 19 |
I need 16/16 div for 3d engine routines. Particular for shots 3d world across observer's axes ("pyramid of visibility" ? :D ). It mean method when observer can fly between objects into 3dworld.
|
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
well if I were you I'd first do it the slow and accurate way, and only when it works would change for optimized shit.
btw do u know stephen judd's cool world ? google for commodore hacking, and find the issue with the article about it, it describes and demonstrates a true 3d engine for the c64. |
| |
Luke
Registered: Dec 2004 Posts: 19 |
ffd2.com rox :DDDD lol nice but why not jsrffd2.com ?:DDDD Thank you, I will check :) It looks very interesting :) Give hope nice evening today :)
btw. What you mean "getting line slopes" , you mean clasical (dy/dx)or something more?
|
| |
Skate
Registered: Jul 2003 Posts: 494 |
Like Oswald said, you can always use (1/a)*b for div routines. Use your coordinate system between 0 and 1 and divide this interval to 65536 pieces. So when you will never need to calculate sth like 43256/9721. Instead you'll always multiply these vaules and get the high byte of the result. For example;
39532*3424/65536 => (16bit x 16 bit) >> 16 bit
You will need a 16 bit x 16 bit = 32 bit multiply routine which is always faster than 16 bit/16 bit with other alternative algorithms (as I know). And you won't need low part (16 bit) of the 32 bits result. You will use only high part. So it will shorten the calculation a bit. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
I mean for line slopes the dx/dy thingie. |
| |
ready.
Registered: Feb 2003 Posts: 441 |
Hi,
about division, in the demo Aurora (still final version has to be uploaded) I needed the division routine. A VERY fast method is base on the following:
Let's say you want to scale an x quantity. Do
lda X
lsr
sta d1 ;x/2
lsr
sta d2 ;x/4
lsr
sta d3 ;x/8
....
Then at some point you just add the d? you want:
d1+d2+d3=x*7/8
d1+d2 =x*3/4
d1+d3 =....
in this way just by adding terms, you can build the percentage of x you want. In the example the highest resolution is x/8 but do more lsr and you can get x/16, x/32,....
This method proved to be quite fast for zooming the flower shapes at the end of the demo. I'm sure it's also easy to implement it for 16 bit. Give it a try.
ciao,
Ready.
|
| |
WVL
Registered: Mar 2002 Posts: 902 |
Quote: I remember that, but after tests slower than classical asl bcc adc ror :) But I'll check it again.
edit. You are right, can be faster with tables. Damn, how to crunch that tables now , it's too much memory left :)
edit2. btw. Anyone got any fast arithmetical a=x*x method?
you can make a x^2 table by doing like this :
0,1,4,9,16,25,36
as you can see, the differences are
0,1,3,5,7,9,11,etc
so first number is 0, and then calc all numbers by nextnumber=previousnumber+1+2*positionoldnumber
|
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Just use drive code and the plain old binary mul/div loops suit perfectly. Works great for me. :) |
| |
White Flame
Registered: Sep 2002 Posts: 136 |
ready.: Unfortunately, that's a multiply routine, not a divide. You're multiplying by a percentage or ratio, which is great if you already have the ratio, but doesn't avoid the division if you need to find it. |
| |
ready.
Registered: Feb 2003 Posts: 441 |
White Flame: check it out better, it's a division.
Each lsr divides by 2, right? Many by 2 divisions provide 1/2, 1/4, 1/8,.... then sum toghether the terms you need to get the percentage you want of your original number. Don't you agree?
Besides in this way, there's no need for tables.
ciao,
Ready. |
| |
Luke
Registered: Dec 2004 Posts: 19 |
ready. - you are wrong :)
1) x = lda #$7f lsr lsr lsr lsr lsr lsr lsr sta $nn (nn)=$00
so for example 3x #$7f/ 128 = 0 ???!!!
Too many lost carry flags. It's inaccurate at all. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
@ready: So... if a line is dx=34 and dy=78 and I wanna calculate the line slope dx/dy = 34/78, how do I do that using your algorithm? |
| |
ready.
Registered: Feb 2003 Posts: 441 |
Luke:
x = lda #$7f lsr lsr lsr lsr lsr lsr lsr sta $nn (nn)=$00
so for example 3x #$7f/ 128 = 0 ???!!!
is absolutely correct. #$7f=127 and 127/128=0.something. My routine deals only with integer. |
| |
ready.
Registered: Feb 2003 Posts: 441 |
Luke:
x = lda #$7f lsr lsr lsr lsr lsr lsr lsr sta $nn (nn)=$00
so for example 3x #$7f/ 128 = 0 ???!!!
is absolutely correct. #$7f=127 and 127/128=0.something. My routine deals only with integer. |
| |
Luke
Registered: Dec 2004 Posts: 19 |
No :D 3x ($7f/128) = dec 2.97 not 0 :)
Show me that zoomer :))))))
edit: JackAsser: :DDDDDDD 34/78 = 0 too:D Did you know?:D |
| |
ready.
Registered: Feb 2003 Posts: 441 |
Luke: ok, you are right, but working with integers, it also depends on what operation you do first. If you do 3x#$7f=#$17d and then divide by 128 you get a more accurate result. The routine I put in my previous post was meant to be used for 8bit division. I just wanted to give the idea, not the detailed routine.
Doing division first causes you to loose significant digits.
For 16bit number, of course you have to use ror instead of lsr.
JackAsser: the same answer is valid for you too. #$34/#$78 is less then 1, so you should do something like #$134/#$78 to store some significant digits into memory as a result.
But so far I have used this division routine just with 8bit numbers and it works.
Ready. |
| |
Luke
Registered: Dec 2004 Posts: 19 |
I understand, sorry :) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
ready, you need 256 different routines for something general this way. |
| |
Danzig
Registered: Jun 2002 Posts: 440 |
@Oswald: but that 256 "unrolled" codesnippets would be smaller than a multiply-table-based routine I coded back in 1989 :D
Edit: Okay, that routine was 16bit (makes it not better at all ;) ) |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Ready: How would you divide a number, say 0..255, by 3 and get correct truncated or rounded integer results, using your method? |
| |
ready.
Registered: Feb 2003 Posts: 441 |
My method does not always provide exact result. For example to divide by 3, you have to get a number as close as possible to 1/3=0.33333, by summing terms of 1/(2^n). So you don't get actually 1/3 but for example 1/4+1/16+1/64=0.328125
If you don't want to do too many ror (or lsr), just 1/4+1/16=0.3125
In my case I didn't need much accurancy, I just wanted to scale the original shape I had to do zoom effect, so I use 1/2,1/4,1/8, which gives me 7 possible levels of zoom:
1/8
1/8+1/4
1/8+1/4+1/2
1/4+1/2
1/8+1/2
1/2
1/4
Enogh to see the shape disappear into the screen.
Ready. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Yes, and the zooming was exceptionally smooth. |
| |
White Flame
Registered: Sep 2002 Posts: 136 |
Quote: White Flame: check it out better, it's a division.
Each lsr divides by 2, right? Many by 2 divisions provide 1/2, 1/4, 1/8,.... then sum toghether the terms you need to get the percentage you want of your original number. Don't you agree?
Besides in this way, there's no need for tables.
ciao,
Ready.
Key word there is "percentage". You're multiplying a number by N/256, where N is 0-255, and the bits of N are lookups in the table. N is in the numerator of that ratio, so you're multiplying. Which table entries you add together come from the 0.8bit fixed point representation of N/256, and the number you LSRed into the table is the number you're multiplying by. It's an unrolling of a shift-add loop with one of the values hardcoded into the source.
Take the 'divide by 3' example. You add together the 1/2^n table portions to approximate 0.33333~. You can only do this because the percentage (1/3rd) was known when writing your source code, and you can add the bits from the table that add to 1/3. But what if you want to take 1/Nth of a number where N is only known at runtime? It takes a division to get the percentage, and your routine deals with only the percentage end. Sure, you can do an 8-bit table of 1/N and do a shift-add with your LSRed table, but that's horribly imprecise for general division. But this works very well for a zoomer where you're only dealing with percentages scaled to the range of 0-255. |
| |
ready.
Registered: Feb 2003 Posts: 441 |
Today I had the need to zoom an 8-bit sampled waveform, with sampling rate around 1kHz, with real time scaling of each sample, so all calculations have to be done extremely fast for the powerful C64.
So I thought I could reuse the same concept used in my old zooming routine.
Using the well known fast multiplication routine, it doesn't take much for the C64 to execute:
K*L/16=M
where K is the original sample, L is the zooming factor (16=original waveform; ......; 1=fully reduced waveform; 0=no waveform), M is the reduced sample.
This code provides 16+1 levels of zoom, +1 is the "no waveform level":
lda K
ldy L
jsr mult ;K*L = (byte low, byte high) = (x reg, accu)
stx ZP
lsr ;/2
ror ZP
lsr ;/4
ror ZP
lsr ;/8
ror ZP
lsr ;/16
ror ZP
lda ZP
sta M
If more/less zooming levels are needed the general formula is K*L/2^n=M, giving 2^n+1 zooming levels. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
for 16+1 levels I'd use tables if there's enough memoriy. |
| |
WVL
Registered: Mar 2002 Posts: 902 |
what oswald said :D
now i think of it..
another way is to add 2 sines and using the 'interference' to calculate the multiplication.
Basically what you're doing is multiplying your sample value (A) with a number (B) between 1/16 and 1. Result is C.
C = A * B
Now let's see what happens when we write this as sines ;)
A mathematical formula for multiplying sines is :
sin(X) * sin(Y) = 0.5 * (cos (x-y) - cos(x+y))
let's say that
A = sin (x)
B = sin (y)
C = sin (x) * sin(y)
now, instead of storing A in your sample data, we store x instead. Next, we have a small table (17 bytes) to get the value y if we want a multiplation of 0/16,1/16,...,16/16.
(this value only changes when you change the volume!)
Now, we can calculate the product very easily, since we know x (which is your sample data!) and y (we modify the code!)
C = 0,5 * (cos(x-y) - cos(x+y)
code for changing the volume level:
;this bit only when we change volume
ldx #B (volume 0-16)
lda volumetable,x
sta sbcadress+1
eor #$ff
clc
adc #1
sta ldaadress+1
this is the code for multiplying a sample value
;this is for multiplying one sample value
ldx sampledata (stored as inverse sines)
ldaadress:
lda COSTABLE+256-y,x (the -y is done by modifying the adress)
sec
sbcadress:
sbc COSTABLE+y,x (the +y is done by modifying the adress)
ror
tadaa! multiplied :) The nice thing is that you can make almost 256 different volume levels ;)
Note : your COSTABLE table has to be twice as long. 2*256 bytes. But still a lot less memory use than 17 256-byte long tables for oswald's way.
The only 'problem' is that you lose around 1 bit of accuracy in the whole process, mainly because you cannot specify A exactly in terms of x (there's always numbers missing from your COSTABLE table, sometimes there's a gap of 2 or 3 instead of 1).
Damn.. what a long post.. I can has cookie?
|
| |
ready.
Registered: Feb 2003 Posts: 441 |
@WVL: I only saw your post now, you must have edited it after I read it the first time. Your method seems very interesting and for zooming it might be the fastest, if not using Oswald's tables.
Anyhow, just today I improved my routine greatly. I noticed that if I have 256 volume/zoom levels instead of 16, 32, 64,.. I can cut the code greately:
K*L/256=M
K: sampled data
L: volume/zoom level
M: scaled sample
lda K
ldy L
jsr mult ;K*L = (byte low, byte high) = (x reg, accu)
sta M
thus getting rid of all the ror lsr and having 256 levels of resolution! After all in the application I am working on 256 levels are better than 16.
Furthermore the X register, which is the least significant part of the multiplication, contains the reminder, which might come handy in some applications. In fact:
K*L=256*M+X
Still I think WVL's method is faster. I will give it a try sooner or later. Great idea! |
| |
Skate
Registered: Jul 2003 Posts: 494 |
@ready: please read Oswald's and my 5 years old posts above and compare with what you've come up with today. :) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
WVL's post is interesting looks exactly like the square method, but using cos/sin? WTF?! I dont get it. :) |
| |
ready.
Registered: Feb 2003 Posts: 441 |
@skate: ah, I shuold have gone through this old thread before posting!! Well, I kinda thought that this way must have been already used somewhere.
Anyhow, it's always nice to make it by yourself without looking at the solution, even if somebody has already made it before :)
|
| |
WVL
Registered: Mar 2002 Posts: 902 |
Quote: WVL's post is interesting looks exactly like the square method, but using cos/sin? WTF?! I dont get it. :)
Ha! Finally I managed to outsmart Oswald!
I have to say that i only once used this technique before, in that particular case to let 2 movements warp into eachother. It was the part in Arcanum with the vertical bars. The nice thing was that all the movements are made from sines ;) So I already had my x1 for movement 1 and x2 for movement 2. Then i just multiplied the value for movement 1 from 1...0 (changing in time) and for movement 2 from 0...1 (changing in time) and the result is a smooth change in the different movements without any effort :)
Anyway, think of it like this :
if value i want (A) is 255 and x is such that cos(x)=255 (x=0') and my y = 0' (multiplication of 1), you get :
C = (cos(x-0) + cos(x+0))/2 = (255 + 255)/2 = 255 (I multiplied by 1).
now, if you select y such that the values are opposite in phase (ie y = 90 degrees), you get that both cos values added give 0.
change the value of y inbetween 0' and 90' and the result is a smooth variance from 0...255 (or any value you like as a result). The only thing you have to do is select the appropiate y-value from an inverse cosine table.
Note that this way of multiplying is not good at all if you first need to get the value of x (but for samples, we simply store x instead of A!)
Oh, one more thing :
the nice thing about this is that it automagically also works OK with negative numbers, unlike the squares method where you have to take care of some stuff by yourself. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
You got me so many times already :) its also interesting from the math viewpont, how come two so different things are so similar ? tho there's an extra 'lookup' on the left side of the equation. apart from that its the same code. BTW signs also automagically work in the squares method, but it only works for +/-64, and I guess the same is true for sines, because everything is the same except the table data :) |
| |
WVL
Registered: Mar 2002 Posts: 902 |
Well, maybe it's easier to understand it if you relate it to
e^a * e^b = e^(a+b)
knowing that you can write sines as the imaginary part of e^ix and cosines as the real part of e^ix... They are heavily related :) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
WVL:
I had a vague memory of my old algebra textbook mentioning that trigonometric identities and tables had once been used to carry out multiplication and division, before the discovery of the logarithm. I just looked it up and apparently the method is called prosthaphaeresis (which sounds more like a disease than an algorithm if you ask me.)
I guess that just goes to show that there's nothing new under the sun. I suppose the ancient Babylonians used the table-of-squares method to do all of their 8-bit arithmetic... |
| |
ready.
Registered: Feb 2003 Posts: 441 |
ah, ah, back in school my math teacher always insisted on these prosthaphaeresis formulas and I have always wondered for what I would ever use them in real life. Well, didn't know I was going to make demos about 20 years later :)
|
| |
Graham Account closed
Registered: Dec 2002 Posts: 990 |
Quote: Well, maybe it's easier to understand it if you relate it to
e^a * e^b = e^(a+b)
knowing that you can write sines as the imaginary part of e^ix and cosines as the real part of e^ix... They are heavily related :)
Yes but this has nothing to do that you can use it for a multiplication. You can use almost any concave function which starts at 0 and a lookup table to make a multiplication.
|