| |
Rudi Account closed
Registered: May 2010 Posts: 125 |
Speedcode
Let's say my loop has to iterate 16K times.
unrolling the loop would not be ideal, because then it would be 16K*the amount of bytes inside the loop.
What does speedcode actually do? Unroll loops?
I saw lecture by Ninja at a conference about just LDA's and STA's to specific address locations. But what do these address locations store? tables or opcode+data?
First I thought about actually loading and storing the opcode and data, but afterwards I thought this wouldn't be right! The LDA's and STA's take each x-amount of cycles that I really dont need, and the opcodes and data themselves are the only thing I need :D
I got an idea now about using only one opcode for doing a same operation, but loading and storing the data in 1 or 2 byte format. Perhaps the LDA and STA needs to jump back and forth (as a consequence, i dont know).
Still its a subject that I recently am looking into since i would guess (as an approximation that) my routine will only run in 7 fps as it is now. Ideas or ramblings are welcome about this issue and topic!
(I forgot to ask. Do you make your own tools/or generators for speedcode generation?) |
|
... 15 posts hidden. Click here to view all posts.... |
| |
algorithm
Registered: May 2002 Posts: 705 |
Also not necessary to unroll everything unless you need every bit of cpu processing or may need an additional register (if using this as a counter) etc. usually unrolling more than 8 times will give less performance gains in comparison with unrolling twice or more.
Changing the routine itself will give the biggest speed gains. The usual cycle saving tricks such as branching if required to code that would less likely be run and trying to avoid using dec/inc opcodes if possible (huge 5-6 cycle usage) and ofcourse code in zeropage, some illegal opcodes, table usage and more etc..
|
| |
Fresh
Registered: Jan 2005 Posts: 101 |
A bit off topic.
Quote:
Regarding the hybrid approaches Bitbreaker mentioned, where you need to jump to various chunks of speedcode, the following article may be useful reading:
http://codebase64.org/doku.php?id=base:dispatch_on_a_byte
That article may be a little misleading: choose the right jumptable and there's no reason you can't have your 256 indirect jump entries.
|
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
Fucking around with branch optimizing and shit like that doesnt helps you much when your approach is slow. Try to think backwards, dont translate an algorithm to 6502 opcodes, try to translate 6502 opcodes into effects. The more primitive solution you find the faster the code will get. Dont try to be smart, try to be primitive and lazy. Try to see the world as the CPU. and then the power will be with you my son :) |
| |
algorithm
Registered: May 2002 Posts: 705 |
Theres only so much a simple speedcode with less branches can do. Although can reduce computational time for example with decompression or so relying on packed bits to jump to a routine based on the byte containing packed bits and doing it all in one go.instead of bit shifting and comparisons. One example was in the just dance 64 demo that can decode 4 bytes from a byte adpcm2 decode in a single raster line. That is around 50000 bytes per second potential. Better to be smart and fast :-) |
| |
Rudi Account closed
Registered: May 2010 Posts: 125 |
Ok. I have 3 branches in my loop (or for-loop to be exact). It's where the "?" and ":" conditionals are.
just for the heck of it i paste it here:
(i have yet to make those lut tables one dimentional)
for (ushort w=0; w<16384; w++)
{
ushort i = w>>3;
uchar j = (w&7);
uchar a = shrlut[p[i]][j]&1;
uchar b = (j<7 ? shrlut[p[i]][j+1]:p[i+128&2047])&1;
uchar c = shrlut[p[i+1&0x7ff]][j]&1;
uchar d = (j<7 ? shrlut[p[i+1&2047]][j+1]:p[i+129&2047])&1;
uchar z = a|shllut2[b][0]|shllut2[c][1]|shllut2[d][2];
z = swaplut[rand()&7][z];
uchar s = shllut[j];
p[i] = (r>>z)&1 ? p[i]|s : p[i]&(s^255);
}
hope you dont mind me pasting C-code here. This was not was Oswald wanted, but just to let you know how many complicated arithmetics and branches this code needs. It also has some 16-bit numbers. Altho i know that for (w&7) i need only to AND the lo-byte, and not the hi-byte. Anyway, here you see what challenge i got. 'r' is a 16-bit number. And for those who dont remember types in the C-language, uchar is unsigned char: 8-bit. and ushort is unsigned short: 16-bits. |
| |
algorithm
Registered: May 2002 Posts: 705 |
Most of the code there is linear. Just note where the bottleneck is and pay attention to optimising that. The bottleneck is not the branches but the othercode from that exampke. Ofcourse optimise that as well |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
I beg your pardon, but per frame you have 18656 available cycles, Sir :-) I'd split up the things into two loops, as it saves the shifting and splitting up of w, also it saves you the test on j<7 and it would be sufficient to unroll the inner loop with then 8 runs. If there's space, incorporate the & 1 into the LUTs. On the other hand: you fetch 8 consecutive bits from a lut, so it might then even be easier to fetch a single byte and shift out the bits. As you have 8 inner loop runs, that is exactly those 8 bits.
The shllut2 lookups are unnecessary and can be incorportaed into the the previous lookups (if there's mem :-) ).
for (ushort i=0; i<2048; i++)
{
//here set up all luts for e.g.
//lda #i&255
//sta shrlut
//lda #i>>8
//sta shrlut+1
//so now you can later on do:
//ldy #$00
//lda (shrlut),y
//iny
//...
//until y == 7
//then load different values for b and d
for (uchar j = 0; j < 8; j++) {
uchar a = shrlut[p[i]][j]&1;
uchar b = (j<7 ? shrlut[p[i]][j+1]:p[i+128&2047])&1;
uchar c = shrlut[p[i+1&0x7ff]][j]&1;
uchar d = (j<7 ? shrlut[p[i+1&2047]][j+1]:p[i+129&2047])&1;
uchar z = a|shllut2[b][0]|shllut2[c][1]|shllut2[d][2];
z = swaplut[rand()&7][z];
uchar s = shllut[j];
p[i] = (r>>z)&1 ? p[i]|s : p[i]&(s^255);
}
}
|
| |
algorithm
Registered: May 2002 Posts: 705 |
I was only going on about convertibg the code to 6502 in linear format as the example by rudi. Of course the method is to change things around to make it more efficient instead of the structure in the code posted. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Oh sorry Algorithm, i was begging for rudi's pardon, not yours :-) Your post just came in between while i was writing that post :-) |
| |
algorithm
Registered: May 2002 Posts: 705 |
Thats ok :-) |
Previous - 1 | 2 | 3 - Next |