| |
Rudi Account closed
Registered: May 2010 Posts: 125 |
Speedcode
Let's say my loop has to iterate 16K times.
unrolling the loop would not be ideal, because then it would be 16K*the amount of bytes inside the loop.
What does speedcode actually do? Unroll loops?
I saw lecture by Ninja at a conference about just LDA's and STA's to specific address locations. But what do these address locations store? tables or opcode+data?
First I thought about actually loading and storing the opcode and data, but afterwards I thought this wouldn't be right! The LDA's and STA's take each x-amount of cycles that I really dont need, and the opcodes and data themselves are the only thing I need :D
I got an idea now about using only one opcode for doing a same operation, but loading and storing the data in 1 or 2 byte format. Perhaps the LDA and STA needs to jump back and forth (as a consequence, i dont know).
Still its a subject that I recently am looking into since i would guess (as an approximation that) my routine will only run in 7 fps as it is now. Ideas or ramblings are welcome about this issue and topic!
(I forgot to ask. Do you make your own tools/or generators for speedcode generation?) |
|
... 15 posts hidden. Click here to view all posts.... |
| |
algorithm
Registered: May 2002 Posts: 705 |
Most of the code there is linear. Just note where the bottleneck is and pay attention to optimising that. The bottleneck is not the branches but the othercode from that exampke. Ofcourse optimise that as well |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
I beg your pardon, but per frame you have 18656 available cycles, Sir :-) I'd split up the things into two loops, as it saves the shifting and splitting up of w, also it saves you the test on j<7 and it would be sufficient to unroll the inner loop with then 8 runs. If there's space, incorporate the & 1 into the LUTs. On the other hand: you fetch 8 consecutive bits from a lut, so it might then even be easier to fetch a single byte and shift out the bits. As you have 8 inner loop runs, that is exactly those 8 bits.
The shllut2 lookups are unnecessary and can be incorportaed into the the previous lookups (if there's mem :-) ).
for (ushort i=0; i<2048; i++)
{
//here set up all luts for e.g.
//lda #i&255
//sta shrlut
//lda #i>>8
//sta shrlut+1
//so now you can later on do:
//ldy #$00
//lda (shrlut),y
//iny
//...
//until y == 7
//then load different values for b and d
for (uchar j = 0; j < 8; j++) {
uchar a = shrlut[p[i]][j]&1;
uchar b = (j<7 ? shrlut[p[i]][j+1]:p[i+128&2047])&1;
uchar c = shrlut[p[i+1&0x7ff]][j]&1;
uchar d = (j<7 ? shrlut[p[i+1&2047]][j+1]:p[i+129&2047])&1;
uchar z = a|shllut2[b][0]|shllut2[c][1]|shllut2[d][2];
z = swaplut[rand()&7][z];
uchar s = shllut[j];
p[i] = (r>>z)&1 ? p[i]|s : p[i]&(s^255);
}
}
|
| |
algorithm
Registered: May 2002 Posts: 705 |
I was only going on about convertibg the code to 6502 in linear format as the example by rudi. Of course the method is to change things around to make it more efficient instead of the structure in the code posted. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Oh sorry Algorithm, i was begging for rudi's pardon, not yours :-) Your post just came in between while i was writing that post :-) |
| |
algorithm
Registered: May 2002 Posts: 705 |
Thats ok :-) |
| |
Rudi Account closed
Registered: May 2010 Posts: 125 |
yes. some interesting solutions here. its wicked to actually optimize it, and then not see that it can be optimized more.. but right now i do see it clearly lol. strange that i didnt see that before. i tend to be locked into some code and not see around it. but thanks again! i think ill do the 1-dimensional pointer before i actually do anything more. since maybe perhaps the shift luts have to be indexed differently.
bitbreaker: i can start of with integrating shllut into shllut2 but not the other way around. shllut2 is a 16-bit shift table while shllut is 8-bit. i just wondering in your example code are you generating lut-table in the loop? isnt it better to have static tables? or was that just example code. yes, i generate my tables in C not in asm. just because i am lazy :P |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Oh, i should have looked right, it is p(i) being used as index for the first demension of the array. So you can't set up a fixed i beforehand, but you can place p into one of the Index-Registers for fast access then.
However you can easy do a two-dimensional arraylookup via either lda (y,x) or lda (x),y so one dimension is choosen by the index register and the other one by one/two sta's to the zeropage. Illegal opcodes could speed up things here as you can use lax in both cases if you need the value in A and X.
In any case, take care that your data is aligned nicely so that the lookup stays easy. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
p[i] = (r>>z)&1 ? p[i]|s : p[i]&(s^255);
this can also be done with a r>>(z+1) to move the bit to be tested into carry. This saves the and #$01 + cmp and you can directly branch with a bcc/bcs to either do the |s / &(s^255).
Also if you store p(i) in X and s in A you can make use of the SBX command to do the X = A & X. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
I'm too tired and cant properly read C, but that looks a hell of lot of stuff to do 2048 times. RAND pops out to me if you cant get away with some timer/raster register mixups, it will be expensive. j<7 seems pointless when j=w&7 (0..7) anyway if its a special case for j=7 then I'd shuffle it until the branch would work with j=0 you can save a cmp then as Z bit sets itself usually. best would be ofcourse to get rid of the branch and make the LUT have a special case for j=7.
not sure what's the table doing, but using a lookup for bit shifting seems expensive, asl rol / lsr ror is less expensive than setting up a table lookup. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
"RAND pops out to me if you cant get away with some timer/raster register mixups, it will be expensive."
unless you really need "proper" random values (which you almost never do for demo effects), then RAND can be done very cheap:
updaterand: <- call once per frame
jsr rand <- your favourite PRNG
cnt:
sta rtab
inc cnt+1
rts
getrand: <- use instead of the PRNG
cnt:
lda rtab
inc cnt+1
rts
that said, dont listen to whoever tells you to use timers or raster or something like that for "random". it gives horrible results, even a static table will work better (and is faster).
|
Previous - 1 | 2 | 3 - Next |