[CSDb] - User Forums

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Speedcode

2013-02-20 11:43

Rudi
Account closed

Registered: May 2010
Posts: 125

Speedcode

Let's say my loop has to iterate 16K times.
unrolling the loop would not be ideal, because then it would be 16K*the amount of bytes inside the loop.

What does speedcode actually do? Unroll loops?
I saw lecture by Ninja at a conference about just LDA's and STA's to specific address locations. But what do these address locations store? tables or opcode+data?

First I thought about actually loading and storing the opcode and data, but afterwards I thought this wouldn't be right! The LDA's and STA's take each x-amount of cycles that I really dont need, and the opcodes and data themselves are the only thing I need :D

I got an idea now about using only one opcode for doing a same operation, but loading and storing the data in 1 or 2 byte format. Perhaps the LDA and STA needs to jump back and forth (as a consequence, i dont know).

Still its a subject that I recently am looking into since i would guess (as an approximation that) my routine will only run in 7 fps as it is now. Ideas or ramblings are welcome about this issue and topic!

(I forgot to ask. Do you make your own tools/or generators for speedcode generation?)

2013-02-20 11:55

Bitbreaker

Registered: Oct 2002
Posts: 508

Speedcode can have so many faces. In its cheapest fashion in can just be an unrolled loop. Generated speedcode might do more, like representing an intrinsic mapping, like for colors (fading/cycling) or such.
More complex situations might have a bunch of speedcode-fragments that are jumped to on request to do some subtasks. Here it is also important to have a slick and fast way to set up the jump target.
Recently i added another small article about optimizing on codebase64 (http://www.codebase64.org/doku.php?id=base:loops_vs_unrolled) but you might also want to read the other stuff about optimization there. Finally it would also help to know your project a bit closer to give suggestions that fit to your specific problem/algorithm.

2013-02-20 11:55

Frantic

Registered: Mar 2003
Posts: 1648

Speedcode is the idea of repeating code literally (hence filling up a lot of memory) also called "unrolled code" instead of repeating the code using assembler instructions (looping by use of branch/compare instructions etc). The gain is obviously that you get rid of the branch/compare instructions and just execute the operations that are crucial for what you actually want to do. Not really anything more complicated than this. Then you can of course make various hybrid approaches, as Bitbreaker says.

Since the unrolled code can become quite large, as you noticed yourself, many people also write code to generate the unrolled speedcode, rather than loading it directly from disk. A more lazy approach is to generate chunks of speedcode by using macro features of your assembler or some script that generates unrolled source code, before assembling it. Then you get the problem of large binaries of course, since the code is already generated in the assembled binary. (Then again I suppose good crunchers may do a quite good job on packing the kind of repeated sequences of bytes you find in unrolled code.)

Have you checked these articles on codebase?

http://codebase64.org/doku.php?id=base:speedcode
http://codebase64.org/doku.php?id=base:loops_vs_unrolled

2013-02-20 12:06

Oswald

Registered: Apr 2002
Posts: 5094

speedcode generator to add sinus tables, 256 byte table at <sinus> should be page aligned and repeated once, sinus movement is achieved by calling the routine with stepping values in X/Y:

w0	lda sinus,x
w1	adc sinus,y
w2	sta result
w3

speedcodegenerator	
        lda #<speedcode	
	sta fc
	lda #>speedcode
	sta fd
	
	lda #0
	sta counter
	sta w0+1
	sta w1+1
-	
	ldy #$00
-	lda w0,y
	sta (fc),y  ;copy speedcode segment (w0-w3) into memory
	iny
	cpy #w3-w0
	bne -
	
	inc w2+1   ;update sta address
	bne *+5
	inc w2+2
	
	lda w0+1   ;update load addresses
	clc
	adc #$03
	sta w0+1
	
	lda w1+1
	clc
	adc #$07
	sta w1+1
	
	tya       ;move 16 bit pointer in memory where speedcode is copyed
	clc
	adc fc
	sta fc
	bcc *+4
	inc fd
	
	inc counter
	lda counter
	cmp #100
	bne -
	
	ldy #$00
	lda #$60
	sta (fc),y    ;RTS to the end of the speedcode
	rts

2013-02-20 12:09

Frantic

Registered: Mar 2003
Posts: 1648

Regarding the hybrid approaches Bitbreaker mentioned, where you need to jump to various chunks of speedcode, the following article may be useful reading:

http://codebase64.org/doku.php?id=base:dispatch_on_a_byte

2013-02-20 12:32

Rudi
Account closed

Registered: May 2010
Posts: 125

Thanks alot for the feedback! You gave me some helpfull ideas im going to look further into.

2013-02-20 16:50

algorithm

Registered: May 2002
Posts: 705

Also not necessary to unroll everything unless you need every bit of cpu processing or may need an additional register (if using this as a counter) etc. usually unrolling more than 8 times will give less performance gains in comparison with unrolling twice or more.

Changing the routine itself will give the biggest speed gains. The usual cycle saving tricks such as branching if required to code that would less likely be run and trying to avoid using dec/inc opcodes if possible (huge 5-6 cycle usage) and ofcourse code in zeropage, some illegal opcodes, table usage and more etc..

2013-02-20 18:22

Fresh

Registered: Jan 2005
Posts: 101

A bit off topic.
Quote:

Regarding the hybrid approaches Bitbreaker mentioned, where you need to jump to various chunks of speedcode, the following article may be useful reading:

http://codebase64.org/doku.php?id=base:dispatch_on_a_byte

That article may be a little misleading: choose the right jumptable and there's no reason you can't have your 256 indirect jump entries.

2013-02-20 18:34

Oswald

Registered: Apr 2002
Posts: 5094

Fucking around with branch optimizing and shit like that doesnt helps you much when your approach is slow. Try to think backwards, dont translate an algorithm to 6502 opcodes, try to translate 6502 opcodes into effects. The more primitive solution you find the faster the code will get. Dont try to be smart, try to be primitive and lazy. Try to see the world as the CPU. and then the power will be with you my son :)

2013-02-20 18:44

algorithm

Registered: May 2002
Posts: 705

Theres only so much a simple speedcode with less branches can do. Although can reduce computational time for example with decompression or so relying on packed bits to jump to a routine based on the byte containing packed bits and doing it all in one go.instead of bit shifting and comparisons. One example was in the just dance 64 demo that can decode 4 bytes from a byte adpcm2 decode in a single raster line. That is around 50000 bytes per second potential. Better to be smart and fast :-)

2013-02-20 19:08

Rudi
Account closed

Registered: May 2010
Posts: 125

Ok. I have 3 branches in my loop (or for-loop to be exact). It's where the "?" and ":" conditionals are.

just for the heck of it i paste it here:
(i have yet to make those lut tables one dimentional)

		for (ushort w=0; w<16384; w++)
		{
			ushort i = w>>3; 
			uchar j = (w&7);
			uchar a = shrlut[p[i]][j]&1;
			uchar b = (j<7 ? shrlut[p[i]][j+1]:p[i+128&2047])&1;
			uchar c = shrlut[p[i+1&0x7ff]][j]&1;
			uchar d = (j<7 ? shrlut[p[i+1&2047]][j+1]:p[i+129&2047])&1;
			uchar z = a|shllut2[b][0]|shllut2[c][1]|shllut2[d][2];
			z = swaplut[rand()&7][z];
			uchar s = shllut[j];

			p[i] = (r>>z)&1 ? p[i]|s : p[i]&(s^255);
		}

hope you dont mind me pasting C-code here. This was not was Oswald wanted, but just to let you know how many complicated arithmetics and branches this code needs. It also has some 16-bit numbers. Altho i know that for (w&7) i need only to AND the lo-byte, and not the hi-byte. Anyway, here you see what challenge i got. 'r' is a 16-bit number. And for those who dont remember types in the C-language, uchar is unsigned char: 8-bit. and ushort is unsigned short: 16-bits.

2013-02-20 19:24

algorithm

Registered: May 2002
Posts: 705

Most of the code there is linear. Just note where the bottleneck is and pay attention to optimising that. The bottleneck is not the branches but the othercode from that exampke. Ofcourse optimise that as well

2013-02-20 19:49

Bitbreaker

Registered: Oct 2002
Posts: 508

I beg your pardon, but per frame you have 18656 available cycles, Sir :-) I'd split up the things into two loops, as it saves the shifting and splitting up of w, also it saves you the test on j<7 and it would be sufficient to unroll the inner loop with then 8 runs. If there's space, incorporate the & 1 into the LUTs. On the other hand: you fetch 8 consecutive bits from a lut, so it might then even be easier to fetch a single byte and shift out the bits. As you have 8 inner loop runs, that is exactly those 8 bits.
The shllut2 lookups are unnecessary and can be incorportaed into the the previous lookups (if there's mem :-) ).

for (ushort i=0; i<2048; i++)
{
    //here set up all luts for e.g.
    //lda #i&255
    //sta shrlut
    //lda #i>>8
    //sta shrlut+1
    //so now you can later on do:
    //ldy #$00
    //lda (shrlut),y
    //iny
    //...
    //until y == 7
    //then load different values for b and d

    for (uchar j = 0; j < 8; j++) {
	uchar a = shrlut[p[i]][j]&1;
	uchar b = (j<7 ? shrlut[p[i]][j+1]:p[i+128&2047])&1;
	uchar c = shrlut[p[i+1&0x7ff]][j]&1;
	uchar d = (j<7 ? shrlut[p[i+1&2047]][j+1]:p[i+129&2047])&1;
	uchar z = a|shllut2[b][0]|shllut2[c][1]|shllut2[d][2];
	z = swaplut[rand()&7][z];
	uchar s = shllut[j];

	p[i] = (r>>z)&1 ? p[i]|s : p[i]&(s^255);
    }
}

2013-02-20 20:06

algorithm

Registered: May 2002
Posts: 705

I was only going on about convertibg the code to 6502 in linear format as the example by rudi. Of course the method is to change things around to make it more efficient instead of the structure in the code posted.

2013-02-20 20:08

Bitbreaker

Registered: Oct 2002
Posts: 508

Oh sorry Algorithm, i was begging for rudi's pardon, not yours :-) Your post just came in between while i was writing that post :-)

2013-02-20 20:14

algorithm

Registered: May 2002
Posts: 705

Thats ok :-)

2013-02-20 20:47

Rudi
Account closed

Registered: May 2010
Posts: 125

yes. some interesting solutions here. its wicked to actually optimize it, and then not see that it can be optimized more.. but right now i do see it clearly lol. strange that i didnt see that before. i tend to be locked into some code and not see around it. but thanks again! i think ill do the 1-dimensional pointer before i actually do anything more. since maybe perhaps the shift luts have to be indexed differently.

bitbreaker: i can start of with integrating shllut into shllut2 but not the other way around. shllut2 is a 16-bit shift table while shllut is 8-bit. i just wondering in your example code are you generating lut-table in the loop? isnt it better to have static tables? or was that just example code. yes, i generate my tables in C not in asm. just because i am lazy :P

2013-02-20 21:07

Bitbreaker

Registered: Oct 2002
Posts: 508

Oh, i should have looked right, it is p(i) being used as index for the first demension of the array. So you can't set up a fixed i beforehand, but you can place p into one of the Index-Registers for fast access then.
However you can easy do a two-dimensional arraylookup via either lda (y,x) or lda (x),y so one dimension is choosen by the index register and the other one by one/two sta's to the zeropage. Illegal opcodes could speed up things here as you can use lax in both cases if you need the value in A and X.
In any case, take care that your data is aligned nicely so that the lookup stays easy.

2013-02-20 21:19

Bitbreaker

Registered: Oct 2002
Posts: 508

p[i] = (r>>z)&1 ? p[i]|s : p[i]&(s^255);

this can also be done with a r>>(z+1) to move the bit to be tested into carry. This saves the and #$01 + cmp and you can directly branch with a bcc/bcs to either do the |s / &(s^255).
Also if you store p(i) in X and s in A you can make use of the SBX command to do the X = A & X.

2013-02-20 21:22

Oswald

Registered: Apr 2002
Posts: 5094

I'm too tired and cant properly read C, but that looks a hell of lot of stuff to do 2048 times. RAND pops out to me if you cant get away with some timer/raster register mixups, it will be expensive. j<7 seems pointless when j=w&7 (0..7) anyway if its a special case for j=7 then I'd shuffle it until the branch would work with j=0 you can save a cmp then as Z bit sets itself usually. best would be ofcourse to get rid of the branch and make the LUT have a special case for j=7.

not sure what's the table doing, but using a lookup for bit shifting seems expensive, asl rol / lsr ror is less expensive than setting up a table lookup.

2013-02-21 05:23

chatGPZ

Registered: Dec 2001
Posts: 11386

"RAND pops out to me if you cant get away with some timer/raster register mixups, it will be expensive."

unless you really need "proper" random values (which you almost never do for demo effects), then RAND can be done very cheap:

updaterand: <- call once per frame
jsr rand <- your favourite PRNG
cnt:
sta rtab
inc cnt+1
rts

getrand: <- use instead of the PRNG
cnt:
lda rtab
inc cnt+1
rts

that said, dont listen to whoever tells you to use timers or raster or something like that for "random". it gives horrible results, even a static table will work better (and is faster).

2013-02-21 10:22

Rudi
Account closed

Registered: May 2010
Posts: 125

Groepaz: i used this in my noize-generator several months ago:

	rol
	adc $d012

and experiment by exchanging adc with eor/sbc. in some cases that worked out better. also the $a2 timer could work. anyway, this wasnt about random number generation. the prng will be an experiment when the effect works. which will take months or perhaps years.

i've decided that regular mode cannot be done. so i'll have to rewrite the whole algorithm and use multicolor or some kind of stretched sprites to do this effect. but i learned alot in this thread. ill be re-reading some of it later because there are some things that are usefull to know.

2013-02-21 12:11

enthusi

Registered: May 2004
Posts: 677

For Assembloids I generate a proper sbox ;-)

2013-03-06 12:16

johncl

Registered: Aug 2007
Posts: 37

A fun RND routine is one that uses 256 bytes filled with the simple rnd routine that is on codebase here:

http://codebase64.org/doku.php?id=base:fast_8bit_ranged_random_..

Every frame you add some new numbers, and just wrap around that 256 byte buffer. So whenever you need a random number in your other code its likely a new one there or something you had 256 bytes back in time. In many cases this is more than enough for many uses.

Generally whenever you can use a lookup table for stuff, then use that to speed up code.

2013-03-06 13:16

chatGPZ

Registered: Dec 2001
Posts: 11386

look what i posted above =)

2013-03-06 17:07

Danzig

Registered: Jun 2002
Posts: 440

Quote:

that said, dont listen to whoever tells you to use timers or raster or something like that for "random". it gives horrible results, even a static table will work better (and is faster).

please follow the quoted advice.

raster and timer are sequentially increasing so randomness depends on the time between the demand of a value.
even stuff like eoring rasterline with timer deserves to be called "predictable" in the widest meaning. using actual rasterline as the seed is ok.

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

macx
Scrap/Genesis Project
Acidchild/Padua
Guests online: 81

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)

Top onefile Demos

1 No Listen  (9.6)
2 Layers  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)

Top Groups

1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)

Top Webmasters

1 Slaygon  (9.6)
2 Perff  (9.6)
3 Sabbi  (9.5)
4 Morpheus  (9.4)
5 CreaMD  (9.1)

Page generated in: 0.182 sec.