[CSDb] - User Forums - Fast way to rotate a char?

Welcome to our latest new user maak ! (Registered 2024-04-18)

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast way to rotate a char?

2017-01-04 08:32

Rudi
Account closed

Registered: May 2010
Posts: 125

Fast way to rotate a char?

Im not talking about rol or ror, but swap bits so that they are rotated 90 degrees:

Example:

a char (and the bits can be random):

10110010 byte 1..
11010110 byte 2.. etc..
00111001
01010110
11011010
10110101
00110011
10110100

after "rotation" (rows and columns are swapped):

is it possible to use lookup tables for this or would that lookup table be too big?
or other lookuptable for getting and setting bits?

-Rudi

... 105 posts hidden. Click here to view all posts....

2017-01-08 13:37

Rudi
Account closed

Registered: May 2010
Posts: 125

Quote: Rudi, except of the fact that you used EOR instead of ORA and didnt use LAX for the first read on zp (would be possible if you swap x and y registers), this looks identical to the code I posted pretty early in this thread. Is there a special reason to use EOR?

The eor was a consequence of the formulas i used for masking and swapping (I derived this from Kalms tutor):

Example for the 4x4 swapping:

tmp0 = byte0 & 0xf0; //xxxx----
tmp1 = byte1 & 0xf0; //xxxx----
tmp2 = byte2 & 0xf0; //xxxx----
tmp3 = byte3 & 0xf0; //xxxx----
tmp4 = byte4 & 0x0f; //----xxxx
tmp5 = byte5 & 0x0f; //----xxxx
tmp6 = byte6 & 0x0f; //----xxxx
tmp7 = byte7 & 0x0f; //----xxxx
data0 = byte0 << 4;
data1 = byte1 << 4;
data2 = byte2 << 4;
data3 = byte3 << 4;
data4 = byte4 >> 4;
data5 = byte5 >> 4;
data6 = byte6 >> 4;
data7 = byte7 >> 4;
data0 ^= tmp4;
data1 ^= tmp5;
data2 ^= tmp6;
data3 ^= tmp7;
data4 ^= tmp0;
data5 ^= tmp1;
data6 ^= tmp2;
data7 ^= tmp3;

Sorry for the long code..

EOR is used for the last xor-swapping. Since I cannot use lookup-table for two different values (fex. data0 and tmp4). The last eight operations in the above are done with the EOR-instruction.

Some of the lookup-tables i derived are doing EOR, AND and SHIFTS at the same time:

shl2_eor_cc[i] = (i ^ (i & 0xcc)) << 2;
shr2_eor_33[i] = (i ^ (i & 0x33)) >> 2;
shl1_eor_aa[i] = (i ^ (i & 0xaa)) << 1;
shr1_eor_55[i] = (i ^ (i & 0x55)) >> 1;

I scratched my head around how your ORA worked. And since I didnt understand that Dreamass-macrocode I wrote mine from scratch. But maybe I should look at your LAX-method next.

Edit: Now I see that it doesnt really matter if one use ora or eor for this technique.

2017-01-08 13:46

Axis/Oxyron

Registered: Apr 2007
Posts: 91

But you know that:
(i ^ (i & 0xcc))

is the same as:
i & 0x33

;o)

2017-01-08 14:45

Rudi
Account closed

Registered: May 2010
Posts: 125

Quote: But you know that:
(i ^ (i & 0xcc))

is the same as:
i & 0x33

;o)

No, didnt think about that hehe.

Btw, 312 cycles now (with LAX).

2017-01-08 14:54

Bitbreaker

Registered: Oct 2002
Posts: 499

Quoting Axis/Oxyron

But you know that:
(i ^ (i & 0xcc))

is the same as:
i & 0x33

;o)

smells like the version of the tab that shifts only 1 bit to the right could be substituted by some asr magic?
Also the and maskX looks like it could be included into something, too static to be done that often :-)

2017-01-08 15:22

Rudi
Account closed

Registered: May 2010
Posts: 125

XAA might be something too.

2017-01-08 18:29

Rudi
Account closed

Registered: May 2010
Posts: 125

Here's a different approach to it:

ldx $82			;3	
xaa #$33		;2	a=(x & 0x33)
ldy $80			;3
eor shl2_eor_cc, y	;4*
sta $90			;3
lda shr2_eor_33, x	;4*
eor tab_cc, y		;4*
sta $92			;3

uses the same amount of cycles though.

Bitbreaker: yes, one could probably optimize the 1x1 rotator with other illegal-opcodes. sine some of them do one shift.

2017-01-09 07:35

Bitbreaker

Registered: Oct 2002
Posts: 499

Besides that it will produce rubbish as xaa can add some unpredictable value to A before doing the txa and and part :-)

2017-01-09 10:10

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Color Bar

I may have found a method that takes 432 cycles....

If I merge columns of 2 bits wide and 4 bits high into one byte and then extract the destination nybbles I can reduce that to 354 cycles.

2017-01-09 11:52

Rudi
Account closed

Registered: May 2010
Posts: 125

Quote: Quoting Color Bar
I may have found a method that takes 432 cycles....

If I merge columns of 2 bits wide and 4 bits high into one byte and then extract the destination nybbles I can reduce that to 354 cycles.

Are you using the masking method?

2017-01-09 12:25

Axis/Oxyron

Registered: Apr 2007
Posts: 91

I just want to share some thoughts on my merges that didnt work out. Perhaps I´m just missing the last twist.

First idea was to make relative merges. I discussed that back in the 90´s with some Amiga coders and on 68030-68060 it saves some cycles.
Idea is, that shifting of the input must not always have the exact values, as long as the delta of the shift of the 2 inputs stays correct. Disadvantage of that is, that the last merge needs to make some rol/ror to compensate.

This resulted in something like this:

lda {src1}
ldy {src2}
and #$aa
ldx {bittab1},y
sax {dst1}
eor {src1} ;invert and #$aa to and #$55
ldx {bittab2},y
sax {dst2}

Unluckily it only saves 1 cycle per merge which is completely eaten up by the last merge that looses 2 cycles for the correction.

Another idea was to interleave the temp-arrays with hi-byte pointers so that they can be used both as pointers for indirect y-indexing and as direct values. Code would look like this:

lax {src1}
and #{mask1}
ora ({src2}),y
sta {dst1}
lda ({src2}),y
ora {bittab2},x
sta {dst2}

Would also save 1 cycle per merge. But the unsolved problem is, that the 2 usages of {src2} should be pointing to 2 different tables. *grrr*

What definitely works is reordering the merges, so that the last 2 merges of a resolution dont need to store the tmp-values into the zp and the first two of the next resolution doesnt need to read the tmp-values.

so the last
sta {dst2}
and the first
lax {src1}
would merge into
tax.

Saves 4 times 3=12 cycles.