 
Rudi
Registered: May 2010 Posts: 82 
Fast way to rotate a char?
Im not talking about rol or ror, but swap bits so that they are rotated 90 degrees:
Example:
a char (and the bits can be random):
10110010 byte 1..
11010110 byte 2.. etc..
00111001
01010110
11011010
10110101
00110011
10110100 after "rotation" (rows and columns are swapped):
11001101
01011000
10100111
11111111
00101000
01010101
11011010
00100110 is it possible to use lookup tables for this or would that lookup table be too big?
or other lookuptable for getting and setting bits?
Rudi 

 
Oswald
Registered: Apr 2002 Posts: 3991 
my first idea:
lsr row7
ror a
lsr row6
ror a
...
ror row0
ror a
sta result_row7
then again:
lsr row7
ror a
...
lsr row0
ror a
sta result_row6
etc. 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
Depends on resolution. In multicolor/2x2 you can do it with 4 bittables, just like you would do a 2xn chunkyFX. In hires things get complicated. The shift/rol method is a good starting point. Perhaps the Amiga C2P techniques based on merges would be faster. Not sure, yet. 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
Sounds like a coding challenge!
Should be doable with 64 tables and 464 cycles, with neither source nor destination in zero page. But that's a lot of RAM.. Oswald's concept's a lot saner. Maybe fastest routine in under 4k code+tables would be interesting to hunt for?
ldx byte0
ldy byte1
for j in range(8)
lda t{j}0,x
ora t{j}1,y
sta $xxxx
ldx byte2
ldy byte3
for j in range(8)
lda#$xx
ora t{j}2,x
ora t{j}3,y
sta $xxxx
ldx byte4
ldy byte5
for j in range(8)
lda#$xx
ora t{j}4,x
ora t{j}5,y
sta $xxxx
ldx byte6
ldy byte7
for j in range(8)
lda#$xx
ora t{j}6,x
ora t{j}7,y
sta $xxxx
(table ij copies a bit from location i to location j, clearing the rest) 
 
Rudi
Registered: May 2010 Posts: 82 
Interesting ideas. I think I understand Oswalds code.
Quoting ChristopherJamSounds like a coding challenge! Feel free to make it a coding challenge. 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
Haha, googling C2P leads back to csdb (Identify this doughnut).
Looks like Michiel's swappingtwoaddressbitsatatime technique at http://stackoverflow.com/questions/1667591/rotatingabitmap90.. ?
Going to have to think about efficient ways to swap low nybble of one byte with high nybble of another now.. 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
Challenge accepted. ;o) 
 
Shadow
Registered: Apr 2002 Posts: 310 
Is it just my spatial thinking that is a bit off now, or is your original example wrong? To me it doesn't seem like a correct 90 degree rotation is done, it looks like it's mirrored as well? 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
Got a first version running using Amiga style merges (328 cycles). Code still looks ugly and unoptimized.
The used tables are 256 bytes each and filled as follows:
lsrtab1: and #$aa, lsr
lsrtab2: and #$cc, lsr, lsr
lsrtab4: lsr, lsr, lsr, lsr
asltab1: and #$55, asl
asltab2: and #$33, asl, asl
asltab4: asl, asl, asl, asl
Code is using Dreamassstyle macros.
src & dst are memory locations of the source and dest char.
tmp1 & tmp2 are 8 byte zparrays each.
;28 cycles mem>zp
;26 cycles zp>zp
;28 cycles zp>mem
#macro MERGE(src1, src2, mask1, mask2, bittab1, bittab2, dst1, dst2)
{
lax {src1}
ldy {src2}
and #{mask1}
ora {bittab1},y
sta {dst1}
tya
and #{mask2}
ora {bittab2},x
sta {dst2}
}
.MERGE(src+0, src+1, $aa, $55, lsrtab1, asltab1, tmp1+0, tmp1+1)
.MERGE(src+2, src+3, $aa, $55, lsrtab1, asltab1, tmp1+2, tmp1+3)
.MERGE(src+4, src+5, $aa, $55, lsrtab1, asltab1, tmp1+4, tmp1+5)
.MERGE(src+6, src+7, $aa, $55, lsrtab1, asltab1, tmp1+6, tmp1+7)
.MERGE(tmp1+0, tmp1+2, $cc, $33, lsrtab2, asltab2, tmp2+0, tmp2+2)
.MERGE(tmp1+1, tmp1+3, $cc, $33, lsrtab2, asltab2, tmp2+1, tmp2+3)
.MERGE(tmp1+4, tmp1+6, $cc, $33, lsrtab2, asltab2, tmp2+4, tmp2+6)
.MERGE(tmp1+5, tmp1+7, $cc, $33, lsrtab2, asltab2, tmp2+5, tmp2+7)
.MERGE(tmp2+0, tmp2+4, $f0, $0f, lsrtab4, asltab4, dst+0, dst+4)
.MERGE(tmp2+1, tmp2+5, $f0, $0f, lsrtab4, asltab4, dst+1, dst+5)
.MERGE(tmp2+2, tmp2+6, $f0, $0f, lsrtab4, asltab4, dst+2, dst+6)
.MERGE(tmp2+3, tmp2+7, $f0, $0f, lsrtab4, asltab4, dst+3, dst+7)
The code still looks pretty unoptimized. I think reordering merges can save some loads/stores to the tmparrays. Also SAX or AXS could come handy here.
If you are interested in the basic concept of this kind of C2p read here: http://www.lysator.liu.se/~mikaelk/doc/c2ptut/ 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
Damn, should have asked Graham. He used nearly exactly my C2P code in the hiddenpart of Deus Ex Machina. 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
Sweet. I thought I had an approach that would take ~360 cycles, but it appears I've made a mistake somewhere, so it's both slower than yours *and* broken..
(was going to do a three skews rotation, with ROL tables for the first and third passes, and swap macros for shuffling bits down in the middle pass)
I'll ponder more after dinner. 
 
Cruzer
Registered: Dec 2001 Posts: 865 
Wow, that's some sick, slick code, Axis. No way I can understand it right now, but I'll look into it if I ever get the need. Did we determine the resolution btw? I'm guessing singlecolor, since you can't do a lossless rotation of a multicolor char, unless it's 2x2. 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
I was assuming hires 8x8 chars. That makes the most sense and is the most interesting challenge. And looks like it´s what Rudi is watching for. 
 
Rudi
Registered: May 2010 Posts: 82 
yes, i was initially looking at hires 8x8 char.
Edit: current techniques just seem too slow for many chars. wished someone could do some magic tricks :) 
 
JackAsser
Registered: Jun 2002 Posts: 1156 
Btw, Axis is talking about this (from Mikael Kalms): http://www.lysator.liu.se/~mikaelk/doc/c2ptut/ 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
Thanks for sharing my link jacky.^^ 
 
JackAsser
Registered: Jun 2002 Posts: 1156 
Quote: Thanks for sharing my link jacky.^^
Jesus, I'm tired... 
 
Oswald
Registered: Apr 2002 Posts: 3991 
Quote: Btw, Axis is talking about this (from Mikael Kalms): http://www.lysator.liu.se/~mikaelk/doc/c2ptut/
huh, that needs a relaxed state and time to have it sink in :) 
 
Groepaz
Registered: Dec 2001 Posts: 7840 
JackAsser is halfway there apparently :=) 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
When he is through, its time to go Amigaaaah!!! 
 
JackAsser
Registered: Jun 2002 Posts: 1156 
Quote: When he is through, its time to go Amigaaaah!!!
Haha, I got semi tired of it after setting up that HAM8mode. :) But it was a fun excercies and fun to figure out the bitpatterns myself, then you told me, and I was actually correct (for once). 
 
Perplex
Registered: Feb 2009 Posts: 187 
Quoting Rudicurrent techniques just seem too slow for many chars. wished someone could do some magic tricks :)
Magic sometimes happens when you ditch working on a generic solution and instead concentrate on one that fits your particular data set or use case.
Not implying that a generic solution is less interesting, mind. 
 
Rudi
Registered: May 2010 Posts: 82 
Some notes to self. altho, i dont know if it will do any miracles. maybe one can reduce it to Axiss solution, but I have no idea:
Basically what happens from a layman perspective (like me), bits are copied like this:
where each number are the bitindices. consider the second number (after the arrow) to be the destination (of zero memory), and source (infront of arrow):
copy from (bitindex) > to:0>7
1>15
2>23
3>31
4>39
.
.(and so on..)
.
63>56 looking at those destination values in binary produce a pattern like this:07: 00000111
15: 00001111
23: 00010111
31: 00011111
39: 00100111
47: 00101111
55: 00110111
63: 00111111 and06: 00000110
14: 00001110
22: 00010110
30: 00011110
38: 00100110
46: 00101110
54: 00110110
62: 00111110 and so on...
notice last column (trits) are same, but decrease, and msbs are increasing. wether or not one can derive something out of this i dont know. of course these are just indexvalues and dont have any sensible meaning for copying from and to memory. cycles would be massive using this technique, but my thought was to get something sensible out of this (which i did not yet)... 
 
Slammer
Registered: Feb 2004 Posts: 401 
I think its a shame that your cyclecounts for this problem is so low allready. I would like to see the spritecollision solution. Something like storing the 8 bytes on row 7 of succeding chars. Stretch the line and use the LFT trick to calculate the solution. But that would take 8 lines. 
 
tlr
Registered: Sep 2003 Posts: 1092 
Quote: I think its a shame that your cyclecounts for this problem is so low allready. I would like to see the spritecollision solution. Something like storing the 8 bytes on row 7 of succeding chars. Stretch the line and use the LFT trick to calculate the solution. But that would take 8 lines.
Interesting idea! It would not necessarily take 8 lines. IIRC collisions can be retriggered during the line. You will be limited by the amount of pixels a sprite covers though. 
 
Rudi
Registered: May 2010 Posts: 82 
I dont see how sprites will solve the C2Pproblem (with as few cycles as possible)? but then again, Im not too experienced with sprites on the C64. A goal would be to calculate as many chars (per frame) as possible for C2P. 
 
Groepaz
Registered: Dec 2001 Posts: 7840 
tlr: wasnt there some VIC that only clears (or retriggers?) the collisions once? 
 
tlr
Registered: Sep 2003 Posts: 1092 
Quote: tlr: wasnt there some VIC that only clears (or retriggers?) the collisions once?
That was lightpen triggering. It's working very differently on 6569R1. 
 
Skate
Registered: Jul 2003 Posts: 451 
I just made a rough calculation. Using 16k of precalc data (which is already terrible), each char takes 500600 cycles to rotate. So, c2p routine looks like the best solution in every aspect. I'm still thinking an alternative approach though. 
 
Flavioweb
Registered: Nov 2011 Posts: 327 
Quoting SkateI just made a rough calculation. Using 16k of precalc data (which is already terrible), each char takes 500600 cycles to rotate.
Just to say (...maybe i'm offtopic...) this code took 544 cicles rotate a char counterclockwise.
*=$C000
.FOR S=0, S<8, S=S+1
LDA SOURCE+S
.FOR D=0, D<8, D=D+1
ASL
ROL DEST+D
.NEXT
.NEXT
RTS
;
SOURCE
.BYTE %11111111
.BYTE %10000000
.BYTE %01000000
.BYTE %00100000
.BYTE %00010000
.BYTE %00001000
.BYTE %00000100
.BYTE %11111111
DEST
.BYTE %00000000
.BYTE %00000000
.BYTE %00000000
.BYTE %00000000
.BYTE %00000000
.BYTE %00000000
.BYTE %00000000
.BYTE %00000000
; 
 
Rudi
Registered: May 2010 Posts: 82 
This takes ~472 cycles, and is basically similar to Oswalds approach: ;counterclockwise rotation by 90 degrees.
;column 1
ror $74
rol
ror $75
rol
ror $76
rol
ror $77
rol
ror $78
rol
ror $79
rol
ror $7a
rol
ror $7b
rol
sta $7c
;tot = 59 cycles
;column 2
ror $74
rol
ror $75
etc..
ror $7b
rol
sta $83
;tot = 59*8 = 472 cycles did a test in zp. so rotated (destination) bytes are in $7c to $83. this req eight bytes to be copied to zeropage ($74 to $7b) (as an example). 
 
Oswald
Registered: Apr 2002 Posts: 3991 
axis' one is close to max possible I guess, go for it if understandable version is not enough :) 
 
Skate
Registered: Jul 2003 Posts: 451 
Table lookup doesn't necessarily be faster than bit shifting. What i was saying is even using large tables to get rid of bit shifting doesn't save cycles unless you start thinking out of the box. 
 
Skate
Registered: Jul 2003 Posts: 451 
here is an algo which looks similar to c2p method without 8 bit restrictions.
void transpose8(unsigned char A[8], int m, int n,
unsigned char B[8]) {
unsigned x, y, t;
// Load the array and pack it into x and y.
x = (A[0]<<24)  (A[m]<<16)  (A[2*m]<<8)  A[3*m];
y = (A[4*m]<<24)  (A[5*m]<<16)  (A[6*m]<<8)  A[7*m];
t = (x ^ (x >> 7)) & 0x00AA00AA; x = x ^ t ^ (t << 7);
t = (y ^ (y >> 7)) & 0x00AA00AA; y = y ^ t ^ (t << 7);
t = (x ^ (x >>14)) & 0x0000CCCC; x = x ^ t ^ (t <<14);
t = (y ^ (y >>14)) & 0x0000CCCC; y = y ^ t ^ (t <<14);
t = (x & 0xF0F0F0F0)  ((y >> 4) & 0x0F0F0F0F);
y = ((x << 4) & 0xF0F0F0F0)  (y & 0x0F0F0F0F);
x = t;
B[0]=x>>24; B[n]=x>>16; B[2*n]=x>>8; B[3*n]=x;
B[4*n]=y>>24; B[5*n]=y>>16; B[6*n]=y>>8; B[7*n]=y;
}

 
HCL
Registered: Feb 2003 Posts: 652 
Ah, that's just beautiful Axis!!
..as usual, i'm the last one to read the thread.. but if anyone still wonders what that code does, it is mirroring the 8x8matrix by starting with 2x2bitblocks, then 4x4blocks, then finally the whole 8x8block. Mirroring and rotation is of course just a matter of configuring the macros :).
..and of course Graham did this 15+ years ago, hope that makes you all feel as retard as i do ;). 
 
Pex Mahoney Tufvesson
Registered: Sep 2003 Posts: 32 
I used Rudi's 472cycle approach "This takes ~472 cycles, and is basically similar to Oswalds approach" in Storebror
There was no need to make it any faster, since it was a onetime calculation just before the android starts rotating. My routine included not only rotation, but mirroring across horizontal, vertical and diagonals as well.
And HCL... I'm even later than you are. So don't worry, you're ok! ;)

Have a noise night!
http://mahoney.c64.org 
 
CreaMD
Registered: Dec 2001 Posts: 2422 
I wonder if this would significantly speed up my proportional routine and help me "posthumously" win the internal compo I had with Wotnau ;)
So which one is the fastest? I need to rotate full line of bitmap 320 bytes char by char, +90 degrees each 
 
dano
Registered: Jul 2004 Posts: 80 
Haha, did my a proportial Routine for NewsPress back then the same way as CreaMD did. Switched to a standard one some time later as this was even faster. Guess i did the rotation the Oswaldway too back then. 
 
Oswald
Registered: Apr 2002 Posts: 3991 
don think you can beat a bitshifter with this c2p, couple hundred cycles even with axis method, while bitshifter worst case is about 170 (lsr a ror zp * 8 lines * 3 times) 
 
Rudi
Registered: May 2010 Posts: 82 
Im trying to figure my head around this:
Since the rotation is just a bunch of bitswaps. For example swapping bit 0 and 7, bit 1 and 15, 2 and 23 etc.00 01 02 03 04 05 06 07
08 09 10 11 12 13 14 15
etc..
56 57 58 59 60 61 62 63. How would a swapbittable look for this? The number of swaps are only 32. only 32 lda's and sta's? lets say ~8 cycles in total per swap: 8*32 = 256 cycles for all the swaps. but still I dont know if its possible to have a table for this..
Edit: hm, woops. i think perhaps 32 is wrong, since both locations need to be written to, so maybe two sta's instead of just one. 
 
Cruzer
Registered: Dec 2001 Posts: 865 
Oswald: Why 3 times and not 8? 
 
Oswald
Registered: Apr 2002 Posts: 3991 
Quote: Oswald: Why 3 times and not 8?
coz you rotate in 2 buffers a,b:
a b
a b
a b
but if shifting is bigger than 3 you can just shift bits to opposite dir and then swap a,b (on screen). 3 is just a wild guess maybe 4 is the max, it was a long time ago. 
 
Rudi
Registered: May 2010 Posts: 82 
There is some interesting cyclic dynamic going on when looking at a bitswapping method for 8x8 bits:
these figures shows that the outer layer is rotated 8 times first, then the next layer is rotated 6 times and so forth:xxxxxxxx ........ ........ ........
x......x .xxxxxx. ........ ........
x......x .x....x. ..xxxx.. ........
x......x .x....x. ..x..x.. ...xx...
x......x .x....x. ..x..x.. ...xx...
x......x .x....x. ..xxxx.. ........
x......x .xxxxxx. ........ ........
xxxxxxxx ........ ........ ........ in this example one start with c0r0 and swap that with c7r0, then c1r0>c7r1, c2r0>c7r2 etc. so it goes around like a circle would. when that process is finished, the next inner layer does the same thing. until all four layers are done. and the result is a 90 deg rotation. its just one other way of looking at it. but still, a lookuptable for this? its difficult for me to see how. (im just giving out a few ideas for trying out other approach maybe). 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
Problem with bitshifter is it only deals with one bit at a time.
c2p exchanges four bitpairs in only 2030 cycles. The expensive part is moving bits to matching positions within the two bytes they're being exchanged between... 
 
Digger
Registered: Mar 2005 Posts: 217 
Nice read. But what fx do you have in mind Rudi? 
 
Rudi
Registered: May 2010 Posts: 82 
Quote: Nice read. But what fx do you have in mind Rudi?
My initial thinking was to make a sinescroller, where the sinustransform is done in chunky and then transformed using c2p. Of course the c2proutine has to be fast enough for this. It would also bring out some new ideas for other vertical effects (that are first done with horisontal tricks, and then transformed using c2p). Unfortunately there seem to be no _really_ fast c2p routine.
Im still figuring out how that 4x4, 2x2 plus 1x1 rotator works. or that is; looking at every detail of the bit swaps.
Edit: hiressinescroller. 
 
Frantic
Registered: Mar 2003 Posts: 1274 
It is not really what you specified in your first post, but just a thought that may be worth considering: Would it be okay to represent the original char in some different way? If so, there might (possibly) be ways to represent that char that would make it easier to "rotate" it to +90˚ angle although at the cost that you may then also need to do some operations on the data to be able to "read out" the normal 0˚ angle.
Just thinking aloud here. Trying to rethink the problem rather than the solution, so to speak. Not sure if it would be allowed. 
 
Rudi
Registered: May 2010 Posts: 82 
Frantic? Sorry mate, but my first post is what i need. the figures dont lie :) 
 
Copyfault
Registered: Dec 2001 Posts: 187 
Interesting Thread :)
Sticking to the figures of the initial post the goal is to have a char mirror* routine with a cycle count as low as possbile for 8x8Chars HiRes with arbitrary bit patterns [*mirror in the sense of mirroring along the diagonal line from bitpos (0,0) to (7,7)].
Maybe it's possible to weaken the requirement of _arbitrary_ bitpatterns... I'm thinking about using a charset with chars $00$7f for the "given" bitpatterns and $80$ff for the corresponding "mirrored" patterns. This way, mirroring would only be a question of setting the msb of the corresponding bytes in the screenram.
Guess in order to have a chance to make this approach work you'd need several charsets spread along the screen  and this still requires that appropriate slices of the screen will have at most 128 different bitpatterns.
Another downside: expanding this approach to 90° rotations will lower the number of different bitpatterns even further to 64.
But maybe this thought is of any help; it just came up while reading so I wanted to share it. 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
For things like sine scrollers, there may be value in doing a partial rotation; only takes 240 cycles per char to reshape eight bytes into
; 00224466
; 00224466
; 00224466
; 00224466
; 11335577
; 11335577
; 11335577
; 11335577
(digit is index of source byte)
using something like
ldx s+0
ldy s+1 ; 8 cycles priming the xy line cache. Need to do this on row 0 and row 4
lda t00,x
ora t01,y ; 8
ldx s+4
ldy s+6 ; 8
ora t02,x
ora t03,y ; 8
sta dst+0 ; 4 28 cycles for this block; one of these per dest byte
lda t12,x
ora t13,y
ldx s+0
ldy s+2
ora t10,x
ora t11,y
sta dst+1
etc 
 
Krill
Registered: Apr 2002 Posts: 809 
Rudi: So you want to rotate chars by arbitrary angles, not just swap X and Y (90°)? What's the problem with Frantic's suggestion of precalculating another representation of the original characters to optimise runtime rendering of the rotated characters, thus trading memory for speed? 
 
Rudi
Registered: May 2010 Posts: 82 
Krill: No. I ONLY want to swap x and y 90 degrees. No other degrees. Im working on Axis's method with masks and so on.. (bcoz it seems to be the one technique that use fewest cycles). 
 
Skate
Registered: Jul 2003 Posts: 451 
I think Rudi wants to rotate dynamically created chars, not a static charset. Rudi, did i get you right? 
 
Frantic
Registered: Mar 2003 Posts: 1274 
I think so, yes, as he wrote that "the bits can be random". (But that doesn't exclude the possibility of playing with other types of representations as such.) 
 
Rudi
Registered: May 2010 Posts: 82 
Quote: I think Rudi wants to rotate dynamically created chars, not a static charset. Rudi, did i get you right?
Yes Skate! That is right. 
 
Color Bar
Registered: Oct 2012 Posts: 98 
I may have found a method that takes 432 cycles. An 8x8 rotation can be performed by moving around 4x4 blocks, then 2x2 subblocks and then the bits in each subblock. Now you can identify for each resulting destination bit pair (that is, bist 7+6, or 5+4, ..., or 1+0) the nybbles of the source bytes it came from. So an approach could be to take two source bytes, merge the desired nybbles into one byte, and then use look up tables to transform it to the desired bit pairs (at the right location, i.e., a bit pair 7+6 from the merged source nybbles could be transformed to a destination bit pair 3+2, for example).
ldx first_source_byte
lda second_source_byte
and #mask ;mask=f0 if you need the high nybble and 0f for the low nybble
ora byte2DesiredNybble,x ;look up table to merge the right nybble in the right way
tax
lda first_destination
ora bits76toRightPosition,x ; Table filters out the desired bits and puts them in the right order at the right position.
sta first_destination
lda second_destination
ora bits54toRightPosition,x
sta second_destination
...
lda fourth_destination
ora bits10toRightPosition,x
sta fourth_destination
+
54 cycles (source and destination bytes in ZP).
This fills in 4 times 2 bits, so you have to do that 8 times to fill the entire destination matrix. So that is 432 cycles in total. But you need a bunch of lookup tables, so it is not very memory efficient.
I hope I did not overlook something ... 
 
Krill
Registered: Apr 2002 Posts: 809 
While 8x8 tile transpose is an entertaining academic exercise, i'm dubious of its practical applications for C64 purposes.
In fact, i'd be hardpressed to find an effect that is rendered faster with generating "horizontal" 8x8 tiles first and then rotating them, as opposed to rendering "vertical" 8x8 tiles right away.
Also, an actual chunky to char approach seems more feasible. Render to linear chunky bitmap (1 byte per pixel), then collect the bits and render final 8x8 tiles. 
 
Rudi
Registered: May 2010 Posts: 82 
Krill: Bear in mind im not too an experienced coder on C64 that I see solutions right away. I figured out this might be an nice intellectual exercise for myself, hopefully for others too(?) but wether or not it has any practical usage, that I have no idea..
Btw, I have a working 336 cycles. It consists of twelve "copies" of this code (pseudocode):lda $<zp1>
tay
and #<mask1>
ldx $<zp2>
eor $<tab1>, x
sta $<zp3>
txa
and #<mask2>
eor $<tab2>, y
sta $<zp4> 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
Rudi, except of the fact that you used EOR instead of ORA and didnt use LAX for the first read on zp (would be possible if you swap x and y registers), this looks identical to the code I posted pretty early in this thread. Is there a special reason to use EOR? 
 
Rudi
Registered: May 2010 Posts: 82 
Quote: Rudi, except of the fact that you used EOR instead of ORA and didnt use LAX for the first read on zp (would be possible if you swap x and y registers), this looks identical to the code I posted pretty early in this thread. Is there a special reason to use EOR?
The eor was a consequence of the formulas i used for masking and swapping (I derived this from Kalms tutor):
Example for the 4x4 swapping:tmp0 = byte0 & 0xf0; //xxxx
tmp1 = byte1 & 0xf0; //xxxx
tmp2 = byte2 & 0xf0; //xxxx
tmp3 = byte3 & 0xf0; //xxxx
tmp4 = byte4 & 0x0f; //xxxx
tmp5 = byte5 & 0x0f; //xxxx
tmp6 = byte6 & 0x0f; //xxxx
tmp7 = byte7 & 0x0f; //xxxx
data0 = byte0 << 4;
data1 = byte1 << 4;
data2 = byte2 << 4;
data3 = byte3 << 4;
data4 = byte4 >> 4;
data5 = byte5 >> 4;
data6 = byte6 >> 4;
data7 = byte7 >> 4;
data0 ^= tmp4;
data1 ^= tmp5;
data2 ^= tmp6;
data3 ^= tmp7;
data4 ^= tmp0;
data5 ^= tmp1;
data6 ^= tmp2;
data7 ^= tmp3; Sorry for the long code..
EOR is used for the last xorswapping. Since I cannot use lookuptable for two different values (fex. data0 and tmp4). The last eight operations in the above are done with the EORinstruction.
Some of the lookuptables i derived are doing EOR, AND and SHIFTS at the same time:shl2_eor_cc[i] = (i ^ (i & 0xcc)) << 2;
shr2_eor_33[i] = (i ^ (i & 0x33)) >> 2;
shl1_eor_aa[i] = (i ^ (i & 0xaa)) << 1;
shr1_eor_55[i] = (i ^ (i & 0x55)) >> 1; I scratched my head around how your ORA worked. And since I didnt understand that Dreamassmacrocode I wrote mine from scratch. But maybe I should look at your LAXmethod next.
Edit: Now I see that it doesnt really matter if one use ora or eor for this technique. 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
But you know that:
(i ^ (i & 0xcc))
is the same as:
i & 0x33
;o) 
 
Rudi
Registered: May 2010 Posts: 82 
Quote: But you know that:
(i ^ (i & 0xcc))
is the same as:
i & 0x33
;o)
No, didnt think about that hehe.
Btw, 312 cycles now (with LAX). 
 
Bitbreaker
Registered: Oct 2002 Posts: 418 
Quoting Axis/OxyronBut you know that:
(i ^ (i & 0xcc))
is the same as:
i & 0x33
;o)
smells like the version of the tab that shifts only 1 bit to the right could be substituted by some asr magic?
Also the and maskX looks like it could be included into something, too static to be done that often :) 
 
Rudi
Registered: May 2010 Posts: 82 
XAA might be something too. 
 
Rudi
Registered: May 2010 Posts: 82 
Here's a different approach to it:ldx $82 ;3
xaa #$33 ;2 a=(x & 0x33)
ldy $80 ;3
eor shl2_eor_cc, y ;4*
sta $90 ;3
lda shr2_eor_33, x ;4*
eor tab_cc, y ;4*
sta $92 ;3 uses the same amount of cycles though.
Bitbreaker: yes, one could probably optimize the 1x1 rotator with other illegalopcodes. sine some of them do one shift. 
 
Bitbreaker
Registered: Oct 2002 Posts: 418 
Besides that it will produce rubbish as xaa can add some unpredictable value to A before doing the txa and and part :) 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quoting Color BarI may have found a method that takes 432 cycles....
If I merge columns of 2 bits wide and 4 bits high into one byte and then extract the destination nybbles I can reduce that to 354 cycles. 
 
Rudi
Registered: May 2010 Posts: 82 
Quote: Quoting Color BarI may have found a method that takes 432 cycles....
If I merge columns of 2 bits wide and 4 bits high into one byte and then extract the destination nybbles I can reduce that to 354 cycles.
Are you using the masking method? 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
I just want to share some thoughts on my merges that didnt work out. Perhaps I´m just missing the last twist.
First idea was to make relative merges. I discussed that back in the 90´s with some Amiga coders and on 6803068060 it saves some cycles.
Idea is, that shifting of the input must not always have the exact values, as long as the delta of the shift of the 2 inputs stays correct. Disadvantage of that is, that the last merge needs to make some rol/ror to compensate.
This resulted in something like this:
lda {src1}
ldy {src2}
and #$aa
ldx {bittab1},y
sax {dst1}
eor {src1} ;invert and #$aa to and #$55
ldx {bittab2},y
sax {dst2}
Unluckily it only saves 1 cycle per merge which is completely eaten up by the last merge that looses 2 cycles for the correction.
Another idea was to interleave the temparrays with hibyte pointers so that they can be used both as pointers for indirect yindexing and as direct values. Code would look like this:
lax {src1}
and #{mask1}
ora ({src2}),y
sta {dst1}
lda ({src2}),y
ora {bittab2},x
sta {dst2}
Would also save 1 cycle per merge. But the unsolved problem is, that the 2 usages of {src2} should be pointing to 2 different tables. *grrr*
What definitely works is reordering the merges, so that the last 2 merges of a resolution dont need to store the tmpvalues into the zp and the first two of the next resolution doesnt need to read the tmpvalues.
so the last
sta {dst2}
and the first
lax {src1}
would merge into
tax.
Saves 4 times 3=12 cycles. 
 
Rudi
Registered: May 2010 Posts: 82 
Interesting ideas. Reordering merges was something I tried out but failed. Maybe I should look at it again. 
 
Rudi
Registered: May 2010 Posts: 82 
Ok, so I did merge 4x4 and 2x2, and it worked. Am now at 299 cycles. There is probably more that can be optimized, because now the code looks like a mess. 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quote: Are you using the masking method?
These diagrams illustrate what I am thinking of
xxxxxxxx source0
xxxxxxxx source1
xxxxxxxx
xxxxxxxx
aabbccdd
aabbccdd
aabbccdd
aabbccdd source7
xxxxaaaa destination0
xxxxaaaa destination1
xxxxbbbb
xxxxbbbb
xxxxcccc
xxxxcccc
xxxxdddd
xxxxdddd destination7
The source bits in the block indicated by the a's should end up in the proper locations in destination bytes. Using look up tables the 8 a's are put into one byte such that the high nybble is the desired nybble of destination0 and then the low nybble that of destination1. The same procedure can be followed for all 4x2 source blocks 'b','c','d', etcetera.
Code for doing that can look like
ldy source4
lda mergingTable1,y ;This table takes two bits and puts them in a byte, such that they are in the right positions to use as nybble for the destination.
ldx source5
ora mergingTable2,x
ldx source6
ora mergingTable3,x
ldx source7
ora mergingTable4,x ;Now the byte is filled with all bits from block 'a'.
tax
and #$0f
sta destination0
lda moveHighNybbleToLowNybble,x
sta destination1
+
42 cycles.
This has to be repeated for blocks 'b', 'c', 'd' (using the right lookup tables), but since source4 was saved in Y, that code can start with TAY instead of 'LDY source4' to save 1 cycle.
The lower half of the source matrix is therefore moved in 42+3*41 = 165 cycles.
The other half of the matrix goes very similar, but since the destination bytes are no longer empty, we have to add 'ORA destination0' and 'ORA destination1' before the 'STA destination0' and 'STA destination1', respectively. So that will take 4*6 cycles extra, totalling 2*165+24=354 cycles. But the method is very memory inefficient. 
 
White Flame
Registered: Sep 2002 Posts: 123 
Here's another idea which I think fails, but might be salvageable.
So the basic LSR, ROR accumulation grabs a bit and stores a bit, taking 2 instructions per bit. However, if done inplace just like many basic shiftadd multiplication routines, ROR both sets a final bit as well as reads the next bit to place. This means that in dream land, the flip can be done in about 64 RORs. If fully in zp, that would be 320 cycles, but only 128 bytes of code with no tables.
However, the tactic I took doesn't seem to have a nice clean loop of RORs linking source bit locations to final bit locations. Various CMP #80s and other bytemasking & merging seems to be required, which would likely bloat it back up to 400+ cycles. But maybe by shuffling it around differently, an arrangement could be made that's both fast and short. 
 
Rudi
Registered: May 2010 Posts: 82 
ok, ok. another idea is to use TXS and TSX (takes only 2 cycles each) for having X as temporary variable in optimizing. might require the lookup to switch from ,x and ,y to ,y and ,x however. I don't mind having a 4k+ big table for making it really fast. 32k+ table however thats overkill! 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quote: ok, ok. another idea is to use TXS and TSX (takes only 2 cycles each) for having X as temporary variable in optimizing. might require the lookup to switch from ,x and ,y to ,y and ,x however. I don't mind having a 4k+ big table for making it really fast. 32k+ table however thats overkill!
I think my method needs 18 tables of 1 page = 4.5k. Using TXS/TSX will shave off 2 cycles in total. 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quoting Color Bar
This has to be repeated for blocks 'b', 'c', 'd' (using the right lookup tables), but since source4 was saved in Y, that code can start with TAY instead of 'LDY source4' to save 1 cycle.
I meant that the 'LDY source4' can be omitted. The TAY doesn't make sense and should be omitted, so the code takes 342 cycli.
I can gain another 6 cycli by saving X. If the code
tax
lda moveHighNybbleToLowNybble,x
is replaced by
sta selfmod:+1
selfmod:
lda moveHighNybbleToLowNybble
then source7 is kept in X and the code for block 'b' can start with fetching the bits of source7 from a table and omit the 'LDX source7'. One cycle is gained that way. In that block source6 must then be saved in X as shown above, and in the following block source5. The code for the last block can stay as it was in the original post. That saves 3 cycles for half the matrix and 6 in total, and the algorithm takes 336 cycles. 
 
Oswald
Registered: Apr 2002 Posts: 3991 
still awful lot of cycles. it seems, whatever the goal is, it is faster to construct the bits into final orientation right away. 
 
enthusi
Registered: May 2004 Posts: 634 
I could still image some topview char based game where rotating in 'realtime' might actually make sense.
But you would certainly need some very special case when this is favorable over replacing chars from a lookup. 
 
Rudi
Registered: May 2010 Posts: 82 
Still got 299 cycles after merging the 4x4 and 2x2 (195 cycles) plus 1x1 (104 cycles). Can't get it any lower than that at the moment. Im trying to look at a different method.
Anyway; Im wondering if the checkerboard pattern (ANDmasking) approach with the shifts can be minimized.. Or changed to a more effective method. If it turns out that the checkerboard pattern is the most optimal then there's no other way than trying to optimize the actual assembly code. 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Pretty good. Below the magic 300 cycles barrier. :) 
 
Copyfault
Registered: Dec 2001 Posts: 187 
@Rudi: what do you mean with "merge 4x4 and 2x2"? I understand the 312cycles version, though this is only valid if you stick to zp load/store. If you access absolute adresses (which will be necessary to read/write to charset data or bitmap resp.), this very same version ends up with 328 cycles > identical to Axis' first routine.
By reordering the merges just like Axis pointed out this can be lowered to 328  12 = 316 (300 if you stick to zp load/store).
So how do you come down to 299 cycles??? 
 
Rudi
Registered: May 2010 Posts: 82 
What I really meant was reordering merges, which I think Axis pointed on in one of his posts. Here I only managed to reorder the 2x2 rotation into 4x4 by removing some sta's and reusing the accumulator iirc. But unfortunately the code is so messy that I myself easily get tired after trying to optimize this. So, here comes a long list of code if you dont mind. I used tables for the shifts, and dot mind the eortag its the same as or. the next tag is the and value. Also used specific addresses in zp since this is just a test.
;
;4x4 & 2x2 rotation (195 cycles)
;
ldy $61
lax $65
and #$0f
eor shl4, y
sta $81
tya
and #$f0
eor shr4, x
sta $85
ldy $63
lda $67
and #$0f
eor shl4, y
ldy $81
tax
and #$33
eor shl2_eor_cc, y
sta $91
tya
and #$cc
eor shr2_eor_33, x
sta $93
lda $63
ldx $67
and #$f0
eor shr4, x
ldy $85
tax
and #$33
eor shl2_eor_cc, y
sta $95
tya
and #$cc
eor shr2_eor_33, x
sta $97
ldy $60
lax $64
and #$0f
eor shl4, y
sta $80
tya
and #$f0
eor shr4, x
sta $84
ldy $62
lax $66
and #$0f
eor shl4, y
ldy $80
and #$33
eor shl2_eor_cc, y
sta $90
tya
and #$cc
eor shr2_eor_33, x
sta $92
lda #$62
and #$f0
eor shr4, x
ldy $84
tax
and #$33
eor shl2_eor_cc, y
sta $94
tya
and #$cc
eor shr2_eor_33, x
sta $96
;
;1x1 rotation (104 cycles)
;
ldy $90
lax $91
and #$55
eor shl1_eor_aa, y
sta $70
tya
and #$aa
eor shr1_eor_55, x
sta $71
ldy $92
lax $93
and #$55
eor shl1_eor_aa, y
sta $72
tya
and #$aa
eor shr1_eor_55, x
sta $73
ldy $94
lax $95
and #$55
eor shl1_eor_aa, y
sta $74
tya
and #$aa
eor shr1_eor_55, x
sta $75
ldy $96
lax $97
and #$55
eor shl1_eor_aa, y
sta $76
tya
and #$aa
eor shr1_eor_55, x
sta $77 I wanted to make the 4x4vs2x2 rotator shorter by using other method, but havent found a solution for this yet (the 1x1 rotator also, for that matter, by optimizing those two distinct sections).
The character is at $60$67 and the result is at $70$77. 
 
Copyfault
Registered: Dec 2001 Posts: 187 
I have to admit that the approach with those "Amigastyle"merges is really outstanding. Hardly possible to force oneself to think different approaches...
Ok, at least I tried ;) One idea I had is to use the CMPinstruction to read out the bits of every byte. E.g.
ldx #$7f
cpx byte0
rol
cpx byte1
rol
cpx byte2
rol
cpx byte3
rol
cpx byte4
rol
cpx byte5
rol
cpx byte6
rol
cpx byte7
rol
eor #$ff
sta dest0
but this already sums up to 47cycle (if sticking to zp lda/sta/cpx). Furthermore, this will only work for dest0 unless one masks out bit7bit1 step by step > every byte needed 8 times, too many table lookups > most probably not feasible :/
Another idea would be to store the char bit patterns along the diagonal. Usually we have
byte0: b7 b6 b5 b4 b3 b2 b1 b0
byte1: b7 b6 b5 b4 b3 b2 b1 b0
byte2: b7 b6 b5 b4 b3 b2 b1 b0
byte3: b7 b6 b5 b4 b3 b2 b1 b0
byte4: b7 b6 b5 b4 b3 b2 b1 b0
byte5: b7 b6 b5 b4 b3 b2 b1 b0
byte6: b7 b6 b5 b4 b3 b2 b1 b0
byte7: b7 b6 b5 b4 b3 b2 b1 b0
Now we could store the same information as e.g.
data0: byte0.b0 X X X X X X X
data1: byte0.b1 byte1.b0 X X X X X X
...
data7: byte0.b7 byte1.b6 byte2.b5 byte3.b4 byte4.b3 byte5.b2 byte6.b1 byte7.b0
...
dataE: byte7.b7 X X X X X X X
The huge advantage of such a data structure is obvious: mirroring the bitpatterns is equivalent to reversing the order of the dataXbytes. Drawback: in order to display the corresponding bitpattern a conversion routine is _always_ needed (no matter if the "original" bitpattern should be displayed or the "mirrored" pattern). I couldn't come up with a decent routine to compensate for this.
Maybe someone can take the good points out of these thoughts... 
 
Color Bar
Registered: Oct 2012 Posts: 98 
I'm now at 326 cycli. 
 
Rudi
Registered: May 2010 Posts: 82 
Looking at the 4x4 rotation an recipe that I have made look like this:1. rol 4 times higher 4 bytes.
2. swap lower nybbles of byte0>byte4, byte1>byte5 etc.
3. rol 4 times lower 4 bytes. Done.
1 and 3 can be done with lookuptables. 2 is a more tricky.
So, what no.2 need is a fast way to swap lonybbles of two bytes, but it seems to be difficult. One would have to do that in 14 cycles or so. Impossible.
Someone gave me this xorswap algorithm which takes 27 cycles:LDA byte1
AND #$0f
EOR byte2
STA byte2
AND #$0f
EOR byte1
STA byte1
AND #$0f
EOR byte2
STA byte2 I also made this, but it takes one cycle more than the former:LDX byte1
LDY byte2
LDA lowCleared,x
ORA andTab,y
STA byte1
LDA lowCleared,y
ORA andTab,x
STA byte2 I guess this wont help at all. Because 27*4 = 108 cycles. Allready reached the limit from the 312 version where each rotatorsection take 104 cycles. 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quoting RudiSomeone gave me this xorswap algorithm which takes 27 cycles:LDA byte1
AND #$0f
EOR byte2
STA byte2
AND #$0f
EOR byte1
STA byte1
AND #$0f
EOR byte2
STA byte2
26 cycles:
lax byte1
and #$f0
ldy byte2
ora grabLowNybble,y ;This table performs AND #$0f
sta byte1 ;Now low nybble of byte2 is in byte1
tya
and #$f0
ora grabLowNybble,x ;byte1 was kept in X
sta byte2

 
Oswald
Registered: Apr 2002 Posts: 3991 
you just need a 64k table
lda byte1byte2
sta result
:) 
 
Rudi
Registered: May 2010 Posts: 82 
Colorbar: nice
Oswald: hah yeah, like thats gonna happen. :P 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Flip disk ...
Rotate monitor clockwise ... 
 
Oswald
Registered: Apr 2002 Posts: 3991 
del. 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quote: Flip disk ...
Rotate monitor clockwise ...
Girls They Want to Have Fun 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quoting Rudi
2. swap lower nybbles of byte0>byte4, byte1>byte5 etc.
That helped. If all bits from block 'a' in my post above are put entirely in destination0, and the bits of the 4x2 block above that in destination1, nybbles have to be swapped at the end, but the code
sta selfmod:+1
and #$0f
sta destination0
selfmod:
lda moveHighNybbleToLowNybble,x
sta destination1
simplifies to just 'STA destination' and with this I can reduce the cycle count to 312, I think. 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
I'm not managing to get below 308 cycles (source and destination non zero page).
However, my second stage is only 23 cycles, less for when stage chaining is done. Perhaps this can help optimise one of the routines above?
#define swap1(s1,s2,m,d1,d2) \
.( :\
lda s1 :\
eor s2 :\
sta ztb :\
and#m :\
eor s2 :\
sta d1 :\
eor ztb :\
sta d2 :\
.)
(also  I could get down to 300 by moving the code to zero page, but then you'd have to jmp there and back, and couldn't unroll for multiple source/destinations  one char takes most of the page) 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quoting ChristopherJamI'm not managing to get below 308 cycles (source and destination non zero page).
However, my second stage is only 23 cycles.
Yes, thanks! That brings my lookup table approach down to 301 cycles when source and dest are in ZP and to 329 cycles when they are not. 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
Christopher: That 3x EOR thing is exactly what we did on Amiga back in the days. Didnt expect this to have an advantage on 6502. But where is the shifting taking place? Or is this only in 1 of the 3 passes, and the other pathes correct the bitorder with a table lookup? 
 
Rudi
Registered: May 2010 Posts: 82 
Color Bar: 301 cycles is nice, just 2 cycles more than what my code does.
If anyone can get below 299 cycles then pls explain what method you use :p 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
Quoting Axis/OxyronChristopher: That 3x EOR thing is exactly what we did on Amiga back in the days. Didnt expect this to have an advantage on 6502. But where is the shifting taking place? Or is this only in 1 of the 3 passes, and the other pathes correct the bitorder with a table lookup?
Sweet. Yes, the code above is only used in the second of the three passes; first and third are very similar to yours, only with a bit shuffle on one of the input bytes in each pair on the third pass. 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
Quoting RudiIf anyone can get below 299 cycles then pls explain what method you use :p
Could we make the rules the same as for Axis' contribution before we compare cycle counts?
neither input nor output bytes on zero page
zero page intermediate results are fine
code is not relocated to zero page either.
In practice it's unlikely there'd be enough space in ZP for source and destination charsets, and copying the data in and out would add at least an extra 102 cycles per char. 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Good idea. I am down to 327 cycles now according to these rules. 
 
Rudi
Registered: May 2010 Posts: 82 
No, I dont think thats a good idea. But feel free to restrict yourselves to your own rules. I totally dont know what you are talking about anyway... 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quote: No, I dont think thats a good idea. But feel free to restrict yourselves to your own rules. I totally dont know what you are talking about anyway...
OK, I will keep mentioning both cases (all input and output bytes either in ZP or in Mem).
One version uses 301 cycles if all are in ZP, another version takes 327 cycles when all are in Mem. 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
OK, in that case mine is 292 cycles if source and dest are in ZP, 308 cycles when all in mem.
Here's a diagram of the how the bits are shuffled at each stage:
Each block shows two input bytes for a swap macro in the top half, two outputs in the lower half. Digits in left border of each box are indexes into the input/output arrays for that stage.
Note that each input pair for the last four blocks contain one byte that's a shuffled version of an output byte from the previous stage, marked with a *
Shuffle's so I can then do an Axis style swap (which needs half the bits to be correctly located within the byte), and it's performed by storing the result of the previous stage into the low byte of an LDY absolute that references the shuffle table.
++ ++ ++ ++
0 a0 b0 c0 d0 e0 f0 g0 h0  1 a1 b1 c1 d1 e1 f1 g1 h1  2 a2 b2 c2 d2 e2 f2 g2 h2  3 a3 b3 c3 d3 e3 f3 g3 h3 
4 a4 b4 c4 d4 e4 f4 g4 h4  5 a5 b5 c5 d5 e5 f5 g5 h5  6 a6 b6 c6 d6 e6 f6 g6 h6  7 a7 b7 c7 d7 e7 f7 g7 h7 
++ ++ ++ ++
0 a0 b0 c0 d0 a4 b4 c4 d4  1 c1 d1 a1 b1 c5 d5 a5 b5  2 c2 d2 a2 b2 c6 d6 a6 b6  3 a3 b3 c3 d3 a7 b7 c7 d7 
4 e0 f0 g0 h0 e4 f4 g4 h4  5 g1 h1 e1 f1 g5 h5 e5 f5  6 g2 h2 e2 f2 g6 h6 e6 f6  7 e3 f3 g3 h3 e7 f7 g7 h7 
++ ++ ++ ++
++ ++ ++ ++
0 a0 b0 c0 d0 a4 b4 c4 d4  1 c1 d1 a1 b1 c5 d5 a5 b5  4 e0 f0 g0 h0 e4 f4 g4 h4  5 g1 h1 e1 f1 g5 h5 e5 f5 
2 c2 d2 a2 b2 c6 d6 a6 b6  3 a3 b3 c3 d3 a7 b7 c7 d7  6 g2 h2 e2 f2 g6 h6 e6 f6  7 e3 f3 g3 h3 e7 f7 g7 h7 
++ ++ ++ ++
0 a0 b0 a2 b2 a4 b4 a6 b6  1 c1 d1 c3 d3 c5 d5 c7 d7  4 e0 f0 e2 f2 e4 f4 e6 f6  5 g1 h1 g3 h3 g5 h5 g7 h7 
2 c2 d2 c0 d0 c6 d6 c4 d4  3 a3 b3 a1 b1 a7 b7 a5 b5  6 g2 h2 g0 h0 g6 h6 g4 h4  7 e3 f3 e1 f1 e7 f7 e5 f5 
++ ++ ++ ++
++ ++ ++ ++
0 a0 b0 a2 b2 a4 b4 a6 b6  1 c1 d1 c3 d3 c5 d5 c7 d7  4 e0 f0 e2 f2 e4 f4 e6 f6  6*g0 h0 g2 h2 g4 h4 g6 h6 
3*a1 b1 a3 b3 a5 b5 a7 b7  2*c0 d0 c2 d2 c4 d4 c6 d6  7*e1 f1 e3 f3 e5 f5 e7 f7  5 g1 h1 g3 h3 g5 h5 g7 h7 
++ ++ ++ ++
0 a0 a1 a2 a3 a4 a5 a6 a7  3 d0 d1 d2 d3 d4 d5 d6 d7  4 e0 e1 e2 e3 e4 e5 e6 e7  6 g0 g1 g2 g3 g4 g5 g6 g7 
1 b0 b1 b2 b3 b4 b5 b6 b7  2 c0 c1 c2 c3 c4 c5 c6 c7  5 f0 f1 f2 f3 f4 f5 f6 f7  7 h0 h1 h2 h3 h4 h5 h6 h7 
++ ++ ++ ++

 
Color Bar
Registered: Oct 2012 Posts: 98 
Quoting ChristopherJam OK, in that case mine is 292 cycles if source and dest are in ZP, 308 cycles when all in mem.
Quote:
Ver neat!
Quote:
++ ++ ++ ++
0 a0 b0 c0 d0 e0 f0 g0 h0  1 a1 b1 c1 d1 e1 f1 g1 h1  2 a2 b2 c2 d2 e2 f2 g2 h2  3 a3 b3 c3 d3 e3 f3 g3 h3 
4 a4 b4 c4 d4 e4 f4 g4 h4  5 a5 b5 c5 d5 e5 f5 g5 h5  6 a6 b6 c6 d6 e6 f6 g6 h6  7 a7 b7 c7 d7 e7 f7 g7 h7 
++ ++ ++ ++
0 a0 b0 c0 d0 a4 b4 c4 d4  1 c1 d1 a1 b1 c5 d5 a5 b5  2 c2 d2 a2 b2 c6 d6 a6 b6  3 a3 b3 c3 d3 a7 b7 c7 d7 
4 e0 f0 g0 h0 e4 f4 g4 h4  5 g1 h1 e1 f1 g5 h5 e5 f5  6 g2 h2 e2 f2 g6 h6 e6 f6  7 e3 f3 g3 h3 e7 f7 g7 h7 
++ ++ ++ ++
I would implement the code for the second and third byte pairs like this
lax s1
lda shuffle1,x
and #$f0
ldy s2
ora merge1,y
sta tmp1
tya
lda shuffle2,y
and #$0f
ora merge2,x
sta tmp2
+
36 cycles (s1 and s2 in mem, tmp1 and tmp2 in zp)
For the first and 4th byte pairs the "LDA shuffle"s can be omitted (28 cycles).
Is there a more efficient way to do it? 
 
Oswald
Registered: Apr 2002 Posts: 3991 
why shuffle tabs dont include the ANDs? 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Yes, you are right, the tables should include the ANDs. So that makes it 32 cycles for the 2nd and 3rd byte pairs. 
 
Axis/Oxyron
Registered: Apr 2007 Posts: 79 
Quoting Color Bar
tya
lda shuffle2,y
WOOT? ;o) 
 
Pex Mahoney Tufvesson
Registered: Sep 2003 Posts: 32 
> WOOT? ;o)
It's a new way of doing nothing; it's just a little too complicated for you Axis, so go back and play with your 300fps milliondots 6510 dot spheres. :P

Have a noise night!
http://mahoney.c64.org 
 
Color Bar
Registered: Oct 2012 Posts: 98 
Quote: Quoting Color Bar
tya
lda shuffle2,y
WOOT? ;o)
Lol. I needed some sleep. Honestly, this stuff is keeping me awake at night.
Pex: Good that you contribute something to this thread too. 
 
ChristopherJam
Registered: Aug 2004 Posts: 503 
Haha, you guys.
Um, my second and third shuffles from the first phase is pretty brute force on the tables front:
ldx s1
ldy s2
lda sb_t2,x
ora sb_t3,y
sta d1
lda sb_t0,x
ora sb_t1,y
sta d2
30 cycles with mem source, zp destination.. 