[CSDb] - User Forums - Fast way to rotate a char?

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Fast way to rotate a char?

2017-01-04 08:32

Rudi
Account closed

Registered: May 2010
Posts: 125

Fast way to rotate a char?

Im not talking about rol or ror, but swap bits so that they are rotated 90 degrees:

Example:

a char (and the bits can be random):

10110010 byte 1..
11010110 byte 2.. etc..
00111001
01010110
11011010
10110101
00110011
10110100

after "rotation" (rows and columns are swapped):

is it possible to use lookup tables for this or would that lookup table be too big?
or other lookuptable for getting and setting bits?

-Rudi

2017-01-04 08:59

Oswald

Registered: Apr 2002
Posts: 5017

my first idea:

lsr row7
ror a
lsr row6
ror a
...
ror row0
ror a
sta result_row7

then again:

lsr row7
ror a
...
lsr row0
ror a
sta result_row6

etc.

2017-01-04 09:06

Axis/Oxyron

Registered: Apr 2007
Posts: 91

Depends on resolution. In multicolor/2x2 you can do it with 4 bit-tables, just like you would do a 2xn chunky-FX. In hires things get complicated. The shift/rol method is a good starting point. Perhaps the Amiga C2P techniques based on merges would be faster. Not sure, yet.

2017-01-04 09:07

ChristopherJam

Registered: Aug 2004
Posts: 1378

Sounds like a coding challenge!

Should be doable with 64 tables and 464 cycles, with neither source nor destination in zero page. But that's a lot of RAM.. Oswald's concept's a lot saner. Maybe fastest routine in under 4k code+tables would be interesting to hunt for?

    ldx byte0
    ldy byte1
    for j in range(8)
        lda t{j}0,x
        ora t{j}1,y
        sta $xxxx


    ldx byte2
    ldy byte3
    for j in range(8)
        lda#$xx
        ora t{j}2,x
        ora t{j}3,y
        sta $xxxx


    ldx byte4
    ldy byte5
    for j in range(8)
        lda#$xx
        ora t{j}4,x
        ora t{j}5,y
        sta $xxxx


    ldx byte6
    ldy byte7
    for j in range(8)
        lda#$xx
        ora t{j}6,x
        ora t{j}7,y
        sta $xxxx

(table ij copies a bit from location i to location j, clearing the rest)

2017-01-04 09:26

Rudi
Account closed

Registered: May 2010
Posts: 125

Interesting ideas. I think I understand Oswalds code.
Quoting ChristopherJam

Sounds like a coding challenge!

Feel free to make it a coding challenge.

2017-01-04 09:51

ChristopherJam

Registered: Aug 2004
Posts: 1378

Haha, googling C2P leads back to csdb (Identify this doughnut).

Looks like Michiel's swapping-two-address-bits-at-a-time technique at http://stackoverflow.com/questions/1667591/rotating-a-bitmap-90.. ?

Going to have to think about efficient ways to swap low nybble of one byte with high nybble of another now..

2017-01-04 09:51

Axis/Oxyron

Registered: Apr 2007
Posts: 91

Challenge accepted. ;o)

2017-01-04 10:13

Shadow
Account closed

Registered: Apr 2002
Posts: 355

Is it just my spatial thinking that is a bit off now, or is your original example wrong? To me it doesn't seem like a correct 90 degree rotation is done, it looks like it's mirrored as well?

2017-01-04 11:12

Axis/Oxyron

Registered: Apr 2007
Posts: 91

Got a first version running using Amiga style merges (328 cycles). Code still looks ugly and unoptimized.

The used tables are 256 bytes each and filled as follows:

lsrtab1: and #$aa, lsr
lsrtab2: and #$cc, lsr, lsr
lsrtab4: lsr, lsr, lsr, lsr
asltab1: and #$55, asl
asltab2: and #$33, asl, asl
asltab4: asl, asl, asl, asl

Code is using Dreamass-style macros.
src & dst are memory locations of the source and dest char.
tmp1 & tmp2 are 8 byte zp-arrays each.

;28 cycles mem->zp
;26 cycles zp->zp
;28 cycles zp->mem
#macro MERGE(src1, src2, mask1, mask2, bittab1, bittab2, dst1, dst2)
{
lax {src1}
ldy {src2}
and #{mask1}
ora {bittab1},y
sta {dst1}
tya
and #{mask2}
ora {bittab2},x
sta {dst2}
}

.MERGE(src+0, src+1, $aa, $55, lsrtab1, asltab1, tmp1+0, tmp1+1)
.MERGE(src+2, src+3, $aa, $55, lsrtab1, asltab1, tmp1+2, tmp1+3)
.MERGE(src+4, src+5, $aa, $55, lsrtab1, asltab1, tmp1+4, tmp1+5)
.MERGE(src+6, src+7, $aa, $55, lsrtab1, asltab1, tmp1+6, tmp1+7)
.MERGE(tmp1+0, tmp1+2, $cc, $33, lsrtab2, asltab2, tmp2+0, tmp2+2)
.MERGE(tmp1+1, tmp1+3, $cc, $33, lsrtab2, asltab2, tmp2+1, tmp2+3)
.MERGE(tmp1+4, tmp1+6, $cc, $33, lsrtab2, asltab2, tmp2+4, tmp2+6)
.MERGE(tmp1+5, tmp1+7, $cc, $33, lsrtab2, asltab2, tmp2+5, tmp2+7)
.MERGE(tmp2+0, tmp2+4, $f0, $0f, lsrtab4, asltab4, dst+0, dst+4)
.MERGE(tmp2+1, tmp2+5, $f0, $0f, lsrtab4, asltab4, dst+1, dst+5)
.MERGE(tmp2+2, tmp2+6, $f0, $0f, lsrtab4, asltab4, dst+2, dst+6)
.MERGE(tmp2+3, tmp2+7, $f0, $0f, lsrtab4, asltab4, dst+3, dst+7)

The code still looks pretty unoptimized. I think reordering merges can save some loads/stores to the tmp-arrays. Also SAX or AXS could come handy here.

If you are interested in the basic concept of this kind of C2p read here: http://www.lysator.liu.se/~mikaelk/doc/c2ptut/

2017-01-04 12:24

Axis/Oxyron

Registered: Apr 2007
Posts: 91

Damn, should have asked Graham. He used nearly exactly my C2P code in the hidden-part of Deus Ex Machina.

2017-01-04 12:27

ChristopherJam

Registered: Aug 2004
Posts: 1378

Sweet. I thought I had an approach that would take ~360 cycles, but it appears I've made a mistake somewhere, so it's both slower than yours *and* broken..

(was going to do a three skews rotation, with ROL tables for the first and third passes, and swap macros for shuffling bits down in the middle pass)

I'll ponder more after dinner.

2017-01-04 12:55

Cruzer

Registered: Dec 2001
Posts: 1048

Wow, that's some sick, slick code, Axis. No way I can understand it right now, but I'll look into it if I ever get the need. Did we determine the resolution btw? I'm guessing singlecolor, since you can't do a lossless rotation of a multicolor char, unless it's 2x2.

2017-01-04 13:01

Axis/Oxyron

Registered: Apr 2007
Posts: 91

I was assuming hires 8x8 chars. That makes the most sense and is the most interesting challenge. And looks like it´s what Rudi is watching for.

2017-01-04 13:38

Rudi
Account closed

Registered: May 2010
Posts: 125

yes, i was initially looking at hires 8x8 char.

Edit: current techniques just seem too slow for many chars. wished someone could do some magic tricks :)

2017-01-04 13:46

JackAsser

Registered: Jun 2002
Posts: 1989

Btw, Axis is talking about this (from Mikael Kalms): http://www.lysator.liu.se/~mikaelk/doc/c2ptut/

2017-01-04 13:50

Axis/Oxyron

Registered: Apr 2007
Posts: 91

Thanks for sharing my link jacky.^^

2017-01-04 14:12

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: Thanks for sharing my link jacky.^^

Jesus, I'm tired...

2017-01-04 14:25

Oswald

Registered: Apr 2002
Posts: 5017

Quote: Btw, Axis is talking about this (from Mikael Kalms): http://www.lysator.liu.se/~mikaelk/doc/c2ptut/

huh, that needs a relaxed state and time to have it sink in :)

2017-01-04 14:27

chatGPZ

Registered: Dec 2001
Posts: 11107

JackAsser is halfway there apparently :=)

2017-01-04 14:35

Axis/Oxyron

Registered: Apr 2007
Posts: 91

When he is through, its time to go Amigaaaah!!!

2017-01-04 14:45

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: When he is through, its time to go Amigaaaah!!!

Haha, I got semi tired of it after setting up that HAM8-mode. :) But it was a fun excercies and fun to figure out the bitpatterns myself, then you told me, and I was actually correct (for once).

2017-01-04 14:50

Perplex

Registered: Feb 2009
Posts: 254

Quoting Rudi

current techniques just seem too slow for many chars. wished someone could do some magic tricks :)

Magic sometimes happens when you ditch working on a generic solution and instead concentrate on one that fits your particular data set or use case.

Not implying that a generic solution is less interesting, mind.

2017-01-04 15:03

Rudi
Account closed

Registered: May 2010
Posts: 125

Some notes to self. altho, i dont know if it will do any miracles. maybe one can reduce it to Axiss solution, but I have no idea:

Basically what happens from a layman perspective (like me), bits are copied like this:
where each number are the bit-indices. consider the second number (after the arrow) to be the destination (of zero memory), and source (infront of arrow):

copy from (bit-index) -> to:

0->7
1->15
2->23
3->31
4->39
.
.(and so on..)
.
63->56

looking at those destination values in binary produce a pattern like this:

07: 00000|111|
15: 00001|111|
23: 00010|111|
31: 00011|111|
39: 00100|111|
47: 00101|111|
55: 00110|111|
63: 00111|111|

and

06: 00000|110|
14: 00001|110|
22: 00010|110|
30: 00011|110|
38: 00100|110|
46: 00101|110|
54: 00110|110|
62: 00111|110|

and so on...

notice last column (trits) are same, but decrease, and msbs are increasing. wether or not one can derive something out of this i dont know. of course these are just index-values and dont have any sensible meaning for copying from and to memory. cycles would be massive using this technique, but my thought was to get something sensible out of this (which i did not yet)...

2017-01-04 15:40

Slammer

Registered: Feb 2004
Posts: 416

I think its a shame that your cyclecounts for this problem is so low allready. I would like to see the spritecollision solution. Something like storing the 8 bytes on row 7 of succeding chars. Stretch the line and use the LFT trick to calculate the solution. But that would take 8 lines.

2017-01-04 15:58

tlr

Registered: Sep 2003
Posts: 1714

Quote: I think its a shame that your cyclecounts for this problem is so low allready. I would like to see the spritecollision solution. Something like storing the 8 bytes on row 7 of succeding chars. Stretch the line and use the LFT trick to calculate the solution. But that would take 8 lines.

Interesting idea! It would not necessarily take 8 lines. IIRC collisions can be retriggered during the line. You will be limited by the amount of pixels a sprite covers though.

2017-01-04 16:02

Rudi
Account closed

Registered: May 2010
Posts: 125

I dont see how sprites will solve the C2P-problem (with as few cycles as possible)? but then again, Im not too experienced with sprites on the C64. A goal would be to calculate as many chars (per frame) as possible for C2P.

2017-01-04 16:04

chatGPZ

Registered: Dec 2001
Posts: 11107

tlr: wasnt there some VIC that only clears (or retriggers?) the collisions once?

2017-01-04 19:07

tlr

Registered: Sep 2003
Posts: 1714

Quote: tlr: wasnt there some VIC that only clears (or retriggers?) the collisions once?

That was lightpen triggering. It's working very differently on 6569R1.

2017-01-04 19:22

Skate

Registered: Jul 2003
Posts: 490

I just made a rough calculation. Using 16k of precalc data (which is already terrible), each char takes 500-600 cycles to rotate. So, c2p routine looks like the best solution in every aspect. I'm still thinking an alternative approach though.

2017-01-04 19:54

Flavioweb

Registered: Nov 2011
Posts: 447

Quoting Skate

I just made a rough calculation. Using 16k of precalc data (which is already terrible), each char takes 500-600 cycles to rotate.

Just to say (...maybe i'm off-topic...) this code took 544 cicles rotate a char counterclockwise.

    *=$C000
    .FOR S=0, S<8, S=S+1
    LDA SOURCE+S
    .FOR D=0, D<8, D=D+1
    ASL
    ROL DEST+D
    .NEXT
    .NEXT
    RTS
;-------------------
SOURCE
    .BYTE %11111111
    .BYTE %10000000
    .BYTE %01000000
    .BYTE %00100000
    .BYTE %00010000
    .BYTE %00001000
    .BYTE %00000100
    .BYTE %11111111
DEST
    .BYTE %00000000
    .BYTE %00000000
    .BYTE %00000000
    .BYTE %00000000
    .BYTE %00000000
    .BYTE %00000000
    .BYTE %00000000
    .BYTE %00000000
;---------------------------------------

2017-01-04 19:57

Rudi
Account closed

Registered: May 2010
Posts: 125

This takes ~472 cycles, and is basically similar to Oswalds approach:

	;counter-clockwise rotation by 90 degrees.
	
	;column 1
	ror $74
	rol
	ror $75
	rol
	ror $76
	rol
	ror $77
	rol
	ror $78
	rol
	ror $79
	rol
	ror $7a
	rol
	ror $7b
	rol
	sta $7c
	;tot = 59 cycles

	;column 2
	ror $74
	rol
	ror $75
etc..
	ror $7b
	rol
	sta $83
;tot = 59*8 = 472 cycles

did a test in zp. so rotated (destination) bytes are in $7c to $83. this req eight bytes to be copied to zeropage ($74 to $7b) (as an example).

2017-01-04 20:35

Oswald

Registered: Apr 2002
Posts: 5017

axis' one is close to max possible I guess, go for it if understandable version is not enough :)

2017-01-04 20:42

Skate

Registered: Jul 2003
Posts: 490

Table lookup doesn't necessarily be faster than bit shifting. What i was saying is even using large tables to get rid of bit shifting doesn't save cycles unless you start thinking out of the box.

2017-01-04 20:56

Skate

Registered: Jul 2003
Posts: 490

here is an algo which looks similar to c2p method without 8 bit restrictions.

void transpose8(unsigned char A[8], int m, int n, 
                unsigned char B[8]) {
   unsigned x, y, t; 

   // Load the array and pack it into x and y. 

   x = (A[0]<<24)   | (A[m]<<16)   | (A[2*m]<<8) | A[3*m]; 
   y = (A[4*m]<<24) | (A[5*m]<<16) | (A[6*m]<<8) | A[7*m]; 

   t = (x ^ (x >> 7)) & 0x00AA00AA;  x = x ^ t ^ (t << 7); 
   t = (y ^ (y >> 7)) & 0x00AA00AA;  y = y ^ t ^ (t << 7); 

   t = (x ^ (x >>14)) & 0x0000CCCC;  x = x ^ t ^ (t <<14); 
   t = (y ^ (y >>14)) & 0x0000CCCC;  y = y ^ t ^ (t <<14); 

   t = (x & 0xF0F0F0F0) | ((y >> 4) & 0x0F0F0F0F); 
   y = ((x << 4) & 0xF0F0F0F0) | (y & 0x0F0F0F0F); 
   x = t; 

   B[0]=x>>24;    B[n]=x>>16;    B[2*n]=x>>8;  B[3*n]=x; 
   B[4*n]=y>>24;  B[5*n]=y>>16;  B[6*n]=y>>8;  B[7*n]=y; 
}

2017-01-05 07:37

HCL

Registered: Feb 2003
Posts: 716

Ah, that's just beautiful Axis!!

..as usual, i'm the last one to read the thread.. but if anyone still wonders what that code does, it is mirroring the 8x8-matrix by starting with 2x2-bit-blocks, then 4x4-blocks, then finally the whole 8x8-block. Mirroring and rotation is of course just a matter of configuring the macros :).

..and of course Graham did this 15+ years ago, hope that makes you all feel as retard as i do ;).

2017-01-05 08:50

Pex Mahoney Tufvesson

Registered: Sep 2003
Posts: 50

I used Rudi's 472-cycle approach "This takes ~472 cycles, and is basically similar to Oswalds approach" in Storebror
There was no need to make it any faster, since it was a one-time calculation just before the android starts rotating. My routine included not only rotation, but mirroring across horizontal, vertical and diagonals as well.

And HCL... I'm even later than you are. So don't worry, you're ok! ;)
---
Have a noise night!
http://mahoney.c64.org

2017-01-05 10:19

CreaMD

Registered: Dec 2001
Posts: 3034

I wonder if this would significantly speed up my proportional routine and help me "posthumously" win the internal compo I had with Wotnau ;-)

So which one is the fastest? I need to rotate full line of bitmap 320 bytes char by char, +90 degrees each

2017-01-05 11:01

Dano

Registered: Jul 2004
Posts: 226

Haha, did my a proportial Routine for NewsPress back then the same way as CreaMD did. Switched to a standard one some time later as this was even faster. Guess i did the rotation the Oswald-way too back then.

2017-01-05 11:39

Oswald

Registered: Apr 2002
Posts: 5017

don think you can beat a bitshifter with this c2p, couple hundred cycles even with axis method, while bitshifter worst case is about 170 (lsr a ror zp * 8 lines * 3 times)

2017-01-05 12:52

Rudi
Account closed

Registered: May 2010
Posts: 125

Im trying to figure my head around this:
Since the rotation is just a bunch of bitswaps. For example swapping bit 0 and 7, bit 1 and 15, 2 and 23 etc.

00 01 02 03 04 05 06 07
08 09 10 11 12 13 14 15
etc..
56 57 58 59 60 61 62 63.

How would a swapbit-table look for this? The number of swaps are only 32. only 32 lda's and sta's? lets say ~8 cycles in total per swap: 8*32 = 256 cycles for all the swaps. but still I dont know if its possible to have a table for this..
Edit: hm, woops. i think perhaps 32 is wrong, since both locations need to be written to, so maybe two sta's instead of just one.

2017-01-05 13:11

Cruzer

Registered: Dec 2001
Posts: 1048

Oswald: Why 3 times and not 8?

2017-01-05 13:38

Oswald

Registered: Apr 2002
Posts: 5017

Quote: Oswald: Why 3 times and not 8?

coz you rotate in 2 buffers a,b:

a b
a b
a b

but if shifting is bigger than 3 you can just shift bits to opposite dir and then swap a,b (on screen). 3 is just a wild guess maybe 4 is the max, it was a long time ago.

2017-01-05 14:30

Rudi
Account closed

Registered: May 2010
Posts: 125

There is some interesting cyclic dynamic going on when looking at a bit-swapping method for 8x8 bits:

these figures shows that the outer layer is rotated 8 times first, then the next layer is rotated 6 times and so forth:

xxxxxxxx    ........    ........    ........
x......x    .xxxxxx.    ........    ........
x......x    .x....x.    ..xxxx..    ........
x......x    .x....x.    ..x..x..    ...xx...
x......x    .x....x.    ..x..x..    ...xx...
x......x    .x....x.    ..xxxx..    ........
x......x    .xxxxxx.    ........    ........
xxxxxxxx    ........    ........    ........

in this example one start with c0r0 and swap that with c7r0, then c1r0->c7r1, c2r0->c7r2 etc. so it goes around like a circle would. when that process is finished, the next inner layer does the same thing. until all four layers are done. and the result is a 90 deg rotation. its just one other way of looking at it. but still, a lookup-table for this? its difficult for me to see how. (im just giving out a few ideas for trying out other approach maybe).

2017-01-05 16:23

ChristopherJam

Registered: Aug 2004
Posts: 1378

Problem with bitshifter is it only deals with one bit at a time.

c2p exchanges four bitpairs in only 20-30 cycles. The expensive part is moving bits to matching positions within the two bytes they're being exchanged between...

2017-01-05 19:58

Digger

Registered: Mar 2005
Posts: 421

Nice read. But what fx do you have in mind Rudi?

2017-01-05 21:54

Rudi
Account closed

Registered: May 2010
Posts: 125

Quote: Nice read. But what fx do you have in mind Rudi?

My initial thinking was to make a sine-scroller, where the sinus-transform is done in chunky and then transformed using c2p. Of course the c2p-routine has to be fast enough for this. It would also bring out some new ideas for other vertical effects (that are first done with horisontal tricks, and then transformed using c2p). Unfortunately there seem to be no _really_ fast c2p routine.

Im still figuring out how that 4x4, 2x2 plus 1x1 rotator works. or that is; looking at every detail of the bit swaps.

Edit: hires-sinescroller.

2017-01-05 23:02

Frantic

Registered: Mar 2003
Posts: 1627

It is not really what you specified in your first post, but just a thought that may be worth considering: Would it be okay to represent the original char in some different way? If so, there might (possibly) be ways to represent that char that would make it easier to "rotate" it to +90˚ angle although at the cost that you may then also need to do some operations on the data to be able to "read out" the normal 0˚ angle.

Just thinking aloud here. Trying to rethink the problem rather than the solution, so to speak. Not sure if it would be allowed.

2017-01-05 23:16

Rudi
Account closed

Registered: May 2010
Posts: 125

Frantic? Sorry mate, but my first post is what i need. the figures dont lie :)

2017-01-05 23:47

Copyfault

Registered: Dec 2001
Posts: 466

Interesting Thread :)

Sticking to the figures of the initial post the goal is to have a char mirror* routine with a cycle count as low as possbile for 8x8-Chars HiRes with arbitrary bit patterns [*mirror in the sense of mirroring along the diagonal line from bitpos (0,0) to (7,7)].

Maybe it's possible to weaken the requirement of _arbitrary_ bitpatterns... I'm thinking about using a charset with chars $00-$7f for the "given" bitpatterns and $80-$ff for the corresponding "mirrored" patterns. This way, mirroring would only be a question of setting the msb of the corresponding bytes in the screenram.

Guess in order to have a chance to make this approach work you'd need several charsets spread along the screen - and this still requires that appropriate slices of the screen will have at most 128 different bitpatterns.

Another downside: expanding this approach to 90° rotations will lower the number of different bitpatterns even further to 64.

But maybe this thought is of any help; it just came up while reading so I wanted to share it.

2017-01-06 09:32

ChristopherJam

Registered: Aug 2004
Posts: 1378

For things like sine scrollers, there may be value in doing a partial rotation; only takes 240 cycles per char to reshape eight bytes into

(digit is index of source byte)
using something like

  ldx s+0
  ldy s+1     ; 8   cycles priming the xy line cache. Need to do this on row 0 and row 4

  lda t00,x
  ora t01,y   ; 8
  ldx s+4
  ldy s+6     ; 8
  ora t02,x
  ora t03,y   ; 8
  sta dst+0   ; 4      28 cycles for this block; one of these per dest byte

  lda t12,x
  ora t13,y
  ldx s+0
  ldy s+2
  ora t10,x
  ora t11,y
  sta dst+1

etc

2017-01-07 11:20

Krill

Registered: Apr 2002
Posts: 2839

Rudi: So you want to rotate chars by arbitrary angles, not just swap X and Y (90°)? What's the problem with Frantic's suggestion of pre-calculating another representation of the original characters to optimise run-time rendering of the rotated characters, thus trading memory for speed?

2017-01-07 15:51

Rudi
Account closed

Registered: May 2010
Posts: 125

Krill: No. I ONLY want to swap x and y 90 degrees. No other degrees. Im working on Axis's method with masks and so on.. (bcoz it seems to be the one technique that use fewest cycles).

2017-01-07 19:58

Skate

Registered: Jul 2003
Posts: 490

I think Rudi wants to rotate dynamically created chars, not a static charset. Rudi, did i get you right?

2017-01-07 21:06

Frantic

Registered: Mar 2003
Posts: 1627

I think so, yes, as he wrote that "the bits can be random". (But that doesn't exclude the possibility of playing with other types of representations as such.)

2017-01-07 21:10

Rudi
Account closed

Registered: May 2010
Posts: 125

Quote: I think Rudi wants to rotate dynamically created chars, not a static charset. Rudi, did i get you right?

Yes Skate! That is right.

2017-01-08 10:53

Rastah Bar

Registered: Oct 2012
Posts: 336

I may have found a method that takes 432 cycles. An 8x8 rotation can be performed by moving around 4x4 blocks, then 2x2 subblocks and then the bits in each subblock. Now you can identify for each resulting destination bit pair (that is, bist 7+6, or 5+4, ..., or 1+0) the nybbles of the source bytes it came from. So an approach could be to take two source bytes, merge the desired nybbles into one byte, and then use look up tables to transform it to the desired bit pairs (at the right location, i.e., a bit pair 7+6 from the merged source nybbles could be transformed to a destination bit pair 3+2, for example).

ldx first_source_byte
lda second_source_byte
and #mask   ;mask=f0 if you need the high nybble and 0f for the low nybble
ora byte2DesiredNybble,x  ;look up table to merge the right nybble in the right way
tax
lda first_destination
ora bits76toRightPosition,x  ; Table filters out the desired bits and puts them in the right order at the right position.
sta first_destination
lda second_destination
ora bits54toRightPosition,x
sta second_destination
...
lda fourth_destination
ora bits10toRightPosition,x
sta fourth_destination
-----------------------+
54 cycles (source and destination bytes in ZP).

This fills in 4 times 2 bits, so you have to do that 8 times to fill the entire destination matrix. So that is 432 cycles in total. But you need a bunch of look-up tables, so it is not very memory efficient.

I hope I did not overlook something ...

2017-01-08 11:38

Krill

Registered: Apr 2002
Posts: 2839

While 8x8 tile transpose is an entertaining academic exercise, i'm dubious of its practical applications for C-64 purposes.

In fact, i'd be hard-pressed to find an effect that is rendered faster with generating "horizontal" 8x8 tiles first and then rotating them, as opposed to rendering "vertical" 8x8 tiles right away.

Also, an actual chunky to char approach seems more feasible. Render to linear chunky bitmap (1 byte per pixel), then collect the bits and render final 8x8 tiles.

2017-01-08 12:30

Rudi
Account closed

Registered: May 2010
Posts: 125

Krill: Bear in mind im not too an experienced coder on C64 that I see solutions right away. I figured out this might be an nice intellectual exercise for myself, hopefully for others too(?) but wether or not it has any practical usage, that I have no idea..

Btw, I have a working 336 cycles. It consists of twelve "copies" of this code (pseudocode):

lda $<zp1>
tay
and #<mask1>
ldx $<zp2>
eor $<tab1>, x
sta $<zp3>
txa
and #<mask2>
eor $<tab2>, y
sta $<zp4>

2017-01-08 13:22

Axis/Oxyron

Registered: Apr 2007
Posts: 91

Rudi, except of the fact that you used EOR instead of ORA and didnt use LAX for the first read on zp (would be possible if you swap x and y registers), this looks identical to the code I posted pretty early in this thread. Is there a special reason to use EOR?

2017-01-08 13:37

Rudi
Account closed

Registered: May 2010
Posts: 125

Quote: Rudi, except of the fact that you used EOR instead of ORA and didnt use LAX for the first read on zp (would be possible if you swap x and y registers), this looks identical to the code I posted pretty early in this thread. Is there a special reason to use EOR?

The eor was a consequence of the formulas i used for masking and swapping (I derived this from Kalms tutor):

Example for the 4x4 swapping:

tmp0 = byte0 & 0xf0; //xxxx----
tmp1 = byte1 & 0xf0; //xxxx----
tmp2 = byte2 & 0xf0; //xxxx----
tmp3 = byte3 & 0xf0; //xxxx----
tmp4 = byte4 & 0x0f; //----xxxx
tmp5 = byte5 & 0x0f; //----xxxx
tmp6 = byte6 & 0x0f; //----xxxx
tmp7 = byte7 & 0x0f; //----xxxx
data0 = byte0 << 4;
data1 = byte1 << 4;
data2 = byte2 << 4;
data3 = byte3 << 4;
data4 = byte4 >> 4;
data5 = byte5 >> 4;
data6 = byte6 >> 4;
data7 = byte7 >> 4;
data0 ^= tmp4;
data1 ^= tmp5;
data2 ^= tmp6;
data3 ^= tmp7;
data4 ^= tmp0;
data5 ^= tmp1;
data6 ^= tmp2;
data7 ^= tmp3;

Sorry for the long code..

EOR is used for the last xor-swapping. Since I cannot use lookup-table for two different values (fex. data0 and tmp4). The last eight operations in the above are done with the EOR-instruction.

Some of the lookup-tables i derived are doing EOR, AND and SHIFTS at the same time:

shl2_eor_cc[i] = (i ^ (i & 0xcc)) << 2;
shr2_eor_33[i] = (i ^ (i & 0x33)) >> 2;
shl1_eor_aa[i] = (i ^ (i & 0xaa)) << 1;
shr1_eor_55[i] = (i ^ (i & 0x55)) >> 1;

I scratched my head around how your ORA worked. And since I didnt understand that Dreamass-macrocode I wrote mine from scratch. But maybe I should look at your LAX-method next.

Edit: Now I see that it doesnt really matter if one use ora or eor for this technique.

2017-01-08 13:46

Axis/Oxyron

Registered: Apr 2007
Posts: 91

But you know that:
(i ^ (i & 0xcc))

is the same as:
i & 0x33

;o)

2017-01-08 14:45

Rudi
Account closed

Registered: May 2010
Posts: 125

Quote: But you know that:
(i ^ (i & 0xcc))

is the same as:
i & 0x33

;o)

No, didnt think about that hehe.

Btw, 312 cycles now (with LAX).

2017-01-08 14:54

Bitbreaker

Registered: Oct 2002
Posts: 500

Quoting Axis/Oxyron

But you know that:
(i ^ (i & 0xcc))

is the same as:
i & 0x33

;o)

smells like the version of the tab that shifts only 1 bit to the right could be substituted by some asr magic?
Also the and maskX looks like it could be included into something, too static to be done that often :-)

2017-01-08 15:22

Rudi
Account closed

Registered: May 2010
Posts: 125

XAA might be something too.

2017-01-08 18:29

Rudi
Account closed

Registered: May 2010
Posts: 125

Here's a different approach to it:

ldx $82			;3	
xaa #$33		;2	a=(x & 0x33)
ldy $80			;3
eor shl2_eor_cc, y	;4*
sta $90			;3
lda shr2_eor_33, x	;4*
eor tab_cc, y		;4*
sta $92			;3

uses the same amount of cycles though.

Bitbreaker: yes, one could probably optimize the 1x1 rotator with other illegal-opcodes. sine some of them do one shift.

2017-01-09 07:35

Bitbreaker

Registered: Oct 2002
Posts: 500

Besides that it will produce rubbish as xaa can add some unpredictable value to A before doing the txa and and part :-)

2017-01-09 10:10

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Color Bar

I may have found a method that takes 432 cycles....

If I merge columns of 2 bits wide and 4 bits high into one byte and then extract the destination nybbles I can reduce that to 354 cycles.

2017-01-09 11:52

Rudi
Account closed

Registered: May 2010
Posts: 125

Quote: Quoting Color Bar
I may have found a method that takes 432 cycles....

If I merge columns of 2 bits wide and 4 bits high into one byte and then extract the destination nybbles I can reduce that to 354 cycles.

Are you using the masking method?

2017-01-09 12:25

Axis/Oxyron

Registered: Apr 2007
Posts: 91

I just want to share some thoughts on my merges that didnt work out. Perhaps I´m just missing the last twist.

First idea was to make relative merges. I discussed that back in the 90´s with some Amiga coders and on 68030-68060 it saves some cycles.
Idea is, that shifting of the input must not always have the exact values, as long as the delta of the shift of the 2 inputs stays correct. Disadvantage of that is, that the last merge needs to make some rol/ror to compensate.

This resulted in something like this:

lda {src1}
ldy {src2}
and #$aa
ldx {bittab1},y
sax {dst1}
eor {src1} ;invert and #$aa to and #$55
ldx {bittab2},y
sax {dst2}

Unluckily it only saves 1 cycle per merge which is completely eaten up by the last merge that looses 2 cycles for the correction.

Another idea was to interleave the temp-arrays with hi-byte pointers so that they can be used both as pointers for indirect y-indexing and as direct values. Code would look like this:

lax {src1}
and #{mask1}
ora ({src2}),y
sta {dst1}
lda ({src2}),y
ora {bittab2},x
sta {dst2}

Would also save 1 cycle per merge. But the unsolved problem is, that the 2 usages of {src2} should be pointing to 2 different tables. *grrr*

What definitely works is reordering the merges, so that the last 2 merges of a resolution dont need to store the tmp-values into the zp and the first two of the next resolution doesnt need to read the tmp-values.

so the last
sta {dst2}
and the first
lax {src1}
would merge into
tax.

Saves 4 times 3=12 cycles.

2017-01-09 14:27

Rudi
Account closed

Registered: May 2010
Posts: 125

Interesting ideas. Reordering merges was something I tried out but failed. Maybe I should look at it again.

2017-01-09 15:00

Rudi
Account closed

Registered: May 2010
Posts: 125

Ok, so I did merge 4x4 and 2x2, and it worked. Am now at 299 cycles. There is probably more that can be optimized, because now the code looks like a mess.

2017-01-09 15:11

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: Are you using the masking method?

These diagrams illustrate what I am thinking of

xxxxxxxx source0
xxxxxxxx source1
xxxxxxxx
xxxxxxxx
aabbccdd
aabbccdd
aabbccdd
aabbccdd source7

xxxxaaaa destination0
xxxxaaaa destination1
xxxxbbbb
xxxxbbbb
xxxxcccc
xxxxcccc
xxxxdddd
xxxxdddd destination7

The source bits in the block indicated by the a's should end up in the proper locations in destination bytes. Using look up tables the 8 a's are put into one byte such that the high nybble is the desired nybble of destination0 and then the low nybble that of destination1. The same procedure can be followed for all 4x2 source blocks 'b','c','d', etcetera.

Code for doing that can look like

ldy source4
lda mergingTable1,y  ;This table takes two bits and puts them in a byte, such that they are in the right positions to use as nybble for the destination.
ldx source5
ora mergingTable2,x
ldx source6
ora mergingTable3,x
ldx source7
ora mergingTable4,x  ;Now the byte is filled with all bits from block 'a'.

tax
and #$0f
sta destination0
lda moveHighNybbleToLowNybble,x
sta destination1
-------------------+
42 cycles.

This has to be repeated for blocks 'b', 'c', 'd' (using the right lookup tables), but since source4 was saved in Y, that code can start with TAY instead of 'LDY source4' to save 1 cycle.

The lower half of the source matrix is therefore moved in 42+3*41 = 165 cycles.

The other half of the matrix goes very similar, but since the destination bytes are no longer empty, we have to add 'ORA destination0' and 'ORA destination1' before the 'STA destination0' and 'STA destination1', respectively. So that will take 4*6 cycles extra, totalling 2*165+24=354 cycles. But the method is very memory inefficient.

2017-01-09 15:18

White Flame

Registered: Sep 2002
Posts: 136

Here's another idea which I think fails, but might be salvageable.

So the basic LSR, ROR accumulation grabs a bit and stores a bit, taking 2 instructions per bit. However, if done in-place just like many basic shift-add multiplication routines, ROR both sets a final bit as well as reads the next bit to place. This means that in dream land, the flip can be done in about 64 RORs. If fully in zp, that would be 320 cycles, but only 128 bytes of code with no tables.

However, the tactic I took doesn't seem to have a nice clean loop of RORs linking source bit locations to final bit locations. Various CMP #80s and other byte-masking & merging seems to be required, which would likely bloat it back up to 400+ cycles. But maybe by shuffling it around differently, an arrangement could be made that's both fast and short.

2017-01-09 15:26

Rudi
Account closed

Registered: May 2010
Posts: 125

ok, ok. another idea is to use TXS and TSX (takes only 2 cycles each) for having X as temporary variable in optimizing. might require the lookup to switch from ,x and ,y to ,y and ,x however. I don't mind having a 4k+ big table for making it really fast. 32k+ table however thats overkill!

2017-01-09 15:44

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: ok, ok. another idea is to use TXS and TSX (takes only 2 cycles each) for having X as temporary variable in optimizing. might require the lookup to switch from ,x and ,y to ,y and ,x however. I don't mind having a 4k+ big table for making it really fast. 32k+ table however thats overkill!

I think my method needs 18 tables of 1 page = 4.5k. Using TXS/TSX will shave off 2 cycles in total.

2017-01-10 11:11

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Color Bar

This has to be repeated for blocks 'b', 'c', 'd' (using the right lookup tables), but since source4 was saved in Y, that code can start with TAY instead of 'LDY source4' to save 1 cycle.

I meant that the 'LDY source4' can be omitted. The TAY doesn't make sense and should be omitted, so the code takes 342 cycli.

I can gain another 6 cycli by saving X. If the code

tax
lda moveHighNybbleToLowNybble,x

is replaced by

sta selfmod:+1
selfmod:
lda moveHighNybbleToLowNybble

then source7 is kept in X and the code for block 'b' can start with fetching the bits of source7 from a table and omit the 'LDX source7'. One cycle is gained that way. In that block source6 must then be saved in X as shown above, and in the following block source5. The code for the last block can stay as it was in the original post. That saves 3 cycles for half the matrix and 6 in total, and the algorithm takes 336 cycles.

2017-01-10 12:04

Oswald

Registered: Apr 2002
Posts: 5017

still awful lot of cycles. it seems, whatever the goal is, it is faster to construct the bits into final orientation right away.

2017-01-10 13:38

enthusi

Registered: May 2004
Posts: 675

I could still image some top-view char based game where rotating in 'realtime' might actually make sense.
But you would certainly need some very special case when this is favorable over replacing chars from a lookup.

2017-01-10 18:17

Rudi
Account closed

Registered: May 2010
Posts: 125

Still got 299 cycles after merging the 4x4 and 2x2 (195 cycles) plus 1x1 (104 cycles). Can't get it any lower than that at the moment. Im trying to look at a different method.

Anyway; Im wondering if the checkerboard pattern (AND-masking) approach with the shifts can be minimized.. Or changed to a more effective method. If it turns out that the checkerboard pattern is the most optimal then there's no other way than trying to optimize the actual assembly code.

2017-01-10 18:59

Rastah Bar

Registered: Oct 2012
Posts: 336

Pretty good. Below the magic 300 cycles barrier. :-)

2017-01-10 21:12

Copyfault

Registered: Dec 2001
Posts: 466

@Rudi: what do you mean with "merge 4x4 and 2x2"? I understand the 312-cycles version, though this is only valid if you stick to zp load/store. If you access absolute adresses (which will be necessary to read/write to charset data or bitmap resp.), this very same version ends up with 328 cycles -> identical to Axis' first routine.

By reordering the merges just like Axis pointed out this can be lowered to 328 - 12 = 316 (300 if you stick to zp load/store).

So how do you come down to 299 cycles???

2017-01-10 21:35

Rudi
Account closed

Registered: May 2010
Posts: 125

What I really meant was reordering merges, which I think Axis pointed on in one of his posts. Here I only managed to reorder the 2x2 rotation into 4x4 by removing some sta's and reusing the accumulator iirc. But unfortunately the code is so messy that I myself easily get tired after trying to optimize this. So, here comes a long list of code if you dont mind. I used tables for the shifts, and dot mind the eor-tag its the same as or. the next tag is the and value. Also used specific addresses in zp since this is just a test.

;-------------------------------
;4x4 & 2x2 rotation (195 cycles)
;-------------------------------
ldy $61
lax $65
and #$0f
eor shl4, y
sta $81
tya
and #$f0
eor shr4, x
sta $85

ldy $63
lda $67
and #$0f
eor shl4, y
ldy $81
tax
and #$33
eor shl2_eor_cc, y
sta $91
tya
and #$cc
eor shr2_eor_33, x
sta $93
lda $63
ldx $67
and #$f0
eor shr4, x
ldy $85
tax
and #$33
eor shl2_eor_cc, y
sta $95
tya
and #$cc
eor shr2_eor_33, x
sta $97

ldy $60
lax $64
and #$0f
eor shl4, y
sta $80
tya
and #$f0
eor shr4, x
sta $84


ldy $62
lax $66
and #$0f
eor shl4, y
ldy $80
and #$33
eor shl2_eor_cc, y
sta $90
tya
and #$cc
eor shr2_eor_33, x
sta $92
lda #$62
and #$f0
eor shr4, x
ldy $84
tax
and #$33
eor shl2_eor_cc, y
sta $94
tya
and #$cc
eor shr2_eor_33, x
sta $96

;-------------------------
;1x1 rotation (104 cycles)
;-------------------------
ldy $90
lax $91
and #$55
eor shl1_eor_aa, y
sta $70
tya
and #$aa
eor shr1_eor_55, x
sta $71

ldy $92
lax $93
and #$55
eor shl1_eor_aa, y
sta $72
tya
and #$aa 
eor shr1_eor_55, x
sta $73

ldy $94
lax $95
and #$55
eor shl1_eor_aa, y
sta $74
tya
and #$aa
eor shr1_eor_55, x
sta $75

ldy $96
lax $97
and #$55
eor shl1_eor_aa, y
sta $76
tya
and #$aa
eor shr1_eor_55, x
sta $77

I wanted to make the 4x4vs2x2 rotator shorter by using other method, but havent found a solution for this yet (the 1x1 rotator also, for that matter, by optimizing those two distinct sections).

The character is at $60-$67 and the result is at $70-$77.

2017-01-10 21:43

Copyfault

Registered: Dec 2001
Posts: 466

I have to admit that the approach with those "Amiga-style"-merges is really outstanding. Hardly possible to force oneself to think different approaches...

Ok, at least I tried ;) One idea I had is to use the CMP-instruction to read out the bits of every byte. E.g.

ldx #$7f
cpx byte0
rol
cpx byte1
rol
cpx byte2
rol
cpx byte3
rol
cpx byte4
rol
cpx byte5
rol
cpx byte6
rol
cpx byte7
rol
eor #$ff
sta dest0

but this already sums up to 47cycle (if sticking to zp lda/sta/cpx). Furthermore, this will only work for dest0 unless one masks out bit7-bit1 step by step -> every byte needed 8 times, too many table lookups -> most probably not feasible :/

Another idea would be to store the char bit patterns along the diagonal. Usually we have

byte0: b7 b6 b5 b4 b3 b2 b1 b0
byte1: b7 b6 b5 b4 b3 b2 b1 b0
byte2: b7 b6 b5 b4 b3 b2 b1 b0
byte3: b7 b6 b5 b4 b3 b2 b1 b0
byte4: b7 b6 b5 b4 b3 b2 b1 b0
byte5: b7 b6 b5 b4 b3 b2 b1 b0
byte6: b7 b6 b5 b4 b3 b2 b1 b0
byte7: b7 b6 b5 b4 b3 b2 b1 b0

Now we could store the same information as e.g.

data0: byte0.b0 X X X X X X X
data1: byte0.b1 byte1.b0 X X X X X X
...
data7: byte0.b7 byte1.b6 byte2.b5 byte3.b4 byte4.b3 byte5.b2 byte6.b1 byte7.b0
...
dataE: byte7.b7 X X X X X X X

The huge advantage of such a data structure is obvious: mirroring the bitpatterns is equivalent to reversing the order of the dataX-bytes. Drawback: in order to display the corresponding bitpattern a conversion routine is _always_ needed (no matter if the "original" bitpattern should be displayed or the "mirrored" pattern). I couldn't come up with a decent routine to compensate for this.

Maybe someone can take the good points out of these thoughts...

2017-01-11 10:39

Rastah Bar

Registered: Oct 2012
Posts: 336

I'm now at 326 cycli.

2017-01-11 13:52

Rudi
Account closed

Registered: May 2010
Posts: 125

Looking at the 4x4 rotation an recipe that I have made look like this:

1. rol 4 times higher 4 bytes.
2. swap lower nybbles of byte0->byte4, byte1->byte5 etc.
3. rol 4 times lower 4 bytes.

Done.

1 and 3 can be done with lookuptables. 2 is a more tricky.
So, what no.2 need is a fast way to swap lo-nybbles of two bytes, but it seems to be difficult. One would have to do that in 14 cycles or so. Impossible.

Someone gave me this xor-swap algorithm which takes 27 cycles:

LDA byte1
AND #$0f
EOR byte2
STA byte2
AND #$0f
EOR byte1
STA byte1
AND #$0f
EOR byte2
STA byte2

I also made this, but it takes one cycle more than the former:

LDX byte1
LDY byte2
LDA lowCleared,x
ORA andTab,y
STA byte1
LDA lowCleared,y
ORA andTab,x
STA byte2

I guess this wont help at all. Because 27*4 = 108 cycles. Allready reached the limit from the 312 version where each rotator-section take 104 cycles.

2017-01-11 14:24

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Rudi

Someone gave me this xor-swap algorithm which takes 27 cycles:
LDA byte1 AND #$0f EOR byte2 STA byte2 AND #$0f EOR byte1 STA byte1 AND #$0f EOR byte2 STA byte2

26 cycles:

lax byte1
and #$f0
ldy byte2
ora grabLowNybble,y  ;This table performs AND #$0f
sta byte1  ;Now low nybble of byte2 is in byte1
tya
and #$f0
ora grabLowNybble,x  ;byte1 was kept in X
sta byte2

2017-01-11 14:43

Oswald

Registered: Apr 2002
Posts: 5017

you just need a 64k table

lda byte1byte2
sta result

:)

2017-01-11 14:52

Rudi
Account closed

Registered: May 2010
Posts: 125

Colorbar: nice
Oswald: hah yeah, like thats gonna happen. :P

2017-01-11 16:52

Rastah Bar

Registered: Oct 2012
Posts: 336

Flip disk ...
Rotate monitor clockwise ...

2017-01-11 18:45

Oswald

Registered: Apr 2002
Posts: 5017

del.

2017-01-11 18:55

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: Flip disk ...
Rotate monitor clockwise ...

Girls They Want to Have Fun

2017-01-12 09:31

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Rudi

2. swap lower nybbles of byte0->byte4, byte1->byte5 etc.

That helped. If all bits from block 'a' in my post above are put entirely in destination0, and the bits of the 4x2 block above that in destination1, nybbles have to be swapped at the end, but the code

sta selfmod:+1
and #$0f
sta destination0
selfmod:
lda moveHighNybbleToLowNybble,x
sta destination1

simplifies to just 'STA destination' and with this I can reduce the cycle count to 312, I think.

2017-01-13 09:05

ChristopherJam

Registered: Aug 2004
Posts: 1378

I'm not managing to get below 308 cycles (source and destination non zero page).

However, my second stage is only 23 cycles, less for when stage chaining is done. Perhaps this can help optimise one of the routines above?

#define swap1(s1,s2,m,d1,d2) \
    .(        :\
    lda s1    :\
    eor s2    :\
    sta ztb   :\
    and#m     :\
    eor s2    :\
    sta d1    :\
    eor ztb   :\
    sta d2    :\
    .)

(also - I could get down to 300 by moving the code to zero page, but then you'd have to jmp there and back, and couldn't unroll for multiple source/destinations - one char takes most of the page)

2017-01-13 10:32

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting ChristopherJam

I'm not managing to get below 308 cycles (source and destination non zero page).

However, my second stage is only 23 cycles.

Yes, thanks! That brings my lookup table approach down to 301 cycles when source and dest are in ZP and to 329 cycles when they are not.

2017-01-13 13:14

Axis/Oxyron

Registered: Apr 2007
Posts: 91

Christopher: That 3x EOR thing is exactly what we did on Amiga back in the days. Didnt expect this to have an advantage on 6502. But where is the shifting taking place? Or is this only in 1 of the 3 passes, and the other pathes correct the bitorder with a table lookup?

2017-01-13 17:10

Rudi
Account closed

Registered: May 2010
Posts: 125

Color Bar: 301 cycles is nice, just 2 cycles more than what my code does.

If anyone can get below 299 cycles then pls explain what method you use :p

2017-01-13 18:17

ChristopherJam

Registered: Aug 2004
Posts: 1378

Quoting Axis/Oxyron

Christopher: That 3x EOR thing is exactly what we did on Amiga back in the days. Didnt expect this to have an advantage on 6502. But where is the shifting taking place? Or is this only in 1 of the 3 passes, and the other pathes correct the bitorder with a table lookup?

Sweet. Yes, the code above is only used in the second of the three passes; first and third are very similar to yours, only with a bit shuffle on one of the input bytes in each pair on the third pass.

2017-01-13 18:22

ChristopherJam

Registered: Aug 2004
Posts: 1378

Quoting Rudi

If anyone can get below 299 cycles then pls explain what method you use :p

Could we make the rules the same as for Axis' contribution before we compare cycle counts?

-neither input nor output bytes on zero page
-zero page intermediate results are fine
-code is not relocated to zero page either.

In practice it's unlikely there'd be enough space in ZP for source and destination charsets, and copying the data in and out would add at least an extra 102 cycles per char.

2017-01-13 18:32

Rastah Bar

Registered: Oct 2012
Posts: 336

Good idea. I am down to 327 cycles now according to these rules.

2017-01-13 20:41

Rudi
Account closed

Registered: May 2010
Posts: 125

No, I dont think thats a good idea. But feel free to restrict yourselves to your own rules. I totally dont know what you are talking about anyway...

2017-01-13 21:31

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: No, I dont think thats a good idea. But feel free to restrict yourselves to your own rules. I totally dont know what you are talking about anyway...

OK, I will keep mentioning both cases (all input and output bytes either in ZP or in Mem).

One version uses 301 cycles if all are in ZP, another version takes 327 cycles when all are in Mem.

2017-01-14 03:56

ChristopherJam

Registered: Aug 2004
Posts: 1378

OK, in that case mine is 292 cycles if source and dest are in ZP, 308 cycles when all in mem.

Here's a diagram of the how the bits are shuffled at each stage:
Each block shows two input bytes for a swap macro in the top half, two outputs in the lower half. Digits in left border of each box are indexes into the input/output arrays for that stage.
Note that each input pair for the last four blocks contain one byte that's a shuffled version of an output byte from the previous stage, marked with a *
Shuffle's so I can then do an Axis style swap (which needs half the bits to be correctly located within the byte), and it's performed by storing the result of the previous stage into the low byte of an LDY absolute that references the shuffle table.

 +-------------------------+  +-------------------------+  +-------------------------+  +-------------------------+
 0 a0 b0 c0 d0 e0 f0 g0 h0 |  1 a1 b1 c1 d1 e1 f1 g1 h1 |  2 a2 b2 c2 d2 e2 f2 g2 h2 |  3 a3 b3 c3 d3 e3 f3 g3 h3 |
 4 a4 b4 c4 d4 e4 f4 g4 h4 |  5 a5 b5 c5 d5 e5 f5 g5 h5 |  6 a6 b6 c6 d6 e6 f6 g6 h6 |  7 a7 b7 c7 d7 e7 f7 g7 h7 |
 +-------------------------+  +-------------------------+  +-------------------------+  +-------------------------+
 0 a0 b0 c0 d0 a4 b4 c4 d4 |  1 c1 d1 a1 b1 c5 d5 a5 b5 |  2 c2 d2 a2 b2 c6 d6 a6 b6 |  3 a3 b3 c3 d3 a7 b7 c7 d7 |
 4 e0 f0 g0 h0 e4 f4 g4 h4 |  5 g1 h1 e1 f1 g5 h5 e5 f5 |  6 g2 h2 e2 f2 g6 h6 e6 f6 |  7 e3 f3 g3 h3 e7 f7 g7 h7 |
 +-------------------------+  +-------------------------+  +-------------------------+  +-------------------------+
                                                             
                                                             
 +-------------------------+  +-------------------------+  +-------------------------+  +-------------------------+
 0 a0 b0 c0 d0 a4 b4 c4 d4 |  1 c1 d1 a1 b1 c5 d5 a5 b5 |  4 e0 f0 g0 h0 e4 f4 g4 h4 |  5 g1 h1 e1 f1 g5 h5 e5 f5 |
 2 c2 d2 a2 b2 c6 d6 a6 b6 |  3 a3 b3 c3 d3 a7 b7 c7 d7 |  6 g2 h2 e2 f2 g6 h6 e6 f6 |  7 e3 f3 g3 h3 e7 f7 g7 h7 |
 +-------------------------+  +-------------------------+  +-------------------------+  +-------------------------+
 0 a0 b0 a2 b2 a4 b4 a6 b6 |  1 c1 d1 c3 d3 c5 d5 c7 d7 |  4 e0 f0 e2 f2 e4 f4 e6 f6 |  5 g1 h1 g3 h3 g5 h5 g7 h7 |
 2 c2 d2 c0 d0 c6 d6 c4 d4 |  3 a3 b3 a1 b1 a7 b7 a5 b5 |  6 g2 h2 g0 h0 g6 h6 g4 h4 |  7 e3 f3 e1 f1 e7 f7 e5 f5 |
 +-------------------------+  +-------------------------+  +-------------------------+  +-------------------------+
                                                             
                                                             
 +-------------------------+  +-------------------------+  +-------------------------+  +-------------------------+
 0 a0 b0 a2 b2 a4 b4 a6 b6 |  1 c1 d1 c3 d3 c5 d5 c7 d7 |  4 e0 f0 e2 f2 e4 f4 e6 f6 |  6*g0 h0 g2 h2 g4 h4 g6 h6 |
 3*a1 b1 a3 b3 a5 b5 a7 b7 |  2*c0 d0 c2 d2 c4 d4 c6 d6 |  7*e1 f1 e3 f3 e5 f5 e7 f7 |  5 g1 h1 g3 h3 g5 h5 g7 h7 |
 +-------------------------+  +-------------------------+  +-------------------------+  +-------------------------+
 0 a0 a1 a2 a3 a4 a5 a6 a7 |  3 d0 d1 d2 d3 d4 d5 d6 d7 |  4 e0 e1 e2 e3 e4 e5 e6 e7 |  6 g0 g1 g2 g3 g4 g5 g6 g7 |
 1 b0 b1 b2 b3 b4 b5 b6 b7 |  2 c0 c1 c2 c3 c4 c5 c6 c7 |  5 f0 f1 f2 f3 f4 f5 f6 f7 |  7 h0 h1 h2 h3 h4 h5 h6 h7 |
 +-------------------------+  +-------------------------+  +-------------------------+  +-------------------------+

2017-01-15 10:29

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting ChristopherJam

OK, in that case mine is 292 cycles if source and dest are in ZP, 308 cycles when all in mem.
Quote:

Ver neat!
Quote:

+-------------------------+ +-------------------------+ +-------------------------+ +-------------------------+ 0 a0 b0 c0 d0 e0 f0 g0 h0 | 1 a1 b1 c1 d1 e1 f1 g1 h1 | 2 a2 b2 c2 d2 e2 f2 g2 h2 | 3 a3 b3 c3 d3 e3 f3 g3 h3 | 4 a4 b4 c4 d4 e4 f4 g4 h4 | 5 a5 b5 c5 d5 e5 f5 g5 h5 | 6 a6 b6 c6 d6 e6 f6 g6 h6 | 7 a7 b7 c7 d7 e7 f7 g7 h7 | +-------------------------+ +-------------------------+ +-------------------------+ +-------------------------+ 0 a0 b0 c0 d0 a4 b4 c4 d4 | 1 c1 d1 a1 b1 c5 d5 a5 b5 | 2 c2 d2 a2 b2 c6 d6 a6 b6 | 3 a3 b3 c3 d3 a7 b7 c7 d7 | 4 e0 f0 g0 h0 e4 f4 g4 h4 | 5 g1 h1 e1 f1 g5 h5 e5 f5 | 6 g2 h2 e2 f2 g6 h6 e6 f6 | 7 e3 f3 g3 h3 e7 f7 g7 h7 | +-------------------------+ +-------------------------+ +-------------------------+ +-------------------------+

I would implement the code for the second and third byte pairs like this

lax s1
lda shuffle1,x
and #$f0
ldy s2
ora merge1,y
sta tmp1
tya
lda shuffle2,y
and #$0f
ora merge2,x
sta tmp2
------------------+
36 cycles (s1 and s2 in mem, tmp1 and tmp2 in zp)

For the first and 4th byte pairs the "LDA shuffle"s can be omitted (28 cycles).
Is there a more efficient way to do it?

2017-01-15 13:37

Oswald

Registered: Apr 2002
Posts: 5017

why shuffle tabs dont include the ANDs?

2017-01-15 17:29

Rastah Bar

Registered: Oct 2012
Posts: 336

Yes, you are right, the tables should include the ANDs. So that makes it 32 cycles for the 2nd and 3rd byte pairs.

2017-01-16 08:04

Axis/Oxyron

Registered: Apr 2007
Posts: 91

Quoting Color Bar

tya lda shuffle2,y

WOOT? ;o)

2017-01-16 08:16

Pex Mahoney Tufvesson

Registered: Sep 2003
Posts: 50

> WOOT? ;o)

It's a new way of doing nothing; it's just a little too complicated for you Axis, so go back and play with your 300fps million-dots 6510 dot spheres. :P

---
Have a noise night!
http://mahoney.c64.org

2017-01-16 09:03

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: Quoting Color Bar

tya lda shuffle2,y

WOOT? ;o)

Lol. I needed some sleep. Honestly, this stuff is keeping me awake at night.

Pex: Good that you contribute something to this thread too.

2017-01-16 11:30

ChristopherJam

Registered: Aug 2004
Posts: 1378

Haha, you guys.

Um, my second and third shuffles from the first phase is pretty brute force on the tables front:

    ldx s1
    ldy s2
    lda sb_t2,x
    ora sb_t3,y
    sta d1
    lda sb_t0,x
    ora sb_t1,y
    sta d2

30 cycles with mem source, zp destination..

2017-02-01 21:22

Rudi
Account closed

Registered: May 2010
Posts: 125

nothing new here i guess..

2019-11-11 22:22

Krill

Registered: Apr 2002
Posts: 2839

Quoting Rudi

nothing new here i guess..

Almost 3 years later...

Pretty much what White Flame had in mind, i guess:
Quoting White Flame

Here's another idea which I think fails, but might be salvageable.

So the basic LSR, ROR accumulation grabs a bit and stores a bit, taking 2 instructions per bit. However, if done in-place just like many basic shift-add multiplication routines, ROR both sets a final bit as well as reads the next bit to place. This means that in dream land, the flip can be done in about 64 RORs. If fully in zp, that would be 320 cycles, but only 128 bytes of code with no tables.

However, the tactic I took doesn't seem to have a nice clean loop of RORs linking source bit locations to final bit locations. Various CMP #80s and other byte-masking & merging seems to be required, which would likely bloat it back up to 400+ cycles. But maybe by shuffling it around differently, an arrangement could be made that's both fast and short.

So (everything in zeropage):

               ; from:
         ; row7: a7 a6 a5 a4 a3 a2 a1 a0
         ; row6: b7 b6 b5 b4 b3 b2 b1 b0
         ; row5: c7 c6 c5 c4 c3 c2 c1 c0
         ; row4: d7 d6 d5 d4 d3 d2 d1 d0
         ; row3: e7 e6 e5 e4 e3 e2 e1 e0
         ; row2: f7 f6 f5 f4 f3 f2 f1 f0
         ; row1: g7 g6 g5 g4 g3 g2 g1 g0
         ; row0: h7 h6 h5 h4 h3 h2 h1 h0

               ; to:
dest7   .byte 0; a7 b7 c7 d7 e7 f7 g7 h7
dest6   .byte 0; a6 b6 c6 d6 e6 f6 g6 h6
dest5   .byte 0; a5 b5 c5 d5 e5 f5 g5 h5
dest4   .byte 0; a4 b4 c4 d4 e4 f4 g4 h4
dest3   .byte 0; a3 b3 c3 d3 e3 f3 g3 h3
dest2   .byte 0; a2 b2 c2 d2 e2 f2 g2 h2
dest1   .byte 0; a1 b1 c1 d1 e1 f1 g1 h1
dest0   .byte 0; a0 b0 c0 d0 e0 f0 g0 h0

transpose
                 ; cc  c    bits
row7 = * + 1
        lda #0   ; 2,       a7 a6 a5 a4 a3 a2 a1 a0
        asl      ; 2, a7 <- a6 a5 a4 a3 a2 a1 a0 00
        rol      ; 2, a6 <- a5 a4 a3 a2 a1 a0 00 a7
        rol row6 ; 5, b7 <- b6 b5 b4 b3 b2 b1 b0 a6
        rol      ; 2, a5 <- a4 a3 a2 a1 a0 00 a7 b7
        rol row5 ; 5, c7 <- c6 c5 c4 c3 c2 c1 c0 a5
        rol      ; 2, a4 <- a3 a2 a1 a0 00 a7 b7 c7
        rol row4 ; 5, d7 <- d6 d5 d4 d3 d2 d1 d0 a4
        rol      ; 2, a3 <- a2 a1 a0 00 a7 b7 c7 d7
        rol row3 ; 5, e7 <- e6 e5 e4 e3 e2 e1 e0 a3
        rol      ; 2, a2 <- a1 a0 00 a7 b7 c7 d7 e7
        rol row2 ; 5, f7 <- f6 f5 f4 f3 f2 f1 f0 a2
        rol      ; 2, a1 <- a0 00 a7 b7 c7 d7 e7 f7
        rol row1 ; 5, g7 <- g6 g5 g4 g3 g2 g1 g0 a1
        rol      ; 2, a0 <- 00 a7 b7 c7 d7 e7 f7 g7
        rol row0 ; 5, h7 <- h6 h5 h4 h3 h2 h1 h0 a0
        rol      ; 2, 00 <- a7 b7 c7 d7 e7 f7 g7 h7
        sta dest7; 3,       a7 b7 c7 d7 e7 f7 g7 h7
              ; = 58
       ;clc
row6 = * + 1
        lda #0   ; 2,       b6 b5 b4 b3 b2 b1 b0 a6
        and #$fe ; 2,       b6 b5 b4 b3 b2 b1 b0 00
        adc row6 ; 3, b6 <- b5 b4 b3 b2 b1 b0 00 a6
        rol      ; 2, b5 <- b4 b3 b2 b1 b0 00 a6 b6
        rol row5 ; 5, c6 <- c5 c4 c3 c2 c1 c0 a5 b5
        rol      ; 2, b4 <- b3 b2 b1 b0 00 a6 b6 c6
        rol row4 ; 5, d6 <- d5 d4 d3 d2 d1 d0 a4 b4
        rol      ; 2, b3 <- b2 b1 b0 00 a6 b6 c6 d6
        rol row3 ; 5, e6 <- e5 e4 e3 e2 e1 e0 a3 b3
        rol      ; 2, b2 <- b1 b0 00 a6 b6 c6 d6 e6
        rol row2 ; 5, f6 <- f5 f4 f3 f2 f1 f0 a2 b2
        rol      ; 2, b1 <- b0 00 a6 b6 c6 d6 e6 f6
        rol row1 ; 5, g6 <- g5 g4 g3 g2 g1 g0 a1 b1
        rol      ; 2, b0 <- 00 a6 b6 c6 d6 e6 f6 g6
        rol row0 ; 5, h6 <- h5 h4 h3 h2 h1 h0 a0 b0
        rol      ; 2, 00 <- a6 b6 c6 d6 e6 f6 g6 h6
        sta dest6; 3,       a6 b6 c6 d6 e6 f6 g6 h6
              ; = 54
       ;clc
row5 = * + 1
        lda #0   ; 2,       c5 c4 c3 c2 c1 c0 a5 b5
        and #$fc ; 2,       c5 c4 c3 c2 c1 c0 00 00
        adc row5 ; 3, c5 <- c4 c3 c2 c1 c0 00 a5 b5
        rol      ; 2, c4 <- c3 c2 c1 c0 00 a5 b5 c5
        rol row4 ; 5, d5 <- d4 d3 d2 d1 d0 a4 b4 c4
        rol      ; 2, c3 <- c2 c1 c0 00 a5 b5 c5 d5
        rol row3 ; 5, e5 <- e4 e3 e2 e1 e0 a3 b3 c3
        rol      ; 2, c2 <- c1 c0 00 a5 b5 c5 d5 e5
        rol row2 ; 5, f5 <- f4 f3 f2 f1 f0 a2 b2 c2
        rol      ; 2, c1 <- c0 00 a5 b5 c5 d5 e5 f5
        rol row1 ; 5, g5 <- g4 g3 g2 g1 g0 a1 b1 c1
        rol      ; 2, c0 <- 00 a5 b5 c5 d5 e5 f5 g5
        rol row0 ; 5, h5 <- h4 h3 h2 h1 h0 a0 b0 c0
        rol      ; 2, 00 <- a5 b5 c5 d5 e5 f5 g5 h5
        sta dest5; 3,       a5 b5 c5 d5 e5 f5 g5 h5
              ; = 47
       ;clc
row4 = * + 1
        lda #0   ; 2,       d4 d3 d2 d1 d0 a4 b4 c4
        and #$f8 ; 2,       d4 d3 d2 d1 d0 00 00 00
        adc row4 ; 3, d4 <- d3 d2 d1 d0 00 a4 b4 c4
        rol      ; 2, d3 <- d2 d1 d0 00 a4 b4 c4 d4
        rol row3 ; 5, e4 <- e3 e2 e1 e0 a3 b3 c3 d3
        rol      ; 2, d2 <- d1 d0 00 a4 b4 c4 d4 e4
        rol row2 ; 5, f4 <- f3 f2 f1 f0 a2 b2 c2 d2
        rol      ; 2, d1 <- d0 00 a4 b4 c4 d4 e4 f4
        rol row1 ; 5, g4 <- g3 g2 g1 g0 a1 b1 c1 d1
        rol      ; 2, d0 <- 00 a4 b4 c4 d4 e4 f4 g4
        rol row0 ; 5, h4 <- h3 h2 h1 h0 a0 b0 c0 d0
        rol      ; 2, 00 <- a4 b4 c4 d4 e4 f4 g4 h4
        sta dest4; 3,       a4 b4 c4 d4 e4 f4 g4 h4
              ; = 40
       ;clc
row3 = * + 1
        lda #0   ; 2,       e3 e2 e1 e0 a3 b3 c3 d3
        and #$f0 ; 2,       e3 e2 e1 e0 00 00 00 00
        adc row3 ; 3, e3 <- e2 e1 e0 00 a3 b3 c3 d3
        rol      ; 2, e2 <- e1 e0 00 a3 b3 c3 d3 e3
        rol row2 ; 5, f3 <- f2 f1 f0 a2 b2 c2 d2 e2
        rol      ; 2, e1 <- e0 00 a3 b3 c3 d3 e3 f3
        rol row1 ; 5, g3 <- g2 g1 g0 a1 b1 c1 d1 e1
        rol      ; 2, e0 <- 00 a3 b3 c3 d3 e3 f3 g3
        rol row0 ; 5, h3 <- h2 h1 h0 a0 b0 c0 d0 e0
        rol      ; 2, 00 <- a3 b3 c3 d3 e3 f3 g3 h3
        sta dest3; 3,       a3 b3 c3 d3 e3 f3 g3 h3
              ; = 33
row2 = * + 1
        lda #0   ; 2,       f2 f1 f0 a2 b2 c2 d2 e2
        asl      ; 2, f2 <- f1 f0 a2 b2 c2 d2 e2 00
        adc #$80 ; 2, f1 <- ?? f0 a2 b2 c2 d2 e2 f2
        rol row1 ; 5, g2 <- g1 g0 a1 b1 c1 d1 e1 f1
        rol      ; 2, ?? <- f0 a2 b2 c2 d2 e2 f2 g2
        cmp #$80 ; 2, f0 <- f0 a2 b2 c2 d2 e2 f2 g2
        rol row0 ; 5, h2 <- h1 h0 a0 b0 c0 d0 e0 f0
        rol      ; 2, f0 <- a2 b2 c2 d2 e2 f2 g2 h2
        sta dest2; 3,       a2 b2 c2 d2 e2 f2 g2 h2
              ; = 25
row1 = * + 1
        lda #0   ; 2,       g1 g0 a1 b1 c1 d1 e1 f1
        asl      ; 2, g1 <- g0 a1 b1 c1 d1 e1 f1 00
        adc #$80 ; 2, g0 <- ?? a1 b1 c1 d1 e1 f1 g1
        rol row0 ; 5, h1 <- h0 a0 b0 c0 d0 e0 f0 g0
        rol      ; 2, ?? <- a1 b1 c1 d1 e1 f1 g1 h1
        sta dest1; 3,       a1 b1 c1 d1 e1 f1 g1 h1
              ; = 16
row0 = * + 1
        lda #0   ; 2,       h0 a0 b0 c0 d0 e0 f0 g0
        cmp #$80 ; 2, h0 <- h0 a0 b0 c0 d0 e0 f0 g0
        rol      ; 2, h0 <- a0 b0 c0 d0 e0 f0 g0 h0
        sta dest0; 3,       a0 b0 c0 d0 e0 f0 g0 h0
               ; = 9
        rts    ; = 9 + 16 + 25 + 33 + 40 + 47 + 54 + 58 = 282

Thus...
Quoting ChristopherJam

Problem with bitshifter is it only deals with one bit at a time.

c2p exchanges four bitpairs in only 20-30 cycles. The expensive part is moving bits to matching positions within the two bytes they're being exchanged between...

Turns out both approaches are pretty much in the same ballpark. =)
(And this approach apparently coming out just ever so slightly faster. \=D/)

Might be possible to squeeze out a few more cycles here and there, so feel free.

I have a hunch that the general problem can't be solved in fewer than 280-ish cycles, though.

2019-11-12 10:21

JackAsser

Registered: Jun 2002
Posts: 1989

@Krill: Nice!!!

So, now what about the shortest code to rotate a char? (or a reference to the message number if it's already stated here)

2019-11-12 12:46

Krill

Registered: Apr 2002
Posts: 2839

Jackasser: Shortest? Going fully academic, eh? =D

I guess it would be the naïve approach, in a nested loop (untested):

SOURCE = $02
DEST   = $0a

        ldy #7      ; 2
-       ldx #7      ; 2
-       lsr SOURCE,x; 2
        ror         ; 1
        dex         ; 1
        bpl -       ; 2
        sax DEST,y  ; 2
        dey         ; 1
        bpl --      ; 2

15 bytes.

But i'd rather have a faster than 280-ish cycles approach. =)

2019-11-12 13:58

Oswald

Registered: Apr 2002
Posts: 5017

sax for giggles ?

2019-11-12 14:46

Krill

Registered: Apr 2002
Posts: 2839

SAX because STA zp,Y does not exist.

2019-11-12 15:35

Oswald

Registered: Apr 2002
Posts: 5017

*clapping*

Refresh

Subscribe to this thread: