| |
Rudi Account closed
Registered: May 2010 Posts: 125 |
Fast way to rotate a char?
Im not talking about rol or ror, but swap bits so that they are rotated 90 degrees:
Example:
a char (and the bits can be random):
10110010 byte 1..
11010110 byte 2.. etc..
00111001
01010110
11011010
10110101
00110011
10110100 after "rotation" (rows and columns are swapped):
11001101
01011000
10100111
11111111
00101000
01010101
11011010
00100110 is it possible to use lookup tables for this or would that lookup table be too big?
or other lookuptable for getting and setting bits?
-Rudi |
|
... 105 posts hidden. Click here to view all posts.... |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting Color BarI may have found a method that takes 432 cycles....
If I merge columns of 2 bits wide and 4 bits high into one byte and then extract the destination nybbles I can reduce that to 354 cycles. |
| |
Rudi Account closed
Registered: May 2010 Posts: 125 |
Quote: Quoting Color BarI may have found a method that takes 432 cycles....
If I merge columns of 2 bits wide and 4 bits high into one byte and then extract the destination nybbles I can reduce that to 354 cycles.
Are you using the masking method? |
| |
Axis/Oxyron Account closed
Registered: Apr 2007 Posts: 91 |
I just want to share some thoughts on my merges that didnt work out. Perhaps I´m just missing the last twist.
First idea was to make relative merges. I discussed that back in the 90´s with some Amiga coders and on 68030-68060 it saves some cycles.
Idea is, that shifting of the input must not always have the exact values, as long as the delta of the shift of the 2 inputs stays correct. Disadvantage of that is, that the last merge needs to make some rol/ror to compensate.
This resulted in something like this:
lda {src1}
ldy {src2}
and #$aa
ldx {bittab1},y
sax {dst1}
eor {src1} ;invert and #$aa to and #$55
ldx {bittab2},y
sax {dst2}
Unluckily it only saves 1 cycle per merge which is completely eaten up by the last merge that looses 2 cycles for the correction.
Another idea was to interleave the temp-arrays with hi-byte pointers so that they can be used both as pointers for indirect y-indexing and as direct values. Code would look like this:
lax {src1}
and #{mask1}
ora ({src2}),y
sta {dst1}
lda ({src2}),y
ora {bittab2},x
sta {dst2}
Would also save 1 cycle per merge. But the unsolved problem is, that the 2 usages of {src2} should be pointing to 2 different tables. *grrr*
What definitely works is reordering the merges, so that the last 2 merges of a resolution dont need to store the tmp-values into the zp and the first two of the next resolution doesnt need to read the tmp-values.
so the last
sta {dst2}
and the first
lax {src1}
would merge into
tax.
Saves 4 times 3=12 cycles. |
| |
Rudi Account closed
Registered: May 2010 Posts: 125 |
Interesting ideas. Reordering merges was something I tried out but failed. Maybe I should look at it again. |
| |
Rudi Account closed
Registered: May 2010 Posts: 125 |
Ok, so I did merge 4x4 and 2x2, and it worked. Am now at 299 cycles. There is probably more that can be optimized, because now the code looks like a mess. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: Are you using the masking method?
These diagrams illustrate what I am thinking of
xxxxxxxx source0
xxxxxxxx source1
xxxxxxxx
xxxxxxxx
aabbccdd
aabbccdd
aabbccdd
aabbccdd source7
xxxxaaaa destination0
xxxxaaaa destination1
xxxxbbbb
xxxxbbbb
xxxxcccc
xxxxcccc
xxxxdddd
xxxxdddd destination7
The source bits in the block indicated by the a's should end up in the proper locations in destination bytes. Using look up tables the 8 a's are put into one byte such that the high nybble is the desired nybble of destination0 and then the low nybble that of destination1. The same procedure can be followed for all 4x2 source blocks 'b','c','d', etcetera.
Code for doing that can look like
ldy source4
lda mergingTable1,y ;This table takes two bits and puts them in a byte, such that they are in the right positions to use as nybble for the destination.
ldx source5
ora mergingTable2,x
ldx source6
ora mergingTable3,x
ldx source7
ora mergingTable4,x ;Now the byte is filled with all bits from block 'a'.
tax
and #$0f
sta destination0
lda moveHighNybbleToLowNybble,x
sta destination1
-------------------+
42 cycles.
This has to be repeated for blocks 'b', 'c', 'd' (using the right lookup tables), but since source4 was saved in Y, that code can start with TAY instead of 'LDY source4' to save 1 cycle.
The lower half of the source matrix is therefore moved in 42+3*41 = 165 cycles.
The other half of the matrix goes very similar, but since the destination bytes are no longer empty, we have to add 'ORA destination0' and 'ORA destination1' before the 'STA destination0' and 'STA destination1', respectively. So that will take 4*6 cycles extra, totalling 2*165+24=354 cycles. But the method is very memory inefficient. |
| |
White Flame
Registered: Sep 2002 Posts: 136 |
Here's another idea which I think fails, but might be salvageable.
So the basic LSR, ROR accumulation grabs a bit and stores a bit, taking 2 instructions per bit. However, if done in-place just like many basic shift-add multiplication routines, ROR both sets a final bit as well as reads the next bit to place. This means that in dream land, the flip can be done in about 64 RORs. If fully in zp, that would be 320 cycles, but only 128 bytes of code with no tables.
However, the tactic I took doesn't seem to have a nice clean loop of RORs linking source bit locations to final bit locations. Various CMP #80s and other byte-masking & merging seems to be required, which would likely bloat it back up to 400+ cycles. But maybe by shuffling it around differently, an arrangement could be made that's both fast and short. |
| |
Rudi Account closed
Registered: May 2010 Posts: 125 |
ok, ok. another idea is to use TXS and TSX (takes only 2 cycles each) for having X as temporary variable in optimizing. might require the lookup to switch from ,x and ,y to ,y and ,x however. I don't mind having a 4k+ big table for making it really fast. 32k+ table however thats overkill! |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quote: ok, ok. another idea is to use TXS and TSX (takes only 2 cycles each) for having X as temporary variable in optimizing. might require the lookup to switch from ,x and ,y to ,y and ,x however. I don't mind having a 4k+ big table for making it really fast. 32k+ table however thats overkill!
I think my method needs 18 tables of 1 page = 4.5k. Using TXS/TSX will shave off 2 cycles in total. |
| |
Rastah Bar Account closed
Registered: Oct 2012 Posts: 336 |
Quoting Color Bar
This has to be repeated for blocks 'b', 'c', 'd' (using the right lookup tables), but since source4 was saved in Y, that code can start with TAY instead of 'LDY source4' to save 1 cycle.
I meant that the 'LDY source4' can be omitted. The TAY doesn't make sense and should be omitted, so the code takes 342 cycli.
I can gain another 6 cycli by saving X. If the code
tax
lda moveHighNybbleToLowNybble,x
is replaced by
sta selfmod:+1
selfmod:
lda moveHighNybbleToLowNybble
then source7 is kept in X and the code for block 'b' can start with fetching the bits of source7 from a table and omit the 'LDX source7'. One cycle is gained that way. In that block source6 must then be saved in X as shown above, and in the following block source5. The code for the last block can stay as it was in the original post. That saves 3 cycles for half the matrix and 6 in total, and the algorithm takes 336 cycles. |
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 - Next |