[CSDb] - User Forums - shortest CIA-stable raster

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > shortest CIA-stable raster

2009-04-04 12:39

Hermit

Registered: May 2008
Posts: 209

shortest CIA-stable raster

<Post edited by moderator on 4/4-2009 14:47>

Hi, Guys :)

While preparing for compo I've developed maybe the shortest
CIA-type stable raster solution (fits in 64 bytes, 24 asm-rows).
If you can do even shorter, I'm curious :)

It works fine in practice, don't have to type novels to achieve
stable raster, and no need for raster-IRQ,CMPd012 method is enough.
If you find it useful for fast & short demo-writing, we may implement it
into codebase64.

;setting the CIA1-timerA to beam in the program beginning:
-----------------------------------------------------------

     sei                   ;we don't want lost cycles by IRQ calls :)
sync cmp $d012             ;scan for begin rasterline (A=$11 after first return)
     bne *-3       ;wait if not reached rasterline #$11 yet
     ldy #8        ;the walue for cia timer fetch & for y-delay loop
     sty $dc04     ;CIA Timer will count from 8,8 down to 7,6,5,4,3,2,1
     dey           ;Y=Y-1 (8 iterations: 7,6,5,4,3,2,1,0)
     bne *-1       ;loop needed to complete the poll-delay with 39 cycles
     sty $dc05     ;no need Hi-byte for timer at all (or it will mess up)
     sta $dc0e,y   ;forced restart of the timer to value 8 (set in dc04)
     lda #$11      ;value for d012 scan and for timerstart in dc0e
     cmp $d012     ;check if line ended (new line) or not (same line)
     sty $d015     ;switch off sprites, they eat cycles when fetched
     bne sync      ;if line changed after 63 cycles, resyncronize it!
     .... the rest (this is also a stable-timed point, can be used for sg.)

B;EXAMPLE-using timerA to stabilize 7 cycle jitter when using CMPd012:
-----------------------------------------------------------------------
scan ldx #$31    ;a good value that's not badline, in border and 1=white
     cpx $d012   ;scan rasterline
     bne *-3     ;wait until rasterline will be $31
     lda $dc04   ;check timer A, here it jitters between 7...1
     eor #7      ;A=7-A so jitter will be 0...6 in A
     sta corr+1  ;self-writing code, the bpl jump-address = A
corr bpl *+2     ;the jump to timer (A) dependent byte
     cmp #$c9    ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
     cmp #$c9    ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
     bit $ea24   ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP

     stx $d020   ;x was 1 so border is white at the stable cycle
     sty $d020   ;y ended in 0 in sync routine, so border black after 4 cycles
     jmp scan    ;go to the raster again (or can go new raster)

-----------------------------------------------------------------------
Opinions?

Hermit Software Hungary

... 17 posts hidden. Click here to view all posts....

2009-04-05 18:49

Ninja

Registered: Jan 2002
Posts: 418

Nice to see such a coding thread on CSDb \o/

While nothing beats the experience gained by doing your own timing stuff, I see some practical issues with this routine:

- As Copyfault mentioned, there is not taken care of 8-cycle jitter.

- It was often useful to me to have a counter synced to a rasterline (i.e. counting 63 cycles not 9). That makes it easier to abuse it more than once IMHO. Might be a personal preference, though.

- This routine could need several frames to reach a stable raster. Some code generation or depacking might have happened meanwhile.

- If you want to be really short, your approach won't beat $d013-based techniques, I am afraid.

Still, nice to see you playing around with it. While I see the above issues, there are still nice ideas in this one.

2009-04-05 20:13

Copyfault

Registered: Dec 2001
Posts: 487

Hey Ninja, greetings my friend,

when seeing that you posted a reply here I first thought you found a way to further optimize this branch-approach...

Maybe this is the limit (considering the used bytes/cycles-ratio).

Copyfault

2010-11-26 15:44

Hermit

Registered: May 2008
Posts: 209

I'd like to emphasize the fact what Copyfault notified me. The 8 cycles jitter is really something that we have to pay attention to...
I'm coding a program, and there were some weird things when the main program (out of the irq) executed more commands, than a simple jmp or so. (especially when irq loader started to operate).
I had to slightly modify the stableraster-waiter routine in the irq. Adding a new line of bit $ea24 seems to prevent any issues coming from 8 cycle jitter... (even 9)

lda $dc04 ;check timer A, here it jitters between 7...1
eor #7 ;A=7-A so jitter will be 0...6 in A
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to timer (A) dependent byte
cmp #$c9 ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
cmp #$c9 ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
bit $ea24 ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP
bit $ea24 ;IMPORTANT to handle 8th cycle jitter

Not a big effort though, but from now I have to start writing stableraster-irq keeping this in mind.
(Not the best solution, but a simple NOP did the work too, however that may not be as stable in 8th cycle..)
Thanks for telling me this 8-9 cycle jitter thingy..

Hermit Software Hungary

2011-03-25 23:54

Repose
Account closed

Registered: Oct 2010
Posts: 227

I'm going to try to develop a mathematical proof of the shortest/quickest cde.
According to http://visual6502.org/wiki/index.php?title=6502_all_256_Opcodes
We have these possibilities:
1 byte: 2-4 cycles
2 bytes: 2-6, 8 cycles
3 bytes: 4-7 cycles

And we are trying to create 8 delay states.
Now here's the table of delay states:
1 byte: 3 states (2, 3 or 4 cycles)
2 byte: 5 states (2, 3, 4, 5, or 6)*
3 byte: 4 states (4, 5, 6, or 7 cycles)
*We'll have to special case this later for the 8 cycles instructions;
there's no way to use two of these to get 15 cycles.
You can see that 1 byte instructions are most efficient for consuming
states.
Combining instructions doesn't double the states! I think of it this way;
at the longest delay, each instruction has the same number
of cycles, which loses a state possibility. The formula is:
total states=(state)*(n)-(n-1), where n is the number of times the instruction
is repeated, and state is the number of cycles possibilities it has.
Here's a table:
states n total
3 1 3
3 2 5
3 3 7
3 4 9
5 1 5
5 2 9
4 1 4
4 2 7
4 3 10

So which combinations gives at least 8 states?
size n total states total bytes
1 4 9 4
2 2 9 4
3 3 10 9
In this table, size is the bytes in the opcode, n is the number of times
an instruction of that length appears in a row.
But 4 bytes isn't the best because we're overdoing it, if we use
a combination of 2 byte and 1 byte opcodes can we get 8 states in less
memory?
Combining the 1 byte and 2 byte opcodes, we get (2,3,4)cycles+(2,3,4,5,6)cycles
which is 4-10cycles or by our formula, (3 states+5 states)-(2-1)=7 states.
This isn't quite enough. However, now we consider the special case of 8 cycles,
which turns out to be very special!
We get (2,3,4)cycles+(2,3,4,5,6,8)cycles=4-12 cycles or 9 states.
Notice that there is just enough overlap here, i.e. (4+6)=10cycles and
(3+8)=11 cycles and (4+8)=12 cycles.
So theoretically, we can use just 3 bytes to write a delay between 4 to 12
cycles.
These formulas can also be used for e.g. the Z80 in the C128 to set a limit
on optimized code.

The Table of Fixed Delays
So how does this help us write the shortest delay routine? The only use
for this in short code is to use it with a computed branch. The obviously
only way to do it with with something like:
lda timer1;4 cycles
asl;2 cycles
sta *+3;4 cycles
bne *+2; selfmod to branch into delay fragments (3 cycles)
xxx;3 state opcode
xxx;6 state opcode (4-12 cycles)
bne continue;do raster processing (3 cycles)
This is obviously horrible for code size but for quickest sync it's promising;
it's 20-28 cycles.

The "multi-threaded" code trick
Using BIT followed by data which happens to be a valid opcode, you can do
a computed branch into coincidently up to 3 different instructions giving
you 3 states in 3 bytes. This seems obviously the most efficient way to make
a short delay; you are trading multiple custom code fragments for a single
code thread array. The code array is indexed by a byte at a time so it
can only consume one state. A first estimate of such techniques is 8 bytes
to consume 8 states. While it saves memory, it can't possibly sync quicker
than the delay above.

The Computed Delay Trick
The last way to make a delay is by calculations which take varying amounts
of time. You can use branches to add 1 cycle of delay based on each of
the flags, N, Z, and C. This can create 4 states of delay. We'll have
to do another calculation to create more delay states. It should take
two calculations and a whole bunch of branches to do it.

Combining Methods
What if you used something like bne $EA to combine threading with computation?
I think you've squeezed an extra state in there somewhere. I think this
might work but your code would be scattered all over the place; it's still
valid though as you can write other code in between.

I haven't fully worked out all the ideas but I believe I've generalized the
approaches.

This reminds me of the 3 or 4 ways of speed optimizing; you can make a loop,
unroll a loop, use "decision tree optimization" where every decision leads
to it's own code fragment, or of course the easier way of doing it which is
a table of subroutines for every possible argument. This is a way to
make a two argument table but with less memory.
Has anyone made a multiply routine for every possible multiplier? I looked
at this and it's about 32 cycles max and sometimes quite less, much faster
than the table of squares method.

2011-03-27 22:37

Copyfault

Registered: Dec 2001
Posts: 487

@Repose: what exactly do you want to prove? The smallest number of bytes needed for a de-jitter-routine?

If scattering of routine fragments is allowed I guess Ninja's approach used in his 2x2-FLI-routines is the shortest possible.

Maybe I didn't fully get the idea behind your lines but don't we need smth like 'axomatic semantics' to do a correct mathematical proof?

2011-07-19 14:00

ready.

Registered: Feb 2003
Posts: 441

Quoting name

lda $dc04 ;check timer A, here it jitters between 7...1
eor #7 ;A=7-A so jitter will be 0...6 in A
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to timer (A) dependent byte
cmp #$c9 ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
cmp #$c9 ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
bit $ea24 ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP
bit $ea24 ;IMPORTANT to handle 8th cycle jitter

@Hermit: this code doesn't patch the previous one when encoutering the 8-cycle jitter. Just check this: $dc04=8, EOR #7 produces $0f and with bpl you end up out of your code.

2011-07-20 10:32

Frantic

Registered: Mar 2003
Posts: 1661

@Ready: Maybe i totally miss the point now just judging from the surface of things but are you missing the following?

0
1
2
3
4
5
6
7

= 8 different states, which means that $dc04==8 never happens and that $dc04==7 really is the 8th state? (As I said, I may very well be wrong now, because I don't know what $dc04 might actually end up being in the code discussed here..)

2011-07-20 12:44

ready.

Registered: Feb 2003
Posts: 441

I might be wrong as well, since I based my feedback on VICE monitor only (VICE 2.2). Still I confirm that in my routine ran in VICE sometimes I get $dc04=8. I checked the setup of the code and it is correct.

2011-07-20 20:39

Copyfault

Registered: Dec 2001
Posts: 487

@Frantic: the experiments I did back then showed that $DC04 will never reach value '0'. This was already mentioned in this thread and in the old one (look @some posts above).

@Hermit: this "eor #$07"-line in your code must indeed cause problems - due to the fact that $DC04 != 0. But ofcourse you could sync your timer to have e.g. values between $10..$17 at the reading cycle of "lda $DC04" - thus, "eor #$17" should fix your example code.

2012-01-09 07:48

ChristopherJam

Registered: Aug 2004
Posts: 1424

Another approach, albeit the same number of lines of code:

;setting the CIA1-timerA to beam in the program beginning:
;-----------------------------------------------------------

     sei           ;we don't want lost cycles by IRQ calls ;)
sync lda#$1c
     cmp$d012      ; scan for line to force DMA
     bne *-3
     sta $d011     ;trigger badline to absorb jitter
     lda #$11
     ldy #8        ;the walue for cia timer fetch & for y-delay loop
     sty $dc04     ;CIA Timer will count from 8,8 down to 7,6,5,4,3,2,1
     ldy#0
     sty $dc05     ;no need Hi-byte for timer at all (or it will mess up)
     sta $dc0e,y   ;forced restart of the timer to value 8 (set in dc04)
     dec $d011     ;undo tiny scroll from above
     bmi sync      ;oops, we were in the bottom border

Previous - 1 | 2 | 3 - Next

Refresh

Subscribe to this thread: