[CSDb] - User Forums - shortest CIA-stable raster

Welcome to our latest new user maak ! (Registered 2024-04-18)

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > shortest CIA-stable raster

2009-04-04 12:39

Hermit

Registered: May 2008
Posts: 208

shortest CIA-stable raster

<Post edited by moderator on 4/4-2009 14:47>

Hi, Guys :)

While preparing for compo I've developed maybe the shortest
CIA-type stable raster solution (fits in 64 bytes, 24 asm-rows).
If you can do even shorter, I'm curious :)

It works fine in practice, don't have to type novels to achieve
stable raster, and no need for raster-IRQ,CMPd012 method is enough.
If you find it useful for fast & short demo-writing, we may implement it
into codebase64.

;setting the CIA1-timerA to beam in the program beginning:
-----------------------------------------------------------

     sei                   ;we don't want lost cycles by IRQ calls :)
sync cmp $d012             ;scan for begin rasterline (A=$11 after first return)
     bne *-3       ;wait if not reached rasterline #$11 yet
     ldy #8        ;the walue for cia timer fetch & for y-delay loop
     sty $dc04     ;CIA Timer will count from 8,8 down to 7,6,5,4,3,2,1
     dey           ;Y=Y-1 (8 iterations: 7,6,5,4,3,2,1,0)
     bne *-1       ;loop needed to complete the poll-delay with 39 cycles
     sty $dc05     ;no need Hi-byte for timer at all (or it will mess up)
     sta $dc0e,y   ;forced restart of the timer to value 8 (set in dc04)
     lda #$11      ;value for d012 scan and for timerstart in dc0e
     cmp $d012     ;check if line ended (new line) or not (same line)
     sty $d015     ;switch off sprites, they eat cycles when fetched
     bne sync      ;if line changed after 63 cycles, resyncronize it!
     .... the rest (this is also a stable-timed point, can be used for sg.)

B;EXAMPLE-using timerA to stabilize 7 cycle jitter when using CMPd012:
-----------------------------------------------------------------------
scan ldx #$31    ;a good value that's not badline, in border and 1=white
     cpx $d012   ;scan rasterline
     bne *-3     ;wait until rasterline will be $31
     lda $dc04   ;check timer A, here it jitters between 7...1
     eor #7      ;A=7-A so jitter will be 0...6 in A
     sta corr+1  ;self-writing code, the bpl jump-address = A
corr bpl *+2     ;the jump to timer (A) dependent byte
     cmp #$c9    ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
     cmp #$c9    ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
     bit $ea24   ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP

     stx $d020   ;x was 1 so border is white at the stable cycle
     sty $d020   ;y ended in 0 in sync routine, so border black after 4 cycles
     jmp scan    ;go to the raster again (or can go new raster)

-----------------------------------------------------------------------
Opinions?

Hermit Software Hungary

... 17 posts hidden. Click here to view all posts....

2009-04-05 08:14

Hermit

Registered: May 2008
Posts: 208

Good to see this approach, I was thinkig about a similar delayer with branch-commands but couldn't realize yet.
As I could, I avoided to use illegal opcodes, because some assemblers (or machine-types?) make it hardly, and also the beginners need a clear code to understand on Codebase64.

I've tried this routine and really works, and GREAT NEWS: no need for ASR#7, an LSR is pretty enough. Why? Because our CIA is counting only in 9 cycles, and at the LDA DC04 only 7..1 appears, no need to turn off any bits. 1 byte saved again.

pha
lda $dc04
lsr
bcs *+2
LSR
bcc *+6
bcs *+4
bne .end
bne *-2
.end

Although, If you can advise a C64 turbo assembler that accepts illegals, I would be happy.

Other idea is, we could do this bne-bcc-bcs..etc like delayer with a halving method, that may reduce steps..

My other approach is to load dc04 to X or Y, and make an indexed jump to a delay-routine. So no need to invert (EOR) the dc04 which is unfortunately counting BACK from 7 to 1 (8 to 0 (8) to be true). Or "JMP ($dc03)" method can be useful to reduce rows.

Hermit Software Hungary

2009-04-05 13:56

Copyfault

Registered: Dec 2001
Posts: 466

I really wonder how a simple LSR instead of ASR #$07 can work. The timing registers NEVER go down to #$00. After #$01, instead of counting down to #$00 the regs are directly reset to the initial value (here $3e most probably). So without masking, we can not make sure that the bne-commands work as intended.

Be careful with the jitter range: it is true that a (legal) RMW-Opcode eats max. 7 cycles, but in combination with branch-opcodes the number of cycles to be considered for jitter can be even longer. This also depends on page_breaks. I once started a thread here about this... will post a link later when I found it.

If desired I can send some acme-source code which clearly shows that there are 8 possible jitter states (value read by $dc04 goes from e.g. $10 to $17). You can do these tests yourself by experimenting with some code like

inc abs,x
bpl *-3
bmi *-5

as main routine.

The jmp ($dc03)-approach has already been done before. Ninja took the idea behind it to perfection. IIRC there was some article out there in VN.

Copyfault

2009-04-05 14:47

Copyfault

Registered: Dec 2001
Posts: 466

Me again,

the discussion I mentioned above was about Stable Raster via Timer.

Copyfault

2009-04-05 18:49

Ninja

Registered: Jan 2002
Posts: 404

Nice to see such a coding thread on CSDb \o/

While nothing beats the experience gained by doing your own timing stuff, I see some practical issues with this routine:

- As Copyfault mentioned, there is not taken care of 8-cycle jitter.

- It was often useful to me to have a counter synced to a rasterline (i.e. counting 63 cycles not 9). That makes it easier to abuse it more than once IMHO. Might be a personal preference, though.

- This routine could need several frames to reach a stable raster. Some code generation or depacking might have happened meanwhile.

- If you want to be really short, your approach won't beat $d013-based techniques, I am afraid.

Still, nice to see you playing around with it. While I see the above issues, there are still nice ideas in this one.

2009-04-05 20:13

Copyfault

Registered: Dec 2001
Posts: 466

Hey Ninja, greetings my friend,

when seeing that you posted a reply here I first thought you found a way to further optimize this branch-approach...

Maybe this is the limit (considering the used bytes/cycles-ratio).

Copyfault

2010-11-26 15:44

Hermit

Registered: May 2008
Posts: 208

I'd like to emphasize the fact what Copyfault notified me. The 8 cycles jitter is really something that we have to pay attention to...
I'm coding a program, and there were some weird things when the main program (out of the irq) executed more commands, than a simple jmp or so. (especially when irq loader started to operate).
I had to slightly modify the stableraster-waiter routine in the irq. Adding a new line of bit $ea24 seems to prevent any issues coming from 8 cycle jitter... (even 9)

lda $dc04 ;check timer A, here it jitters between 7...1
eor #7 ;A=7-A so jitter will be 0...6 in A
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to timer (A) dependent byte
cmp #$c9 ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
cmp #$c9 ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
bit $ea24 ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP
bit $ea24 ;IMPORTANT to handle 8th cycle jitter

Not a big effort though, but from now I have to start writing stableraster-irq keeping this in mind.
(Not the best solution, but a simple NOP did the work too, however that may not be as stable in 8th cycle..)
Thanks for telling me this 8-9 cycle jitter thingy..

Hermit Software Hungary

2011-03-25 23:54

Repose

Registered: Oct 2010
Posts: 222

I'm going to try to develop a mathematical proof of the shortest/quickest cde.
According to http://visual6502.org/wiki/index.php?title=6502_all_256_Opcodes
We have these possibilities:
1 byte: 2-4 cycles
2 bytes: 2-6, 8 cycles
3 bytes: 4-7 cycles

And we are trying to create 8 delay states.
Now here's the table of delay states:
1 byte: 3 states (2, 3 or 4 cycles)
2 byte: 5 states (2, 3, 4, 5, or 6)*
3 byte: 4 states (4, 5, 6, or 7 cycles)
*We'll have to special case this later for the 8 cycles instructions;
there's no way to use two of these to get 15 cycles.
You can see that 1 byte instructions are most efficient for consuming
states.
Combining instructions doesn't double the states! I think of it this way;
at the longest delay, each instruction has the same number
of cycles, which loses a state possibility. The formula is:
total states=(state)*(n)-(n-1), where n is the number of times the instruction
is repeated, and state is the number of cycles possibilities it has.
Here's a table:
states n total
3 1 3
3 2 5
3 3 7
3 4 9
5 1 5
5 2 9
4 1 4
4 2 7
4 3 10

So which combinations gives at least 8 states?
size n total states total bytes
1 4 9 4
2 2 9 4
3 3 10 9
In this table, size is the bytes in the opcode, n is the number of times
an instruction of that length appears in a row.
But 4 bytes isn't the best because we're overdoing it, if we use
a combination of 2 byte and 1 byte opcodes can we get 8 states in less
memory?
Combining the 1 byte and 2 byte opcodes, we get (2,3,4)cycles+(2,3,4,5,6)cycles
which is 4-10cycles or by our formula, (3 states+5 states)-(2-1)=7 states.
This isn't quite enough. However, now we consider the special case of 8 cycles,
which turns out to be very special!
We get (2,3,4)cycles+(2,3,4,5,6,8)cycles=4-12 cycles or 9 states.
Notice that there is just enough overlap here, i.e. (4+6)=10cycles and
(3+8)=11 cycles and (4+8)=12 cycles.
So theoretically, we can use just 3 bytes to write a delay between 4 to 12
cycles.
These formulas can also be used for e.g. the Z80 in the C128 to set a limit
on optimized code.

The Table of Fixed Delays
So how does this help us write the shortest delay routine? The only use
for this in short code is to use it with a computed branch. The obviously
only way to do it with with something like:
lda timer1;4 cycles
asl;2 cycles
sta *+3;4 cycles
bne *+2; selfmod to branch into delay fragments (3 cycles)
xxx;3 state opcode
xxx;6 state opcode (4-12 cycles)
bne continue;do raster processing (3 cycles)
This is obviously horrible for code size but for quickest sync it's promising;
it's 20-28 cycles.

The "multi-threaded" code trick
Using BIT followed by data which happens to be a valid opcode, you can do
a computed branch into coincidently up to 3 different instructions giving
you 3 states in 3 bytes. This seems obviously the most efficient way to make
a short delay; you are trading multiple custom code fragments for a single
code thread array. The code array is indexed by a byte at a time so it
can only consume one state. A first estimate of such techniques is 8 bytes
to consume 8 states. While it saves memory, it can't possibly sync quicker
than the delay above.

The Computed Delay Trick
The last way to make a delay is by calculations which take varying amounts
of time. You can use branches to add 1 cycle of delay based on each of
the flags, N, Z, and C. This can create 4 states of delay. We'll have
to do another calculation to create more delay states. It should take
two calculations and a whole bunch of branches to do it.

Combining Methods
What if you used something like bne $EA to combine threading with computation?
I think you've squeezed an extra state in there somewhere. I think this
might work but your code would be scattered all over the place; it's still
valid though as you can write other code in between.

I haven't fully worked out all the ideas but I believe I've generalized the
approaches.

This reminds me of the 3 or 4 ways of speed optimizing; you can make a loop,
unroll a loop, use "decision tree optimization" where every decision leads
to it's own code fragment, or of course the easier way of doing it which is
a table of subroutines for every possible argument. This is a way to
make a two argument table but with less memory.
Has anyone made a multiply routine for every possible multiplier? I looked
at this and it's about 32 cycles max and sometimes quite less, much faster
than the table of squares method.

2011-03-27 22:37

Copyfault

Registered: Dec 2001
Posts: 466

@Repose: what exactly do you want to prove? The smallest number of bytes needed for a de-jitter-routine?

If scattering of routine fragments is allowed I guess Ninja's approach used in his 2x2-FLI-routines is the shortest possible.

Maybe I didn't fully get the idea behind your lines but don't we need smth like 'axomatic semantics' to do a correct mathematical proof?

2011-07-19 14:00

ready.

Registered: Feb 2003
Posts: 441

Quoting name

lda $dc04 ;check timer A, here it jitters between 7...1
eor #7 ;A=7-A so jitter will be 0...6 in A
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to timer (A) dependent byte
cmp #$c9 ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
cmp #$c9 ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
bit $ea24 ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP
bit $ea24 ;IMPORTANT to handle 8th cycle jitter

@Hermit: this code doesn't patch the previous one when encoutering the 8-cycle jitter. Just check this: $dc04=8, EOR #7 produces $0f and with bpl you end up out of your code.

2011-07-20 10:32

Frantic

Registered: Mar 2003
Posts: 1627

@Ready: Maybe i totally miss the point now just judging from the surface of things but are you missing the following?

0
1
2
3
4
5
6
7

= 8 different states, which means that $dc04==8 never happens and that $dc04==7 really is the 8th state? (As I said, I may very well be wrong now, because I don't know what $dc04 might actually end up being in the code discussed here..)

Previous - 1 | 2 | 3 - Next

Refresh

Subscribe to this thread: