[CSDb] - User Forums - Shortest code for stable raster timer setup

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Shortest code for stable raster timer setup

2020-01-20 16:20

Krill

Registered: Apr 2002
Posts: 2839

Shortest code for stable raster timer setup

While working on my ICC 2019 4K entry (now postponed to ICC 2020, but i hope it'll be worth the wait), i came up with this (14 bytes):

initstabilise   lda $d012
                ldx #10          ; 2
-               dex              ;   (10 * 5) + 4
                bpl -            ; 54
                nop              ; 2
                eor $d012 - $ff,x; 5 = 63
                bne initstabilise; 7 = 70

                [...]; timer setup

The idea is to loop until the same current raster line is read at the very beginning (first cycle) and at the very end (last cycle) of a raster line, implying 0 cycles jitter.

With 63 cycles per line on PAL, the delay between the reads must be 63 cycles (and not 62), reading $d012 at cycle 0 and cycle 63 of a video frame's last line (311), which is one cycle longer due to the vertical retrace.

The downside is that effectively only one line per video frame is attempted, so the loop may take a few frames to terminate, and the worst case is somewhere just beyond 1 second.

The upside is that it always comes out at the same X raster position AND raster line (0), plus it leaves with accu = 0 and X = $ff, which can be economically re-used for further init code.

Now, is there an even shorter approach, or at least a same-size solution without the possibly-long wait drawback?

2020-01-20 16:26

Frantic

Registered: Mar 2003
Posts: 1627

Looks like great material for a Codebase 64 article. :) Neat code!

2020-01-20 17:03

Krill

Registered: Apr 2002
Posts: 2839

Thanks! =) But i'm not quite satisfied with this approach yet.

I've just realised that the overall number of cycles in a frame is still 312 * 63 (otherwise the timer would drift, which it doesn't). There might be something with testing $d012 on two or more lines. Possibly with more than 14 bytes of code, though.

2020-01-20 17:49

Krill

Registered: Apr 2002
Posts: 2839

13 bytes, still long wait. =)

initstabilise   ldx #11
                lda $d012
-               dex              ;   (11 * 5) + 4
                bpl -            ; 59
                eor $d012        ; 4 = 63
                bne initstabilise; 9 = 72

2020-01-20 18:44

Frantic

Registered: Mar 2003
Posts: 1627

In an actual 4K context, I guess this should be pretty minimal, since I guess you would remove the x register delay counting stuff and replace it with some other unrelated init code that you would have to execute anyway (and that wouldn't break things even if it was executed a number of times).

2020-01-20 21:18

Oswald

Registered: Apr 2002
Posts: 5017

"The idea is to loop until the same current raster line is read at the very beginning (first cycle) and at the very end (last cycle) of a raster line, implying 0 cycles jitter."

basicly its not 0 jitter, just then you know how the raster and the cpu is synchronized :P cool idea! dont think anyone ever coded this for size, guess you could make it run faster if you would do this on the border on consecutive lines. bne would jump out from a loop.

2020-01-20 21:22

Krill

Registered: Apr 2002
Posts: 2839

Frantic: True that. =)

But i wasn't so happy with the possible long wait of about one second.

Here's a fast version with 15 bytes:

initstabilise   ldx #10
                lda $d012
                lsr              ; 2
                rol              ; 2
-               dex              ;   (10 * 5) + 4
                bpl -            ; 54
                cmp $d012        ; 4 = 62
                bne initstabilise; 9 = 71

This only considers even-numbered raster lines, so the problematic last line 311 won't terminate the loop.

Note that when replacing the X register delay stuff with some code of the same cycle count, the loop must still have 71 cycles, that is, a number of cycles that is co-prime with 63. Otherwise, not all cycles in a raster line may be reached and the loop spin endlessly.

2020-01-20 21:26

Krill

Registered: Apr 2002
Posts: 2839

Quoting Oswald

basicly its not 0 jitter, just then you know how the raster and the cpu is synchronized :P

I meant that the loop always comes out at the same cycle on a raster line, with no jitter. =)

Quoting Oswald

guess you could make it run faster if you would do this on the border on consecutive lines. bne would jump out from a loop.

One must avoid the 64-cycles line, because this would in fact produce a 1-cycle jitter with a 63-cycles check. But i think i've got this, now. :D

2020-01-20 21:33

Krill

Registered: Apr 2002
Posts: 2839

Ah, bummer. This is the correct one with even-numbered lines only. =)

initstabilise   ldx #10
                lda $d012
                lsr              ; 2
                asl              ; 2
-               dex              ;   (10 * 5) + 4
                bpl -            ; 54
                cmp $d012        ; 4 = 62
                bne initstabilise; 9 = 71

2020-01-20 21:50

ChristopherJam

Registered: Aug 2004
Posts: 1378

13 bytes, at most a frame.

But, only works 99.9% of the time (fails to trigger DMA if it starts during line $ee), and puts about 3 and a half lines of black at the bottom of the screen for a frame.

    ldx#$ee
:   cpx $d012
    bne :-
:   dex
    bmi :-
    stx $d011

2020-01-20 22:58

Krill

Registered: Apr 2002
Posts: 2839

Hah, that's pretty dirty. :)

I briefly considered DMA-based methods, but yeah, they usually come with visual artefacts or VSP hazards and the like.

One could argue that

- nop
  lda $d012
  lsr
  asl
  [54 cycles worth of user code not touching the accu]
  cmp $d012
  bne -

with 11 bytes net size as proposed by Frantic is shorter, though. =D

2020-01-20 23:06

ChristopherJam

Registered: Aug 2004
Posts: 1378

Haha well if we can pad with other code, just put something else that doesn't touch X in place of dex:bmi *-1, and we're down to 10 bytes :)

But yes, I'm not that keen on visible artefacts even for a frame. Easier to do a DMA on line $30 at the start of a blanked frame, and put all the init code somewhere that'll be overwritten by decrunched graphics or mainloop code.

2020-01-20 23:20

Krill

Registered: Apr 2002
Posts: 2839

Quoting ChristopherJam

Haha well if we can pad with other code, just put something else that doesn't touch X in place of dex:bmi *-1, and we're down to 10 bytes :)

True, but it's quite hard to hit exactly, uhm, 1189 cycles. :)

Quoting ChristopherJam

and put all the init code somewhere that'll be overwritten by decrunched graphics or mainloop code.

In the usual size coding categories, you want the init code to be as small as possible as well, though, as the executable size counts. :)

2020-01-20 23:25

ChristopherJam

Registered: Aug 2004
Posts: 1378

Well, I'm assuming more "just enough to pad the gap between comparison becoming true and being in the DMA enabled area", so just a couple dozen cycles should be safe.

Fair point on minimizing initcode.

2020-01-21 16:16

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Krill

With 63 cycles per line on PAL, the delay between the reads must be 63 cycles (and not 62), reading $d012 at cycle 0 and cycle 63 of a video frame's last line (311), which is one cycle longer due to the vertical retrace.

Funny, didn't know that. Does Vice emulate it? Hoxs doesn't.

2020-01-21 16:52

Krill

Registered: Apr 2002
Posts: 2839

Quoting Rastah Bar

Quoting Krill
With 63 cycles per line on PAL, the delay between the reads must be 63 cycles (and not 62), reading $d012 at cycle 0 and cycle 63 of a video frame's last line (311), which is one cycle longer due to the vertical retrace.

Funny, didn't know that. Does Vice emulate it? Hoxs doesn't.

From https://sourceforge.net/p/vice-emu/code/HEAD/tree/trunk/vice/sr..

    /* Line 0 is 62 cycles long, while line (SCREEN_HEIGHT - 1) is 64
       cycles long.  As a result, the counter is incremented one
       cycle later on line 0.  */

2020-01-21 17:33

Rastah Bar

Registered: Oct 2012
Posts: 336

Oh, I see. Thanks.

2020-01-27 21:18

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Krill

Ah, bummer. This is the correct one with even-numbered lines only. =)
initstabilise ldx #10 lda $d012 lsr ; 2 asl ; 2 - dex ; (10 * 5) + 4 bpl - ; 54 cmp $d012 ; 4 = 62 bne initstabilise; 9 = 71

Absolutely fantastic, Krill! It feels tempting to shave off another byte by exchanging [lsr:asl] by an and-instruction, i.e.

newline_loop:
           ldx #$f7
           lda $d012-$f7,x
           and inc_opcode:#$e8
           //-p-a-g-e-b-r-e-a-k
           bne inc_opcode
           cmp $d012
           bne newline_loop

Unfortunately, rasline $00 (or $100 resp.) breaks this :(

Also gave it a try to utilise the operand of the [ldx #10] in your code as $0a -> asl, but it all ends up with >=15 bytes total netweight.

Anyway, thanks for sharing and "en passant" giving something to ponder about ;)

2020-01-28 11:37

Rastah Bar

Registered: Oct 2012
Posts: 336

I tried something like this:

start:  ldx $d011
        bpl start:
loop:   pha
        cpx $d012
        pla
        inc safe_mem,x
        bcs loop:

But it can stabilize either on the PHA or on the INC safe_mem,x so no cigar.

2020-01-28 20:12

ChristopherJam

Registered: Aug 2004
Posts: 1378

Like the mathematician's hypothetical can opener, let us assume we have 50 cycles worth of init code that we are happy to run as many as seven times over.

Then we can sync with just ten bytes of code, in at most seven frames.

sync:
    lda $d012   ; will be zero on cycles 0,1,2,3,4,5 or 6
    bne sync

    .res 25,$ea ; replace this with 50 cycles of init code

    lsr $d012   ; if it's zero, either we're too early on line 0, or we're on line 256..  
    bcc sync    ; fall through if we read on cycle 62, 56 after cycle 6

We use lsr instead of lda for the second test, as lda would result in a 63 cycle loop.

2020-01-28 21:43

Copyfault

Registered: Dec 2001
Posts: 466

Quoting ChristopherJam

Like the mathematician's hypothetical can opener, let us assume we have 50 cycles worth of init code that we are happy to run as many as seven times over.

Then we can sync with just ten bytes of code, in at most seven frames.

sync: lda $d012 ; will be zero on cycles 0,1,2,3,4,5 or 6 bne sync .res 25,$ea ; replace this with 50 cycles of init code lsr $d012 ; if it's zero, either we're too early on line 0, or we're on line 256.. bcc sync ; fall through if we read on cycle 62, 56 after cycle 6

We use lsr instead of lda for the second test, as lda would result in a 63 cycle loop.

Woohaaaa, lovely! I knew you still have some variants hidden deep in your brains. So lucky I was right:)
This [lsr $d012] is really tricky. Hope it does not harm that we actually *set* a value for raster-irq this way, but it shouldn't as long as we have irqs disabled (what I silently assumed as a healthy setting). If the init-code uses index registers -which is usually the case-, one might want to style this approach even further...

sync:
    ldy #val
    lax $d012
    bne sync
    // here we are at cycle 3..9 of rasterline $00 (or 2..8 of rasterline $100)
    /*
    // 50 cycles of init code go here
    */
    // after that init code block we are @cycle 53..59 of rasterline $00(52..58 of line $100)
    lda $d012   // the R-cycle occurs @cycle 57..63 -> 63 = cycle 0 of line 1
                // in rasterline $100 this will end up @cycle 56..62, thus always without reaching the next line 
    beq sync    // branch not taken only when line 1 is reached

This way, we'd get x=0 and y loaded with a value of our choice on entering the init code block and we avoid _writing_ something to $d012, while still ensuring that the upper $d012-reads happen in a regular 65-cycle-distance.
However, the 10-byte-trophy stays with you, CJ! (though I'd like to argue that this [ldy#val] belongs more to the init code;);)).

Let's wait another day, maybe Krill will come up with yet another "brainbomb":)

2020-01-28 21:52

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Copyfault

[...]
sync: ldy #val lax $d012 bne sync // here we are at cycle 3..9 of rasterline $00 (or 2..8 of rasterline $100) /* // 50 cycles of init code go here */ // after that init code block we are @cycle 53..59 of rasterline $00(52..58 of line $100) lda $d012 // the R-cycle occurs @cycle 57..63 -> 63 = cycle 0 of line 1 // in rasterline $100 this will end up @cycle 56..62, thus always without reaching the next line beq sync // branch not taken only when line 1 is reached

Hmm, wait, wouldn't having a pagebreack between the upper [lax $d012] and the [beq sync] be enough?

sync:
    lax $d012
    bne sync
    /*
    // 50 cycles of init code go here
    */
    lda $d012
    //-p-a-g-e-b-r-e-a-k-
    beq sync

Ok, that nice [ldy #val] is no more, but this way the cycle-distance between the upper $d012-reads amounts to 64, which is coprime to 63.

2020-01-29 05:52

ChristopherJam

Registered: Aug 2004
Posts: 1378

Cheers, Copyfault.

I was wondering about the pagebreak myself, but didn't want to constrain the code (of course, as writ it needs to avoid a page break, but that's slightly easier).

I agree that the ldy#val you've proposed belongs to the init code, and hence is a valid way to avoid the RMW without increasing the code size.

All that said, my submission and your modifications to it are all vulnerable to failing if they enter in the middle of line zero - only a 0.3% chance, but still not ideal. The following take avoids that, albeit at the cost of re-introducing the write to $d012. Possibly a bonus if you want to set up a raster interrupt for line $000 or $100 mind :)

sync:
    inc $d012   ; result will be zero on cycles 0-7 or 8 of raster line $ff
    bne sync    ; (or very rarely, cycle 9..62)

    .res 27,$ea ; wait 62-8=54 cycles. Replace with 54 cycles of init code

    inc $d012   ; if result is nonzero then we are too late.
    bne sync    ; carry on if we read on cycle 62, 62 cycles after cycle 0

edit: this one probably also compresses better than the earlier versions - the two code snippets both start with EE 12 D0 D0

2020-01-29 16:53

Krill

Registered: Apr 2002
Posts: 2839

Quoting ChristopherJam

sync: inc $d012 ; result will be zero on cycles 0-7 or 8 of raster line $ff bne sync ; (or very rarely, cycle 9..62) .res 27,$ea ; wait 62-8=54 cycles. Replace with 54 cycles of init code inc $d012 ; if result is nonzero then we are too late. bne sync ; carry on if we read on cycle 62, 62 cycles after cycle 0

edit: this one probably also compresses better than the earlier versions - the two code snippets both start with EE 12 D0 D0

Excellent! =D

Seems to work just fine here in my real-world code. Fast, reliable, short, packs well AND is pretty elegant. A keeper. :)

2020-01-29 18:14

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: Quoting ChristopherJam
sync: inc $d012 ; result will be zero on cycles 0-7 or 8 of raster line $ff bne sync ; (or very rarely, cycle 9..62) .res 27,$ea ; wait 62-8=54 cycles. Replace with 54 cycles of init code inc $d012 ; if result is nonzero then we are too late. bne sync ; carry on if we read on cycle 62, 62 cycles after cycle 0

edit: this one probably also compresses better than the earlier versions - the two code snippets both start with EE 12 D0 D0
Excellent! =D

Seems to work just fine here in my real-world code. Fast, reliable, short, packs well AND is pretty elegant. A keeper. :)

I haven't really tried to understand it actually. Will this lock on any raster line or just 0 or 256? Or what are the limitations?

2020-01-29 18:27

Krill

Registered: Apr 2002
Posts: 2839

$100 only, i think.

2020-01-29 18:47

Frantic

Registered: Mar 2003
Posts: 1627

How nice! :)

2020-01-29 20:59

ChristopherJam

Registered: Aug 2004
Posts: 1378

Cheers all.

Yup, locks to line $100.

Assumes sprites disabled and no interrupts occur.

2020-01-29 21:39

Oswald

Registered: Apr 2002
Posts: 5017

I dont get it, Z flag is set according to result of inc, thus line should be $ff ?

2020-01-29 21:52

ChristopherJam

Registered: Aug 2004
Posts: 1378

Sure - depends how you measure it.
The last time the 54 cycles worth of code between the snippets are executed, they will be on line $0ff.

The code always exits a few cycles into line $100.

2020-01-30 00:22

Copyfault

Registered: Dec 2001
Posts: 466

Quote: Cheers, Copyfault.

I was wondering about the pagebreak myself, but didn't want to constrain the code (of course, as writ it needs to avoid a page break, but that's slightly easier).

I agree that the ldy#val you've proposed belongs to the init code, and hence is a valid way to avoid the RMW without increasing the code size.

All that said, my submission and your modifications to it are all vulnerable to failing if they enter in the middle of line zero - only a 0.3% chance, but still not ideal. The following take avoids that, albeit at the cost of re-introducing the write to $d012. Possibly a bonus if you want to set up a raster interrupt for line $000 or $100 mind :)

sync: inc $d012 ; result will be zero on cycles 0-7 or 8 of raster line $ff bne sync ; (or very rarely, cycle 9..62) .res 27,$ea ; wait 62-8=54 cycles. Replace with 54 cycles of init code inc $d012 ; if result is nonzero then we are too late. bne sync ; carry on if we read on cycle 62, 62 cycles after cycle 0

edit: this one probably also compresses better than the earlier versions - the two code snippets both start with EE 12 D0 D0

Ooh yes, I see... the uber-motivation was to huge yesterday;)

But... this should be fixable without INC-opcodes. The chance of failing comes from the fact that we used a BEQ-check after the stuffed-in init-code with the strategy of a unique overflow situation in mind.

Now if we reverse this strategy, we could check for a unique non-overflow-situation, by just prolonging the init code part by a suitable no. of cycles.

sync:
    ldy #val
line0x100_wait:
    lax $d012
    bne line0x100_wait
    /*
    // 56 cycles of init code go here
    */
    lax $d012   // the R-cycle occurs exactly after 62 cycles of the upper R-cycle of the lax $d012
    bne sync    // this gives 0 if and only if the upper $d012-read was exactly @cycle=0 of the rasterline
                // as rasterline 0 is only 62 cycles long, this will only be
                // the case if the upper $d012-read was @cyc=0 of line=$100

The [ldy #val] is still needed for ensuring coprimeness (i.e. no. of cycles between two upper $d012-checks must not have a common divisor with 63=7*9).

Should work and should need 63 frames in the worst case. Ahh, and the lower [lax $d012] was just for having two identical codeblocks, thus should also do them compressing algorithms a favor;)

2020-01-30 07:38

Oswald

Registered: Apr 2002
Posts: 5017

in max how many frames would this awesome inc solution would sync up ?

I'd be interested in a solution thats simple AND fast. Not necessarily shortest. Would be a nice addendum to codebase.

What I have currently by Ninja does have a lot of code checking on how it misses the end of a rasterline, a version that fits into a dozen lines would be neater.

2020-01-30 09:08

ChristopherJam

Registered: Aug 2004
Posts: 1378

The INC solution takes at most nine frames, and should average 4.5

Copyfault, there's really no harm in writing to d012, unless you wish to set it to some other value than zero in the init code between the two bookends.

2020-01-30 09:12

Oswald

Registered: Apr 2002
Posts: 5017

wow, then the inc solution is good for everything :)

2020-01-30 20:05

Copyfault

Registered: Dec 2001
Posts: 466

Quote: The INC solution takes at most nine frames, and should average 4.5

Copyfault, there's really no harm in writing to d012, unless you wish to set it to some other value than zero in the init code between the two bookends.

I somehow tend to avoid writing to a reg if there's no real purpose behind.

But back to your INC-based solution: why does it take 9 frames at most? If the upper INC happens to come at some cycle >=9 of Rasterline $ff, it should take longer, more or less comparable to the alternative I presented - or do I miss smth here? AFAIU, both approaches do the same, just the rasterline where the syncing finishes is different (yours at line $100, mine at $101).

Ofcourse, it might also be a wanted side-effect to set $D012=0 if the first Raster-IRQ at line 0 (or $100 resp.) makes sense.

2020-01-31 00:24

Copyfault

Registered: Dec 2001
Posts: 466

Quote: in max how many frames would this awesome inc solution would sync up ?

I'd be interested in a solution thats simple AND fast. Not necessarily shortest. Would be a nice addendum to codebase.

What I have currently by Ninja does have a lot of code checking on how it misses the end of a rasterline, a version that fits into a dozen lines would be neater.

So you basically look for a solution that has the least raster-time demand for syncing, or am I on the wrong path?

Something like this should finish in at most eight rasterlines:

        lda #$08
        sta zp_val
        
        ldx #$fe
loop:   
wait_startline:
        cpx $d012
        bne wait_startline
        inx
        bmi wait_startline
        //at cycle 6..12 of line $ff
        ldy zp_val
waste_cycles:
        dey
        bpl waste_cycles
        
        cpx $d012
        bne loop //leaves at cycle 2 of the first line in which raster is stable ($100..$106)
done:

By debouncing the starting line, we can asure that the no. of cycles at the start of the actual syncing loop lies exactly in the interval [6..12] (and is never different). So the syncing can be done by variance cancelation, which needs one rasterline per correction cycle. As there are seven different possibilites for the variance (6,7,8,9,10,11,12), (up to) seven rasterlines are needed in total (plus the first one for ensuring a "save start").

Maybe this can be done with shorter code, but I think not really faster (unless you really want to do variance halfing which will blow up code size too much for my taste).

2020-01-31 00:38

Oswald

Registered: Apr 2002
Posts: 5017

sorry I did not construct it properly with fast I meant it stabilizes fast, with that I mean max ~0.3 seconds a time span that for us humans doesnt matter :) so 9 frames max will do. however looking at the new version and explanation: your skills at this are truly impressive sir.

2020-01-31 11:03

Rastah Bar

Registered: Oct 2012
Posts: 336

What Krill said.

Here is another method (13 bytes, stabilizes in less than a frame). When entering from Basic, timer A of CIA#1 is running. That can be used to check if the last cycle of an RMW instruction falls on the first "BA low, AEC high" cycle of a badline, as follows:

sync: lda $dc04
      sec
      sta ZP    ;RMW instruction
      sbc $dc04
      cmp #51   
      bne sync:

If and only if the last cycle of STA ZP is executed on the first "BA low, AEC high" cycle of a badline will there be exactly 51 cycles between LDA $DC04 and SBC $DC04 and the routine will exit on the last cycle of a badline.

2020-01-31 23:50

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Copyfault

[...]
But back to your INC-based solution: why does it take 9 frames at most?[...]

This kept me awake for quite some time now. Think I have an explanation for it - finally!

If I do the calculations correctly (read: set up my surrounding framework including those frame-counters right;)), the lda-based method takes at most 7 frames. Uh, why is it now 7, even less than those 9 frames maximum for the inc-based approach?

The answer lies in the respective entry points of the delay loops. Taking a look at the INC-method, we see that it starts with

waitline:
   inc $d012
   bne waitline
   ...

If this first waiting loop has finished, the delay part begins (that we decided upon to be filled with init code f.e.). To simplify things, let's hold the case of starting this code in the middle of line=$ff (it would instantly come true whilst being off more than 9 cycles from the start of that line) back for a moment. How many cycles are over when leaving the waiting loop? It's 4 cycles iff $d012=$ff on the fourth R-cycle of the INC, but it amounts to 12 cycles iff $d012=$ff happens one cycle later! So this gives a variance of 12-4=8.

Exactly this variance is what we need to get rid of to have a stable raster. The INC-&LDA-loops presented in this thread cancel one cycle of variance per frame. For the INC-approach, this means we need 8 frames for the worst case (i.e. 12 cycles off).
Now we still have that "bad case" I had ignored for the sake of simplification. In fact, it does not do too much harm: in case the loop really starts mid of the testing line ($ff in the INC-approach), the first delay loop run will go fail. As the loop construction ensures 71 cycles between each $d012-checks at the start of each delay loop, with ggT(71,63)=1 (coprimeness) plus the fact that one run of the waiting loop is 9, the next start of the delay loop will be at a cycle c of type c = 9*k + 71 = 9*(k+1) - 1 #= -1 (mod 9) [mind that 9 is a factor of 63=7*9, thus skipping a multiple of 9 will get you to the exact same cycle position of any other line (or the same in the next frame); that -1(mod9) ensures that the position is changed!]
This means, from the second run of the delay loop onwards till the end, we step through the cycles of the first nine cycles of the line.
So back to counting the no. of frames that is needed at most: this "bad case" adds one to this frame count. So the INC-approach has a max frame count of 9.

Looking at the LDA-based method, we have a waiting loop like this:

waitline:
   lda $d012
   bne waitline
   ...

This part is finished 2-8 cycles after the beginning of line=$00. Following the above arguments, this approach needs at most 6+1(for the "bad case")=7 frames. Interesting fact is that the waiting loop here also needs a factor of 63 (=7*9), i.e. 7 cycles for one run. So here we have that c-formula like this: c = 7*k + 71 = 7*(k+10) + 1 #= 1 (mod 7). Thus we deal with a 7-cycle window in this case.
One other thing to mention is that with that lda, there's no chance to check explicitly for a unique rasterline (or you use compare opcodes, but it'll take more bytes!!!). The fact that line=$000 consists only of 62 cycles and the construction of the delay loop ensure that the check if this line will always fail. This is no real problem either, as we hit line=$100 once per frame, so the overall approach will come to an end!

Maybe someone is interested enough to read this, maybe this was all clear to you. Anyway, I felt the urge to write it down now that I finally understood it (I think).

2020-02-01 00:06

Copyfault

Registered: Dec 2001
Posts: 466

Quote: sorry I did not construct it properly with fast I meant it stabilizes fast, with that I mean max ~0.3 seconds a time span that for us humans doesnt matter :) so 9 frames max will do. however looking at the new version and explanation: your skills at this are truly impressive sir.

Ah come on, I'm just too fond of playing aroung with things that seem to keep certain mathematical mysteries inside;) Does not really help to get things *done*
To the opposite: I'd say you are the one to adore here! Will never ever reach that level of coding that you simply own, Oswald! I mean it:)

But thanks for your kind words. Gives me the positive feeling that there are people like you out there that care about explanations'n'stuff!

2020-02-01 11:54

Rastah Bar

Registered: Oct 2012
Posts: 336

I find this problem surprisingly hard to understand. I think I get most of what you are saying, but aren't you neglecting the presence of badlines? The number of cycles available to the CPU is less on badlines and can even vary because of RMW instructions in the init code. So it seems there may be cases where neither of the approaches (INC, LAX) locks. Or am I mistaken?

2020-02-01 12:02

Copyfault

Registered: Dec 2001
Posts: 466

Quote: I find this problem surprisingly hard to understand. I think I get most of what you are saying, but aren't you neglecting the presence of badlines? The number of cycles available to the CPU is less on badlines and can even vary because of RMW instructions in the init code. So it seems there may be cases where neither of the approaches (INC, LAX) locks. Or am I mistaken?

I compared the approaches with having a clean setup before doing the stabilization routine, i.e. no badlines, no irqs.

If you allow badline f.e., my reasoning of the large posting above does not hold true anymore. I did some measurements yesterday that show both approaches take more than 9(resp. 7)frames with badlines enabled. I have to admit that I had no motivation to do the calculations respecting the badlines inbetween, but it *could* be done...

2020-02-01 12:13

Rastah Bar

Registered: Oct 2012
Posts: 336

I guess it can be easily fixed by blanking the screen in the init code. This is often required anyway when setting up the graphics, so this is not really a constraint.

I have tried to analyze my timer-based approach.

One loop takes 18 cycles. Between the same cycle of two consecutive badlines there are 461 available cycles. If the STA ZP starts on a certain cycle of a badline (and there is no lock), it will start 7 cycles later on the next badline, because 461 = 26*18 - 7. Since a non-locking badline has 20 cycles which is not coprime with 7, the algorithm will always lock.

What are your thoughts about this?

2020-02-01 16:54

Rastah Bar

Registered: Oct 2012
Posts: 336

I can shave off one byte:

sync: lax $dc04
      sbx #51
      sta ZP      ;RMW instruction
      cpx $dc04
      bne sync:

The loop is 16 cycles and since 461 = 29*16 - 3, this also should always lock. It needs at most 20 consecutive badlines, so the very worst case is that the lower border is reached after 19 badlines and you have to start again at the first badline. So locking is guaranteed in less than 1.4 frames.

2020-02-01 21:36

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: I can shave off one byte:

sync: lax $dc04 sbx #51 sta ZP ;RMW instruction cpx $dc04 bne sync:

The loop is 16 cycles and since 461 = 29*16 - 3, this also should always lock. It needs at most 20 consecutive badlines, so the very worst case is that the lower border is reached after 19 badlines and you have to start again at the first badline. So locking is guaranteed in less than 1.4 frames.

Exploiting kernel setup values in dc04 and dc05 (different on PAL and NTSC)?! But we're only in PAL domain in this thread anyways.

2020-02-01 22:24

Rastah Bar

Registered: Oct 2012
Posts: 336

See post #38 for what I have in mind. Do you think this could work? I'm always a little bit afraid that I missed something.

It should lock also on NTSC since 477 = 30*16 - 3, but the routine exits on a different cycle number.

2020-02-02 12:09

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

I guess it can be easily fixed by blanking the screen in the init code. This is often required anyway when setting up the graphics, so this is not really a constraint.[...]

Forgot to stress this detail, but I had this in mind: you even do not have to set it before the start of the stabilization loop (the first check of $d012 might be a "bad case" anyway), so it suffices to blank screen/kill irqs/etc in the init code blob.

Quoting Rastah Bar

I can shave off one byte:

sync: lax $dc04 sbx #51 sta ZP ;RMW instruction cpx $dc04 bne sync:

The loop is 16 cycles and since 461 = 29*16 - 3, this also should always lock. It needs at most 20 consecutive badlines, so the very worst case is that the lower border is reached after 19 badlines and you have to start again at the first badline. So locking is guaranteed in less than 1.4 frames.

This one looks quite clever, though I did not deep-check "all the math" behind it. One thing (besides the badline-timing) that might also cause a cycle-mismatch at the cpx $dc04-instruction is the behaviour of the timers: afair, it never reaches the $00-value, but gets initialized with the max-value (so $dc04 outputs the same value in two consequetive cycles, but never $00).
And as a sidenote: a STA ZP is just a write-instruction, no Read-Modify-Write (the RMW-comment in your code examples confused me a little;)). But the idea you posted with one write-cycle is correct and should work...

2020-02-02 18:21

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Copyfault

Quoting Rastah Bar
I guess it can be easily fixed by blanking the screen in the init code. This is often required anyway when setting up the graphics, so this is not really a constraint.[...]
Forgot to stress this detail, but I had this in mind: you even do not have to set it before the start of the stabilization loop (the first check of $d012 might be a "bad case" anyway), so it suffices to blank screen/kill irqs/etc in the init code blob.

Yes, you are right.
Quoting Copyfault

This one looks quite clever, though I did not deep-check "all the math" behind it. One thing (besides the badline-timing) that might also cause a cycle-mismatch at the cpx $dc04-instruction is the behaviour of the timers: afair, it never reaches the $00-value, but gets initialized with the max-value (so $dc04 outputs the same value in two consequetive cycles, but never $00).
And as a sidenote: a STA ZP is just a write-instruction, no Read-Modify-Write (the RMW-comment in your code examples confused me a little;)). But the idea you posted with one write-cycle is correct and should work...

Thanks for your feedback. $dc04 can reach 0 because it is linked with $dc05. So as long as $dc05>0, $dc04 goes from 0 to $ff, and there is no problem. But when ($dc05,$dc04)=$0001 it goes directly to $4025 after that (on PAL), but that cannot cause an accidental lock. It only may delay the locking a bit. So there is no problem there, I think.

You are right, STA ZP is an RRW instruction, but the W-cycle at the end is important.

2020-02-03 11:43

Rastah Bar

Registered: Oct 2012
Posts: 336

How would one setup a timer in NTSC? Since 65 = 5*13 a timer period of 13 seems the most obvious choice.

2020-07-02 17:06

Quiss

Registered: Nov 2016
Posts: 37

This is an idea I got after talking to Copyfault.
At least in the cycle-correct version of Vice (i.e., x64sc) this seems to work. Haven't tried on a real machine.

* = $0f00  ; Some address with (H+1)&1 = 0 and (H+1)&$10 = $10

       ldy #$00
loop:  ldx #$11
       shx cont, y
cont:  bpl loop

It uses the fact that we will AND the written value with H+1 unless a badline pauses the CPU between the third and fourth cycle of shx. The latter then changes the "bpl" into an "ora" and drops us out of the loop at horizontal position 61.

2020-07-02 17:36

Rastah Bar

Registered: Oct 2012
Posts: 336

Only 9 bytes! Very nice use of the peculiarities of SHX too!

2020-07-02 18:19

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: This is an idea I got after talking to Copyfault.
At least in the cycle-correct version of Vice (i.e., x64sc) this seems to work. Haven't tried on a real machine.

* = $0f00 ; Some address with (H+1)&1 = 0 and (H+1)&$10 = $10 ldy #$00 loop: ldx #$11 shx cont, y cont: bpl loop

It uses the fact that we will AND the written value with H+1 unless a badline pauses the CPU between the third and fourth cycle of shx. The latter then changes the "bpl" into an "ora" and drops us out of the loop at horizontal position 61.

Haha! Wow!

2020-07-02 19:05

Burglar

Registered: Dec 2004
Posts: 1031

Quoting Quiss

* = $0f00 ; Some address with (H+1)&1 = 0 and (H+1)&$10 = $10 ldy #$00 loop: ldx #$11 shx cont, y cont: bpl loop

wait what?? I need to look up SHX... at first glance this does not make any sense to me :)

2020-07-02 19:17

ChristopherJam

Registered: Aug 2004
Posts: 1378

Holy shit, that’s brilliant! Well found.

2020-07-02 19:19

Burglar

Registered: Dec 2004
Posts: 1031

even Crossbow cannot beat this!

2020-07-02 19:29

chatGPZ

Registered: Dec 2001
Posts: 11113

i so have to steal this and use as an example in my pdf :)

edit: quick test on C64 confirms it works :)

2020-07-02 20:16

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: i so have to steal this and use as an example in my pdf :)

edit: quick test on C64 confirms it works :)

So at a controlled X pos but at a ”random” y*8+c pos depending on $d011, which is good enough to launch a 63c timer ofc.

2020-07-02 20:43

TWW

Registered: Jul 2009
Posts: 541

Damn, nice one.

2020-07-03 00:44

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Quiss

This is an idea I got after talking to Copyfault.
At least in the cycle-correct version of Vice (i.e., x64sc) this seems to work. Haven't tried on a real machine.

* = $0f00 ; Some address with (H+1)&1 = 0 and (H+1)&$10 = $10 ldy #$00 loop: ldx #$11 shx cont, y cont: bpl loop

It uses the fact that we will AND the written value with H+1 unless a badline pauses the CPU between the third and fourth cycle of shx. The latter then changes the "bpl" into an "ora" and drops us out of the loop at horizontal position 61.

Lovely!!! Quiss, I knew you will come up with exactly this kind of brilliance sooner or later. Sooo good to have you back;)

If you want to "overdo" (optimize, erm) this, let's save another 2 bytes:

* = $0faa  ; _a very nice_ address with (H+1)&1 = 0 and (H+1)&$10 = $10

0FAA   loop:  ldx #$11
0FAC          shx cont, y
0FAF   cont:  bpl loop

with start adress $0FAD (you guess the operand bytes of the SHX ; ))

Branching directly to the SHX-opcode should also work (8-cycle loop instead of 10-cycle loop, both coprime to 63), though I'm not sure which one will be faster.

Only "drawback" is that you do not know at which raster position you end up, only that it will be (at the very end of) a badline. Not too bad for my taste :))

2020-07-03 08:47

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Copyfault

If you want to "overdo" (optimize, erm) this, let's save another 2 bytes:

* = $0faa ; _a very nice_ address with (H+1)&1 = 0 and (H+1)&$10 = $10 0FAA loop: ldx #$11 0FAC shx cont, y 0FAF cont: bpl loop
with start adress $0FAD (you guess the operand bytes of the SHX ; ))

with start adress $0FAD (you guess the operand bytes of the SHX ; ))

Interesting idea, but I do not completely understand it. How does the Y register get the right value?

I was thinking about possibly saving one byte, if one could find a suitable start address and a ZP location with the right contents after entering from basic

* = $????     ;magic address that allows us to save 1 byte

      lax ZP  ;another one of those magic addresses
      tay
loop: shx cont,y
cont: bpl loop:

There might also exist variations where you let the SHX instruction change itself or change the value after the BPL into f.e. 0 (or another suitable value).

2020-07-03 17:28

Quiss

Registered: Nov 2016
Posts: 37

Neat! Right, no reason to make those two address bytes go to waste. :)

Another amusing thing to contemplate is how this code could be placed at, say, $08xx. Preferably without messing up the basic upstart.

Also, careful with the loop length. The number of CPU cycles between two badlines is 461, except when the loop's one write cycle (last cycle of SHX) sneaks into the three cycle RDY grace period. Then it's 462 ticks.
(Imagine a graph with n nodes, in which node i is connected to node (i+461)%n for 0 < i < n-1 and to (i+462)%n for i = 0. Node n-1 isn't connected to anything. You want that graph to be acyclic.)
In the range 5-20, the lengths that do work are 5, 10, 12, 16, 18 and 19. But note that in particular, length 8 (a.k.a. branching directly to the SHX) does not.

2020-07-03 18:05

Oswald

Registered: Apr 2002
Posts: 5017

nice to see Quiss rising from his grave, hopefuly it means some Rfx demo is cooking :)

2020-07-03 21:40

Rastah Bar

Registered: Oct 2012
Posts: 336

A loop length of 8 does not work because sometimes you have 462 cycles?

2020-07-04 02:11

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

Quoting Copyfault

If you want to "overdo" (optimize, erm) this, let's save another 2 bytes:

* = $0faa ; _a very nice_ address with (H+1)&1 = 0 and (H+1)&$10 = $10 0FAA loop: ldx #$11 0FAC shx cont, y 0FAF cont: bpl loop
with start adress $0FAD (you guess the operand bytes of the SHX ; ))

with start adress $0FAD (you guess the operand bytes of the SHX ; ))

Interesting idea, but I do not completely understand it. How does the Y register get the right value?
[...]

The SHX will be SHX $0fa0,y. Then you start this with a JMP $0fad which is just an LDY #$0f. This is also the reason for that code blob to begin at $0faa.

Quoting Quiss

[...]Another amusing thing to contemplate is how this code could be placed at, say, $08xx. Preferably without messing up the basic upstart.

Hmm, sounds like a good weapon against boredom;) Speak up if you found a nice variant!

Quoting Quiss

Also, careful with the loop length. The number of CPU cycles between two badlines is 461, except when the loop's one write cycle (last cycle of SHX) sneaks into the three cycle RDY grace period. Then it's 462 ticks.
(Imagine a graph with n nodes, in which node i is connected to node (i+461)%n for 0 < i < n-1 and to (i+462)%n for i = 0. Node n-1 isn't connected to anything. You want that graph to be acyclic.)
In the range 5-20, the lengths that do work are 5, 10, 12, 16, 18 and 19. But note that in particular, length 8 (a.k.a. branching directly to the SHX) does not.

Yes, yes, it's so true! After I wrote this post, two things haunted me some hours later: 1. that branching to SHX is not possible when doing the SHX $0fa0,Y-trick, so it was confusing to start with it and writing that branch-idea after it
2. those 63 cycles do only apply for non-badlines, but your approach needs badlines badly (pun intended). So my calculation was wrong. Thanks for putting this right, with the corresponding cycle calculations included :)

2020-07-04 09:39

Perplex

Registered: Feb 2009
Posts: 254

Quoting Copyfault

0FAA loop: ldx #$11 0FAC shx cont, y 0FAF cont: bpl loop
with start adress $0FAD (you guess the operand bytes of the SHX ; ))

Smooth, but I think you must do "SHX cont-15,y" for this to work? Otherwise there will be $AF at $0FAD.

2020-07-04 09:48

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Copyfault

The SHX will be SHX $0fa0,y. Then you start this with a JMP $0fad which is just an LDY #$0f. This is also the reason for that code blob to begin at $0faa.

Excellent! Thanks for the explanation. I had the same problem as Perplex. The code you posted gives LDA $100F.

Quoting CopyFault

Yes, yes, it's so true! After I wrote this post, two things haunted me some hours later: 1. that branching to SHX is not possible when doing the SHX $0fa0,Y-trick, so it was confusing to start with it and writing that branch-idea after it
2. those 63 cycles do only apply for non-badlines, but your approach needs badlines badly (pun intended). So my calculation was wrong. Thanks for putting this right, with the corresponding cycle calculations included :)

461 is a prime number, so I don't understand the (a)cyclic graph explanation. Could someone explain it in a bit more depth, please?

2020-07-04 20:12

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

Quoting Copyfault

The SHX will be SHX $0fa0,y. Then you start this with a JMP $0fad which is just an LDY #$0f. This is also the reason for that code blob to begin at $0faa.

Excellent! Thanks for the explanation. I had the same problem as Perplex. The code you posted gives LDA $100F.
[...]

Ooops, this was ofcourse wrong in the post, but correct in my head. So, excuses to Perplex, Rastah Bar and all readers - though I think the optimization was obvious, more or less;)

So next task is to get his <7 bytes *ducks+runs*

CF

2020-07-04 21:39

Rastah Bar

Registered: Oct 2012
Posts: 336

Perhaps it might be possible in 6 bytes if done something like this:

*=$XX??     ;Suitable starting address we have to find.

loop:	tay
uoc:	shx cont-offset,y
cont:	bpl loop:

If cont-offset equals $XXA7, jumping to uoc+1 (like in Copyfault's modification) would execute LAX $XX first.
The problem is to find a suitable offset and starting address.

Note that the SHX instruction does not necessarily have to change the BPL instruction, but could change the byte after the BPL or perhaps even the SHX instruction itself.
This gives a bit more freedom to solve the problem. Perhaps the BPL could be a different branch instruction too.

2020-07-04 22:53

Quiss

Registered: Nov 2016
Posts: 37

Quoting Rastah Bar

461 is a prime number, so I don't understand the (a)cyclic graph explanation. Could someone explain it in a bit more depth, please?

I drew a diagram of the state transitions: https://photos.app.goo.gl/g6Ba3YzbzBe1GSq67
Maybe that clarifies things? Happy to elaborate further.

2020-07-05 10:13

Rastah Bar

Registered: Oct 2012
Posts: 336

Thanks! That makes it a lot clearer. So the possibibilty of having 462 cycles due to the W-cycle can result in a periodic loop that excludes the "golden" R-cycle that would end the loop.

I think though that the borders could make some of the other loop lengths work.

The border is 112*63 cycles = 7056 cycles.

So perhaps a loop, when it is stuck in a wrong loop on the visible screen, will be "shifted out" of that loop in the border.

This will happen for a loop length of 17, since 7056 = 415*17 + 1.
Loop lengths 11, 13, and 15 still don't seem to work.

(Btw, with longer loop lengths there may be more W-cycles in the loop.)

2020-07-05 20:39

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Copyfault

So next task is to get his <7 bytes *ducks+runs*

Here is one example of a six bytes loop. ZP-address $39 contains the basic line number. So that one is free to choose, if we choose 17 decimal, then all registers will contain $11 after the SYS with the code:

17 SYS 14774 : REM $39B6

*=$39B4 (14772 decimal)
39B4 loop:  TAY
39B5        SHX $39A7,Y
39B8        BPL loop

The SYS 14774 jumps to LAX $39, which contains $11. Since $3A AND $11 equals $10, the SHX stores the BPL opcode at $39B8, except when the &{H+1} drops off due to DMA. Then it stores opcode $11, which is ORA (ZP),Y and the loop exits.

Surely there are other working examples, with perhaps more convenient addresses?

2020-07-05 22:23

Quiss

Registered: Nov 2016
Posts: 37

Quoting Rastah Bar

I think though that the borders could make some of the other loop lengths work.

The border is 112*63 cycles = 7056 cycles.

So perhaps a loop, when it is stuck in a wrong loop on the visible screen, will be "shifted out" of that loop in the border.

This will happen for a loop length of 17, since 7056 = 415*17 + 1.

Oh, good point! Indeed, loop length of 17 gets "fixed" by the border. Depending on which cycle you land on initially, loop exit gets delayed by up to three frames, but it'll eventually align.
A similar thing happens with 27, which takes up to four frames to align.
Those seem to be the only "special" (multi-frame) cases below 30.

Quoting Rastah Bar

(Btw, with longer loop lengths there may be more W-cycles in the loop.)

Right. Any extra W cycles would make things more complicated. I guess they could both break loops and create loops.

2020-07-05 23:14

Rastah Bar

Registered: Oct 2012
Posts: 336

The six byte example as shown does not work since it could get interrupted. An SEI should be executed first before jumping into the loop.

2020-07-06 01:10

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

The six byte example as shown does not work since it could get interrupted. An SEI should be executed first before jumping into the loop.

Hmm, this is more or less the same for all examples we had before, or do I miss smth?? Ok, here a basic sys-line is part of the trick, but it could also jump to an init routine that does SEI or LDA #$7F:STA $DC0D:etc. first.

Another instability of the approach with having the operand bytes of the SHX actually doing something usefull (i.e. filling a register) is that y != 0 in all the examples we had until now. This can lead to the ORA ($f9),Y taking one cycle more, depending on the content of $f9/$fa.

So I vote for $14 instead of $11 as value for X. It'll change the BPL into a NOP zp,X which always takes 4 cycles.

2020-07-06 09:04

Rastah Bar

Registered: Oct 2012
Posts: 336

It can be fixed by putting an SEI in a 7 byte loop. Adding your improvement as well:

20 SYS 14777 : REM $39B9

*=$39B6 (14774 decimal)

39B6 loop:  SEI
39B7        TAY
39B8        SHX $39A7,Y
39BB        BPL loop

But since the basic SYS 14777 instruction occupies one more byte in memory than a SYS to $08xx, I suppose we have to count this as an eight bytes method.

2020-11-27 11:13

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Quiss

Neat! Right, no reason to make those two address bytes go to waste. :)

Another amusing thing to contemplate is how this code could be placed at, say, $08xx. Preferably without messing up the basic upstart.
[...]

Not necessarily the shortest piece of code, but it satisfies your requirement to have the sync routine at $08xx:

0843 A2 9E   LDX #$9E
0845 A0 08   LDY #$08
0847 E0 00   CPX #$00
0849 D0 F9   BNE $0844

Branching to $0844 leads to SHX $08A0,Y, so the operand byte of the CPX is altered continuously. As long as the "&(hi+1)" plays in, the compare operand will be =$08. When "&(hi+1)" disappears, the full $9E is written to the operand byte and the loop will end. This happens iff the critical SHX-cycle happens on a badline.

2020-11-27 22:37

Rastah Bar

Registered: Oct 2012
Posts: 336

Excellent!

2020-11-28 11:56

Jammer

Registered: Nov 2002
Posts: 1289

Stupid question from a layman - where this stabilizing piece of code is supposed to go exactly to do its job and not crash the whole thing? :)

2020-11-28 12:06

Rastah Bar

Registered: Oct 2012
Posts: 336

$08A3

2020-11-28 12:22

Jammer

Registered: Nov 2002
Posts: 1289

Quoting Rastah Bar

$08A3

LOL! :D

So it's supposed to be in fixed place in order to work properly? I asked rather broad, relatively to usual structure of inits etc.

2020-11-28 12:44

Rastah Bar

Registered: Oct 2012
Posts: 336

I guess you did not follow the thread very carefully. The (currently) shortest code that can be placed anywhere was proposed in post #44. Quiss came up with a very bright idea in post #50 that uses the instabilities of the SHX instruction. It uses less RAM, but it has some restrictions on code location. Shorter variants were found, but they have much stronger location restrictions.

2020-11-28 12:51

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Jammer

Stupid question from a layman - where this stabilizing piece of code is supposed to go exactly to do its job and not crash the whole thing? :)

No worries, we're all laymen in some field(s);)

Short answer: it's a short routine that ends on a fixed cycle position of a PAL-rasterline.

If you want a bit more: quite often a raster stabilizing routine is needed for some funky raster code. One approach is to initialize a timer in such a way that you can read it at the beginning of your RASTER irq and use it as a counter for the cycle jitter. This requires some routine that actually initializes the timer somewhere in your init code.

Since every raster line has the exact same no. of cycles, it boils down to init the timer relative to that total no. of cycles of a line. Quiss came up with the initial idea to "wait for a badline" utilising the SHX abs,Y. So effectively his (and also my) routine loops until you hit a badline at an exact cycle position, thus ending up on a unique cycle after the badline. Here you'd usually insert your timer start trigger.

If you *really* want even more detail, feel free to pm me and I'll act as your personal explainer ;)

2020-11-28 13:14

Jammer

Registered: Nov 2002
Posts: 1289

I understand stabilizing in genral but I wasn't sure if this one is supposed to be triggered onceat all, once per vbl, once per line or sth. To my knowledge timer based interrupts are supposed to be stabilized every call. That's at least what THCM does :)

2020-11-28 13:18

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: I understand stabilizing in genral but I wasn't sure if this one is supposed to be triggered onceat all, once per vbl, once per line or sth. To my knowledge timer based interrupts are supposed to be stabilized every call. That's at least what THCM does :)

The code examples in this thread are meant to run once prior to setting a timer at a precisely known cycle. This timer can then used everytime at the start of an interrupt routine to stabilize it (that is, compensate for the jitter).

2020-11-28 18:36

Copyfault

Registered: Dec 2001
Posts: 466

Yes, Rastah Bar pointed it out already: all routines here are meant to run only once during init. Purpose of such a timer-init-routine is to know the exact cycle position (at least relative to a rasterline) at the end of the routine - without using timers, ofcourse;)

Main reason for adding yet another variant was Quiss' question wether his routine can be moved to mem-area $08xx. Turns out that it *is* possible;)

Ah, and to disarm the "routine must sit at a fixed mem-pos"-argument: with my approach the timer-init can be placed in almost any mem page - only the position within that page is fixed! So not fully flexible, but not too rigid either.

2020-11-29 00:22

Copyfault

Registered: Dec 2001
Posts: 466

Found another one, but alas this time fully mem-adress-fixed:

$183c a0 9e  ldy #$9e
$183e a2 19  ldx #$19
$1840 18     clc
$1841 10 fa  bpl $083d

This saves another byte \o/

The CLC is just a 2-cycle place-holder that is replaced by $19 = ORA abs,Y when the SHX $19a2,Y hits the correct cycle in a badline. This ORA abs,Y effectively "eats up" the branch and ends the loop.

Mind we could also use other branch instructions (BCC springs to mind) but I decided to use BPL to avoid page-crossing (also saving a cpu cycle).

This example also shows that with SHX abs,Y and the likes, page-crossing must be carefully planned ($19a2 + y = $19a2 + $9e = $1a40, but the hi-byte is distorted s.t. it ends up as x & (hi+1) = $19 & $1a = $18).

2020-11-29 11:06

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: Found another one, but alas this time fully mem-adress-fixed:

$183c a0 9e ldy #$9e $183e a2 19 ldx #$19 $1840 18 clc $1841 10 fa bpl $083d
This saves another byte \o/

The CLC is just a 2-cycle place-holder that is replaced by $19 = ORA abs,Y when the SHX $19a2,Y hits the correct cycle in a badline. This ORA abs,Y effectively "eats up" the branch and ends the loop.

Mind we could also use other branch instructions (BCC springs to mind) but I decided to use BPL to avoid page-crossing (also saving a cpu cycle).

This example also shows that with SHX abs,Y and the likes, page-crossing must be carefully planned ($19a2 + y = $19a2 + $9e = $1a40, but the hi-byte is distorted s.t. it ends up as x & (hi+1) = $19 & $1a = $18).

The code location can't be $183c, can't it? Also the SHX instruction behaves unpredictable when a page is crossed, so I'm afraid this one won't work.

Btw, maybe you could summarize all the known allowed code locations where any SHX or SHY variant could work?

2020-11-29 14:13

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

The code location can't be $183c, can't it? Also the SHX instruction behaves unpredictable when a page is crossed, so I'm afraid this one won't work.

Hm, I think it should, as the SHX has $19a2 as operand bytes. Adding Y=$9e and taking the wrong fixup into account gives $19a2 + $9e = $1a40 =(wrong fixup)= $1840. This is the adress of the CLC.

So starting the code at $183c is mandatory for that sniplet to work.

Quoting Rastah Bar

Btw, maybe you could summarize all the known allowed code locations where any SHX or SHY variant could work?

I remember CJam had examined this in detail and posted it in some thread here on csdb. After a short forum scan I found it: https://csdb.dk/forums/?roomid=11&topicid=94460.

CJam summarised his findings in a very nice table (adjusted to SHX):

high byte of address written to, when:
 +--------+------------------+---------------+
 |        | no DMA on cycleN | DMA on cycleN |
 +--------+------------------+---------------+
 |page    |                  |               |
 |not     |        H         |       H       |
 |crossed |                  |               |
 +--------+------------------+---------------+
 |page    |                  |               |
 |crossed |     X&(H+1)      |   X&(H+1)     |
 |        |                  |               |
 +--------+------------------+---------------+

value written, when:
 +--------+------------------+---------------+
 |        | no DMA on cycleN | DMA on cycleN |
 +--------+------------------+---------------+
 |page    |                  |               |
 |not     |     X&(H+1)      |       X       |
 |crossed |                  |               |
 +--------+------------------+---------------+
 |page    |                  |               |
 |crossed |     X&(H+1)      |       X       |
 |        |                  |               |
 +--------+------------------+---------------+

Here H is the hi-byte of the SHX-operand. For my example, this means there's page-crossing all the time (since y=$9e)- luckily, the strange hi-byte-fixup does not depend on the DMA-at-cylce-N-condition.

2020-11-29 15:46

Rastah Bar

Registered: Oct 2012
Posts: 336

Thanks! I should have looked at the latest version of the "No More Secrets" document.

2020-12-02 01:24

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

Thanks! I should have looked at the latest version of the "No More Secrets" document.

True that! Should've used the opportunity to do some advertisment for this fine document - my group fellow Groepaz does an outstanding job creating and continuously extending it!

Quoting Rastah Bar

[...]The (currently) shortest code that can be placed anywhere was proposed in post #44.

Well, I feel like a little correction is appropriate:
1. while the approach with the timers is fine and I really like it, it still depends on the timers being correctly set. Ok for most init situations, but in general dangerous to rely on.
2. The "code bracket approach" introduced by CJam (post#23) and the pure LDA-variant thereof I posted in #31 are also 12 (or even 10) bytes long (still arguable if that LDY #val must really be counted, but ok) and have no mem location constraints other than the fact that some lines of code must be put inside of this "bracket".

Quoting Rastah Bar

Quiss came up with a very bright idea in post #50 that uses the instabilities of the SHX instruction. It uses less RAM, but it has some restrictions on code location. Shorter variants were found, but they have much stronger location restrictions.

So let's get rid of these location restrictions;) Since all the routines are meant to be run during init, why not let the sync loop do some init'ing also? So if we want a zp-adress zp_pos=$00..$fe to be init'ed with an initval(!=0), this routine will do it:

      ldx #initval
      ldy #<(zp_pos+1)
loop: shx $ffff,y
      lda <($100 + zp_pos - initval),x
      beq loop

The routine ends with accu=initval, which is also stored in at zp_pos in the zeropage. It works from any mem location. Only restriction is that zp_pos=$ff is not permitted. It can be shortened even further iff special init values and/or zp-adresses are used (got it down to 8 bytes so far), but I leave it as is now.

Will check codebase within the next days and add some of the routines of this thread, unless someone tells me it's already up there.

2020-12-02 13:33

Rastah Bar

Registered: Oct 2012
Posts: 336

The STA $ZP instruction (see post #44) can be made part of the init code, which reduces the timer-based stabilization approach to effectively 10 bytes:

      ldy #init_value  ;Init code
sync: lax $dc04
      sbx #51
      sty ZP      ;RRW instruction. Part of init code.
      cpx $dc04
      bne sync:

STY ABS is also allowed, in combination with SBX #52.

If I'm not mistaken, this should work on PAL, NTSC, and DREAN, but the loop exit cycle may depend on the system.

2020-12-02 18:13

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

The STA $ZP instruction (see post #44) can be made part of the init code, which reduces the timer-based stabilization approach to effectively 10 bytes:

ldy #init_value ;Init code sync: lax $dc04 sbx #51 sty ZP ;RRW instruction. Part of init code. cpx $dc04 bne sync:

STY ABS is also allowed, in combination with SBX #52.

If I'm not mistaken, this should work on PAL, NTSC, and DREAN, but the loop exit cycle may depend on the system.

But we both agree that this is not the shortest, but rather one of the shortest approaches that work without mem loc constraints, don't we?;) The code bracket with INCs is 10 bytes long, the other one with LDAs sums up to 10 + 2(for an LDY #val that effectively is also part of the init code) in total...

Now concerning the different VIC systems in your timer based approach, I wonder wether the operand of the sbx must be adjusted according to the no. of cycles per line or if this works with subtrahend 51 on all systems for some magic reason...

2020-12-02 18:18

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Copyfault

ldx #initval ldy #<(zp_pos+1) loop: shx $ffff,y lda <($100 + zp_pos - initval),x beq loop

Nice! LDA <($100 + zp_pos - initval),X can simply be replaced with LDA zp_pos, I think.

2020-12-02 18:26

Copyfault

Registered: Dec 2001
Posts: 466

Quote: Quoting Copyfault

ldx #initval ldy #<(zp_pos+1) loop: shx $ffff,y lda <($100 + zp_pos - initval),x beq loop

Nice! LDA <($100 + zp_pos - initval),X can simply be replaced with LDA zp_pos, I think.

No, the loop'd take only 11 cycles which does not suffice, see also post #61 by Quiss. This was the reason to use the x-indexed version.

2020-12-02 18:26

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: Quoting Rastah Bar
The STA $ZP instruction (see post #44) can be made part of the init code, which reduces the timer-based stabilization approach to effectively 10 bytes:

ldy #init_value ;Init code sync: lax $dc04 sbx #51 sty ZP ;RRW instruction. Part of init code. cpx $dc04 bne sync:

STY ABS is also allowed, in combination with SBX #52.

If I'm not mistaken, this should work on PAL, NTSC, and DREAN, but the loop exit cycle may depend on the system.
But we both agree that this is not the shortest, but rather one of the shortest approaches that work without mem loc constraints, don't we?;) The code bracket with INCs is 10 bytes long, the other one with LDAs sums up to 10 + 2(for an LDY #val that effectively is also part of the init code) in total...

Now concerning the different VIC systems in your timer based approach, I wonder wether the operand of the sbx must be adjusted according to the no. of cycles per line or if this works with subtrahend 51 on all systems for some magic reason...

CJam's approach also takes 10 bytes, but maybe it requires blanking the screen? I'm not sure, but if it does, it is an extra constraint, since you do not always want that. For example, in a demo you can have some effect working on the startup screen, and then you don't want to blank it.

VIC always steals the same amount of cycles from the CPU on a badline. This is system independent.

2020-12-02 18:30

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: No, the loop'd take only 11 cycles which does not suffice, see also post #61 by Quiss. This was the reason to use the x-indexed version.

Oh yes, I forgot about that. Hard to keep track of all the subtleties :-)

2020-12-02 18:44

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

CJam's approach also takes 10 bytes, but maybe it requires blanking the screen? I'm not sure, but if it does, it is an extra constraint, since you do not always want that. For example, in a demo you can have some effect working on the startup screen, and then you don't want to blank it.

Ok, that's true. However screen blanking could be done inside the code bracket, but it is a constraint as it's not possible to display anything during this init.

Quoting Rastah Bar

VIC always steals the same amount of cycles from the CPU on a badline. This is system independent.

YES ofcourse!!! Thanks for the oh so pbvious yet clever point! So the drawback that remains is the demand of a running timer. With my last proposal (based on Quiss' SHX-idea), we'd have no constraints at all, but it needs 11 bytes in its current state. So definately not the shortest one, but at least something with SHX :)

2020-12-02 18:49

Rastah Bar

Registered: Oct 2012
Posts: 336

The more different approaches, the merrier :-)

It could very well be that screen blanking is not required for CJam's approach, but it needs to be analyzed. With likely multiple W instructions in the init code, that seems very difficult. And I'm too tired right now to think about it very hard. Not that I can figure it out, anyway :-)

2020-12-03 12:48

ChristopherJam

Registered: Aug 2004
Posts: 1378

Woah, this thread has been busy, nice work all :)

Um, no need to blank the screen for my bracketing method. There's never character DMA on line $0ff, which is the only one for which there are 63 cycles for which an INC $d012 will read $ff and write $00.

(assuming of course that I'm remembering correctly rumours that line $00 is only 62 cycles long - can anyone point me at documentation to confirm that? There's nothing in the venerable VIC Article [english], and I'm not spotting it anywhere on CodeBase either)

2020-12-03 13:05

Frantic

Registered: Mar 2003
Posts: 1627

cjam: See post 16 in this thread.

2020-12-03 14:21

ChristopherJam

Registered: Aug 2004
Posts: 1378

Arrgh, I'm going blind. Thank you Frantic.

2020-12-03 14:24

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting ChristopherJam

Woah, this thread has been busy, nice work all :)

Um, no need to blank the screen for my bracketing method. There's never character DMA on line $0ff, which is the only one for which there are 63 cycles for which an INC $d012 will read $ff and write $00.

(assuming of course that I'm remembering correctly rumours that line $00 is only 62 cycles long - can anyone point me at documentation to confirm that? There's nothing in the venerable VIC Article [english], and I'm not spotting it anywhere on CodeBase either)

I was just wondering of some W cycles within the init code might cause some kind of alignment on badlines, such that the first INC $d012 always appears on line $ff on a non-syncing cycle?

Btw, since the LDX #initval in Copyfault's method of post #90 is part of the init code, that one is only 9 bytes and hence the shortest unconstrained method.

2020-12-03 14:49

ChristopherJam

Registered: Aug 2004
Posts: 1378

Quoting Rastah Bar

Btw, since the LDX #initval in Copyfault's method of post #90 is part of the init code, that one is only 9 bytes and hence the shortest unconstrained method.

Oh yes, I'm well aware that (compressibility aside??) the mantle of shortest routine has moved on :)

Quote:

I was just wondering of some W cycles within the init code might cause some kind of alignment on badlines, such that the first INC $d012 always appears on line $ff on a non-syncing cycle?

Oh, I see - the requisite phase drift might not occur.. Yes, I can see that's a potential issue, and one that would be quite a nightmare to debug if someone hadn't already pointed out the possibility. Well spotted!

btw - with the various routines that have an entry point inside the loop, I was originally thinking "wait, aren't you then spending more bytes to branch into the start point?" but then I remembered that this code is probably running post decrunch, and most crunchers will happily let you set whatever start point you want, and kill CIA for you to boot.

Of course, if you're being this stingy with bytes there's a also fair chance you're *not* using an off the shelf decruncher...

2020-12-03 17:22

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

Btw, since the LDX #initval in Copyfault's method of post #90 is part of the init code, that one is only 9 bytes and hence the shortest unconstrained method.

Well, what can I say? Thanks!!! I happily take this "medal" :)) Makes all other approaches not a tiny bit less attractive (and I still have to do the transferring to codebase,so @Frantic: sorry for delay!)

Quoting ChristopherJam

btw - with the various routines that have an entry point inside the loop, I was originally thinking "wait, aren't you then spending more bytes to branch into the start point?" but then I remembered that this code is probably running post decrunch, and most crunchers will happily let you set whatever start point you want, and kill CIA for you to boot.

This was also some point I always wanted to get rid of. I'm over-happy that my last proposal does not need a jmp inside of the loop but can just start with the first opcode, the LDX #initval. But fair point with the context, the routine is usually called after the cruncher finished its job, so a jmp to whatever adress shouldn't be a real constraint.

2020-12-03 20:37

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting ChristopherJam

btw - with the various routines that have an entry point inside the loop, I was originally thinking "wait, aren't you then spending more bytes to branch into the start point?" but then I remembered that this code is probably running post decrunch, and most crunchers will happily let you set whatever start point you want, and kill CIA for you to boot.

Of course, if you're being this stingy with bytes there's a also fair chance you're *not* using an off the shelf decruncher...

I do not know exactly how crunchers tailored to 6502 code work, but what will turn out to be the shortest crunched routine, could depend on the code or data around it, is that correct?

2020-12-04 00:23

Copyfault

Registered: Dec 2001
Posts: 466

Here's another nice one I just found: if we first have an init-routine that initialises zp-adr $9f with a non-zero(!) value val, the following routine can be run afterwards to get in sync:

loop: lax $9f
      ldy #$ff
      cmp <($100 + $9f - val),x
      bne loop+1

Takes 8 bytes and though it operates on zp, it is non-destructive, i.e. the value of $9f is restored again when the loop terminates.

2020-12-04 00:47

Copyfault

Registered: Dec 2001
Posts: 466

Ah well, f**k it, it obviously falls through on the first loop run. Damn!

2020-12-04 01:29

ChristopherJam

Registered: Aug 2004
Posts: 1378

Quoting Rastah Bar

I do not know exactly how crunchers tailored to 6502 code work, but what will turn out to be the shortest crunched routine, could depend on the code or data around it, is that correct?

Could do - but every single cruncher in use is some kind of variation on "replacing things it has seen before with a reference to the earlier version" (though LZMA does also try and compress the bits in literal values if it can).

The code crunching specializations just separate out the operand stream from the opcode stream, either directly or by favouring offsets for copying bytes from 5 or 6 bytes earlier in the output-thus-far.

Any kind of guarantee that "hey I've seen nearly half these bytes already' is going to be hard to beat.

2020-12-04 14:40

Rastah Bar

Registered: Oct 2012
Posts: 336

@CJam: Thanks for explaining. It is a nice puzzle to find short uncrunched versions, but I understand that it remains to be seen how useful they really are.

@all:
Another variant: after some init code often the values in X and Y will be known. If one of them is larger then 127, then SHX or SHY can be used with an address with {H+1}<128 for a 7-byte, 10-cycle loop. F.e. let's say X>$7F and Y=$10 (any value Y<$FF can be made to work):

HH0c   shx $HH00,Y
HH0f   lda #any_value
HH11   bpl *-5

HH is a number smaller than $7F. The low byte of the start address should be adjusted according to the value of Y.
X or Y >$7f is hardly a constraint, since something like LDX #$80, STX $d020 already does the trick.

2020-12-04 21:49

Copyfault

Registered: Dec 2001
Posts: 466

I was about to begin with "Let's cheat it down to 7 bytes...", but hats off for being faster with a 7-bytes-version, Rastah Bar! Great stuff, really! But let's face it: the lower the amount of bytes for the actual sync routine, the more (and weirder) the constraints ;)

Here's my idea. After decrunch, let's assume we have two zp-adresses set to specific values: $9e=$fa; and $a0=$00 (other combinations are possible, but this one's fairly nice to illustrate the method). This'd allow us to sync with the following three lines of code:

loop:  lda $a6,x
       shx $00a0,y
       beq loop

The routine must be called with a JMP loop+1, so it starts with

entry: ldx $9e
       ldy #$00
       beq loop

Ok, it needs special zp adresses set to special values, but this is usually the case (one's tempted to say: "choose your vectors wisely";)). At least this routine can be placed at any mem position and the zp-values stay unchanged.

Don't know which constraints are less disturbing, but I think it won't get any smaller than 7 bytes. It's still a challenge to try to make it completely free of any constraint while keeping this small size!

2020-12-04 22:35

Rastah Bar

Registered: Oct 2012
Posts: 336

Nice, but it would be quite a coincidence that you would need exactly these presettings in the rest of the intro or demo. Perhaps there are ZP adresses that normally (I mean, after a cold start), have the required values.

2020-12-04 22:56

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

Nice, but it would be quite a coincidence that you would need exactly these presettings in the rest of the intro or demo. Perhaps there are ZP adresses that normally (I mean, after a cold start), have the required values.

Did not dig deeper through the default zp settings, but since the sync-loop must be started by jumping inside, calling it after decrunching is mandatory more or less. So why not establish some special vector settings;))?

And though other combinations are possible, it's all quite rigid and every variant needs extra checks asf. Getting rid of the vectors completely would be awesome (without mem constraints & 7 bytes in total), but well - this whole problem had a black hole effect for far too long on my mind... and obviously still has :(

2020-12-04 23:22

Rastah Bar

Registered: Oct 2012
Posts: 336

I know almost nothing about decrunchers, so I don't have a clue what they can do in terms of "initial conditions" of ZP-adresses or registers, etc.

If they can, for example, give you a desired value of X and Y (without increasing net code size), then perhaps a 6-byte loop is possible with something like this:

shx $HH00,y
BYTE any_value
bne *-4

The code location and $HH should be such that X & {H+1} is the opcode for instructions like TXA, TYA, while X should contain the opcode for a 3-byte instruction.

So without DMA, the byte after the SHX instruction is replaced by e.g. TYA ensuring the branch is taken, and with DMA the loop exits with the 3-byte instruction whose opcode was in X. But this is stretching it really far!

2020-12-04 23:56

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

I know almost nothing about decrunchers, so I don't have a clue what they can do in terms of "initial conditions" of ZP-adresses or registers, etc.

With "after decrunch" I just mean that the all memory is initialised with values as needed and that the jump to whatever starting point belongs to the decruncher code.

Quoting Rastah Bar

If they can, for example, give you a desired value of X and Y (without increasing net code size), then perhaps a 6-byte loop is possible with something like this:

shx $HH00,y BYTE any_value bne *-4

The code location and $HH should be such that X & {H+1} is the opcode for instructions like TXA, TYA, while X should contain the opcode for a 3-byte instruction.

So without DMA, the byte after the SHX instruction is replaced by e.g. TYA ensuring the branch is taken, and with DMA the loop exits with the 3-byte instruction whose opcode was in X. But this is stretching it really far!

Yes this should work. But you're right, it's really shifting *a lot* of preparations to the reign of decruncher & init code. Still quite doable I think. Time to dig out the shortest-code-medal and polish it for the new owner;)

2020-12-05 15:35

Rastah Bar

Registered: Oct 2012
Posts: 336

The code cannot be freely placed in memory, so you may keep that medal :-)

One example (there are probably a lot more):
X = $38 (opcode for SEC)

SHX $HH00,Y
CLC
BCC *-4

HH can be $17..$1E, $57..$5E, $97..9E, $D7..$DE. Without DMA, the CLC (opcode $18) does not change, with DMA it is replaced with SEC.

2020-12-05 21:06

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: The STA $ZP instruction (see post #44) can be made part of the init code, which reduces the timer-based stabilization approach to effectively 10 bytes:

ldy #init_value ;Init code sync: lax $dc04 sbx #51 sty ZP ;RRW instruction. Part of init code. cpx $dc04 bne sync:

STY ABS is also allowed, in combination with SBX #52.

If I'm not mistaken, this should work on PAL, NTSC, and DREAN, but the loop exit cycle may depend on the system.

Correction: STY ABS is not guaranteed to lock(*), but STY ZP is, on all models (PAL, old and new NTSC, DREAN).

(*) Unless the border saves it, but I still have to check that.

2020-12-06 19:02

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

Quoting Copyfault
Quoting Rastah Bar
If they can, for example, give you a desired value of X and Y (without increasing net code size), then perhaps a 6-byte loop is possible with something like this:

shx $HH00,y BYTE any_value bne *-4
[...]
Yes this should work. But you're right, it's really shifting *a lot* of preparations to the reign of decruncher & init code. Still quite doable I think. Time to dig out the shortest-code-medal and polish it for the new owner;)
The code cannot be freely placed in memory, so you may keep that medal :-)[...]

Well, in large parts it's the same as what I proposed in post#86. But now that we entered the territory of over-stretching, here a version that does it in only 5 bytes (again putting all required reg settings on the decruncher's bill) :

$fdfc  9E D0 FD   shx $fdd0,y
$fdff  D0 FC      bne $fdfd

Comes with all constraints one could think of: mem loc fixed, y=$2f fixed val mandatory, x=$d1 fixed val mandatory, setting of vector $fc/$fd has influence on the no. of cycle that are taken when the loop is left, to be started with z-flag=0, ... maybe more! Ok. it's possible to do it with any branch-opcode, but this doesn't really make it any better;)

2020-12-06 19:16

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Copyfault

Well, in large parts it's the same as what I proposed in post#86.

Yes, you are right. It's also very similar to what I wrote in post#71. I lost track a bit of all the variants.
Quote:

But now that we entered the territory of over-stretching, here a version that does it in only 5 bytes (again putting all required reg settings on the decruncher's bill) :

$fdfc 9E D0 FD shx $fdd0,y $fdff D0 FC bne $fdfd
Comes with all constraints one could think of: mem loc fixed, y=$2f fixed val mandatory, x=$d1 fixed val mandatory, setting of vector $fc/$fd has influence on the no. of cycle that are taken when the loop is left, to be started with z-flag=0, ... maybe more! Ok. it's possible to do it with any branch-opcode, but this doesn't really make it any better;)

Very ingenious, but an 8-cycle loop doesn't work, doesn't it? See post #61.

2020-12-06 19:28

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

Quoting Copyfault
Well, in large parts it's the same as what I proposed in post#86.

Yes, you are right. I lost track a bit of all the variants.
Quote:

But now that we entered the territory of over-stretching, here a version that does it in only 5 bytes (again putting all required reg settings on the decruncher's bill) :

$fdfc 9E D0 FD shx $fdd0,y $fdff D0 FC bne $fdfd
Comes with all constraints one could think of: mem loc fixed, y=$2f fixed val mandatory, x=$d1 fixed val mandatory, setting of vector $fc/$fd has influence on the no. of cycle that are taken when the loop is left, to be started with z-flag=0, ... maybe more! Ok. it's possible to do it with any branch-opcode, but this doesn't really make it any better;)

Very ingenious, but an 8-cycle loop doesn't work, doesn't it? See post #61.

It's actually a 12-cycle loop, cause the first branch is 4-cycles long (page-break!), the branch in the operand of the SHX takes 3 cycles and the SHX itself 5 -> 12 cycles in total;)

It could even be done with just 4 bytes (continuing the abuse of the byte-counting):

loop:  sha (vec),y
       bne loop

If this is located at the end of a page s.t. the BNE comes with a pb, it's a 10-cycle-loop in total.

Still, too far-fetched, too many things must be configured correctly. Personally, I think the 7-bytes-solution (as in post#110) that "only" comes with requirements on zp-values set in a special way is the best compromise between flexibility and byte-count!

2020-12-06 19:47

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Copyfault

Quoting Rastah Bar

Very ingenious, but an 8-cycle loop doesn't work, doesn't it? See post #61.
Quote:
It's actually a 12-cycle loop, cause the first branch is 4-cycles long (page-break!), the branch in the operand of the SHX takes 3 cycles and the SHX itself 5 -> 12 cycles in total;)

Yes, I misread the branch. I thought it was to $FDFC.
Quote:

It could even be done with just 4 bytes (continuing the abuse of the byte-counting):

loop: sha (vec),y bne loop
If this is located at the end of a page s.t. the BNE comes with a pb, it's a 10-cycle-loop in total.

Awesome! With SHA(vec),y even 3 bytes is possible for a 12-cycle loop. One example:

$5f00  SHA (VEC),y
$5f02  RTS

If we assume that the decruncher provides the following initial conditions: {A&X} = $EA (opcode of NOP), Y = 2, the ZP addresses VEC and VEC+1 point to $5F00 and the stack is completely filled with the return address $5F00. Without DMA the SHA writes $EA & {$5F+1} = $60 (opcode for RTS) and repeats that until a DMA makes it write an NOP.
Quote:

Still, too far-fetched, too many things must be configured correctly. Personally, I think the 7-bytes-solution (as in post#110) that "only" comes with requirements on zp-values set in a special way is the best compromise between flexibility and byte-count!

I'll leave that judgement to the people who want to use any of the variants.

2020-12-16 00:07

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

Awesome! With SHA(vec),y even 3 bytes is possible for a 12-cycle loop. One example:

$5f00 SHA (VEC),y $5f02 RTS

If we assume that the decruncher provides the following initial conditions: {A&X} = $EA (opcode of NOP), Y = 2, the ZP addresses VEC and VEC+1 point to $5F00 and the stack is completely filled with the return address $5F00. Without DMA the SHA writes $EA & {$5F+1} = $60 (opcode for RTS) and repeats that until a DMA makes it write an NOP.

Yeah, already told you that I like this approach for its level of insanity alone :)) Maybe instead of $5f00 one could choose $5f5f as "start adress" so the whole stack can be filled with the same byte and no matter at which position the SP will be, it will always return to the right spot!

2020-12-16 02:16

ChristopherJam

Registered: Aug 2004
Posts: 1378

Oh this is great.

May I suggest replacing the RTS with a BRK? Then you only need a single vector pointing at the routine start, instead of all of stack :) 13 cycle works I think; it divides cycles per frame but not cycles per character row.

(edit - assuming no issues with cycle stealing from all the stack writes for the BRK, of course. I've not tested this)

2020-12-16 12:55

chatGPZ

Registered: Dec 2001
Posts: 11113

read the whole thread again any by scientific measures you all turned out pretty insane.

2020-12-16 14:36

ChristopherJam

Registered: Aug 2004
Posts: 1378

Quoting Groepaz

read the whole thread again any by scientific measures you all turned out pretty insane.

Well, I can't really argue with that.

I can suggest replacing the BRK (or rather the contents of A) with the opcode of the next instruction in the init routine. Place the SHA in $ffxx, and it will zero out whatever is written until the time is right. Just saved another byte \o/

2020-12-16 15:24

Copyfault

Registered: Dec 2001
Posts: 466

Quoting ChristopherJam

Oh this is great.

May I suggest replacing the RTS with a BRK? Then you only need a single vector pointing at the routine start, instead of all of stack :) 13 cycle works I think; it divides cycles per frame but not cycles per character row.

(edit - assuming no issues with cycle stealing from all the stack writes for the BRK, of course. I've not tested this)

I also had this idea to do it with a BRK instead of RTS (I mean: when there's some ANDing in play, a $00-byte for the "continue-loop"-case feels tempting;)), but afaiu post#61 by Quiss, 13 cycles does not work. So I guess the complete stack would've to be "configured adequately" :)

Quoting Groepaz

read the whole thread again any by scientific measures you all turned out pretty insane.

Well, don't have a good argument against this - but insanity is the obvious state when stepping beyond science ;)

2020-12-16 15:30

Copyfault

Registered: Dec 2001
Posts: 466

Quoting ChristopherJam

[...]
(edit - assuming no issues with cycle stealing from all the stack writes for the BRK, of course. I've not tested this)

Oh wait... Quiss' calculation for the loop-length was based on R-cycle only... so the BRK *will* change it. Need to fiddle out the permitted loop-lengths under this new precondition - maybe it works...

2020-12-16 15:36

ChristopherJam

Registered: Aug 2004
Posts: 1378

I do hope it works. Because then we can use the entire SHA instruction and its opcode as the address operand of a preceding instruction (eg placing the zp pointer at an address that doubles as the high byte of an IO address), then the entire routine can vanish altogether. Zero bytes :D :D

2020-12-16 15:54

Copyfault

Registered: Dec 2001
Posts: 466

Quoting ChristopherJam

I do hope it works. Because then we can use the entire SHA instruction and its opcode as the address operand of a preceding instruction (eg placing the zp pointer at an address that doubles as the high byte of an IO address), then the entire routine can vanish altogether. Zero bytes :D :D

Oh, where are we now??? Insanity^infty /o\\ But ok, splendig idea to make SHA (vec),y an operand of a preceeding opcode, like STA $d093,y :)) First thought it'd rather be a 1-byte-solution due to the opcode following that STA $d093,y, but the sync-loop exits always when the full value was written, thus also this byte is completely free to choose.

2020-12-16 17:46

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: read the whole thread again any by scientific measures you all turned out pretty insane.

2020-12-16 18:25

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: I do hope it works. Because then we can use the entire SHA instruction and its opcode as the address operand of a preceding instruction (eg placing the zp pointer at an address that doubles as the high byte of an IO address), then the entire routine can vanish altogether. Zero bytes :D :D

Fantastic!

2020-12-16 19:16

Frantic

Registered: Mar 2003
Posts: 1627

Zero bytes routine... Sounds like a world's first to me. :)

(And, yes, this thread obviously continued way past its end station.)

2020-12-16 22:13

Copyfault

Registered: Dec 2001
Posts: 466

The 13-cycle-loop won't work :((( The additional W-Cycles do not cancel out every possible cyclicity in the corresponding graph.

Talking in terms of cycles, we have the following:

sha ($d0),y ; RRRRSW
brk         ; RRWWWRR

The cycles are denoted as usual, with "S" being the special cycle that must land on cycle #12 of a badline (starting the rastercycle-count of a line with 1).

Now a 13-cycle-loop that is spread amoing 461 cycles (from badline to badline) gives a mod(-,13)-operator. Since mod(461,13)=6, we're actually adding 6 to the relative cycle position in our loop when getting from one badline to the next and look at a fixed cycle in the line.

The mod-operator also uniquely numbers the cycles of the loop, i.e. the first R-cycle of the SHA is cycle=0, the second R-cycle is cycle=1, etc. and the last R-cycle of the BRK is cycle=12. Now if we e.g. fix cycle #11 (the one before the first DMA-overtake-cycle) and have loop-cycle#11 at this spot, the +6-operation will lead to loop-cycle=11+6=17=4(mod13), i.e. the "S"-cycle will be at cycle#11 of the next badline. But here the drama happens: the cycle following this "S"-cycle is a W-cycle, thus executed since W-cycles can be executed on DMA-overtake-cycles. So the cycle-loop-count increases by one, and in the next badline at cycle#11 we will meet loop-cycle=5+6=11. This is the cycle we started with, so we're in an endless loop :(

Can this be broken by the no. of cycles on the upper/lower border? Or can we circumvent the trouble by using ROM-vectors ($0314ff)? This would give us a 41/42-cycle-loop, since (at least afair) there's an LDA $0104,X in the ROM-IRQ-routine, with X=SP, so this may lead to a varying no. of cycles if a pb is happening...

2020-12-17 00:16

Copyfault

Registered: Dec 2001
Posts: 466

A 12-cycle-loop works, even when we have the three W-cycles of the BRK in the game. So one could argue that the following is kinda 0-byte-ish:

$ffd0   sta $d09e,y
$ffd3   isb $d000,x
$ffd6   brk

BRK-vector points to $ffd1, so the "loop-view" of the code is

$ffd0   .byte $99
$ffd1   shx $ffd0,y
$ffd4   brk
$ffd5   bne $ffd7

This offers several ways to end the loop; choosing X s.t. some store-opcode is put to $ffd4 when SHX hits the correct cycle gives an access to zp-adress $d0, but ideally, the ISB has already done a good job, thus also one of the NOP-opcodes like $1a at $ffd4 followed by the BNE can be ok.

2020-12-17 00:27

Krill

Registered: Apr 2002
Posts: 2839

I've been following this thread somewhat out of base personal interest... but so far, Quiss' original approach in https://csdb.dk/forums/?roomid=11&topicid=140414#143496 seems the most feasible for real-world purposes, imho.

Now, if the magic code could reside somewhere at $08xx... =)

2020-12-17 00:34

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Krill

[...]
Now, if the magic code could reside somewhere at $08xx... =)

Well, it can, see f.e. post#76 ;)

2020-12-17 00:37

Krill

Registered: Apr 2002
Posts: 2839

Quoting Copyfault

Quoting Krill
[...]
Now, if the magic code could reside somewhere at $08xx... =)
Well, it can, see f.e. post#76 ;)

Oh okay, sorry, seem to have missed that!

So same assertion with https://csdb.dk/forums/?roomid=11&topicid=140414#146475 instead!
8 bytes, excellent! =)

2020-12-17 02:48

ChristopherJam

Registered: Aug 2004
Posts: 1378

Quoting Copyfault

The 13-cycle-loop won't work :(((

Aww, that's a shame.

Kind of ironic to be derailed by W accesses, given the number of earlier approaches that relied on cycling stealing to work at all.

Thanks for doing the analysis!

2020-12-17 10:37

Rastah Bar

Registered: Oct 2012
Posts: 336

It may still work, because the border can get it out of such an endless loop, but this should be checked.

2020-12-17 10:39

Oswald

Registered: Apr 2002
Posts: 5017

Quote: Quoting Copyfault
Quoting Krill
[...]
Now, if the magic code could reside somewhere at $08xx... =)
Well, it can, see f.e. post#76 ;)
Oh okay, sorry, seem to have missed that!

So same assertion with https://csdb.dk/forums/?roomid=11&topicid=140414#146475 instead!
8 bytes, excellent! =)

this post link doesnt work when csdb is also set to display only a set nr of posts from a topic.

2020-12-17 10:41

Krill

Registered: Apr 2002
Posts: 2839

Quoting Oswald

this post link doesnt work when csdb is also set to display only a set nr of posts from a topic.

I think that's a bug report and should go to another sub-forum, no? :)

2020-12-17 10:52

Rastah Bar

Registered: Oct 2012
Posts: 336

Krill, what is your opininion on the timer-based approach of post#91? Is it too dangerous to rely on a timer when loading the next part of a demo?

2020-12-17 11:04

Krill

Registered: Apr 2002
Posts: 2839

Quoting Rastah Bar

Krill, what is your opininion on the timer-based approach of post#91? Is it too dangerous to rely on a timer when loading the next part of a demo?

Timers are not used by loaders usually, then there is no hazard for loading itself. Typical IRQ loaders load in a background thread, so they are interrupted themselves but don't interrupt other threads.

Some IRQ loaders exist which use timer or raster IRQs to periodically receive data from the drive, and it seems like that could disturb your approach, as it does not seem to like being interrupted.

Depending on how long a loader is not "scheduled" due to a long-running critical section, it could be starved and trigger a drive-side watchdog IRQ to reset the drive due to protocol violation.

2020-12-17 11:24

Rastah Bar

Registered: Oct 2012
Posts: 336

Thanks for your detailed answer. None of the methods in this thread likes to be interrupted, but that is easy to guarantee since the routines only have to run once.

I phrased my question a bit clumsily, though. What I meant was: suppose you want to use the timer-based syncing routine not directly after the start from basic, but later, for some other part of a demo. Is there some risk in relying on $dc04? I suspect not, since the coder should know the state of the timers, but I don't know if there are some cases where you can't be sure of the state of the timers, because f.e. a loader has used it.

2020-12-17 11:32

Krill

Registered: Apr 2002
Posts: 2839

Quoting Rastah Bar

What I meant was: suppose you want to use the timer-based syncing routine not directly after the start from basic, but later, for some other part of a demo. Is there some risk in relying on $dc04? I suspect not, since the coder should know the state of the timers, but I don't know if there are some cases where you can't be sure of the state of the timers, because f.e. a loader has used it.

I have understood your question exactly like this. Is there anything unclear in my answer? :)

2020-12-17 11:39

chatGPZ

Registered: Dec 2001
Posts: 11113

relying on anything being already initialized isnt really a good idea though. perhaps acceptable for something like a 4k - but for anything bigger i'd rather not do this. you can never know what some cartridge or kernal replacement does.

2020-12-17 11:40

Copyfault

Registered: Dec 2001
Posts: 466

Quoting ChristopherJam

Quoting Copyfault
The 13-cycle-loop won't work :(((

Aww, that's a shame.

Kind of ironic to be derailed by W accesses, given the number of earlier approaches that relied on cycling stealing to work at all.

Thanks for doing the analysis!

Had to be done;) And I kinda like rastercycle-joggling... really sad it did not "pay off".

2020-12-17 11:49

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: Quoting Rastah Bar
What I meant was: suppose you want to use the timer-based syncing routine not directly after the start from basic, but later, for some other part of a demo. Is there some risk in relying on $dc04? I suspect not, since the coder should know the state of the timers, but I don't know if there are some cases where you can't be sure of the state of the timers, because f.e. a loader has used it.
I have understood your question exactly like this. Is there anything unclear in my answer? :)

No, except that I did not see an argument against the timer-based approach that does not hold for the other approaches, but Groepaz' answer does.

2020-12-17 11:51

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: relying on anything being already initialized isnt really a good idea though. perhaps acceptable for something like a 4k - but for anything bigger i'd rather not do this. you can never know what some cartridge or kernal replacement does.

Thanks! Yes, that is a strong argument against it.

2020-12-17 12:03

Krill

Registered: Apr 2002
Posts: 2839

Quoting Groepaz

relying on anything being already initialized isnt really a good idea though. perhaps acceptable for something like a 4k - but for anything bigger i'd rather not do this. you can never know what some cartridge or kernal replacement does.

Cartridge and KERNAL stuff do not play much of a role in demos, so it's not unusual to have timers for IRQ jitter compensation run during the entire multi-side demo without re-initialisation.

It's different for games, of course, but there you rarely need stable interrupts.

But then both multiload demos and games do not have much of a space issue, so just re-initialise timers at strategic/natural points and be done with it.

2020-12-17 12:16

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

It may still work, because the border can get it out of such an endless loop, but this should be checked.

Hmm, at least my counter example from post#132 seems to get fixed: if we have loop-cycle#4 at cycle#11 of the last rasterline in one frame, we will get to loop-cycle

change                  |   loop-cycle
starting situation      |   4
W-cycle following "S"   |   +1 = 5
cycles to bl in next fr |   +3 = 8
W-cycles of BRK         |   +2 = 10
cycle to next bl        |   +6 = 16 = 3

so the next loop-cycle (which is the S-cycle) hits raster-cycle #12 and the loop ends.

Similarly, when starting with loop-cycle#11:

change                  |   loop-cycle
starting situation      |   11
cycles to bl in next fr |   +3 = 14 = 1
cycle to next bl        |   +6 = 7
W-cycles of BRK         |   +3 = 10
cycle to next bl        |   +6 = 16 = 3

Now we have to check all starting situations, since the routine might start at a last badline of a frame. But I'm warily optimistic ;)

2020-12-17 12:49

Rastah Bar

Registered: Oct 2012
Posts: 336

I checked all starting situations and if I did not make a mistake, the loop always ends!

The next exercise is to check this for C64 models other than PAL.

2020-12-17 12:59

chatGPZ

Registered: Dec 2001
Posts: 11113

Quote:

Cartridge and KERNAL stuff do not play much of a role in demos, so it's not unusual to have timers for IRQ jitter compensation run during the entire multi-side demo without re-initialisation.

thats not the point. the point is that you cant rely on the kernal having initialized the timer to a certain value when your code starts up.

2020-12-17 13:01

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: Quoting Groepaz
relying on anything being already initialized isnt really a good idea though. perhaps acceptable for something like a 4k - but for anything bigger i'd rather not do this. you can never know what some cartridge or kernal replacement does.
Cartridge and KERNAL stuff do not play much of a role in demos, so it's not unusual to have timers for IRQ jitter compensation run during the entire multi-side demo without re-initialisation.

It's different for games, of course, but there you rarely need stable interrupts.

But then both multiload demos and games do not have much of a space issue, so just re-initialise timers at strategic/natural points and be done with it.

I did assume that $dc04 goes through all 256 values. If it runs at a period of 9 cycles, the SBX #51 will fail. If it runs at a period of 63 cycles, I don't know if it will work. That has to be analyzed. If it does, I expect that it often will take much longer to lock. Then again, as you point out, if it runs at these periods usually means that a raster syncing procedure has already been performed.

2020-12-17 13:16

Krill

Registered: Apr 2002
Posts: 2839

Quoting Rastah Bar

I did assume that $dc04 goes through all 256 values.

Note that by default, you cannot expect a timer period that is divisible by 256. That is, on timer underflow, the timer lo-byte will wrap to some other value than $ff.

2020-12-17 13:30

Copyfault

Registered: Dec 2001
Posts: 466

The approach with the timer has some pitfalls but the idea behind it is really beautiful: to check if the timer has changed largely though the read-accesses to the timer-register are just a few cycles apart!

In some of my former posts I already wondered wether the timer underflow never misaligns the read values s.t. a badline might be detected while not being in one. And somehow it feels strange to use timers in a routine which aims to fix a raster cycle position s.t. a timer can be initialised in a stable way. Nontheless a valuable addition to the pool of (mostly insane) approaches :)

2020-12-17 14:10

Rastah Bar

Registered: Oct 2012
Posts: 336

Quote: Quoting Rastah Bar
I did assume that $dc04 goes through all 256 values.
Note that by default, you cannot expect a timer period that is divisible by 256. That is, on timer underflow, the timer lo-byte will wrap to some other value than $ff.

Yes, but for the default ~60 Hz setting, that will only postpone locking somewhat.

2020-12-17 14:15

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Copyfault

The approach with the timer has some pitfalls but the idea behind it is really beautiful: to check if the timer has changed largely though the read-accesses to the timer-register are just a few cycles apart!

Thanks!
Quoting Copyfault

In some of my former posts I already wondered wether the timer underflow never misaligns the read values s.t. a badline might be detected while not being in one. And somehow it feels strange to use timers in a routine which aims to fix a raster cycle position s.t. a timer can be initialised in a stable way. Nontheless a valuable addition to the pool of (mostly insane) approaches :)

The default Kernal settings are such that a false positive cannot occur. But, as Groepaz pointed out, you can never know what some cartridge or kernal replacement does to the settings.

2020-12-17 18:31

Krill

Registered: Apr 2002
Posts: 2839

Quoting Rastah Bar

The default Kernal settings are such that a false positive cannot occur. But, as Groepaz pointed out, you can never know what some cartridge or kernal replacement does to the settings.

This is the same hazard as relying on any kind of pre-initialised variable. Usually to be avoided, but okay for very tight size-restricted productions, such as 4K or smaller demos. =)

2020-12-18 00:38

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

I checked all starting situations and if I did not make a mistake, the loop always ends!

Can confirm :) Depending on the loop-cycle at raster-cycle#11 of, say, badline at $30, the loop-cycle-no. either rather quickly gets to 3 (and thus the S-cycle lands where it should), or the loop-cycle-no. gets into an "4/5<->11 endless cycle". No other endless cycles are possible.

So it suffices to check what happens when the loop-cycle at raster-cycle#11 of the last badline of the frame is 4/5 or 11. For this, see post#150. "QED", I'd say 8)=)

Quoting Rastah Bar

The next exercise is to check this for C64 models other than PAL.

Puhhh, not now, I'm just too happy it works for PAL :)

Quoting ChristopherJam

Quoting Copyfault
[...]Thanks for doing the analysis!
Had to be done;) And I kinda like rastercycle-joggling... really sad it did not "pay off".

And now it DID pay off :) I'm really happy now, 'cause we have just "proven" that

sta $d093,y
brk

with the brk replaced by any wanted value on the sync-loop's exit works!!! A 0-byte-sync-routine (ofcourse PLUS all the preparations, but those may serve other purposes, too).

2020-12-18 01:31

ChristopherJam

Registered: Aug 2004
Posts: 1378

Victory! \:D/ \:D/

2020-12-18 09:33

Rastah Bar

Registered: Oct 2012
Posts: 336

I checked NTSC1, NTSC2, and DREAN, but it doesn't work for these models. NTSC2 and DREAN have 65 cycles per line, and since this is a mutiple of the looplength of 13 cycles, the border changes nothing and won't get us out of an endless loop.

I found examples of endless loops for all models, and for NTSC1 the border does not save it either. So PAL only.

2020-12-19 00:13

Copyfault

Registered: Dec 2001
Posts: 466

There's still some more to squeeze out of the "0-byte-approach", though one may argue that it becomes a 1-byte-routine this way...

Instead of changing the byte following the SHA (vec),y, we could change the low-byte of the BRK-vector. By doing so, the mem constraint of the routine (best fit was the $ffxx-page) can be lowered a bit:

*= page*$100 - 1
sta $d093,y
brk

Now init the BRK-vector with $00/#page, put #($fe-y)/$ff at $d0/$d1 and choose values for accu and x s.t. x&a=1 and the routine will still work (also widens the choice for a and x somewhat).

2020-12-19 06:16

ChristopherJam

Registered: Aug 2004
Posts: 1378

Oh nice. There's always a use for a value pre-initialised to zero too, so I'm sure that would not be wasted.

2020-12-19 11:45

Rastah Bar

Registered: Oct 2012
Posts: 336

The choice of A&X can be much wider than s.t. A&X=1.
A&X=3 would also work to continue with the code after the BRK instruction. Or, if you store some useful data such as a small table after the BRK, it can be skipped with a larger value of A&X. Ideally the BRK instruction would serve as the first element of a table :-)

2020-12-19 22:13

Copyfault

Registered: Dec 2001
Posts: 466

Yes, you're both right, the BRK can be used as some data and the step from the beginning of the page to the adress where to continue can be chosen almost arbitrarily... but it imposes more (and weirder!!) restrictions on the choices for a and x;)

The example with A&X=1 was on intention, 'cause this way the $d0 turns into a BNE to the next opcode, and since A&X!=0, it's quite save to assume that the zero-flag is not set when the loop starts;))

2020-12-20 10:37

Rastah Bar

Registered: Oct 2012
Posts: 336

Quoting Copyfault

Yes, you're both right, the BRK can be used as some data and the step from the beginning of the page to the adress where to continue can be chosen almost arbitrarily... but it imposes more (and weirder!!) restrictions on the choices for a and x;)

Not more restrictions, you just have more choice.

2020-12-20 13:12

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Rastah Bar

Quoting Copyfault
Yes, you're both right, the BRK can be used as some data and the step from the beginning of the page to the adress where to continue can be chosen almost arbitrarily... but it imposes more (and weirder!!) restrictions on the choices for a and x;)

Not more restrictions, you just have more choice.

Hmm, don't know, but the more bits are fixed by the AND-conditiion, the fewer variants for the pair (a,x) are permitted, no?

2020-12-20 13:26

Rastah Bar

Registered: Oct 2012
Posts: 336

Yes, but the coder can choose to use A&X=1, or A&X=3, or something else. So in that sense there are less restrictions.
A priori he has more possibilities to choose from.

Some may not work, but he can always fall back to A&X = 1 or A&X = 3 if he can't make a data table work with the corresponding restrictions on A and X, and to A&X = 1 if A&X = 3 can't be met. Besides, A&X = 2^K, K = 2,...,7 also only fixes 1 bit.

2020-12-20 17:54

Copyfault

Registered: Dec 2001
Posts: 466

Ah, ok, looking from that meta-level I agree of course. The only case that causes trouble is A&X=2, everthing else will work.

2021-01-30 18:15

Krill

Registered: Apr 2002
Posts: 2839

Quoting Copyfault

Quoting Quiss
Neat! Right, no reason to make those two address bytes go to waste. :)

Another amusing thing to contemplate is how this code could be placed at, say, $08xx. Preferably without messing up the basic upstart.
[...]
Not necessarily the shortest piece of code, but it satisfies your requirement to have the sync routine at $08xx:

0843 A2 9E LDX #$9E 0845 A0 08 LDY #$08 0847 E0 00 CPX #$00 0849 D0 F9 BNE $0844
Branching to $0844 leads to SHX $08A0,Y, so the operand byte of the CPX is altered continuously. As long as the "&(hi+1)" plays in, the compare operand will be =$08. When "&(hi+1)" disappears, the full $9E is written to the operand byte and the loop will end. This happens iff the critical SHX-cycle happens on a badline.

Nice trolljob there! =)

Took me a while and Rastah Bar's successive comment to figure out that code must sit at $08A3 instead, and not at $0843.
Seems to work nicely now in real-world code on realthing and x64sc, while on x64 the code just twiddles thumbs in an endless loop. =)

2021-01-30 20:43

Copyfault

Registered: Dec 2001
Posts: 466

Quote: Quoting Copyfault
Quoting Quiss
Neat! Right, no reason to make those two address bytes go to waste. :)

Another amusing thing to contemplate is how this code could be placed at, say, $08xx. Preferably without messing up the basic upstart.
[...]
Not necessarily the shortest piece of code, but it satisfies your requirement to have the sync routine at $08xx:

0843 A2 9E LDX #$9E 0845 A0 08 LDY #$08 0847 E0 00 CPX #$00 0849 D0 F9 BNE $0844
Branching to $0844 leads to SHX $08A0,Y, so the operand byte of the CPX is altered continuously. As long as the "&(hi+1)" plays in, the compare operand will be =$08. When "&(hi+1)" disappears, the full $9E is written to the operand byte and the loop will end. This happens iff the critical SHX-cycle happens on a badline.
Nice trolljob there! =)

Took me a while and Rastah Bar's successive comment to figure out that code must sit at $08A3 instead, and not at $0843.
Seems to work nicely now in real-world code on realthing and x64sc, while on x64 the code just twiddles thumbs in an endless loop. =)

Oops, thought I corrected this in a later post, but it seems I forgot to.

However, other approaches have been discovered during the discussion that are comparable in size but more flexible regarding mem location.

So take my deep apologies... "trolljob" sounds really evil :(:(:(... and I really really did not intend to fool anyone.

Hopefully it won't happen again.

2021-01-30 21:27

Krill

Registered: Apr 2002
Posts: 2839

No worries, i actually took it for a mistake.

Though i wonder how that could happen, did you drunkenly type in some notes scribbled on a napkin, mistaking an A for a 4? =)

2021-01-31 00:42

Raistlin

Registered: Mar 2007
Posts: 555

So you're suggesting there was a fault in Copyfault's copy?

2021-01-31 00:51

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Raistlin

So you're suggesting there was a fault in Copyfault's copy?

Yeah, must be the handle, obviously ;)

2021-01-31 10:24

Oswald

Registered: Apr 2002
Posts: 5017

0843 A2 9E LDX #$9E
0845 A0 08 LDY #$08
0847 E0 00 CPX #$00
0849 D0 F9 BNE $0844

could someone explain how this works ? :D

2021-01-31 11:34

Krill

Registered: Apr 2002
Posts: 2839

Quoting Oswald

0843 A2 9E LDX #$9E
0845 A0 08 LDY #$08
0847 E0 00 CPX #$00
0849 D0 F9 BNE $0844

could someone explain how this works ? :D

Corrected version of the code and Copyfault's explanation annotated:

08A3 A2 9E   LDX #$9E
08A5 A0 08   LDY #$08
08A7 E0 00   CPX #$00
08A9 D0 F9   BNE $08A4

Branching to $08A4 leads to SHX $08A0,Y with Y = 8, so the operand byte of the CPX #imm at $08A8 is altered continuously.
As long as the "&(hi+1)" plays in, the CPX #imm operand will be $9E & $09 = $08, which is not equal X = $9E, so branching back.
When "&(hi+1)" disappears, the full $9E is written to the CPX's operand byte and the loop will end (X = $9E with CPX #$9E will yield Z=1). This happens if and only if the critical SHX-cycle appears on a badline.

2021-01-31 12:35

Oswald

Registered: Apr 2002
Posts: 5017

why does the &hi disappear on a badline?

2021-01-31 12:39

chatGPZ

Registered: Dec 2001
Posts: 11113

thats what the opcode does - why exactly it happens is unknown, but its likely some analog effect

2021-01-31 13:21

Krill

Registered: Apr 2002
Posts: 2839

Yes, something to do with DMA interference* disturbing the inner workings of that opcode. And it's a non-intended (illegal) opcode anyways, so short-circuiting inner logic to begin with. :)

* It has been observed that other 6502-based platforms without DMA (1541, e.g.) do not exhibit the &H-dropoff behaviour.

2021-01-31 15:04

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Oswald

why does the &hi disappear on a badline?

In my text "badline" was short for "the 1st DMA-overtake-cycle @cycle#12 in every badline".

If the 4th cycle of the SHX abs,Y coincides with a DMA-overtake cycle, the &(hi+1) drops off, see e.g. post#88 or the latest No More Secrets V0.95.

Ofcourse all sprites must be turned off while the routine is running (they'd throw in other DMA-overtake-cycles and break the whole thing). So the basic trick (courtesy of Quiss;)) is to uniquely mark badlines with the DMA-overtake-cycles and loop until the beforementioned 4th cycle of the SHX lands on the correct cycle (or short: "on a badline";)).

2021-02-01 00:44

Devia

Registered: Oct 2004
Posts: 401

What an awful thread.. fell down the rabbit hole, turned utterly insane, lost several days of sleep...
Quiss' original approach is short and elegant, but opcode rewriting confuses my old brain.
Adding a single byte to it's size, the flexibility can be increased somewhat (from 64 to 112 possible pages) and no opcode rewrites, just operand rewrite.

	* = $xy00, where x=$0-$f and y=$0-$6
loop:	ldy	#$08
	ldx	#$0a
	shx	loop,y
	jmp	loop
$xx0A:

Loop times kept at either 10 or 12 cycles, depending on address choice.

As an added bonus, A is not touched - which may or may not matter in your overall timer setup.

So, it's obvious my priority is not size, but readability - and I find this approach a tad more readable ;-)

2021-02-01 10:19

Frantic

Registered: Mar 2003
Posts: 1627

Yes, this thread should have some kind of warning sign attached to it. :)

2021-02-01 15:52

Copyfault

Registered: Dec 2001
Posts: 466

Quoting Devia

What an awful thread.. fell down the rabbit hole, turned utterly insane, lost several days of sleep...

How can it be so awful when it guided you into Wonderland :) ??

Quoting Devia

Quiss' original approach is short and elegant, but opcode rewriting confuses my old brain.
Adding a single byte to it's size, the flexibility can be increased somewhat (from 64 to 112 possible pages) and no opcode rewrites, just operand rewrite.

* = $xy00, where x=$0-$f and y=$0-$6 loop: ldy #$08 ldx #$0a shx loop,y jmp loop $xx0A:
Loop times kept at either 10 or 12 cycles, depending on address choice.

Wow, loop times depending on the adress-highbyte. But yeah, a JMP-instruction is needed to change the basic structure from opcode-change to operand-change... (unless one wants to exit from the loop by jumping back, but let's not go that route)

Quoting Devia

As an added bonus, A is not touched - which may or may not matter in your overall timer setup.

So, it's obvious my priority is not size, but readability - and I find this approach a tad more readable ;-)

Hm, ok, I agree that readability was not the main focus most of the times... but at least some variants that leave A untouched were given already (basically those with a CPX#val).

Still, I think the sniplet presented in post#90 (and all variants thereof) is best regarding trade-off between readability and size. And it imposes almost no restriction on the highbyte that can be used;)

Thanks for your version; it just rang the "things that I wanted to do"-bell for me (writing it all up on codebase, that is;)).

2021-02-01 18:10

Devia

Registered: Oct 2004
Posts: 401

Quoting Copyfault

Still, I think the sniplet presented in post#90 (and all variants thereof) is best regarding trade-off between readability and size. And it imposes almost no restriction on the highbyte that can be used;)

That's what I mean about the insanity..I do remember reading that post and the following ones, but apparently didn't recognize their brilliancy the first time around - that one is a keeper! ;-)

2021-02-01 19:06

Oswald

Registered: Apr 2002
Posts: 5017

is there any requierement for the nr of cycles for the whole loop to make sure it will 'always' land on a different cycle of the shx opcode on the badline's 12th cycle?

2021-02-01 19:44

Rastah Bar

Registered: Oct 2012
Posts: 336

Yes, see posts 61 and 72 (and 69 for an explanation):

"In the range 5-20, the lengths that do work are 5, 10, 12, 16, 18 and 19. But note that in particular, length 8 (a.k.a. branching directly to the SHX) does not."

"Indeed, loop length of 17 gets "fixed" by the border. Depending on which cycle you land on initially, loop exit gets delayed by up to three frames, but it'll eventually align.
A similar thing happens with 27, which takes up to four frames to align.
Those seem to be the only "special" (multi-frame) cases below 30."

2021-02-01 20:47

Oswald

Registered: Apr 2002
Posts: 5017

thanks for everyone for the explanations, this shx method is really cool :)

2023-05-11 19:00

Quiss

Registered: Nov 2016
Posts: 37

Sometimes, you want to set up a timer that loops once per frame, not per line, and on a stable cycle.
The shx code snippets posted so far give us a constant x position, but on a random y, so won't work for that.

But you can do:

        ldy #$01
loop:
        ldx #243  (or any other badline)
        cpx $d012
        bne *-3
        shy $ffff, x
        lda $ff, x
        beq loop

This always drops us out on line 243 cycle 62, and takes up to seven frames to do so.

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

Ddw/Deja Vu
tlr
neoman/titan
Airwolf/F4CG
GI-Joe/MYD!
itsP/Nostalgia
Martin Piper
Sentinel/Excess/TREX
Guests online: 125

Top Demos

1 Next Level  (9.8)
2 Mojo  (9.7)
3 Coma Light 13  (9.7)
4 Edge of Disgrace  (9.6)
5 Comaland 100%  (9.6)
6 No Bounds  (9.6)
7 Uncensored  (9.6)
8 Wonderland XIV  (9.6)
9 Bromance  (9.6)
10 Memento Mori  (9.6)

Top onefile Demos

1 It's More Fun to Com..  (9.7)
2 Party Elk 2  (9.7)
3 Cubic Dream  (9.6)
4 Copper Booze  (9.5)
5 TRSAC, Gabber & Pebe..  (9.5)
6 Rainbow Connection  (9.5)
7 Onscreen 5k  (9.5)
8 Wafer Demo  (9.5)
9 Dawnfall V1.1  (9.5)
10 Quadrants  (9.5)

Top Groups

1 Oxyron  (9.3)
2 Nostalgia  (9.3)
3 Booze Design  (9.3)
4 Censor Design  (9.3)
5 Crest  (9.3)

Top Swappers

1 Derbyshire Ram  (10)
2 Jerry  (9.8)
3 Violator  (9.8)
4 Acidchild  (9.7)
5 Starlight  (9.6)

Page generated in: 0.823 sec.