[CSDb] - User Forums - Improved clock-slide

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Improved clock-slide

2017-02-28 07:21

lft

Registered: Jul 2007
Posts: 369

Improved clock-slide

If you use timer-based jitter correction, or just VSP, here's a way to shave off one cycle:

http://codebase64.org/doku.php?id=base:improved_clockslide

2017-02-28 07:40

oziphantom

Registered: Oct 2014
Posts: 490

There is a special satisfaction for when one makes a thorn into a rose... the French probably have a word for it... but that is what I felt when I read that.

2017-02-28 09:50

Trash

Registered: Jan 2002
Posts: 122

Not only a cycle, a byte also...

2017-02-28 09:51

chatGPZ

Registered: Dec 2001
Posts: 11386

that optimization is so obvious... i feel silly now :(

2017-02-28 09:56

JackAsser

Registered: Jun 2002
Posts: 2014

Now put that bpl at $00ff to shave off another byte and cycle and do the jitter-compensation and IRQ-handling on $01xx.

2017-02-28 10:13

lft

Registered: Jul 2007
Posts: 369

I'm afraid that would leave the page-boundary in the wrong place.

2017-02-28 10:47

Martin Piper

Registered: Nov 2007
Posts: 722

Nicely done. Bravo.

2017-02-28 13:49

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: I'm afraid that would leave the page-boundary in the wrong place.

True. at $00fx then. But this of course applies to the standard method aswell.

2017-02-28 14:58

Copyfault

Registered: Dec 2001
Posts: 478

So simple, so beautiful ;)

Now that I come to think of this, it should be possible to shave off another byte by utilising that other additional branch cycle (in case of a taken branch).

Assuming that the accu holds the no. of bytes to skip (which is common for this approach) we could do the following:

;-----------------------------
;A=0..n-1=no.of bytes to skip
;must have been calculated
;directly before to ensure
;correct setting of the z-flag
;-----------------------------
     sta bra+1
bra  bne *
     nop       ;2 cycles
     lda #$a9  ;2
     lda #$a9  ;2
     lda #$a9  ;2
     lda $ea   ;3
;-----------------------------
;page break here
;-----------------------------
code ...

If A=0, the branch is not taken; thus the total sum of cycles will be 11. In case of a non-vanishing A, the NOP-instruction is skipped (-2 cycles) but the additional "branch taken"-cycle comes in. Mind that if you want to slide down to 2 cycle-delay the page break is mandatory! A one-cycle delay is not possible but this is the same with the "BPL"-instruction which always comes with that additional cycle.

2017-02-28 15:07

Frantic

Registered: Mar 2003
Posts: 1648

I think the best part is that LFT wrote a Codebase article about it *before* I had to ask him about it.

2017-02-28 19:56

lft

Registered: Jul 2007
Posts: 369

Copyfault, that is an excellent improvement!

2017-03-02 10:35

ChristopherJam

Registered: Aug 2004
Posts: 1409

Oh that's gorgeous. Nice work, both of you!

2017-03-02 12:08

Frantic

Registered: Mar 2003
Posts: 1648

@Copyfalut: Don't hesitate to write about that improvement on Codebase. If LFT don't mind, perhaps it can be written as an extension of his article?

2017-03-02 22:38

Copyfault

Registered: Dec 2001
Posts: 478

Reading the reactions to my idea makes me happy&smile :))

But to be fair: my ideas are not that much of an optimization as it looks like on first glance!

Sticking to lft's example, the routine can cope with a jitter of 10 which means 11 different latencies (or cycle delay states as I prefer to call it). Applying my "optimization" the number of delay states drops by one.

;-----------------------
;cycle no. taken from
;lft's example
;-----------------------
                 ;32..41 (31 not poss. with the opt)
                 ;A=0..9 (10 not poss.)
     sta bra+1   
                 ;36..45
bra  bne *       
                 ;38 (only for A=0)
     nop         
                 ;40 (branch taken, nop skipped)
     lda #$a9    
                 ;42
     lda #$a9    
                 ;44
     lda #$a9    
                 ;46
     lda $ea     
                 ;49
;-----------------------------
;page break here
;-----------------------------
code ...

You can easily see that the delay state "A=10" is not coped anymore after the optimization (it would most probably lead to a crash due to a branch to "code+1") whereas it is fully treated in lft's approach.

Thus the idea I had is more of a "cosmetic kind". In order to fully take advantage of that extra "branch taken" cycle, one would have to ensure that the very first byte of the clock slide code is reached by a taken branch (for the "A=0"-case, i.e. the one which has to compensate the most cycles) and also just by passing that branch instruction (usually the "A=1"-case), but this would require some more touch-up of the accu before starting the actual dejitter part.

So before feeding the codebase I better ask here if you still want the idea to be added there.

2017-03-02 23:18

Copyfault

Registered: Dec 2001
Posts: 478

Speaking of that touch-up of the accu in my previous post, one could do it using table lookup. So the code would be smth like

;---------------------------------
;trying to stick to lft's example
;with all the cycle numbers
;---------------------------------
                 ;23..33
     ldx timer   
                 ;27..37
     lda table,x
                 ;31..41
                 ;A=0..10
     sta bra+1   
                 ;35..45
bra  bpl *       
                 ;38 (35+3 for A=0 or 36+2 for A=$ff)
     nop         ;40
     lda #$a9    ;42
     lda #$a9    ;44
     lda #$a9    ;46
     lda $ea     ;49
code ...

table
     !by $09,$08,$07,$06,$05,$04,$03,$02,$01,$ff,$00

This way all 11 different delay states can be coped with but at the cost of extra Bytes for the table. Now if I want to be smart, I'd align the table to also have a page break for the "A=0"-case ;))

;---------------------------------
                 ;23..33
     ldx timer   
                 ;27..37
     lda table,x ;if timer holds the max-val, the table access reads above the page end -> extra cycle!
                 ;32..41
                 ;A=0..9
     sta bra+1   
                 ;36..45
bra  bpl *       
                 ;39 (36+3 for A=0 or 37+2 for A=$ff)
     lda #$a9    ;41
     lda #$a9    ;43
     lda #$a5    ;45
     nop         ;48 (ends one cycle earlier as the first dejitter cycle is the lookup-table penalty cylce)
code ...

table
     !by $08,$07,$06,$05,$04,$03,$02,$01,$ff,$00
;-------------------------------------------------
;page break here
;-------------------------------------------------
     !by $00

Needs even one byte less for the clock slide part... but ofcourse, any advantage is eaten up by all the drawbacks like page-break requirements (now for that table also!), need for an index Register, higher "minimum overhead cost" (lda #const: sbc timer is cheaper in this respect!), etc.

But maybe this idea qualifies a Little better for a contribution to the mighty codebase?!??

[Edit]
Oops, that was too optimistic ;)) Ofcourse is must be

;---------------------------------
                 ;23..33
     ldx timer   
                 ;27..37
     lda table,x ;if timer holds the max-val, the table access reads above the page end -> extra cycle!
                 ;32..41
                 ;A=0,0,$ff,1,..,8 (see table)
     sta bra+1   
                 ;36..45
bra  bpl *       
                 ;39 (36+3 for A=0 or 37+2 for A=$ff)
     lda #$a9    ;41
     lda #$a9    ;43
     lda #$a9    ;45
     lda $ea     ;48 (ends one cycle earlier as the first dejitter cycle is the lookup-table penalty cylce)
code ...

table
     !by $08,$07,$06,$05,$04,$03,$02,$01,$ff,$00
;-------------------------------------------------
;page break here
;-------------------------------------------------
     !by $00

The clock slide part is ofcourse _one_ byte less, not two ;p

2017-03-04 06:11

ChristopherJam

Registered: Aug 2004
Posts: 1409

Mind like a sieve. Look what I found on an old disk image from somewhere around 1989-1992.

Pretty sure I got the BPL from John West, after he independently discovered VSP some time around 1989-90

edit: argh, all this proves is that we *didn't* discover copyfault's improvement. I need some more sleep.

Also, welcome to my horrible source code from before I recanted my 'all cross developing is cheating' stance.

2017-03-04 07:53

oziphantom

Registered: Oct 2014
Posts: 490

6510+?

2017-03-04 08:31

ChristopherJam

Registered: Aug 2004
Posts: 1409

FASSEM

2017-03-04 11:21

Copyfault

Registered: Dec 2001
Posts: 478

Woke up this morning with the thought "Ending up on cycle 48 feels odd somehow...". After taking a shower it became clear to me. Scratch that bullshit with the "ends up one cycle earlier...". It's simply wrong! Instead, the clock slide code part must start with a NOP.

;---------------------------------
                 ;23..33
     ldx timer   
                 ;27..37
     lda table,x ;if timer holds the max-val, the table access reads above the page end -> extra cycle!
                 ;32..41
                 ;A=0,0,$ff,1,..,8 (see table)
     sta bra+1   
                 ;36..45
bra  bpl *       
                 ;39 (36+3 for A=0 or 37+2 for A=$ff)
     Nop         ;41
     lda #$a9    ;43
     lda #$a9    ;45
     lda #$a5    ;47
     Nop         ;49
;-------
;pb here
:-------
code ...

table
     !by $08,$07,$06,$05,$04,$03,$02,$01,$ff,$00
;-------------------------------------------------
;page break here
;-------------------------------------------------
     !by $00

@Frantic: codebase-worthy in this state?

2017-03-06 09:53

ChristopherJam

Registered: Aug 2004
Posts: 1409

Yes, all you need is to start and end with NOP.

Kind of wondering about bit shifting approaches now.. For A in 0..3:

  lsr a
  bcs plus1   ;2 or 3 cycles
plus1
  lsr a
  bcs plus2  ;2 or 4 cycles
plus2     ; <- page boundary

..But do you still get the extra cycle cost when branching to the next instruction?

2017-03-06 10:36

HCL

Registered: Feb 2003
Posts: 728

..that's the kind of timing i have seen in some Crest-demos i think. It saves some space, but needs a few more cycles for LSR and an extra branch.

    lsr
    sta br+1
    bcc br
br: bpl..
    nop
    nop
    ..

2017-03-06 10:38

chatGPZ

Registered: Dec 2001
Posts: 11386

thats the one posted by hannes sommer in 64er mag about hundred years ago... :=)

2017-03-06 13:41

Ninja

Registered: Jan 2002
Posts: 411

Using page-crossing branches to waste a cycle is not exactly brandnew as well... ;)

2017-03-06 22:52

Copyfault

Registered: Dec 2001
Posts: 478

If I leave out the "Ninja-method" for the moment (which is unbeatable without question) my favourite way of dejittering looks like this:

        lda timer   ;synced to give A=$17,...,$10
        lsr         ;A=$0b,$0b,$0a,$0a,$09,$09,$08,$08, C set for odd values
        bcs .skip1  ;that bit shifting trick found in Crest-demos as mentioned before by HCL
.skip1  asr #$03    ;after "AND #$03": A=$03,$03,$02,$02,$01,$01,$00,$00
                    ;after "LSR": A=$01,$01,$01,$01,$00,$00,$00,$00, C set for odd values
        bcc .skip2  ;waste 2 cycles if C set
        bcs .skip2
.skip4  bne .end    ;waste 4 cycles for non-vanishing A
.skip2  bne .skip4		
.end    
        ...

This is more or less a slightly "stretched" version of the approach found in the Crest-demos. No need for any pb's, even no need for an sbc/eor. It can cope only with eight different delay states, though.

I tried to find a way to get rid of one branch instruction but didn't succeed. Either it's totally trivial and I'm just blind or it is UNpossible :))

2017-03-06 23:13

Copyfault

Registered: Dec 2001
Posts: 478

Quoting ChristopherJam

[...]
Kind of wondering about bit shifting approaches now.. For A in 0..3:

lsr a bcs plus1 ;2 or 3 cycles plus1 lsr a bcs plus2 ;2 or 4 cycles plus2 ; <- page boundary

..But do you still get the extra cycle cost when branching to the next instruction?

Iirc you don't get that page-crossing penalty cycle if the branch instruction is at a page end (maybe it was the other way around that you always get it also for a non-taken branch, don't remember correctly anymore). Either way, this

...
lsr
bcs plus2
;---
;pb
;---
plus2

won't work as expected to compensate a two-cycle jitter step.

2017-03-07 05:45

lft

Registered: Jul 2007
Posts: 369

Correct. The offset in the branch instruction is added to the PC, but the PC has already been incremented to the new page so there's no carry and hence no extra cycle.

2017-03-07 06:04

ChristopherJam

Registered: Aug 2004
Posts: 1409

Aww, I had a bad feeling about that page crossing. Thanks guys.

I also gather the CIA timers never read zero when they're counting down? (which kills another idea I had..)

Refresh

Subscribe to this thread: