| |
lft
Registered: Jul 2007 Posts: 369 |
Improved clock-slide
If you use timer-based jitter correction, or just VSP, here's a way to shave off one cycle:
http://codebase64.org/doku.php?id=base:improved_clockslide |
|
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
There is a special satisfaction for when one makes a thorn into a rose... the French probably have a word for it... but that is what I felt when I read that. |
| |
Trash
Registered: Jan 2002 Posts: 122 |
Not only a cycle, a byte also... |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
that optimization is so obvious... i feel silly now :( |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Now put that bpl at $00ff to shave off another byte and cycle and do the jitter-compensation and IRQ-handling on $01xx. |
| |
lft
Registered: Jul 2007 Posts: 369 |
I'm afraid that would leave the page-boundary in the wrong place. |
| |
Martin Piper
Registered: Nov 2007 Posts: 722 |
Nicely done. Bravo. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: I'm afraid that would leave the page-boundary in the wrong place.
True. at $00fx then. But this of course applies to the standard method aswell. |
| |
Copyfault
Registered: Dec 2001 Posts: 478 |
So simple, so beautiful ;)
Now that I come to think of this, it should be possible to shave off another byte by utilising that other additional branch cycle (in case of a taken branch).
Assuming that the accu holds the no. of bytes to skip (which is common for this approach) we could do the following:
;-----------------------------
;A=0..n-1=no.of bytes to skip
;must have been calculated
;directly before to ensure
;correct setting of the z-flag
;-----------------------------
sta bra+1
bra bne *
nop ;2 cycles
lda #$a9 ;2
lda #$a9 ;2
lda #$a9 ;2
lda $ea ;3
;-----------------------------
;page break here
;-----------------------------
code ...
If A=0, the branch is not taken; thus the total sum of cycles will be 11. In case of a non-vanishing A, the NOP-instruction is skipped (-2 cycles) but the additional "branch taken"-cycle comes in. Mind that if you want to slide down to 2 cycle-delay the page break is mandatory! A one-cycle delay is not possible but this is the same with the "BPL"-instruction which always comes with that additional cycle. |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
I think the best part is that LFT wrote a Codebase article about it *before* I had to ask him about it. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Copyfault, that is an excellent improvement! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Oh that's gorgeous. Nice work, both of you! |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
@Copyfalut: Don't hesitate to write about that improvement on Codebase. If LFT don't mind, perhaps it can be written as an extension of his article? |
| |
Copyfault
Registered: Dec 2001 Posts: 478 |
Reading the reactions to my idea makes me happy&smile :))
But to be fair: my ideas are not that much of an optimization as it looks like on first glance!
Sticking to lft's example, the routine can cope with a jitter of 10 which means 11 different latencies (or cycle delay states as I prefer to call it). Applying my "optimization" the number of delay states drops by one.
;-----------------------
;cycle no. taken from
;lft's example
;-----------------------
;32..41 (31 not poss. with the opt)
;A=0..9 (10 not poss.)
sta bra+1
;36..45
bra bne *
;38 (only for A=0)
nop
;40 (branch taken, nop skipped)
lda #$a9
;42
lda #$a9
;44
lda #$a9
;46
lda $ea
;49
;-----------------------------
;page break here
;-----------------------------
code ...
You can easily see that the delay state "A=10" is not coped anymore after the optimization (it would most probably lead to a crash due to a branch to "code+1") whereas it is fully treated in lft's approach.
Thus the idea I had is more of a "cosmetic kind". In order to fully take advantage of that extra "branch taken" cycle, one would have to ensure that the very first byte of the clock slide code is reached by a taken branch (for the "A=0"-case, i.e. the one which has to compensate the most cycles) and also just by passing that branch instruction (usually the "A=1"-case), but this would require some more touch-up of the accu before starting the actual dejitter part.
So before feeding the codebase I better ask here if you still want the idea to be added there. |
| |
Copyfault
Registered: Dec 2001 Posts: 478 |
Speaking of that touch-up of the accu in my previous post, one could do it using table lookup. So the code would be smth like
;---------------------------------
;trying to stick to lft's example
;with all the cycle numbers
;---------------------------------
;23..33
ldx timer
;27..37
lda table,x
;31..41
;A=0..10
sta bra+1
;35..45
bra bpl *
;38 (35+3 for A=0 or 36+2 for A=$ff)
nop ;40
lda #$a9 ;42
lda #$a9 ;44
lda #$a9 ;46
lda $ea ;49
code ...
table
!by $09,$08,$07,$06,$05,$04,$03,$02,$01,$ff,$00
This way all 11 different delay states can be coped with but at the cost of extra Bytes for the table. Now if I want to be smart, I'd align the table to also have a page break for the "A=0"-case ;))
;---------------------------------
;23..33
ldx timer
;27..37
lda table,x ;if timer holds the max-val, the table access reads above the page end -> extra cycle!
;32..41
;A=0..9
sta bra+1
;36..45
bra bpl *
;39 (36+3 for A=0 or 37+2 for A=$ff)
lda #$a9 ;41
lda #$a9 ;43
lda #$a5 ;45
nop ;48 (ends one cycle earlier as the first dejitter cycle is the lookup-table penalty cylce)
code ...
table
!by $08,$07,$06,$05,$04,$03,$02,$01,$ff,$00
;-------------------------------------------------
;page break here
;-------------------------------------------------
!by $00
Needs even one byte less for the clock slide part... but ofcourse, any advantage is eaten up by all the drawbacks like page-break requirements (now for that table also!), need for an index Register, higher "minimum overhead cost" (lda #const: sbc timer is cheaper in this respect!), etc.
But maybe this idea qualifies a Little better for a contribution to the mighty codebase?!??
[Edit]
Oops, that was too optimistic ;)) Ofcourse is must be
;---------------------------------
;23..33
ldx timer
;27..37
lda table,x ;if timer holds the max-val, the table access reads above the page end -> extra cycle!
;32..41
;A=0,0,$ff,1,..,8 (see table)
sta bra+1
;36..45
bra bpl *
;39 (36+3 for A=0 or 37+2 for A=$ff)
lda #$a9 ;41
lda #$a9 ;43
lda #$a9 ;45
lda $ea ;48 (ends one cycle earlier as the first dejitter cycle is the lookup-table penalty cylce)
code ...
table
!by $08,$07,$06,$05,$04,$03,$02,$01,$ff,$00
;-------------------------------------------------
;page break here
;-------------------------------------------------
!by $00
The clock slide part is ofcourse _one_ byte less, not two ;p |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Mind like a sieve. Look what I found on an old disk image from somewhere around 1989-1992.
Pretty sure I got the BPL from John West, after he independently discovered VSP some time around 1989-90
edit: argh, all this proves is that we *didn't* discover copyfault's improvement. I need some more sleep.
Also, welcome to my horrible source code from before I recanted my 'all cross developing is cheating' stance. |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
6510+? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
FASSEM |
| |
Copyfault
Registered: Dec 2001 Posts: 478 |
Woke up this morning with the thought "Ending up on cycle 48 feels odd somehow...". After taking a shower it became clear to me. Scratch that bullshit with the "ends up one cycle earlier...". It's simply wrong! Instead, the clock slide code part must start with a NOP.
;---------------------------------
;23..33
ldx timer
;27..37
lda table,x ;if timer holds the max-val, the table access reads above the page end -> extra cycle!
;32..41
;A=0,0,$ff,1,..,8 (see table)
sta bra+1
;36..45
bra bpl *
;39 (36+3 for A=0 or 37+2 for A=$ff)
Nop ;41
lda #$a9 ;43
lda #$a9 ;45
lda #$a5 ;47
Nop ;49
;-------
;pb here
:-------
code ...
table
!by $08,$07,$06,$05,$04,$03,$02,$01,$ff,$00
;-------------------------------------------------
;page break here
;-------------------------------------------------
!by $00
@Frantic: codebase-worthy in this state? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Yes, all you need is to start and end with NOP.
Kind of wondering about bit shifting approaches now.. For A in 0..3:
lsr a
bcs plus1 ;2 or 3 cycles
plus1
lsr a
bcs plus2 ;2 or 4 cycles
plus2 ; <- page boundary
..But do you still get the extra cycle cost when branching to the next instruction? |
| |
HCL
Registered: Feb 2003 Posts: 728 |
..that's the kind of timing i have seen in some Crest-demos i think. It saves some space, but needs a few more cycles for LSR and an extra branch.
lsr
sta br+1
bcc br
br: bpl..
nop
nop
..
|
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
thats the one posted by hannes sommer in 64er mag about hundred years ago... :=) |
| |
Ninja
Registered: Jan 2002 Posts: 411 |
Using page-crossing branches to waste a cycle is not exactly brandnew as well... ;) |
| |
Copyfault
Registered: Dec 2001 Posts: 478 |
If I leave out the "Ninja-method" for the moment (which is unbeatable without question) my favourite way of dejittering looks like this:
lda timer ;synced to give A=$17,...,$10
lsr ;A=$0b,$0b,$0a,$0a,$09,$09,$08,$08, C set for odd values
bcs .skip1 ;that bit shifting trick found in Crest-demos as mentioned before by HCL
.skip1 asr #$03 ;after "AND #$03": A=$03,$03,$02,$02,$01,$01,$00,$00
;after "LSR": A=$01,$01,$01,$01,$00,$00,$00,$00, C set for odd values
bcc .skip2 ;waste 2 cycles if C set
bcs .skip2
.skip4 bne .end ;waste 4 cycles for non-vanishing A
.skip2 bne .skip4
.end
...
This is more or less a slightly "stretched" version of the approach found in the Crest-demos. No need for any pb's, even no need for an sbc/eor. It can cope only with eight different delay states, though.
I tried to find a way to get rid of one branch instruction but didn't succeed. Either it's totally trivial and I'm just blind or it is UNpossible :)) |
| |
Copyfault
Registered: Dec 2001 Posts: 478 |
Quoting ChristopherJam [...]
Kind of wondering about bit shifting approaches now.. For A in 0..3:
lsr a
bcs plus1 ;2 or 3 cycles
plus1
lsr a
bcs plus2 ;2 or 4 cycles
plus2 ; <- page boundary
..But do you still get the extra cycle cost when branching to the next instruction?
Iirc you don't get that page-crossing penalty cycle if the branch instruction is at a page end (maybe it was the other way around that you always get it also for a non-taken branch, don't remember correctly anymore). Either way, this
...
lsr
bcs plus2
;---
;pb
;---
plus2
won't work as expected to compensate a two-cycle jitter step. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Correct. The offset in the branch instruction is added to the PC, but the PC has already been incremented to the new page so there's no carry and hence no extra cycle. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Aww, I had a bad feeling about that page crossing. Thanks guys.
I also gather the CIA timers never read zero when they're counting down? (which kills another idea I had..) |