| | Hermit
Registered: May 2008 Posts: 208 |
shortest CIA-stable raster
<Post edited by moderator on 4/4-2009 14:47>
Hi, Guys :)
While preparing for compo I've developed maybe the shortest
CIA-type stable raster solution (fits in 64 bytes, 24 asm-rows).
If you can do even shorter, I'm curious :)
It works fine in practice, don't have to type novels to achieve
stable raster, and no need for raster-IRQ,CMPd012 method is enough.
If you find it useful for fast & short demo-writing, we may implement it
into codebase64.
;setting the CIA1-timerA to beam in the program beginning:
-----------------------------------------------------------
sei ;we don't want lost cycles by IRQ calls :)
sync cmp $d012 ;scan for begin rasterline (A=$11 after first return)
bne *-3 ;wait if not reached rasterline #$11 yet
ldy #8 ;the walue for cia timer fetch & for y-delay loop
sty $dc04 ;CIA Timer will count from 8,8 down to 7,6,5,4,3,2,1
dey ;Y=Y-1 (8 iterations: 7,6,5,4,3,2,1,0)
bne *-1 ;loop needed to complete the poll-delay with 39 cycles
sty $dc05 ;no need Hi-byte for timer at all (or it will mess up)
sta $dc0e,y ;forced restart of the timer to value 8 (set in dc04)
lda #$11 ;value for d012 scan and for timerstart in dc0e
cmp $d012 ;check if line ended (new line) or not (same line)
sty $d015 ;switch off sprites, they eat cycles when fetched
bne sync ;if line changed after 63 cycles, resyncronize it!
.... the rest (this is also a stable-timed point, can be used for sg.)
B;EXAMPLE-using timerA to stabilize 7 cycle jitter when using CMPd012:
-----------------------------------------------------------------------
scan ldx #$31 ;a good value that's not badline, in border and 1=white
cpx $d012 ;scan rasterline
bne *-3 ;wait until rasterline will be $31
lda $dc04 ;check timer A, here it jitters between 7...1
eor #7 ;A=7-A so jitter will be 0...6 in A
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to timer (A) dependent byte
cmp #$c9 ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
cmp #$c9 ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
bit $ea24 ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP
stx $d020 ;x was 1 so border is white at the stable cycle
sty $d020 ;y ended in 0 in sync routine, so border black after 4 cycles
jmp scan ;go to the raster again (or can go new raster)
-----------------------------------------------------------------------
Opinions?
Hermit Software Hungary |
|
| | Hermit
Registered: May 2008 Posts: 208 |
I'm working on the form, but no spaces appear in post. :(
Hermit Software Hungary |
| | chatGPZ
Registered: Dec 2001 Posts: 11356 |
fixed it (use the "code" tag =)) |
| | Hermit
Registered: May 2008 Posts: 208 |
Many thaks
Hermit Software Hungary |
| | Oswald
Registered: Apr 2002 Posts: 5086 |
nice one, but a bit too overoptimised. expecting A being $11 at the start is over the top, as it will be used as a subroutine inside someone's code most probably. secondly I'd prefer to see it setup the timer for a raster irq and a raster irq example. thirdly if this is done put it in codebase, there's a huge demand for such code :) |
| | Hermit
Registered: May 2008 Posts: 208 |
Hi, Oswald.
Yes, it's very optimised, but does the work in any circumstances I've tried, and the aim was the least typing work.
It makes no sense whatever value is in A when calling this routine, because first D012 check can be anywhere, and A will surely be $11 after the first syncronizing phase. And 1st phase appears.
OK,whait a while, I'll work out & share also the IRQ-version today. And of course put it to codebase.
!!One important thing to pay attention: The stx $d020, sty$d020 is only an example, and works.
But put some delay, if only several rows are in this section, because checking d012 ("scan") will appear again in the same raster-row with other timings, that can -of course- crash the program.
It's not a problem in real demo-programming, we very-rarely use a raster interrupt only for some D020 modifying commands...in that case no danger of cmpd012 repetition on the same line.
Hermit Software Hungary |
| | Hermit
Registered: May 2008 Posts: 208 |
Good news, the CIA setter-routine (first part) works well also for raster IRQ. :)
Set the IRQ then,
The form after IRQ entry (when IRQ calld) should be something like this:
pha
lda $dc04
eor #7
sta *+4
bpl *+2
lda #$a9
lda #$a9
lda $eaa5
txa
pha
tya
pha
...
...the rest processes in the IRQ routine
...
asl $d019
pla
tay
pla
tax
pla
rti
I'll refresh Codebase64 with this info:)
Hermit Software Hungary |
| | Copyfault
Registered: Dec 2001 Posts: 475 |
Quote: Good news, the CIA setter-routine (first part) works well also for raster IRQ. :)
Set the IRQ then,
The form after IRQ entry (when IRQ calld) should be something like this:
pha
lda $dc04
eor #7
sta *+4
bpl *+2
lda #$a9
lda #$a9
lda $eaa5
txa
pha
tya
pha
...
...the rest processes in the IRQ routine
...
asl $d019
pla
tay
pla
tax
pla
rti
I'll refresh Codebase64 with this info:)
Hermit Software Hungary
Hmm, the first part of your code can be "optimized" ;)
What about...
pha
lda $dc04
lsr
bcs *+2
asr #$07
bcc *+6
bcs *+4
bne .end
bne *-2
.end
...
eats up the same amount of cycles but saves one byte (unless you're running the code within the zp, which makes it even more unflexible)
Maybe one can even cut it down further by two bytes... it's tackling to try to get rid of one of these branch opcodes, but up to now I didn't see a way to do it.
Copyfault |
| | Hermit
Registered: May 2008 Posts: 208 |
Good to see this approach, I was thinkig about a similar delayer with branch-commands but couldn't realize yet.
As I could, I avoided to use illegal opcodes, because some assemblers (or machine-types?) make it hardly, and also the beginners need a clear code to understand on Codebase64.
I've tried this routine and really works, and GREAT NEWS: no need for ASR#7, an LSR is pretty enough. Why? Because our CIA is counting only in 9 cycles, and at the LDA DC04 only 7..1 appears, no need to turn off any bits. 1 byte saved again.
pha
lda $dc04
lsr
bcs *+2
LSR
bcc *+6
bcs *+4
bne .end
bne *-2
.end
Although, If you can advise a C64 turbo assembler that accepts illegals, I would be happy.
Other idea is, we could do this bne-bcc-bcs..etc like delayer with a halving method, that may reduce steps..
My other approach is to load dc04 to X or Y, and make an indexed jump to a delay-routine. So no need to invert (EOR) the dc04 which is unfortunately counting BACK from 7 to 1 (8 to 0 (8) to be true). Or "JMP ($dc03)" method can be useful to reduce rows.
Hermit Software Hungary |
| | Copyfault
Registered: Dec 2001 Posts: 475 |
I really wonder how a simple LSR instead of ASR #$07 can work. The timing registers NEVER go down to #$00. After #$01, instead of counting down to #$00 the regs are directly reset to the initial value (here $3e most probably). So without masking, we can not make sure that the bne-commands work as intended.
Be careful with the jitter range: it is true that a (legal) RMW-Opcode eats max. 7 cycles, but in combination with branch-opcodes the number of cycles to be considered for jitter can be even longer. This also depends on page_breaks. I once started a thread here about this... will post a link later when I found it.
If desired I can send some acme-source code which clearly shows that there are 8 possible jitter states (value read by $dc04 goes from e.g. $10 to $17). You can do these tests yourself by experimenting with some code like
inc abs,x
bpl *-3
bmi *-5
as main routine.
The jmp ($dc03)-approach has already been done before. Ninja took the idea behind it to perfection. IIRC there was some article out there in VN.
Copyfault |
| | Copyfault
Registered: Dec 2001 Posts: 475 |
Me again,
the discussion I mentioned above was about Stable Raster via Timer.
Copyfault |
| | Ninja
Registered: Jan 2002 Posts: 411 |
Nice to see such a coding thread on CSDb \o/
While nothing beats the experience gained by doing your own timing stuff, I see some practical issues with this routine:
- As Copyfault mentioned, there is not taken care of 8-cycle jitter.
- It was often useful to me to have a counter synced to a rasterline (i.e. counting 63 cycles not 9). That makes it easier to abuse it more than once IMHO. Might be a personal preference, though.
- This routine could need several frames to reach a stable raster. Some code generation or depacking might have happened meanwhile.
- If you want to be really short, your approach won't beat $d013-based techniques, I am afraid.
Still, nice to see you playing around with it. While I see the above issues, there are still nice ideas in this one.
|
| | Copyfault
Registered: Dec 2001 Posts: 475 |
Hey Ninja, greetings my friend,
when seeing that you posted a reply here I first thought you found a way to further optimize this branch-approach...
Maybe this is the limit (considering the used bytes/cycles-ratio).
Copyfault |
| | Hermit
Registered: May 2008 Posts: 208 |
I'd like to emphasize the fact what Copyfault notified me. The 8 cycles jitter is really something that we have to pay attention to...
I'm coding a program, and there were some weird things when the main program (out of the irq) executed more commands, than a simple jmp or so. (especially when irq loader started to operate).
I had to slightly modify the stableraster-waiter routine in the irq. Adding a new line of bit $ea24 seems to prevent any issues coming from 8 cycle jitter... (even 9)
lda $dc04 ;check timer A, here it jitters between 7...1
eor #7 ;A=7-A so jitter will be 0...6 in A
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to timer (A) dependent byte
cmp #$c9 ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
cmp #$c9 ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
bit $ea24 ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP
bit $ea24 ;IMPORTANT to handle 8th cycle jitter
Not a big effort though, but from now I have to start writing stableraster-irq keeping this in mind.
(Not the best solution, but a simple NOP did the work too, however that may not be as stable in 8th cycle..)
Thanks for telling me this 8-9 cycle jitter thingy..
Hermit Software Hungary |
| | Repose
Registered: Oct 2010 Posts: 225 |
I'm going to try to develop a mathematical proof of the shortest/quickest cde.
According to http://visual6502.org/wiki/index.php?title=6502_all_256_Opcodes
We have these possibilities:
1 byte: 2-4 cycles
2 bytes: 2-6, 8 cycles
3 bytes: 4-7 cycles
And we are trying to create 8 delay states.
Now here's the table of delay states:
1 byte: 3 states (2, 3 or 4 cycles)
2 byte: 5 states (2, 3, 4, 5, or 6)*
3 byte: 4 states (4, 5, 6, or 7 cycles)
*We'll have to special case this later for the 8 cycles instructions;
there's no way to use two of these to get 15 cycles.
You can see that 1 byte instructions are most efficient for consuming
states.
Combining instructions doesn't double the states! I think of it this way;
at the longest delay, each instruction has the same number
of cycles, which loses a state possibility. The formula is:
total states=(state)*(n)-(n-1), where n is the number of times the instruction
is repeated, and state is the number of cycles possibilities it has.
Here's a table:
states n total
3 1 3
3 2 5
3 3 7
3 4 9
5 1 5
5 2 9
4 1 4
4 2 7
4 3 10
So which combinations gives at least 8 states?
size n total states total bytes
1 4 9 4
2 2 9 4
3 3 10 9
In this table, size is the bytes in the opcode, n is the number of times
an instruction of that length appears in a row.
But 4 bytes isn't the best because we're overdoing it, if we use
a combination of 2 byte and 1 byte opcodes can we get 8 states in less
memory?
Combining the 1 byte and 2 byte opcodes, we get (2,3,4)cycles+(2,3,4,5,6)cycles
which is 4-10cycles or by our formula, (3 states+5 states)-(2-1)=7 states.
This isn't quite enough. However, now we consider the special case of 8 cycles,
which turns out to be very special!
We get (2,3,4)cycles+(2,3,4,5,6,8)cycles=4-12 cycles or 9 states.
Notice that there is just enough overlap here, i.e. (4+6)=10cycles and
(3+8)=11 cycles and (4+8)=12 cycles.
So theoretically, we can use just 3 bytes to write a delay between 4 to 12
cycles.
These formulas can also be used for e.g. the Z80 in the C128 to set a limit
on optimized code.
The Table of Fixed Delays
So how does this help us write the shortest delay routine? The only use
for this in short code is to use it with a computed branch. The obviously
only way to do it with with something like:
lda timer1;4 cycles
asl;2 cycles
sta *+3;4 cycles
bne *+2; selfmod to branch into delay fragments (3 cycles)
xxx;3 state opcode
xxx;6 state opcode (4-12 cycles)
bne continue;do raster processing (3 cycles)
This is obviously horrible for code size but for quickest sync it's promising;
it's 20-28 cycles.
The "multi-threaded" code trick
Using BIT followed by data which happens to be a valid opcode, you can do
a computed branch into coincidently up to 3 different instructions giving
you 3 states in 3 bytes. This seems obviously the most efficient way to make
a short delay; you are trading multiple custom code fragments for a single
code thread array. The code array is indexed by a byte at a time so it
can only consume one state. A first estimate of such techniques is 8 bytes
to consume 8 states. While it saves memory, it can't possibly sync quicker
than the delay above.
The Computed Delay Trick
The last way to make a delay is by calculations which take varying amounts
of time. You can use branches to add 1 cycle of delay based on each of
the flags, N, Z, and C. This can create 4 states of delay. We'll have
to do another calculation to create more delay states. It should take
two calculations and a whole bunch of branches to do it.
Combining Methods
What if you used something like bne $EA to combine threading with computation?
I think you've squeezed an extra state in there somewhere. I think this
might work but your code would be scattered all over the place; it's still
valid though as you can write other code in between.
I haven't fully worked out all the ideas but I believe I've generalized the
approaches.
This reminds me of the 3 or 4 ways of speed optimizing; you can make a loop,
unroll a loop, use "decision tree optimization" where every decision leads
to it's own code fragment, or of course the easier way of doing it which is
a table of subroutines for every possible argument. This is a way to
make a two argument table but with less memory.
Has anyone made a multiply routine for every possible multiplier? I looked
at this and it's about 32 cycles max and sometimes quite less, much faster
than the table of squares method. |
| | Copyfault
Registered: Dec 2001 Posts: 475 |
@Repose: what exactly do you want to prove? The smallest number of bytes needed for a de-jitter-routine?
If scattering of routine fragments is allowed I guess Ninja's approach used in his 2x2-FLI-routines is the shortest possible.
Maybe I didn't fully get the idea behind your lines but don't we need smth like 'axomatic semantics' to do a correct mathematical proof? |
| | ready.
Registered: Feb 2003 Posts: 441 |
Quoting name
lda $dc04 ;check timer A, here it jitters between 7...1
eor #7 ;A=7-A so jitter will be 0...6 in A
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to timer (A) dependent byte
cmp #$c9 ;if A=0, cmp#$c9; if A=1, cmp #$c9 again 2 cycles later
cmp #$c9 ;if A=2, cmp#$c9, if A=3, CMP #$EA 2 cycles later
bit $ea24 ;if A=4,bit$ea24; if A=5, bit $ea, if A=6, only NOP
bit $ea24 ;IMPORTANT to handle 8th cycle jitter
@Hermit: this code doesn't patch the previous one when encoutering the 8-cycle jitter. Just check this: $dc04=8, EOR #7 produces $0f and with bpl you end up out of your code.
|
| | Frantic
Registered: Mar 2003 Posts: 1646 |
@Ready: Maybe i totally miss the point now just judging from the surface of things but are you missing the following?
0
1
2
3
4
5
6
7
= 8 different states, which means that $dc04==8 never happens and that $dc04==7 really is the 8th state? (As I said, I may very well be wrong now, because I don't know what $dc04 might actually end up being in the code discussed here..) |
| | ready.
Registered: Feb 2003 Posts: 441 |
I might be wrong as well, since I based my feedback on VICE monitor only (VICE 2.2). Still I confirm that in my routine ran in VICE sometimes I get $dc04=8. I checked the setup of the code and it is correct.
|
| | Copyfault
Registered: Dec 2001 Posts: 475 |
@Frantic: the experiments I did back then showed that $DC04 will never reach value '0'. This was already mentioned in this thread and in the old one (look @some posts above).
@Hermit: this "eor #$07"-line in your code must indeed cause problems - due to the fact that $DC04 != 0. But ofcourse you could sync your timer to have e.g. values between $10..$17 at the reading cycle of "lda $DC04" - thus, "eor #$17" should fix your example code. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Another approach, albeit the same number of lines of code:
;setting the CIA1-timerA to beam in the program beginning:
;-----------------------------------------------------------
sei ;we don't want lost cycles by IRQ calls ;)
sync lda#$1c
cmp$d012 ; scan for line to force DMA
bne *-3
sta $d011 ;trigger badline to absorb jitter
lda #$11
ldy #8 ;the walue for cia timer fetch & for y-delay loop
sty $dc04 ;CIA Timer will count from 8,8 down to 7,6,5,4,3,2,1
ldy#0
sty $dc05 ;no need Hi-byte for timer at all (or it will mess up)
sta $dc0e,y ;forced restart of the timer to value 8 (set in dc04)
dec $d011 ;undo tiny scroll from above
bmi sync ;oops, we were in the bottom border
|
| | spider-j
Registered: Oct 2004 Posts: 498 |
Sorry to dig this up, but I also stumbled over this lda $dc04/$dd04 returns $08 and therefore eor #7 produces $0f "thingy". I did some experiments with NMI (CIA2 TIMER B counting PAL cycles and TIMER A to "stabilize") and Krill loader instead of IRQ and made a "dirty fix" like this to help myself:
pha
lda $dd04
eor #7
sta *+4
bpl *+2
cmp #$c9
cmp #$c9
bit $ea24
bit $ea24
jmp *+8
nop
nop
jmp *+3
txa
pha
tya
pha
Yes, I know it's a lot more bytes & cycles "wasted". I just saw while linking Trafolta that Achim used this code snippet and played around a little bit.
Maybe someone who is a bit more creative than me should update the codebase64 page with a proper solution. Or at least there should be some kind of warning ... or whatsoever... |
| | Repose
Registered: Oct 2010 Posts: 225 |
lol so I come back exactly 6 years later and this thread is still going.
What I meant was, I wanted to answer two questions, 1) What is a methodical approach to finding the shortest or quickest sync code 2) how to tell if you've found the best possible solution.
Instead of playing around with ideas and guessing, I was thoroughly going through every opcode to see how they could be combined to make various delays. By using that method, it can be proven the best way to do this and then say it's done and forever. I guess no one really understood it, but I still found my own post very interesting of course it makes sense to me.
I'll have to look over the latest proposed segments and decide if I feel any of them are probably the last answer. |
| | Repose
Registered: Oct 2010 Posts: 225 |
Ok so the conclusion of my msg is, "we can use just 3 bytes to write a delay between 4 to 12 cycles". I mean two instructions of the right type, a one byte and two byte one, together can add up to any possibility of time from 4 (nop:nop) to 12 (unspecified 4 cycle and 8 cycle instruction) cycles.
If we jump into each segment, that's one approach.
So what I'm saying is, for that approach this is the smallest code you could ever write.
Then I give two other approaches that could be shorter, but not faster overall (if that's important). |
| | ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Repose, so it looks like you were searching for the smallest number of bytes for which there are a set of at least eight routines of that length that between them cover a run of eight different durations?
Nine delay states easily confirmable in Python thusly:
>>> b1={2,3,4}
>>> b2={2,3,4,5,6,8}
>>> set([x+y for x in b1 for y in b2])
{4, 5, 6, 7, 8, 9, 10, 11, 12}
I'm not clear how closely related that is to finding the shortest possible anti jitter routine mind, as multiple fragments would ordinarily introduce the cost of dispatching to them and returning. (also note that the jmp ($dc03) approach doesn't require same-length delay routines)
It does raise the entertaining possibility of this construct, mind:
ldy $dc04
lda frag1,y
sta rna+1
lda frag2,y
sta rna+2
rna:
.byt 0,0,0
24 to 32 cycles, depending on the content of $dc04. A 9 cycle timer wouldn't work because of the duplicated-8 issue, but a 63 cycle timer should be fine if the alignment is appropriate (alternately, avoid undefined opcodes in main). |
| | lft
Registered: Jul 2007 Posts: 369 |
That is an excellent idea, and it can be taken further!
We can do something like this:
ldy $dc04
lda opcodes,y
sta mod
mod .byt $00,$1b,$a9,$13,$ea
; 18-27 cycles later...
So that's six cycles faster, and one more cycle of jitter supported.
Here is a clarifying table:
y code cycles trashes
1 a9 1b|a9 13|ea 6
2 a5 1b|a9 13|ea 7
3 b5 1b|a9 13|ea 8
4 06 1b|a9 13|ea 9 1b
5 a1 1b|a9 13|ea 10
6 ea|1b a9 13|ea 11 13a9,y
7 ad 1b a9|13 ea 12 (ea),y
8 99 1b a9|13 ea 13 a91b,y (ea),y
9 0e 1b a9|13 ea 14 a91b (ea),y
10 1b 1b a9|13 ea 15 a91b,y (ea),y
This can be varied according to taste, to trash different memory locations. Note that the value of Y is known, so the exact address of the trashed location is also known. |
| | ChristopherJam
Registered: Aug 2004 Posts: 1408 |
Ooh, very nice indeed.
I guess the next question is, what's the fewest cycles required for each jitter length; lft's 18 cycle minimum is likely optimal for ten jitter states; if there's only two (eg we know we're interrupting NOPs) one could just
ldy $dc04
lda $xxnn,y
(eight or nine cycles, depending whether $dc04 is greater than 255-nn)
I'm not wrapping my brain around the sets of bcX *+n above at the moment; it's been a long day. (oh, and I should have been doing STA to rna+0 and rna+1 two comments ago too, but you've probably guessed that already) |
| | Repose
Registered: Oct 2010 Posts: 225 |
Apparently I worked on this 5 years ago.
http://forum.6502.org/viewtopic.php?p=18148#18148
This does 14+A in the range A=(1,8) or 13 with A=0.
;A=1..8
*=$1000
clc
adc #$ff-8;A=8-A so result will be 7
0 in A
eor #$ff
sta corr+1 ;self-writing code, the bpl jump-address = A
corr bpl *+2 ;the jump to (A) dependent byte (13 cycles so far)
cmp #$c9 ;A=8->A=0->BPL +2
cmp #$c9 ;
cmp #$c9 ;
cmp $ea ;3 =9 (13+9=22 max delay)
Nothing innovative, just different idea. |
|