| |
Frantic
Registered: Mar 2003 Posts: 1648 |
ACME macro for delaying X cycles
Anybody got an macro handy for the ACME assembler for delaying X number of cycles? It is OK if it kills the A or X register. |
|
... 21 posts hidden. Click here to view all posts.... |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
doynax: another nice source for subtle bugs =) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
"doesn't clobber any registers or flags. Still requires two bytes of stack."
almost perfect, now one without jsr and 2-63 cycles please for the ultimate macro :) |
| |
Han
Registered: Apr 2017 Posts: 8 |
Funny to see this question now when I was writing my own macro last week :)
Maybe this is useful for somebody (KickAssembler):
.macro waitx(Cycles)
{
// Parameters of fast loop (outside a page boundary)
.var LC=5 // Cycles per loop iteration (DEX, BNE)
.var LoopCount=max(1, floor((Cycles-1)/LC)) // Loop counter
.if((LoopCount>1) && (Cycles - (LoopCount*LC+1)==1)) { .eval LoopCount-- } // Handle only 1 remaining cycle
.var ExtraCycles=max(0, Cycles - (LoopCount*LC+1)) // Cycles outside the loop
.var ExtraBytes=max(0, ceil(ExtraCycles/2)) // Bytes required outside the loop
// Parameters of slow loop (branch over page boundary)
.var P_LC=6
.var P_LoopCount=max(1, floor(Cycles/P_LC))
.if((P_LoopCount>1) && (Cycles - (P_LoopCount*P_LC)==1)) { .eval P_LoopCount-- }
.var P_ExtraCycles=max(0, Cycles - (P_LoopCount*P_LC))
.var P_ExtraBytes=max(0, ceil(P_ExtraCycles/2))
.var Relocate=false
.var IsPageCrossed=(((<*)>=$fb) && ((<*)<=$fd))
.if(IsPageCrossed)
{ // Check if fast loop could be relocated to be slow and would also be smaller
.var adr=*+ExtraBytes
.if((ExtraBytes<P_ExtraBytes) && (((<adr)<$fb) || ((<adr)>$fd)))
{
.eval Relocate=true
}
else
{
.eval LoopCount=P_LoopCount
.eval ExtraCycles=P_ExtraCycles
.eval ExtraBytes=P_ExtraBytes
}
}
else
{ // Check if slow loop could be relocated to be fast and would also be smaller
.var adr=*+P_ExtraBytes
.if((P_ExtraBytes<ExtraBytes) && (((<adr)>=$fb) && ((<adr)<=$fd)))
{
.eval LoopCount=P_LoopCount
.eval ExtraCycles=P_ExtraCycles
.eval ExtraBytes=P_ExtraBytes
.eval Relocate=true
}
}
.if(ceil(Cycles/2) <= (5+ExtraBytes))
{ // Loopless wait is smaller than using a loop
wait(Cycles)
}
else
{ // All that hassle for this small (relocated) loop :)
.if(Relocate) { wait(ExtraCycles) }
ldx #LoopCount
dex
bne *-1
.if(!Relocate) { wait(ExtraCycles) }
}
}
.macro wait(Cycles)
{
.if(Cycles>0)
{
.if(Cycles<2) .error "Can't delay 1 cycle"
.if((Cycles & 1)==0) { nop } else { bit $00 } // Delay 2 or 3 cycles
.for(var i=1; i<floor(Cycles/2); i++) { nop } // Remaining even amount
}
}
What this does is building an optimal Loop+Nop+Bit-combination that observes a page boundary.
Depending on the number of delay cycles and on the location of the loop the number of required extra cycles varies. So this macro checks if the extra bytes can be used to relocate the loop from/onto a page boundary so that the resulting number of bytes is minimal. (Of course it uses a loopless delay if that's even better.)
Example: your code starts at $08fd and you want to wait 24 cycles:
$08fd LDX #$04
$08ff DEX
$0900 BNE $08FF // Page crossing
If instead you wanted to wait 28 cycles at this location you could append 2 NOPs. But it's smaller to prepend just one NOP, thus relocating the loop off of the page boundary and adding one iteration:
$08fd NOP
$08fe LDX #$05
$0900 DEX
$0901 BNE $0900 // No page crossing
The wait() macro is just a simple loopless delay that's used inside waitx(). Using pha/pla the code size could be reduced even more so maybe I'll include that later.
Please note that I did test this but it's still work in progress.. |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Just got a crazy idea for delaying 13 cycles in 1 byte: pause:
rti
delay13Cycles:
brk Requires that the IRQ/BRK vector is set to the pause label, and no IRQs occur at the same time, which I guess is unlikely anyway when cycle-exact timing is going on. However, after a little test it seems like the PC skips a byte after returning with rti, so in reality it takes two bytes:
pause:
rti
delay13Cycles:
brk
.by 0 |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Yes, BRK is a two-byte instruction. The operand byte is supposed to be an argument for the software interrupt you're triggering, pretty much similar to TRAP #<X> or INT <X> on other platforms.
It was intended for OS calls, i think, but i fail to come up with an example that actually uses the argument byte.
The 1581 ROM code only has a dummy parameter:
.8:959d 08 PHP
.8:959e 58 CLI
.8:959f 95 02 STA $02,X
.8:95a1 00 BRK
.8:95a2 EA NOP |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Interesting, did not know that. Wonder why BRK isn't usually interpreted as having an argument by assemblers/disassemblers. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
so byte after brk is loaded into A ? or just thrown away ? isnt it just some kind of side effect from jsr ? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting CruzerHowever, after a little test it seems like the PC skips a byte after returning with rti, so in reality it takes two bytes
I guess you could make it a single byte 19 cycle delay by incrementing the return address in the interrupt handler, assuming you know the stack depth at the time of execution, and also avoid page boundary crossings in the 'caller'
|
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting CruzerWonder why BRK isn't usually interpreted as having an argument by assemblers/disassemblers. Usually, yes. Some assemblers allow an optional argument. Default is without, as usually BRK is used to end a program, discarding any code or data after it.
Quoting Oswaldso byte after brk is loaded into A ? or just thrown away ? isnt it just some kind of side effect from jsr ? The byte needs to be retrieved manually, reading it from stack after finding its position via TSX.
It may be possible that this is just a side-effect of saving gates or re-using some other logic (but probably not JSR with its two argument bytes).
But there was one real-world application which at least mildly suggests it was a conscious decision. The 6502 was designed as a micro-controller for industrial machines, not a general-purpose CPU for home computers. Back then, PROMs were used for custom or low-volume machines, which would be turned on and immediately manipulate physical objects in the real world. The PROMs came with all bits set, and were programmed by blowing fuses to flip bits to 0, but those bits could never be reset to 1.
Now, the BRK opcode is $00, and it could be used to patch code in PROMs. Upon encountering BRK (which was some other instruction formerly), the interrupt handler could then look up the argument byte (in addition or alternatively to the return address on stack) and decide which patch routine for that location (located in a patch area on the PROM) to execute, then resume operation.
Has anybody interviewed Mr Peddle about this? :) |
| |
lft
Registered: Jul 2007 Posts: 369 |
But in that case, the byte following BRK would be some random byte from the original code. If multiple patches were used, there would be no guarantee that the extra bytes would be different from each other.
Meanwhile, the *address* of the extra byte is available on the stack, and you would have to retrieve it anyway in order to read the extra byte. Hence, it is easier to just use the address (which is unique) to distinguish between different patches. |
Previous - 1 | 2 | 3 | 4 - Next |