| |
Frantic
Registered: Mar 2003 Posts: 1648 |
ACME macro for delaying X cycles
Anybody got an macro handy for the ACME assembler for delaying X number of cycles? It is OK if it kills the A or X register. |
|
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
No, but this might help with writing one
;minimal bytes
; preserves a,x,y,sp
; may clobber stack and flags
; 2 cycles (1 byte)
nop
; 3 cycles (2 bytes)
bit 3
; 4 cycles (2 bytes)
nop
nop
; 5 cycles (3 bytes)
nop
bit 3
; 6 cycles (3 bytes)
nop
nop
nop
; 7 cycles (2 bytes)
pha
pla
; 8 cycles (4 bytes)
nop
nop
nop
nop
; 9 cycles (3 bytes)
pha
nop
pla
;10 cycles (4 bytes)
pha
bit 3
pla
;11 cycles (4 bytes)
pha
nop
nop
pla
;12 cycles (5 bytes)
pha
nop
bit 3
pla
;13 cycles (5 bytes)
pha
nop
nop
nop
pla
;14 cycles (4 bytes)
pha
pha
pla
pla
|
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Thanks! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
np.
Also this:
#define wait_min24(x) lda#88-x:jsr waitN
waitN
sta *+4
bne *+2
.dsb 64,$a9
lda $ea
rts |
| |
TWW
Registered: Jul 2009 Posts: 545 |
And this (part of my delay pseudo) ;-)
.if(cycles.getValue()==15) {
pha // 3
nop // 2
nop // 2
nop // 2
nop // 2
pla // 4 <- 15 cycles | 6 bytes
}
.if(cycles.getValue()==16) {
pha // 3
pha // 3
nop // 2
pla // 4
pla // 4 <- 16 cycles | 5 bytes
}
.if(cycles.getValue()==17) {
pha // 3
pha // 3
bit $00 // 3
pla // 4
pla // 4 <- 17 cycles | 6 bytes
}
.if(cycles.getValue()==18) {
pha // 3
pha // 3
nop // 2
nop // 2
pla // 4
pla // 4 <- 18 cycles | 6 bytes
}
.if(cycles.getValue()==19) {
pha // 3
pha // 3
nop // 2
bit $00 // 3
pla // 4
pla // 4 <- 19 cycles | 7 bytes
}
.if(cycles.getValue()==20) {
pha // 3
pha // 3
nop // 2
nop // 2
nop // 2
pla // 4
pla // 4 <- 20 cycles | 7 bytes
}
.if(cycles.getValue()==21) {
pha // 3
pha // 3
pha // 3
pla // 4
pla // 4
pla // 4 <- 21 cycles | 6 bytes
}
.if(cycles.getValue()==22) {
pha // 3
lda #%00000010 // 2
lsr // 2 2
nop // 2 2
bcc *-2 // 3 2
pla // 4 <- 22 cycles | 8 bytes
}
.if(cycles.getValue()==23) {
pha // 3
lda #%00000100 // 2
lsr // 2 2 2
bcc *-1 // 3 3 2
pla // 4 <- 23 cycles | 7 bytes
}
.if(cycles.getValue()==24) {
pha // 3
pha // 3
pha // 3
bit $00 // 3
pla // 4
pla // 4
pla // 4 <- 24 cycles | 8 bytes
}
.if(cycles.getValue()==25) {
pha // 3
lda #%00000100 // 2
lsr // 2 2 2
bcc *-1 // 3 3 2
nop // 2
pla // 4 <- 25 cycles | 8 bytes
}
.if(cycles.getValue()==26) {
pha // 3
pha // 3
pha // 3
nop // 2
bit $00 // 3
pla // 4
pla // 4
pla // 4 <- 26 cycles | 9 bytes
}
.if(cycles.getValue()==27) {
pha // 3
pha // 3
pha // 3
nop // 2
nop // 2
nop // 2
pla // 4
pla // 4
pla // 4 <- 27 cycles | 9 bytes
}
.if(cycles.getValue()==28) {
pha // 3
lda #%00001000 // 2
lsr // 2 2 2 2
bcc *-1 // 3 3 3 2
pla // 4 <- 28 cycles | 7 bytes
}
.if(cycles.getValue()==29) {
pha // 3
lda #%00000100 // 2
lsr // 2 2 2
nop // 2 2 2
bcc *-2 // 3 3 2
pla // 4 <- 29 cycles | 8 bytes
}
.if(cycles.getValue()==30) {
pha // 3
lda #%00001000 // 2
lsr // 2 2 2 2
bcc *-1 // 3 3 3 2
nop // 2
pla // 4 <- 30 cycles | 8 bytes
}
.if(cycles.getValue()==31) {
pha // 3
lda #%00001000 // 2
lsr // 2 2 2 2
bcc *-1 // 3 3 3 2
bit $00 // 3
pla // 4 <- 31 cycles | 9 bytes
}
.if(cycles.getValue()==32) {
pha // 3
lda #%00001000 // 2
lsr // 2 2 2 2
bcc *-1 // 3 3 3 2
nop // 2
nop // 2
pla // 4 <- 32 cycles | 9 bytes
}
.if(cycles.getValue()==33) {
pha // 3
lda #%00010000 // 2
lsr // 2 2 2 2 2
bcc *-1 // 3 3 3 3 2
pla // 4 <- 33 cycles | 7 bytes
}
.if(cycles.getValue()==34) {
pha // 3
lda #%00001000 // 2
lsr // 2 2 2 2
bcc *-1 // 3 3 3 2
nop // 2
nop // 2
nop // 2
pla // 4 <- 34 cycles | 10 bytes
}
.if(cycles.getValue()==35) {
pha // 3
lda #%00010000 // 2
lsr // 2 2 2 2 2
bcc *-1 // 3 3 3 3 2
nop // 2
pla // 4 <- 35 cycles | 8 bytes
}
.if(cycles.getValue()==36) {
pha // 3
lda #%00001000 // 2
lsr // 2 2 2 2
nop // 2 2 2 2
bcc *-2 // 3 3 3 2
pla // 4 <- 36 cycles | 8 bytes
}
|
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting ChristopherJam
;minimal bytes
...
; 6 cycles (3 bytes)
nop
nop
nop
The following is more minimal. =)
; 6 cycles (2 bytes)
cmp (0,x)
The routines for 8, 12 and 13 cycles can be shortened with the same technique. |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Thanks all!
@lft: cool. Wasn't aware of that one. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
i'd refrain from using branches in those macros... or atleast have guards that give warnings (or even adjust the macro accordingly) when the branch crosses a page boundary. without you will get nice heisenbugs that appear and disappear randomly when you add/remove code :) |
| |
Kruthers
Registered: Jul 2016 Posts: 21 |
Can't help posting my "stable" delay macro, though it's 64tass not ACME. Maybe somebody will find it useful. It allows you to tweak the delay without causing code to shift around because it always uses 8 bytes. Alas, it can't do 2 or 4 cycles...
Only real regret is that it sometimes needs to trash a ZP location. And branching is not optional of course. Liberal use of ".option allow_branch_across_page" helps. :)
; macro to generate delay in cycles, always using 8 bytes
; requires the x or y register and one zeropage location which may be trashed
;
; usage: #delay8b 1282, y, $02
; #delay8b 23, x, $ff
delay8b .macro cycles, reg, zp
; validation
.cerror (\cycles < 3 || \cycles == 4), "8-byte-delay cannot be 1, 2 or 4 cycles"
.cerror (\cycles > 1282), "8-byte-delay must be less than 1283 cycles"
.cerror (\reg != "x" && \reg != "y"), "Unknown register", \reg
.cerror (\zp < $00 || \zp > $ff), "Zeropage argument is required"
; 3 to 10 cycles are hard coded
.switch \cycles
.case 3
jmp *+8
bit \zp
bit \zp
nop
.case 5
nop
jmp *+7
bit \zp
bit \zp
.case 6
bit \zp
jmp *+6
bit \zp
nop
.case 7
nop
nop
jmp *+6
bit \zp
nop
.case 8
nop
bit \zp
jmp *+5
bit \zp
.case 9
bit \zp
bit \zp
jmp *+4
nop
.case 10
nop
nop
bit \zp
jmp *+4
nop
; 11 or more cycles follows a repeating pattern
.default
; determine number of each operation
loop := (\cycles - 4) / 5
n_nop := 1
n_bit := 0
n_ldy := 0
n_jmp := 0
n_inc := 0
.if (\cycles % 5) == 0
n_ldy := 1
.elsif ((\cycles - 1) % 5) == 0
n_bit := 1
.elsif ((\cycles - 2) % 5) == 0
n_nop := 3
.elsif ((\cycles - 3) % 5) == 0
n_inc := 1
.elsif ((\cycles - 4) % 5) == 0
n_nop := 0
n_jmp := 1
.endif
; write out the code
; extra ldy (or ldx)
.if n_ldy > 0
.if \reg == "y"
ldy #($100-loop)
.else
ldx #($100-loop)
.endif
.endif
; loop
.if \reg == "y"
ldy #($100-loop)
iny
.else
ldx #($100-loop)
inx
.endif
bne *-1
; bit
.if n_bit
bit \zp
.endif
; nops
.rept n_nop
nop
.next
; inc
.if n_inc
inc \zp
.endif
; jmps
.if n_jmp
jmp *+3
.endif
.endswitch
.endm
|
| |
TWW
Registered: Jul 2009 Posts: 545 |
@ GPZ: That is a good point. Straight fwd. to add an assertion based on .pc.
@lft: Nice one. Will shamelessly update my routines with this one, saving a byte where I can ;-)
Forgot to say the other criteria for me was to leave the registers untouched and used simply as :DELAY 5 |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Actually, no one should ever use my wait_min24 from above on a 6510, unless they have ideological objections to unintended opcodes.
This is better; covers shorter delays and doesn't clobber any registers or flags. Still requires two bytes of stack.
#define wait_min14(x) jsr wait14+14-(x)
.dsb 64,$80 ; NOP#nn
.byt $04 ; NOP zp
wait14
nop
wait12
rts
(if you don't like illegals, replace the NOPs with BITs, and you get one that preserves registers but not flags..)
oh, and lft, thanks for the cmp (0,x)! |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
@lft: Cute, but beware of inadvertently touching I/O register with side-effects on read (i.e. $DC0D or $DD0D). |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
doynax: another nice source for subtle bugs =) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
"doesn't clobber any registers or flags. Still requires two bytes of stack."
almost perfect, now one without jsr and 2-63 cycles please for the ultimate macro :) |
| |
Han
Registered: Apr 2017 Posts: 8 |
Funny to see this question now when I was writing my own macro last week :)
Maybe this is useful for somebody (KickAssembler):
.macro waitx(Cycles)
{
// Parameters of fast loop (outside a page boundary)
.var LC=5 // Cycles per loop iteration (DEX, BNE)
.var LoopCount=max(1, floor((Cycles-1)/LC)) // Loop counter
.if((LoopCount>1) && (Cycles - (LoopCount*LC+1)==1)) { .eval LoopCount-- } // Handle only 1 remaining cycle
.var ExtraCycles=max(0, Cycles - (LoopCount*LC+1)) // Cycles outside the loop
.var ExtraBytes=max(0, ceil(ExtraCycles/2)) // Bytes required outside the loop
// Parameters of slow loop (branch over page boundary)
.var P_LC=6
.var P_LoopCount=max(1, floor(Cycles/P_LC))
.if((P_LoopCount>1) && (Cycles - (P_LoopCount*P_LC)==1)) { .eval P_LoopCount-- }
.var P_ExtraCycles=max(0, Cycles - (P_LoopCount*P_LC))
.var P_ExtraBytes=max(0, ceil(P_ExtraCycles/2))
.var Relocate=false
.var IsPageCrossed=(((<*)>=$fb) && ((<*)<=$fd))
.if(IsPageCrossed)
{ // Check if fast loop could be relocated to be slow and would also be smaller
.var adr=*+ExtraBytes
.if((ExtraBytes<P_ExtraBytes) && (((<adr)<$fb) || ((<adr)>$fd)))
{
.eval Relocate=true
}
else
{
.eval LoopCount=P_LoopCount
.eval ExtraCycles=P_ExtraCycles
.eval ExtraBytes=P_ExtraBytes
}
}
else
{ // Check if slow loop could be relocated to be fast and would also be smaller
.var adr=*+P_ExtraBytes
.if((P_ExtraBytes<ExtraBytes) && (((<adr)>=$fb) && ((<adr)<=$fd)))
{
.eval LoopCount=P_LoopCount
.eval ExtraCycles=P_ExtraCycles
.eval ExtraBytes=P_ExtraBytes
.eval Relocate=true
}
}
.if(ceil(Cycles/2) <= (5+ExtraBytes))
{ // Loopless wait is smaller than using a loop
wait(Cycles)
}
else
{ // All that hassle for this small (relocated) loop :)
.if(Relocate) { wait(ExtraCycles) }
ldx #LoopCount
dex
bne *-1
.if(!Relocate) { wait(ExtraCycles) }
}
}
.macro wait(Cycles)
{
.if(Cycles>0)
{
.if(Cycles<2) .error "Can't delay 1 cycle"
.if((Cycles & 1)==0) { nop } else { bit $00 } // Delay 2 or 3 cycles
.for(var i=1; i<floor(Cycles/2); i++) { nop } // Remaining even amount
}
}
What this does is building an optimal Loop+Nop+Bit-combination that observes a page boundary.
Depending on the number of delay cycles and on the location of the loop the number of required extra cycles varies. So this macro checks if the extra bytes can be used to relocate the loop from/onto a page boundary so that the resulting number of bytes is minimal. (Of course it uses a loopless delay if that's even better.)
Example: your code starts at $08fd and you want to wait 24 cycles:
$08fd LDX #$04
$08ff DEX
$0900 BNE $08FF // Page crossing
If instead you wanted to wait 28 cycles at this location you could append 2 NOPs. But it's smaller to prepend just one NOP, thus relocating the loop off of the page boundary and adding one iteration:
$08fd NOP
$08fe LDX #$05
$0900 DEX
$0901 BNE $0900 // No page crossing
The wait() macro is just a simple loopless delay that's used inside waitx(). Using pha/pla the code size could be reduced even more so maybe I'll include that later.
Please note that I did test this but it's still work in progress.. |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Just got a crazy idea for delaying 13 cycles in 1 byte: pause:
rti
delay13Cycles:
brk Requires that the IRQ/BRK vector is set to the pause label, and no IRQs occur at the same time, which I guess is unlikely anyway when cycle-exact timing is going on. However, after a little test it seems like the PC skips a byte after returning with rti, so in reality it takes two bytes:
pause:
rti
delay13Cycles:
brk
.by 0 |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Yes, BRK is a two-byte instruction. The operand byte is supposed to be an argument for the software interrupt you're triggering, pretty much similar to TRAP #<X> or INT <X> on other platforms.
It was intended for OS calls, i think, but i fail to come up with an example that actually uses the argument byte.
The 1581 ROM code only has a dummy parameter:
.8:959d 08 PHP
.8:959e 58 CLI
.8:959f 95 02 STA $02,X
.8:95a1 00 BRK
.8:95a2 EA NOP |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Interesting, did not know that. Wonder why BRK isn't usually interpreted as having an argument by assemblers/disassemblers. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
so byte after brk is loaded into A ? or just thrown away ? isnt it just some kind of side effect from jsr ? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting CruzerHowever, after a little test it seems like the PC skips a byte after returning with rti, so in reality it takes two bytes
I guess you could make it a single byte 19 cycle delay by incrementing the return address in the interrupt handler, assuming you know the stack depth at the time of execution, and also avoid page boundary crossings in the 'caller'
|
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting CruzerWonder why BRK isn't usually interpreted as having an argument by assemblers/disassemblers. Usually, yes. Some assemblers allow an optional argument. Default is without, as usually BRK is used to end a program, discarding any code or data after it.
Quoting Oswaldso byte after brk is loaded into A ? or just thrown away ? isnt it just some kind of side effect from jsr ? The byte needs to be retrieved manually, reading it from stack after finding its position via TSX.
It may be possible that this is just a side-effect of saving gates or re-using some other logic (but probably not JSR with its two argument bytes).
But there was one real-world application which at least mildly suggests it was a conscious decision. The 6502 was designed as a micro-controller for industrial machines, not a general-purpose CPU for home computers. Back then, PROMs were used for custom or low-volume machines, which would be turned on and immediately manipulate physical objects in the real world. The PROMs came with all bits set, and were programmed by blowing fuses to flip bits to 0, but those bits could never be reset to 1.
Now, the BRK opcode is $00, and it could be used to patch code in PROMs. Upon encountering BRK (which was some other instruction formerly), the interrupt handler could then look up the argument byte (in addition or alternatively to the return address on stack) and decide which patch routine for that location (located in a patch area on the PROM) to execute, then resume operation.
Has anybody interviewed Mr Peddle about this? :) |
| |
lft
Registered: Jul 2007 Posts: 369 |
But in that case, the byte following BRK would be some random byte from the original code. If multiple patches were used, there would be no guarantee that the extra bytes would be different from each other.
Meanwhile, the *address* of the extra byte is available on the stack, and you would have to retrieve it anyway in order to read the extra byte. Hence, it is easier to just use the address (which is unique) to distinguish between different patches. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
thanks for the clarification.
"The 6502 was designed as a micro-controller for industrial machines"
more accurately Chuck Peddle had ATM machines and POS terminals (till/cash register) in mind, and so on. Cheap CPU that can be built into everyday products.
this is in "on the edge". |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
This is getting off topic |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting lftBut in that case, the byte following BRK would be some random byte from the original code. If multiple patches were used, there would be no guarantee that the extra bytes would be different from each other. Collection of 8 bits, to be precise, each of which you can null individually until having a unique byte. And if that fails, null the bits of consecutive bytes and read past null bytes in the BRK handler. But anyhow, yes, that argument byte's just a bonus, which might or might not come in handy depending on the use-case. And patches were probably relatively rare. :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
This is what I'm using in ca65 at the moment.
Given that I'm still clobbering flags with pha/pla, it's a bit OTT using illegals for NOP zeropage instead of BIT zeropage, but there's no harm in it, and this way you can convert the macro into a flag preserving one by removing the N=7 case.
I elected not to use the (0,x) thing, because I don't want to watch out for accidentally reading from IO.
; preserves a,x,y,sp
; may use a few bytes of stack
; Remove the N=7 case to get one that also preserves flags
.macro wait N
.if N = 2
nop
.elseif N = 3
.byt $04,3 ; NOP zp
.elseif N = 7
pha
pla
.elseif N < 9
nop
wait N-2
.elseif N = 12
jsr wait12
.elseif N < 14
wait 7
wait N-7
.elseif N<77
jsr wait14+14-(N)
.elseif N<84
wait 76
wait (N)-76
.elseif N<152
wait (N)/2
wait ((N)+1)/2
.else
wait 76
wait (N)-76
.endif
.endmacro
.segment "CODE"
.res 61,$80 ; NOP#nn
.byt $04 ; NOP zp
wait14:
nop
wait12:
rts
(I'll probably add it to codebase after I've done some more testing, unless someone else gets there first) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Off-topic but I don't suppose that anyone here has gotten around to writing a usable 6502 super-optimizer? It might be handy to brute-force this type or thing, finding a use for the more obscure illegal-opcodes, etc. Then again the solution in practice is typically to redefine the problem with the 6510 architecture in mind, so perhaps not.
Quoting ChristopherJamGiven that I'm still clobbering flags with pha/pla, it's a bit OTT using illegals for NOP zeropage instead of BIT zeropage, but there's no harm in it, and this way you can convert the macro into a flag preserving one by removing the N=7 case. PHP/PLP? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting doynaxPHP/PLP?
OMG, of course!
*fixes* |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting doynaxOff-topic but I don't suppose that anyone here has gotten around to writing a usable 6502 super-optimizer? It might be handy to brute-force this type or thing, finding a use for the more obscure illegal-opcodes, etc.
Not yet, much as I've wondered about it from time to time.. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Erm, replace the tail of that macro with
.elseif N<77
jsr wait14+14-(N)
.elseif N=77
wait 2
wait 75
.elseif N<84
wait 76
wait (N)-76
.elseif N<153
wait (N)/2
wait ((N)+1)/2
.else
wait 76
wait (N)-76
.endif
.endmacro
or you'll get errors for a few edge cases (it tries to call wait 1 and falls into an infinite recursion, triggering a fatal error: Too many nested .IFs) |
| |
Copyfault
Registered: Dec 2001 Posts: 478 |
The unintended RMW-Opcodes allow zp-indexed adressing plus they can be paired so that at least part of the change that is performed by the first opcode is inverted by the latter, e.g.ISB ($00,x)
DCP ($00,x) The cycle demand per RMW-zp-indexed-instruction is always 8. This way, it should be possible to cut down the amount of bytes for the delay routine if the no.of delay-cycles is high enough (>=16). It will take some extra effort (-> extra delay cycles) to also save accu and/or flags (e.g. some kind of PHP&PHA+PLP&PLA-bracket should do).
With these instructions, it's even more important to make sure that no accidental accesses of I/O-registers occur. This means the macro will have to handle it by choosing the operand byte accordingly.
Just a few thoughts, didn't implement such delay macros as of yet. |
| |
Remdy
Registered: Feb 2019 Posts: 26 |
Hi, Could anyone please post an ACME delay macro here? |