| |
Norrland
Registered: Aug 2011 Posts: 14 |
problems w. opening sidebordes+sprites
Hi there! First post here, but bare with me, I'm gonna spam you with some questions on semi-newbie level for a while..
I'll start with my problems with opening the sideborder. I'm showing a bitmap picture and the plan is to have 4 sprites in the sideborder, in the middle of the screen, all on the same y-pos. I have managed to make a stable irqroutine and have set $d015 to #$f0 to enable sprites 4-7 (sprites in order, the ones with lowest prio). I've also made sure that sprites 0-3 have totally different y-pos than sprites 4-7.
Later down the screen, at the same position as the sprites, I open the border with dec/inc $d016 (cycle 56), and everything goes well for the normal lines, but on bad lines I'll be 3 cycles late (if I've understood everything right) even though I do dec/inc $d016 right after the last lines dec/inc.
I've tried to follow Christian Bauers (+codebase, c-hacking, posts here and others) texts about the vic, rastertiming and opening borders, but I don't understand what I do wrong. In my code I'll do 20 nops between the dec/inc, which in my head equals to 52 cycles including dec/inc, and if that is right, suggests that the vic uses 11 cycles (63-52=11) (2/sprite and 3 for BA signal??) for fetching spritedata for 4 sprites on non-badlines.
If my calculations are right, I don't understand why I have problems on badlines, were I should have 23 cycles(?). Even if I skip one sprite, I'm still one cycle late, and I've read that 4 sprites should be possible..
The code inside irq, and screenshot showing my timing:
dec $d021 ;row1
inc $d021
.byte $ea, $ea, $ea, $ea, $ea, $ea, $ea ;20 st (40 cycles)
.byte $ea, $ea, $ea, $ea, $ea, $ea, $ea
.byte $ea, $ea, $ea, $ea, $ea, $ea
dec $d021 ;row2 6 cycles
inc $d021 ; 6 cycles,tot 52
.byte $ea, $ea, $ea, $ea, $ea, $ea, $ea ;20 st (40 cycles)
.byte $ea, $ea, $ea, $ea, $ea, $ea, $ea
.byte $ea, $ea, $ea, $ea, $ea, $ea
dec $d021 ;row3 6 cycles
inc $d021 ; 6 cycles,tot 52
dec $d020 ;row 4 BADLINE hell breaks loose... (nåja)
inc $d020
http://i1120.photobucket.com/albums/l491/lordborak/kastabort.png
Is my understanding right? Do I need to consider/setup anything else than described above? |
|
| |
TWW
Registered: Jul 2009 Posts: 545 |
sty $d016
sta $d016,x
sty $d016
sta $d016
Just those few cycles less you need i recon :-)
|
| |
Mr. SID
Registered: Jan 2003 Posts: 424 |
What TWW said.
Also a little tip for timing. Put this in your code somewhere:
delay64: nop
delay62: nop
delay60: nop
delay58: nop
delay56: nop
delay54: nop
delay52: nop
delay50: nop
delay48: nop
delay46: nop
delay44: nop
delay42: nop
delay40: nop
delay38: nop
delay36: nop
delay34: nop
delay32: nop
delay30: nop
delay28: nop
delay26: nop
delay24: nop
delay22: nop
delay20: nop
delay18: nop
delay16: nop
delay14: nop
delay12: rts
delay63: nop
delay61: nop
delay59: nop
delay57: nop
delay55: nop
delay53: nop
delay51: nop
delay49: nop
delay47: nop
delay45: nop
delay43: nop
delay41: nop
delay39: nop
delay37: nop
delay35: nop
delay33: nop
delay31: nop
delay29: nop
delay27: nop
delay25: nop
delay23: nop
delay21: nop
delay19: nop
delay17: nop
delay15: .byte $04, $00 ; NOOP $00 = 3 cycles
rts
Then just do a jsr delay40 in your code instead of putting 20 nops into it. That way you'll make sure that changing your timing delays is not going to push your code around. Otherwise a branch might end up crossing a page border and you'll lose a cycle somewhere which usually takes a long time to find. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Quoting Mr. SIDOtherwise a branch might end up crossing a page border and you'll lose a cycle somewhere which usually takes a long time to find.
.assert >* = >delay, error, "delay loop crosses page boundary"
Though I wish ca65 had the ability to warn on branches across a page boundary. |
| |
TWW
Registered: Jul 2009 Posts: 545 |
OR you could be cool and use Kickass and the following Pseudo-Commands (Unless RAM is hurting):
irq2:
:IRQ_LeadIn #1
ldx #$07
dex
bne *-1
bit $ffff
nop
ldx #$00
d16:
ldy #$00
:SET_GRAPHICS_BANKS($4400,$5800)
lda #$1b
sta $d011
lda #$0f
sty x
sta x,x //32
sty x
sta x //33
:timer1 //34
:timer1 //35
:timer1 //36
:timer1 //37
:timer1 //38
:timer1 //39
:timer2 //3a & 3b
:timer1 //3c
:timer1 //3d
:timer1 //3e
:timer1 //3f
:timer1 //40
:timer1 //41
:timer2 //42 & 43
:timer1 //44
:timer1 //45
:timer1 //46
ldx #106+21
stx $d009
stx $d00b
stx $d00d
stx $d00f
ldx #$41
stx $47fc
ldx #$45
stx $47fd
ldx #$50
stx $47fe
ldx #$50
stx $47ff
ldx #$00
sty x
sta x
:timer1
:timer1
:timer2
:timer1
:timer1
:timer1
:timer1
:timer1
:timer1
:timer2
:timer1
:timer1
:timer1
:timer1
:timer1
:timer1
:timer2
ldx #106+42
stx $d009
stx $d00b
stx $d00d
stx $d00f
ldx #$42
stx $47fc
ldx #$46
stx $47fd
ldx #$50
stx $47fe
ldx #$50
stx $47ff
ldx #$00
sty x
sta x
:timer1
:timer1
:timer1
:timer1
:timer1
.pseudocommand timer1 {
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
sty x
sta x
}
.pseudocommand timer2 {
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
bit $ffff
sty x
sta x,x
sty x
sta x
}
This should open 6 chars with sprites with the initial raster triggering before the 1st badline (adjust timing as necessary!) Here you also can cotrol $d016 HW scrolling aswell.
Cheers. Buy me a beer anytime! |
| |
Mr. SID
Registered: Jan 2003 Posts: 424 |
I think this is cooler:
#pybegin
badline_offset = 2
for i in range(16):
if i%8 == badline_offset:
# bad line
print "sta $d016,y"
else:
print "jsr delay44"
print "sta $d016"
print "stx $d016"
#pyend
:) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Using 3 bytes to waste 4 cycles is just wrong, when you can use 2 bytes to waste 7 cycles. :) (bit $ffff, vs pha+pla)
|
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Yes, it is important to waste cycles with style. |
| |
TWW
Registered: Jul 2009 Posts: 545 |
Alright, then this is the coolest :-)
:open_border (2,16)
.macro open_border(badline_offset,total_lines) {
.for (i=0;i<total_lines;i++) {
jsr delay44
.if i&7 == badline_offset:
stx $d016
sta $d016,y
}
stx $d016
sta $d016
}
}
Delay44:
.for (i=0;i<4;i++) { pha pla }
nop
nop
rts
+ it wastes cycles with style^^
ps. Untested and might be bugged! |
| |
Mr. SID
Registered: Jan 2003 Posts: 424 |
Now we're getting somewhere... ;) |
| |
TWW
Registered: Jul 2009 Posts: 545 |
Minor Bugfix 8-D
.var i = 0
:open_border(2,16)
.macro open_border(badline_offset,total_lines) {
.for (i=0;i<total_lines;i++) {
jsr Delay44
stx $d016
.if([i&7] == badline_offset) {
sta $d016,y
stx $d016
}
sta $d016
}
}
Delay44:
.for (i=0;i<4;i++) { pha pla }
nop
nop
rts
Looks ok when compiled. Not done any real test though so might need some cycle adjustment if my math is off. |
| |
Norrland
Registered: Aug 2011 Posts: 14 |
Damn it!! I knew I did something wrong, I just didn't realize that I wasn't wasting the cycles with enough style... I probably become better on that in the future, but meanwhile, I'll stick to the NOPs...
Big thanks for the solution, works great! And I'll be using your tip with "delaytable" in the future when in need of careful timing.
You talk about beer, I want beer, you write crazy opening- sideborder-routines, my program works, everyone is happy!, but I still don't _understand_ why I don't have enough cycles left on badlines, 23 cycles should be enough for spritefetching & dec/inc? Does someone have a suggestion? |
| |
Radiant
Registered: Sep 2004 Posts: 639 |
H Macaroni: Well, how many cycles you have available on a badline depend on what your code looks like, especially if you also have sprites enabled. :-) The key is the BA line on the VIC-II; BA turns low three cycles before the VIC needs exclusive memory access. If the CPU executes a read cycle while BA is low it stalls and AEC is set to low until the VIC is finished with its business, where the CPU will continue from where it was. Therefore you have 20-23 cycles available on a badline with no sprites, depending on your code. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
delay44:
nop
jsr :+
: jsr :+
: rts |
| |
Norrland
Registered: Aug 2011 Posts: 14 |
radiantx: thx |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: delay44:
nop
jsr :+
: jsr :+
: rts
Cool! Removed my post regarding how to calc when I realized my mistake... |
| |
Radiant
Registered: Sep 2004 Posts: 639 |
MagerValp: Mind = blown |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
@Magervalp: Nice, yes! Stylish, yes! I think I have seen this kind of delay code before somewhere though. Am I right? Is it in some routine in C=Hacking or something?
Anyway.. I don't think your particular little routine is correct. :) Unless you call the routine with a JSR, there will be one RTS too much and you'll end up in stack space when RTS'ing to Eternia. ...but if you DO call the routine with a JSR it will waste a total of 44+6 cycles (with the initial JSR to the routine included) rather than 44 cycles. ...or did I miss something? Maybe this is actually how it was supposed to be?
Not as elegant, but unless I did some mistake, this routine should provide a relatively stylish wasting of 44 cycles (with the initial 6 cycles of the JSR to the waste44 routine included):
[SOME CODE HERE]
jsr waste44
[SOME CODE HERE]
waste44: ;the routine is 9 bytes
nop
jsr :+
jsr :+
: nop
rts
|
| |
WVL
Registered: Mar 2002 Posts: 902 |
check!
6 jsr delay44
2 nop
6 jsr 1
2 nop
6 rts
6 jsr 2
2 nop
6 rts
2 nop
6 rts
total : 44 |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Yes, it depends on if you count 6 cycles of the calling jsr or not. I couldn't be arsed to count the cycles of the previous examples :P
Don't know if I found it somewhere or if I thought of it myself. |
| |
TWW
Registered: Jul 2009 Posts: 545 |
Dealy44:
lda #%00010000
sec
ror
// PAGEBREAK HERE
bcc *-1
rts // A exits with #8 and done in 7 bytes.
Another variant with 1 byte less ;-) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
you are making it overcomplicated. 1 byte less and primitive:
$10=5
6 jsr delay44
delay44
3 ldx $10
2 dex ;
3 bpl *-3 ;x5=25 cycles +when exits 4 cycles -> 29
6 rts
|
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: you are making it overcomplicated. 1 byte less and primitive:
$10=5
6 jsr delay44
delay44
3 ldx $10
2 dex ;
3 bpl *-3 ;x5=25 cycles +when exits 4 cycles -> 29
6 rts
Yes, but you destroys teh regiztorz! == CHEAT! :D |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
oh, didnt think about that :) |
| |
TWW
Registered: Jul 2009 Posts: 545 |
plus you need to set $10 which will set you back 4 bytes ;-)
Trust me I gave it some thought^^ |
| |
Mace
Registered: May 2002 Posts: 1799 |
CYC #$2c
Super illegal opcode. |
| |
Slammer
Registered: Feb 2004 Posts: 416 |
:pause #$2c (From codebase, a bit down on the page)
Just put in your own optimizations. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
Quote: plus you need to set $10 which will set you back 4 bytes ;-)
Trust me I gave it some thought^^
yeah, thats why you destroy A, plus $10 doesnt needs to be set each time, so in the long run it uses less mem anyway :) |
| |
Martin Piper
Registered: Nov 2007 Posts: 722 |
I've often thought some kind of external tool that generates ASM would be good here. You tell the tool what registers/memory you want set to what value and the raster and cycle it should be set on. Then it figures out the necessary most optimal timed code including bad lines, sprite DMAs etc. |
| |
Perplex
Registered: Feb 2009 Posts: 255 |
If you write a tool in some high level language to generate the timed code for you, better make it waste cycles in between the timing critical instructions by interleaving it with other useful code instead of just nops and the like. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: If you write a tool in some high level language to generate the timed code for you, better make it waste cycles in between the timing critical instructions by interleaving it with other useful code instead of just nops and the like.
That is exactly what the code multiplexer in S:T Lars Meeting III - Invite does. It has two code snippets, one that updates the 4x4-effect using A,X and Y registers. And one that opens the border. The multiplexer then replaces all NOPs in the timing critical code with the code from the 4x4-updater, keeping track of A,X and Y usage etc. |
| |
TWW
Registered: Jul 2009 Posts: 545 |
Quote: yeah, thats why you destroy A, plus $10 doesnt needs to be set each time, so in the long run it uses less mem anyway :)
A was 8 upon entry and is 8 upon exit.
lda #$08 <----- see, 8!
ldy #$00
OPEN DA BOOORDEEEERRZZZZZZZ!!!!!
Still 8 here!
You however, destroy X and add bytes ;-)
Come to think of it, one could probably use X as a counter to shorten this even further...
<-entrypoint (X set with badline offset $02 set with quantities of linesX8)
!: jsr delayXX
sta $d016
sty $d016
dex
bne !-
jsr delayXX-1
sta $d016
sty $d016
sta $d016,x
sty $d016
ldx #$07
jsr delayZZ
dec $02
bne !-+3
rts
Then again this is straig out of my ass^^ |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
cool, then I win by simply modifying my code to:
$10=6
delay44
3 ldx $10
2 dex ;
3 bne *-3 ;x5=25 cycles +when exits 4 cycles -> 29
6 rts
x needs to be 0, and my x wil be 0 :)
edit: you also need to be on a pagebreak, which is not a nice thing either :) |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Quote: cool, then I win by simply modifying my code to:
$10=6
delay44
3 ldx $10
2 dex ;
3 bne *-3 ;x5=25 cycles +when exits 4 cycles -> 29
6 rts
x needs to be 0, and my x wil be 0 :)
edit: you also need to be on a pagebreak, which is not a nice thing either :)
...but is it stylish? |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
it does achive what it was intended to: do it in less bytes. changing the goal later, isn't that unfair ? :) |
| |
TWW
Registered: Jul 2009 Posts: 545 |
Quote: cool, then I win by simply modifying my code to:
$10=6
delay44
3 ldx $10
2 dex ;
3 bne *-3 ;x5=25 cycles +when exits 4 cycles -> 29
6 rts
x needs to be 0, and my x wil be 0 :)
edit: you also need to be on a pagebreak, which is not a nice thing either :)
Fair enough. Pagebreak isn't cool, but a still somewhat if someone rips your code and don't think about it (hehe).
You still need to do: sta/y/x $10 "somewhere" to ensure $10 = 6. Thus adding 2 bytes (maybee 2 more unless you have a state of 6 in one of your regs) and yielding a total consumption of 8 bytes vs. my 7 bytes.
I guess you would use 3 cycles more overall due to the sta/sty/stx $10 aswell ;-D |