| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Faster charmap scrolling
Some interesting asides over in the Pixeling forum about speeding up charmaps (cf Graphician for intense EF game). As Oswald pointed out, switching from tiles to a straight unpacked charmaps doesn't really save you much, as you can avoid dealing with tile indices for most of the screen by just copying most of the chars from within VM. Besides, even tile index reads can be amortised over multiple VM writes.
However, there are other possibilities. If you've got a little RAM to spare (eg because all your level data is in EF), then why not unroll the update loop into one hardcoded routine per column?
Could easily dedicate 5k to
lda#$xx
sta vm,x
lda#$xx
sta vm+40,x
lda#$xx
sta vm+2*40,x
..
lda#$xx
sta vm+24*40,x
which gets you down to 7 cycles per char (14 if you also do video ram)
You only need to update one column of source each time you scroll one char, and call the columns in sequence with increasing values of X.
Might have to do divide into upper/lower half of screen to avoid tearing.
Of course, if you want to be really extravagant, you could generate a routine per column of level data, and skip any redundant loads by grouping identical indices; kind of like compiled sprites on PC.
That would eat shedloads of flash if you stored them all in advance of course (a tad less with duplicate removal), or you could try generating them on the fly
|
|
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
My latest full screen scroll code involves:
stx SCREEN1+$0000
stx SCREEN2+$0000
inx
stx SCREEN1+$0028
stx SCREEN2+$0028
inx
stx SCREEN1+$0050
stx SCREEN2+$0050
inx
stx SCREEN1+$0078
stx SCREEN2+$0078
inx
stx SCREEN1+$00a0
stx SCREEN2+$00a0
stx SCREEN1+$0001
stx SCREEN2+$0001
inx
stx SCREEN1+$00c8
stx SCREEN2+$00c8
stx SCREEN1+$0029
stx SCREEN2+$0029
inx
stx SCREEN1+$00f0
stx SCREEN2+$00f0
stx SCREEN1+$0051
stx SCREEN2+$0051
inx
stx SCREEN1+$0118
stx SCREEN2+$0118
stx SCREEN1+$0079
stx SCREEN2+$0079
inx
.
.
.
Segment (start, stop, size):
SCROLLER 007C00 0094CF 0018D0
For what and how it's used is a secret and will be revealed at a future demo party. :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Haha, nice. I can think of a few things you could do with that.. |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
jackie, twister with half chars ? :) |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
CJ, took me a minute until I got it, thats an awesome idea :) tho the usage is limited to horizontal scrollin. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: CJ, took me a minute until I got it, thats an awesome idea :) tho the usage is limited to horizontal scrollin.
Free directional (mine that is). CJ's I dunno. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting OswaldCJ, took me a minute until I got it, thats an awesome idea :) tho the usage is limited to horizontal scrollin.
Thanks! Yes, I probably should have mentioned that limitation. |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
Given you have your chars mapped out like so
$8000 row 1 of chars
$8100 row 2 of chars
$8200 row 3 of chars
....
$a000 row 1 of colours
$a100 row 2 of colours
.....
And you are double buffered. Screen 1 at $4400 and Screen 2 at $4800
You have a window defined as Lindex and Rindex which start at 0 and 39.
Lets also assume that you are viewing Screen 1 to start with
When you scroll left, such that it appears the player has moved right.
You use an unrolled loop to copy
4401 -> 4800
4402 -> 4801
.....
You then need to plot your new column, in this case in the right edge of screen 2.
so
ldx Rindex
lda $8000,x
sta $4827
lda $8100,x
sta $484f
lda $8200,x
...
then you need to do the CRAM so again unrolled loop
d801 -> d800
d802 -> d801
.....
and plot the CRAM side
ldx Rindex
lda $A000,x
sta $d827
lda $A100,x
sta $484f
lda $d900,x
...
So you need 4 unrolled Screen copy routines
Screen1 to Screen2 forwards
Screen2 to Screen1 forwards
Screen1 to Screen2 backwards
Screen2 to Screen1 backwards
and 2 CRAM copy routines
CRAM+1 to CRAM
CRAM to CRAM+1
And 4 column routines
LeftEdge Screen1
LeftEdge Screen2
RightEdge Screen1
RightEdge Screen2
Now to move you move your "window", so to get to the next char row you inc Rindex and Lindex, to go back you dec them.
But this only gets you a 256 char wide map. So you need to add in a RindexBank and LindexBank as well. Once you roll over you either inc/dec the bank. Or even look up what the next bank is in a bank map, allowing you to repeat banks for extra length, or warp around maps. Since the Banks are always at $8000 you don't need to use any pointers, or indexes. Your unrolled loops can now service up to 1MB worth of map. Eats a good 24~5K but with all the map data and other code being able to be stored in ROM, it don't think it is going to matter, and it gives top speed with free range "boundless" map size, with unique colour per 1x1 map tile and you don't need any timing critical VIC tricks to worry about patching to support NTSC. However you can't modify the map, so dynamic parts of the map must become "entities". Could also be modified to support up/down and even possibly 8 way scrolling versions. |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
you're better off with reu, esp. since 1541u supports it so big user base. now reu can copy memory at 1 cycle / byte speed for scrolling. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Oziphantom: Agreed with Oswald, I believe you can not get appreciably faster whole screen or color-RAM update with just the CPU no matter what you're doing.
Unrolling will help, as will writing the same value to both screen & color-RAM (basically you could have 16 chars of the same color, Quod Init Exit IIm does this), but still the load/store operations dominate the load.
However you're probably not going to do the screen update every frame, so use that to your advantage, e.g. instead of waiting, you can have the main program always calculate at least 1 frame ahead, while IRQs perform the screen update. This means that the large chunk of CPU time taken by it isn't as devastating, as your next frame may already be half ready when interrupted by the screen update, which hopefully leaves enough time to finish. |
| |
Compyx
Registered: Jan 2005 Posts: 631 |
Seems like VSP would not be a bad idea, seeing how all scrolling is only horizontal. Saves a shitload of raster time, except for the one time you have to scroll colorram up.
But like Cadaver said, you can 'cheat' updating the colorram by carefully timing when it happens. But that might screw with any multiplexer, so perhaps move $d800-updating out of IRQ. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
dudes, for simple horizontal scroller neither vsp nor reu nor whatever is needed.... unless you want to scroll way too fast =) it can be done easily in 2 frames, even more easily in more frames. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Compyx: re-entrant IRQ should not mess with the multiplexer as long as IRQ source is acknowledged and CPU registers are stored properly. I've got a test scroll/sprite engine going where exactly the color-RAM scroll happens in IRQ and no problems so far. |
| |
Compyx
Registered: Jan 2005 Posts: 631 |
Quote: dudes, for simple horizontal scroller neither vsp nor reu nor whatever is needed.... unless you want to scroll way too fast =) it can be done easily in 2 frames, even more easily in more frames.
I know, the problem is the colorram updates. Double buffering removes most of the raster time problems, especially when scrolling at 1-4 pixels.
For me, the colorram has always been the problem. Although I managed to avoid that using VSP and linecrunch (infinite scrolling in all directions), but that disabled any use of sprites. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Typical approach is to split color-RAM update in two halves, you can start updating the top half already when raster beam is in the bottom half. This may be critical if you're after NTSC compatibility. Though in the "color-RAM update in IRQ" method I start simply at the bottom of the scrolling area, because I don't fire an IRQ earlier. |
| |
Compyx
Registered: Jan 2005 Posts: 631 |
Yes, doing the colorram update somewhere half-screen worked for me. Although I never considered NTSC systems. I'm a demo coder, so for me PAL is all I care about. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
yup just double buffer the screen and then update colorram as cadaver said.... not really hard to do at all :) |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
Quote: yup just double buffer the screen and then update colorram as cadaver said.... not really hard to do at all :)
except with a multiplexer + game logics I'd guess. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
multiplexer is no problem really - just use re entrant interrupts like cadaver said |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
In the context of my Project as posted in the Graphics forum
I don't think an REU is the better choice. REU and EF are different and have different capabilities. To which at the moment I feel the EF wins, it might be the case I get 3/4 through it realise, I don't need the EF benefits over the REU's and switch, but at the moment I think the EF wins.
In the context of somebody who wants to scroll super fast or needs every last drop of CPU and has a scroller then yes, using a REU gets you insane scrolling speed, you can start in the lower border and be done in the lower border and it takes 36 bytes or something equally minute to do the scroll code. If the map was stored in the REU probably half of that.
If you just need speed and not worried about length or needing extra RAM then VSP can be an option. It is a pain in the butt, but it does save you a lot of raster most of the time. You need to do a lot of juggling to get things in the right place and you get the "this game is stupid you can't jump off the top of the screen, so lame" comments ;) Also worth noting with an REU you get the "this is not a real 64, it is cheating, not standard man..." comments ;) |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
Quote: multiplexer is no problem really - just use re entrant interrupts like cadaver said
the problem is to reach constant 50fps with all the shit - plexer-scroller-logics. Cadaver shows this in his rants nicely. Certainly not a walk in the park if you have to resort to doing things like not running AI in each frame, etc. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
the whole "trick" is to run only the critical stuff in irq and then the rest in main loop.... which indeed may not always run at 50Hz then - but that doesnt really matter unless you have a *lolt* of *fast* moving objects and/or very complicated AI (and most games dont have, nor need, either) |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Some of the info in the past rants may amount to bullshit, and for example a "lazy" AI round-robin update can make the game behave unpredicatably different in different situations, now I'd just advise to:
- Don't wait in the main program unless you're 1 frame ahead and can't buffer more stuff for the IRQs to show
- Game entity logic can be run at half framerate and sprite movements interpolated
- O(n^2) algorithms like collision detection can be most painful to the framerate, so optimize or try to avoid |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Ok, so running with the "take EF as read" assumption, and also sticking with horizontal scrolling as per the OP, I still think there's space for using a code generator for building speedcode to do the d800 update.
Oziphantom, is it safe to assume you're using pure MCM, hence only needing values 8 to 15? |
| |
Peacemaker
Registered: Sep 2004 Posts: 275 |
I have been recently working on a bitmap scroller and had an idea which fits in here very well i guess.
Some say its a problem to store the colorram each, lets say, every 8th frame, as it eats a lot of rastertime. That is true: here is a trick you could do if your scroller is not very fast (scrolls every frame, but every 4th or 8th frame which will work of course, even every 3rd).
You double buffer, if you prefer. You scroll the screen, as usual, at the same time you do the screen calcs, you store the d800 (vram) colors into a routine. a chunk of data every frame, to get down the rastertime. 1000/8 . If the routine is filled with the new values for the next colorram update, stop, and execute the display routine at colorram update if needed at for example 8th frame.
updateroutineeveryframewithachunkofdata (1000/8)
lda colorramsource,x
sta storehere1+1
lda colorramsource+1,x
sta storehere2+1
etc pp.
..................
call_this_routine_at_colorram_update:
storehere1:
lda #$00
sta $d800
storehere2:
lda #$00
sta $d801
etc.
whe the screen is update (d018 / dd00 switch if you use double buffer), you just call the colorram routine which displays the new colors in the same frame.
=)
i hope i could help |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
ChristopherJam That would be a question for the artist ;) I would think it is not though, I was also thinking that it might be possible to either have a 64char set or just use the first 64chars of the main set, with the top few rows being in ECBM mode. This way the far background could be in hires with more smaller pixels to make things look smaller, and the extra colours to help give it depth. Not sure it would be useful though.
Doing the speed code doesn't really save raster though, it saves you 1000clocks on the CRAM frame at the cost of 8000 clocks over the other frames right? |
| |
Perplex
Registered: Feb 2009 Posts: 255 |
Peacemaker: Nice if you have 5KB to spare for speedcode and need lots of time for other stuff besides D800-copying during the cruical frame(s). On the other hand it wastes a lot of cycles modifying the speedcode if you are doing other stuff like loading new bitmap data from disk in the background. I guess it all depends on what you'll be using it for. |
| |
Peacemaker
Registered: Sep 2004 Posts: 275 |
Perplex:
Sure, this method is ofcourse very useful for "D800-copying during the cruical frame(s)". It will work even better if you are using VSP, then you have actualy a lot of frames to fill the speedcode with new values. And then, a loader wont suffer that much =) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Oziphantom OK, I will redo my numbers allowing for 16 possible colours; think it means I'll have to switch to a section of speedcode per pair of columns, otherwise I lose 600 cycles of my savings.
I'll do that tomorrow, but for now here are my 8 colour numbers:
At eight possible colours with one section of speedcode per character column, it only takes me around 5700 cycles to update all of d800, so a savings of 2300 cycles over the direct copy within d800 + column fetch from EF.
Building a section of speedcode only takes around 2200 cycles, including fetching new values from EF; an easy cost to bear even if you were scrolling four pixels per frame. (hah; just noticed the build cost is similar to the savings on the update frames; all I've done is balance the load a little) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Sorry, just realised where your 1000 cycles came from. Yes, it's true, my first suggestion above takes 7000 cycles for a full update, but also you only need a very quick update of one of the 39 segments of code each 8 pixels; should take around 300 cycles to fill from EF. The rest are reused by using progressively lower values of X to select a destination column.
Peacemaker's solution has a lower runtime (6000 cycles), but you cannot reuse the speedcode as the destinations are hardcoded and unindexed.
The 5700 cycle version I was referring to in my last comment only performs eight immediate loads per column, but to get down to that I have to use a counting-sort to reorder the stores, which is considerably slower. At least I get to use sta (zp,x) in the speedcode generator :D |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Slack bastard time. Generate a new segment of speedcode every two charscrolls, use the hidden char column so that the same set of 20 double-width columns can be used twice.
; 20*50*5 = 5000 cycles for storing values
; 20*16*2 = 640 cycles for loading values to write
; 20* 7 = 140 cycles for double-decrementing x and skipping to next routine
; TOTAL 5780 cycles
It'd be about 90 cycles less if only 19 routines were called, and special case code was done for first/last column, but I'm heading back to drive coding for now. I'll write up the speedcode generator another day. |