| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Exomizer on-the-fly loading/decompressing
Hey,
anyone want to share, what is the lowest disk interleave you've managed to use with on-the-fly Exomizer decompression while loading?
I'm currently at 11, using 2-bit transfer and a lame drivecode (using jobcodes only) + 1 sector buffering. However I don't think the drivecode is the problem; if I try to decrease to IL 10 the C64 often doesn't have to wait for the drive at all for the next sector's data, but occasionally the depack will take too long, resulting in missed revolution.
I've already done some optimization to the depack routine, including inlining getting a single bit (literal/sequence decision, and reading the gamma).
Don't think I would switch to another packer just for speed, but nevertheless interested in any battle stories. |
|
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Ok. Got it down to 10, not 100% reliably but reliably enough to result in a definite speed increase compared to 11.
Final changes were further inlining in the gamma/decision handling (do not jsr to a "refill bitbuf" routine, instead jsr directly to getbyte) and to not save the X register except when needed. I'm also purposefully disallowing literal sequences, since at least for my usecase they don't result in actual disk blocks being saved.
For those interested, the current loader code (may not be useful for general use): https://github.com/cadaver/hessian/blob/master/loader.s
(I'm checking for a sprite Y-coordinate range in the sector transfer, that could simply be removed if sprites were always off to save approx. 40 lines) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
cant help, but i hope this means a cool cadaver game :) |
| |
tlr
Registered: Sep 2003 Posts: 1790 |
Perhaps you should try limiting copy lengths to 256 with -M256 as well. Then you can rewrite the copy loop and possibly even the bits/base decoder to optimize for that. Shouldn't reduce compression much. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Actually I just did that :)
Though it seemed that it would still generate longer RLE sequences, so I patched Exomizer2 a bit to honor the -M parameters also for RLE. |
| |
tlr
Registered: Sep 2003 Posts: 1790 |
So did you get any gain from doing that? It's going to generate a few more primary units so perhaps the increased number of bits used for encoding those eats up the gain in the loop? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Surely the 'perfect' interleave would depend on the data being decompressed? I'd expect a bunch of 'copy substrings' would consume a lot more cycles per byte of input than a chunk of literals. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Next up: Make a cruncher that simulates disk access times, and weighs that into the cost function that determines when to produce a copy or a literal unit. Too bad access times aren't really predictable. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
tlr: I first checked how often it was generating those longer sequences, in my case it wasn't often (a few times per the longer files).
The changes have resulted in a few bytes longer data here and there, but at least not in any more used blocks.
I haven't been able to bump interleave down from 10, but the optimizations have allowed a larger "safety factor" ie. ingame loading will be slower as there are raster IRQs and possibly sprites on, and having a faster decompressor reduces the chances of missed revolution hickups.
EDIT: lft: rather than to cripple the compressed output for speed, it should be possible to build the diskimage dynamically by emulating the CPU during loading/depacking to guarantee optimal interleave. Don't think I'll be going there, but it's a possibility. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
I'm with ChristopherJam here, and i don't see an actual benefit unless you really have constant depacking time on every block. That is pretty unlikely to be achieved, except maybe with prohibitive degradation of overall performance.
The goal is the shortest combined loading and depacking time, and given that you want to download a block from the drive just when it is ready, you do that and prioritise this over depacking. Now, it may happen that time is wasted when the blocks arrive in a non-linear order, such that the depacker is doing nothing between new blocks arriving, until the next block in the linear packed data stream is ready.
However, it is fairly simple to have an interleave that makes all blocks arrive in linear order (even when the loader/depacker support out-of-order loading for the eventual hiccup). Then you can depack as much as possible until the next block is ready, and after downloading it from the drive, go on depacking where you left before. No time is wasted, and the rest of the file will be depacked when all blocks have been loaded. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
lft, cadaver: IMHO, the biggest problem when factoring in actual run-time data to optimise combined loading and depacking is that you have at least one big source of error: the inter-track skew.
As conventional tools to transfer disk images to physical disks (just like most copy programs) do not line up the blocks on every track to the same respective sectors (blocks and sectors are different things here, as the 1541 does not have an index hole sensor), individual copies will start reading a different block when going from one track to the next.
Thus, you will see different timings from one disk to the next, even with the same drive. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Krill: sure, the depacker cycle use measurement as I imagined would only be useful within a track, and only as long as the depack CPU time dominates.
In my case I need to minimize the loader/depacker resident code size so unfortunately that means multiple sector buffering and/or out-of-order loading are out of question for me. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Why are buffering multiple sectors and out-of-order loading out of the question for you?
The former is not a big issue, as the packed data can be stored at the end of the unpacked area, plus the safety margin to prevent unpacked data writes from happening before the packed data read on the same memory cell.
And out-of-order loading... i see no problem there at all, save maybe for some more resident code.
So the main downsides are a few bytes wasted for the safety margin at the end of each uncompressed file, and some resident code to handle ooo loading.
So why are those such big problems? :) |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Quoting cadaver
In my case I need to minimize the loader/depacker resident code size so unfortunately that means multiple sector buffering and/or out-of-order loading are out of question for me.
Regarding the size of the exomizer depacker (including tables) this would just be a small addition. Quite some of the out-of-order (and serialization) handling can be offloaded to the floppy, where you don't have much code yet, right? |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting BitbreakerRegarding the size of the exomizer depacker (including tables) this would just be a small addition. Right, why does it have to be exomizer in the first place, which is neither fast nor small? :) |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
My top priority is still the compression ratio, over loading speed or runtime size.
Also, I have cases where a file is not only a single packed stream, but might consist of..
- first subfile memory allocation info (3 unpacked bytes)
- first subfile packed stream
- second subfile..
in which case out-of-order loading would need to support reading those unpacked bytes in between the packed streams.
+ have to support also Kernal loading with preferably drop-in replacements for opening a file and reading a byte, without any mandatory buffering. In this case Exomizer's API allows the easiest replacement, though the performance is not optimal compared to e.g. Doynamite, which always reads from a buffer. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
You don't want any of those pesky crackers come up with a shorter version, eh? :)
So yes, if 3 bytes shorter is what you want, and having the best pack ratio so far, then exomizer indeed.
The subfiles could be split into separate files for unpacked and ooo-loading. Then slap "IFFL"-style loading on top to make up for the block granularity overhead.
Kernal loading is not a problem, as ooo-loading and buffer-based depacking do not contradict it. My loader has Kernal load fallback and still supports ooo-loading and Doynax LZ. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
I'll stop when I reach 0 blocks free on the diskside, so using Exo allows to cram the most content in :) |
| |
lft
Registered: Jul 2007 Posts: 369 |
Nah, you don't stop when you reach 0 blocks free. That's the starting point. From there on, you have to do some actual thinking about how to best represent each individual effect/level/whatever based on its content, to produce optimal input for the generic cruncher.
I admit I have limited experience of this for C64. For other platforms, though... |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
In MW4, I remember shuffling some dialogue around to improve compression, as well as saving only one way facing sprites (generate the flipped ones at runtime after loading) quite late into the project, when I had hit 0 blocks. This time I'm already doing the sprite flipping from the start though.
But yeah, something like reordering tiles/chars in levels (or patterns in songs) has the potential to save some additional space, without changing anything functionally. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Why is adding another disk side not an option? :)
(Since you want Kernal loading as a fallback, denser custom formats and extended tracks are out of the question. Oh, and i guess you're using the directory track for storage already?) |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
I want to see the crackers reaction when they want to add an intro, but there's no space free :) Ie. do they use IFFL, add a "boot side", improve compression or what?
If we're being serious, a boot side would always be an option (free around 100 blocks), actual in-game disk flipping less preferable since you travel back and forth, and I'm not using the directory track yet. But I don't even know yet if I'll run into serious trouble, if I have extra space in the end then I'll just add extra cutscene pictures, movie credits end sequence or such. |
| |
algorithm
Registered: May 2002 Posts: 705 |
What lft also mentioned in a previous post. Its just as important (if not more important) to transform the data to make it more compressible or generate it before the packing. Even just simple methods such as low/high nibble swaps from byte pairs (for bitmap data) or delta for samples. |
| |
Didi
Registered: Nov 2011 Posts: 487 |
Don't expect any cracker to be really bothered by a full disk. ;-) |
| |
SIDWAVE Account closed
Registered: Apr 2002 Posts: 2238 |
in my world, exomizer is too slow to be used in constant-loading production, so i never would.
with some buffering, the problem is solved, instead complicated code to make it "real time".
no ? |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
cadaver: make sure to also use 40 tracks then :) |
| |
tlr
Registered: Sep 2003 Posts: 1790 |
make that 41 + kabuto's encoding scheme... ;) |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Minor update to this subject: dug up ideas to speed up drive->c64 sector transfer (for example Newcomer has a nice near-optimal transfer routine, and V-Max games like Rocket Ranger transfer multiple bytes per sync). Could get below interleave 10 that way. Custom decode in drivecode still wasn't necessary, as Exomizer (and potential slowing conditions like sprites) are still the bottleneck, but when the transfer is tuned for speed, it can be hard for drive to keep up, so I used a 256-byte table for speeding up the high nybble transfer. If custom GCR decode leaves high & low nybble in separate buffers, then it's probably also going to be fast enough without a large table. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
I'm still not sure whether you're going up the right alley. Why is decoupling block download and decompression using out-of-order loading and in-place depacking between the arrival of new blocks not an option, again?
You'd combine good pack ratio and as-quick-as-possible loading times while being able to independently tune the serial transfer protocol. The sub-files must be split up into separate files, then, but the overhead should be negligible.
But another thing: I've considered patching Exomizer (and other compressors) to get rid of the "safety offset", i.e., the handful of clobbered bytes beyond the unpacked data after depacking in-place.
That is, ensure that the write (decompressed data) pointer always points to memory before the read (downloaded blocks, packed data) pointer. This requires picking different (possibly less efficient) packed representations of data, but this might not be required so often and may be relevant only when crossing block boundaries. Having looked at Exomizer more recently than i, do you think this is possible? |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
I could examine the in-place depacking, but I'm not sure if it will work due to the safety offset requirement. For example, assume that rest of the memory is used by code, level data and graphics, and thus can't be clobbered, and I need to load & depack new music into a 2KB buffer. In worst case the music module will utilize the full 2KB. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
But with Exomizer, this is really just a few bytes, like 3. Surely you can work around that? :) |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
True :)
Another matter though is, that I like to keep just one copy of drivecode which I modify for 1581 & other drives, to keep the bootpart size low. The 1541-specific out-of-order scanning & possible custom decode would necessitate a whole 1541-specific second drivecode and would make the bootpart larger.
Anyway, this is already quite specific to my requirements, and in general I'm already happy with loading speed, though in theory I'm operating a technically substandard loader. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
You can let the drives get their custom drive code directly from the disk rather than the roundabout way of loading drive-specific code into C-64 RAM, then write just a part of it back to the drive in question. This will make the boot part smaller than it is now, too. |
| |
cadaver
Registered: Feb 2002 Posts: 1160 |
Good point, naturally the different drivecodes still exist as part of total data (which can mean few blocks less free for game data) but the initial boot-up time can be reduced that way, as I currently have quite a M-W + M-R + M-E monstrosity going on for the drive type detection on the C64 side :) |