| |
Shadow Account closed
Registered: Apr 2002 Posts: 355 |
Load first decrunch later or stream decrunch?
Work on my first attempt on an IRQ-loading demo continues. I am using Dreamload, and everything is working great.
However, some parts tend to be rather big and eat both diskspace and load time.
Obvious solution would of course be to compress the data.
As I see it, there should be two options.
1. Load entire compressed part, then decompress.
2. Load byte-by-byte and stream-decompress
At the moment I am leaning towards solution 1. To get this working, the unpacking must allow for overlapping between packed and unpacked data, since I don't have space for both obviously. But I guess a smart decompressor works 'in reverse' so to speak, so overlapping should not be a problem as long as unpacked data is larger than packed...
I have looked at Exomizer, and it seems like it does things that way, and the decompressor code is fairly compact, so it could be a way to go.
Option 2 I have not looked into as much. Dreamload does support a callback on byte-for-byte basis, so it should be possible I guess.
So, I ask all veterans in this field - how is it usually done? Any tips and general good ideas? |
|
| |
Ninja
Registered: Jan 2002 Posts: 411 |
Yes, DreamLoad can handle both situations and there are also example sourcecodes for option 2 in the demo/-folder.
Exomizer is in almost all cases a good choice.
Besides this, I wonder if there is a "usual" method. It depends on taste and the actual case. I tend to load first and decrunch later, because it often happens that I don't have enough memory for the depacked part while loading and showing another effect. Then again, if you always have enough memory, it may be faster to depack on the fly, as it can partially hide waiting for a new sector. |
| |
Burglar
Registered: Dec 2004 Posts: 1101 |
also, choose your cruncher wisely, exomizer gives best pack result, but is a bit slow with depacking, pucrunch depacks faster with a slightly bigger filesize.
according to krill, load+depack at the same time takes the shortest amount of time.
I'd first test exomizer to see if you still need to speed things up. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
if loading (plus depacking) speed is critical i'd also give good old levelsqueezer (or any of its variants) a try - it packs reasonably good, but can be depacked very fast on the fly. |
| |
Burglar
Registered: Dec 2004 Posts: 1101 |
Quote: if loading (plus depacking) speed is critical i'd also give good old levelsqueezer (or any of its variants) a try - it packs reasonably good, but can be depacked very fast on the fly.
true, but afaik there is no crossplatform levelsqueezer... I don't think anyone would want to manually crunch files in vice each time you want to test a build... |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
levelsqueezer was the first packer ever that existed on pc =) its in the old taboo-assembler (6502tass or whatever it was called) package if i recall correctly. |
| |
Burglar
Registered: Dec 2004 Posts: 1101 |
in 64tass? got a link? the sources I have are from 1.45b, and they don't compile... also, there's no levelsqueezer sources in there... |
| |
hollowman
Registered: Dec 2001 Posts: 474 |
Partly out of old habit I use level crusher and decrunch while loading. Levelcrusher for dos is really fast and I think Krills loader in combination with lc has given me the best results when it comes to speed. Levelcrusher for dos is included in the bonus directory of tass here 6502TASS V1.31 , along with a depack routine. You could ask Krill for his loader sourcecodes which include a lc depacker routine.
I've also used levelcrusher together with dreamload, and it didnt take much code to be able to decrunch while loading, load and decrunch later and load unpacked. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
i vaguely remember the source for that cruncher beeing available aswell.... mmmmh *shrug* |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
what ninja said. |
| |
Ninja
Registered: Jan 2002 Posts: 411 |
Well, you won't come around experimenting a bit and develop a feeling for an apropriate solution.
Depacking on the fly can be quite faster, if you know what you are doing. If not, there is hardly a difference.
Exomizer can be slower when depacking, but maybe it pays off as you might have to transfer one or a couple of sectors less? And what if you rewrite the decompressor just a bit to speed it up (replacing JSR getbit with a macro already helps)?
So many options and we haven't talked about interleave yet ;)
@Shadow: If you don't need maximum speed, take an easily understandable approach. Doing the first trackmo is usually complicated enough. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Didn't i just post on this therad!? WTF where is? |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Yeye.. really phun to rewrite the whole post again :(. Well, to keep it short..
I of course (also) have the solution to these problems. My loader system has ByteBoozer-decrunch built in, so you can either load+decrunch at the same time, or load/decrunch separately. You have to use ByteBoozer then of course, but there is both c64-and PC-version of it + source, so that should be bearable.
Looking at performance, it's really the fastest way to load and decrunch at the same time. The byte-per-byte approach sounds cute, but is really inefficient if you look at the properties of crunched data. You will have chunks of data that should be copied into memory, and doing one JSR for each byte there will have an impact on the performance for sure. I had an idea of a hybrid solution there also, using a cyclic buffer, but let's just forget that for the moment. |
| |
Stryyker
Registered: Dec 2001 Posts: 468 |
I know Jolz used a 1 sector buffer in C64 memory in Vandalism News loader. Load a sector to RAM, process that while the 1541 moves to next sector. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Yep, something like that. There is some room for tweaking there i guess on the buffer size, in some situations i think that a buffer of 2 sectors would work better. ..But i haven't tried it out, so i really shouldn't come with wise comments. |
| |
Shadow Account closed
Registered: Apr 2002 Posts: 355 |
Alright, thanks for the input all. To get going, I did some preliminary tests:
Loading a 27.4kb unpacked data took 12.4 seconds.
The same data packed with exomizer ended up at 12.5kb.
With the exomizer self-extractor (I assume this uses the same code as the depacker source included) it took 4.3 seconds to unpack. I guess it will take a bit longer in the demo though, since I will have music IRQ still going stealing a bit of time. Still, should be below 5 seconds.
If we assume that loading 12.5kb instead of 27.4kb takes 12.5*(12.4/27.4)=5.6 seconds, the total for loading and unpacking would be less than the time required for just loading the unpacked data.
However, I suspect that this reasoning might not be correct, that there is some 'init'-time to the loading (finding right track etc.) meaning that it probably will take more than 5.6 seconds to load the 12.5kb.
Still, looks like it could be a way to go. Now I need to find some space to put the exomizer depacker in though. Perhaps $0200-$03ff, that is just about the only space I have left. Wonder if the exomizer depacker will fit in there... |
| |
HCL
Registered: Feb 2003 Posts: 728 |
The thing you're not considering is that you can actually do some depacking *at the same time* while loading. The loader spends some time just waiting for the disk drive to find the next sector and reading it. This time can be used for decrunching instead.
Then there's also difference in performance on different decrunchers :P.
ByteBoozer decruncher is easily less than $100 bytes. Integrated with loader it ends up on ~$220 bytes. Your choice ;).
|
| |
Danzig
Registered: Jun 2002 Posts: 440 |
@hcl: sound like you try to sell it, but you forgot the pricing ;) |
| |
Shadow Account closed
Registered: Apr 2002 Posts: 355 |
Haha, yeah, I was also started contemplating if HCL perhaps has a background in the sales department... ;)
Well, it was a nice pitch, and I would consider it perhaps if I had more time, but switching loader etc. at this stage is not something I would venture into (must finish demo in time for LCP08!)
Anyway, I did implement the exomize-after-load variant, and the result was very satisfying:
The time for data loading + decrunching for my test part ended up at 11 sec, so a 1.4 second win in total time compared to the unpacked data, as well as saving more than 50% diskspace! |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Hehe :D, well i don't know if i could sell vacuumers when i can't even sell this, which is for free! I'll even buy you a beer if you use it ;).
Horrible to hear that you're ok with such an un-optimized solution. But in the age of warp-speed, who cares anymore :(. Just for teh phun Shadow, plz send me your test philez and i'll see if my shit is really that much faster. |
| |
Ninja
Registered: Jan 2002 Posts: 411 |
Loading and depacking 50 blocks in 11 seconds sounds indeed quite slow. Is there something happening on the screen while loading? What interleave did you use? |
| |
Shadow Account closed
Registered: Apr 2002 Posts: 355 |
I have a 2x music player running, as well as a little something else. So raster IRQ kicks in 4 times per frame. Not sure exactly how much CPU time I'm using though, but I'm guessing less than $20 lines total.
HCL: Are you crazy man, the test data is one of the effect from my new demo, you would be able to rip it, release it right now and get all the glory!!! :D
Just kidding, I'd be glad to provide the file if you want to do some performance testing. However, as I actually do have some stuff running during loading, so any time comparisons would be a bit unfair. Though your solution is probably much faster anyway, no doubt.
-edit- I'm creating the disk with 'makedisk.exe' and I have set the interleave at 10. Not sure why, I think it was recommended in one of my earlier threads where I asked advice about IRQ-loading! :)
|
| |
HCL
Registered: Feb 2003 Posts: 728 |
"interleave at 10".. hmm, funny choice. Well, in that case an integrated "loader and decruncher at the same time" would probably help you alot. This technique is very forgiving when choosing non-optimal sector interleave, since all otherwise wasted time is used for goodie goodie decrunch :).
Never mind, i still think your stuff runs better than many other of todays demos. Only the fact that you're asking shows that you have some clues :). ..and sure, i can wait with the benchmarking until you've released the demo :). |
| |
Shadow Account closed
Registered: Apr 2002 Posts: 355 |
I have converted all parts now to 'load-then-deexomize' and saved a whole bunch of space! The previous build of the demo had 131 blocks free, the new has 432. Guess I need to code more parts! :D |
| |
Ninja
Registered: Jan 2002 Posts: 411 |
Yeah, Exomizer can be frustrating :D
Try lowering the interleave, that may help a bit (Yeah, I hear you, guys: "Depack while waiting!" "Load the sectors as they come!"...) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Agree, i give up my propaganda ;). It sounds like Interleave 6 would work fine in this case. If you have different loading parts with different CPU-load, perhaps 8 is more safe. At 10 you must have really really much shit going on to miss the next sector. |
| |
Shadow Account closed
Registered: Apr 2002 Posts: 355 |
I tested with interleave 6 instead of 10, but then the load time for the largest part (53 blocks now) went from 5.7 to 13.9 seconds...?!?
-edit- I am not suggesting that HCL is giving bad advice, more likely it is my stuff that is weird :) |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Thou shalt never suggest that HCL is giving bad advice. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@Shadow: Didn't you say your shit needs less than $20 rasterlines!? If that's true, it's really weird :). |
| |
Danzig
Registered: Jun 2002 Posts: 440 |
"shit wasting $20 rasterlines" is a pretty huge "sausage" by the way *puke* just imagine, a screen wide and 32 pixels high bruchwurst... |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Aber Danzig.. <:) |
| |
Danzig
Registered: Jun 2002 Posts: 440 |
go and measure ;) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
:D |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
or decrunch first, load later. |
| |
Danzig
Registered: Jun 2002 Posts: 440 |
Quote: or decrunch first, load later.
sounds like "enroll your penis, then jerk off"... oswaldb0rgar? :D |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
grab your dick and doubleclick :) |
| |
The Shadow
Registered: Oct 2007 Posts: 304 |
In my day, the method of decrunch while loading was considered to be best since you would not have to wait after loading. The Sharks used the levelsqueezer and the GI Joe Irq loader with a desqueeze routine that operated while loading on most of their multi-load cracks. The levelsqueezer 2.0 does a fine job on reducing the size of files and the desqeezer uses less bytes than the exomizer decrunch routine. |
| |
raven Account closed
Registered: Jan 2002 Posts: 137 |
It all depends on the intended flow of the demo.
If you're going for quick transitions, decrunching while
loading is a problem when loading during effects.
In this case, the decruncher will work slowly & will
most probably make the drive wait.
Best method (although lots of work) is to optimize loading
for each file, which means using the optimal interleave,
per file, according to the free cpu during its load.
With fast loaders, this method pays off big time! |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@raven: I used to think this way too, but have changed my mind after doing some testing. There is no way to choose *optimal* interleave without risking to miss a sector some time, and that will be *really* expensive with your method. ..and even though you choose a good interleave, there will be some waiting time anyways, which can be used for goodie.
Also.. in practice, there are some problems with your method. If you have file1 with optimized loading interleave 7, and file2 works best with 5, the track where file1 ends and file2 starts will be quite fragmented -> not possible to have desired interleave. Especially with many small files it's a problem. |
| |
Danzig
Registered: Jun 2002 Posts: 440 |
@hcl: what brings us to the conclusion that onefilers are teh b3tt3r demos *rofl* |
| |
Higgie
Registered: Apr 2002 Posts: 127 |
@danzig: yeah! that gives more points for the release. better also add some decent trainers! what i wish for some demos is a 'see the end now' option. ;) |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Raven: Please check out my loader to see that decrunching while loading is faster than first loading and then decrunching, plus interleave is something you can forget now. :D |
| |
raven Account closed
Registered: Jan 2002 Posts: 137 |
@HCL: well, you need to compromise a little if you aim
for the highest possible speed with a given loader.
I used to concentrate files with equal interleaves in
"zones", thus minimizing the gaps.
Also, in situations in the demo where there is plenty
of time to load, a not-so-optimal interleave can be used,
one that fills the gaps :)
@Krill: Well, when my new loader is finally finished (been
way too long in development, on & off) we can do some
tests.
I still think there's no substitute for properly optimizing
the disk: saving files sequentially, starting from track1,
with the best interleave for the loader.
This also assures the head only moves 1 track at a time,
which greatly minimizes seek times between tracks.
Maybe this all sounds like lots of work, but if you need
to milk every little bit of speed it might be worth it. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@Raven: Yes, of course it's still worth it to shove all the philez in order starting from track#1. That will always minimize the searching time for each new file, plus it makes the poor 1541 STFU AMAP.
Then I second Krill: You can (almost) forget about the interleave when you're done with your loader, just set it to 8 or something and you'll be fine(st) in more or less all situations. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Ack. Interleave optimizing can still be done in very narrow limititations, but with doubtable gain. The drive has varying rotation speed plus your irq handler runs sometimes longer, sometimes shorter, and interrupts the loader at pretty much random points.. So that is, you can't really optimize this like a raster routine. Same goes for other parts of the loader, but i guess you know this after the experience with Insomnia. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Oh and by the way: With a slight modification of the file format (you'll lose one or two bytes of data per block depending on certain options) you can have a speed equivalent to the per-load (!) optimum sector interleave. That is, even with varying spinning speeds, irq hander run-times, &c., you'll have the best virtual interleave for any instance of loading.
My implementation is somewhere in the far future, but i can disclose my ideas to anyone interested and willing to implement it. |
| |
raven Account closed
Registered: Jan 2002 Posts: 137 |
@HCL: interleave 8 is slooow :)
I mostly used 5 & 4 in Insomnia & even that wasnt enough
at times.
Ofcourse I couldnt use that interleave when a "full" effect
was running, so i usually disable some of the routines
for a short time (start/end the part in stages) to free
cycles for loading.
@Krill: I also fooled around with various ideas involving
file-system changes, but didnt implement any of them.
Decided to first simply try to get the most of what I
already have & see how far it gets me. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@Raven: Quote:interleave 8 is slooow :)
Common, get at grip! You can forget the interleave. It doesn't matter if it's 4 or 8, it will be faster than your 4 + decrunch afterwards anyway. That't the whole point. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
*mutters something about out-of-order loading and re-ordering on the c64 side* |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Yeah, that sounds kinda funny, JackAsser was babbeling about it also (non-soberly ;). But how do you avoid loading stuff that might belong to another phile!? You're wasting data for knowing that!?
Dunno, since we have all forgot about the interleave by now, what's the use of catching sectors!? ;). |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Yeah, that sounds kinda funny, JackAsser was babbeling about it also (non-soberly ;). But how do you avoid loading stuff that might belong to another phile!? You're wasting data for knowing that!?
Dunno, since we have all forgot about the interleave by now, what's the use of catching sectors!? ;).
Nevermind me dear chap! ;D |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Perhaps sacrificing a byte or two in the track with a file ID or something?
Just guessing here. I'm not that deep into the mutual dynamics of loading-depacking dependencies. ;) |
| |
Trash
Registered: Jan 2002 Posts: 122 |
One byte for file-id, two bytes for destination in C-64 memory perhaps would do it? |
| |
TNT Account closed
Registered: Oct 2004 Posts: 189 |
Three bits for file ID, five for offset inside current track. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
8 files should be enough for everyone ? |
| |
TNT Account closed
Registered: Oct 2004 Posts: 189 |
Eight different files on a single track should be enough for everone who is not trying to break the loader deliberately. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
I'd rather use 2 bytes, than mess with the extra complexity&data needed to keep track of what track has what file ids for what files...
(edit: not to talk about the mess the only 5 bit offset means) |
| |
TNT Account closed
Registered: Oct 2004 Posts: 189 |
lda sector_id
eor wanted_id
cmp #$20
bcs .skip
tax
lda track_address
adc offset_lo,x
sta sector_address
lda track_address+1
adc offset_hi,x
sta sector_address+1
wanted_id = full_8_bit_index<<5 so it's in top bits. Offset table is X*253 if you have normal T & S links. track_address updated when track changes.
I see very little extra complexity, data and mess :) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
nice one indeed. but it looks to me you'll end up this way having to store extra track/sector_id and track/file/offset tables, with the extra routines to handle them. Do you want to load those tables before starting to load a file, or keep it in memory? why not waste instead that extra 1 byte/sector ? |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
@hcl: Any idea where I might find the source for ByteBoozer's decruncher? In my tests it gets compression ratios in the same ballpark as Exomizer (which itself has a scary tendency to beat zip, gzip, bzip, and 7z on small files) and I'm in need of something with a better chance of keeping up with my new-and-improved loader.
I began disassembling the binary embedded in cruncher.c but gave up when it turned out to relocate itself to the zeropage, abusing the same word both as absolute addresses in instructions then reference them as indirect addresses as well..
By the way do common formatting utilities, and the 1541 ROM's implementation for that matter, line up the sectors nicely when changing tracks? That is when you've just finished reading the last sector on a track can you expect to see roughly the same sector number under the read head? After all just missing that new sector is in practice roughly equivalent to increasing the interleave by one. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@doynax: Oh my, stop disassembling :). it's here: ByteBoozer V1.0. There is even a loader with integrated ByteBoozer decruncher, check out the testprog.. All in TurboAssembler phormat though, haven't exported it to PC-textfile. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quote: @doynax: Oh my, stop disassembling :). it's here: ByteBoozer V1.0. There is even a loader with integrated ByteBoozer decruncher, check out the testprog.. All in TurboAssembler phormat though, haven't exported it to PC-textfile.
Gah! How did I manage to miss that one?
Now then, lets see if I can add a sliding window to get true streaming decompression and perhaps see if there's anything to gain by optimizing for speed rather than size.
As usual I'm feeling an irresistible urge to reimplement my own new and slightly crappier version of the whole thing. Goddammit this project will never get finished.. |
| |
AlexC
Registered: Jan 2008 Posts: 299 |
Reading this very interesting topic I started to think about using additional ($2000-$BFFF) drive memory. While I'm aware that extensions like RAMBoard aren't that common let's face it: at one point we will be forced to forget about original drives and MMC64 and 1541Ultimate are probably just the beginning of 1541 evolution and Vice support additional memory without any problems. Anyway I was wondering if somebody did some some testing in decrunching in drive and than sending code to c64. Haven't experimented with transmission yet so I don't have any idea about potential bottlenecks. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
I'm working on ULoad Mini, optimized to use a little ram as possible in the C64. It does LZMV decrunching using a 256-byte ring buffer in drive ram, and sends over an uncompressed stream. The performance hit is pretty severe, but the loader fits in a fraction of the space on the C64 side (currently 138 bytes for full 16 char filenames, less if you go for 2 char).
|
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
1541u or mmc should become kind of a standard setup, and even supported in emulators to this become a reality. I don think it would, since soon 30 yrs the standard is c64+drive. |
| |
AlexC
Registered: Jan 2008 Posts: 299 |
Quote: I'm working on ULoad Mini, optimized to use a little ram as possible in the C64. It does LZMV decrunching using a 256-byte ring buffer in drive ram, and sends over an uncompressed stream. The performance hit is pretty severe, but the loader fits in a fraction of the space on the C64 side (currently 138 bytes for full 16 char filenames, less if you go for 2 char).
Very interesting. Is there any chance to take a loot at it? This make it perfect solution for releases that occupy almost all free memory for example. I can image that the user can wait a bit longer during initial loading to be free from additional loading during gameplay for example. Sometimes even switching out IO space at $D000 for relocator for example is not enought to fit everything into c64 RAM. |
| |
AlexC
Registered: Jan 2008 Posts: 299 |
Quote: 1541u or mmc should become kind of a standard setup, and even supported in emulators to this become a reality. I don think it would, since soon 30 yrs the standard is c64+drive.
At one point they will all break due to mechanical reasons for example. Secondly I prefer - I'm guess I'm not the only one - loading files from SD card for example. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Hmm.. and sooner or later all those 6510-chips gotta break up also, so why don't we better code PC-demos instead?
No kidding.. I'm not saying we should not use 1541U, but the day we stop supporting the original 1541 drive we have probably taken a step in the wrong way. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Quote: Very interesting. Is there any chance to take a loot at it? This make it perfect solution for releases that occupy almost all free memory for example. I can image that the user can wait a bit longer during initial loading to be free from additional loading during gameplay for example. Sometimes even switching out IO space at $D000 for relocator for example is not enought to fit everything into c64 RAM.
The 1-bit protocol I was going to use (which is basically the same as Covert Bitops et al) doesn't work in this scenario - I need a protocol that does generic getbyte/sendbyte. Currently the loader just hangs on startup, so it's of little use to anyone right now.
|
| |
AlexC
Registered: Jan 2008 Posts: 299 |
Quote: The 1-bit protocol I was going to use (which is basically the same as Covert Bitops et al) doesn't work in this scenario - I need a protocol that does generic getbyte/sendbyte. Currently the loader just hangs on startup, so it's of little use to anyone right now.
Thanks for the feedback. BTW: have you considered burst option in c128/1571? |