| |
lft
Registered: Jul 2007 Posts: 369 |
GCR decoding on the fly
Here's how to do it:
http://linusakesson.net/programming/gcr-decoding/index.php |
|
| |
PAL
Registered: Mar 2009 Posts: 292 |
the only thing i do understand here is that you are insane great at making solutions to problems that no other have been able to solve... when krill say that is awesome i guess your loader is pretty awesome... hats of for that! |
| |
Skate
Registered: Jul 2003 Posts: 495 |
omfg! amazing thinking on how to get rid of bit shiftings using tables. probably people who had tried this missed using "no sequence of three zeros" rule as the key point for creating tables. biiiig thumbs up! |
| |
Killsquad Account closed
Registered: Jun 2005 Posts: 17 |
foooOOOFFF.. the sound of this as it went straight over my head. Brilliant work, again, LFT. Even though I didn't understand all of this :D. |
| |
Zyron
Registered: Jan 2002 Posts: 2381 |
As above. ;) |
| |
tlr
Registered: Sep 2003 Posts: 1791 |
Very clever implementation! Excellent use of undocumented opcodes to mask a single read value in more than one way.
I love it how you are also a good technical writer being able to explain things in a concise manner and at the same time providing the context of the problem.
Keep inventing! |
| |
algorithm
Registered: May 2002 Posts: 705 |
Very innovative! Keep up all this work. Nothing is impossible :-) |
| |
Ejner
Registered: Oct 2012 Posts: 43 |
That is just really mindblowing! I'll have to read that again a few times! :-) It's really interesting, but I kind of lost track and concentration when I came to the part with illegal upcodes and "zeroes are stronger than ones" and stuff... It seems that this is what noone else have been able to do or figure out for the past 30+ years. -You finally did it! :-) A milestone in c64 history... Thanks for sharing! :-) |
| |
Clarence
Registered: Mar 2004 Posts: 121 |
Great achievement Lft, can you tell an estimation, how much faster can a loader be using this technique? |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
I had to read it twice, since it was too brilliant for my puny mind to grasp on the first try.
Quoting ClarenceGreat achievement Lft, can you tell an estimation, how much faster can a loader be using this technique?
Instead of read -> decode -> transfer, it's now read & decode -> transfer. If your loader currently is using, say, interleave 5, it can now use interleave 4 (unless I'm missing something). |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
okay no words for this. congratulations you are officially a genius. chuck norris of c64 :) |
| |
Bob
Registered: Nov 2002 Posts: 71 |
Excellent job, truly, I read your doc, and I think understood it :) , I had a harder time to understand why data was packed as such from the drive though ;)
Now the question is that this has speeded up the reading to a new level, or unloaded the process for other purpose?
this tech is very interesting for our future coming demos where one of the main problems in trackmo is the loader/cruncher relation, where nowdays the uncrunching takes longer time then loading a part ..but either way this shortens the whole loading procedure which is very good... especially where we use all memory :)
LFT, once again, hats off for your remarkable innovative approaches and questioning and challange the unquestionable! ... Where people has accepted that is how it is... and you just like say WHY?... I love that.. keep up the spirit.. if you need anything help support or be in Censor just let me know ;)
|
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Yet again, amazing solution and description of a problem - this time one I had been blissfully ignorant about until now. Hopefully this will become a standard in trackmo loaders and lead to more action packed demos. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Cruzer: Was that sort of a challenge of who'd build the fastest IRQ loader now? :) |
| |
Kabuto Account closed
Registered: Sep 2004 Posts: 58 |
Truly amazing what you did, and also how well you've documented it. |
| |
raven Account closed
Registered: Jan 2002 Posts: 137 |
This is awesome, it was such a simple approach in the end.
Congrats! :)
@MagerValp:
Interleave 4 can be quite easily done without full on-the-fly decoding, I wonder if its possible to reach interleave 2 with this... |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
You cannot transfer $fe bytes during the time a sector flies by, this is just barely possible with those $e0 bytes per sector schemes (Vorpal V1 and derivates like Heureka-Sprint), and there only with non-interruptable transfer. Last time i checked, that is. |
| |
Fungus
Registered: Sep 2002 Posts: 691 |
lft : this is pretty brilliant, I have to say.
Do you think it may be possible to decode without recombining the nybbels into whole bytes?
As you know, there is extra overhead when converting the bytes back into nybbels again for transfer over serial which in normal loaders really wastes of lot of time. So it would be very interesting if this extra step can also be eliminated like it is in Krill's, Mine, or the action replay loaders.
In my own loaders I also am calculating the checksum after decoding the nybbels with an eor/sta pair for each nybbel, and finally checking those against the nybbels read from the checksum. I did that of course to eliminate another loop of overhead to check the contents of the decoded bytes against the checksum.
I think it would be INSANELY fast if the methods could be combined. |
| |
Axis/Oxyron Account closed
Registered: Apr 2007 Posts: 91 |
Gosh, this is really fucking brilliant. And thanks for the article. This was a very interesting read. I never new about that GCR encoding problem.
So there are still unawnsered questions.
How big is the practical gain? What is the cost of GCR encoding e.g. in Krills loader?
Are there any disadvantages like e.g. compatibility issues?
@Krill: When will we get our hands on a version of your loader using this technique? ;o)
|
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Quoting KrillCruzer: Was that sort of a challenge of who'd build the fastest IRQ loader now? :) Yes please! :D |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Quoting KrillYou cannot transfer $fe bytes during the time a sector flies by, this is just barely possible with those $e0 bytes per sector schemes
So don't use the whole sector, only decode as much as you can transfer while the next sector flies by. You'd lose 20% but you can transfer a full track in two revolutions... |
| |
Fungus
Registered: Sep 2002 Posts: 691 |
If you are ok with 20% data loss, might as well use a 3/4 encoding scheme like rapidlok which is much easier to decode, requiring only and'ing and or'ing.
|
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Don't want to lose D64 compatibility though, but yeah, full sector and interleave 2 is probably more realistic. |
| |
raven Account closed
Registered: Jan 2002 Posts: 137 |
Fungus: combining the nybbles is what allows using the stack as decode buffer.
If you dont combine, you'll need to store each nybble separately, which means additional cycles in the read/decode loop.
I do agree about the overhead, but i think this seems like a necessary compromise in this case. |
| |
Stone
Registered: Oct 2006 Posts: 172 |
Brilliant again, with easy to understand explanations. You are the Richard Feynman of C64 coding :) |
| |
Ymgve
Registered: May 2002 Posts: 84 |
Eh, it's no Warp25.
(Just kidding, awesome work!) |
| |
Mixer
Registered: Apr 2008 Posts: 455 |
Such things are the reason why I like the c-64 scene so much :)
But, how about writing to a disk? Is it doable in equally paced manner?
|
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quote: Quoting KrillYou cannot transfer $fe bytes during the time a sector flies by, this is just barely possible with those $e0 bytes per sector schemes
So don't use the whole sector, only decode as much as you can transfer while the next sector flies by. You'd lose 20% but you can transfer a full track in two revolutions...
You can still load a full GCR-encoded track in 2 revolutions. Just that the trick is somewhere else (and not applicable to IRQ loading). :) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Axis: The practical gain is somewhat of a mixed bag.
Generally, if you can decode a block while it's being read AND also do something smart with the checksumming, you save about one revolution per track. For reference: On 1541, my loader (v138) has (virtual) interleave 5, and on 1571 (2 Mhz) it's 4. (Plus a scan revolution, but that is another issue and will soon be optional.)
Lft's approach, however, moves the checksumming to the C-64, as the approach is still not fast enough to do that while reading as well. So there are two options now: Do a separate checksumming pass on the C-64 side (wastes valuable time for decompression), or modify the format on a high level (EOR all the bytes before saving, then EOR again during transfer), so checksumming basically comes for free with the transfer. The second option is more feasible speed-wise, but less so from a compatibility/useability standpoint, given that you save in a somewhat non-standard, albeit high-level D64-compatible, format.
Now, moving over the checksumming in a similar manner, pre-existing approaches (like in my loader) would shave off that one revolution as well.
So this is largely an academic problem in the context of IRQ loading.
Furthermore, Lft's approach does need quite a bit of memory in the very limited drive RAM, which i cannot afford due to a few architectural goals i'm not willing to sacrifice (ease of use and compatibility by loading via directory and filenames, support for non-1541 drives, disabled true drive emulation, etc. - Lft's system loads via a given set of tracks/sectors, limiting all that).
Since checksumming after transfer also solves another problem (flipped bits during transfer in heavy electrostatic environments like parties), chances are good i'll add such an option (which, as mentioned, will also have the same actual speed improvement). But then i always said there are a few more options to speed up things, yet nobody really ever needed it so far, seeming content with the speed as it is.
Bottom line: Limited practical gain, but awesome somebody finally made it after many have tried. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quote: Quoting KrillCruzer: Was that sort of a challenge of who'd build the fastest IRQ loader now? :) Yes please! :D
So will YOU need that speed? I have the vague feeling you're more the screen-off-non-IRQ-loading kind of guy, and for that... I do have two ideas on paper waiting to be tried out, both D64 compatible, giving 50x and 32x speed. |
| |
Axis/Oxyron Account closed
Registered: Apr 2007 Posts: 91 |
I think every little speedup on loading would help everyone producing c64 trackmos. We all struggle alot and invest months of our life to optimize loading-/decrunchtimes of our parts to get the pace a little bit up and to have time left to do some nice transition work.
Perhaps it would be easier to get the gain on optimising decrunchers rather than loaders. I think there is also a big potential on that.
But perfect solution would be to optimize both to the max.
Because we are still so far away from what the amiga guys were doing back in the 90s due to slow loading and decrunching.
|
| |
HCL
Registered: Feb 2003 Posts: 728 |
One brilliant piece of code in here, but i don't see that you gain very much in the total solution. Then i have to admit that i didn't even find any transfer-code in there yet :P, which is a quite essential part of the drive coding (and the major bottle neck of course).
LFT always has a fresh approach to c64 problems, and always seem to come up with something new!! A genius by definition :). Using extensive tables for gcr-decoding was used in my own loader(s) (in Cycle, EoD etc..) but not this beautifully of course.
One funny detail is that last fall i was also working on my disk-loader. I also came down to 100% on the fly decoding, but was not satisfied because it uses SAX, which doesn't work on all my (2) drives. So i reworked it and the result can be found in TimeMachine. Don't use that loader as any reference though, it seems to still have problems on 128DCR which are not solved yet. |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
why checksumming btw? if the data is fucked, running the demo is fucked either way, you know it explicitly or not.
approx how much time is a track revolution?
|
| |
raven Account closed
Registered: Jan 2002 Posts: 137 |
@Oswald:
My thoughts exactly, which is why I didnt bother with checksumming in my loader.
I mean, its a trackmo, even if there is a loading error, there's usually no time to reload, so why bother? :)
@HCL:
Will you ever fix EoD so I can finally watch it on my 128D (the only machine I've been using for years) ?
Its a loader problem, which I'm too lazy to debug :) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
A revolution at 300 rpm takes 1/5 seconds.
Axis: Need for maximum speed noted. Give me a figure to work with for your next future release, and we can work on optimizing things towards that goal.
At some point you simply go into uneasy-trade-off territory and have to drop one or the other feature for the sake of that extra bit of performance.
As for packers, there are some choices to make, too. When going for maximum speed, the simpler/speed-optimized algorithms (e.g. ByteBoozer, but i also think Doynax and WVL made big progress when it comes to speed) are most often a better choice for best combined loading+decompression speed, compared to squeezing out every last bit at the cost of decompression speed (PuCrunch, Exomizer). Adding another decompressor to the loader is easy, but i can do that quickly following a specific request.
HCL: So you made it as well? Have to check that code. :)
Oswald: Good question. Disk errors are rare with good disks and under lab conditions, but there are spurious/transitional errors with not so good disks and again, under party conditions. Simply dropping the checksumming is an option, but IMHO not a good idea, as the drives and disks do age as well.
raven: And a demo that cannot tolerate slower loading (except for obvious sync problems) is not a good idea either. Drive coding is not like coding raster bars. Eventually you will find a drive/disk/track-skew combo that's slower than anything you tested, with the hardware still in good condition. And since you will end up with some slack anyways to have more of an error margin, why not use that slack with faster drives for retries? :) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@Krill: Well.. No i didn't, as i didn't keep the code i was unsatisfied with. I could of course do it again, but it would be a 1541-only loader for drives that can coop with SAX. The loader from the Cycle-era uses lots of tables and is faster. I took that code and used your trick of only synchronizing 2 bytes of 5, and got it down to zero overhead in the read-loop. But what i wanted in TimeMachine (and always) was a loader using less tables, so i have space for dir-table in the drive. So, first i removed SAX to make it work on my other drive, and then i made it use only one $20-table for gcr-decoding..
@Raven: Oh, i didn't know that was an open issue ;). Well, in fact this loader from TimeMachine may be able to replace the EoD-loader. Before now i didn't have any loader that could do all that the EoD-loader can. So.. perhaps some day :). |
| |
Graham Account closed
Registered: Dec 2002 Posts: 990 |
Quoting raven@Oswald:
My thoughts exactly, which is why I didnt bother with checksumming in my loader.
I mean, its a trackmo, even if there is a loading error, there's usually no time to reload, so why bother? :)
A load error message is better than continuing with broken stuff. Also demos is not the only use of a loader.
|
| |
HCL
Registered: Feb 2003 Posts: 728 |
Hmm.. sorry i have to correct myself. I do not gcr-decode on the fly while reading data in my loader, i just manage to split the data into 8 chunks. Then the data in those 8 chunks are used to index into those different gcr-decoding tables on the fly when transmitting the data. I think this is more or less like Krill does and many other loaders i have peeked into during the years :). The only improvement i did was that i didn't need to waste any time between the reading and the transmitting, which on the other hand is exactly what LFT did as well :) |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
lft did it with less memory needed :P :) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Quote:lft did it with less memory needed :P :)
Uhm.. not sure about that. He is using 2 pages of tables, the whole stack area plus half the zeropage. I managed to squeeze in all my tables into one page, but wasting another page for half-converted data.
..plus the major draw back: He should of course have generated those tables in the drive code!! Now he is wasting 2 precious blocks of disk space for stopid but beautiful tables, that could have been generated in less than half a block probably!! I mean, we *are* discussing beauty of code here, aren't we ;) |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Quoting KrillSo will YOU need that speed? I have the vague feeling you're more the screen-off-non-IRQ-loading kind of guy, and for that... I do have two ideas on paper waiting to be tried out, both D64 compatible, giving 50x and 32x speed. Ofcuz I will need it for the EoD-killer I'm planning, but the super fast spacemo-loader definitely sounds like something I could use as well.
Btw, I'm definitely willing to sacrifice features like drive agnosticism and dir loading if it can lead to a faster IRQ-loader. |
| |
Clarence
Registered: Mar 2004 Posts: 121 |
I would opt for the fastest possible demo friendly loader also, but maintaining compatibility with all the standard C= drives including the different 1571 versions. That illegal opcode incompatibility HCL mentions, sounds bitter already. :/ |
| |
raven Account closed
Registered: Jan 2002 Posts: 137 |
From memory, I think there is only one (or two) drive clones that have a problem with illegal opcodes.
I used SAX and didn't lose any sleep about it ;)
Also, I remember only one complaint about it - from HCL!
I wonder how many people still use these drives.
HCL: which make/model is it exactly? |
| |
enthusi
Registered: May 2004 Posts: 677 |
Just make it load fast from SD and 1541u !!!1111
Just kidding :)
But then again, if some non 1541-drives fail, well, they fail. |
| |
algorithm
Registered: May 2002 Posts: 705 |
The bottleneck in most cases seem to be certain types of decompressors (for example with exomiser it almost takes the same amount of time to load and more) just to depack (even if depacking while loading)
Can't a quick CRC be done at the beginning of each linked part (once all data is loaded and decompressed?) Sure, this would not be a solid way of doing things - in particular if the crc checker at the beginning has load errors., but should be a mimimum issue
@krill. interested in your 50speed non-irq loader :-) |
| |
tlr
Registered: Sep 2003 Posts: 1791 |
Quoting algorithmCan't a quick CRC be done at the beginning of each linked part (once all data is loaded and decompressed?) Sure, this would not be a solid way of doing things - in particular if the crc checker at the beginning has load errors., but should be a mimimum issue
Is there any gain in doing a separate CRC step? Another drawback is that you won't detect partially corrupt T/S links.
|
| |
algorithm
Registered: May 2002 Posts: 705 |
Yes, the data check would be sub-optimal, but may have some type of gain. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@Raven: You are funny :). You want me to fix the EoD loader to work on your c128D, but you don't want to care about my disk drive. It's a digilog drive btw, already mentioned in another recent thread. |
| |
tlr
Registered: Sep 2003 Posts: 1791 |
@hcl: Maybe you can put your digilog drive in his C128D?
There, problem solved! |
| |
raven Account closed
Registered: Jan 2002 Posts: 137 |
@HCL: I believe many people have a 1571 compared to a few with the digilog ;)
Anyway, I seem to remember an email from you saying you removed the SAX's and got the demo working on your drive :) |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Quoting enthusiJust make it load fast from SD and 1541u !!!1111 Almost agree, actually. If it works on 1541, it works on 1541u, Chameleon, etc. That should be enough for everyone. It's ofcuz nice if it works on other drives as well, but no need to sacrifice speed for that.
Quoting algorithmThe bottleneck in most cases seem to be certain types of decompressors Which is why I think that in most cases it's not even worth compressing the files. For Pimp My Snail I skipped it, and I think it worked pretty well. I just placed all code/data/gfx togehter in a tight lump, and "unpacked" it using custom routines that could be optimized for the specific data and speedcode. It might have helped a bit compressing some of the simple gfx though, in a lightweight way optimized for decrunching speed. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Yes, and soon you'll end up with an algorithm very similar to ByteBoozer and other LZ77 variants. Good speed, good pack results, often best combined loading+depack performance. :) |
| |
Clarence
Registered: Mar 2004 Posts: 121 |
Cruzer, next time for a similar task, consider using Werner's LZWVL packer. If you need something fast, and with a better compression ratio than RLE, it is great I think: LZWVL |
| |
HCL
Registered: Feb 2003 Posts: 728 |
..Ok, so it sounds like i should bring back that zero overhead read-loop with all illegals and release together with a new version of EoD? Then Cruzer has to promise to start using ByteBoozer, and we will all be happy :). |
| |
Clarence
Registered: Mar 2004 Posts: 121 |
HCL, I think avoiding support for non C= brand drives is acceptable, then again I don't have any drives as such, so I might be biased. :D |
| |
WVL
Registered: Mar 2002 Posts: 903 |
My 2 cents :
I see more profit in finding a way to read from disk and transfer to c64 at the same time. Should really have a go at that once.. :) prolly i'll quickly see why that isnt possible though..
Also have a look at Doynax's packer, my tests showed that it both compressed better + was faster in decompressing than Byteboozer (sorry David!) |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Thanks for the tips on packers. LZWVL looks promising, at least from the lovely chart Werner did. If I get the time I would like to do a similar test for loading + decrunching combined, with different kinds of files, to see where which kind of compression should be used, and where it should be avoided. E.g. I doubt that it makes sense to pack code, unless it's mixed up with data or full of "align to next page" statements. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting WVLI see more profit in finding a way to read from disk and transfer to c64 at the same time. Should really have a go at that once.. :) prolly i'll quickly see why that isnt possible though..
It is very much possible. But it isn't suitable for IRQ-loading and very probably needs a disabled screen, too. Have a look at Mafiosino Trackloader (19x) which reads a track in two revolutions. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting CruzerIf I get the time I would like to do a similar test for loading + decrunching combined
Keep in mind that this involves actually adding the missing decompressors to a loader, as combined loading + decrunching involves decrunching between fetching sectors (hence it is faster than loading first and decrunching after). |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting CruzerE.g. I doubt that it makes sense to pack code, unless it's mixed up with data or full of "align to next page" statements.
My experience with packing code has shown that it can indeed be sensible to pack code by separating op-code stream and operand stream. For even better compression, make sure to actually add redundancy to the code (e.g., lda #$00:ldx #$00 is likely to pack better than lda #$00:tax in the end). However, it is not feasible to go to these lengths for anything bigger than 4K :) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@WVL: Yes i know about Doynax's packer. It was based on ByteBoozer, at least from the beginning, and optimized from there in all (?) possible ways. I'm just honored by his work :). ..and again, transferring while reading is a no-go if you want to be interruptable, and i would not sacrifice that. |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Quoting KrillMy experience with packing code has shown that it can indeed be sensible to pack code by separating op-code stream and operand stream. Clever!
Quote:For even better compression, make sure to actually add redundancy to the code (e.g., lda #$00:ldx #$00 is likely to pack better than lda #$00:tax in the end). That would cause bigger code, resulting in a potentially worse effect, so I would never do a thing like that.
The priority for a trackmo should be effect quality > loading time > file size. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
As i said, 4K.
But it really depends, there is no real loss in "lda #$00:ldx #$00" vs. "lda #$00:tax": same amount of cycles, just one byte more. The packed file is shorter with the former, the unpacked file longer. No problem. :) |
| |
WVL
Registered: Mar 2002 Posts: 903 |
Talking about that Doynax packer, I can't find it on CSDb.. Has it been released or is it just a few people that got it from Doynax himself? |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
http://csdb.dk/forums/?roomid=11&topicid=59374#59404 -> http://doynax.googlepages.com/lz.zip .
Seems like he deleted his account, and before that never officially released the packer. Weird :\ |
| |
WVL
Registered: Mar 2002 Posts: 903 |
Whatever happened to him? :-( |
| |
Burglar
Registered: Dec 2004 Posts: 1105 |
http://sh.scs-trc.net/hereyougo/doynax_lz.zip |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
First off, let me congratulate lft for an elegant solution and an entertaining write-up.
I have attempted to write a loader without an intermediate swizzling stage myself but could never manage it. This despite skipping the checksum, abusing the entire stack and zeropages as buffers, transmitting the resulting bit a whole bizarre order, and abusing every illegal opcode in the book.
Quoting KrillSeems like he deleted his account, and before that never officially released the packer. Weird : Oh, I'm still alive and lurking.
I rather lost interest after solving the technical problems but if anyone cares I'll put together a package with a bit of documentation plus binaries and "release" it somewhere reputable.
For the record it was designed to be easy to integrate into streaming loaders.
Personally I'm tempted to abandon D64 compatibility and try out Kabuto's beautiful 7-bit/byte GCR coder. I sort-of get the general idea but will really have to sit down with pen-and-paper to work out the finer details and convince myself that it really can code all inputs.
Anything to put off having to write actual game logic ;) |
| |
Fungus
Registered: Sep 2002 Posts: 691 |
I would love to see a proper release of your compressor doynax. |
| |
Dano
Registered: Jul 2004 Posts: 240 |
+1 for that! |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Quote: I would love to see a proper release of your compressor doynax.
Me too! |
| |
Isildur
Registered: Sep 2006 Posts: 275 |
Regarding Doynax, ByteBoozer, Exomizer, Level Crusher in one place:
http://csdb.dk/release/?id=117165&rss
Do you find this tool useful?
(BTW This is not any kind of adv - just asking)
|
| |
Burglar
Registered: Dec 2004 Posts: 1105 |
isildur, yea, its quite useful, especially since you guys included benchmark disks of the various packers. including your own bongo cruncher, which seems to perform best (only slightly longer than exomizer, but beating every other one in both size and speed).
now, what would really be interesting and useful is to include other irq loaders in the benchmarks (like Krill's, lft's, ...). |
| |
Isildur
Registered: Sep 2006 Posts: 275 |
Burglar, I'll talk to Wegi about that :) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
GUI coding tools other than text editors aren't for me.
And to be honest, the code needs serious clean-up, if not complete rewrite, to be useable.
A generic packer/loader test tool would be a nice thing, however, i guess it's hard to accurately measure different design and useability properties in various usecases against mere performance numbers.
|
| |
Isildur
Registered: Sep 2006 Posts: 275 |
It will be hard because Wegi is like Derbyshire Ram on Polish scene (but with tough persoality) ;P |
| |
Fungus
Registered: Sep 2002 Posts: 691 |
No, I don't find it useful because my build environment is all command line tools. |
| |
Isildur
Registered: Sep 2006 Posts: 275 |
Fungus, there is command line tool. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
I am just about doing a complete rewrite of the packer in C. So the tool will have clean code soon and be ready for commandline driven build-environments. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
Quote:It will be hard because Wegi is like Derbyshire Ram
no. really. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
I finally got around to reading lft's post. Amazing work, especially realising you can get down to just two tables. And yes, squeezing every last cycle out of a loop can be quite the brainworm :)
Well done! |
| |
lft
Registered: Jul 2007 Posts: 369 |
I got some benchmark figures for my loader. The unit is the number of
revolutions needed to load a track, a.k.a. optimal interleave (although with
out-of-order loading you don't need to think about interleave). The test
conditions are: No sprites, no interrupts, 25 badlines. This models the most
optimal setup which is useful in practice, either for silent loading while
displaying the BASIC screen (or something else), or for loading with a blanked
screen and a normal sid playroutine being called every frame.
The first row represents the version of the loader used in Shards of Fancy, but
without any decrunching. This version verifies the checksum in a separate pass
after reading a sector, to detect read errors. Then the checksum is verified
again on the C64 side to detect transmission errors.
As I mentioned to Krill at Revision, I had an idea to combine these into a
single checksum verification performed on the C64 side, and then re-read
(possibly another sector) and re-transmit on error. This was implemented, and
corresponds to the second row in the table.
Finally, I optimised the transfer routine and got it down to 74 C64 cycles per
byte. This is a regular atn handshake protocol, with the checksum computed
during transfer. The correct checksum is transmitted as an extra byte at the
end. The performance of this version is shown in the third row.
For easy comparison, I also computed a rough loading speed for this last
version by dividing the number of bytes loaded by the time needed for the given
number of revolutions. This figure should not be confused with actual loading
speed, for which you'd also need to take into account such things as overhead
from the high-level format (necessary to compensate for out-of-order loading),
track stepping, motor spin-up time and skipping sectors that don't belong to
the file. But it provides a rough estimate, and a maximum.
I have verified that the latest version works on real hardware, but the
measurements were obtained using Vice.
track: 1-17 18-24 25-30 31-35
-----------------------------
v1 (shards) 4 4 4 3
v2 (combined checksum) 4 4 3 3
v3 (74-cycle transfer) 4 3 3 3
v3 raw loading speed (B/s) 6720 8107 7680 7253
|
| |
HCL
Registered: Feb 2003 Posts: 728 |
Interesting results! I made some tweaks to my own loader also last week, re-introduced SAX to get zero overhead for the reading loop. This however gives only a speed increase of less than 10% from before (Cycle loader, EoD etc..).
Fair enough, then i went on to the transfer loop, which is (just as LFT's loop) 74 cycles. Here there should be room for some optimizations, but when i cut cycles, the transfer screws up! There is probably some theoretical explanation to this, less than 18 cycles between each read of $dd00 makes it not work. That is when using ANT handshake of course, else it's possible to reduce it a lot..
Anyone got any ideas why 18 cycles seem to be the limit? If it's confirmed then i'm pretty much done.. 2 cycles left to optimize, and that's all, not even sure i'm going for those 2 in that case :). |
| |
Fungus
Registered: Sep 2002 Posts: 691 |
Be sure to check it on some 1541-II and 1571 drives, later VIA revision sometimes need an extra cycle for handshake. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
I think 18 cycles is used by most loaders, at least on some of the 2-bit-pairs. I have 18+18+20+18, most others have more, but if it doesn't work with 18, then 90% of all modern demos would not work on 1571 or 1541-II. I have a 1541-II myself and it works there of course :).
Now this is on the computer side, i should say. On the drive you should of course go below 18, at least here and there, to be safe if the drive is running a fragment faster than the computer. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Interesting results! I get close to them with my planned speed-ups, but don't quite reach them yet. Must hurry now to optimize a bit more and push the next release out the door i guess :)
I have added a new experimental protocol reaching 70 cycles per byte (including loop and store overhead) a while ago, this has a few strange-seeming limitations though (like 0 or 5-8 sprites are okay, but not 1-4). No sprite limitations gets it to a whopping 82 cycles. There might be room for improvement in both versions, but this is yet to be explored.
As for the 18-cycle limit with plain 2bit+ATN, which i confirm: My explanation is that waiting for ATN flip in a loop is 6 cycles minimum, then 7 cycles for a miss which does happen, then 4 cycles to set next bitpair, then another cycle due to slightly different clocks, wire delay, missed sampling windows and whatnot. Makes 6+7+4+1=18 cycles.
HCL: I might be wrong, but the drive being slightly faster actually gives you more than 18 cycles here and there on the drive side, according to my understanding. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Quote:HCL: I might be wrong, but the drive being slightly faster actually gives you more than 18 cycles here and there on the drive side, according to my understanding.
Oh, yes of course :). So, in case the drive is a fragment slower, you need to go below 18 cycles here and there. The drive is after all waiting for the computer when needed. I tend to believe my transfer loop is actually working since it has been around for ~10 years by now in numerous of demos. Don't know if i have done loading while displaying a sprite multiplexer though, with loading *on* the sprites :). Perhaps the AFLI-zoomer in EoD?.. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quoting KrillAs for the 18-cycle limit with plain 2bit+ATN, which i confirm: My explanation is that waiting for ATN flip in a loop is 6 cycles minimum, then 7 cycles for a miss which does happen, then 4 cycles to set next bitpair, then another cycle due to slightly different clocks, wire delay, missed sampling windows and whatnot. Makes 6+7+4+1=18 cycles. I think I've managed 16 cycles actually (66.5 per byte in practice with 2x unrolling.)
The trick is to reduce the delay between reading the bits and flipping ATN by combining both in a single RMW instruction (e.g. SLO/SRE.) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Hmm, how does that speed up the drive side, which is the bottleneck here, as it has to wait for the C-64 and respond to ATN flips asap? |
| |
HCL
Registered: Feb 2003 Posts: 728 |
i would say the computer side is the bottle neck, at least i have NOPs in my transfer loop on the computer side.
@Doynax: Hehe.. cool. And you were actually able to do something useful with that data you got from those instructions also.. Impressing! |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting HCLSo, in case the drive is a fragment slower, you need to go below 18 cycles here and there. The drive is after all waiting for the computer when needed.
But the drive is never slower than the C-64. If the drive code is in theory less than 18 cycles between bitpairs when not branching in the loop, then the protocol is not violated in practice, as the branch will be taken, so of course your code should be just fine. :) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting HCLi would say the computer side is the bottle neck, at least i have NOPs in my transfer loop on the computer side.
If you have NOPs on the computer side to make up for drive slowness, how does that make the computer side the bottle neck? :) |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting doynaxI think I've managed 16 cycles actually (66.5 per byte in practice with 2x unrolling.)
The trick is to reduce the delay between reading the bits and flipping ATN by combining both in a single RMW instruction (e.g. SLO/SRE.)
Yes, I also had this idea. But I couldn't figure out a way to do it without restricting the user to vicbank 0 (and maybe also 3). Did you find a way that works regardless of vicbank? |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@Krill: The drive loop is faster (of course, else it would not work), it's the computer side that can not suck out the data faster because of that timing issue you just explained.. Even if i reduce the drive loop to 12 or 14 cycles, the computer side still has to be 18 cycles -> the bottle neck. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quoting KrillHmm, how does that speed up the drive side, which is the bottleneck here, as it has to wait for the C-64 and respond to ATN flips asap? Excellent question.
It seems to work in practice and has done so for a while even under IRQ/DMA heavy conditions, though that doesn't necessarily mean much given how few drives I've tested. At 15 cycles it starts to crap out once in a blue moon.
I suppose that I don't quite buy your 6 + 7 cycle sum for the ATN cost. Presumably if the first ATN is late then you've only got the branch of the first loop left to execute, plus the seven of the second, equals 9 in total.
Still, it's likely I'm just confused. Anyone care to write up a little simulator to generate some sequence diagrams? |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
doynax: Yes, very likely my quickly-thought-out explanation is wrong.
HCL: Well my point was the relevant bottle-neck drive loop is the wait for ATN flip (the branch back to bit $1800), not the no-branch time between two bitpair updates. But maybe we're just saying the same thing in different words.
lft: I also had the VIC bank restriction thought about using RMW opcodes on $dd00, didn't find a solution either. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quoting lftYes, I also had this idea. But I couldn't figure out a way to do it without restricting the user to vicbank 0 (and maybe also 3). Did you find a way that works regardless of vicbank? To be honest I never even tried. I'm working on a game with somewhat limited VIC tricks so I've gotten away with using bank 3 almost exclusively. |
| |
lft
Registered: Jul 2007 Posts: 369 |
This is how I understand the timing constraints. On the drive side, here's how you transmit two bit pairs:
; prepare value in A
bit $1800
bmi *-3
sta $1800
; prepare value in A
bit $1800
bpl *-3
sta $1800
It is clear that a bit pair cannot be guaranteed to be on the serial bus earlier than 13 cycles after ATN changes, because if ATN changes just after it was sampled during the last cycle of a bit instruction, we need 3 (bpl) + 4 (bit) + 2 (bpl) + 4 (sta) = 13 cycles to put the new value into the VIA.
For this reason, we can use up to 7 cycles to prepare each bit pair. The C64 will not toggle ATN earlier than 4 cycles after reading out the last bit pair. Following this cycle, 3 (remaining preparation) + 4 (bit) + 2 (bpl) + 4 (sta) = 13 cycles.
On the C64 side, after reading a bit pair, we spend 4 cycles writing a new value to ATN. Then we read the new bit pair after 14 cycles. Hence, 18 in total. Why can't we read already after 13 cycles? This is because the clocks of the C64 and the 1541 are almost always out of phase. After updating ATN on a C64 clock tick, it will take on average half a cycle before the next 1541 clock tick. When sending the bits back, there is again a delay before the next C64 clock tick, and the total delay will be one C64 cycle (unless we're really lucky and the 1541 cycles, being a tad shorter, fit perfectly in between the C64 cycles).
C64 1-------2-------3-------4-------
1541 ----1------2------3------4------
(not to scale)
|
| |
HCL
Registered: Feb 2003 Posts: 728 |
Ah, for once i think i understand :). LFT, what is that book you have? everyone should have it ;). |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Yes, this explains everything. :) |
| |
tlr
Registered: Sep 2003 Posts: 1791 |
Quoting lftQuoting doynaxI think I've managed 16 cycles actually (66.5 per byte in practice with 2x unrolling.)
The trick is to reduce the delay between reading the bits and flipping ATN by combining both in a single RMW instruction (e.g. SLO/SRE.)
Yes, I also had this idea. But I couldn't figure out a way to do it without restricting the user to vicbank 0 (and maybe also 3). Did you find a way that works regardless of vicbank?
Couldn't the $dd00 bank bits just be kept 00? Then switching can be done via $dd02. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting tlrCouldn't the $dd00 bank bits just be kept 00? Then switching can be done via $dd02.
Not sure about the possibility of your idea, but the $dd02 trick is often used already so that the VIC bank can be set by a simple lda #bank:sta $dd00 in IRQ handlers. This prevents possible visual glitches by IRQs hitting between loader-executed lda value/sta $dd00 (and saves masking overhead, too). So setting $dd02 from user code is forbidden, while in your idea, setting $dd00 is. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quoting KrillYes, this explains everything. :) Indeed. That explanation actually makes sense to me.
Quoting tlrCouldn't the $dd00 bank bits just be kept 00? Then switching can be done via $dd02. Why didn't I think of that?
The SLO keeps zeroes and the SRE doesn't appear reach the least-significant bits with any ones. I just ran a quick test by poking at $dd02 and I can't see any VIC bank switching during loading.
For the record the basic transfer loop looks something like this: ;(y = %00000100)
;16 cycles, raises ATN
and #%01100000 ;0ba00000
cmp $00,y
sty $dd00
slo $dd00 ;cba010--
;16 cycles, lowers ATN
inx
ror ;dcba010-
lsr ;0dcba010
cmp #%01000000
arr #%00111000 ;d00cba00
sre $dd00 ;dfecba--
;16 cycles, raises ATN
alr #%11111100 ;0dfecba0
sta merge+1
sty $dd00
slo $dd00 ;g-------
;16 cycles, lowers ATN
and #%10000000 ;g0000000
merge: adc #%00000000 ;gdfecbah
sta sector,x
sre $dd00-$04,y ;-ba----- I wonder if it would be possible to get the bits through in the right order without sacrificing performance.. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting tlr
Couldn't the $dd00 bank bits just be kept 00? Then switching can be done via $dd02.
No, unfortunately that won't work. When reading dd00, the bits still reflect what is on the lines. If bank 1 is selected in this way, the two least significant bits in dd00 were written as 00 and the bits in dd02 were written as 01. This makes the lines high-low, and so when you read dd00 you get 10. Now suppose you rotate right (the same applies for bank 2 if you rotate left). Even if you can control the bit that gets shifted in from the left to be a zero, this will write 01 to dd00. The lines are now high-high, and the wrong bank has been selected. |
| |
tlr
Registered: Sep 2003 Posts: 1791 |
Good point. Then RMW doesn't work unless the bit is forced low by the instruction, like the lsb when using SLO. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting doynaxI wonder if it would be possible to get the bits through in the right order without sacrificing performance..
This is one of the reasons i do this peculiar nibble-wise intermediate storing as mentioned by lft in his blog article. Since i have the block data in two pages of GCR nibbles in the drive RAM, i can do a table lookup (table size 32 bytes) while transferring, getting the bits nicely swapped and inverted into $1800 so that the computer receives them in the correct order and orientation. No extra table is needed on the computer side, thus yielding a minimum resident code size of $0100 bytes.
This, of course, has a few drawbacks, as in a few more cycles per byte and no easy possibility to checksum the data during transfer. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Quote:So setting $dd02 from user code is forbidden, while in your idea, setting $dd00 is. In my loader system, it's the other way around. Setting $dd00 in user code is forbidden. The loader uses $dd00 and the user uses $dd02, though it's still possible to use $dd00 in limited ways if you really have to..
@Doynax: Interesting transfer loop, do you really get out what you want there? :P. Gotta check it once again :). |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting doynaxI wonder if it would be possible to get the bits through in the right order without sacrificing performance..
Hmm, looking at it a liitle longer, your problem is not only getting the bits over the wire in the right order, but also through your funky receive logics? :)
I guess it should be easy for you to simply shuffle the bits around on the disk so that it arrives in computer memory in the correct order. And in that case, no problem, is there? I mean, you sacrificed other general-use requirements (VIC bank) before, so.. :) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quoting tlrGood point. Then RMW doesn't work unless the bit is forced low by the instruction, :(
Quoting HCLInteresting transfer loop, do you really get out what you want there? :P. Gotta check it once again :). It is loading a compressed executable so I'd be somewhat surprised if it works despite dropping bits ;)
Quoting KrillI guess it should be easy for you to simply shuffle the bits around on the disk so that it arrives in computer memory in the correct order. And in that case, no problem, is there? I mean, you sacrificed other general-use requirements (VIC bank) before, so.. :) Pretty much though it is a tad inconvenient. Still, if a bit of pre-processing saves me a byte or a cycle I'm willing to do it.
The saving grace is that it's easy to reverse the transformation when uploading bytes to the drive, e.g. when saving, and thankfully the EOR checksum shouldn't care. |
| |
Danzig
Registered: Jun 2002 Posts: 441 |
Quote: Ah, for once i think i understand :). LFT, what is that book you have? everyone should have it ;).
Maybe this is the right moment: I still got the book "Das große Floppy Buch 1541" from Data Becker for sale.. Anyone? :D |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
protip: a pdf of that one is at spiros website =P |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
About the dd00/dd02 issue - wouldn't it be an idea to add a feature where you can ask the loader on the drive side to ignore the register for x amount of seconds, if you have an effect that absolutely has to use the "wrong register"?
Of course this is rare, and the feature would take valuable bytes on the drive side, so maybe it wouldn't be an idea after all. :) |
| |
Danzig
Registered: Jun 2002 Posts: 441 |
Quote: protip: a pdf of that one is at spiros website =P
you've narrowed my possible profit to smth like zero. ;) |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
you can still bring it to next X and throw it at whoever managed to use a broken loader =) |
| |
Danzig
Registered: Jun 2002 Posts: 441 |
Quote: you can still bring it to next X and throw it at whoever managed to use a broken loader =)
I can bring it to next X, take each sheet and role you pipes you have to smoke all the way... |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
sounds like a plan =) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quote: About the dd00/dd02 issue - wouldn't it be an idea to add a feature where you can ask the loader on the drive side to ignore the register for x amount of seconds, if you have an effect that absolutely has to use the "wrong register"?
Of course this is rare, and the feature would take valuable bytes on the drive side, so maybe it wouldn't be an idea after all. :)
Do you mean:
.define IDLE_BUS_LOCK 0 ; C-64 only: allow for arbitrary $DD00 writes ($00-$FF) when the loader
; is idle (good for raster routines with LDA #value:STA $D018:STA $DD00, e.g.)
Has been a feature since day 1. And yes, nobody seems to have used it so far. :) |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
Quote: Do you mean:
.define IDLE_BUS_LOCK 0 ; C-64 only: allow for arbitrary $DD00 writes ($00-$FF) when the loader
; is idle (good for raster routines with LDA #value:STA $D018:STA $DD00, e.g.)
Has been a feature since day 1. And yes, nobody seems to have used it so far. :)
you have invented that for me, for soiled legacy's chessboard stretcher part ;) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
..also have it in my loader, but i think i only used it in "1991" so far. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quote: you have invented that for me, for soiled legacy's chessboard stretcher part ;)
Yes, but that was a previous loader, different code base and all. But i withdraw the nobody-ever-used-it part of my statement, sorry. :) |
| |
Cruzer
Registered: Dec 2001 Posts: 1048 |
Quoting KrillHas been a feature since day 1. Quoting HCL..also have it in my loader Silly me for thinking I could come up with something new for loaders. :) Problem solved then I guess? |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
I guess so. But there are soooo many more things on the list waiting to be implemented.. :) |
| |
Pantaloon
Registered: Aug 2003 Posts: 124 |
So when is a faster version of the krill loader going to be released :) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Soon. I am currently adding Doynax's LZ packer, first tests gave really good throughput results of 8.5-10.5 kB/s for decrunch during load. |
| |
Pantaloon
Registered: Aug 2003 Posts: 124 |
oh nice :) |
| |
Pantaloon
Registered: Aug 2003 Posts: 124 |
tell me if u want testing help on various 1541:s :) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Seems like C128DCR is more of an issue at the moment.. :) |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
noooooo.....screw 128DCR !!!!111eleven |
| |
HCL
Registered: Feb 2003 Posts: 728 |
I also get error reports for c128dcr's on my demos.. what's with that machine actually? Should be common knowledge by now one would think :P |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
good old hammer fix to the rescue :) |
| |
raven Account closed
Registered: Jan 2002 Posts: 137 |
There's nothing wrong with the 128D, fix yer loaders!
I'm getting random crashes on many demos from X2012, all during drive activity. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
its only the C128DCR - and the timing IS broken there :) |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
been ages since I last used my dcr, but it was rock stable on demos until early 2000s. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
try using jiffydos on it :) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
@Groepaz: You seem to know what is wrong with the dcr, is it transfer timing or is it read-head timing? If it's true what Oswald says, then it should of course be possible to do fast loaders that work on the dcr also.. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
apparently the delay caused by the cable connection is slightly (really, less than a cycle) different than expected.... someone did a bunch of measurements (because chameleon has had a similar problem) - see http://www.forum64.de/wbb3/board65-neue-hardware/board289-diver.. ... yes its possible to make it work, by detecting and specifically handling C128D that is. (and then everything will break again with more than 1 drive on the bus ... =P) |
| |
tlr
Registered: Sep 2003 Posts: 1791 |
Those are really nice measurement confirming and detailing the problem.
I thought it was common knowledge that the transfer timing was different? Less capacitance on the bus equals faster transitions.
Transfer routines that are "unidirectional" avoid the problem. i.e send sync pattern from the drive side and then the data (like the AR turbo). Anything that relies on the round trip c64 -> drive -> c64 is sensitive to timing variations for instance by bus loading.
Using the unidirectional technique might not be feasible in demo contexts though. |
| |
soci
Registered: Sep 2003 Posts: 481 |
There's no problem with JiffyDOS on my C128DCR, but only after it had warmed up ;) The problem is on reading the last bits as I remember. |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quoting Groepazapparently the delay caused by the cable connection is slightly (really, less than a cycle) different than expected.... someone did a bunch of measurements (because chameleon has had a similar problem) Bear with me, it's been a long time since I took German, but the gist of it is that added capacitance in the DCR slows down signalling by less than a cycle for a round-trip but still sufficient to require an extra cycle of waiting on the host before a response can safely be read?
In effect limiting the speed of a traditional IRQ loader to 19x4 cycles/byte (17x4 with RMW opcodes.)
Quoting Groepaz ... yes its possible to make it work, by detecting and specifically handling C128D that is. (and then everything will break again with more than 1 drive on the bus ... =P) Detecting and handling it by introducing extra delays on the host, with more than one drive on a non-GCR system exhibiting similar behaviour?
As an aside how am I supposed to handle multiple IEC devices on the bus in a two-bit IRQ loader? My best idea so far is to manually detect and handle as many of them as possible by installing code to put them into tight ATN-acknowledgement loops to avoid blocking the DATA line, but presumably that will fail miserably on many devices.
On a vaguely related note I've been thinking about what other limits the unwary drive coder with limited resources for testing might run into.
Specifically:
- How many tracks are safe to use?
- Which speedzones are safe for the various tracks?
- How fast may the head be stepped? Can it be improved with acceleration?
- What is the range of rotation speeds which should be supported?
- How hard are the GCR constraints? May clock recovery be reliable with more than two zero bits in a row?
- Which illegal opcodes, if any, may safely be relied upon?
- How much of a margin is required to account for the larger numbers of devices and longer cables on a worst-case IEC bus?
I realize that most of these are judgement calls but it would be good to know what the consensus is and what trouble you actually risk running into. |
| |
tlr
Registered: Sep 2003 Posts: 1791 |
Quoting doynax- How fast may the head be stepped? Can it be improved with acceleration? Kernal uses 15 ms + 75 ms settle.
I've used 8 ms + 8 ms settle in DMA loader II which seems pretty stable.
Graham employs acceleration in WarpCopy64. It seems reasonable that will allow higher top speeds.
Perhaps Graham can elaborate on how it works in practice?
Quoting doynax- What is the range of rotation speeds which should be supported? There is some discussion here: Searching for a fast-writing floppy routine
Graham claims 280-320 rpm. TNT has only seen 295-305 rpm.
The drives I've encountered have all been very close to 300 rpm, even over time (i.e since '85 or so).
Oh, and my statement in above thread about static intersector gaps have been since retracted, Graham was right. I've done it the right way in Format II.
Quoting doynax- How hard are the GCR constraints? May clock recovery be reliable with more than two zero bits in a row? The main reason for that restriction is how the 0 bit recovery was constructed.
It is done by a simple 4-bit counter which is reset on seeing a transition. It will dead count 4 steps for each bit position according to the current set speed zone. The lowest two counter bits are used for timing and the upper two is used to generate a 1 for the first step and 0's for the following 3 steps.
Now the weird thing is that it wraps! This means that after seeing a transition (a '1') plus ~3 non-transitions ('0's), a 1 will appear even though there isn't any transition coming from the disk.
The '~' in ~3 non-transitions ('0's) is because if the 1 that _must_ follow the last 0 is a tiny bit late, a fake 1 will be generated and then immediately afterwards the real transition will come and reset the counter and generate another 1.
In addition to this there might be analog factors as well, e.g noise, clock jitter, mechanical vibration causing bit jitter.
That said, it should be possible to allow three 0's in a row as long as the data rate (+ jitter) coming from the disk is strictly faster than the bit clock in the drive in all situations.
This requires you to write at a higher bit rate than you read.
There is a note about this in conjuction with early V-MAX implementations that it wasn't reliable on some drives (1541-II?). http://c64preservation.com/dp.php?pg=vmax (at the top) and here http://markus.brenner.de/mnib/vmaxtech.txt |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Thanks for the info!
Quoting tlrGraham claims 280-320 rpm. TNT has only seen 295-305 rpm.
The drives I've encountered have all been very close to 300 rpm, even over time (i.e since '85 or so). Ouch. Speed-zone 3 written at 320 RPM and read back at 280 is only 22 1/4 cycles per byte.
Quoting tlrThat said, it should be possible to allow three 0's in a row as long as the data rate (+ jitter) coming from the disk is strictly faster than the bit clock in the drive in all situations.
This requires you to write at a higher bit rate than you read. Interesting. So with proper authoring and a devilishly clever algorithm we might theoretically squeeze out 15 bits per 16-bit word of entropy. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
My loader uses acceleration, too, originally suggested by Graham.
.define MINSTPSP $18 ; min. r/w head stepping speed on 1541/41-C/41-II/70/71/71CR This is a value which should be safe even on old 1541s. (The unit is 256 cycles per half-track, so the same as the original firmware would write to a timer hi-byte.)
.define MAXSTPSP $10 ; max. r/w head stepping speed on 1541/41-C/41-II/70/71/71CR These figures have been used for years now, without any negative reports so far.
.define STEPRACC $1c ; r/w head stepping acceleration on 1541/41-C/41-II/70/71/71CR Acceleration is above figure, added once every timer lo-byte underflow. See source for details ;) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting doynax- Which illegal opcodes, if any, may safely be relied upon? Same as for the 6510, minus SAX. SAX is reportedly not working on HCL's 1541 clone (using a Synertec 6502 clone), plus i suspect it to be generally a bad idea when used on the bus port bits. Data and clock might be updated at different times within a cycle. (But this might not be an issue after all.)
Quoting doynaxAs an aside how am I supposed to handle multiple IEC devices on the bus in a two-bit IRQ loader? My best idea so far is to manually detect and handle as many of them as possible by installing code to put them into tight ATN-acknowledgement loops to avoid blocking the DATA line, but presumably that will fail miserably on many devices.
This is what i will add to my loader, too. First tests worked just fine with loading from any drive in a 4-units daisy chain using standard CBM cables. I do have one or the other cycle more in the transfer loop compared to your ultra-tight fixed VIC bank approach, though. :)
The actual problem though is that the other drives have to be somewhat protocol-savvy to correctly decide when to reset, as i do use a watchdog approach so you can safely reset the host computer at any time without locking up your drives, or you can cart-freeze and implicitly uninstall by using standard KERNAL drive access. (Not all C-64s issue a reset signal via serial.) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quoting KrillMy loader uses acceleration, too, originally suggested by Graham.
.
.
.
Acceleration is above figure, added once every timer lo-byte underflow. See source for details ;) Thanks for the timing data!
I realize that it isn't much of an issue in demos but in my current project I'm forced to do an awful lot of random-access.
Quoting KrillSame as for the 6510, minus SAX. SAX is reportedly not working on HCL's 1541 clone (using a Synertec 6502 clone), plus i suspect it to be generally a bad idea when used on the bus port bits. Data and clock might be updated at different times within a cycle. (But this might not be an issue after all.) Damn :(
As you've probably guessed my transfer loop relies on SAX to mask out the ATN acknowledgment.
Oh, well.. I suppose that extra "DCR" accommodation cycle should free up some space.
Quoting KrillThis is what i will add to my loader, too. First tests worked just fine with loading from any drive in a 4-units daisy chain using standard CBM cables. Good to know I'm at least on the right track.
Have you encountered any IEC devices where this scheme won't work? Personally I have no experience with printers or modems or anything besides floppy drives really.
Quoting KrillThe actual problem though is that the other drives have to be somewhat protocol-savvy to correctly decide when to reset, as i do use a watchdog approach so you can safely reset the host computer at any time without locking up your drives, or you can cart-freeze and implicitly uninstall by using standard KERNAL drive access. (Not all C-64s issue a reset signal via serial.) Thanks for the heads up! The reset signal not being reliable is precisely the kind of thing to catch the inexperienced drive coder by surprise.
Unfortunately I rather trash all of RAM, including using the full stack and zero-page as buffers, so routing a hardware timer interrupt through the ROM vectors may take a bit of juggling. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting doynaxI realize that it isn't much of an issue in demos but in my current project I'm forced to do an awful lot of random-access. Yes, and this is precisely where acceleration comes in handy.
Quoting doynaxAs you've probably guessed my transfer loop relies on SAX to mask out the ATN acknowledgment. If my suspicion about different timing on different bit positions doesn't hold nor matter, it's alright to demand genuine C= gear. :)
Quoting doynaxHave you encountered any IEC devices where this scheme won't work? Personally I have no experience with printers or modems or anything besides floppy drives really. Me neither, but i think it's okay to ignore those. I don't see why anybody should use printers and modems and whatnot on the serial bus of a C-64 these days. Those who do probably care more for GEOS rather than games or even demos.
Quoting doynaxThanks for the heads up! The reset signal not being reliable is precisely the kind of thing to catch the inexperienced drive coder by surprise. Oh, it works or doesn't quite reliably. Just that some ASSY #s do and some don't pull serial reset out low upon host reset.
Quoting doynaxUnfortunately I rather trash all of RAM, including using the full stack and zero-page as buffers, so routing a hardware timer interrupt through the ROM vectors may take a bit of juggling. Same in my loader, there's basically not a single unused byte. But installing your own interrupt handler for watchdog purposes is not that difficult, you basically just run an "execute code in block" job and make sure that all other job codes are ineffective. Again, see source for details. :) |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
that watchdog feature can kill your trackmo if you starve the loader's need for cpu for a few frames. ie, timer interrupt will trigger after a few frames thinking the machine has been reset, and will reset the drive. |
| |
tlr
Registered: Sep 2003 Posts: 1791 |
Quoting KrillQuoting doynax- Which illegal opcodes, if any, may safely be relied upon? Same as for the 6510, minus SAX. SAX is reportedly not working on HCL's 1541 clone (using a Synertec 6502 clone), plus i suspect it to be generally a bad idea when used on the bus port bits. Data and clock might be updated at different times within a cycle. (But this might not be an issue after all.)
Do you have anything to back that timing suspicion up?
I can't see how the bit timing could be different all the way out to the port pins. Surely there must be at least one pipe line step through the VIA so even if there is a difference in timing on the bus out from the 6502 it will be reclocked. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
also MiST from visual6502 actually did all the tests in a 1541 - with no sign of special behaviour. |
| |
WVL
Registered: Mar 2002 Posts: 903 |
Quote: that watchdog feature can kill your trackmo if you starve the loader's need for cpu for a few frames. ie, timer interrupt will trigger after a few frames thinking the machine has been reset, and will reset the drive.
Had that problem aswell, was really happy I could pinpoint it with some help from Krill.. We could've known you could 'starve' a loader? :) |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Yes, well, that was my fault mainly. I knew that the watchdog timeout is max. 65536 cycles, without any trick known to me to extend it without serious overhead or other repercussions, and at some point i noticed that this would be the problem. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting tlrDo you have anything to back that timing suspicion up?
I can't see how the bit timing could be different all the way out to the port pins. Surely there must be at least one pipe line step through the VIA so even if there is a difference in timing on the bus out from the 6502 it will be reclocked. No, hence my doubting my doubts. Probably there is no such problem if the SAX opcode itself works. That it doesn't with some drives may be the main problem to consider here. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quote: also MiST from visual6502 actually did all the tests in a 1541 - with no sign of special behaviour.
Original MOS 6502, yes. I have no idea about the clones floating around, if they use the original circuitry and whatnot. After all, SAX does not work on that Synertec variant. But i haven't checked if this is somehow connected with it not being NMOS or anything, IF it isn't.. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
i'd just ignore that drive then, really :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: i'd just ignore that drive then, really :)
Ignore HCL's drive?!? Now that's bold... ;) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
OK, then i'll ignore all c128dcr drives.. ..and this means WAAAAR!!!
;) |
| |
chatGPZ
Registered: Dec 2001 Posts: 11391 |
every good irq handler has an inc $d030 in it =) |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
I've been testing the 16-cycle RMW ATN acknowledgment scheme discussed above and have run into a bit of trouble. It appears to work fine on my working (1571) drive, 1541U, VICE and Hoxs64. However the reaction is occasionally a cycle late in CCS64.
Anyway, I've isolated the issue into a little timing test comparing ASL $DD00 to ASL+STA $DD00.
I'd much appreciate it if anyone else would run these on hardware to compare the RMW/WR cases, confirm whether this is a known bug, or spot the error in my thinking.
https://sites.google.com/site/doynax/iec_repro.zip
|
| |
tlr
Registered: Sep 2003 Posts: 1791 |
Quote: I've been testing the 16-cycle RMW ATN acknowledgment scheme discussed above and have run into a bit of trouble. It appears to work fine on my working (1571) drive, 1541U, VICE and Hoxs64. However the reaction is occasionally a cycle late in CCS64.
Anyway, I've isolated the issue into a little timing test comparing ASL $DD00 to ASL+STA $DD00.
I'd much appreciate it if anyone else would run these on hardware to compare the RMW/WR cases, confirm whether this is a known bug, or spot the error in my thinking.
https://sites.google.com/site/doynax/iec_repro.zip
didn't examine it in detail but the asl $dd00 will shift CLKin into DATAout (=DATA out from the c64 will be the inverse of the state of the CLK line).
Maybe that is what bites you? |
| |
doynax Account closed
Registered: Oct 2004 Posts: 212 |
Quoting tlrdidn't examine it in detail but the asl $dd00 will shift CLKin into DATAout (=DATA out from the c64 will be the inverse of the state of the CLK line).
Maybe that is what bites you? Good idea. That's thinking outside of the box.
Unfortunately I think you've got the shift direction mixed up. The CLK input is bit 6 with DATA out in bit 5 just below it, so an ASL should not pick up the two significant input bits. |
| |
tlr
Registered: Sep 2003 Posts: 1791 |
Quoting doynaxUnfortunately I think you've got the shift direction mixed up. The CLK input is bit 6 with DATA out in bit 5 just below it, so an ASL should not pick up the two significant input bits.
Doh! :) |