[CSDb] - User Forums - GCR decoding on the fly

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > GCR decoding on the fly

2013-03-31 12:46

lft

Registered: Jul 2007
Posts: 369

GCR decoding on the fly

Here's how to do it:

http://linusakesson.net/programming/gcr-decoding/index.php

2013-03-31 13:24

PAL

Registered: Mar 2009
Posts: 292

the only thing i do understand here is that you are insane great at making solutions to problems that no other have been able to solve... when krill say that is awesome i guess your loader is pretty awesome... hats of for that!

2013-03-31 13:52

Skate

Registered: Jul 2003
Posts: 495

omfg! amazing thinking on how to get rid of bit shiftings using tables. probably people who had tried this missed using "no sequence of three zeros" rule as the key point for creating tables. biiiig thumbs up!

2013-03-31 14:01

Killsquad
Account closed

Registered: Jun 2005
Posts: 17

foooOOOFFF.. the sound of this as it went straight over my head. Brilliant work, again, LFT. Even though I didn't understand all of this :D.

2013-03-31 15:21

Zyron

Registered: Jan 2002
Posts: 2381

As above. ;)

2013-03-31 16:07

tlr

Registered: Sep 2003
Posts: 1791

Very clever implementation! Excellent use of undocumented opcodes to mask a single read value in more than one way.
I love it how you are also a good technical writer being able to explain things in a concise manner and at the same time providing the context of the problem.

Keep inventing!

2013-03-31 16:22

algorithm

Registered: May 2002
Posts: 705

Very innovative! Keep up all this work. Nothing is impossible :-)

2013-03-31 17:11

Ejner

Registered: Oct 2012
Posts: 43

That is just really mindblowing! I'll have to read that again a few times! :-) It's really interesting, but I kind of lost track and concentration when I came to the part with illegal upcodes and "zeroes are stronger than ones" and stuff... It seems that this is what noone else have been able to do or figure out for the past 30+ years. -You finally did it! :-) A milestone in c64 history... Thanks for sharing! :-)

2013-03-31 17:35

Clarence

Registered: Mar 2004
Posts: 121

Great achievement Lft, can you tell an estimation, how much faster can a loader be using this technique?

2013-03-31 17:57

MagerValp

Registered: Dec 2001
Posts: 1078

I had to read it twice, since it was too brilliant for my puny mind to grasp on the first try.

Quoting Clarence

Great achievement Lft, can you tell an estimation, how much faster can a loader be using this technique?

Instead of read -> decode -> transfer, it's now read & decode -> transfer. If your loader currently is using, say, interleave 5, it can now use interleave 4 (unless I'm missing something).

2013-03-31 18:56

Oswald

Registered: Apr 2002
Posts: 5095

okay no words for this. congratulations you are officially a genius. chuck norris of c64 :)

2013-03-31 19:57

Bob

Registered: Nov 2002
Posts: 71

Excellent job, truly, I read your doc, and I think understood it :) , I had a harder time to understand why data was packed as such from the drive though ;)

Now the question is that this has speeded up the reading to a new level, or unloaded the process for other purpose?

this tech is very interesting for our future coming demos where one of the main problems in trackmo is the loader/cruncher relation, where nowdays the uncrunching takes longer time then loading a part ..but either way this shortens the whole loading procedure which is very good... especially where we use all memory :)

LFT, once again, hats off for your remarkable innovative approaches and questioning and challange the unquestionable! ... Where people has accepted that is how it is... and you just like say WHY?... I love that.. keep up the spirit.. if you need anything help support or be in Censor just let me know ;)

2013-03-31 22:14

Cruzer

Registered: Dec 2001
Posts: 1048

Yet again, amazing solution and description of a problem - this time one I had been blissfully ignorant about until now. Hopefully this will become a standard in trackmo loaders and lead to more action packed demos.

2013-03-31 22:22

Krill

Registered: Apr 2002
Posts: 2982

Cruzer: Was that sort of a challenge of who'd build the fastest IRQ loader now? :)

2013-03-31 23:10

Kabuto
Account closed

Registered: Sep 2004
Posts: 58

Truly amazing what you did, and also how well you've documented it.

2013-03-31 23:35

raven
Account closed

Registered: Jan 2002
Posts: 137

This is awesome, it was such a simple approach in the end.
Congrats! :)

@MagerValp:

Interleave 4 can be quite easily done without full on-the-fly decoding, I wonder if its possible to reach interleave 2 with this...

2013-04-01 06:03

Krill

Registered: Apr 2002
Posts: 2982

You cannot transfer $fe bytes during the time a sector flies by, this is just barely possible with those $e0 bytes per sector schemes (Vorpal V1 and derivates like Heureka-Sprint), and there only with non-interruptable transfer. Last time i checked, that is.

2013-04-01 08:06

Fungus

Registered: Sep 2002
Posts: 691

lft : this is pretty brilliant, I have to say.

Do you think it may be possible to decode without recombining the nybbels into whole bytes?

As you know, there is extra overhead when converting the bytes back into nybbels again for transfer over serial which in normal loaders really wastes of lot of time. So it would be very interesting if this extra step can also be eliminated like it is in Krill's, Mine, or the action replay loaders.

In my own loaders I also am calculating the checksum after decoding the nybbels with an eor/sta pair for each nybbel, and finally checking those against the nybbels read from the checksum. I did that of course to eliminate another loop of overhead to check the contents of the decoded bytes against the checksum.

I think it would be INSANELY fast if the methods could be combined.

2013-04-01 08:36

Axis/Oxyron
Account closed

Registered: Apr 2007
Posts: 91

Gosh, this is really fucking brilliant. And thanks for the article. This was a very interesting read. I never new about that GCR encoding problem.

So there are still unawnsered questions.
How big is the practical gain? What is the cost of GCR encoding e.g. in Krills loader?
Are there any disadvantages like e.g. compatibility issues?
@Krill: When will we get our hands on a version of your loader using this technique? ;o)

2013-04-01 09:22

Cruzer

Registered: Dec 2001
Posts: 1048

Quoting Krill

Cruzer: Was that sort of a challenge of who'd build the fastest IRQ loader now? :)

Yes please! :D

2013-04-01 10:09

MagerValp

Registered: Dec 2001
Posts: 1078

Quoting Krill

You cannot transfer $fe bytes during the time a sector flies by, this is just barely possible with those $e0 bytes per sector schemes

So don't use the whole sector, only decode as much as you can transfer while the next sector flies by. You'd lose 20% but you can transfer a full track in two revolutions...

2013-04-01 11:03

Fungus

Registered: Sep 2002
Posts: 691

If you are ok with 20% data loss, might as well use a 3/4 encoding scheme like rapidlok which is much easier to decode, requiring only and'ing and or'ing.

2013-04-01 11:04

MagerValp

Registered: Dec 2001
Posts: 1078

Don't want to lose D64 compatibility though, but yeah, full sector and interleave 2 is probably more realistic.

2013-04-01 13:17

raven
Account closed

Registered: Jan 2002
Posts: 137

Fungus: combining the nybbles is what allows using the stack as decode buffer.
If you dont combine, you'll need to store each nybble separately, which means additional cycles in the read/decode loop.

I do agree about the overhead, but i think this seems like a necessary compromise in this case.

2013-04-01 14:41

Stone

Registered: Oct 2006
Posts: 172

Brilliant again, with easy to understand explanations. You are the Richard Feynman of C64 coding :)

2013-04-01 15:17

Ymgve

Registered: May 2002
Posts: 84

Eh, it's no Warp25.

(Just kidding, awesome work!)

2013-04-01 16:45

Mixer

Registered: Apr 2008
Posts: 455

Such things are the reason why I like the c-64 scene so much :)

But, how about writing to a disk? Is it doable in equally paced manner?

2013-04-01 17:00

Krill

Registered: Apr 2002
Posts: 2982

Quote: Quoting Krill
You cannot transfer $fe bytes during the time a sector flies by, this is just barely possible with those $e0 bytes per sector schemes

So don't use the whole sector, only decode as much as you can transfer while the next sector flies by. You'd lose 20% but you can transfer a full track in two revolutions...

You can still load a full GCR-encoded track in 2 revolutions. Just that the trick is somewhere else (and not applicable to IRQ loading). :)

2013-04-01 17:32

Krill

Registered: Apr 2002
Posts: 2982

Axis: The practical gain is somewhat of a mixed bag.

Generally, if you can decode a block while it's being read AND also do something smart with the checksumming, you save about one revolution per track. For reference: On 1541, my loader (v138) has (virtual) interleave 5, and on 1571 (2 Mhz) it's 4. (Plus a scan revolution, but that is another issue and will soon be optional.)

Lft's approach, however, moves the checksumming to the C-64, as the approach is still not fast enough to do that while reading as well. So there are two options now: Do a separate checksumming pass on the C-64 side (wastes valuable time for decompression), or modify the format on a high level (EOR all the bytes before saving, then EOR again during transfer), so checksumming basically comes for free with the transfer. The second option is more feasible speed-wise, but less so from a compatibility/useability standpoint, given that you save in a somewhat non-standard, albeit high-level D64-compatible, format.

Now, moving over the checksumming in a similar manner, pre-existing approaches (like in my loader) would shave off that one revolution as well.

So this is largely an academic problem in the context of IRQ loading.

Furthermore, Lft's approach does need quite a bit of memory in the very limited drive RAM, which i cannot afford due to a few architectural goals i'm not willing to sacrifice (ease of use and compatibility by loading via directory and filenames, support for non-1541 drives, disabled true drive emulation, etc. - Lft's system loads via a given set of tracks/sectors, limiting all that).
Since checksumming after transfer also solves another problem (flipped bits during transfer in heavy electrostatic environments like parties), chances are good i'll add such an option (which, as mentioned, will also have the same actual speed improvement). But then i always said there are a few more options to speed up things, yet nobody really ever needed it so far, seeming content with the speed as it is.

Bottom line: Limited practical gain, but awesome somebody finally made it after many have tried.

2013-04-01 17:43

Krill

Registered: Apr 2002
Posts: 2982

Quote: Quoting Krill
Cruzer: Was that sort of a challenge of who'd build the fastest IRQ loader now? :)
Yes please! :D

So will YOU need that speed? I have the vague feeling you're more the screen-off-non-IRQ-loading kind of guy, and for that... I do have two ideas on paper waiting to be tried out, both D64 compatible, giving 50x and 32x speed.

2013-04-02 07:57

Axis/Oxyron
Account closed

Registered: Apr 2007
Posts: 91

I think every little speedup on loading would help everyone producing c64 trackmos. We all struggle alot and invest months of our life to optimize loading-/decrunchtimes of our parts to get the pace a little bit up and to have time left to do some nice transition work.
Perhaps it would be easier to get the gain on optimising decrunchers rather than loaders. I think there is also a big potential on that.
But perfect solution would be to optimize both to the max.
Because we are still so far away from what the amiga guys were doing back in the 90s due to slow loading and decrunching.

2013-04-02 08:43

HCL

Registered: Feb 2003
Posts: 728

One brilliant piece of code in here, but i don't see that you gain very much in the total solution. Then i have to admit that i didn't even find any transfer-code in there yet :P, which is a quite essential part of the drive coding (and the major bottle neck of course).

LFT always has a fresh approach to c64 problems, and always seem to come up with something new!! A genius by definition :). Using extensive tables for gcr-decoding was used in my own loader(s) (in Cycle, EoD etc..) but not this beautifully of course.

One funny detail is that last fall i was also working on my disk-loader. I also came down to 100% on the fly decoding, but was not satisfied because it uses SAX, which doesn't work on all my (2) drives. So i reworked it and the result can be found in TimeMachine. Don't use that loader as any reference though, it seems to still have problems on 128DCR which are not solved yet.

2013-04-02 08:57

Oswald

Registered: Apr 2002
Posts: 5095

why checksumming btw? if the data is fucked, running the demo is fucked either way, you know it explicitly or not.

approx how much time is a track revolution?

2013-04-02 09:55

raven
Account closed

Registered: Jan 2002
Posts: 137

@Oswald:

My thoughts exactly, which is why I didnt bother with checksumming in my loader.
I mean, its a trackmo, even if there is a loading error, there's usually no time to reload, so why bother? :)

@HCL:

Will you ever fix EoD so I can finally watch it on my 128D (the only machine I've been using for years) ?
Its a loader problem, which I'm too lazy to debug :)

2013-04-02 10:19

Krill

Registered: Apr 2002
Posts: 2982

A revolution at 300 rpm takes 1/5 seconds.

Axis: Need for maximum speed noted. Give me a figure to work with for your next future release, and we can work on optimizing things towards that goal.
At some point you simply go into uneasy-trade-off territory and have to drop one or the other feature for the sake of that extra bit of performance.
As for packers, there are some choices to make, too. When going for maximum speed, the simpler/speed-optimized algorithms (e.g. ByteBoozer, but i also think Doynax and WVL made big progress when it comes to speed) are most often a better choice for best combined loading+decompression speed, compared to squeezing out every last bit at the cost of decompression speed (PuCrunch, Exomizer). Adding another decompressor to the loader is easy, but i can do that quickly following a specific request.

HCL: So you made it as well? Have to check that code. :)

Oswald: Good question. Disk errors are rare with good disks and under lab conditions, but there are spurious/transitional errors with not so good disks and again, under party conditions. Simply dropping the checksumming is an option, but IMHO not a good idea, as the drives and disks do age as well.
raven: And a demo that cannot tolerate slower loading (except for obvious sync problems) is not a good idea either. Drive coding is not like coding raster bars. Eventually you will find a drive/disk/track-skew combo that's slower than anything you tested, with the hardware still in good condition. And since you will end up with some slack anyways to have more of an error margin, why not use that slack with faster drives for retries? :)

2013-04-02 11:18

HCL

Registered: Feb 2003
Posts: 728

@Krill: Well.. No i didn't, as i didn't keep the code i was unsatisfied with. I could of course do it again, but it would be a 1541-only loader for drives that can coop with SAX. The loader from the Cycle-era uses lots of tables and is faster. I took that code and used your trick of only synchronizing 2 bytes of 5, and got it down to zero overhead in the read-loop. But what i wanted in TimeMachine (and always) was a loader using less tables, so i have space for dir-table in the drive. So, first i removed SAX to make it work on my other drive, and then i made it use only one $20-table for gcr-decoding..

@Raven: Oh, i didn't know that was an open issue ;). Well, in fact this loader from TimeMachine may be able to replace the EoD-loader. Before now i didn't have any loader that could do all that the EoD-loader can. So.. perhaps some day :).

2013-04-02 12:32

Graham
Account closed

Registered: Dec 2002
Posts: 990

Quoting raven

@Oswald:

My thoughts exactly, which is why I didnt bother with checksumming in my loader.
I mean, its a trackmo, even if there is a loading error, there's usually no time to reload, so why bother? :)

A load error message is better than continuing with broken stuff. Also demos is not the only use of a loader.

2013-04-02 13:46

HCL

Registered: Feb 2003
Posts: 728

Hmm.. sorry i have to correct myself. I do not gcr-decode on the fly while reading data in my loader, i just manage to split the data into 8 chunks. Then the data in those 8 chunks are used to index into those different gcr-decoding tables on the fly when transmitting the data. I think this is more or less like Krill does and many other loaders i have peeked into during the years :). The only improvement i did was that i didn't need to waste any time between the reading and the transmitting, which on the other hand is exactly what LFT did as well :)

2013-04-02 13:57

Oswald

Registered: Apr 2002
Posts: 5095

lft did it with less memory needed :P :)

2013-04-02 14:00

HCL

Registered: Feb 2003
Posts: 728

Quote:

lft did it with less memory needed :P :)

Uhm.. not sure about that. He is using 2 pages of tables, the whole stack area plus half the zeropage. I managed to squeeze in all my tables into one page, but wasting another page for half-converted data.

..plus the major draw back: He should of course have generated those tables in the drive code!! Now he is wasting 2 precious blocks of disk space for stopid but beautiful tables, that could have been generated in less than half a block probably!! I mean, we *are* discussing beauty of code here, aren't we ;)

2013-04-02 14:29

Cruzer

Registered: Dec 2001
Posts: 1048

Quoting Krill

So will YOU need that speed? I have the vague feeling you're more the screen-off-non-IRQ-loading kind of guy, and for that... I do have two ideas on paper waiting to be tried out, both D64 compatible, giving 50x and 32x speed.

Ofcuz I will need it for the EoD-killer I'm planning, but the super fast spacemo-loader definitely sounds like something I could use as well.

Btw, I'm definitely willing to sacrifice features like drive agnosticism and dir loading if it can lead to a faster IRQ-loader.

2013-04-02 14:42

Clarence

Registered: Mar 2004
Posts: 121

I would opt for the fastest possible demo friendly loader also, but maintaining compatibility with all the standard C= drives including the different 1571 versions. That illegal opcode incompatibility HCL mentions, sounds bitter already. :/

2013-04-02 15:35

raven
Account closed

Registered: Jan 2002
Posts: 137

From memory, I think there is only one (or two) drive clones that have a problem with illegal opcodes.
I used SAX and didn't lose any sleep about it ;)
Also, I remember only one complaint about it - from HCL!

I wonder how many people still use these drives.
HCL: which make/model is it exactly?

2013-04-02 15:51

enthusi

Registered: May 2004
Posts: 677

Just make it load fast from SD and 1541u !!!1111
Just kidding :)
But then again, if some non 1541-drives fail, well, they fail.

2013-04-02 16:40

algorithm

Registered: May 2002
Posts: 705

The bottleneck in most cases seem to be certain types of decompressors (for example with exomiser it almost takes the same amount of time to load and more) just to depack (even if depacking while loading)

Can't a quick CRC be done at the beginning of each linked part (once all data is loaded and decompressed?) Sure, this would not be a solid way of doing things - in particular if the crc checker at the beginning has load errors., but should be a mimimum issue

@krill. interested in your 50speed non-irq loader :-)

2013-04-02 17:21

tlr

Registered: Sep 2003
Posts: 1791

Quoting algorithm

Can't a quick CRC be done at the beginning of each linked part (once all data is loaded and decompressed?) Sure, this would not be a solid way of doing things - in particular if the crc checker at the beginning has load errors., but should be a mimimum issue

Is there any gain in doing a separate CRC step? Another drawback is that you won't detect partially corrupt T/S links.

2013-04-02 18:05

algorithm

Registered: May 2002
Posts: 705

Yes, the data check would be sub-optimal, but may have some type of gain.

2013-04-02 18:08

HCL

Registered: Feb 2003
Posts: 728

@Raven: You are funny :). You want me to fix the EoD loader to work on your c128D, but you don't want to care about my disk drive. It's a digilog drive btw, already mentioned in another recent thread.

2013-04-02 18:10

tlr

Registered: Sep 2003
Posts: 1791

@hcl: Maybe you can put your digilog drive in his C128D?
There, problem solved!

2013-04-02 19:54

raven
Account closed

Registered: Jan 2002
Posts: 137

@HCL: I believe many people have a 1571 compared to a few with the digilog ;)

Anyway, I seem to remember an email from you saying you removed the SAX's and got the demo working on your drive :)

2013-04-02 20:29

Cruzer

Registered: Dec 2001
Posts: 1048

Quoting enthusi

Just make it load fast from SD and 1541u !!!1111

Almost agree, actually. If it works on 1541, it works on 1541u, Chameleon, etc. That should be enough for everyone. It's ofcuz nice if it works on other drives as well, but no need to sacrifice speed for that.

Quoting algorithm

The bottleneck in most cases seem to be certain types of decompressors

Which is why I think that in most cases it's not even worth compressing the files. For Pimp My Snail I skipped it, and I think it worked pretty well. I just placed all code/data/gfx togehter in a tight lump, and "unpacked" it using custom routines that could be optimized for the specific data and speedcode. It might have helped a bit compressing some of the simple gfx though, in a lightweight way optimized for decrunching speed.

2013-04-02 20:53

Krill

Registered: Apr 2002
Posts: 2982

Yes, and soon you'll end up with an algorithm very similar to ByteBoozer and other LZ77 variants. Good speed, good pack results, often best combined loading+depack performance. :)

2013-04-02 21:18

Clarence

Registered: Mar 2004
Posts: 121

Cruzer, next time for a similar task, consider using Werner's LZWVL packer. If you need something fast, and with a better compression ratio than RLE, it is great I think: LZWVL

2013-04-02 21:36

HCL

Registered: Feb 2003
Posts: 728

..Ok, so it sounds like i should bring back that zero overhead read-loop with all illegals and release together with a new version of EoD? Then Cruzer has to promise to start using ByteBoozer, and we will all be happy :).

2013-04-02 21:40

Clarence

Registered: Mar 2004
Posts: 121

HCL, I think avoiding support for non C= brand drives is acceptable, then again I don't have any drives as such, so I might be biased. :D

2013-04-02 21:45

WVL

Registered: Mar 2002
Posts: 903

My 2 cents :

I see more profit in finding a way to read from disk and transfer to c64 at the same time. Should really have a go at that once.. :) prolly i'll quickly see why that isnt possible though..

Also have a look at Doynax's packer, my tests showed that it both compressed better + was faster in decompressing than Byteboozer (sorry David!)

2013-04-02 22:20

Cruzer

Registered: Dec 2001
Posts: 1048

Thanks for the tips on packers. LZWVL looks promising, at least from the lovely chart Werner did. If I get the time I would like to do a similar test for loading + decrunching combined, with different kinds of files, to see where which kind of compression should be used, and where it should be avoided. E.g. I doubt that it makes sense to pack code, unless it's mixed up with data or full of "align to next page" statements.

2013-04-03 05:24

Krill

Registered: Apr 2002
Posts: 2982

Quoting WVL

I see more profit in finding a way to read from disk and transfer to c64 at the same time. Should really have a go at that once.. :) prolly i'll quickly see why that isnt possible though..

It is very much possible. But it isn't suitable for IRQ-loading and very probably needs a disabled screen, too. Have a look at Mafiosino Trackloader (19x) which reads a track in two revolutions.

2013-04-03 05:29

Krill

Registered: Apr 2002
Posts: 2982

Quoting Cruzer

If I get the time I would like to do a similar test for loading + decrunching combined

Keep in mind that this involves actually adding the missing decompressors to a loader, as combined loading + decrunching involves decrunching between fetching sectors (hence it is faster than loading first and decrunching after).

2013-04-03 06:14

Krill

Registered: Apr 2002
Posts: 2982

Quoting Cruzer

E.g. I doubt that it makes sense to pack code, unless it's mixed up with data or full of "align to next page" statements.

My experience with packing code has shown that it can indeed be sensible to pack code by separating op-code stream and operand stream. For even better compression, make sure to actually add redundancy to the code (e.g., lda #$00:ldx #$00 is likely to pack better than lda #$00:tax in the end). However, it is not feasible to go to these lengths for anything bigger than 4K :)

2013-04-03 06:15

HCL

Registered: Feb 2003
Posts: 728

@WVL: Yes i know about Doynax's packer. It was based on ByteBoozer, at least from the beginning, and optimized from there in all (?) possible ways. I'm just honored by his work :). ..and again, transferring while reading is a no-go if you want to be interruptable, and i would not sacrifice that.

2013-04-03 13:54

Cruzer

Registered: Dec 2001
Posts: 1048

Quoting Krill

My experience with packing code has shown that it can indeed be sensible to pack code by separating op-code stream and operand stream.

Clever!

Quote:

For even better compression, make sure to actually add redundancy to the code (e.g., lda #$00:ldx #$00 is likely to pack better than lda #$00:tax in the end).

That would cause bigger code, resulting in a potentially worse effect, so I would never do a thing like that.

The priority for a trackmo should be effect quality > loading time > file size.

2013-04-03 16:43

Krill

Registered: Apr 2002
Posts: 2982

As i said, 4K.

But it really depends, there is no real loss in "lda #$00:ldx #$00" vs. "lda #$00:tax": same amount of cycles, just one byte more. The packed file is shorter with the former, the unpacked file longer. No problem. :)

2013-04-03 17:40

WVL

Registered: Mar 2002
Posts: 903

Talking about that Doynax packer, I can't find it on CSDb.. Has it been released or is it just a few people that got it from Doynax himself?

2013-04-03 19:52

Krill

Registered: Apr 2002
Posts: 2982

http://csdb.dk/forums/?roomid=11&topicid=59374#59404 -> http://doynax.googlepages.com/lz.zip .

Seems like he deleted his account, and before that never officially released the packer. Weird :\

2013-04-03 20:03

WVL

Registered: Mar 2002
Posts: 903

Whatever happened to him? :-(

2013-04-03 21:55

Burglar

Registered: Dec 2004
Posts: 1105

http://sh.scs-trc.net/hereyougo/doynax_lz.zip

2013-04-04 18:18

doynax
Account closed

Registered: Oct 2004
Posts: 212

First off, let me congratulate lft for an elegant solution and an entertaining write-up.
I have attempted to write a loader without an intermediate swizzling stage myself but could never manage it. This despite skipping the checksum, abusing the entire stack and zeropages as buffers, transmitting the resulting bit a whole bizarre order, and abusing every illegal opcode in the book.

Quoting Krill

Seems like he deleted his account, and before that never officially released the packer. Weird :

Oh, I'm still alive and lurking.
I rather lost interest after solving the technical problems but if anyone cares I'll put together a package with a bit of documentation plus binaries and "release" it somewhere reputable.
For the record it was designed to be easy to integrate into streaming loaders.

Personally I'm tempted to abandon D64 compatibility and try out Kabuto's beautiful 7-bit/byte GCR coder. I sort-of get the general idea but will really have to sit down with pen-and-paper to work out the finer details and convince myself that it really can code all inputs.
Anything to put off having to write actual game logic ;)

2013-04-04 20:18

Fungus

Registered: Sep 2002
Posts: 691

I would love to see a proper release of your compressor doynax.

2013-04-05 08:18

Dano

Registered: Jul 2004
Posts: 240

+1 for that!

2013-04-05 12:10

Frantic

Registered: Mar 2003
Posts: 1648

Quote: I would love to see a proper release of your compressor doynax.

Me too!

2013-04-06 09:13

Isildur

Registered: Sep 2006
Posts: 275

Regarding Doynax, ByteBoozer, Exomizer, Level Crusher in one place:

http://csdb.dk/release/?id=117165&rss

Do you find this tool useful?
(BTW This is not any kind of adv - just asking)

2013-04-06 18:25

Burglar

Registered: Dec 2004
Posts: 1105

isildur, yea, its quite useful, especially since you guys included benchmark disks of the various packers. including your own bongo cruncher, which seems to perform best (only slightly longer than exomizer, but beating every other one in both size and speed).

now, what would really be interesting and useful is to include other irq loaders in the benchmarks (like Krill's, lft's, ...).

2013-04-06 20:15

Isildur

Registered: Sep 2006
Posts: 275

Burglar, I'll talk to Wegi about that :)

2013-04-06 20:36

Krill

Registered: Apr 2002
Posts: 2982

GUI coding tools other than text editors aren't for me.

And to be honest, the code needs serious clean-up, if not complete rewrite, to be useable.

A generic packer/loader test tool would be a nice thing, however, i guess it's hard to accurately measure different design and useability properties in various usecases against mere performance numbers.

2013-04-06 20:56

Isildur

Registered: Sep 2006
Posts: 275

It will be hard because Wegi is like Derbyshire Ram on Polish scene (but with tough persoality) ;P

2013-04-06 23:13

Fungus

Registered: Sep 2002
Posts: 691

No, I don't find it useful because my build environment is all command line tools.

2013-04-06 23:34

Isildur

Registered: Sep 2006
Posts: 275

Fungus, there is command line tool.

2013-04-07 07:06

Bitbreaker

Registered: Oct 2002
Posts: 508

I am just about doing a complete rewrite of the packer in C. So the tool will have clean code soon and be ready for commandline driven build-environments.

2013-04-07 13:12

chatGPZ

Registered: Dec 2001
Posts: 11391

Quote:

It will be hard because Wegi is like Derbyshire Ram

no. really.

2013-04-08 07:35

ChristopherJam

Registered: Aug 2004
Posts: 1409

I finally got around to reading lft's post. Amazing work, especially realising you can get down to just two tables. And yes, squeezing every last cycle out of a loop can be quite the brainworm :)

Well done!

2013-04-08 19:52

lft

Registered: Jul 2007
Posts: 369

I got some benchmark figures for my loader. The unit is the number of
revolutions needed to load a track, a.k.a. optimal interleave (although with
out-of-order loading you don't need to think about interleave). The test
conditions are: No sprites, no interrupts, 25 badlines. This models the most
optimal setup which is useful in practice, either for silent loading while
displaying the BASIC screen (or something else), or for loading with a blanked
screen and a normal sid playroutine being called every frame.

The first row represents the version of the loader used in Shards of Fancy, but
without any decrunching. This version verifies the checksum in a separate pass
after reading a sector, to detect read errors. Then the checksum is verified
again on the C64 side to detect transmission errors.

As I mentioned to Krill at Revision, I had an idea to combine these into a
single checksum verification performed on the C64 side, and then re-read
(possibly another sector) and re-transmit on error. This was implemented, and
corresponds to the second row in the table.

Finally, I optimised the transfer routine and got it down to 74 C64 cycles per
byte. This is a regular atn handshake protocol, with the checksum computed
during transfer. The correct checksum is transmitted as an extra byte at the
end. The performance of this version is shown in the third row.

For easy comparison, I also computed a rough loading speed for this last
version by dividing the number of bytes loaded by the time needed for the given
number of revolutions. This figure should not be confused with actual loading
speed, for which you'd also need to take into account such things as overhead
from the high-level format (necessary to compensate for out-of-order loading),
track stepping, motor spin-up time and skipping sectors that don't belong to
the file. But it provides a rough estimate, and a maximum.

I have verified that the latest version works on real hardware, but the
measurements were obtained using Vice.

                       
                        track:   1-17   18-24   25-30   31-35
                                -----------------------------
v1 (shards)                         4       4       4       3
v2 (combined checksum)              4       4       3       3
v3 (74-cycle transfer)              4       3       3       3
v3 raw loading speed (B/s)       6720    8107    7680    7253

2013-04-09 06:38

HCL

Registered: Feb 2003
Posts: 728

Interesting results! I made some tweaks to my own loader also last week, re-introduced SAX to get zero overhead for the reading loop. This however gives only a speed increase of less than 10% from before (Cycle loader, EoD etc..).

Fair enough, then i went on to the transfer loop, which is (just as LFT's loop) 74 cycles. Here there should be room for some optimizations, but when i cut cycles, the transfer screws up! There is probably some theoretical explanation to this, less than 18 cycles between each read of $dd00 makes it not work. That is when using ANT handshake of course, else it's possible to reduce it a lot..

Anyone got any ideas why 18 cycles seem to be the limit? If it's confirmed then i'm pretty much done.. 2 cycles left to optimize, and that's all, not even sure i'm going for those 2 in that case :).

2013-04-09 06:46

Fungus

Registered: Sep 2002
Posts: 691

Be sure to check it on some 1541-II and 1571 drives, later VIA revision sometimes need an extra cycle for handshake.

2013-04-09 06:50

HCL

Registered: Feb 2003
Posts: 728

I think 18 cycles is used by most loaders, at least on some of the 2-bit-pairs. I have 18+18+20+18, most others have more, but if it doesn't work with 18, then 90% of all modern demos would not work on 1571 or 1541-II. I have a 1541-II myself and it works there of course :).

Now this is on the computer side, i should say. On the drive you should of course go below 18, at least here and there, to be safe if the drive is running a fragment faster than the computer.

2013-04-09 10:45

Krill

Registered: Apr 2002
Posts: 2982

Interesting results! I get close to them with my planned speed-ups, but don't quite reach them yet. Must hurry now to optimize a bit more and push the next release out the door i guess :)

I have added a new experimental protocol reaching 70 cycles per byte (including loop and store overhead) a while ago, this has a few strange-seeming limitations though (like 0 or 5-8 sprites are okay, but not 1-4). No sprite limitations gets it to a whopping 82 cycles. There might be room for improvement in both versions, but this is yet to be explored.

As for the 18-cycle limit with plain 2bit+ATN, which i confirm: My explanation is that waiting for ATN flip in a loop is 6 cycles minimum, then 7 cycles for a miss which does happen, then 4 cycles to set next bitpair, then another cycle due to slightly different clocks, wire delay, missed sampling windows and whatnot. Makes 6+7+4+1=18 cycles.

HCL: I might be wrong, but the drive being slightly faster actually gives you more than 18 cycles here and there on the drive side, according to my understanding.

2013-04-09 11:43

HCL

Registered: Feb 2003
Posts: 728

Quote:

HCL: I might be wrong, but the drive being slightly faster actually gives you more than 18 cycles here and there on the drive side, according to my understanding.

Oh, yes of course :). So, in case the drive is a fragment slower, you need to go below 18 cycles here and there. The drive is after all waiting for the computer when needed. I tend to believe my transfer loop is actually working since it has been around for ~10 years by now in numerous of demos. Don't know if i have done loading while displaying a sprite multiplexer though, with loading *on* the sprites :). Perhaps the AFLI-zoomer in EoD?..

2013-04-09 11:54

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quoting Krill

As for the 18-cycle limit with plain 2bit+ATN, which i confirm: My explanation is that waiting for ATN flip in a loop is 6 cycles minimum, then 7 cycles for a miss which does happen, then 4 cycles to set next bitpair, then another cycle due to slightly different clocks, wire delay, missed sampling windows and whatnot. Makes 6+7+4+1=18 cycles.

I think I've managed 16 cycles actually (66.5 per byte in practice with 2x unrolling.)

The trick is to reduce the delay between reading the bits and flipping ATN by combining both in a single RMW instruction (e.g. SLO/SRE.)

2013-04-09 12:18

Krill

Registered: Apr 2002
Posts: 2982

Hmm, how does that speed up the drive side, which is the bottleneck here, as it has to wait for the C-64 and respond to ATN flips asap?

2013-04-09 12:28

HCL

Registered: Feb 2003
Posts: 728

i would say the computer side is the bottle neck, at least i have NOPs in my transfer loop on the computer side.

@Doynax: Hehe.. cool. And you were actually able to do something useful with that data you got from those instructions also.. Impressing!

2013-04-09 12:31

Krill

Registered: Apr 2002
Posts: 2982

Quoting HCL

So, in case the drive is a fragment slower, you need to go below 18 cycles here and there. The drive is after all waiting for the computer when needed.

But the drive is never slower than the C-64. If the drive code is in theory less than 18 cycles between bitpairs when not branching in the loop, then the protocol is not violated in practice, as the branch will be taken, so of course your code should be just fine. :)

2013-04-09 12:35

Krill

Registered: Apr 2002
Posts: 2982

Quoting HCL

i would say the computer side is the bottle neck, at least i have NOPs in my transfer loop on the computer side.

If you have NOPs on the computer side to make up for drive slowness, how does that make the computer side the bottle neck? :)

2013-04-09 12:54

lft

Registered: Jul 2007
Posts: 369

Quoting doynax

I think I've managed 16 cycles actually (66.5 per byte in practice with 2x unrolling.)

The trick is to reduce the delay between reading the bits and flipping ATN by combining both in a single RMW instruction (e.g. SLO/SRE.)

Yes, I also had this idea. But I couldn't figure out a way to do it without restricting the user to vicbank 0 (and maybe also 3). Did you find a way that works regardless of vicbank?

2013-04-09 12:58

HCL

Registered: Feb 2003
Posts: 728

@Krill: The drive loop is faster (of course, else it would not work), it's the computer side that can not suck out the data faster because of that timing issue you just explained.. Even if i reduce the drive loop to 12 or 14 cycles, the computer side still has to be 18 cycles -> the bottle neck.

2013-04-09 13:00

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quoting Krill

Hmm, how does that speed up the drive side, which is the bottleneck here, as it has to wait for the C-64 and respond to ATN flips asap?

Excellent question.

It seems to work in practice and has done so for a while even under IRQ/DMA heavy conditions, though that doesn't necessarily mean much given how few drives I've tested. At 15 cycles it starts to crap out once in a blue moon.

I suppose that I don't quite buy your 6 + 7 cycle sum for the ATN cost. Presumably if the first ATN is late then you've only got the branch of the first loop left to execute, plus the seven of the second, equals 9 in total.

Still, it's likely I'm just confused. Anyone care to write up a little simulator to generate some sequence diagrams?

2013-04-09 13:08

Krill

Registered: Apr 2002
Posts: 2982

doynax: Yes, very likely my quickly-thought-out explanation is wrong.

HCL: Well my point was the relevant bottle-neck drive loop is the wait for ATN flip (the branch back to bit $1800), not the no-branch time between two bitpair updates. But maybe we're just saying the same thing in different words.

lft: I also had the VIC bank restriction thought about using RMW opcodes on $dd00, didn't find a solution either.

2013-04-09 13:22

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quoting lft

Yes, I also had this idea. But I couldn't figure out a way to do it without restricting the user to vicbank 0 (and maybe also 3). Did you find a way that works regardless of vicbank?

To be honest I never even tried. I'm working on a game with somewhat limited VIC tricks so I've gotten away with using bank 3 almost exclusively.

2013-04-09 14:00

lft

Registered: Jul 2007
Posts: 369

This is how I understand the timing constraints. On the drive side, here's how you transmit two bit pairs:

; prepare value in A
bit $1800
bmi *-3
sta $1800
; prepare value in A
bit $1800
bpl *-3
sta $1800

It is clear that a bit pair cannot be guaranteed to be on the serial bus earlier than 13 cycles after ATN changes, because if ATN changes just after it was sampled during the last cycle of a bit instruction, we need 3 (bpl) + 4 (bit) + 2 (bpl) + 4 (sta) = 13 cycles to put the new value into the VIA.

For this reason, we can use up to 7 cycles to prepare each bit pair. The C64 will not toggle ATN earlier than 4 cycles after reading out the last bit pair. Following this cycle, 3 (remaining preparation) + 4 (bit) + 2 (bpl) + 4 (sta) = 13 cycles.

On the C64 side, after reading a bit pair, we spend 4 cycles writing a new value to ATN. Then we read the new bit pair after 14 cycles. Hence, 18 in total. Why can't we read already after 13 cycles? This is because the clocks of the C64 and the 1541 are almost always out of phase. After updating ATN on a C64 clock tick, it will take on average half a cycle before the next 1541 clock tick. When sending the bits back, there is again a delay before the next C64 clock tick, and the total delay will be one C64 cycle (unless we're really lucky and the 1541 cycles, being a tad shorter, fit perfectly in between the C64 cycles).

C64   1-------2-------3-------4-------
1541  ----1------2------3------4------
(not to scale)

2013-04-09 14:48

HCL

Registered: Feb 2003
Posts: 728

Ah, for once i think i understand :). LFT, what is that book you have? everyone should have it ;).

2013-04-09 15:25

Krill

Registered: Apr 2002
Posts: 2982

Yes, this explains everything. :)

2013-04-09 16:52

tlr

Registered: Sep 2003
Posts: 1791

Quoting lft

Quoting doynax
I think I've managed 16 cycles actually (66.5 per byte in practice with 2x unrolling.)

The trick is to reduce the delay between reading the bits and flipping ATN by combining both in a single RMW instruction (e.g. SLO/SRE.)

Yes, I also had this idea. But I couldn't figure out a way to do it without restricting the user to vicbank 0 (and maybe also 3). Did you find a way that works regardless of vicbank?

Couldn't the $dd00 bank bits just be kept 00? Then switching can be done via $dd02.

2013-04-09 17:41

Krill

Registered: Apr 2002
Posts: 2982

Quoting tlr

Couldn't the $dd00 bank bits just be kept 00? Then switching can be done via $dd02.

Not sure about the possibility of your idea, but the $dd02 trick is often used already so that the VIC bank can be set by a simple lda #bank:sta $dd00 in IRQ handlers. This prevents possible visual glitches by IRQs hitting between loader-executed lda value/sta $dd00 (and saves masking overhead, too). So setting $dd02 from user code is forbidden, while in your idea, setting $dd00 is.

2013-04-09 18:15

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quoting Krill

Yes, this explains everything. :)

Indeed. That explanation actually makes sense to me.

Quoting tlr

Couldn't the $dd00 bank bits just be kept 00? Then switching can be done via $dd02.

Why didn't I think of that?

The SLO keeps zeroes and the SRE doesn't appear reach the least-significant bits with any ones. I just ran a quick test by poking at $dd02 and I can't see any VIC bank switching during loading.

For the record the basic transfer loop looks something like this:

	;(y = %00000100)
	;16 cycles, raises ATN
	and #%01100000		;0ba00000
	cmp $00,y
	sty $dd00
	slo $dd00		;cba010--
	;16 cycles, lowers ATN
	inx
	ror			;dcba010-
	lsr			;0dcba010
	cmp #%01000000
	arr #%00111000		;d00cba00
	sre $dd00		;dfecba--
	;16 cycles, raises ATN
	alr #%11111100		;0dfecba0
	sta merge+1
	sty $dd00
	slo $dd00		;g-------
	;16 cycles, lowers ATN
	and #%10000000		;g0000000
merge:	adc #%00000000		;gdfecbah
	sta sector,x
	sre $dd00-$04,y		;-ba-----

I wonder if it would be possible to get the bits through in the right order without sacrificing performance..

2013-04-09 18:44

lft

Registered: Jul 2007
Posts: 369

Quoting tlr

Couldn't the $dd00 bank bits just be kept 00? Then switching can be done via $dd02.

No, unfortunately that won't work. When reading dd00, the bits still reflect what is on the lines. If bank 1 is selected in this way, the two least significant bits in dd00 were written as 00 and the bits in dd02 were written as 01. This makes the lines high-low, and so when you read dd00 you get 10. Now suppose you rotate right (the same applies for bank 2 if you rotate left). Even if you can control the bit that gets shifted in from the left to be a zero, this will write 01 to dd00. The lines are now high-high, and the wrong bank has been selected.

2013-04-09 19:06

tlr

Registered: Sep 2003
Posts: 1791

Good point. Then RMW doesn't work unless the bit is forced low by the instruction, like the lsb when using SLO.

2013-04-09 20:17

Krill

Registered: Apr 2002
Posts: 2982

Quoting doynax

I wonder if it would be possible to get the bits through in the right order without sacrificing performance..

This is one of the reasons i do this peculiar nibble-wise intermediate storing as mentioned by lft in his blog article. Since i have the block data in two pages of GCR nibbles in the drive RAM, i can do a table lookup (table size 32 bytes) while transferring, getting the bits nicely swapped and inverted into $1800 so that the computer receives them in the correct order and orientation. No extra table is needed on the computer side, thus yielding a minimum resident code size of $0100 bytes.
This, of course, has a few drawbacks, as in a few more cycles per byte and no easy possibility to checksum the data during transfer.

2013-04-10 06:41

HCL

Registered: Feb 2003
Posts: 728

Quote:

So setting $dd02 from user code is forbidden, while in your idea, setting $dd00 is.

In my loader system, it's the other way around. Setting $dd00 in user code is forbidden. The loader uses $dd00 and the user uses $dd02, though it's still possible to use $dd00 in limited ways if you really have to..

@Doynax: Interesting transfer loop, do you really get out what you want there? :P. Gotta check it once again :).

2013-04-10 11:01

Krill

Registered: Apr 2002
Posts: 2982

Quoting doynax

I wonder if it would be possible to get the bits through in the right order without sacrificing performance..

Hmm, looking at it a liitle longer, your problem is not only getting the bits over the wire in the right order, but also through your funky receive logics? :)
I guess it should be easy for you to simply shuffle the bits around on the disk so that it arrives in computer memory in the correct order. And in that case, no problem, is there? I mean, you sacrificed other general-use requirements (VIC bank) before, so.. :)

2013-04-10 12:09

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quoting tlr

Good point. Then RMW doesn't work unless the bit is forced low by the instruction,

:(

Quoting HCL

Interesting transfer loop, do you really get out what you want there? :P. Gotta check it once again :).

It is loading a compressed executable so I'd be somewhat surprised if it works despite dropping bits ;)

Quoting Krill

I guess it should be easy for you to simply shuffle the bits around on the disk so that it arrives in computer memory in the correct order. And in that case, no problem, is there? I mean, you sacrificed other general-use requirements (VIC bank) before, so.. :)

Pretty much though it is a tad inconvenient. Still, if a bit of pre-processing saves me a byte or a cycle I'm willing to do it.

The saving grace is that it's easy to reverse the transformation when uploading bytes to the drive, e.g. when saving, and thankfully the EOR checksum shouldn't care.

2013-04-10 14:28

Danzig

Registered: Jun 2002
Posts: 441

Quote: Ah, for once i think i understand :). LFT, what is that book you have? everyone should have it ;).

Maybe this is the right moment: I still got the book "Das große Floppy Buch 1541" from Data Becker for sale.. Anyone? :D

2013-04-10 18:55

chatGPZ

Registered: Dec 2001
Posts: 11391

protip: a pdf of that one is at spiros website =P

2013-04-10 19:35

Cruzer

Registered: Dec 2001
Posts: 1048

About the dd00/dd02 issue - wouldn't it be an idea to add a feature where you can ask the loader on the drive side to ignore the register for x amount of seconds, if you have an effect that absolutely has to use the "wrong register"?

Of course this is rare, and the feature would take valuable bytes on the drive side, so maybe it wouldn't be an idea after all. :)

2013-04-10 19:47

Danzig

Registered: Jun 2002
Posts: 441

Quote: protip: a pdf of that one is at spiros website =P

you've narrowed my possible profit to smth like zero. ;)

2013-04-10 20:05

chatGPZ

Registered: Dec 2001
Posts: 11391

you can still bring it to next X and throw it at whoever managed to use a broken loader =)

2013-04-10 21:34

Danzig

Registered: Jun 2002
Posts: 441

Quote: you can still bring it to next X and throw it at whoever managed to use a broken loader =)

I can bring it to next X, take each sheet and role you pipes you have to smoke all the way...

2013-04-10 21:50

chatGPZ

Registered: Dec 2001
Posts: 11391

sounds like a plan =)

2013-04-11 01:18

Krill

Registered: Apr 2002
Posts: 2982

Quote: About the dd00/dd02 issue - wouldn't it be an idea to add a feature where you can ask the loader on the drive side to ignore the register for x amount of seconds, if you have an effect that absolutely has to use the "wrong register"?

Of course this is rare, and the feature would take valuable bytes on the drive side, so maybe it wouldn't be an idea after all. :)

Do you mean:

.define IDLE_BUS_LOCK 0 ; C-64 only: allow for arbitrary $DD00 writes ($00-$FF) when the loader
                        ; is idle (good for raster routines with LDA #value:STA $D018:STA $DD00, e.g.)

Has been a feature since day 1. And yes, nobody seems to have used it so far. :)

2013-04-11 06:10

Oswald

Registered: Apr 2002
Posts: 5095

Quote: Do you mean:

.define IDLE_BUS_LOCK 0 ; C-64 only: allow for arbitrary $DD00 writes ($00-$FF) when the loader ; is idle (good for raster routines with LDA #value:STA $D018:STA $DD00, e.g.)

Has been a feature since day 1. And yes, nobody seems to have used it so far. :)

you have invented that for me, for soiled legacy's chessboard stretcher part ;)

2013-04-11 06:25

HCL

Registered: Feb 2003
Posts: 728

..also have it in my loader, but i think i only used it in "1991" so far.

2013-04-11 07:13

Krill

Registered: Apr 2002
Posts: 2982

Quote: you have invented that for me, for soiled legacy's chessboard stretcher part ;)

Yes, but that was a previous loader, different code base and all. But i withdraw the nobody-ever-used-it part of my statement, sorry. :)

2013-04-11 11:43

Cruzer

Registered: Dec 2001
Posts: 1048

Quoting Krill

Has been a feature since day 1.

Quoting HCL

..also have it in my loader

Silly me for thinking I could come up with something new for loaders. :) Problem solved then I guess?

2013-04-11 11:45

Krill

Registered: Apr 2002
Posts: 2982

I guess so. But there are soooo many more things on the list waiting to be implemented.. :)

2013-04-11 12:43

Pantaloon

Registered: Aug 2003
Posts: 124

So when is a faster version of the krill loader going to be released :)

2013-04-11 12:45

Krill

Registered: Apr 2002
Posts: 2982

Soon. I am currently adding Doynax's LZ packer, first tests gave really good throughput results of 8.5-10.5 kB/s for decrunch during load.

2013-04-11 12:50

Pantaloon

Registered: Aug 2003
Posts: 124

oh nice :)

2013-04-11 12:50

Pantaloon

Registered: Aug 2003
Posts: 124

tell me if u want testing help on various 1541:s :)

2013-04-11 13:00

Krill

Registered: Apr 2002
Posts: 2982

Seems like C128DCR is more of an issue at the moment.. :)

2013-04-11 13:03

chatGPZ

Registered: Dec 2001
Posts: 11391

noooooo.....screw 128DCR !!!!111eleven

2013-04-11 13:34

HCL

Registered: Feb 2003
Posts: 728

I also get error reports for c128dcr's on my demos.. what's with that machine actually? Should be common knowledge by now one would think :P

2013-04-11 13:46

chatGPZ

Registered: Dec 2001
Posts: 11391

good old hammer fix to the rescue :)

2013-04-11 15:14

raven
Account closed

Registered: Jan 2002
Posts: 137

There's nothing wrong with the 128D, fix yer loaders!

I'm getting random crashes on many demos from X2012, all during drive activity.

2013-04-11 15:30

chatGPZ

Registered: Dec 2001
Posts: 11391

its only the C128DCR - and the timing IS broken there :)

2013-04-11 15:53

Oswald

Registered: Apr 2002
Posts: 5095

been ages since I last used my dcr, but it was rock stable on demos until early 2000s.

2013-04-11 16:12

chatGPZ

Registered: Dec 2001
Posts: 11391

try using jiffydos on it :)

2013-04-11 18:21

HCL

Registered: Feb 2003
Posts: 728

@Groepaz: You seem to know what is wrong with the dcr, is it transfer timing or is it read-head timing? If it's true what Oswald says, then it should of course be possible to do fast loaders that work on the dcr also..

2013-04-11 18:30

chatGPZ

Registered: Dec 2001
Posts: 11391

apparently the delay caused by the cable connection is slightly (really, less than a cycle) different than expected.... someone did a bunch of measurements (because chameleon has had a similar problem) - see http://www.forum64.de/wbb3/board65-neue-hardware/board289-diver.. ... yes its possible to make it work, by detecting and specifically handling C128D that is. (and then everything will break again with more than 1 drive on the bus ... =P)

2013-04-11 18:42

tlr

Registered: Sep 2003
Posts: 1791

Those are really nice measurement confirming and detailing the problem.
I thought it was common knowledge that the transfer timing was different? Less capacitance on the bus equals faster transitions.

Transfer routines that are "unidirectional" avoid the problem. i.e send sync pattern from the drive side and then the data (like the AR turbo). Anything that relies on the round trip c64 -> drive -> c64 is sensitive to timing variations for instance by bus loading.
Using the unidirectional technique might not be feasible in demo contexts though.

2013-04-12 05:25

soci

Registered: Sep 2003
Posts: 481

There's no problem with JiffyDOS on my C128DCR, but only after it had warmed up ;) The problem is on reading the last bits as I remember.

2013-04-23 07:17

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quoting Groepaz

apparently the delay caused by the cable connection is slightly (really, less than a cycle) different than expected.... someone did a bunch of measurements (because chameleon has had a similar problem)

Bear with me, it's been a long time since I took German, but the gist of it is that added capacitance in the DCR slows down signalling by less than a cycle for a round-trip but still sufficient to require an extra cycle of waiting on the host before a response can safely be read?

In effect limiting the speed of a traditional IRQ loader to 19x4 cycles/byte (17x4 with RMW opcodes.)

Quoting Groepaz

... yes its possible to make it work, by detecting and specifically handling C128D that is. (and then everything will break again with more than 1 drive on the bus ... =P)

Detecting and handling it by introducing extra delays on the host, with more than one drive on a non-GCR system exhibiting similar behaviour?

As an aside how am I supposed to handle multiple IEC devices on the bus in a two-bit IRQ loader? My best idea so far is to manually detect and handle as many of them as possible by installing code to put them into tight ATN-acknowledgement loops to avoid blocking the DATA line, but presumably that will fail miserably on many devices.

On a vaguely related note I've been thinking about what other limits the unwary drive coder with limited resources for testing might run into.

Specifically:
- How many tracks are safe to use?
- Which speedzones are safe for the various tracks?
- How fast may the head be stepped? Can it be improved with acceleration?
- What is the range of rotation speeds which should be supported?
- How hard are the GCR constraints? May clock recovery be reliable with more than two zero bits in a row?
- Which illegal opcodes, if any, may safely be relied upon?
- How much of a margin is required to account for the larger numbers of devices and longer cables on a worst-case IEC bus?

I realize that most of these are judgement calls but it would be good to know what the consensus is and what trouble you actually risk running into.

2013-04-23 17:59

tlr

Registered: Sep 2003
Posts: 1791

Quoting doynax

- How fast may the head be stepped? Can it be improved with acceleration?

Kernal uses 15 ms + 75 ms settle.

I've used 8 ms + 8 ms settle in DMA loader II which seems pretty stable.

Graham employs acceleration in WarpCopy64. It seems reasonable that will allow higher top speeds.
Perhaps Graham can elaborate on how it works in practice?

Quoting doynax

- What is the range of rotation speeds which should be supported?

There is some discussion here: Searching for a fast-writing floppy routine

Graham claims 280-320 rpm. TNT has only seen 295-305 rpm.
The drives I've encountered have all been very close to 300 rpm, even over time (i.e since '85 or so).

Oh, and my statement in above thread about static intersector gaps have been since retracted, Graham was right. I've done it the right way in Format II.

Quoting doynax

- How hard are the GCR constraints? May clock recovery be reliable with more than two zero bits in a row?

The main reason for that restriction is how the 0 bit recovery was constructed.
It is done by a simple 4-bit counter which is reset on seeing a transition. It will dead count 4 steps for each bit position according to the current set speed zone. The lowest two counter bits are used for timing and the upper two is used to generate a 1 for the first step and 0's for the following 3 steps.
Now the weird thing is that it wraps! This means that after seeing a transition (a '1') plus ~3 non-transitions ('0's), a 1 will appear even though there isn't any transition coming from the disk.
The '~' in ~3 non-transitions ('0's) is because if the 1 that _must_ follow the last 0 is a tiny bit late, a fake 1 will be generated and then immediately afterwards the real transition will come and reset the counter and generate another 1.
In addition to this there might be analog factors as well, e.g noise, clock jitter, mechanical vibration causing bit jitter.

That said, it should be possible to allow three 0's in a row as long as the data rate (+ jitter) coming from the disk is strictly faster than the bit clock in the drive in all situations.
This requires you to write at a higher bit rate than you read.

There is a note about this in conjuction with early V-MAX implementations that it wasn't reliable on some drives (1541-II?). http://c64preservation.com/dp.php?pg=vmax (at the top) and here http://markus.brenner.de/mnib/vmaxtech.txt

2013-04-23 19:01

doynax
Account closed

Registered: Oct 2004
Posts: 212

Thanks for the info!

Quoting tlr

Graham claims 280-320 rpm. TNT has only seen 295-305 rpm.
The drives I've encountered have all been very close to 300 rpm, even over time (i.e since '85 or so).

Ouch. Speed-zone 3 written at 320 RPM and read back at 280 is only 22 1/4 cycles per byte.

Quoting tlr

That said, it should be possible to allow three 0's in a row as long as the data rate (+ jitter) coming from the disk is strictly faster than the bit clock in the drive in all situations.
This requires you to write at a higher bit rate than you read.

Interesting. So with proper authoring and a devilishly clever algorithm we might theoretically squeeze out 15 bits per 16-bit word of entropy.

2013-04-25 07:56

Krill

Registered: Apr 2002
Posts: 2982

My loader uses acceleration, too, originally suggested by Graham.

.define MINSTPSP                 $18 ; min. r/w head stepping speed on 1541/41-C/41-II/70/71/71CR

This is a value which should be safe even on old 1541s. (The unit is 256 cycles per half-track, so the same as the original firmware would write to a timer hi-byte.)

.define MAXSTPSP                 $10 ; max. r/w head stepping speed on 1541/41-C/41-II/70/71/71CR

These figures have been used for years now, without any negative reports so far.

.define STEPRACC                 $1c ; r/w head stepping acceleration on 1541/41-C/41-II/70/71/71CR

Acceleration is above figure, added once every timer lo-byte underflow. See source for details ;)

2013-04-25 08:07

Krill

Registered: Apr 2002
Posts: 2982

Quoting doynax

- Which illegal opcodes, if any, may safely be relied upon?

Same as for the 6510, minus SAX. SAX is reportedly not working on HCL's 1541 clone (using a Synertec 6502 clone), plus i suspect it to be generally a bad idea when used on the bus port bits. Data and clock might be updated at different times within a cycle. (But this might not be an issue after all.)

Quoting doynax

As an aside how am I supposed to handle multiple IEC devices on the bus in a two-bit IRQ loader? My best idea so far is to manually detect and handle as many of them as possible by installing code to put them into tight ATN-acknowledgement loops to avoid blocking the DATA line, but presumably that will fail miserably on many devices.

This is what i will add to my loader, too. First tests worked just fine with loading from any drive in a 4-units daisy chain using standard CBM cables. I do have one or the other cycle more in the transfer loop compared to your ultra-tight fixed VIC bank approach, though. :)
The actual problem though is that the other drives have to be somewhat protocol-savvy to correctly decide when to reset, as i do use a watchdog approach so you can safely reset the host computer at any time without locking up your drives, or you can cart-freeze and implicitly uninstall by using standard KERNAL drive access. (Not all C-64s issue a reset signal via serial.)

2013-04-25 10:39

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quoting Krill

My loader uses acceleration, too, originally suggested by Graham.
.
.
.
Acceleration is above figure, added once every timer lo-byte underflow. See source for details ;)

Thanks for the timing data!
I realize that it isn't much of an issue in demos but in my current project I'm forced to do an awful lot of random-access.

Quoting Krill

Same as for the 6510, minus SAX. SAX is reportedly not working on HCL's 1541 clone (using a Synertec 6502 clone), plus i suspect it to be generally a bad idea when used on the bus port bits. Data and clock might be updated at different times within a cycle. (But this might not be an issue after all.)

Damn :(
As you've probably guessed my transfer loop relies on SAX to mask out the ATN acknowledgment.

Oh, well.. I suppose that extra "DCR" accommodation cycle should free up some space.

Quoting Krill

This is what i will add to my loader, too. First tests worked just fine with loading from any drive in a 4-units daisy chain using standard CBM cables.

Good to know I'm at least on the right track.
Have you encountered any IEC devices where this scheme won't work? Personally I have no experience with printers or modems or anything besides floppy drives really.

Quoting Krill

The actual problem though is that the other drives have to be somewhat protocol-savvy to correctly decide when to reset, as i do use a watchdog approach so you can safely reset the host computer at any time without locking up your drives, or you can cart-freeze and implicitly uninstall by using standard KERNAL drive access. (Not all C-64s issue a reset signal via serial.)

Thanks for the heads up! The reset signal not being reliable is precisely the kind of thing to catch the inexperienced drive coder by surprise.

Unfortunately I rather trash all of RAM, including using the full stack and zero-page as buffers, so routing a hardware timer interrupt through the ROM vectors may take a bit of juggling.

2013-04-25 13:45

Krill

Registered: Apr 2002
Posts: 2982

Quoting doynax

I realize that it isn't much of an issue in demos but in my current project I'm forced to do an awful lot of random-access.

Yes, and this is precisely where acceleration comes in handy.

Quoting doynax

As you've probably guessed my transfer loop relies on SAX to mask out the ATN acknowledgment.

If my suspicion about different timing on different bit positions doesn't hold nor matter, it's alright to demand genuine C= gear. :)

Quoting doynax

Have you encountered any IEC devices where this scheme won't work? Personally I have no experience with printers or modems or anything besides floppy drives really.

Me neither, but i think it's okay to ignore those. I don't see why anybody should use printers and modems and whatnot on the serial bus of a C-64 these days. Those who do probably care more for GEOS rather than games or even demos.

Quoting doynax

Thanks for the heads up! The reset signal not being reliable is precisely the kind of thing to catch the inexperienced drive coder by surprise.

Oh, it works or doesn't quite reliably. Just that some ASSY #s do and some don't pull serial reset out low upon host reset.

Quoting doynax

Unfortunately I rather trash all of RAM, including using the full stack and zero-page as buffers, so routing a hardware timer interrupt through the ROM vectors may take a bit of juggling.

Same in my loader, there's basically not a single unused byte. But installing your own interrupt handler for watchdog purposes is not that difficult, you basically just run an "execute code in block" job and make sure that all other job codes are ineffective. Again, see source for details. :)

2013-04-25 13:52

Oswald

Registered: Apr 2002
Posts: 5095

that watchdog feature can kill your trackmo if you starve the loader's need for cpu for a few frames. ie, timer interrupt will trigger after a few frames thinking the machine has been reset, and will reset the drive.

2013-04-25 16:49

tlr

Registered: Sep 2003
Posts: 1791

Quoting Krill

Quoting doynax
- Which illegal opcodes, if any, may safely be relied upon?
Same as for the 6510, minus SAX. SAX is reportedly not working on HCL's 1541 clone (using a Synertec 6502 clone), plus i suspect it to be generally a bad idea when used on the bus port bits. Data and clock might be updated at different times within a cycle. (But this might not be an issue after all.)

Do you have anything to back that timing suspicion up?
I can't see how the bit timing could be different all the way out to the port pins. Surely there must be at least one pipe line step through the VIA so even if there is a difference in timing on the bus out from the 6502 it will be reclocked.

2013-04-25 18:43

chatGPZ

Registered: Dec 2001
Posts: 11391

also MiST from visual6502 actually did all the tests in a 1541 - with no sign of special behaviour.

2013-04-25 20:41

WVL

Registered: Mar 2002
Posts: 903

Quote: that watchdog feature can kill your trackmo if you starve the loader's need for cpu for a few frames. ie, timer interrupt will trigger after a few frames thinking the machine has been reset, and will reset the drive.

Had that problem aswell, was really happy I could pinpoint it with some help from Krill.. We could've known you could 'starve' a loader? :)

2013-04-25 20:54

Krill

Registered: Apr 2002
Posts: 2982

Yes, well, that was my fault mainly. I knew that the watchdog timeout is max. 65536 cycles, without any trick known to me to extend it without serious overhead or other repercussions, and at some point i noticed that this would be the problem.

2013-04-25 20:56

Krill

Registered: Apr 2002
Posts: 2982

Quoting tlr

Do you have anything to back that timing suspicion up?
I can't see how the bit timing could be different all the way out to the port pins. Surely there must be at least one pipe line step through the VIA so even if there is a difference in timing on the bus out from the 6502 it will be reclocked.

No, hence my doubting my doubts. Probably there is no such problem if the SAX opcode itself works. That it doesn't with some drives may be the main problem to consider here.

2013-04-25 21:00

Krill

Registered: Apr 2002
Posts: 2982

Quote: also MiST from visual6502 actually did all the tests in a 1541 - with no sign of special behaviour.

Original MOS 6502, yes. I have no idea about the clones floating around, if they use the original circuitry and whatnot. After all, SAX does not work on that Synertec variant. But i haven't checked if this is somehow connected with it not being NMOS or anything, IF it isn't..

2013-04-26 18:53

chatGPZ

Registered: Dec 2001
Posts: 11391

i'd just ignore that drive then, really :)

2013-04-26 19:37

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: i'd just ignore that drive then, really :)

Ignore HCL's drive?!? Now that's bold... ;)

2013-04-26 20:23

HCL

Registered: Feb 2003
Posts: 728

OK, then i'll ignore all c128dcr drives.. ..and this means WAAAAR!!!

;)

2013-04-26 20:48

chatGPZ

Registered: Dec 2001
Posts: 11391

every good irq handler has an inc $d030 in it =)

2013-06-23 20:43

doynax
Account closed

Registered: Oct 2004
Posts: 212

I've been testing the 16-cycle RMW ATN acknowledgment scheme discussed above and have run into a bit of trouble. It appears to work fine on my working (1571) drive, 1541U, VICE and Hoxs64. However the reaction is occasionally a cycle late in CCS64.

Anyway, I've isolated the issue into a little timing test comparing ASL $DD00 to ASL+STA $DD00.

I'd much appreciate it if anyone else would run these on hardware to compare the RMW/WR cases, confirm whether this is a known bug, or spot the error in my thinking.

https://sites.google.com/site/doynax/iec_repro.zip

2013-06-24 18:10

tlr

Registered: Sep 2003
Posts: 1791

Quote: I've been testing the 16-cycle RMW ATN acknowledgment scheme discussed above and have run into a bit of trouble. It appears to work fine on my working (1571) drive, 1541U, VICE and Hoxs64. However the reaction is occasionally a cycle late in CCS64.

Anyway, I've isolated the issue into a little timing test comparing ASL $DD00 to ASL+STA $DD00.

I'd much appreciate it if anyone else would run these on hardware to compare the RMW/WR cases, confirm whether this is a known bug, or spot the error in my thinking.

https://sites.google.com/site/doynax/iec_repro.zip

didn't examine it in detail but the asl $dd00 will shift CLKin into DATAout (=DATA out from the c64 will be the inverse of the state of the CLK line).

Maybe that is what bites you?

2013-06-24 18:34

doynax
Account closed

Registered: Oct 2004
Posts: 212

Quoting tlr

didn't examine it in detail but the asl $dd00 will shift CLKin into DATAout (=DATA out from the c64 will be the inverse of the state of the CLK line).

Maybe that is what bites you?

Good idea. That's thinking outside of the box.

Unfortunately I think you've got the shift direction mixed up. The CLK input is bit 6 with DATA out in bit 5 just below it, so an ASL should not pick up the two significant input bits.

2013-06-24 19:07

tlr

Registered: Sep 2003
Posts: 1791

Quoting doynax

Unfortunately I think you've got the shift direction mixed up. The CLK input is bit 6 with DATA out in bit 5 just below it, so an ASL should not pick up the two significant input bits.

Doh! :)

Refresh

Subscribe to this thread: