| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Release id #205568 : Spindle 3.0
Let's continue this here :-)
So the kernal + basic suits 10 times on a disk in your case and the FLI 12 times. I just tried that and the kernal+basic fits 12 times on disk, and still quite a few blocks left, so there's more than 105 blocks extra on disk. The FLI fits 15 times on a disk and still some blocks free, so that is an extra of 126 blocks minimum. Speed was 6,11kb/s for the kernal and 8,05kb/s for the FLI at 100% cycles for the CPU.
Seeing that you cache a second sector, are you making use of a 512byte window now or still using a 256 byte window per block? |
|
| |
Krill
Registered: Apr 2002 Posts: 2982 |
And i am somewhat surprised that shifting the checksumming from the GCR read+decode loop to an extra loop after reading apparently does not slow things down.
But what was the reason for that change? Smaller decoding tables and bigger block cache? |
| |
lft
Registered: Jul 2007 Posts: 369 |
Previously I was checksumming on the c64 side, during the transfer. But the serial protocol was changed to avoid sending a full padded sector at the end of each job. However, with variable payloads, the system can get stuck if there's a transmission error in the size prefix, and it wasn't feasible to checksum the size first (in a separate transmission) before transferring the corresponding data; especially not within one page of resident code. So I moved the checksumming to the drive.
Most of the time, under normal trackmo conditions, the drive is waiting for the host to decrunch (while also using rastertime for other stuff). Copying the newly read sector from the stack to a separate buffer takes time, so it's not possible to read the immediately following sector, which means that the drive has nothing to do until the second-next sector appears. That's why computing the checksum on the drive doesn't affect things all that much. |
| |
lft
Registered: Jul 2007 Posts: 369 |
The decruncher does something new: The complete file is compressed in one go, with references up to 1 KB away. The compressed data is then split into blocks that are transferred independently, in any order. When a block is received, the compressed stream is decoded and literal data items are stored at their target locations.
Copy-items can't be carried out yet, because they might refer to data in a block that hasn't been received, so *a representation of the copy item is stored in the gap*. These representations form a linked list.
When a part of this chain is known to be complete (typically at track boundaries), it is traversed and the delayed copy operations are performed. Meanwhile, new blocks are loaded and the next chain starts to build up.
Obviously, there are many details to get right, but it can be done in quite a small footprint, and it's fast. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting lftMost of the time, under normal trackmo conditions, the drive is waiting for the host to decrunch (while also using rastertime for other stuff). Unexpected assertion. =) Are you implying that any pending transfers are stalled until a block is fully decrunched? I.e., that block decrunching isn't interruptible by new blocks rolling in?
Quoting lftThat's why computing the checksum on the drive doesn't affect things all that much. My question was why checksumming isn't done on-the-fly in the read+decode loop, but in a separate loop afterwards.
But then i realised there never was a read+decode+checksum loop with Spindle. Still wonder about the speed impact if it were. =) |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting KrillStill wonder about the speed impact if it were.
I tried disabling the checksumming entirely, and saw only a very minor speedup. It wasn't enough of a gain to motivate a complete rewrite of the decoder, although it might come to that in the future. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting Krill Quoting lftAre you implying that any pending transfers are stalled until a block is fully decrunched? I.e., that block decrunching isn't interruptible by new blocks rolling in?
My decrunching happens in two stages: The literal items are handled first, and this part has to finish before the buffer page can be reused. The copy items are handled in a second stage that can be interrupted.
But that is beside the point. In the end, the host has more work to do than the drive. If you transfer as much as possible as early as possible, you still end up with a big slab of crunched data to work through at the end. Moving things around in time doesn't affect the total duration --- *except* if you insert dead-time by making the host (who has the most to do) wait for the drive. That is why a large drive-side buffer is important, along with prefetching, so the drive can supply a lot of data right at the beginning. |
| |
Sparta
Registered: Feb 2017 Posts: 49 |
@Lft, very cool decompression algorithm! I really like how you squeezed all eight GCR decoding tables in one page. :)
As for your graphs, I do have an observation. I re-created the 100% CPU tests with Spindle 3 and Sparkle 2, and while my results with Spindle 3 matched yours, I got $378 frames, 7380 B/s for the Basic+Kernal test, and $32b frames, 8491 B/s for the FLI test using Sparkle 2. Wonder what the cause of the difference is. Did you use LoadNext calls in your tests with Sparkle 2?
I am happy to share my test d64s. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting lftIf you transfer as much as possible as early as possible, you still end up with a big slab of crunched data to work through at the end. Moving things around in time doesn't affect the total duration Hmm, sounds sound for block-based decrunch or your new approach, but not so sure about the traditional bytestream-based crunchers.
Tbh, i'm not quite sure what of those is ultimately better in a C-64 IRQ-loading context. Schemes not working on bytestreams come with worse pack ratio pretty much by definition due to back references limited to something less than the entire preceding unpacked file. Then they may make up for that with more speed by reduced penalty for missing blocks in an out-of-order regime.
Alas, let's all implement both benchmarks (yours and Bitfire's/Sparkle's), then ponder the results on current versions. =)
(I'll need a few more weeks to finish mine.) |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting SpartaI re-created the 100% CPU tests with Spindle 3 and Sparkle 2, and while my results with Spindle 3 matched yours, I got $378 frames, 7380 B/s for the Basic+Kernal test, and $32b frames, 8491 B/s for the FLI test using Sparkle 2. Wonder what the cause of the difference is. Did you use LoadNext calls in your tests with Sparkle 2?
Hmm, that's quite a big difference. Yes, I did use LoadNext. Let's compare notes in PM. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting lftCopy-items can't be carried out yet, because they might refer to data in a block that hasn't been received, so *a representation of the copy item is stored in the gap*. These representations form a linked list.
Oh, that's rather clever :)
It does put a lower bound on the length of a copy item, but it's quite rare to copy fewer than three or four bytes from distant locations anyway, so I can't see there being much of a compression ratio penalty for that. I wonder if it would be worth having two kinds of copy items - one intra block, and one deferred.. |
| |
Krill
Registered: Apr 2002 Posts: 2982 |
Quoting lftHmm, that's quite a big difference. Yes, I did use LoadNext. Let's compare notes in PM. Would it be possible to publish the benchmark's disk images and source?
I'm adding it to my loader's example folder (as i did with the Bitfire/Sparkle benchmark described in Loader Benchmarks) but would like to avoid reinventing the wheel and possibly skewing/biasing results. =) |