| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Nucrunch 0.1
Continuing from the benchmarks WVL posted in Doynamite 1.x:
I dusted off my unfinished nucrunch in December to pack just enough of the second page of Reutastic to give me some workspace for some precalculations. Pity I didn't schedule enough time to pack the entire demo, else it would have been ~90 blocks instead of 190, but I digress. I've spent bits of the past month cleaning up the code, optimizing the packer (mostly by porting it from python to rust :P), and adding reverse direction support.
It's still no more than a component, with an commandline packer and asm decrunch subroutine, but no tools yet for generating an executable from a single commandline. It does at least now support multiple input segments that are unpacked to their destination addresses, and it's also now useable enough to for me to do some benchmarking.
In short, doynamite's ratio looks pretty unbeatable for anything lz based; my ratio's almost identical despite a somewhat different encoding.
Where I can win though is speed at that ratio; nucrunch is usually ten to twenty percent faster. The one exception in the test corpus is 6.bin, where it's 20% slower; not sure why yet.
I've added the times for pucrunch -ffast below for to complete the comparison. Last two columns are nucrunch, and nucrunch -r (the latter decodes in reverse; should be a more useful component for single filers)
If anyone wants to have a play at this stage, poke me and I'll upload some source. Failing that I'll hold off until I at least have something that can make onefilers without any faffing about with relocating the last couple of pages by hand.
filesizes
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 11008 8020 4529 4151 4329 3383 3711 3265 3225 3230
2 4973 4314 3532 3309 3423 2648 3005 2512 2498 2490
3 3949 3498 2991 2617 2972 2187 2530 2108 2091 2093
4 7016 6456 4242 4085 4225 3681 3924 3617 3622 3614
5 34760 27647 25781 24895 25210 21306 21182 20405 20447 20516
6 31605 12511 11283 10923 11614 9194 9203 8904 8915 8894
7 20392 17295 12108 11285 11445 9627 9789 9289 9140 9144
8 5713 5407 4179 3916 3936 3251 3656 3132 3165 3187
9 8960 7986 6914 6896 6572 5586 6000 5430 5502 5486
filesize in %
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 100 72.9 41.1 37.7 39.3 30.7 33.7 29.7 29.3 29.3
2 100 86.7 71.0 66.5 68.8 53.2 60.4 50.5 50.2 50.1
3 100 88.6 75.7 66.3 75.3 55.4 64.1 53.4 53.0 53.0
4 100 92.0 60.5 58.2 60.2 52.5 55.9 51.6 51.6 51.5
5 100 79.5 74.2 71.6 72.5 61.3 60.9 58.7 58.8 59.0
6 100 39.6 35.7 34.6 36.7 29.1 29.1 28.2 28.2 28.1
7 100 84.8 59.4 55.3 56.1 47.2 48.0 45.6 44.8 44.8
8 100 94.6 73.1 68.5 68.9 56.9 64.0 54.8 55.4 55.8
9 100 89.1 77.2 77.0 73.3 62.3 67.0 60.6 61.4 61.2
number of frames to depack
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 0 11 13 14 15 58 54 27 22 22
2 0 5 7 7 9 38 39 17 14 14
3 0 4 6 6 7 28 31 12 10 10
4 0 8 9 9 10 43 51 20 17 18
5 0 36 39 42 59 300 298 119 104 107
6 0 20 25 25 37 126 152 49 59 59
7 0 22 25 26 32 138 139 60 51 52
8 0 6 8 8 10 43 47 18 16 17
9 0 9 12 12 16 73 81 32 28 29
kilobytes output per second
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 49.0 41.4 38.5 35.9 9.3 10.0 20.0 24.5 24.5
2 48.7 34.8 34.8 27.0 6.4 6.2 14.3 17.4 17.4
3 48.3 32.2 32.2 27.6 6.9 6.2 16.1 19.3 19.3
4 42.9 38.2 38.2 34.3 8.0 6.7 17.2 20.2 19.1
5 47.3 43.6 40.5 28.8 5.7 5.7 14.3 16.4 15.9
6 77.4 61.9 61.9 41.8 12.3 10.2 31.6 26.2 26.2
7 45.4 39.9 38.4 31.2 7.2 7.2 16.6 19.6 19.2
8 46.6 35.0 35.0 28.0 6.5 6.0 15.5 17.5 16.5
9 48.7 36.5 36.5 27.4 6.0 5.4 13.7 15.7 15.1
cycles per byte consumed
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 0 27 56 66 68 337 286 163 134 134
2 0 23 39 42 52 282 255 133 110 111
3 0 22 39 45 46 252 241 112 94 94
4 0 24 42 43 47 230 255 109 92 98
5 0 26 30 33 46 277 277 115 100 103
6 0 31 44 45 63 269 325 108 130 130
7 0 25 41 45 55 282 279 127 110 112
8 0 22 38 40 50 260 253 113 99 105
9 0 22 34 34 48 257 265 116 100 104
decrunch time for nucrunch/rnucrunch relative to doynamite
1: 81.5% (-18.5%) 81.5% (-18.5%)
2: 82.4% (-17.6%) 82.4% (-17.6%)
3: 83.3% (-16.7%) 83.3% (-16.7%)
4: 85.0% (-15.0%) 90.0% (-10.0%)
5: 87.4% (-12.6%) 89.9% (-10.1%)
6: 120.4% ( 20.4%) 120.4% ( 20.4%)
7: 85.0% (-15.0%) 86.7% (-13.3%)
8: 88.9% (-11.1%) 94.4% ( -5.6%)
9: 87.5% (-12.5%) 90.6% ( -9.4%)
|
|
... 95 posts hidden. Click here to view all posts.... |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Well yes, I was assuming only checking in case of a match
Intriguing that end addr test can be done at zero cost in cycles.
The clobbering cases - is it because of the match spec crossing a byte boundary? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting Trash(the tradeoff would be that some data in the end would be overwritten)...
Oh yes, aware of that. I want an exact endpoint so I can implement MagerValp's end-alignment, where not even the crunched data passes the endpoint. |
| |
enthusi
Registered: May 2004 Posts: 677 |
The cruncher of choice obviously depends on the environment as well. In particular the speed with which the data is loaded into memory. Which fastloader, tape, cart...
Maybe someone adds certain loading-bitrates to the graphs? ;-)
TurboTape is ~ 450 Byte/sec.
1541 ROM: ~ 410 Byte/sec
Cart (lda $8000,x; sta $0800,x) ~ 160 KB/sec
(i.e. for Caren I now used page aligned ,x loops of RAW data which is still faster than RLE with byte-at-boundary-check. A dedicated RLE that never crosses banked in pages should be slighly faster) |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Quote: Well yes, I was assuming only checking in case of a match
Intriguing that end addr test can be done at zero cost in cycles.
The clobbering cases - is it because of the match spec crossing a byte boundary?
Well, it is possible that the read_pointer crosses the write_pointer already before the last literal/match, at least then, when we encode stuff with variable bitlengths. I did a working prototype (more tests pending) to get the cruncher to spit out a final binary blob as soon as this happens. Thus some files tend to become bigger, others where this works out, get usually 2 bytes smaller.
As for the decruncher a few things need to be changed, in my case i can forgo on the terminator check:
tay
beq .lz_end_of_file
Therefore i check when i add the match length to the destination pointer:
tya
adc .lz_dst
sta .lz_dst
bcc .lz_end_low
.lz_maximum
inc .lz_dst+1
lda .lz_dst
.lz_end_low
cmp #$00
.lz_skip_poll
bne .lz_poll
lda .lz_dst+1
.lz_end_hi
cmp #$00
bne .lz_skip_end
rts
.lz_poll
However you also can't rely anymore on the crunchers EOF test, so you need to continue loading after the cruncher returns to be sure any remaining binary blob is still loaded if it goes over a block boundary, until the loader terminates with an EOF. The later of course only applies if you depack with on the fly loading. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Ah of course - you're replacing the EOF check with something of (99.6% of the time) equal cost.
Interesting point about streaming loaders, but surely in that case you'd just be loading to a one page buffer rather than loading to destination address? |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
I never used a buffer but was always depacking in place, yet however still with a small overlap. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Here is a first experimental test. At least my usual benchmark stuff passes, also overlap for normal cases decreased, possible that the the old algorithm did something wrong (or the new does, haha)
As said, highly experimental, hurt yourself at your own risk. One might want to feed more files to the testsuite in the benchmark folder. The checksums + sizes and loadaddresses are not automated yet, some sed wizardry should help out there soon.
Now, with the end address check the non overflowing case on the dst-pointer addition is favoured and thus one cycled saved compared to the standard version. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
There's one problem with the in place decompression arising:
If you happen to have the source data not in place but at some other location in mem, the last literal will not be copied, unless you include it as a literal sequence into the control stream. Then however, if a file ends with a literal, the end check is missed when it is only done upon matches. Means: one has to either add another bogus match which makes the depacked data 2 bytes longer, or test on both cases, what bloats and slows down the depacker. Also The old sentinel could be used for that. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Well if the source data is not in place but at some other location in mem, then you're not doing in place decompression :P |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Sure, but one might want to use the same decruncher code for both purposes, for e.g. in a demo ;-) |
Previous - 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 - Next |