| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Nucrunch 0.1
Continuing from the benchmarks WVL posted in Doynamite 1.x:
I dusted off my unfinished nucrunch in December to pack just enough of the second page of Reutastic to give me some workspace for some precalculations. Pity I didn't schedule enough time to pack the entire demo, else it would have been ~90 blocks instead of 190, but I digress. I've spent bits of the past month cleaning up the code, optimizing the packer (mostly by porting it from python to rust :P), and adding reverse direction support.
It's still no more than a component, with an commandline packer and asm decrunch subroutine, but no tools yet for generating an executable from a single commandline. It does at least now support multiple input segments that are unpacked to their destination addresses, and it's also now useable enough to for me to do some benchmarking.
In short, doynamite's ratio looks pretty unbeatable for anything lz based; my ratio's almost identical despite a somewhat different encoding.
Where I can win though is speed at that ratio; nucrunch is usually ten to twenty percent faster. The one exception in the test corpus is 6.bin, where it's 20% slower; not sure why yet.
I've added the times for pucrunch -ffast below for to complete the comparison. Last two columns are nucrunch, and nucrunch -r (the latter decodes in reverse; should be a more useful component for single filers)
If anyone wants to have a play at this stage, poke me and I'll upload some source. Failing that I'll hold off until I at least have something that can make onefilers without any faffing about with relocating the last couple of pages by hand.
filesizes
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 11008 8020 4529 4151 4329 3383 3711 3265 3225 3230
2 4973 4314 3532 3309 3423 2648 3005 2512 2498 2490
3 3949 3498 2991 2617 2972 2187 2530 2108 2091 2093
4 7016 6456 4242 4085 4225 3681 3924 3617 3622 3614
5 34760 27647 25781 24895 25210 21306 21182 20405 20447 20516
6 31605 12511 11283 10923 11614 9194 9203 8904 8915 8894
7 20392 17295 12108 11285 11445 9627 9789 9289 9140 9144
8 5713 5407 4179 3916 3936 3251 3656 3132 3165 3187
9 8960 7986 6914 6896 6572 5586 6000 5430 5502 5486
filesize in %
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 100 72.9 41.1 37.7 39.3 30.7 33.7 29.7 29.3 29.3
2 100 86.7 71.0 66.5 68.8 53.2 60.4 50.5 50.2 50.1
3 100 88.6 75.7 66.3 75.3 55.4 64.1 53.4 53.0 53.0
4 100 92.0 60.5 58.2 60.2 52.5 55.9 51.6 51.6 51.5
5 100 79.5 74.2 71.6 72.5 61.3 60.9 58.7 58.8 59.0
6 100 39.6 35.7 34.6 36.7 29.1 29.1 28.2 28.2 28.1
7 100 84.8 59.4 55.3 56.1 47.2 48.0 45.6 44.8 44.8
8 100 94.6 73.1 68.5 68.9 56.9 64.0 54.8 55.4 55.8
9 100 89.1 77.2 77.0 73.3 62.3 67.0 60.6 61.4 61.2
number of frames to depack
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 0 11 13 14 15 58 54 27 22 22
2 0 5 7 7 9 38 39 17 14 14
3 0 4 6 6 7 28 31 12 10 10
4 0 8 9 9 10 43 51 20 17 18
5 0 36 39 42 59 300 298 119 104 107
6 0 20 25 25 37 126 152 49 59 59
7 0 22 25 26 32 138 139 60 51 52
8 0 6 8 8 10 43 47 18 16 17
9 0 9 12 12 16 73 81 32 28 29
kilobytes output per second
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 49.0 41.4 38.5 35.9 9.3 10.0 20.0 24.5 24.5
2 48.7 34.8 34.8 27.0 6.4 6.2 14.3 17.4 17.4
3 48.3 32.2 32.2 27.6 6.9 6.2 16.1 19.3 19.3
4 42.9 38.2 38.2 34.3 8.0 6.7 17.2 20.2 19.1
5 47.3 43.6 40.5 28.8 5.7 5.7 14.3 16.4 15.9
6 77.4 61.9 61.9 41.8 12.3 10.2 31.6 26.2 26.2
7 45.4 39.9 38.4 31.2 7.2 7.2 16.6 19.6 19.2
8 46.6 35.0 35.0 28.0 6.5 6.0 15.5 17.5 16.5
9 48.7 36.5 36.5 27.4 6.0 5.4 13.7 15.7 15.1
cycles per byte consumed
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr
- ----- ----- ----- ----- ----- ----- ----- ----- ----- -----
1 0 27 56 66 68 337 286 163 134 134
2 0 23 39 42 52 282 255 133 110 111
3 0 22 39 45 46 252 241 112 94 94
4 0 24 42 43 47 230 255 109 92 98
5 0 26 30 33 46 277 277 115 100 103
6 0 31 44 45 63 269 325 108 130 130
7 0 25 41 45 55 282 279 127 110 112
8 0 22 38 40 50 260 253 113 99 105
9 0 22 34 34 48 257 265 116 100 104
decrunch time for nucrunch/rnucrunch relative to doynamite
1: 81.5% (-18.5%) 81.5% (-18.5%)
2: 82.4% (-17.6%) 82.4% (-17.6%)
3: 83.3% (-16.7%) 83.3% (-16.7%)
4: 85.0% (-15.0%) 90.0% (-10.0%)
5: 87.4% (-12.6%) 89.9% (-10.1%)
6: 120.4% ( 20.4%) 120.4% ( 20.4%)
7: 85.0% (-15.0%) 86.7% (-13.3%)
8: 88.9% (-11.1%) 94.4% ( -5.6%)
9: 87.5% (-12.5%) 90.6% ( -9.4%)
|
|
| |
Fungus
Registered: Sep 2002 Posts: 691 |
could you include exo in the tests pls. |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
demo coders wouldnt mind for 10x speed for 10% bigger file. or smth like that. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
How's about the size of the decoder itself? Can it easily be integrated into some loader system? |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Quote: demo coders wouldnt mind for 10x speed for 10% bigger file. or smth like that.
That would just make the decruncher wait even longer for new data on a on the fly loading/decrunching process :-) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Oh noes.. i get compared again on 10+ years old code. But i don't blame noone but myself, i just have to click the release-button :P. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
@Fungus, I'll add that to my todo list, unless someone else gets to it first!
@Oswald well, for the moment the KB per second table should be useful for choosing between crunchers, but I'm planning on adding some tuning parameters to nucrunch next
@Bitbreaker Currently around 380 bytes for the standard decruncher/450 for the reverse. But there's some fairly aggressive inlining in there, probably wouldn't lose too much performance if I scale that back a bit.
The plan is to integrate it with marmaload, but my progress on that is pretty glacial at the moment. Should be pretty trivial to replace the 'next page' callback with something that signals a loader/waits for the next page in any case. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
@HCL it's ok, you're still 20% faster than pucrunch -ffast, and that was my go-to until *very* recently :D |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Quoting ChristopherJam
@Bitbreaker Currently around 380 bytes for the standard decruncher/450 for the reverse. But there's some fairly aggressive inlining in there, probably wouldn't lose too much performance if I scale that back a bit.
Sounds huge yet, but i'd give a smaller version a try, so mind sharing code? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quote: Quoting ChristopherJam
@Bitbreaker Currently around 380 bytes for the standard decruncher/450 for the reverse. But there's some fairly aggressive inlining in there, probably wouldn't lose too much performance if I scale that back a bit.
Sounds huge yet, but i'd give a smaller version a try, so mind sharing code?
OK, I'll bundle something up tomorrow. |
| |
Kabuto Account closed
Registered: Sep 2004 Posts: 58 |
Could you please include my ALZ64 packer too? It's an LZMA-based packer, compression is very good, but decompression is very slow yet still acceptable for 4 KB intros, and some of your test cases are in that range :) |
| |
mankeli
Registered: Oct 2010 Posts: 146 |
Can you add Spindle's cruncher to the list? |
| |
Fungus
Registered: Sep 2002 Posts: 691 |
HCL: Yours is still best for cracks too, if lots of small files. :) |
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
I couldn't find the test files along with the code used to replicate the tests? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
@Martin, I grabbed the test files from LZWVL (bin.rar) |
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
Quote: @Martin, I grabbed the test files from LZWVL (bin.rar)
Cheers :)
The number of bytes in the top table, is that the raw file compressed only? No self extracting code? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
np. Raw file compressed only. (Or at least, that's what I was assuming for rle, wvl, bb & doyna, and it's what I measured for nucrunch, tinycrunch and pucrunch) |
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
Oh cool. :) I also get similar compression results to yours with my old LZ based code. The circular recent history buffer (updateHistoryBuffer) that stores length and offset helps squeeze more compression at the expense of decompression speed.
Current LZMPi results:
1 3,184
2 2,410
3 1,931
4 3,571
5 20,362
6 8,719
7 9,256
8 3,048
9 5,563
https://github.com/martinpiper/C64Public/tree/master/Compression
https://github.com/martinpiper/C64Public/tree/master/Decompress.. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
For funsies here's the result from LZMV+RLE with a 4k window:
# bin rle wvl-f wvl-s tc bb pu-f doyna nucru rnucr LZMV4k
1 100 72.9 41.1 37.7 39.3 30.7 33.7 29.7 29.3 29.3 38.2
2 100 86.7 71.0 66.5 68.8 53.2 60.4 50.5 50.2 50.1 64.0
3 100 88.6 75.7 66.3 75.3 55.4 64.1 53.4 53.0 53.0 64.6
4 100 92.0 60.5 58.2 60.2 52.5 55.9 51.6 51.6 51.5 61.9
5 100 79.5 74.2 71.6 72.5 61.3 60.9 58.7 58.8 59.0 68.6
6 100 39.6 35.7 34.6 36.7 29.1 29.1 28.2 28.2 28.1 33.6
7 100 84.8 59.4 55.3 56.1 47.2 48.0 45.6 44.8 44.8 54.7
8 100 94.6 73.1 68.5 68.9 56.9 64.0 54.8 55.4 55.8 69.3
9 100 89.1 77.2 77.0 73.3 62.3 67.0 60.6 61.4 61.2 72.6
https://github.com/MagerValp/u4remastered/blob/master/tools/bac..
https://github.com/MagerValp/u4remastered/blob/master/src/u4loa.. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
OK, got some updated benchmarks. Still no exo or ALZ64, but I've updated ByteBoozer to version 2.0, optimised nucrunch a little, and I've switched to counting cycles for timing BB, Pu & Nu. I've added half a frame to all the other times on the assumption that they were originally all rounded down.
I've also included sizes for LZMPi and ~LZMV4k (thanks guys!), no timings for those yet. I gather bitnax/doynax is a little faster now, too; update yet to come.
filesizes
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 11008 8020 4529 4151 4329 3322 3711 3265 3225 3230 3184 4205
2 4973 4314 3532 3309 3423 2513 3005 2512 2498 2490 2410 3183
3 3949 3498 2991 2617 2972 2098 2530 2108 2091 2093 1931 2551
4 7016 6456 4242 4085 4225 3682 3924 3617 3622 3614 3571 4343
5 34760 27647 25781 24895 25210 20530 21182 20405 20447 20516 20362 23845
6 31605 12511 11283 10923 11614 8998 9203 8904 8915 8896 8719 10619
7 20392 17295 12108 11285 11445 9241 9789 9289 9140 9145 9256 11154
8 5713 5407 4179 3916 3936 3165 3656 3132 3166 3187 3048 3959
9 8960 7986 6914 6896 6572 5491 6000 5430 5502 5487 5563 6505
filesize in %
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 100.0 72.9 41.1 37.7 39.3 30.2 33.7 29.7 29.3 29.3 28.9 38.2
2 100.0 86.7 71.0 66.5 68.8 50.5 60.4 50.5 50.2 50.1 48.5 64.0
3 100.0 88.6 75.7 66.3 75.3 53.1 64.1 53.4 53.0 53.0 48.9 64.6
4 100.0 92.0 60.5 58.2 60.2 52.5 55.9 51.6 51.6 51.5 50.9 61.9
5 100.0 79.5 74.2 71.6 72.5 59.1 60.9 58.7 58.8 59.0 58.6 68.6
6 100.0 39.6 35.7 34.6 36.7 28.5 29.1 28.2 28.2 28.1 27.6 33.6
7 100.0 84.8 59.4 55.3 56.1 45.3 48.0 45.6 44.8 44.8 45.4 54.7
8 100.0 94.6 73.1 68.5 68.9 55.4 64.0 54.8 55.4 55.8 53.4 69.3
9 100.0 89.1 77.2 77.0 73.3 61.3 67.0 60.6 61.4 61.2 62.1 72.6
number of frames to depack
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 0.0 11.5 13.5 14.5 15.5 46.1 54.7 27.5 22.0 21.3
2 0.0 5.5 7.5 7.5 9.5 31.2 39.6 17.5 13.7 13.6
3 0.0 4.5 6.5 6.5 7.5 22.7 31.9 12.5 10.0 10.1
4 0.0 8.5 9.5 9.5 10.5 37.5 52.0 20.5 16.6 17.1
5 0.0 36.5 39.5 42.5 59.5 222.2 298.6 119.5 101.4 100.6
6 0.0 20.5 25.5 25.5 37.5 112.3 152.8 49.5 57.9 56.6
7 0.0 22.5 25.5 26.5 32.5 109.6 139.8 60.5 50.3 50.0
8 0.0 6.5 8.5 8.5 10.5 35.4 47.7 18.5 15.6 15.8
9 0.0 9.5 12.5 12.5 16.5 62.3 81.7 32.5 27.1 27.2
kilobytes output per second
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 46.9 39.9 37.2 34.8 11.7 9.9 19.6 24.5 25.3
2 44.3 32.5 32.5 25.6 7.8 6.1 13.9 17.7 17.9
3 43.0 29.7 29.7 25.8 8.5 6.1 15.5 19.4 19.2
4 40.4 36.2 36.2 32.7 9.2 6.6 16.8 20.6 20.1
5 46.6 43.1 40.0 28.6 7.7 5.7 14.2 16.8 16.9
6 75.5 60.7 60.7 41.3 13.8 10.1 31.3 26.7 27.3
7 44.4 39.1 37.7 30.7 9.1 7.1 16.5 19.8 20.0
8 43.0 32.9 32.9 26.6 7.9 5.9 15.1 18.0 17.7
9 46.2 35.1 35.1 26.6 7.0 5.4 13.5 16.2 16.1
cycles per byte consumed
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 0.0 28.2 58.6 68.7 70.4 272.9 289.7 165.6 134.2 129.4
2 0.0 25.1 41.7 44.6 54.6 244.4 259.3 136.9 108.0 107.5
3 0.0 25.3 42.7 48.8 49.6 212.3 247.9 116.6 93.8 94.6
4 0.0 25.9 44.0 45.7 48.8 200.3 260.4 111.4 90.3 93.1
5 0.0 26.0 30.1 33.6 46.4 212.8 277.1 115.1 97.5 96.4
6 0.0 32.2 44.4 45.9 63.5 245.3 326.3 109.3 127.7 125.1
7 0.0 25.6 41.4 46.2 55.8 233.2 280.7 128.0 108.2 107.4
8 0.0 23.6 40.0 42.7 52.4 219.9 256.4 116.1 96.6 97.2
9 0.0 23.4 35.5 35.6 49.3 223.0 267.7 117.6 96.9 97.4
|
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
Are you timing with VICE stopwatch to get the cycle times? With the screen off? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Chained DC00 timers (based on a routine from Glasnost) and blanked screen for nucrunch
VICE stopwatch for BB & pu, not blanked; which in the light of day seems a tad unfair - forgot that at 3am.
I need to either translate BB or install the assembler HCL uses so I can build it into my testbed instead of just running the produced executable. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Hmm, compression rate of B2 was something like expected.. but performance was not that impressive, only 15-25% faster than before. I guess i was comparing to original bb, and you probably had the latest optimized decruncher. Never mind, thanx for adding it so quickly!! |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
I added exoraw 2.0.9 with default options:
filesizes
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 11008 8020 4529 4151 4329 3322 3711 3265 3225 3230 3184 4205 2988
2 4973 4314 3532 3309 3423 2513 3005 2512 2498 2490 2410 3183 2241
3 3949 3498 2991 2617 2972 2098 2530 2108 2091 2093 1931 2551 1817
4 7016 6456 4242 4085 4225 3682 3924 3617 3622 3614 3571 4343 3454
5 34760 27647 25781 24895 25210 20530 21182 20405 20447 20516 20362 23845 19631
6 31605 12511 11283 10923 11614 8998 9203 8904 8915 8896 8719 10619 8337
7 20392 17295 12108 11285 11445 9241 9789 9289 9140 9145 9256 11154 8751
8 5713 5407 4179 3916 3936 3165 3656 3132 3166 3187 3048 3959 3059
9 8960 7986 6914 6896 6572 5491 6000 5430 5502 5487 5563 6505 5295
filesize in %
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 100.0 72.9 41.1 37.7 39.3 30.2 33.7 29.7 29.3 29.3 28.9 38.2 27.1
2 100.0 86.7 71.0 66.5 68.8 50.5 60.4 50.5 50.2 50.1 48.5 64.0 45.1
3 100.0 88.6 75.7 66.3 75.3 53.1 64.1 53.4 53.0 53.0 48.9 64.6 46.0
4 100.0 92.0 60.5 58.2 60.2 52.5 55.9 51.6 51.6 51.5 50.9 61.9 49.2
5 100.0 79.5 74.2 71.6 72.5 59.1 60.9 58.7 58.8 59.0 58.6 68.6 56.5
6 100.0 39.6 35.7 34.6 36.7 28.5 29.1 28.2 28.2 28.1 27.6 33.6 26.4
7 100.0 84.8 59.4 55.3 56.1 45.3 48.0 45.6 44.8 44.8 45.4 54.7 42.9
8 100.0 94.6 73.1 68.5 68.9 55.4 64.0 54.8 55.4 55.8 53.4 69.3 53.5
9 100.0 89.1 77.2 77.0 73.3 61.3 67.0 60.6 61.4 61.2 62.1 72.6 59.1
For a graph and results sorted by average file size: https://www.icloud.com/numbers/00060NSRmk5sdCbLjtye-UGCA#LZMV_B.. |
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
Very cool compression with exoraw. Something to strive towards. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Impressively small! Any chance of cycle timings for the new additions? |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
I don't have time to code anything from scratch at the moment, but if you share your benchmarking code I could probably do LZMV and exomizer.
And while exomizer produces small files, it really is butt slow. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Here you go:
https://dl.dropboxusercontent.com/u/62734159/csdb/20160210_cycl.. |
| |
Axis/Oxyron Account closed
Registered: Apr 2007 Posts: 91 |
So any volunteers to optimize the exo-decruncher? ;oP |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Quote: So any volunteers to optimize the exo-decruncher? ;oP
Doesn't make much sense as long as you also want the decruncher to be tiny, most of all, as it needs a 156 byte scratchpad during decrunching. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Added timed benchmarks for LZMV, both the original 256 byte sliding window and the 4k version I used in U4. Note that neither is optimized for speed right now, but size and convenience.
filesizes
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 11008 8020 4529 4151 4329 3322 3711 3265 3225 3230 3184 4205 4539 2988
2 4973 4314 3532 3309 3423 2513 3005 2512 2498 2490 2410 3183 3575 2241
3 3949 3498 2991 2617 2972 2098 2530 2108 2091 2093 1931 2551 3018 1817
4 7016 6456 4242 4085 4225 3682 3924 3617 3622 3614 3571 4343 4314 3454
5 34760 27647 25781 24895 25210 20530 21182 20405 20447 20516 20362 23845 26116 19631
6 31605 12511 11283 10923 11614 8998 9203 8904 8915 8896 8719 10619 11352 8337
7 20392 17295 12108 11285 11445 9241 9789 9289 9140 9145 9256 11154 12188 8751
8 5713 5407 4179 3916 3936 3165 3656 3132 3166 3187 3048 3959 3987 3059
9 8960 7986 6914 6896 6572 5491 6000 5430 5502 5487 5563 6505 6943 5295
filesize in %
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------- ------
1 100.0 72.9 41.1 37.7 39.3 30.2 33.7 29.7 29.3 29.3 28.9 38.2 41,2 27.1
2 100.0 86.7 71.0 66.5 68.8 50.5 60.4 50.5 50.2 50.1 48.5 64.0 71,9 45.1
3 100.0 88.6 75.7 66.3 75.3 53.1 64.1 53.4 53.0 53.0 48.9 64.6 76,4 46.0
4 100.0 92.0 60.5 58.2 60.2 52.5 55.9 51.6 51.6 51.5 50.9 61.9 61,5 49.2
5 100.0 79.5 74.2 71.6 72.5 59.1 60.9 58.7 58.8 59.0 58.6 68.6 75,1 56.5
6 100.0 39.6 35.7 34.6 36.7 28.5 29.1 28.2 28.2 28.1 27.6 33.6 35,9 26.4
7 100.0 84.8 59.4 55.3 56.1 45.3 48.0 45.6 44.8 44.8 45.4 54.7 59,8 42.9
8 100.0 94.6 73.1 68.5 68.9 55.4 64.0 54.8 55.4 55.8 53.4 69.3 69,8 53.5
9 100.0 89.1 77.2 77.0 73.3 61.3 67.0 60.6 61.4 61.2 62.1 72.6 77,5 59.1
number of frames to depack
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------- ------
1 0.0 11.5 13.5 14.5 15.5 46.1 54.7 27.5 22.0 21.3 16.9 13.7
2 0.0 5.5 7.5 7.5 9.5 31.2 39.6 17.5 13.7 13.6 9.6 8.4
3 0.0 4.5 6.5 6.5 7.5 22.7 31.9 12.5 10.0 10.1 7.1 6.2
4 0.0 8.5 9.5 9.5 10.5 37.5 52.0 20.5 16.6 17.1 11.5 9.4
5 0.0 36.5 39.5 42.5 59.5 222.2 298.6 119.5 101.4 100.6 58.4 46.6
6 0.0 20.5 25.5 25.5 37.5 112.3 152.8 49.5 57.9 56.6 35.8 38.2
7 0.0 22.5 25.5 26.5 32.5 109.6 139.8 60.5 50.3 50.0 35.0 29.1
8 0.0 6.5 8.5 8.5 10.5 35.4 47.7 18.5 15.6 15.8 10.3 8.9
9 0.0 9.5 12.5 12.5 16.5 62.3 81.7 32.5 27.1 27.2 16.6 14.2
kilobytes output per second
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------- ------
1 46.9 39.9 37.2 34.8 11.7 9.9 19.6 24.5 25.3 33.1 41.0
2 44.3 32.5 32.5 25.6 7.8 6.1 13.9 17.7 17.9 26.4 30.1
3 43.0 29.7 29.7 25.8 8.5 6.1 15.5 19.4 19.2 28.4 32.6
4 40.4 36.2 36.2 32.7 9.2 6.6 16.8 20.6 20.1 31.1 38.1
5 46.6 43.1 40.0 28.6 7.7 5.7 14.2 16.8 16.9 30.3 38.0
6 75.5 60.7 60.7 41.3 13.8 10.1 31.3 26.7 27.3 44.9 42.1
7 44.4 39.1 37.7 30.7 9.1 7.1 16.5 19.8 20.0 29.7 35.6
8 43.0 32.9 32.9 26.6 7.9 5.9 15.1 18.0 17.7 28.1 32.6
9 46.2 35.1 35.1 26.6 7.0 5.4 13.5 16.2 16.1 27.4 32.1
cycles per byte consumed
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------- ------
1 0.0 28.2 58.6 68.7 70.4 272.9 289.7 165.6 134.2 129.4 30.2 24.4
2 0.0 25.1 41.7 44.6 54.6 244.4 259.3 136.9 108.0 107.5 37.8 33.2
3 0.0 25.3 42.7 48.8 49.6 212.3 247.9 116.6 93.8 94.6 35.2 30.6
4 0.0 25.9 44.0 45.7 48.8 200.3 260.4 111.4 90.3 93.1 32.2 26.2
5 0.0 26.0 30.1 33.6 46.4 212.8 277.1 115.1 97.5 96.4 33.0 26.3
6 0.0 32.2 44.4 45.9 63.5 245.3 326.3 109.3 127.7 125.1 22.3 23.8
7 0.0 25.6 41.4 46.2 55.8 233.2 280.7 128.0 108.2 107.4 33.7 28.1
8 0.0 23.6 40.0 42.7 52.4 219.9 256.4 116.1 96.6 97.2 35.5 30.6
9 0.0 23.4 35.5 35.6 49.3 223.0 267.7 117.6 96.9 97.4 36.5 31.1
|
| |
mankeli
Registered: Oct 2010 Posts: 146 |
just for comparison:
cycles per byte consumed (bin): 4+4 = 8
kilobytes output per second (bin): (312*63*50)/8 -> 122.85kB/s
etc.
:) |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
That requires an unrolled loop and hardcoded addresses, so it's not a realistic comparison. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
also with direct loading into registers and presorting to maximize register reuse upon same values, we get below 6 cycles ;-) For further brainfuck we can combine registers with SAX, SHA, ... on some stores :-D |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting Bitbreakeralso with direct loading into registers and presorting to maximize register reuse upon same values, we get below 6 cycles ;-)
Exactly what's required for half-cell Koalas, or there wouldn't be enough CPU time to update 37 bytes of d800 values every four lines :)
But yeah, I was assuming the .bin was already in place, so this is the excess cycles over time taken to transfer from external storage, be it 1541 or something more exotic
|
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting MagerValpAdded timed benchmarks for LZMV, both the original 256 byte sliding window and the 4k version I used in U4. Note that neither is optimized for speed right now, but size and convenience.
Thank you! Looking very similar to TinyCrunch there, maybe I should be dusting that off too
|
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Added results of exomizer mem. Butt slow. I also added average compression and total time to depack all 9.
filesizes
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exomem
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 11008 8020 4529 4151 4329 3322 3711 3265 3225 3230 3184 4205 4539 2988
2 4973 4314 3532 3309 3423 2513 3005 2512 2498 2490 2410 3183 3575 2225
3 3949 3498 2991 2617 2972 2098 2530 2108 2091 2093 1931 2551 3018 1808
4 7016 6456 4242 4085 4225 3682 3924 3617 3622 3614 3571 4343 4314 3442
5 34760 27647 25781 24895 25210 20530 21182 20405 20447 20516 20362 23845 26116 19715
6 31605 12511 11283 10923 11614 8998 9203 8904 8915 8896 8719 10619 11352 8322
7 20392 17295 12108 11285 11445 9241 9789 9289 9140 9145 9256 11154 12188 8765
8 5713 5407 4179 3916 3936 3165 3656 3132 3166 3187 3048 3959 3987 3081
9 8960 7986 6914 6896 6572 5491 6000 5430 5502 5487 5563 6505 6943 5304
filesize in %
# bin rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exomem
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------- ------
1 100.0 72.9 41.1 37.7 39.3 30.2 33.7 29.7 29.3 29.3 28.9 38.2 41.2 27.1
2 100.0 86.7 71.0 66.5 68.8 50.5 60.4 50.5 50.2 50.1 48.5 64.0 71.9 44.7
3 100.0 88.6 75.7 66.3 75.3 53.1 64.1 53.4 53.0 53.0 48.9 64.6 76.4 45.8
4 100.0 92.0 60.5 58.2 60.2 52.5 55.9 51.6 51.6 51.5 50.9 61.9 61.5 49.1
5 100.0 79.5 74.2 71.6 72.5 59.1 60.9 58.7 58.8 59.0 58.6 68.6 75.1 56.7
6 100.0 39.6 35.7 34.6 36.7 28.5 29.1 28.2 28.2 28.1 27.6 33.6 35.9 26.3
7 100.0 84.8 59.4 55.3 56.1 45.3 48.0 45.6 44.8 44.8 45.4 54.7 59.8 43.0
8 100.0 94.6 73.1 68.5 68.9 55.4 64.0 54.8 55.4 55.8 53.4 69.3 69.8 53.9
9 100.0 89.1 77.2 77.0 73.3 61.3 67.0 60.6 61.4 61.2 62.1 72.6 77.5 59.2
Average 80.9 63.1 59.5 61.2 48.4 53.7 48.1 48.1 48.1 47.1 58.6 63.2 45.1
number of frames to depack
# rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exomem
--- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------- ------
1 11.5 13.5 14.5 15.5 46.1 54.7 27.5 22.0 21.3 16.9 13.7 57.9
2 5.5 7.5 7.5 9.5 31.2 39.6 17.5 13.7 13.6 9.6 8.4 38.5
3 4.5 6.5 6.5 7.5 22.7 31.9 12.5 10.0 10.1 7.1 6.2 28.5
4 8.5 9.5 9.5 10.5 37.5 52.0 20.5 16.6 17.1 11.5 9.4 53.1
5 36.5 39.5 42.5 59.5 222.2 298.6 119.5 101.4 100.6 58.4 46.6 295.9
6 20.5 25.5 25.5 37.5 112.3 152.8 49.5 57.9 56.6 35.8 38.2 142.3
7 22.5 25.5 26.5 32.5 109.6 139.8 60.5 50.3 50.0 35.0 29.1 139.2
8 6.5 8.5 8.5 10.5 35.4 47.7 18.5 15.6 15.8 10.3 8.9 44.8
9 9.5 12.5 12.5 16.5 62.3 81.7 32.5 27.1 27.2 16.6 14.2 78.9
Sum 125.5 148.5 153.5 199.5 679.3 898.8 358.5 314.6 312.3 201.2 174.7 879.1
kilobytes output per second
# rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exomem
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------- ------
1 46.9 39.9 37.2 34.8 11.7 9.9 19.6 24.5 25.3 33.1 41.0 9.7
2 44.3 32.5 32.5 25.6 7.8 6.1 13.9 17.7 17.9 26.4 30.1 6.6
3 43.0 29.7 29.7 25.8 8.5 6.1 15.5 19.4 19.2 28.4 32.6 7.0
4 40.4 36.2 36.2 32.7 9.2 6.6 16.8 20.6 20.1 31.1 38.1 6.7
5 46.6 43.1 40.0 28.6 7.7 5.7 14.2 16.8 16.9 30.3 38.0 6.0
6 75.5 60.7 60.7 41.3 13.8 10.1 31.3 26.7 27.3 44.9 42.1 11.3
7 44.4 39.1 37.7 30.7 9.1 7.1 16.5 19.8 20.0 29.7 35.6 7.5
8 43.0 32.9 32.9 26.6 7.9 5.9 15.1 18.0 17.7 28.1 32.6 6.5
9 46.2 35.1 35.1 26.6 7.0 5.4 13.5 16.2 16.1 27.4 32.1 5.8
cycles per byte consumed
# rle wvl-f wvl-s tc bb2.0 pu-f doyna nucru rnucr LZMPi LZMV4k LZMV256 exomem
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------- ------
1 28.2 58.6 68.7 70.4 272.9 289.7 165.6 134.2 129.4 30.2 24.4 103.4
2 25.1 41.7 44.6 54.6 244.4 259.3 136.9 108.0 107.5 37.8 33.2 152.0
3 25.3 42.7 48.8 49.6 212.3 247.9 116.6 93.8 94.6 35.2 30.6 141.9
4 25.9 44.0 45.7 48.8 200.3 260.4 111.4 90.3 93.1 32.2 26.2 148.7
5 26.0 30.1 33.6 46.4 212.8 277.1 115.1 97.5 96.4 33.0 26.3 167.3
6 32.2 44.4 45.9 63.5 245.3 326.3 109.3 127.7 125.1 22.3 23.8 88.5
7 25.6 41.4 46.2 55.8 233.2 280.7 128.0 108.2 107.4 33.7 28.1 134.2
8 23.6 40.0 42.7 52.4 219.9 256.4 116.1 96.6 97.2 35.5 30.6 154.0
9 23.4 35.5 35.6 49.3 223.0 267.7 117.6 96.9 97.4 36.5 31.1 173.0
|
| |
Fungus
Registered: Sep 2002 Posts: 691 |
Thanks for that MV. |
| |
Burglar
Registered: Dec 2004 Posts: 1105 |
|
| |
Burglar
Registered: Dec 2004 Posts: 1105 |
|
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Now just give us the Pareto Optimal set too and then we've settled this properly. ;) |
| |
Burglar
Registered: Dec 2004 Posts: 1105 |
It would be invalid to treat Pareto efficiency as equivalent to social optimality. |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
...but it would still be great. :) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
I'm sorry CJam, but those numbers for b2 decrunching times just don't make sense. I have setup the same tests (i hope) here at my place, and i get numbers in the same area as Doynax and Nucrunch.. which is expected. Could you please double check you're not using the old bb-decruncher on the b2-files or something.. :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Curious indeed. I just grabbed the new release, built files with
bench_bb_speed/%.prg: corpus/%.prg Makefile
mkdir -p bench_bb_speed
cp $< bench_bb_speed
b2 -c 1000 $@
mv $@.b2 $@
and set watches in Vice on 080d and 1000, zeroing stopwatch at the first and recording it at the second.
I was going have another go anyway after patching the decoder to zero $d011, but that would only gain 4-5%. I'll keep you posted. |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Ah, so you are making executables (-c).. that explains it :). The executables have a smaller decruncher which is alot slower than the normal one. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
That would do it!
Funny, I was just disassembling in Vice on the off chance the embedded decoder was out of date, and noticing the absence of the sequence of starting STYs :) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
I should have understood that when you were talking about translating the decruncher.. |
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
So are we meant to be timing with self extracting space optimised exes or speed optimised decompression code? :)
Or both? :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: So are we meant to be timing with self extracting space optimised exes or speed optimised decompression code? :)
Or both? :)
From a demo coder point of view, the self extracting values are completely useless. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Yup, speed optimised. I just didn't realise that wasn't what was bundled into the self-extracting archive. |
| |
enthusi
Registered: May 2004 Posts: 677 |
For games you often (and for tapes always) want in-place depacking of 'raw' packed data.
(load data to $1xxx-$2004, depack to $1000-$2000)
Not streamed and not self-executing.
Doynamite ist just great for that.
Many tools/loaders seem to be tuned for demo-usage, not so much games :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: For games you often (and for tapes always) want in-place depacking of 'raw' packed data.
(load data to $1xxx-$2004, depack to $1000-$2000)
Not streamed and not self-executing.
Doynamite ist just great for that.
Many tools/loaders seem to be tuned for demo-usage, not so much games :)
This is exactly what we do in demos always, isn't it? |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
I have sketched out a version of LZMV that decrunches in place, with the added finesse that it doesn't need any overlap - e.g. load to $1000-$xxxx, decrunch to $1000-$1fff. That 4 byte overlap always bugged me :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: I have sketched out a version of LZMV that decrunches in place, with the added finesse that it doesn't need any overlap - e.g. load to $1000-$xxxx, decrunch to $1000-$1fff. That 4 byte overlap always bugged me :)
I suppose you have a scratchpad somewhere in memory, like in the decruncher code to handle it then? |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
just stumbled over this:
"Exactly what's required for half-cell Koalas, or there wouldn't be enough CPU time to update 37 bytes of d800 values every four lines :)"
you have 4 rasterlines to update d800, because it is only read on bad lines. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting Oswaldjust stumbled over this:
"Exactly what's required for half-cell Koalas, or there wouldn't be enough CPU time to update 37 bytes of d800 values every four lines :)"
you have 4 rasterlines to update d800, because it is only read on bad lines.
Well, yes. 4 raster lines less the 40-43 cycles of DMA, 4 cycles to update d011, and 4 cycles to flip d018. That leaves at best 63*4-40-4-4=204 cycles, or 5.51 cycles per byte of d800, and that's assuming you've magically got useful values for d011 and d018 to hand. |
| |
Oswald
Registered: Apr 2002 Posts: 5095 |
true. its harder than I thought. (as always)
anyhow 16 values vs 37 writes its should fit easily with register reuse (ok that was what you said, alright:). also eod wasnt doing nowhere near 30 bytes when it was dynamic. now we know why :) |
| |
algorithm
Registered: May 2002 Posts: 705 |
Just doing the lda #$00 sta dxxx sta xxxx.... lda #$01 sta dxxx dxxx.. should give max cycle usage of (4*37 for the writes and (2*16) for the register preload = 180 cycles (and that is in the worst case scenario - if there are 16 different colors per line. Should still be ample time to update d800. |
| |
algorithm
Registered: May 2002 Posts: 705 |
To reduce it further. can do the delta writes in the second half of the 4 rasterlines.. only change colors that are different in the same position as directly above.. |
| |
algorithm
Registered: May 2002 Posts: 705 |
Actually should be able to do a delta every 4 lines..
Set up d800-d827 before the first badline..
1
when the badline occurs, do the delta update for the next 4 lines
force second badline 4 pixels down. at this stage, the updates to d800-d827 will occur
then write the delta to change back the values to how they were originally at d828-d84f
repeat from 1 |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
If x is preloaded for a whole frame one can easily write 16 values with only 8 register loads:
ldx #$0e
lda #$01
sax $d800
sta $d801
lda #$03
sax $d802
sta $d803
lda #$05
sax $d804
sta $d805
...
|
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: just stumbled over this:
"Exactly what's required for half-cell Koalas, or there wouldn't be enough CPU time to update 37 bytes of d800 values every four lines :)"
you have 4 rasterlines to update d800, because it is only read on bad lines.
Really interesting, but why on earth contaminate THIS thread? :D |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Because he can :-D |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Because he can :-D
I'm OK with that! :) |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Is this thread just being hijacked!? |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Is this thread just being hijacked!?
Oswald 'The Hijacker' |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting algorithmActually should be able to do a delta every 4 lines..
Set up d800-d827 before the first badline..
1
when the badline occurs, do the delta update for the next 4 lines
force second badline 4 pixels down. at this stage, the updates to d800-d827 will occur
then write the delta to change back the values to how they were originally at d828-d84f
repeat from 1
Exactly what I did. Still waiting on art from someone, or you'd have seen a release by now. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Quote: I suppose you have a scratchpad somewhere in memory, like in the decruncher code to handle it then?
No, no extra memory, that's the beauty of it, only the end address or a counter. I'll sketch something out to show how it works, there might be a corner case that I'm not thinking of. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
But wouldn't that only work if we end with a match then? As long as the last thing to do is copying a literal this would still fail as with packers like Doynax, BB and similar, as one would have some control stream bits/bytes before that to signal a literal and give a length. If the following literal bytes would not overlap, the control bits would have been overwritten by the preceeding action and things would fail. In case things end with a match, all would be fine, only the copying of the match would then kill the already read control bits and the decruncher would end afterwards. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Update time!!
All results from above combined, as well as some new ones
Thanks MagerValp for LZMV* & exo*, and Bitbreaker for Bitfire.
Also thanks to HCL for his patience, I've finally switched to the fast decompressor for BB.
Columns sorted by output rate (guessed positions for LZMPi and exoraw, the only ones I don't yet have times for):
filesizes
# bin rle wvl-f wvl-s LZMV25 tc LZMV4k rnucru nucrun bb2.0 bitfir doynax LZMPi exomem pu-f exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 11008 8020 4529 4151 4539 4329 4205 3230 3225 3322 3324 3265 3184 2988 3711 2988
2 4973 4314 3532 3309 3575 3423 3183 2490 2498 2513 2515 2512 2410 2225 3005 2241
3 3949 3498 2991 2617 3018 2972 2551 2093 2091 2098 2097 2108 1931 1808 2530 1817
4 7016 6456 4242 4085 4314 4225 4343 3614 3622 3682 3682 3617 3571 3442 3924 3454
5 34760 27647 25781 24895 26116 25210 23845 20516 20447 20530 20531 20405 20362 19715 21182 19631
6 31605 12511 11283 10923 11352 11614 10619 8896 8915 8998 9004 8904 8719 8322 9203 8337
7 20392 17295 12108 11285 12188 11445 11154 9145 9140 9241 9242 9289 9256 8765 9789 8751
8 5713 5407 4179 3916 3987 3936 3959 3187 3166 3165 3162 3132 3048 3081 3656 3059
9 8960 7986 6914 6896 6943 6572 6505 5487 5502 5491 5491 5430 5563 5304 6000 5295
filesize in %
# bin rle wvl-f wvl-s LZMV25 tc LZMV4k rnucru nucrun bb2.0 bitfir doynax LZMPi exomem pu-f exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 100.0 72.9 41.1 37.7 41.2 39.3 38.2 29.3 29.3 30.2 30.2 29.7 28.9 27.1 33.7 27.1
2 100.0 86.7 71.0 66.5 71.9 68.8 64.0 50.1 50.2 50.5 50.6 50.5 48.5 44.7 60.4 45.1
3 100.0 88.6 75.7 66.3 76.4 75.3 64.6 53.0 53.0 53.1 53.1 53.4 48.9 45.8 64.1 46.0
4 100.0 92.0 60.5 58.2 61.5 60.2 61.9 51.5 51.6 52.5 52.5 51.6 50.9 49.1 55.9 49.2
5 100.0 79.5 74.2 71.6 75.1 72.5 68.6 59.0 58.8 59.1 59.1 58.7 58.6 56.7 60.9 56.5
6 100.0 39.6 35.7 34.6 35.9 36.7 33.6 28.1 28.2 28.5 28.5 28.2 27.6 26.3 29.1 26.4
7 100.0 84.8 59.4 55.3 59.8 56.1 54.7 44.8 44.8 45.3 45.3 45.6 45.4 43.0 48.0 42.9
8 100.0 94.6 73.1 68.5 69.8 68.9 69.3 55.8 55.4 55.4 55.3 54.8 53.4 53.9 64.0 53.5
9 100.0 89.1 77.2 77.0 77.5 73.3 72.6 61.2 61.4 61.3 61.3 60.6 62.1 59.2 67.0 59.1
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
~ 100.0 80.9 63.1 59.5 63.2 61.3 58.6 48.1 48.1 48.4 48.4 48.1 47.1 45.1 53.7 45.1
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
number of frames to depack
# bin rle wvl-f wvl-s LZMV25 tc LZMV4k rnucru nucrun bb2.0 bitfir doynax LZMPi exomem pu-f exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 0.0 11.5 13.5 14.5 13.7 15.5 16.9 21.3 22.0 24.3 24.4 27.5 0.0 57.9 54.7 0.0
2 0.0 5.5 7.5 7.5 8.4 9.5 9.6 13.6 13.7 15.1 15.1 17.5 0.0 38.5 39.6 0.0
3 0.0 4.5 6.5 6.5 6.2 7.5 7.1 10.1 10.0 10.9 10.9 12.5 0.0 28.5 31.9 0.0
4 0.0 8.5 9.5 9.5 9.4 10.5 11.5 17.1 16.6 18.1 18.1 20.5 0.0 53.1 52.0 0.0
5 0.0 36.5 39.5 42.5 46.6 59.5 58.4 100.6 101.4 107.7 107.4 119.5 0.0 295.9 298.6 0.0
6 0.0 20.5 25.5 25.5 38.2 37.5 35.8 56.6 57.9 61.5 62.0 49.5 0.0 142.3 152.8 0.0
7 0.0 22.5 25.5 26.5 29.1 32.5 35.0 50.0 50.3 54.5 54.5 60.5 0.0 139.2 139.8 0.0
8 0.0 6.5 8.5 8.5 8.9 10.5 10.3 15.8 15.6 16.9 17.0 18.5 0.0 44.8 47.7 0.0
9 0.0 9.5 12.5 12.5 14.2 16.5 16.6 27.2 27.1 29.6 29.4 32.5 0.0 78.9 81.7 0.0
kilobytes output per second
# bin rle wvl-f wvl-s LZMV25 tc LZMV4k rnucru nucrun bb2.0 bitfir doynax LZMPi exomem pu-f exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 46.9 39.9 37.2 39.3 34.8 31.9 25.3 24.5 22.2 22.1 19.6 9.3 9.9
2 44.3 32.5 32.5 29.0 25.6 25.4 17.9 17.7 16.1 16.1 13.9 6.3 6.1
3 43.0 29.7 29.7 31.2 25.8 27.2 19.2 19.4 17.7 17.8 15.5 6.8 6.1
4 40.4 36.2 36.2 36.5 32.7 29.9 20.1 20.6 18.9 19.0 16.8 6.5 6.6
5 46.6 43.1 40.0 36.5 28.6 29.1 16.9 16.8 15.8 15.8 14.2 5.8 5.7
6 75.5 60.7 60.7 40.5 41.3 43.2 27.3 26.7 25.2 25.0 31.3 10.9 10.1
7 44.4 39.1 37.7 34.3 30.7 28.5 20.0 19.8 18.3 18.3 16.5 7.2 7.1
8 43.0 32.9 32.9 31.4 26.6 27.2 17.7 18.0 16.5 16.4 15.1 6.2 5.9
9 46.2 35.1 35.1 30.9 26.6 26.4 16.1 16.2 14.8 14.9 13.5 5.6 5.4
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
~ 0.0 47.8 38.8 38.0 34.4 30.3 29.9 20.1 20.0 18.4 18.4 17.4 0.0 7.2 7.0 0.0
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
cycles per byte consumed
# bin rle wvl-f wvl-s LZMV25 tc LZMV4k rnucru nucrun bb2.0 bitfir doynax LZMPi exomem pu-f exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 0.0 28.2 58.6 68.7 59.3 70.4 79.0 129.4 134.0 143.6 144.1 165.6 0.0 380.9 289.7 0.0
2 0.0 25.1 41.7 44.6 46.2 54.6 59.3 107.5 108.0 118.4 118.4 136.9 0.0 340.1 259.3 0.0
3 0.0 25.3 42.7 48.8 40.4 49.6 54.7 94.6 93.8 102.4 102.0 116.6 0.0 309.8 247.9 0.0
4 0.0 25.9 44.0 45.7 42.8 48.8 52.0 93.1 90.3 96.8 96.4 111.4 0.0 303.2 260.4 0.0
5 0.0 26.0 30.1 33.6 35.1 46.4 48.1 96.4 97.5 103.1 102.8 115.1 0.0 295.0 277.1 0.0
6 0.0 32.2 44.4 45.9 66.1 63.5 66.3 125.1 127.7 134.3 135.4 109.3 0.0 336.1 326.3 0.0
7 0.0 25.6 41.4 46.2 46.9 55.8 61.7 107.4 108.2 115.9 116.0 128.0 0.0 312.2 280.7 0.0
8 0.0 23.6 40.0 42.7 43.9 52.4 51.1 97.2 96.6 105.2 105.9 116.1 0.0 285.8 256.4 0.0
9 0.0 23.4 35.5 35.6 40.2 49.3 50.2 97.4 96.9 105.9 105.4 117.6 0.0 292.4 267.7 0.0
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
|
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Quote: But wouldn't that only work if we end with a match then? As long as the last thing to do is copying a literal this would still fail as with packers like Doynax, BB and similar, as one would have some control stream bits/bytes before that to signal a literal and give a length. If the following literal bytes would not overlap, the control bits would have been overwritten by the preceeding action and things would fail. In case things end with a match, all would be fine, only the copying of the match would then kill the already read control bits and the decruncher would end afterwards.
Yes, that's exactly the case. The trick is to ensure that the compressed stream ends with a match. If it ends with a literal, include the literal data without any control bytes at the correct location. Illustration here:
https://www.dropbox.com/s/9zttmj3avjm6krl/LZMV%20in%20place%20u..
Note that it packs backwards, so the start is at the top of memory and the ending literal "!" in the second example is at the bottom. Unpacking backwards enables dey/bpl copy loops for speed and lets you keep the load address. Speed for this should be on par with WVL's packers. |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
@ChristopherJam: Lovely! The pattern I see here is that there are three speed classes. You have the fast LZSS variants (WVL, LZMV), you have the efficient but slow LZ + Huffman (exo, pu), and in the middle you have nu/bb2/bitfire/doynax.
It also makes it obvious that LZMV (and maybe tc) needs speed optimization to be competitive :) |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Thanks MagerValp, that was the tiny bit of info that was missing, in deed one can leave the last literal just there without any more control bits :-D As an extra, when we are guaranteed to end with a match, it is also sufficient to do the end check on matches only, as before with that maximum run as an eof marker. It will however add a few more bytes to the decruncher, but speedwise it would be mostly as fast as a tay + beq as eof detection by having a cmp + beq (in most cases that is, as we would require to check on 16 bit in case)
I'll give this a try the next days :-) |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
Go for it! :)
And Excel just crashed on me while I was trying to make a pretty scatter plot of the results :/ |
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
Very slow LZMPi decompression cycles:
1.bin 1558777
2.bin 1044272
3.bin 813537
4.bin 1482657
5.bin 8471668
6.bin 4323352
7.bin 4044852
8.bin 1278691
9.bin 2307100
Hey, it's optimised for space not speed. :D |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
It's almost as if BB2 and Bitfire is the same code! :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting Martin PiperVery slow LZMPi decompression cycles:
Hey, it's optimised for space not speed. :D
Ah, thank you!
First couple of speed related tables:
number of frames to depack
# bin rle wvl-f wvl-s LZMV25 tc LZMV4k rnucru nucrun bb2.0 bitfir doynax exomem pu-f LZMPi exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 0.0 11.5 13.5 14.5 13.7 15.5 16.9 21.3 22.0 24.3 24.4 27.5 57.9 54.7 79.3 0.0
2 0.0 5.5 7.5 7.5 8.4 9.5 9.6 13.6 13.7 15.1 15.1 17.5 38.5 39.6 53.1 0.0
3 0.0 4.5 6.5 6.5 6.2 7.5 7.1 10.1 10.0 10.9 10.9 12.5 28.5 31.9 41.4 0.0
4 0.0 8.5 9.5 9.5 9.4 10.5 11.5 17.1 16.6 18.1 18.1 20.5 53.1 52.0 75.4 0.0
5 0.0 36.5 39.5 42.5 46.6 59.5 58.4 100.6 101.4 107.7 107.4 119.5 295.9 298.6 431.0 0.0
6 0.0 20.5 25.5 25.5 38.2 37.5 35.8 56.6 57.9 61.5 62.0 49.5 142.3 152.8 220.0 0.0
7 0.0 22.5 25.5 26.5 29.1 32.5 35.0 50.0 50.3 54.5 54.5 60.5 139.2 139.8 205.8 0.0
8 0.0 6.5 8.5 8.5 8.9 10.5 10.3 15.8 15.6 16.9 17.0 18.5 44.8 47.7 65.1 0.0
9 0.0 9.5 12.5 12.5 14.2 16.5 16.6 27.2 27.1 29.6 29.4 32.5 78.9 81.7 117.4 0.0
kilobytes output per second
# bin rle wvl-f wvl-s LZMV25 tc LZMV4k rnucru nucrun bb2.0 bitfir doynax exomem pu-f LZMPi exoraw
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
1 46.9 39.9 37.2 39.3 34.8 31.9 25.3 24.5 22.2 22.1 19.6 9.3 9.9 6.8
2 44.3 32.5 32.5 29.0 25.6 25.4 17.9 17.7 16.1 16.1 13.9 6.3 6.1 4.6
3 43.0 29.7 29.7 31.2 25.8 27.2 19.2 19.4 17.7 17.8 15.5 6.8 6.1 4.7
4 40.4 36.2 36.2 36.5 32.7 29.9 20.1 20.6 18.9 19.0 16.8 6.5 6.6 4.6
5 46.6 43.1 40.0 36.5 28.6 29.1 16.9 16.8 15.8 15.8 14.2 5.8 5.7 3.9
6 75.5 60.7 60.7 40.5 41.3 43.2 27.3 26.7 25.2 25.0 31.3 10.9 10.1 7.0
7 44.4 39.1 37.7 34.3 30.7 28.5 20.0 19.8 18.3 18.3 16.5 7.2 7.1 4.9
8 43.0 32.9 32.9 31.4 26.6 27.2 17.7 18.0 16.5 16.4 15.1 6.2 5.9 4.3
9 46.2 35.1 35.1 30.9 26.6 26.4 16.1 16.2 14.8 14.9 13.5 5.6 5.4 3.7
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
~ 0.0 47.8 38.8 38.0 34.4 30.3 29.9 20.1 20.0 18.4 18.4 17.4 7.2 7.0 4.9 0.0
- ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------ ------
|
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting MagerValp@ChristopherJam: Lovely! The pattern I see here is that there are three speed classes. You have the fast LZSS variants (WVL, LZMV), you have the efficient but slow LZ + Huffman (exo, pu), and in the middle you have nu/bb2/bitfire/doynax.
It also makes it obvious that LZMV (and maybe tc) needs speed optimization to be competitive :)
Yes, I've done some quick graphs in Matplotlib (no lables yet) and there's some quite pronounced clustering.
exomem, pu-f & LZMPi are all below 10k per second, with exomem by far the smallest.
rle is also in a class of it's own; definite winner for speed if only light compression is required.
Quoting JackAsserIt's almost as if BB2 and Bitfire is the same code! :)
Yes, they are remarkably close! I suspect identical encoding for a start
|
| |
Burglar
Registered: Dec 2004 Posts: 1105 |
|
| |
Fungus
Registered: Sep 2002 Posts: 691 |
An how about a comparison of uncompressor size? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Size is definitely pertinent. nucrunch is hardly small at the moment.
nucrunch 342
rnucrunch 394
tinycrunch 151
|
| |
HCL
Registered: Feb 2003 Posts: 728 |
Comparing B2 and Bitfire.. The encoders are generating the same format, however not compatible. It has been the same since the first release of ByteBoozer1.0. ByteBoozer2.0 and Bitfire (and probably most others today) are searching through *all* possible ways to win bits, where as Bb1.0 has a faster gready algorithm. Remember it was running on the c64 also(!), but that's why B2 compresses better. B2 and Bitfire encoders do have rather different ways of achieving their goals, which explains the small differences we see in compression rate.
B2 and Bitfire decrunchers are more identical than the encoders. I was working like hell to find another code design than Bitbreaker's that still wasn't slower, but i guess he already examined all those possibilities before me :). So i guess we sort of agree on this rather optimal way of decrunching lz-code.. Now i'm very qurious on what new tricks Nucrunch is bringing up to the surface :). |
| |
MagerValp
Registered: Dec 2001 Posts: 1078 |
LZMV4k decruncher is 137 bytes, LZMV256 is 89 bytes. |
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
Depends if you're including all the post BASIC setup and memory moving code. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting Martin PiperDepends if you're including all the post BASIC setup and memory moving code.
Just the decrunch engine; assume something else has loaded the data/set $01/cleared interrupts etc and the source and destination areas placed so as to not need special handling. |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting MagerValpThe trick is to ensure that the compressed stream ends with a match. If it ends with a literal, include the literal data without any control bytes at the correct location.
That is absolutely brilliant! |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting lftQuoting MagerValpThe trick is to ensure that the compressed stream ends with a match. If it ends with a literal, include the literal data without any control bytes at the correct location.
That is absolutely brilliant!
It is, rather. MagerValp, how were you planning to terminate the stream?
My EOF's over a byte long, and checking end address seems slow if it's not necessarily on a page boundary. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting HCLComparing B2 and Bitfire.. The encoders are generating the same format, however not compatible. It has been the same since the first release of ByteBoozer1.0. ByteBoozer2.0 and Bitfire (and probably most others today) are searching through *all* possible ways to win bits, where as Bb1.0 has a faster gready algorithm. Remember it was running on the c64 also(!), but that's why B2 compresses better...
...Now i'm very qurious on what new tricks Nucrunch is bringing up to the surface :).
Impressive on the first one doing so well with a native compressor. But yes, I'm doing an exhaustive search too, with the full file rather than a sliding window. I pinched the idea of literals always being followed by matches from Doynax.
Source is out there now (NuCrunch 0.1), but I suspect most of the gains come from a combination of aggressive inlining, and splitting the bitstream into bits, bitpairs, and nybbles, so I can leave out a lot of the 'next byte?' tests. |
| |
Trash
Registered: Jan 2002 Posts: 122 |
Quote: Quoting lftQuoting MagerValpThe trick is to ensure that the compressed stream ends with a match. If it ends with a literal, include the literal data without any control bytes at the correct location.
That is absolutely brilliant!
It is, rather. MagerValp, how were you planning to terminate the stream?
My EOF's over a byte long, and checking end address seems slow if it's not necessarily on a page boundary.
Why use an EOF at all? Inlude how many bytes you should unpack in the header, for speed you could limit that to pages (the tradeoff would be that some data in the end would be overwritten)... |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
As already posted earlier, it is sufficient to do any check only in case of a match, be it with an eof marker or by an end address. The check is not more expensive and can even be made without clobbering carry when being done with eor, but takes more size codewise. I have made tests with zero overlap on bitnax, but fail yet on some files with overwriting myself :-/ |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Well yes, I was assuming only checking in case of a match
Intriguing that end addr test can be done at zero cost in cycles.
The clobbering cases - is it because of the match spec crossing a byte boundary? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Quoting Trash(the tradeoff would be that some data in the end would be overwritten)...
Oh yes, aware of that. I want an exact endpoint so I can implement MagerValp's end-alignment, where not even the crunched data passes the endpoint. |
| |
enthusi
Registered: May 2004 Posts: 677 |
The cruncher of choice obviously depends on the environment as well. In particular the speed with which the data is loaded into memory. Which fastloader, tape, cart...
Maybe someone adds certain loading-bitrates to the graphs? ;-)
TurboTape is ~ 450 Byte/sec.
1541 ROM: ~ 410 Byte/sec
Cart (lda $8000,x; sta $0800,x) ~ 160 KB/sec
(i.e. for Caren I now used page aligned ,x loops of RAW data which is still faster than RLE with byte-at-boundary-check. A dedicated RLE that never crosses banked in pages should be slighly faster) |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Quote: Well yes, I was assuming only checking in case of a match
Intriguing that end addr test can be done at zero cost in cycles.
The clobbering cases - is it because of the match spec crossing a byte boundary?
Well, it is possible that the read_pointer crosses the write_pointer already before the last literal/match, at least then, when we encode stuff with variable bitlengths. I did a working prototype (more tests pending) to get the cruncher to spit out a final binary blob as soon as this happens. Thus some files tend to become bigger, others where this works out, get usually 2 bytes smaller.
As for the decruncher a few things need to be changed, in my case i can forgo on the terminator check:
tay
beq .lz_end_of_file
Therefore i check when i add the match length to the destination pointer:
tya
adc .lz_dst
sta .lz_dst
bcc .lz_end_low
.lz_maximum
inc .lz_dst+1
lda .lz_dst
.lz_end_low
cmp #$00
.lz_skip_poll
bne .lz_poll
lda .lz_dst+1
.lz_end_hi
cmp #$00
bne .lz_skip_end
rts
.lz_poll
However you also can't rely anymore on the crunchers EOF test, so you need to continue loading after the cruncher returns to be sure any remaining binary blob is still loaded if it goes over a block boundary, until the loader terminates with an EOF. The later of course only applies if you depack with on the fly loading. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Ah of course - you're replacing the EOF check with something of (99.6% of the time) equal cost.
Interesting point about streaming loaders, but surely in that case you'd just be loading to a one page buffer rather than loading to destination address? |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
I never used a buffer but was always depacking in place, yet however still with a small overlap. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Here is a first experimental test. At least my usual benchmark stuff passes, also overlap for normal cases decreased, possible that the the old algorithm did something wrong (or the new does, haha)
As said, highly experimental, hurt yourself at your own risk. One might want to feed more files to the testsuite in the benchmark folder. The checksums + sizes and loadaddresses are not automated yet, some sed wizardry should help out there soon.
Now, with the end address check the non overflowing case on the dst-pointer addition is favoured and thus one cycled saved compared to the standard version. |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
There's one problem with the in place decompression arising:
If you happen to have the source data not in place but at some other location in mem, the last literal will not be copied, unless you include it as a literal sequence into the control stream. Then however, if a file ends with a literal, the end check is missed when it is only done upon matches. Means: one has to either add another bogus match which makes the depacked data 2 bytes longer, or test on both cases, what bloats and slows down the depacker. Also The old sentinel could be used for that. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Well if the source data is not in place but at some other location in mem, then you're not doing in place decompression :P |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
Sure, but one might want to use the same decruncher code for both purposes, for e.g. in a demo ;-) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Fair point! |
| |
Bitbreaker
Registered: Oct 2002 Posts: 508 |
For now, i decided to add the check back in, that ends decrunching after a 257 byte match is decoded. This decreases speed slightly, but i have made other optimizations to compensate for more than that. |
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
LZMPi with -cu option:
https://github.com/martinpiper/C64Public/tree/master/bin
bytes
1 3,143
2 2,372
3 1,912
4 3,524
5 19,036
6 8,147
7 8,809
8 2,912
9 5,207 |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Oh, thanks for posting those.
How many cycles to decrunch? (or, speed in bytes per second produced or consumed, I can work out the rest from there) |
| |
Martin Piper
Registered: Nov 2007 Posts: 726 |
| file | cycles |
| 1.bin | 420115 |
| 2.bin | 264355 |
| 3.bin | 199139 |
| 4.bin | 311702 |
| 5.bin | 1930613 |
| 6.bin | 1063796 |
| 7.bin | 954401 |
| 8.bin | 288310 |
| 9.bin | 515944 |
I've been porting this to the Amiga for a self extracting executable, so got around to testing the cycle times for this data: https://github.com/martinpiper/C64Public/blob/master/Decompress.. |