| |
ChristopherJam
Registered: Aug 2004 Posts: 1423 |
On the relative pros and cons of various RLE schemes
So over in Native crunch/decrunch code. I just suggested encoding
abbcdddeffffg as
abb{0}cdd{1}eff{2}g
Raistlin kindly liked the idea, Krill quite sensibly didn't want us to hijack the post and suggested a new post for the new topic. so... Have at it! |
|
... 38 posts hidden. Click here to view all posts.... |
| |
Krill
Registered: Apr 2002 Posts: 3066 |
Quoting tlrbut here you'll be vulnerable to something not too uncommon. I think for a run count represented with a regular byte, the escape code method is going to be smaller than both this and the lit/run chunk with count method for almost all normal c64 input. As RLE isn't that crunchy anyways, performance should be the main regard (where crunchiness plays a rather weak role, but it still has some impact).
Some test corpus wouldn't hurt, ofc. Maybe the good old Bitfire benchmark files? |
| |
CyberBrain Administrator
Posts: 392 |
<Post edited by CyberBrain on 22/4-2024 23:52>
Cool, thank you, tlr!
Ah, ok, i can see how the "Escape byte" representation doesn't have to waste anything for 2-byte repeats (but requires an extra pass over the input data to find the least used byte when packing).
If understand the "lit/run chunk" representation correctly, it looks like it wastes a byte on each literal-chunk (so at least 2 bytes per 256 packed bytes, as far as i can tell). Looks like this could struggle a bit if there are many short repeats mixed in between non-repeating data. Edit: No, it couldn't - fake news.
Those are the 3 standard representations on C64? |
| |
tlr
Registered: Sep 2003 Posts: 1802 |
Quoting CyberBrainThose are the 3 standard representations on C64?
I'd say escape byte is by far the most common. I've seldom seen the lit/run chunk, but it's used in G-Packer V1.0 and also in Amiga IFF ILBM files IIRC. I've never seen the one cjam spoke of in the wild. |
| |
Bansai
Registered: Feb 2023 Posts: 51 |
In the absence of an escape byte that needs to be consulted every input byte for driving decisions in decoding, you could still have a *mode* escape of sorts where a reserved repeat count value or signaling bit in the length indicates something like "directly copy [arbitrary number of] bytes from source to dest" or there is a some form of encoded RLE skip/copy count that follows the reserved value. This puts more complexity in the compressor, but the decode is still fairly straightforward.
In an input stream if the value XX refers to the same value:
XX XX RC
if RC is positive, emit f(RC) more repeats of XX, for example, RC+1 is the repeat count, but this could grow to include large repeat counts to handle long runs of zeros, etc.
otherwise directly copy the number of bytes that is a function of -RC. |
| |
Raistlin
Registered: Mar 2007 Posts: 730 |
Isn’t it more common to use a bit to define “constant or not”?
$00-7F .. count of non-repeats ($01-$80)
$80-FF .. count of repeats ($04-$83)
There’s no point of course in having 1-2 repeats. I’m not sure there’s much point in 3 either, definitely not if breaking from a non-repeat segment to present them?
This also makes the depack code a little simpler as you simply check with BPL/BMI. |
| |
Fungus
Registered: Sep 2002 Posts: 736 |
I did a lot of RLE work in the past, and I found that output that gave the most equal sequences was best for crunching afterwards. Of course there are many different packer implementations but almost all of them either use one or the other method.
For my own I chose to use scan codes, using the least used byte in the file as the code to have minimum expansion. Some packers used more than one code to represent hard coded lengths which were scanned for in order to shorten the encoding.
Examples like
(code)(byte) for runs of 3
or
(code)(byte) where the code was a lookup table for various lengths
or simply multiple codes, but that always added more expansion.
I chose to limit runs to no more than 256 bytes, so for a length of say 4095 bytes the output would be
(code)(byte)(00)(code)(byte)(00)(code)(byte)(00)(code)(byte)(ff)
This gave an output that crunched better than other packers which would give
(code)(01=16bit)(lengthlow)(lengthhi)(byte)
or similar...
depends on use case of course.
I never did look into super zipper v8 or similar to see what they were doing though. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1423 |
Quoting CyberBrainThat looks like the encoding scheme from the codebase64 article? (https://codebase64.org/doku.php?id=base:rle_pack_unpack)
Oh! Yes, it's almost exactly that, except I didn't think to reserve one count value as an end-of-data marker.
Quote:Yes, that way of encoding it, wastes one byte for each 2-byte repeat (starts saving bytes for 4-byte repeats and longer), so could be bad if there are too many of those in the input data
Yeah it's a size loss if there are more 2-byte repeats than occurrences of a potential escape byte value (assuming counts are stored as whole bytes of course). But the implementation is nice and simple.
Quote:(I actually used that one in my latest demo to get room for more pics in memory, and this ultra simple RLE-based packing, together with delta-encoding, worked surprisingly (to me) well - it saved multiple $100s of bytes off some pics (they did have large background areas, but still), so the 2-repeats doesn't have to be a problem)
Nice!
Quote:It could be interesting to hear about the other ways of representing the bytes of RLE-encoding, which are used on the C64.
For example, what is this "the usual escape byte based scheme" you guys have mentioned in this and the other thread? And the counter representation?
Oh, things like (esc)(byte)(repeat count), where (esc) is the least commonly occurring byte in the source data. It does mean than any occurrences of the escape byte are inflated.
As for counter representation, most RLE implementations use a whole byte for the repeat count, but you can do things like read a smaller number of bits to make the repeat counts more compact, much as that slows decrunch a bit (you still keep the literals byte aligned of course). I do like that idea Fungus mentioned of having a few different escape codes for common repeat counts. |
| |
Fungus
Registered: Sep 2002 Posts: 736 |
In the case of my packer I used esc,esc to represent a literal. It still caused expansion, but it also was a common sequence that would be crunched by whatever lz compressor I was using at the time, either byteboiler/level ab2, or level crusher.
The esc,esc worked for the other encoding way too.
Lightemizer and Oneway's packers would be interesting to look into these days. Back then I was trying to understand algos and make my own rather than using other peoples code. But you don't get accused of ripping anymore so looking at stuff is OK these days. |
| |
Krill
Registered: Apr 2002 Posts: 3066 |
Quoting FungusBut you don't get accused of ripping anymore so looking at stuff is OK these days. Ripping and looking at other people's code are different things, though. =) |
| |
The Human Code Machine
Registered: Sep 2005 Posts: 112 |
Here's an exmaple of a byte packer using different pack codes: https://codebase64.org/doku.php?id=base:mdg_bytesmasher_0.04&s[.. |
Previous - 1 | 2 | 3 | 4 | 5 - Next |