[CSDb] - User Forums - Plaintext compression

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Plaintext compression

2005-11-28 01:33

Radiant

Registered: Sep 2004
Posts: 639

Plaintext compression

As some of you know, I'm playing around a bit with creating a game...

Me being the person I am, I wanna include a not insubstantial amount of text in the game. A lot of text eats a lot of memory. Therefore I'm pondering different ways of compressing the text. Since the text will most likely only consist of A-Z, 0-9 and some miscellaneous delimiting characters, I figure I could easily do with just 64 different character codes, including some special codes for switching case, changing colour, etc. This would of course allow me to fit each character code into 6 bits, which is not immediately advantageable (at least to me), considering it would only result in a 1/4 memory usage reduction and add serious overhead to the print routine.

Another thing I could do would be to analyze my writing and create a table containing the most frequent words I use, and just replace all references to those with table indexes. This could save me quite a bit of RAM, and has the advantage of very little overhead... However, it feels a bit unreliable.

RLE wouldn't be effective.

So, which are your favourite methods for squeezing as much text as possible into, say, 4 KiB? :-) I'm really at a loss here, as you can see.

2005-11-28 06:35

JackAsser

Registered: Jun 2002
Posts: 2014

The standard method of compressing plain text is to use huffman-trees. Basically a huffman tree creates a bitpattern for each symbol, where the most used symbol will get the smallest bitpattern. This typically yeilds a 2x-compression ratio, used together with your 6bits of input I'd say you'll end up with a 8x-compression ratio. Although as you say yourself, the print-routine will suffer substantial overhead, although I think a huffman decoder can be written quite efficiently. Search the web for algorithm details.

2005-11-28 09:02

Oswald

Registered: Apr 2002
Posts: 5094

I think huffman with or without the 6bit encoding would give the same result, as huffman in general reduces the number of bits needed to encode bytes to the possible minimum.

anyway imho huffman with all the bit shifting and binary tree lookups is not really for the 6510, I advise lzw.

2005-11-28 09:57

algorithm

Registered: May 2002
Posts: 705

Huffman does not give a good compression ratio for text. Neither does RLE. You can either use LZ based compression (should give compression ratio of around 3:1 - 5:1. (You can always use a compression utility such as Exomizer or Pucrunch to compress and decompress specific sections when required.

2005-11-28 10:33

Radiant

Registered: Sep 2004
Posts: 639

Yep, but I'd prefer to be able to decompress directly in the print routine, character for character which rules out Exomizer and the likes.

LZW has the same problem. Of course, I could just store all the text in compressed chunks, and then decompress it to a temporary buffer as needed, but I wonder if there'd be that much gain from doing that, considering the size of the buffer etc... Perhaps it'll be necessary, we'll see.

Any more suggestions?

2005-11-28 11:01

algorithm

Registered: May 2002
Posts: 705

Maybe some simple dictionary based compression.

Each word would be analysed and replaced with a marker, this word would then be placed in the dictionary, the same occurance of the word would then point to the same area and new words would be added onto the dictionary.

This would obviously require additional space (for the dictionary) as well as the markers. If you are planning on having maybe a substantial amount of text, then this method may be the one to use (and its extremely fast.. just read the marker, it then points to lookuptable with the word

2005-11-28 12:28

Compyx

Registered: Jan 2005
Posts: 631

Maybe you should take a look at bzip2: http://en.wikipedia.org/wiki/Bzip2

It's a fairly simple algorithm and compresses text better than zip or gzip, and it's pattent-free unlike LZW. (See http://en.wikipedia.org/wiki/LZW#Patent_issues)

2005-11-28 13:19

Scout

Registered: Dec 2002
Posts: 1570

Quote: Yep, but I'd prefer to be able to decompress directly in the print routine, character for character which rules out Exomizer and the likes.

LZW has the same problem. Of course, I could just store all the text in compressed chunks, and then decompress it to a temporary buffer as needed, but I wonder if there'd be that much gain from doing that, considering the size of the buffer etc... Perhaps it'll be necessary, we'll see.

Any more suggestions?

That's not completely true.

You can still use Exomizer for this.
Check the sources for streamed decompression which are included in the package.

R.

---
-= Silicon Ltd. =-
http://forum.siliconlimited.com

Commodore 64 Scenemusic Podcast
http://8bitmayhem.blogspot.com/

2005-11-28 13:44

MagerValp

Registered: Dec 2001
Posts: 1078

LZSS is dead simple, and works well for text. No algorithm will work well for short strings though, so you should probably look into coding the strings as blocks.

2005-11-28 14:48

Radiant

Registered: Sep 2004
Posts: 639

Yeah, Burrows-Wheeler transform + RLE seems to be quite ok for text. A bit much overhead, though, as do LZ based algorithms provide.

Scout: But that's just streamed decompression, right? I'd probably need some kind of random access, though that is probably hard (impossible?) to obtain with any non-trivial compression... I'll check it out.

Anyway, thanks for the suggestions, I am a lot wiser now.

2005-11-28 15:07

JackAsser

Registered: Jun 2002
Posts: 2014

With overhead do you mean computation time, memory space for depacker or lazyness for implementing, or a combo of all three? =D Just curious..

... 30 posts hidden. Click here to view all posts....

Previous - 1 | 2 | 3 | 4 - Next

Refresh

Subscribe to this thread: