[CSDb] - User Forums - Plaintext compression

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Plaintext compression

2005-11-28 01:33

Radiant

Registered: Sep 2004
Posts: 639

Plaintext compression

As some of you know, I'm playing around a bit with creating a game...

Me being the person I am, I wanna include a not insubstantial amount of text in the game. A lot of text eats a lot of memory. Therefore I'm pondering different ways of compressing the text. Since the text will most likely only consist of A-Z, 0-9 and some miscellaneous delimiting characters, I figure I could easily do with just 64 different character codes, including some special codes for switching case, changing colour, etc. This would of course allow me to fit each character code into 6 bits, which is not immediately advantageable (at least to me), considering it would only result in a 1/4 memory usage reduction and add serious overhead to the print routine.

Another thing I could do would be to analyze my writing and create a table containing the most frequent words I use, and just replace all references to those with table indexes. This could save me quite a bit of RAM, and has the advantage of very little overhead... However, it feels a bit unreliable.

RLE wouldn't be effective.

So, which are your favourite methods for squeezing as much text as possible into, say, 4 KiB? :-) I'm really at a loss here, as you can see.

... 30 posts hidden. Click here to view all posts....

2007-08-23 14:28

johncl

Registered: Aug 2007
Posts: 37

You shuold be able to get 4000 bytes of text down to at least 3000.

I did some small enhancements to my compression test and by weighing dictionary words and endings with their length and add many more possible endings (but only use the most frequently used ones) I was able to get my 9525 bytes file down to 5640 bytes.

There is still some work I can do to optimize the selection of dictionary words versus endings and I might do a test to reduce data to lower case and add special symbols to indicate upper case. This will free 25 more "slots" for dictionary lookups (if I use one to indicate next letter is upper case). Depending on the type of text I realised that a frequent pattern is a period followed by a space and an upper case letter. By reserving one char code for this I am quite sure I can get below 5k for my sample text but I am not happy until I have half the size and still a simple decoding algorithm. :)

2007-08-23 14:47

Mace

Registered: May 2002
Posts: 1799

Radiantx wrote:Quote:

(...) I figure I could easily do with just 64 different character codes, including some special codes for switching case, changing colour, etc. This would of course allow me to fit each character code into 6 bits, which is not immediately advantageable (at least to me), considering it would only result in a 1/4 memory usage reduction and add serious overhead to the print routine.

JohnCL wrote:Quote:

You shuold be able to get 4000 bytes of text down to at least 3000.

Yeah, *duh*... if you do the 6-bit trick, 4000 bytes become 3000 bytes :-)
So there's nothing fancy about achieving that result.

2007-08-23 15:37

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: Radiantx wrote:Quote:
(...) I figure I could easily do with just 64 different character codes, including some special codes for switching case, changing colour, etc. This would of course allow me to fit each character code into 6 bits, which is not immediately advantageable (at least to me), considering it would only result in a 1/4 memory usage reduction and add serious overhead to the print routine.

JohnCL wrote:Quote:
You shuold be able to get 4000 bytes of text down to at least 3000.

Yeah, *duh*... if you do the 6-bit trick, 4000 bytes become 3000 bytes :-)
So there's nothing fancy about achieving that result.

"...add serious overhead to the print routine...."

I did exactly this (6bits / char) in Eye of the Beholder and if you're smart it doesn't add much to the depacker. You can have my decoder if you like, it's just 20 lines of ASM or so...

2007-08-23 17:17

Radiant

Registered: Sep 2004
Posts: 639

JackAsser: This thread is old and rotten. But thanks for offering. Kind of a non-problem at the moment though. :-)

2007-08-23 17:41

JackAsser

Registered: Jun 2002
Posts: 1989

Quote: JackAsser: This thread is old and rotten. But thanks for offering. Kind of a non-problem at the moment though. :-)

Yeah I suspected... ;D

2007-08-24 08:44

WVL

Registered: Mar 2002
Posts: 886

Oops :D didnt notice how old the starting post was :D

2007-08-24 08:52

johncl

Registered: Aug 2007
Posts: 37

Yeah, sorry for bringing this post back on the top. I just found the task an interesting one. :)

I guess 6bit per char and shifting isnt that expensive for a realtime print routine.

2007-08-24 11:49

Oswald

Registered: Apr 2002
Posts: 5017

"This will free 25 more "slots" for dictionary lookups"

I think you can achive better results with reserving more bits for dictionary lookup indexes.

2007-08-24 11:57

johncl

Registered: Aug 2007
Posts: 37

Yes thats what I meant, by using a code to indicate next char should be upper case, you have 25 more character codes to be used as dictionary lookups.

I tried this in my algorithm and it reduced the filesize by a tiny bit, not much. It depends on the amount of upper case letters.

Experimenting some with some special cases now, e.g. "period, space, uppercase" combos and encoding numbers binary. Naturally these has to be used enough for them to occupy precious dictionary lookup spots in the map. But if the total number of bytes used for encoding e.g all numbers binary, I can "free" another 9 character codes for dictionary usage instead.

I am quite sure this area has been analyzed to death by experts but its just fun to reinvent the wheel now and then. :)

2007-08-24 12:28

Oswald

Registered: Apr 2002
Posts: 5017

no, I mean I would use 8 or even more bits for dictionary indexes ALONE. the benefit is HUGE, as already 3 letter words can be packed into 16 bits (if the dictionary index is 16 bit that is). If I would work on this I would definitly determine how many words does my text use, and scale the nr of bits in the dictionary accordingly. thus putting almost all words into a dictionary. tho this is getting to sound lot like lzw :D

Previous - 1 | 2 | 3 | 4 - Next

Refresh

Subscribe to this thread: