[CSDb] - User Forums - Plaintext compression

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Plaintext compression

2005-11-28 01:33

Radiant

Registered: Sep 2004
Posts: 639

Plaintext compression

As some of you know, I'm playing around a bit with creating a game...

Me being the person I am, I wanna include a not insubstantial amount of text in the game. A lot of text eats a lot of memory. Therefore I'm pondering different ways of compressing the text. Since the text will most likely only consist of A-Z, 0-9 and some miscellaneous delimiting characters, I figure I could easily do with just 64 different character codes, including some special codes for switching case, changing colour, etc. This would of course allow me to fit each character code into 6 bits, which is not immediately advantageable (at least to me), considering it would only result in a 1/4 memory usage reduction and add serious overhead to the print routine.

Another thing I could do would be to analyze my writing and create a table containing the most frequent words I use, and just replace all references to those with table indexes. This could save me quite a bit of RAM, and has the advantage of very little overhead... However, it feels a bit unreliable.

RLE wouldn't be effective.

So, which are your favourite methods for squeezing as much text as possible into, say, 4 KiB? :-) I'm really at a loss here, as you can see.

2005-11-28 06:35

JackAsser

Registered: Jun 2002
Posts: 1990

The standard method of compressing plain text is to use huffman-trees. Basically a huffman tree creates a bitpattern for each symbol, where the most used symbol will get the smallest bitpattern. This typically yeilds a 2x-compression ratio, used together with your 6bits of input I'd say you'll end up with a 8x-compression ratio. Although as you say yourself, the print-routine will suffer substantial overhead, although I think a huffman decoder can be written quite efficiently. Search the web for algorithm details.

2005-11-28 09:02

Oswald

Registered: Apr 2002
Posts: 5025

I think huffman with or without the 6bit encoding would give the same result, as huffman in general reduces the number of bits needed to encode bytes to the possible minimum.

anyway imho huffman with all the bit shifting and binary tree lookups is not really for the 6510, I advise lzw.

2005-11-28 09:57

algorithm

Registered: May 2002
Posts: 702

Huffman does not give a good compression ratio for text. Neither does RLE. You can either use LZ based compression (should give compression ratio of around 3:1 - 5:1. (You can always use a compression utility such as Exomizer or Pucrunch to compress and decompress specific sections when required.

2005-11-28 10:33

Radiant

Registered: Sep 2004
Posts: 639

Yep, but I'd prefer to be able to decompress directly in the print routine, character for character which rules out Exomizer and the likes.

LZW has the same problem. Of course, I could just store all the text in compressed chunks, and then decompress it to a temporary buffer as needed, but I wonder if there'd be that much gain from doing that, considering the size of the buffer etc... Perhaps it'll be necessary, we'll see.

Any more suggestions?

2005-11-28 11:01

algorithm

Registered: May 2002
Posts: 702

Maybe some simple dictionary based compression.

Each word would be analysed and replaced with a marker, this word would then be placed in the dictionary, the same occurance of the word would then point to the same area and new words would be added onto the dictionary.

This would obviously require additional space (for the dictionary) as well as the markers. If you are planning on having maybe a substantial amount of text, then this method may be the one to use (and its extremely fast.. just read the marker, it then points to lookuptable with the word

2005-11-28 12:28

Compyx

Registered: Jan 2005
Posts: 631

Maybe you should take a look at bzip2: http://en.wikipedia.org/wiki/Bzip2

It's a fairly simple algorithm and compresses text better than zip or gzip, and it's pattent-free unlike LZW. (See http://en.wikipedia.org/wiki/LZW#Patent_issues)

2005-11-28 13:19

Scout

Registered: Dec 2002
Posts: 1568

Quote: Yep, but I'd prefer to be able to decompress directly in the print routine, character for character which rules out Exomizer and the likes.

LZW has the same problem. Of course, I could just store all the text in compressed chunks, and then decompress it to a temporary buffer as needed, but I wonder if there'd be that much gain from doing that, considering the size of the buffer etc... Perhaps it'll be necessary, we'll see.

Any more suggestions?

That's not completely true.

You can still use Exomizer for this.
Check the sources for streamed decompression which are included in the package.

R.

---
-= Silicon Ltd. =-
http://forum.siliconlimited.com

Commodore 64 Scenemusic Podcast
http://8bitmayhem.blogspot.com/

2005-11-28 13:44

MagerValp

Registered: Dec 2001
Posts: 1059

LZSS is dead simple, and works well for text. No algorithm will work well for short strings though, so you should probably look into coding the strings as blocks.

2005-11-28 14:48

Radiant

Registered: Sep 2004
Posts: 639

Yeah, Burrows-Wheeler transform + RLE seems to be quite ok for text. A bit much overhead, though, as do LZ based algorithms provide.

Scout: But that's just streamed decompression, right? I'd probably need some kind of random access, though that is probably hard (impossible?) to obtain with any non-trivial compression... I'll check it out.

Anyway, thanks for the suggestions, I am a lot wiser now.

2005-11-28 15:07

JackAsser

Registered: Jun 2002
Posts: 1990

With overhead do you mean computation time, memory space for depacker or lazyness for implementing, or a combo of all three? =D Just curious..

2005-11-28 15:13

Radiant

Registered: Sep 2004
Posts: 639

Mainly memory wastage and computational time, actually. :-) Though the laziness factor is not to be underestimated...

2005-11-28 15:36

Scout

Registered: Dec 2002
Posts: 1568

Quote:

Scout: But that's just streamed decompression, right? I'd probably need some kind of random access, though that is probably hard (impossible?) to obtain with any non-trivial compression... I'll check it out.

Well, you can compress your texts in segments.
In your program you set the decompress-pointers to the segment (in memory) of your choice and stream-decompress it.
The only thing you have to do is make a table with all the pointers of your compressed text segments and you'll be able to do random access them.

R.
---
-= Silicon Ltd. =-
http://forum.siliconlimited.com

Commodore 64 Scenemusic Podcast
http://8bitmayhem.blogspot.com/

2005-11-28 16:35

Radiant

Registered: Sep 2004
Posts: 639

Scout: Yeah, I figured that out after posting my initial reply. :-) I blame lack of sleep...

I'm probably gonna go with whatever method that saves me the most space, including the depacking algorithm. If I don't have to code the depacker myself that's a bonus I'll consider as well. :-)

2005-11-29 16:21

Ben
Account closed

Registered: Feb 2003
Posts: 163

Hmzz.. I did want to put some of my knowledge on display.. but.. you do not need random access, quite sure, so why not dynamically update the range of texts with one of those good loaders around.. Rid yourself from all kinds of selfimposed constraints.. All you need is a couple of look-up tables..

2005-11-29 16:32

Monte Carlos

Registered: Jun 2004
Posts: 351

Perhaps use a dictionary and then compress the
dictionary.

2005-11-29 17:10

algorithm

Registered: May 2002
Posts: 702

Personally I would go for the LZ method. It is very easy to implement in the code. eg the required section of text is pointed to the compressed text, destination memory location is input and the decompressor then run.

2005-11-29 18:19

TNT
Account closed

Registered: Oct 2004
Posts: 189

Problem with LZ is that you need plaintext history, so you need to have quite a big buffer for it to be effective. LZW avoids that, but its data tree easily takes the same amount of memory.

I would use combination of dictionary and Huffman, that allows random access without extensive memory usage.

2005-11-29 19:42

JackAsser

Registered: Jun 2002
Posts: 1990

Huffman in not randomly accessible unless you put resync markers every now and then. You can never be sure that you've synchronized correctly on to the bitstream. Each random combination of bits will produce non-verifiable results on the other end. I should add that this is as I remebers it from school at least to reduce the risk of public flaming etc. =)

2005-11-30 10:54

Graham
Account closed

Registered: Dec 2002
Posts: 990

huffman is not a very efficient compression algo, it is just a "clean" algo so the profs at the universities love it :)

hack-algos like the LZ-clones perform much much better than huffman, especially on text. LZW also gives good compression results, but needs way too much memory so it cannot be used on C64.

2005-11-30 11:14

Steppe

Registered: Jan 2002
Posts: 1510

Outsourcing the text to sequential files and loading them with an IRQ loader when needed is not an option, I guess?

2005-11-30 17:06

enthusi

Registered: May 2004
Posts: 675

though Im not that experienced with compression I think you'd do ok with both of your ideas: dont use the full 8 bits for chars but some 6 bits per char and you'd still have some values reserved for (long) words you often use. Chosen manually. This is probably what all Prof would hate :) but its easy to implement and provide. By some native code or for crossdev some small c-(or whatever-)-routine to encode 8to6bit and some small lookuptable. I think using 8bit/char and a large lookup would get inefficient since you'd need some large table (Im so clever)...
This at least is how I'd do it (which is possibly why I dont do it :)

edit: Id also recommend some streaming-irq-tape-loader for texts so you can handle each bit individually anyway

2005-11-30 17:13

Steppe

Registered: Jan 2002
Posts: 1510

streaming-irq ->TAPE<- loader ??? O_o

2005-11-30 17:24

TNT
Account closed

Registered: Oct 2004
Posts: 189

Huffman is randomly accessible unless it's adaptive, you just need bit address of the starting point. With even distribution that saves one bit per seek position (versus byte-aligned seek positions).

Huffman is not only "clean", it also has quite low memory requirements and it's seekable. LZ(W) methods need to decompress from the beginning (with the exception of LZW when the tree is full and hasn't been discarded, but even then you need to decompress until that happens) and require quite large history buffer to be effective.

Experiment :)

2005-12-01 08:44

HCL

Registered: Feb 2003
Posts: 717

I would probably go for a Huffman-clone in this case. It has some advantages that you seem to want, like getChar(), and not being forced to 'depack' a whole bunch of text just to get one sentence. It's not the optimal packer as people already have stated, but you have to consider your needs in different aspects.

I made a Huffman-packer long time ago, it's actually been used in some demos that i've done as well :). What you'll get is three short lookup-tables of ~64 bytes each in your case, the depacking code is about 15 instructions, so it's all very nice and tidy. Just like the Profs want it, rite ;). Just call me if you're interested.

2007-08-20 13:44

polonus
Account closed

Registered: Aug 2007
Posts: 9

The best method in my opinion would be like this.

1. write all the text you want to include
2. build the dictionary of words (only more than 3 letters, omit any digits, interpunctions etc.)
3. All the base letters can be encoded using 5 bits
4. reconvert words from the dictionary into 5bits x no. of letters in the word sequences. In example, 6 letters word would occupy only 30 bits, so let's say - 4 bytes. Approx 2/6 shorter in memory. And that's - dictionary :)
5. Encoded text will use all the single interpunction signs, digits and up-to-3-letters-long words in form as they are. Otherwise (we have longer word on track ;) ) decode word using dictionary from its 5-bit packed form. The token sequence can be two-byte value with oldest bit turned on.

It is quite good method, verified to work by me long time ago. It is faaast :D

2007-08-21 07:06

ChristopherJam

Registered: Aug 2004
Posts: 1380

I don't think the 'best' method is determinable without seeing the text you wish to compress.

Why not make a compression challenge, where you publish a sample script, an API (eg: jsr to decoder with Y containing the ID of the string to decrunch, and decoder must call your 'putchar' with char in A for each character decoded, putchar to preserve a, x and y but not flags), and offer kudos to the person who can store the text, decoder and working space in least memory usage?

2007-08-21 12:27

polonus
Account closed

Registered: Aug 2007
Posts: 9

Well, we don't know more specific information about the text to be encoded. Is this long single text file or a lot of strings (i.e. text adventure)? Since we don't know that, it would be wise to consider memory saving solution:

Input:

the text file with some special characters used within text flow to deal with text segments (i.e. dialogues or bigger parts). Like this: text1[#]text2[#]text3 .. and so on, where [#] stands for special code.

Encoding:

The effect of encoding should be something like music routine, with program, dictionary and data all generated together - and, probably in separate file, index of all text parts with their numbers to recall later.

Using:

User provides text number (as per known index file) and initialises 'text-player'. Then calls DecodeNextCharacter routine from the text-player as many times as the number of characters in original (decoded) text until EOF is received. This solution saves space in memory, since there will be no temporary space to be used to 'deflate' text, God knows how long one ;) It is similar to binary read-by-byte routines when reading from floppy or something.

This way you could have packed solution as a text-player with minimum memory foot print when decoding what is important when considering long texts in one go.

It is a time to download some C64 emulator for my Mac, I suppose :D

2007-08-23 10:38

johncl

Registered: Aug 2007
Posts: 37

Interesting thread. I have always wanted to tinker with a self made text compression engine. I remember when I first made my first big text adventure on the C64 with some 110 rooms and lots of objects I eventually ran into "Out of Memory Error". Of course back then everything was made in basic.

Intrigued by the thread I started experimenting some with the intitial requirement that the decoder has to be as simple as possible and be as fast as possible.

My initial thought was that all text contains a lot of spaces, and that a pretty full character set including upper and lower chars, numbers and some symbols could easily fit 7 bits. So a simple algorithm would be to reserve last bit as indication that there is a space coming after the following letter. This reduces my sample 9525 bytes text file to 8040. Not much but at least some.

My second thought was the obvious dictionary. Since the total number of chars used is not 128 but actually 78 in my sample text i could replace these char indexes as dictionary lookup. Naturally the selection of words is based on the most used words. My sample has e.g. "the", "to", "and", "of" and "in" as frequent words. By using every free character index and replacing them with the most used words + 1bit space my file is reduced to 6295 bytes (from 9525). This size does not include the dictionary though, although that is only 100 to 200 bytes.

Well I then thought that the english language consist of many typical endings like "ing", "ion", "ed", "er". I then reserved 6 spots in the dictionary for some common endings and this further reduced the data to 5942 bytes.

The decoding involved here is very simple, read char, AND with 128, basic lookup to see if char is a dictionary entry (both full words and endings), pure printout of all that are not, and finally if bit 7 was set print a space after this char.

This can naturally be improved if you dont need lower case chars, or you could have rules that the start char of text is always upper case and the same applies for first char after a period. You would need special handling for names but that should be easy now that you have 2 bits free per char for 4 special rules. The idea is to reduce the space the most common text takes and have some overhead on the exceptions.

I wanted to avoid a lot of shifting etc in my algorithm so this was an ok tradeoff. If I compress the test text I used with RAR and best compression I get a file with the size 3578, so there is still some way to go to get there! :)

2007-08-23 10:50

Marauder/GSS
Account closed

Registered: Jul 2006
Posts: 224

yeah, this thread reminds me of my good old "Word Cruncher" I've done under Visage Studios. Anyone have seen this one? I still miss it...

Actually I don't remember exactly how my compression-algorithm worked (would be nice to have a peek at it again), but I know it uses several passes to build up a dictionary and words were replaced by some control-codes (entry-id).

If I remember correctly the first pass analyzed the scrolltext and checked how often words occured in the text.
Next pass/es replaced the most occured words (starting with the longer words first, then the smaller words, down to min. 3 chars)... something like that.

The uncompression routine worked in realtime, calling a function to get next char of your scroll-text then, so you don't need to uncompress it completly...

Anyone have seen this "Word Cruncher"? Must be done around 90-91...

2007-08-23 12:34

WVL

Registered: Mar 2002
Posts: 886

Some questions :

- how is the complete text stored in memory right now? and how do you now where certain bits of text begin?

The reason I'm asking is that I made a sort of LZ cruncher a while ago that doesnt use the _depacked_ data as a dictionary, but instead uses the _packed_ data as a dictionary (I call it LZW-VL ;D). It's easy to understand the advantage if you want pseudo-random access ;) You can add markers to the original file to let the packer know where the beginning of a data-chunk is.

The packer will shape the output file in such a way all the separate data-chunks are immediately accessible. At a cost of a pointer-table.. You simply get the start-adress of the datachunk you want decompressed from the table and start decompressing...

Can you send me bin file so i can check my packer against it? :) would be nice to test :D

edit :

Btw, I'm wondering if 4000 bytes of text is enough to get a decent compression.. There's not too much bytes to work with.. but maybe :)

2007-08-23 14:28

johncl

Registered: Aug 2007
Posts: 37

You shuold be able to get 4000 bytes of text down to at least 3000.

I did some small enhancements to my compression test and by weighing dictionary words and endings with their length and add many more possible endings (but only use the most frequently used ones) I was able to get my 9525 bytes file down to 5640 bytes.

There is still some work I can do to optimize the selection of dictionary words versus endings and I might do a test to reduce data to lower case and add special symbols to indicate upper case. This will free 25 more "slots" for dictionary lookups (if I use one to indicate next letter is upper case). Depending on the type of text I realised that a frequent pattern is a period followed by a space and an upper case letter. By reserving one char code for this I am quite sure I can get below 5k for my sample text but I am not happy until I have half the size and still a simple decoding algorithm. :)

2007-08-23 14:47

Mace

Registered: May 2002
Posts: 1799

Radiantx wrote:Quote:

(...) I figure I could easily do with just 64 different character codes, including some special codes for switching case, changing colour, etc. This would of course allow me to fit each character code into 6 bits, which is not immediately advantageable (at least to me), considering it would only result in a 1/4 memory usage reduction and add serious overhead to the print routine.

JohnCL wrote:Quote:

You shuold be able to get 4000 bytes of text down to at least 3000.

Yeah, *duh*... if you do the 6-bit trick, 4000 bytes become 3000 bytes :-)
So there's nothing fancy about achieving that result.

2007-08-23 15:37

JackAsser

Registered: Jun 2002
Posts: 1990

Quote: Radiantx wrote:Quote:
(...) I figure I could easily do with just 64 different character codes, including some special codes for switching case, changing colour, etc. This would of course allow me to fit each character code into 6 bits, which is not immediately advantageable (at least to me), considering it would only result in a 1/4 memory usage reduction and add serious overhead to the print routine.

JohnCL wrote:Quote:
You shuold be able to get 4000 bytes of text down to at least 3000.

Yeah, *duh*... if you do the 6-bit trick, 4000 bytes become 3000 bytes :-)
So there's nothing fancy about achieving that result.

"...add serious overhead to the print routine...."

I did exactly this (6bits / char) in Eye of the Beholder and if you're smart it doesn't add much to the depacker. You can have my decoder if you like, it's just 20 lines of ASM or so...

2007-08-23 17:17

Radiant

Registered: Sep 2004
Posts: 639

JackAsser: This thread is old and rotten. But thanks for offering. Kind of a non-problem at the moment though. :-)

2007-08-23 17:41

JackAsser

Registered: Jun 2002
Posts: 1990

Quote: JackAsser: This thread is old and rotten. But thanks for offering. Kind of a non-problem at the moment though. :-)

Yeah I suspected... ;D

2007-08-24 08:44

WVL

Registered: Mar 2002
Posts: 886

Oops :D didnt notice how old the starting post was :D

2007-08-24 08:52

johncl

Registered: Aug 2007
Posts: 37

Yeah, sorry for bringing this post back on the top. I just found the task an interesting one. :)

I guess 6bit per char and shifting isnt that expensive for a realtime print routine.

2007-08-24 11:49

Oswald

Registered: Apr 2002
Posts: 5025

"This will free 25 more "slots" for dictionary lookups"

I think you can achive better results with reserving more bits for dictionary lookup indexes.

2007-08-24 11:57

johncl

Registered: Aug 2007
Posts: 37

Yes thats what I meant, by using a code to indicate next char should be upper case, you have 25 more character codes to be used as dictionary lookups.

I tried this in my algorithm and it reduced the filesize by a tiny bit, not much. It depends on the amount of upper case letters.

Experimenting some with some special cases now, e.g. "period, space, uppercase" combos and encoding numbers binary. Naturally these has to be used enough for them to occupy precious dictionary lookup spots in the map. But if the total number of bytes used for encoding e.g all numbers binary, I can "free" another 9 character codes for dictionary usage instead.

I am quite sure this area has been analyzed to death by experts but its just fun to reinvent the wheel now and then. :)

2007-08-24 12:28

Oswald

Registered: Apr 2002
Posts: 5025

no, I mean I would use 8 or even more bits for dictionary indexes ALONE. the benefit is HUGE, as already 3 letter words can be packed into 16 bits (if the dictionary index is 16 bit that is). If I would work on this I would definitly determine how many words does my text use, and scale the nr of bits in the dictionary accordingly. thus putting almost all words into a dictionary. tho this is getting to sound lot like lzw :D

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

grass/LETHARGY
wbochar
t0m3000/ibex-crew
celticdesign/G★P/M..
Matt
Magic/Nah-Kolor
AmiDog
zenda
ilesj/Artline Designs
TPM/Silicon Ltd
Adder/Triad
tonysavon
codise
Jazzcat/Onslaught
Sillicon/Unreal
Steveboy
fegolhuzz
Brush/Elysium
Nith/TRIɅD
Loloke
DivertigO
Slaxx/Q/HF/MYD!
The Human Co../Maste..
Brittle/Dentifrice^(?)
Exploding Fi../Techn..
Murphy/Exceed
Tao/TRIAD
Danko/Fairlight
Guests online: 164

Top Demos

1 Next Level  (9.8)
2 Mojo  (9.7)
3 Coma Light 13  (9.7)
4 Edge of Disgrace  (9.6)
5 Comaland 100%  (9.6)
6 No Bounds  (9.6)
7 Uncensored  (9.6)
8 Wonderland XIV  (9.6)
9 Memento Mori  (9.6)
10 Bromance  (9.5)

Top onefile Demos

1 It's More Fun to Com..  (9.7)
2 Party Elk 2  (9.7)
3 Cubic Dream  (9.6)
4 Copper Booze  (9.5)
5 TRSAC, Gabber & Pebe..  (9.5)
6 Rainbow Connection  (9.5)
7 Dawnfall V1.1  (9.5)
8 Quadrants  (9.5)
9 Daah, Those Acid Pil..  (9.5)
10 Birth of a Flower  (9.5)

Top Groups

1 Nostalgia  (9.3)
2 Oxyron  (9.3)
3 Booze Design  (9.3)
4 Censor Design  (9.3)
5 Crest  (9.3)

Top Diskmag Editors

1 Jazzcat  (9.4)
2 Magic  (9.4)
3 hedning  (9.2)
4 Elwix  (9.1)
5 A Life in Hell  (9.1)

Page generated in: 0.081 sec.