[CSDb] - User Forums - Plaintext compression

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Plaintext compression

2005-11-28 01:33

Radiant

Registered: Sep 2004
Posts: 640

Plaintext compression

As some of you know, I'm playing around a bit with creating a game...

Me being the person I am, I wanna include a not insubstantial amount of text in the game. A lot of text eats a lot of memory. Therefore I'm pondering different ways of compressing the text. Since the text will most likely only consist of A-Z, 0-9 and some miscellaneous delimiting characters, I figure I could easily do with just 64 different character codes, including some special codes for switching case, changing colour, etc. This would of course allow me to fit each character code into 6 bits, which is not immediately advantageable (at least to me), considering it would only result in a 1/4 memory usage reduction and add serious overhead to the print routine.

Another thing I could do would be to analyze my writing and create a table containing the most frequent words I use, and just replace all references to those with table indexes. This could save me quite a bit of RAM, and has the advantage of very little overhead... However, it feels a bit unreliable.

RLE wouldn't be effective.

So, which are your favourite methods for squeezing as much text as possible into, say, 4 KiB? :-) I'm really at a loss here, as you can see.

... 30 posts hidden. Click here to view all posts....

2005-11-30 17:06

enthusi

Registered: May 2004
Posts: 679

though Im not that experienced with compression I think you'd do ok with both of your ideas: dont use the full 8 bits for chars but some 6 bits per char and you'd still have some values reserved for (long) words you often use. Chosen manually. This is probably what all Prof would hate :) but its easy to implement and provide. By some native code or for crossdev some small c-(or whatever-)-routine to encode 8to6bit and some small lookuptable. I think using 8bit/char and a large lookup would get inefficient since you'd need some large table (Im so clever)...
This at least is how I'd do it (which is possibly why I dont do it :)

edit: Id also recommend some streaming-irq-tape-loader for texts so you can handle each bit individually anyway

2005-11-30 17:13

Steppe

Registered: Jan 2002
Posts: 1510

streaming-irq ->TAPE<- loader ??? O_o

2005-11-30 17:24

TNT
Account closed

Registered: Oct 2004
Posts: 189

Huffman is randomly accessible unless it's adaptive, you just need bit address of the starting point. With even distribution that saves one bit per seek position (versus byte-aligned seek positions).

Huffman is not only "clean", it also has quite low memory requirements and it's seekable. LZ(W) methods need to decompress from the beginning (with the exception of LZW when the tree is full and hasn't been discarded, but even then you need to decompress until that happens) and require quite large history buffer to be effective.

Experiment :)

2005-12-01 08:44

HCL

Registered: Feb 2003
Posts: 731

I would probably go for a Huffman-clone in this case. It has some advantages that you seem to want, like getChar(), and not being forced to 'depack' a whole bunch of text just to get one sentence. It's not the optimal packer as people already have stated, but you have to consider your needs in different aspects.

I made a Huffman-packer long time ago, it's actually been used in some demos that i've done as well :). What you'll get is three short lookup-tables of ~64 bytes each in your case, the depacking code is about 15 instructions, so it's all very nice and tidy. Just like the Profs want it, rite ;). Just call me if you're interested.

2007-08-20 13:44

polonus
Account closed

Registered: Aug 2007
Posts: 9

The best method in my opinion would be like this.

1. write all the text you want to include
2. build the dictionary of words (only more than 3 letters, omit any digits, interpunctions etc.)
3. All the base letters can be encoded using 5 bits
4. reconvert words from the dictionary into 5bits x no. of letters in the word sequences. In example, 6 letters word would occupy only 30 bits, so let's say - 4 bytes. Approx 2/6 shorter in memory. And that's - dictionary :)
5. Encoded text will use all the single interpunction signs, digits and up-to-3-letters-long words in form as they are. Otherwise (we have longer word on track ;) ) decode word using dictionary from its 5-bit packed form. The token sequence can be two-byte value with oldest bit turned on.

It is quite good method, verified to work by me long time ago. It is faaast :D

2007-08-21 07:06

ChristopherJam

Registered: Aug 2004
Posts: 1424

I don't think the 'best' method is determinable without seeing the text you wish to compress.

Why not make a compression challenge, where you publish a sample script, an API (eg: jsr to decoder with Y containing the ID of the string to decrunch, and decoder must call your 'putchar' with char in A for each character decoded, putchar to preserve a, x and y but not flags), and offer kudos to the person who can store the text, decoder and working space in least memory usage?

2007-08-21 12:27

polonus
Account closed

Registered: Aug 2007
Posts: 9

Well, we don't know more specific information about the text to be encoded. Is this long single text file or a lot of strings (i.e. text adventure)? Since we don't know that, it would be wise to consider memory saving solution:

Input:

the text file with some special characters used within text flow to deal with text segments (i.e. dialogues or bigger parts). Like this: text1[#]text2[#]text3 .. and so on, where [#] stands for special code.

Encoding:

The effect of encoding should be something like music routine, with program, dictionary and data all generated together - and, probably in separate file, index of all text parts with their numbers to recall later.

Using:

User provides text number (as per known index file) and initialises 'text-player'. Then calls DecodeNextCharacter routine from the text-player as many times as the number of characters in original (decoded) text until EOF is received. This solution saves space in memory, since there will be no temporary space to be used to 'deflate' text, God knows how long one ;) It is similar to binary read-by-byte routines when reading from floppy or something.

This way you could have packed solution as a text-player with minimum memory foot print when decoding what is important when considering long texts in one go.

It is a time to download some C64 emulator for my Mac, I suppose :D

2007-08-23 10:38

johncl

Registered: Aug 2007
Posts: 37

Interesting thread. I have always wanted to tinker with a self made text compression engine. I remember when I first made my first big text adventure on the C64 with some 110 rooms and lots of objects I eventually ran into "Out of Memory Error". Of course back then everything was made in basic.

Intrigued by the thread I started experimenting some with the intitial requirement that the decoder has to be as simple as possible and be as fast as possible.

My initial thought was that all text contains a lot of spaces, and that a pretty full character set including upper and lower chars, numbers and some symbols could easily fit 7 bits. So a simple algorithm would be to reserve last bit as indication that there is a space coming after the following letter. This reduces my sample 9525 bytes text file to 8040. Not much but at least some.

My second thought was the obvious dictionary. Since the total number of chars used is not 128 but actually 78 in my sample text i could replace these char indexes as dictionary lookup. Naturally the selection of words is based on the most used words. My sample has e.g. "the", "to", "and", "of" and "in" as frequent words. By using every free character index and replacing them with the most used words + 1bit space my file is reduced to 6295 bytes (from 9525). This size does not include the dictionary though, although that is only 100 to 200 bytes.

Well I then thought that the english language consist of many typical endings like "ing", "ion", "ed", "er". I then reserved 6 spots in the dictionary for some common endings and this further reduced the data to 5942 bytes.

The decoding involved here is very simple, read char, AND with 128, basic lookup to see if char is a dictionary entry (both full words and endings), pure printout of all that are not, and finally if bit 7 was set print a space after this char.

This can naturally be improved if you dont need lower case chars, or you could have rules that the start char of text is always upper case and the same applies for first char after a period. You would need special handling for names but that should be easy now that you have 2 bits free per char for 4 special rules. The idea is to reduce the space the most common text takes and have some overhead on the exceptions.

I wanted to avoid a lot of shifting etc in my algorithm so this was an ok tradeoff. If I compress the test text I used with RAR and best compression I get a file with the size 3578, so there is still some way to go to get there! :)

2007-08-23 10:50

Marauder/GSS
Account closed

Registered: Jul 2006
Posts: 224

yeah, this thread reminds me of my good old "Word Cruncher" I've done under Visage Studios. Anyone have seen this one? I still miss it...

Actually I don't remember exactly how my compression-algorithm worked (would be nice to have a peek at it again), but I know it uses several passes to build up a dictionary and words were replaced by some control-codes (entry-id).

If I remember correctly the first pass analyzed the scrolltext and checked how often words occured in the text.
Next pass/es replaced the most occured words (starting with the longer words first, then the smaller words, down to min. 3 chars)... something like that.

The uncompression routine worked in realtime, calling a function to get next char of your scroll-text then, so you don't need to uncompress it completly...

Anyone have seen this "Word Cruncher"? Must be done around 90-91...

2007-08-23 12:34

WVL

Registered: Mar 2002
Posts: 924

Some questions :

- how is the complete text stored in memory right now? and how do you now where certain bits of text begin?

The reason I'm asking is that I made a sort of LZ cruncher a while ago that doesnt use the _depacked_ data as a dictionary, but instead uses the _packed_ data as a dictionary (I call it LZW-VL ;D). It's easy to understand the advantage if you want pseudo-random access ;) You can add markers to the original file to let the packer know where the beginning of a data-chunk is.

The packer will shape the output file in such a way all the separate data-chunks are immediately accessible. At a cost of a pointer-table.. You simply get the start-adress of the datachunk you want decompressed from the table and start decompressing...

Can you send me bin file so i can check my packer against it? :) would be nice to test :D

edit :

Btw, I'm wondering if 4000 bytes of text is enough to get a decent compression.. There's not too much bytes to work with.. but maybe :)

Previous - 1 | 2 | 3 | 4 - Next

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

hedning/G★P
Stone/Prosonix
Twoflower/ΤRIΛD
JackAsser/Booze Design
Røly/Mayday!
TheRyk/MYD!
Earthshaker/Silicon ..
saimo/RETREAM
Pac
Guests online: 254

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Codeboys & Endians  (9.7)
4 Mojo  (9.6)
5 Coma Light 13  (9.6)
6 Edge of Disgrace  (9.6)
7 Signal Carnival  (9.6)
8 Wonderland XIV  (9.5)
9 Uncensored  (9.5)
10 Comaland 100%  (9.5)

Top onefile Demos

1 Nine  (9.7)
2 Layers  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.5)
6 Scan and Spin  (9.5)
7 Onscreen 5k  (9.5)
8 Grey  (9.5)
9 Dawnfall V1.1  (9.5)
10 Rainbow Connection  (9.5)

Top Groups

1 Artline Designs  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Performers  (9.3)
5 Censor Design  (9.3)

Top Swappers

1 Derbyshire Ram  (10)
2 Jerry  (9.8)
3 Acidchild  (9.7)
4 Cash  (9.6)
5 Violator  (9.6)

Page generated in: 0.18 sec.