[CSDb] - User Forums - Packer vs Code VM vs Code Generator

You are not logged in - nap

CSDb User Forums

Forums > C64 Coding > Packer vs Code VM vs Code Generator

2016-08-13 16:38

oziphantom

Registered: Oct 2014
Posts: 490

Packer vs Code VM vs Code Generator

Has anybody experimented with/know of code VMs or Generators.
Just thinking to get code size down they might perform better than exomiser et al.

I know of Sweet16 by Woz any others?

2016-08-13 16:51

chatGPZ

Registered: Dec 2001
Posts: 11386

i decent code generator will always win. unless speed doesnt matter much, then you can gain a lot by using some kind of interpreted VM (most obvious example: the basic interpreter)

2016-08-13 17:39

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Has anybody experimented with/know of code VMs or Generators.
Just thinking to get code size down they might perform better than exomiser et al.

I know of Sweet16 by Woz any others?

The vectors in our 1991 demo uses a VM do encode the generated code for each object. It outperforms a general cruncher in both size and speed since it's tailored to the problem domain.

2016-08-13 18:46

Oswald

Registered: Apr 2002
Posts: 5094

I think its stretching to call VM a data set that describes how to generate code. but prove me wrong :)

2016-08-13 20:55

Claus_2015

Registered: Oct 2012
Posts: 53

I would say this obviously strongly depends on the properties of the data that needs compression. If any kind of simple rule can be applied, code/data generation is probably the way to go. If the data consists of permutations of non-trivial functional elements a VM is probably good. If no suitable property can be found, a generic compressor might be the only way out. Different parts of the compressed data might have different properties, so combining all of these might be the best solution.

2016-08-13 23:15

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: I think its stretching to call VM a data set that describes how to generate code. but prove me wrong :)

True, it's just opcodes that selects data chunks together with parameters for altering the data written.

2016-08-14 00:56

White Flame

Registered: Sep 2002
Posts: 136

I've used this VM for some tool development. I've got a v2 that isn't pushed to github that I need to finish. Speed hit is under 10x slower than asm for degenerate cases, which is pretty good for this field, and it's intended for the user to add large heavyweight instructions to it which basically makes it as fast as you want.

http://acheronvm.github.io/acheronvm/

2016-08-14 07:27

Krill

Registered: Apr 2002
Posts: 2980

Quoting oziphantom

Has anybody experimented with/know of code VMs or Generators.
Just thinking to get code size down they might perform better than exomiser et al.

I know of Sweet16 by Woz any others?

As has been said before, specially-made code generators do indeed outperform generic compressors in terms of compression ratio and decompression speed. Just look at any effect-driven small-size demo (4k, one-filers) worth its salt. (Using code generation first and then generic compression later.)

VMs aren't encountered as often in demos as they are in games (which you might be going for), as they usually execute much slower than native code, but have a smaller memory footprint. Of course, they may compile to native code at run-time, as in Jackasser's example, which is a form of code generation again. For efficient 6502 VM techniques, you might want to take a look at http://codebase64.org/doku.php?id=base:dispatch_on_a_byte .

2016-08-14 07:34

Krill

Registered: Apr 2002
Posts: 2980

Quoting Oswald

I think its stretching to call VM a data set that describes how to generate code. but prove me wrong :)

Quoting JackAsser

True, it's just opcodes that selects data chunks together with parameters for altering the data written.

Quoting Wikipedia

Virtual machines operate based on the computer architecture and functions of a real or hypothetical computer.

I think Jackasser's example can be considered a VM on the semantic level, and that the implementation details are irrelevant, no matter how trivial.
Just like i think context-switching via interrupts on 6502 is a threading implementation and the KERNAL is an OS.
Really, the definitions are vague on those, and the general concepts can be applied to all computing machines, even at the lowest end.

2016-08-14 08:51

oziphantom

Registered: Oct 2014
Posts: 490

Quoting Krill

Quoting oziphantom
Has anybody experimented with/know of code VMs or Generators.
Just thinking to get code size down they might perform better than exomiser et al.

I know of Sweet16 by Woz any others?
As has been said before, specially-made code generators do indeed outperform generic compressors in terms of compression ratio and decompression speed. Just look at any effect-driven small-size demo (4k, one-filers) worth its salt. (Using code generation first and then generic compression later.)

VMs aren't encountered as often in demos as they are in games (which you might be going for), as they usually execute much slower than native code, but have a smaller memory footprint. Of course, they may compile to native code at run-time, as in Jackasser's example, which is a form of code generation again. For efficient 6502 VM techniques, you might want to take a look at http://codebase64.org/doku.php?id=base:dispatch_on_a_byte .

I'm thinking Game. Reset 4K ;) I want to do scrolling and at the moment the smallest way is for me to make a simple code unroller to do LDA STA pair to shift CRAM and SCREEN ;)

But doing common things like
LDA #00
STA $08
LDA #$80
STA $09
could be LS16 08 80 00 = 4 bytes vs 8
LDA $FC
CLC
ADC #06
STA $FC
could be ADCI FC 06 = 3 bytes vs 7
IO needs 6 bits to say what register is needed max, 2 bits to specify VIC,SID,CIA1,CIA2 so you can make
LDA #$7F
STA $D015 = 5 bytes
into
ST VIC15 7F = 3 bytes
LDA $D015
AND #$7E
STA $D015 = 8 bytes
AND VIC15 7E = 3 bytes

I would think a 4K game( maybe even a 1Meg game ) you could fit most Variables into 256 bytes. I normally put them at $200 so I could have a bit in a Command byte that specifies that the next store/load is Variables and hence the next byte is really $02XX
With flags a STZ or Store 1 would also probably be handy
to which STZ XX for zero page is 2 bytes. STZV XX for variables is also 2 bytes.
LDA CMP BXX chains could be dropped to LCBXX AA BB CC

Where the game would be unpacked with exomiser, then the code packer would parse the code bytes to spit out the expanded code, then the game would be run.

But it could all work out worse in the end. Curious to know what people already had/have tried and failed etc.

2016-08-14 08:53

oziphantom

Registered: Oct 2014
Posts: 490

Quoting Krill

Quoting Oswald
I think its stretching to call VM a data set that describes how to generate code. but prove me wrong :)
Quoting JackAsser
True, it's just opcodes that selects data chunks together with parameters for altering the data written.
Quoting Wikipedia
Virtual machines operate based on the computer architecture and functions of a real or hypothetical computer.
I think Jackasser's example can be considered a VM on the semantic level, and that the implementation details are irrelevant, no matter how trivial.
Just like i think context-switching via interrupts on 6502 is a threading implementation and the KERNAL is an OS.
Really, the definitions are vague on those, and the general concepts can be applied to all computing machines, even at the lowest end.

To me if
the code is ran once, then it is a code generator
the code reads data, makes code, runs it, comes back reads more data, makes code, runs it ... then it is a JIT VM

2016-08-14 09:06

Krill

Registered: Apr 2002
Posts: 2980

Oh, and i thought you were about to implement some kind of script interpreter for a domain-specific language of your own devising.

To be honest, i don't think the code generation you have in mind for that scroll routine will get you very far, and the compression ratio seems to be above 50%. The standard approach for that is to have a template with some placeholders for values and addresses, which you will modify and copy many times to create a flat unrolled copy/render routine.

Quoting oziphantom

Where the game would be unpacked with exomiser, then the code packer would parse the code bytes to spit out the expanded code, then the game would be run.

But it could all work out worse in the end. Curious to know what people already had/have tried and failed etc.

If you check the source that comes with Artefacts you'll see that an approach that worked good for me was to split up the 6502 instruction stream into two separate streams, one for op-codes, one for the arguments. This along with code tailored with that in mind (you'd do LDA #$00 : LDX #$00 rather than LDA #$00 : TAX) gave me quite a plus on exomizer's compression ratio (plus i used Ninja's size-optimized exo decompressor). There are other techniques involved, such as packing function tables (sine curves etc.) lossily using polynomials and the good old code generation.

2016-08-14 09:06

Oswald

Registered: Apr 2002
Posts: 5094

Krill, you cut off the first part of the definition: "In computing, a virtual machine (VM) is an emulation of a given computer system"

Jackie's code generator doesnt emulate anything. Its a data set on how to generate code. How to add up and scale vectors to calculate vertices is not a computer system's emulation we should agree I think...

2016-08-14 09:10

Krill

Registered: Apr 2002
Posts: 2980

Oh, and about exomizer... For a 4K, you'd probably get better results with ALZ64 which is a 6502-optimised LZMA variant. It's very slow, but not too slow for 4Ks. (I still need to check wether Artefacts is shorter using ALZ64 rather than exo.)

2016-08-14 09:11

Oswald

Registered: Apr 2002
Posts: 5094

oziphantom: I'd just collect often used routines into subs, and jsr. it depends how tight you want and how much time you have. if both are a lot then what you suggest could work.

2016-08-14 09:14

tlr

Registered: Sep 2003
Posts: 1790

Quote: Krill, you cut off the first part of the definition: "In computing, a virtual machine (VM) is an emulation of a given computer system"

Jackie's code generator doesnt emulate anything. Its a data set on how to generate code. How to add up and scale vectors to calculate vertices is not a computer system's emulation we should agree I think...

If it's turing complete I'd argue it's a crude VM, albeit running fixed code to generate the data set.

2016-08-14 09:19

Krill

Registered: Apr 2002
Posts: 2980

What tlr said, but i think Turing completeness isn't necessary. You could build a VM for a stack machine (with exactly one stack) and it'd still be a VM. :)

2016-08-14 21:03

mankeli

Registered: Oct 2010
Posts: 146

I believe the correct term would be automaton. :)

Automata Theory

2016-08-14 21:34

Krill

Registered: Apr 2002
Posts: 2980

"For stack machines in automata theory, see pushdown automaton." - Both are correct, "PDA" is just less ambiguous without the context (which is clear here). :)

2016-08-15 03:24

TWW

Registered: Jul 2009
Posts: 545

You guys need to get laid...

2016-08-15 07:52

Krill

Registered: Apr 2002
Posts: 2980

Quoting TWW

You guys need to get laid...

How extraordinarily creative, smart and funny, yet dignified and mature. Surely, "TWW" is short for "The Witty Wisecracker"? I applaud you, good sir!

2016-08-15 08:41

enthusi

Registered: May 2004
Posts: 677

Stay tuned for the coming Above & Beyond #2 for an article by Ulli on this subject ;-)

2016-08-15 10:31

mankeli

Registered: Oct 2010
Posts: 146

I wish I could experience the warmth of a woman even once. I'm sure that would help me to let go of these.... feelings... and thoughts... about VM's.... and automatons....

2016-08-15 10:52

HCL

Registered: Feb 2003
Posts: 728

This thread is finally getting enjoyable :) Thank you!

2016-08-15 11:36

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: This thread is finally getting enjoyable :) Thank you!

Speaking about getting laid. I think you HCL have the highest CSDB Code score / number of kids ratio (C2K^-1) actually.

2016-08-15 11:41

chatGPZ

Registered: Dec 2001
Posts: 11386

Quote:

CSDB Code score / number of kids ratio (C2K^-1)

LOL! i demand a chart for this on the front page

2016-08-15 11:55

mankeli

Registered: Oct 2010
Posts: 146

5.7 5.3 5.3 NaN 7.3 3.4 NaN

2016-08-15 14:29

Scan

Registered: Dec 2015
Posts: 111

NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN .....

Bat-maaaaannn!!!

2016-08-15 15:12

TWW

Registered: Jul 2009
Posts: 545

Quote: Quoting TWW
You guys need to get laid...
How extraordinarily creative, smart and funny, yet dignified and mature. Surely, "TWW" is short for "The Witty Wisecracker"? I applaud you, good sir!

Hahaha, I like juvenile humor, sue me.

2016-08-15 19:59

Krill

Registered: Apr 2002
Posts: 2980

Quoting JackAsser

Speaking about getting laid. I think you HCL have the highest CSDB Code score / number of kids ratio (C2K^-1) actually.

Except... how is having kids related to getting laid... on a highly frequent basis? I demand some _scientifically sound_ facts! :D

2016-08-15 21:09

HCL

Registered: Feb 2003
Posts: 728

Quote: Quoting JackAsser
Speaking about getting laid. I think you HCL have the highest CSDB Code score / number of kids ratio (C2K^-1) actually.
Except... how is having kids related to getting laid... on a highly frequent basis? I demand some _scientifically sound_ facts! :D

Hmm.. exactly what i was to reply also.. but of *some* reason i refused to push the "submit" button :P

2016-08-16 06:54

Trash

Registered: Jan 2002
Posts: 122

Having children is nothing but proof that their mother is getting laid, not that you are... ;-)

2016-08-16 06:55

oziphantom

Registered: Oct 2014
Posts: 490

Mothers are sceners too.

2016-08-16 07:19

Oswald

Registered: Apr 2002
Posts: 5094

back to topic, reread ozi's post, and actually I find it a cool idea.

kickass would suit nicely for this, but one problem I see is how one would compile this so, that adresses are correct after code is blown up to normal size. illegals should be used for the pseudo instructions. maybe I'd use a macro which adds enough nops after "own" instructions to keep the code size same for adresses to work, then a preprocessor would remove the nops.

2016-08-16 07:49

oziphantom

Registered: Oct 2014
Posts: 490

I've made macros, but the tass64 verbose option doesn't list when it "invokes" one. As I was thinking you could assemble with the macros expanded, this will give you a labels file with all the correct placements. Then run through the listing and pack the macros.
Maybe a post process that just hunts the code for patterns and then applies the patterns as it finds it, use macros to "enforce" a pattern is the best way to go.

2016-08-16 08:11

Krill

Registered: Apr 2002
Posts: 2980

Quoting Oswald

kickass would suit nicely for this, but

So it's not suited nicely :) Really, it amazes me when people try to shoehorn in stuff that, in this case, the tool really wasn't built for (just a guess, though). You'd fare better with the standard approach of writing your own standalone encoder and decoder. However, i still reckon the gain is rather poor in comparison to other approaches, at least when having to provide the decoder in the same binary.

2016-08-16 08:28

Krill

Registered: Apr 2002
Posts: 2980

Thinking further about it, it may be a better approach to decrease code density rather than increase it.

As the program will ultimately be compressed with a general-purpose compressor anyways, you want to support it. In the case of code, that'd mean a more regular structure, with a lot of repeating or similar sequences.

So you'd "RISCify" the code, meaning to come up with an orthogonal set of instructions: INC a, INC x and INC y will all have the same op-code, but different arguments. Together with the separated op-code and argument streams i mentioned above, this might pay off.

Of course, again, there is a rather high initial cost for the decoder, which may or may not amortise in 4K, plus the code will probably perform (slightly?) worse compared to highly optimised hand-crafted native code.

2016-08-16 08:48

Mixer

Registered: Apr 2008
Posts: 452

In my opinion 1K and 4k and other small space code challenges are more about redefining the problem than making a language or VM.

I like this Krill's idea of opcode stream and arguments stream. It is like having two stacks of data that one combines into a program. I don't think it has 4k benefits, but as a way to "stream" or VM programs it is interesting.

2016-08-16 09:03

Krill

Registered: Apr 2002
Posts: 2980

It DID have benefits in Artefacts. And trust me, there was a lot of trial and error before i settled for that one (and some other techniques). :)

2016-08-16 09:21

Oswald

Registered: Apr 2002
Posts: 5094

there's a lot more going on in artefacts I've thought, kudos :)

2016-08-16 10:55

oziphantom

Registered: Oct 2014
Posts: 490

I though about simplistic code, I think the macros will actually help there as well for just normal compression but the issue I see is this

LDA #7f
STA $d015
LDA #0
STA $d027
LDA #84
STA $43E7

At this point the compression gets an LDA, unique number, STA, unique number, D0 LDA, unique number, STA, unique number, D0 LDA, unique number, STA unique number, unique number. To which I feel that by the time the compression system encodes the custom byte stream, and then the 2 byte pairs, then back to custom, then 2 bytes, it would actually consume more memory than taking the whole lot as a custom stream. Using the conventional wisdom I know Code compresses badly, data compresses well, make more data.
Any code that is mostly the same, will be JSR optimised by the coder, so you just end up with the annoying sequences that its not worth the extra instructions to jsr as you end up with a loss.

 
+	lda $3
	bne +
	dec $4
+	dec $3	
	lda $6
	bne +
	dec $7
+	dec $6

So close, yet so far.

Which I think is proven by Krill's genius idea of splitting data and code this way you would get a nice run of
LDA STA LDA STA LDA STA and just pay for the 7f 15 d0 00 27 d0 84 e7 40
Wether or not the code expansion system pays-off in 4K is hard to tell until all the chips are on the table. It might only be viable at 16K. I fell that for things like Demos that have highly algorithmic based effects and code that the split will pay off very well, as you will do lots of similar code, while a game tends to have large amounts of unique code with random structure.

2016-08-16 13:42

ChristopherJam

Registered: Aug 2004
Posts: 1409

I wonder if IO register setting code is common enough to benefit from a pseudo-op that moves an eight bit immediate into $Dxxx, where xxx has a number of implicit zeros - or possibly even a series of N such operations

2016-08-16 14:58

oziphantom

Registered: Oct 2014
Posts: 490

Well you can get
STA SSRRRRRR Value

where SS is
00 VIC
01 SID
10 CIA 1
11 CIA 2

And R is the register

allowing you to pack any IO address into a byte. In a game context I don't think that will save you much. Typically the screen mode and charset registers are set and forget. The multiplexer will deal with the sprites usually in one function and then you have a few D012,9,A for the IRQs.

2016-08-16 15:51

oziphantom

Registered: Oct 2014
Posts: 490

I've been working under the impression that packing all the data near each other and the code together would help with compression.
so
CODE
DATA

is this misguided do you think?

2016-08-16 16:44

ChristopherJam

Registered: Aug 2004
Posts: 1409

Grouping is generally good.

Most crunchers encode "copy recent substring" with fewer bits the more recent the substring, so grouping all your code seperate to your data means the cruncher doesn't have to look as far to find prior examples, hence can use fewer bits for the copy offset.

2016-08-16 17:15

Krill

Registered: Apr 2002
Posts: 2980

Quoting oziphantom

it would actually consume more memory than taking the whole lot as a custom stream

The two separate streams take exactly the same amount of memory as the conventional 6502 counterpart. Your example would translate to

$a9,$8d,$a9,$8d,$a9,$8d
$7f,$15,$d0,$00,$84,$27,$d0,$84,$e7,$43

and it's plain to see that the op-codes would compress well, and data arguments too at some later point when using the same literals and addresses many times.
Of course, these two chunks will have to be re-combined after decompression, which adds an initial cost of about $0180 or so bytes (which in my case amortised for the 4K, so overall i saved a few pages).

But please, do look at the Artefacts source code at some point to see what i did.

2016-08-16 22:25

soci

Registered: Sep 2003
Posts: 480

Storing opcode lengths in a 64 byte bit-array, what a waste of space ;) But the use of (,x) to switch streams is smart.

2016-08-17 05:32

Krill

Registered: Apr 2002
Posts: 2980

:D One of the few occasions where (ZP,X) actually comes in handy. Which is mostly whenever you have a number of "buckets" to select from AND need to push or pull stuff to and fro.

As for op-code lengths, there didn't seem to be a nice enough pattern to store that more efficiently. Speak about non-orthogonal instruction set. So i went for a solution that somehow suited exomizer best. :)

2016-08-17 06:28

soci

Registered: Sep 2003
Posts: 480

From the processor's viewpoint the least significant 5 bits of opcode fully determines the length of the addressing mode.

Still it seems that BRK/JSR/RTI/RTS are exceptions. But that's only because the modification of PC is covering up that they're actually 2 bytes long.

I think if you reduce the table size the result will be shorter even with the special handling added in for these opcodes.

2016-08-17 07:23

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: From the processor's viewpoint the least significant 5 bits of opcode fully determines the length of the addressing mode.

Still it seems that BRK/JSR/RTI/RTS are exceptions. But that's only because the modification of PC is covering up that they're actually 2 bytes long.

I think if you reduce the table size the result will be shorter even with the special handling added in for these opcodes.

It would be very slow, but couldn't you depack that stream without a table! :) Imagine placing a BRK after the opcode in the stream then use the IRQ to determine the opcode length by checking the pushed PC on the stack! Sneaky!

2016-08-17 07:53

oziphantom

Registered: Oct 2014
Posts: 490

Quoting Krill

Quoting oziphantom
it would actually consume more memory than taking the whole lot as a custom stream
The two separate streams take exactly the same amount of memory as the conventional 6502 counterpart. Your example would translate to
$a9,$8d,$a9,$8d,$a9,$8d $7f,$15,$d0,$00,$84,$27,$d0,$84,$e7,$43
and it's plain to see that the op-codes would compress well, and data arguments too at some later point when using the same literals and addresses many times.
Of course, these two chunks will have to be re-combined after decompression, which adds an initial cost of about $0180 or so bytes (which in my case amortised for the 4K, so overall i saved a few pages).

But please, do look at the Artefacts source code at some point to see what i did.

My line is referencing a normal packer like exo, not your split system.

2016-08-17 08:17

Krill

Registered: Apr 2002
Posts: 2980

Ah, sorry, didn't read thoroughly enough.

But i guess the compressor itself should be oblivious to the semantic details of the blob you serve it. It should not know about op-codes/arguments/data separation and just crunch everything as well as it can.

It should be served data it can compress well, and this is where any special coding takes place before final compression and after initial decompression.

In principle this is akin to Burrows-Wheeler transform/move-to-front, which makes it much better compressible.

2016-08-17 08:19

Krill

Registered: Apr 2002
Posts: 2980

Quoting JackAsser

It would be very slow, but couldn't you depack that stream without a table! :) Imagine placing a BRK after the opcode in the stream then use the IRQ to determine the opcode length by checking the pushed PC on the stack! Sneaky!

Excellent idea! However, i have a hunch that the code for all that would, in the end, still consume more compressed data than the rather straightforward approach with a table, which is probably better to compress. Not to speak of execution hazards with funny op-codes in an environment that doesn't really allow for sandboxing. :)

2016-08-17 11:32

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Quoting JackAsser
It would be very slow, but couldn't you depack that stream without a table! :) Imagine placing a BRK after the opcode in the stream then use the IRQ to determine the opcode length by checking the pushed PC on the stack! Sneaky!
Excellent idea! However, i have a hunch that the code for all that would, in the end, still consume more compressed data than the rather straightforward approach with a table, which is probably better to compress. Not to speak of execution hazards with funny op-codes in an environment that doesn't really allow for sandboxing. :)

Indeed, it's a bit tricky, but possible. For instance RTI/RTS handles the return address a bit different (+1 vs -1). But I guess having argument $02,$02 and then prefill the stack return address properly should be able to handle most degenerate cases. But yeah, it's a big chance it will compress worse in total tbh.

2016-08-17 11:38

Oswald

Registered: Apr 2002
Posts: 5094

"RTI/RTS handles the return address a bit different "

wot ? didnt knew this. why ?

2016-08-17 11:39

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: "RTI/RTS handles the return address a bit different "

wot ? didnt knew this. why ?

Dunno, but it does.

2016-08-17 11:48

chatGPZ

Registered: Dec 2001
Posts: 11386

iirc there was a longish thread explaining it on 6502.org

2016-08-17 12:19

Krill

Registered: Apr 2002
Posts: 2980

Quoting Oswald

"RTI/RTS handles the return address a bit different "

wot ? didnt knew this. why ?

Why, because JSR pushes the return address MINUS 1 on stack while an interrupt simply pushes the return address, of course. :D

But seriously, as groepaz implied, the reason for that is likely due to some 6502 design intricacies. Something like JSR pushing next op-code minus 1 because PC at that stage isn't incremented yet to the next op-code, while interrupts are handled exactly between op-codes (after increasing PC).

2016-08-17 12:21

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: iirc there was a longish thread explaining it on 6502.org

Ahh ok, didn't know. I was painfully aware of it when I implemented my preemtive multitasking to support yeild (i.e. give up CPU to another thread). Normally the task-switch is performed by some interrupt (i.e. the return / continuation point for the thread is placed on the stack and resumable with an RTI). However, when I want to yield in "userspace", i.e. not wait until the next task switch, the task switch is performed by a subroutine and thus resumable with an RTS. So I need to keep that in mind so that I don't accidentally mix up those to types of resume possibilities. Basically the yield-call manipulates the stack to convert it from RTS-compatible to RTI-compatible by adding to the return address and push the status also. Hairy, but it works... :) (And it's useless, but fun!)

2016-08-17 12:43

Krill

Registered: Apr 2002
Posts: 2980

Why didn't you use BRK, which was basically invented for things like this? :) (It's the 6502 equivalent of TRAP.)

2016-08-17 12:47

chatGPZ

Registered: Dec 2001
Posts: 11386

brk is more useful for systemcall type of things, not "yield" imho

2016-08-17 12:53

Krill

Registered: Apr 2002
Posts: 2980

But it triggers an interrupt from software.

2016-08-17 13:52

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: brk is more useful for systemcall type of things, not "yield" imho

It's for a demo part and not an OS and I have many other IRQs going on so I decided that 1) I don't want to clutter each handler with B-flag check, 2) couldn't take the risk of a possible uncontrolled extra stall in f.e. a tight multiplexer. But yeah I condidered BRK long and hard before I decided to do the manipulation manually (and thus a need for semaphores to block the scheduler from interrupting a "user mode" task switch).

2016-08-17 13:53

JackAsser

Registered: Jun 2002
Posts: 2014

(Sorry for being OT, one thing led to another...)

2016-08-17 14:43

Oswald

Registered: Apr 2002
Posts: 5094

in my first demo I had "multitasking" with raster irqs. about in 20% for each frame I switched to loader then back. So only one type of return adresses, but I had other problems, the two threads messed up each other's stacks, so I simply copyed back and forth each one's stack (not all of it) to a temp buffer. also all register swapped. worked nicely today I'd just halve the stack and replace S tho.

2016-08-17 14:56

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: in my first demo I had "multitasking" with raster irqs. about in 20% for each frame I switched to loader then back. So only one type of return adresses, but I had other problems, the two threads messed up each other's stacks, so I simply copyed back and forth each one's stack (not all of it) to a temp buffer. also all register swapped. worked nicely today I'd just halve the stack and replace S tho.

I do that too, but split the stack properly. 25% to loader, 75% to plotter, loader yields when not loading.

2016-08-17 15:06

oziphantom

Registered: Oct 2014
Posts: 490

Make 128 demos, problem solved :D

2016-08-17 15:08

Oswald

Registered: Apr 2002
Posts: 5094

Quote: I do that too, but split the stack properly. 25% to loader, 75% to plotter, loader yields when not loading.

you could just set some flag to tell the irq to stop scheduling the loader in, insead of that supper scifi method? :)

2016-08-17 15:29

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: you could just set some flag to tell the irq to stop scheduling the loader in, insead of that supper scifi method? :)

I did consider that also, but was too boring to code

2016-08-17 17:32

Oswald

Registered: Apr 2002
Posts: 5094

well yeah certainly a more sexy solution :) brk could work too tho if you replace all your raster IRQs with NMIs :)

2016-08-17 18:25

TNT
Account closed

Registered: Oct 2004
Posts: 189

NMI happening during BRK will cause BRK to be ignored. See http://visual6502.org/wiki/index.php?title=6502_BRK_and_B_bit#N..

2016-08-17 18:31

TNT
Account closed

Registered: Oct 2004
Posts: 189

@Krill: did you try separate streams for immediate/low byte operands and high bytes? I found out that high bytes compress better when separated from the low byte.

@soci: no need to add special cases for RTI/RTS - just make preprocessor warn if the following byte (fake operand) is non-zero. Additionally the preprocessor can replace the zero with any other value if that improves compression. JSR needs to be handled separately of course.

2016-08-18 04:53

Krill

Registered: Apr 2002
Posts: 2980

TNT: I don't remember trying that. Also the RTS/RTI padding is a good idea. Certainly something to play around with for the next 4K of mine! :D

2016-08-18 05:00

Krill

Registered: Apr 2002
Posts: 2980

Quoting Oswald

you could just set some flag to tell the irq to stop scheduling the loader in, insead of that supper scifi method? :)

Why is the obvious solution of simply splitting the stack (TSX/TXS exists) the "super sci-fi" method and painstakingly copying it back and forth not? :)

2016-08-18 05:38

JackAsser

Registered: Jun 2002
Posts: 2014

Quote: Quoting Oswald
you could just set some flag to tell the irq to stop scheduling the loader in, insead of that supper scifi method? :)
Why is the obvious solution of simply splitting the stack (TSX/TXS exists) the "super sci-fi" method and painstakingly copying it back and forth not? :)

He was referring to implementing yield vs pause/resume thread

2016-08-18 06:02

Krill

Registered: Apr 2002
Posts: 2980

Ah, thanks. I should not morning-post before the third coffee. :D

2016-08-18 20:23

lft

Registered: Jul 2007
Posts: 369

Quoting soci

From the processor's viewpoint the least significant 5 bits of opcode fully determines the length of the addressing mode.

Still it seems that BRK/JSR/RTI/RTS are exceptions. But that's only because the modification of PC is covering up that they're actually 2 bytes long.

I think if you reduce the table size the result will be shorter even with the special handling added in for these opcodes.

Also remember that this doesn't need to be 100% accurate. It's just a heuristic, and things will work as long as the encoder and the decoder make exactly the same errors. So treating jsr as having a one-byte operand (or brk/rti/rts/ldy#/cpy#/cpx# as having two-byte operands) will throw things out of sync for a while, but the compression ratio might not increase very much, and may even accidentally decrease for some input data. Meanwhile the decoder would be greatly simplified.

2016-08-19 06:20

Krill

Registered: Apr 2002
Posts: 2980

Yes, but i'd try to get things back in sync again as quickly as possible. In an environment where every byte counts, "tainting" the streams by switching their data may (or may not) result in some losses.

And the question would be how often do the streams get in and out of alignment? Surely you want to keep that number low, otherwise the entire exercise is moot.

Now, i really like the idea of padding in zeros for simplified decoding, which may avoid unaligned streams altogether and have no drawbacks in comparison.

But i guess the goal would be to make the op-code lengths as regular as possible, possibly even getting rid of the table altogether in favour of a simple expression, while keeping the streams as well-aligned as possible by padding and/or allowing transient alignment errors only on rare operations (prefer brk/rti/rts/jsr over lda#/cmp#/etc.) with a quick recovery (also via some sort of padding?).

2016-08-19 06:43

Krill

Registered: Apr 2002
Posts: 2980

Oh, and of course, most illegals and also quite a few regular operations aren't used much (or can be replaced), especially in size-constrained productions. This may also have a significant influence on the "regularity" of the instruction lengths.

2016-08-23 06:55

TBH

Registered: Mar 2010
Posts: 21

I've played around with what might be considered code VM systems for the C64, with varying results:

1) Event Management System
The use of Events allows code to be deferred ("fire and forget") within an imperative program design. It's basically a set of tables, pointers and indices tied together with a few routines.

The basic features of such a system are:
* Subscribe To An Event - Add a subroutine address to an Event's list of Handlers.
* Unsubscribe From An Event - Remove a subscription.
* Raise An Event - Notify the Event system that an Event of a certain type occurred.
* Event Manager - (called from main loop) Executes the subscribed Handlers for each raised event in its buffer. Events can be queued and prioritised, eg animation Events handled once every 8 frames.

2) Reactive (Event-Driven) Extended Finite State machine
A state machine compatible with OMG's UML2 specification. Relatively slow, but can handle complex sets of transitions. However, it's probably overkill for most purposes.

3) Decision Table
Similar to a lookup table, but with more features and a routine that (optionally) compiles the data and executes the logic expressed by the data and structure.

A table consists of a set of Rules. Each Rule has a set of Actions that are executed if a set of Conditions are met.

The benefits:
a) Rules are expressed explicity when tabularised, instead of (often) implicitly when in code.
b) No IF-THEN code is required.

Many different tables can exist, but only routine used to execute them. This minimises code and the payoff can be big in a program that needs to evaluate a lot of conditions. Alternatively, a code generator can generate a code representation of the table.

4) Inline Function Parameters
This is simple but useful in that it remains slightly more readable than using a data table outside the code. Implementation can vary, but involve manipulating stack return values. The result is code like this:

; PrintAt routine params: Column, Row, Colour, @-terminated string

JSR PrintAt
.byt 10, 2, LtBlue
.asc "The quick brown fox@"
JSR PrintAt
.byt 10, 3, White
.asc "jumps over the lazy dog.@"

2016-08-23 07:09

lft

Registered: Jul 2007
Posts: 369

I've implemented the Z-machine for C64+REU. Zeugma 1.2

2016-08-23 07:33

Krill

Registered: Apr 2002
Posts: 2980

Once upon a time, AEG was working on a VM for a 4K demo of his, which, to my knowledge, was never finished. It implemented 32-bit arithmetics (including MUL and DIV) via an orthogonal instruction set operating on virtual registers.

I wonder if for a purpose like that, a stack machine approach similar to Intel's 8087 would be suited well.

2016-08-23 08:52

White Flame

Registered: Sep 2002
Posts: 136

Having played with a bunch of VM strategies on the 64, I'd say that a stack machine is nice for numeric evaluation, but not for general purpose use. If you're only going to have one or the other, I'd stick with some notion of register file.

2016-08-23 13:57

Frantic

Registered: Mar 2003
Posts: 1648

I once coded a 6510-emu routine to be run on C64. :D Sounds like one of the more reasonable things to do, right? ;)

It actually executed the instructions, one after another, setting/restoring cpu flags and such, but it checked some constraints before and after executing the opcodes.

(The purpose was to be able to execute randomly generated 6510 code, but to be able to disallow access to certain memory areas and stuff like that e.g. to avoid crashes. It was related to some experimentation with Genetic Programming.)

2016-08-23 16:47

Krill

Registered: Apr 2002
Posts: 2980

Frantic: Interesting! Did your evil experiments go anywhere? What did you train the software for? What were the fitness functions? After which generation did the software do anything noteworthy? Did it become self-aware? :)

2016-08-23 17:25

Oswald

Registered: Apr 2002
Posts: 5094

frantic you could have just insterted a brk _before_ each instruction, and the irq woul have examined it if its within limits :)

2016-08-23 17:27

Frantic

Registered: Mar 2003
Posts: 1648

Hehe.. Yes, I pulled the plug and stopped being active in the scene when the C64 threathened to kill me.

Well.. No, I didn't get very far. If I remember correctly I think Alih came up with some "code a c64 game" competition that I made a small game for instead of continuing with the 6510-emu thing. That game was also based on similar evolutionary principles, but I didn't make it before the deadline, so I just stopped. It was released as a preview by Triad though, a whole bunch of years later:
Drone Processor Preview

From an evolutionary perspective that game is not very interesting though, as it was precisely that part was what needed some work. So not much came out of that either. :D That game didn't generate 6510 code though, but I just made some sort of simple VM without much thought at all behind it. Just fiddling around for fun.

...and then I coded defMON instead. :)
http://toolsforscholars.com/defmon/doku.php
Defmon &D

Sorry for being off topic.

2016-08-23 17:34

Frantic

Registered: Mar 2003
Posts: 1648

Oswald: I guess I could have done that, but if I remember correctly I didn't want to insert anything in the code. I wanted it to have precisely the form it would have when being executed for real, so that potentially an opcode could be used as an operand and vice versa, depending on where things branched, and that it could self modify in whatever way (as long as it stayed within its workspace, so to speak) during the execution, and so forth...

2016-08-23 18:05

Digger

Registered: Mar 2005
Posts: 437

A little off-topic but I remember MARS being implemented in Core Wars (http://www.koth.org/info/corewars_for_dummies/dummies.html)

It would be cool to run some OOP on C64 as TBH suggests :)

2016-08-24 10:25

ChristopherJam

Registered: Aug 2004
Posts: 1409

In terms of making more compressible code, I wonder if it would be worth normalising most or all instructions to three bytes long?

A table of all the numbers from 0 to 255 on page $ea would let you turn all immediate operands into absolute, while padding all branches with the same third byte as the immediates. There would be fewer distinct opcodes in the stream to be compressed, too.

Of course, that still leaves the single byte instructions. Pack these into groups of one to three opcodes, padded with a nop or two?

2016-08-24 15:14

Krill

Registered: Apr 2002
Posts: 2980

I'm sceptical. Assuming it would pay off in terms of compression ratio, you'd end up with either pretty inefficient code or rather expensive decoding (in terms of size, to recover the original 1- or 2-byte instructions).

Refresh

Subscribe to this thread:

You need to be logged in to post in the forum.

Search the forum:
Search for in
All times are CET.

Search CSDb

Advanced

Users Online

Copyfault/Extend^tsn..
Linus/MSL
pcollins/Quantum
kbs/Pht/Lxt
Faayd/Quantum
Mythus/Delysid
t0m3000/hf^boom!^ibx
chancer
Walt/Bonzai
Mike
katon/Lepsi De
Mason/Unicess
Freeze/Blazon
Magic/Nah-Kolor
Guests online: 111

Top Demos

1 Next Level  (9.7)
2 13:37  (9.7)
3 Mojo  (9.7)
4 Coma Light 13  (9.6)
5 Edge of Disgrace  (9.6)
6 What Is The Matrix 2  (9.6)
7 The Demo Coder  (9.6)
8 Uncensored  (9.6)
9 Comaland 100%  (9.6)
10 Wonderland XIV  (9.6)

Top onefile Demos

1 Layers  (9.6)
2 No Listen  (9.6)
3 Cubic Dream  (9.6)
4 Party Elk 2  (9.6)
5 Copper Booze  (9.6)
6 Dawnfall V1.1  (9.5)
7 Rainbow Connection  (9.5)
8 Onscreen 5k  (9.5)
9 Morph  (9.5)
10 Libertongo  (9.5)

Top Groups

1 Performers  (9.3)
2 Booze Design  (9.3)
3 Oxyron  (9.3)
4 Triad  (9.3)
5 Censor Design  (9.3)

Top Swappers

1 Derbyshire Ram  (10)
2 Jerry  (9.8)
3 Violator  (9.7)
4 Acidchild  (9.7)
5 Cash  (9.6)

Page generated in: 0.166 sec.