| |
oziphantom
Registered: Oct 2014 Posts: 490 |
Packer vs Code VM vs Code Generator
Has anybody experimented with/know of code VMs or Generators.
Just thinking to get code size down they might perform better than exomiser et al.
I know of Sweet16 by Woz any others? |
|
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
i decent code generator will always win. unless speed doesnt matter much, then you can gain a lot by using some kind of interpreted VM (most obvious example: the basic interpreter) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Has anybody experimented with/know of code VMs or Generators.
Just thinking to get code size down they might perform better than exomiser et al.
I know of Sweet16 by Woz any others?
The vectors in our 1991 demo uses a VM do encode the generated code for each object. It outperforms a general cruncher in both size and speed since it's tailored to the problem domain. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
I think its stretching to call VM a data set that describes how to generate code. but prove me wrong :) |
| |
Claus_2015
Registered: Oct 2012 Posts: 53 |
I would say this obviously strongly depends on the properties of the data that needs compression. If any kind of simple rule can be applied, code/data generation is probably the way to go. If the data consists of permutations of non-trivial functional elements a VM is probably good. If no suitable property can be found, a generic compressor might be the only way out. Different parts of the compressed data might have different properties, so combining all of these might be the best solution. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: I think its stretching to call VM a data set that describes how to generate code. but prove me wrong :)
True, it's just opcodes that selects data chunks together with parameters for altering the data written. |
| |
White Flame
Registered: Sep 2002 Posts: 136 |
I've used this VM for some tool development. I've got a v2 that isn't pushed to github that I need to finish. Speed hit is under 10x slower than asm for degenerate cases, which is pretty good for this field, and it's intended for the user to add large heavyweight instructions to it which basically makes it as fast as you want.
http://acheronvm.github.io/acheronvm/ |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting oziphantomHas anybody experimented with/know of code VMs or Generators.
Just thinking to get code size down they might perform better than exomiser et al.
I know of Sweet16 by Woz any others? As has been said before, specially-made code generators do indeed outperform generic compressors in terms of compression ratio and decompression speed. Just look at any effect-driven small-size demo (4k, one-filers) worth its salt. (Using code generation first and then generic compression later.)
VMs aren't encountered as often in demos as they are in games (which you might be going for), as they usually execute much slower than native code, but have a smaller memory footprint. Of course, they may compile to native code at run-time, as in Jackasser's example, which is a form of code generation again. For efficient 6502 VM techniques, you might want to take a look at http://codebase64.org/doku.php?id=base:dispatch_on_a_byte . |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting OswaldI think its stretching to call VM a data set that describes how to generate code. but prove me wrong :) Quoting JackAsserTrue, it's just opcodes that selects data chunks together with parameters for altering the data written. Quoting WikipediaVirtual machines operate based on the computer architecture and functions of a real or hypothetical computer. I think Jackasser's example can be considered a VM on the semantic level, and that the implementation details are irrelevant, no matter how trivial.
Just like i think context-switching via interrupts on 6502 is a threading implementation and the KERNAL is an OS.
Really, the definitions are vague on those, and the general concepts can be applied to all computing machines, even at the lowest end. |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
Quoting KrillQuoting oziphantomHas anybody experimented with/know of code VMs or Generators.
Just thinking to get code size down they might perform better than exomiser et al.
I know of Sweet16 by Woz any others? As has been said before, specially-made code generators do indeed outperform generic compressors in terms of compression ratio and decompression speed. Just look at any effect-driven small-size demo (4k, one-filers) worth its salt. (Using code generation first and then generic compression later.)
VMs aren't encountered as often in demos as they are in games (which you might be going for), as they usually execute much slower than native code, but have a smaller memory footprint. Of course, they may compile to native code at run-time, as in Jackasser's example, which is a form of code generation again. For efficient 6502 VM techniques, you might want to take a look at http://codebase64.org/doku.php?id=base:dispatch_on_a_byte .
I'm thinking Game. Reset 4K ;) I want to do scrolling and at the moment the smallest way is for me to make a simple code unroller to do LDA STA pair to shift CRAM and SCREEN ;)
But doing common things like
LDA #00
STA $08
LDA #$80
STA $09
could be LS16 08 80 00 = 4 bytes vs 8
LDA $FC
CLC
ADC #06
STA $FC
could be ADCI FC 06 = 3 bytes vs 7
IO needs 6 bits to say what register is needed max, 2 bits to specify VIC,SID,CIA1,CIA2 so you can make
LDA #$7F
STA $D015 = 5 bytes
into
ST VIC15 7F = 3 bytes
LDA $D015
AND #$7E
STA $D015 = 8 bytes
AND VIC15 7E = 3 bytes
I would think a 4K game( maybe even a 1Meg game ) you could fit most Variables into 256 bytes. I normally put them at $200 so I could have a bit in a Command byte that specifies that the next store/load is Variables and hence the next byte is really $02XX
With flags a STZ or Store 1 would also probably be handy
to which STZ XX for zero page is 2 bytes. STZV XX for variables is also 2 bytes.
LDA CMP BXX chains could be dropped to LCBXX AA BB CC
Where the game would be unpacked with exomiser, then the code packer would parse the code bytes to spit out the expanded code, then the game would be run.
But it could all work out worse in the end. Curious to know what people already had/have tried and failed etc. |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
Quoting KrillQuoting OswaldI think its stretching to call VM a data set that describes how to generate code. but prove me wrong :) Quoting JackAsserTrue, it's just opcodes that selects data chunks together with parameters for altering the data written. Quoting WikipediaVirtual machines operate based on the computer architecture and functions of a real or hypothetical computer. I think Jackasser's example can be considered a VM on the semantic level, and that the implementation details are irrelevant, no matter how trivial.
Just like i think context-switching via interrupts on 6502 is a threading implementation and the KERNAL is an OS.
Really, the definitions are vague on those, and the general concepts can be applied to all computing machines, even at the lowest end.
To me if
the code is ran once, then it is a code generator
the code reads data, makes code, runs it, comes back reads more data, makes code, runs it ... then it is a JIT VM |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Oh, and i thought you were about to implement some kind of script interpreter for a domain-specific language of your own devising.
To be honest, i don't think the code generation you have in mind for that scroll routine will get you very far, and the compression ratio seems to be above 50%. The standard approach for that is to have a template with some placeholders for values and addresses, which you will modify and copy many times to create a flat unrolled copy/render routine.
Quoting oziphantomWhere the game would be unpacked with exomiser, then the code packer would parse the code bytes to spit out the expanded code, then the game would be run.
But it could all work out worse in the end. Curious to know what people already had/have tried and failed etc. If you check the source that comes with Artefacts you'll see that an approach that worked good for me was to split up the 6502 instruction stream into two separate streams, one for op-codes, one for the arguments. This along with code tailored with that in mind (you'd do LDA #$00 : LDX #$00 rather than LDA #$00 : TAX) gave me quite a plus on exomizer's compression ratio (plus i used Ninja's size-optimized exo decompressor). There are other techniques involved, such as packing function tables (sine curves etc.) lossily using polynomials and the good old code generation. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
Krill, you cut off the first part of the definition: "In computing, a virtual machine (VM) is an emulation of a given computer system"
Jackie's code generator doesnt emulate anything. Its a data set on how to generate code. How to add up and scale vectors to calculate vertices is not a computer system's emulation we should agree I think... |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Oh, and about exomizer... For a 4K, you'd probably get better results with ALZ64 which is a 6502-optimised LZMA variant. It's very slow, but not too slow for 4Ks. (I still need to check wether Artefacts is shorter using ALZ64 rather than exo.) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
oziphantom: I'd just collect often used routines into subs, and jsr. it depends how tight you want and how much time you have. if both are a lot then what you suggest could work. |
| |
tlr
Registered: Sep 2003 Posts: 1790 |
Quote: Krill, you cut off the first part of the definition: "In computing, a virtual machine (VM) is an emulation of a given computer system"
Jackie's code generator doesnt emulate anything. Its a data set on how to generate code. How to add up and scale vectors to calculate vertices is not a computer system's emulation we should agree I think...
If it's turing complete I'd argue it's a crude VM, albeit running fixed code to generate the data set. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
What tlr said, but i think Turing completeness isn't necessary. You could build a VM for a stack machine (with exactly one stack) and it'd still be a VM. :) |
| |
mankeli
Registered: Oct 2010 Posts: 146 |
I believe the correct term would be automaton. :)
Automata Theory
|
| |
Krill
Registered: Apr 2002 Posts: 2980 |
"For stack machines in automata theory, see pushdown automaton." - Both are correct, "PDA" is just less ambiguous without the context (which is clear here). :) |
| |
TWW
Registered: Jul 2009 Posts: 545 |
You guys need to get laid... |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting TWWYou guys need to get laid... How extraordinarily creative, smart and funny, yet dignified and mature. Surely, "TWW" is short for "The Witty Wisecracker"? I applaud you, good sir!
|
| |
enthusi
Registered: May 2004 Posts: 677 |
Stay tuned for the coming Above & Beyond #2 for an article by Ulli on this subject ;-) |
| |
mankeli
Registered: Oct 2010 Posts: 146 |
I wish I could experience the warmth of a woman even once. I'm sure that would help me to let go of these.... feelings... and thoughts... about VM's.... and automatons.... |
| |
HCL
Registered: Feb 2003 Posts: 728 |
This thread is finally getting enjoyable :) Thank you! |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: This thread is finally getting enjoyable :) Thank you!
Speaking about getting laid. I think you HCL have the highest CSDB Code score / number of kids ratio (C2K^-1) actually. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
Quote:CSDB Code score / number of kids ratio (C2K^-1)
LOL! i demand a chart for this on the front page |
| |
mankeli
Registered: Oct 2010 Posts: 146 |
5.7 5.3 5.3 NaN 7.3 3.4 NaN |
| |
Scan
Registered: Dec 2015 Posts: 111 |
NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN .....
Bat-maaaaannn!!! |
| |
TWW
Registered: Jul 2009 Posts: 545 |
Quote: Quoting TWWYou guys need to get laid... How extraordinarily creative, smart and funny, yet dignified and mature. Surely, "TWW" is short for "The Witty Wisecracker"? I applaud you, good sir!
Hahaha, I like juvenile humor, sue me. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting JackAsserSpeaking about getting laid. I think you HCL have the highest CSDB Code score / number of kids ratio (C2K^-1) actually. Except... how is having kids related to getting laid... on a highly frequent basis? I demand some _scientifically sound_ facts! :D |
| |
HCL
Registered: Feb 2003 Posts: 728 |
Quote: Quoting JackAsserSpeaking about getting laid. I think you HCL have the highest CSDB Code score / number of kids ratio (C2K^-1) actually. Except... how is having kids related to getting laid... on a highly frequent basis? I demand some _scientifically sound_ facts! :D
Hmm.. exactly what i was to reply also.. but of *some* reason i refused to push the "submit" button :P |
| |
Trash
Registered: Jan 2002 Posts: 122 |
Quote: Quoting JackAsserSpeaking about getting laid. I think you HCL have the highest CSDB Code score / number of kids ratio (C2K^-1) actually. Except... how is having kids related to getting laid... on a highly frequent basis? I demand some _scientifically sound_ facts! :D
Having children is nothing but proof that their mother is getting laid, not that you are... ;-) |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
Mothers are sceners too. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
back to topic, reread ozi's post, and actually I find it a cool idea.
kickass would suit nicely for this, but one problem I see is how one would compile this so, that adresses are correct after code is blown up to normal size. illegals should be used for the pseudo instructions. maybe I'd use a macro which adds enough nops after "own" instructions to keep the code size same for adresses to work, then a preprocessor would remove the nops. |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
I've made macros, but the tass64 verbose option doesn't list when it "invokes" one. As I was thinking you could assemble with the macros expanded, this will give you a labels file with all the correct placements. Then run through the listing and pack the macros.
Maybe a post process that just hunts the code for patterns and then applies the patterns as it finds it, use macros to "enforce" a pattern is the best way to go. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting Oswaldkickass would suit nicely for this, but So it's not suited nicely :) Really, it amazes me when people try to shoehorn in stuff that, in this case, the tool really wasn't built for (just a guess, though). You'd fare better with the standard approach of writing your own standalone encoder and decoder. However, i still reckon the gain is rather poor in comparison to other approaches, at least when having to provide the decoder in the same binary. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Thinking further about it, it may be a better approach to decrease code density rather than increase it.
As the program will ultimately be compressed with a general-purpose compressor anyways, you want to support it. In the case of code, that'd mean a more regular structure, with a lot of repeating or similar sequences.
So you'd "RISCify" the code, meaning to come up with an orthogonal set of instructions: INC a, INC x and INC y will all have the same op-code, but different arguments. Together with the separated op-code and argument streams i mentioned above, this might pay off.
Of course, again, there is a rather high initial cost for the decoder, which may or may not amortise in 4K, plus the code will probably perform (slightly?) worse compared to highly optimised hand-crafted native code. |
| |
Mixer
Registered: Apr 2008 Posts: 452 |
In my opinion 1K and 4k and other small space code challenges are more about redefining the problem than making a language or VM.
I like this Krill's idea of opcode stream and arguments stream. It is like having two stacks of data that one combines into a program. I don't think it has 4k benefits, but as a way to "stream" or VM programs it is interesting. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
It DID have benefits in Artefacts. And trust me, there was a lot of trial and error before i settled for that one (and some other techniques). :) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
there's a lot more going on in artefacts I've thought, kudos :) |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
I though about simplistic code, I think the macros will actually help there as well for just normal compression but the issue I see is this
LDA #7f
STA $d015
LDA #0
STA $d027
LDA #84
STA $43E7
At this point the compression gets an LDA, unique number, STA, unique number, D0 LDA, unique number, STA, unique number, D0 LDA, unique number, STA unique number, unique number. To which I feel that by the time the compression system encodes the custom byte stream, and then the 2 byte pairs, then back to custom, then 2 bytes, it would actually consume more memory than taking the whole lot as a custom stream. Using the conventional wisdom I know Code compresses badly, data compresses well, make more data.
Any code that is mostly the same, will be JSR optimised by the coder, so you just end up with the annoying sequences that its not worth the extra instructions to jsr as you end up with a loss.
+ lda $3
bne +
dec $4
+ dec $3
lda $6
bne +
dec $7
+ dec $6
So close, yet so far.
Which I think is proven by Krill's genius idea of splitting data and code this way you would get a nice run of
LDA STA LDA STA LDA STA and just pay for the 7f 15 d0 00 27 d0 84 e7 40
Wether or not the code expansion system pays-off in 4K is hard to tell until all the chips are on the table. It might only be viable at 16K. I fell that for things like Demos that have highly algorithmic based effects and code that the split will pay off very well, as you will do lots of similar code, while a game tends to have large amounts of unique code with random structure. |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
I wonder if IO register setting code is common enough to benefit from a pseudo-op that moves an eight bit immediate into $Dxxx, where xxx has a number of implicit zeros - or possibly even a series of N such operations
|
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
Well you can get
STA SSRRRRRR Value
where SS is
00 VIC
01 SID
10 CIA 1
11 CIA 2
And R is the register
allowing you to pack any IO address into a byte. In a game context I don't think that will save you much. Typically the screen mode and charset registers are set and forget. The multiplexer will deal with the sprites usually in one function and then you have a few D012,9,A for the IRQs. |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
I've been working under the impression that packing all the data near each other and the code together would help with compression.
so
CODE
DATA
is this misguided do you think? |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
Grouping is generally good.
Most crunchers encode "copy recent substring" with fewer bits the more recent the substring, so grouping all your code seperate to your data means the cruncher doesn't have to look as far to find prior examples, hence can use fewer bits for the copy offset. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting oziphantomit would actually consume more memory than taking the whole lot as a custom stream The two separate streams take exactly the same amount of memory as the conventional 6502 counterpart. Your example would translate to$a9,$8d,$a9,$8d,$a9,$8d
$7f,$15,$d0,$00,$84,$27,$d0,$84,$e7,$43 and it's plain to see that the op-codes would compress well, and data arguments too at some later point when using the same literals and addresses many times.
Of course, these two chunks will have to be re-combined after decompression, which adds an initial cost of about $0180 or so bytes (which in my case amortised for the 4K, so overall i saved a few pages).
But please, do look at the Artefacts source code at some point to see what i did. |
| |
soci
Registered: Sep 2003 Posts: 480 |
Storing opcode lengths in a 64 byte bit-array, what a waste of space ;) But the use of (,x) to switch streams is smart. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
:D One of the few occasions where (ZP,X) actually comes in handy. Which is mostly whenever you have a number of "buckets" to select from AND need to push or pull stuff to and fro.
As for op-code lengths, there didn't seem to be a nice enough pattern to store that more efficiently. Speak about non-orthogonal instruction set. So i went for a solution that somehow suited exomizer best. :) |
| |
soci
Registered: Sep 2003 Posts: 480 |
From the processor's viewpoint the least significant 5 bits of opcode fully determines the length of the addressing mode.
Still it seems that BRK/JSR/RTI/RTS are exceptions. But that's only because the modification of PC is covering up that they're actually 2 bytes long.
I think if you reduce the table size the result will be shorter even with the special handling added in for these opcodes. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: From the processor's viewpoint the least significant 5 bits of opcode fully determines the length of the addressing mode.
Still it seems that BRK/JSR/RTI/RTS are exceptions. But that's only because the modification of PC is covering up that they're actually 2 bytes long.
I think if you reduce the table size the result will be shorter even with the special handling added in for these opcodes.
It would be very slow, but couldn't you depack that stream without a table! :) Imagine placing a BRK after the opcode in the stream then use the IRQ to determine the opcode length by checking the pushed PC on the stack! Sneaky! |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
Quoting KrillQuoting oziphantomit would actually consume more memory than taking the whole lot as a custom stream The two separate streams take exactly the same amount of memory as the conventional 6502 counterpart. Your example would translate to$a9,$8d,$a9,$8d,$a9,$8d
$7f,$15,$d0,$00,$84,$27,$d0,$84,$e7,$43 and it's plain to see that the op-codes would compress well, and data arguments too at some later point when using the same literals and addresses many times.
Of course, these two chunks will have to be re-combined after decompression, which adds an initial cost of about $0180 or so bytes (which in my case amortised for the 4K, so overall i saved a few pages).
But please, do look at the Artefacts source code at some point to see what i did.
My line is referencing a normal packer like exo, not your split system. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Ah, sorry, didn't read thoroughly enough.
But i guess the compressor itself should be oblivious to the semantic details of the blob you serve it. It should not know about op-codes/arguments/data separation and just crunch everything as well as it can.
It should be served data it can compress well, and this is where any special coding takes place before final compression and after initial decompression.
In principle this is akin to Burrows-Wheeler transform/move-to-front, which makes it much better compressible. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting JackAsserIt would be very slow, but couldn't you depack that stream without a table! :) Imagine placing a BRK after the opcode in the stream then use the IRQ to determine the opcode length by checking the pushed PC on the stack! Sneaky! Excellent idea! However, i have a hunch that the code for all that would, in the end, still consume more compressed data than the rather straightforward approach with a table, which is probably better to compress. Not to speak of execution hazards with funny op-codes in an environment that doesn't really allow for sandboxing. :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Quoting JackAsserIt would be very slow, but couldn't you depack that stream without a table! :) Imagine placing a BRK after the opcode in the stream then use the IRQ to determine the opcode length by checking the pushed PC on the stack! Sneaky! Excellent idea! However, i have a hunch that the code for all that would, in the end, still consume more compressed data than the rather straightforward approach with a table, which is probably better to compress. Not to speak of execution hazards with funny op-codes in an environment that doesn't really allow for sandboxing. :)
Indeed, it's a bit tricky, but possible. For instance RTI/RTS handles the return address a bit different (+1 vs -1). But I guess having argument $02,$02 and then prefill the stack return address properly should be able to handle most degenerate cases. But yeah, it's a big chance it will compress worse in total tbh. |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
"RTI/RTS handles the return address a bit different "
wot ? didnt knew this. why ? |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: "RTI/RTS handles the return address a bit different "
wot ? didnt knew this. why ?
Dunno, but it does. |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
iirc there was a longish thread explaining it on 6502.org |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting Oswald"RTI/RTS handles the return address a bit different "
wot ? didnt knew this. why ? Why, because JSR pushes the return address MINUS 1 on stack while an interrupt simply pushes the return address, of course. :D
But seriously, as groepaz implied, the reason for that is likely due to some 6502 design intricacies. Something like JSR pushing next op-code minus 1 because PC at that stage isn't incremented yet to the next op-code, while interrupts are handled exactly between op-codes (after increasing PC). |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: iirc there was a longish thread explaining it on 6502.org
Ahh ok, didn't know. I was painfully aware of it when I implemented my preemtive multitasking to support yeild (i.e. give up CPU to another thread). Normally the task-switch is performed by some interrupt (i.e. the return / continuation point for the thread is placed on the stack and resumable with an RTI). However, when I want to yield in "userspace", i.e. not wait until the next task switch, the task switch is performed by a subroutine and thus resumable with an RTS. So I need to keep that in mind so that I don't accidentally mix up those to types of resume possibilities. Basically the yield-call manipulates the stack to convert it from RTS-compatible to RTI-compatible by adding to the return address and push the status also. Hairy, but it works... :) (And it's useless, but fun!) |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Why didn't you use BRK, which was basically invented for things like this? :) (It's the 6502 equivalent of TRAP.) |
| |
chatGPZ
Registered: Dec 2001 Posts: 11386 |
brk is more useful for systemcall type of things, not "yield" imho |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
But it triggers an interrupt from software. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: brk is more useful for systemcall type of things, not "yield" imho
It's for a demo part and not an OS and I have many other IRQs going on so I decided that 1) I don't want to clutter each handler with B-flag check, 2) couldn't take the risk of a possible uncontrolled extra stall in f.e. a tight multiplexer. But yeah I condidered BRK long and hard before I decided to do the manipulation manually (and thus a need for semaphores to block the scheduler from interrupting a "user mode" task switch). |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
(Sorry for being OT, one thing led to another...) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
in my first demo I had "multitasking" with raster irqs. about in 20% for each frame I switched to loader then back. So only one type of return adresses, but I had other problems, the two threads messed up each other's stacks, so I simply copyed back and forth each one's stack (not all of it) to a temp buffer. also all register swapped. worked nicely today I'd just halve the stack and replace S tho. |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: in my first demo I had "multitasking" with raster irqs. about in 20% for each frame I switched to loader then back. So only one type of return adresses, but I had other problems, the two threads messed up each other's stacks, so I simply copyed back and forth each one's stack (not all of it) to a temp buffer. also all register swapped. worked nicely today I'd just halve the stack and replace S tho.
I do that too, but split the stack properly. 25% to loader, 75% to plotter, loader yields when not loading. |
| |
oziphantom
Registered: Oct 2014 Posts: 490 |
Make 128 demos, problem solved :D |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
Quote: I do that too, but split the stack properly. 25% to loader, 75% to plotter, loader yields when not loading.
you could just set some flag to tell the irq to stop scheduling the loader in, insead of that supper scifi method? :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: you could just set some flag to tell the irq to stop scheduling the loader in, insead of that supper scifi method? :)
I did consider that also, but was too boring to code |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
well yeah certainly a more sexy solution :) brk could work too tho if you replace all your raster IRQs with NMIs :) |
| |
TNT Account closed
Registered: Oct 2004 Posts: 189 |
NMI happening during BRK will cause BRK to be ignored. See http://visual6502.org/wiki/index.php?title=6502_BRK_and_B_bit#N.. |
| |
TNT Account closed
Registered: Oct 2004 Posts: 189 |
@Krill: did you try separate streams for immediate/low byte operands and high bytes? I found out that high bytes compress better when separated from the low byte.
@soci: no need to add special cases for RTI/RTS - just make preprocessor warn if the following byte (fake operand) is non-zero. Additionally the preprocessor can replace the zero with any other value if that improves compression. JSR needs to be handled separately of course. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
TNT: I don't remember trying that. Also the RTS/RTI padding is a good idea. Certainly something to play around with for the next 4K of mine! :D |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Quoting Oswaldyou could just set some flag to tell the irq to stop scheduling the loader in, insead of that supper scifi method? :) Why is the obvious solution of simply splitting the stack (TSX/TXS exists) the "super sci-fi" method and painstakingly copying it back and forth not? :) |
| |
JackAsser
Registered: Jun 2002 Posts: 2014 |
Quote: Quoting Oswaldyou could just set some flag to tell the irq to stop scheduling the loader in, insead of that supper scifi method? :) Why is the obvious solution of simply splitting the stack (TSX/TXS exists) the "super sci-fi" method and painstakingly copying it back and forth not? :)
He was referring to implementing yield vs pause/resume thread |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Ah, thanks. I should not morning-post before the third coffee. :D |
| |
lft
Registered: Jul 2007 Posts: 369 |
Quoting sociFrom the processor's viewpoint the least significant 5 bits of opcode fully determines the length of the addressing mode.
Still it seems that BRK/JSR/RTI/RTS are exceptions. But that's only because the modification of PC is covering up that they're actually 2 bytes long.
I think if you reduce the table size the result will be shorter even with the special handling added in for these opcodes.
Also remember that this doesn't need to be 100% accurate. It's just a heuristic, and things will work as long as the encoder and the decoder make exactly the same errors. So treating jsr as having a one-byte operand (or brk/rti/rts/ldy#/cpy#/cpx# as having two-byte operands) will throw things out of sync for a while, but the compression ratio might not increase very much, and may even accidentally decrease for some input data. Meanwhile the decoder would be greatly simplified. |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Yes, but i'd try to get things back in sync again as quickly as possible. In an environment where every byte counts, "tainting" the streams by switching their data may (or may not) result in some losses.
And the question would be how often do the streams get in and out of alignment? Surely you want to keep that number low, otherwise the entire exercise is moot.
Now, i really like the idea of padding in zeros for simplified decoding, which may avoid unaligned streams altogether and have no drawbacks in comparison.
But i guess the goal would be to make the op-code lengths as regular as possible, possibly even getting rid of the table altogether in favour of a simple expression, while keeping the streams as well-aligned as possible by padding and/or allowing transient alignment errors only on rare operations (prefer brk/rti/rts/jsr over lda#/cmp#/etc.) with a quick recovery (also via some sort of padding?). |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Oh, and of course, most illegals and also quite a few regular operations aren't used much (or can be replaced), especially in size-constrained productions. This may also have a significant influence on the "regularity" of the instruction lengths. |
| |
TBH
Registered: Mar 2010 Posts: 21 |
I've played around with what might be considered code VM systems for the C64, with varying results:
1) Event Management System
The use of Events allows code to be deferred ("fire and forget") within an imperative program design. It's basically a set of tables, pointers and indices tied together with a few routines.
The basic features of such a system are:
* Subscribe To An Event - Add a subroutine address to an Event's list of Handlers.
* Unsubscribe From An Event - Remove a subscription.
* Raise An Event - Notify the Event system that an Event of a certain type occurred.
* Event Manager - (called from main loop) Executes the subscribed Handlers for each raised event in its buffer. Events can be queued and prioritised, eg animation Events handled once every 8 frames.
2) Reactive (Event-Driven) Extended Finite State machine
A state machine compatible with OMG's UML2 specification. Relatively slow, but can handle complex sets of transitions. However, it's probably overkill for most purposes.
3) Decision Table
Similar to a lookup table, but with more features and a routine that (optionally) compiles the data and executes the logic expressed by the data and structure.
A table consists of a set of Rules. Each Rule has a set of Actions that are executed if a set of Conditions are met.
The benefits:
a) Rules are expressed explicity when tabularised, instead of (often) implicitly when in code.
b) No IF-THEN code is required.
Many different tables can exist, but only routine used to execute them. This minimises code and the payoff can be big in a program that needs to evaluate a lot of conditions. Alternatively, a code generator can generate a code representation of the table.
4) Inline Function Parameters
This is simple but useful in that it remains slightly more readable than using a data table outside the code. Implementation can vary, but involve manipulating stack return values. The result is code like this:
; PrintAt routine params: Column, Row, Colour, @-terminated string
JSR PrintAt
.byt 10, 2, LtBlue
.asc "The quick brown fox@"
JSR PrintAt
.byt 10, 3, White
.asc "jumps over the lazy dog.@" |
| |
lft
Registered: Jul 2007 Posts: 369 |
I've implemented the Z-machine for C64+REU. Zeugma 1.2 |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Once upon a time, AEG was working on a VM for a 4K demo of his, which, to my knowledge, was never finished. It implemented 32-bit arithmetics (including MUL and DIV) via an orthogonal instruction set operating on virtual registers.
I wonder if for a purpose like that, a stack machine approach similar to Intel's 8087 would be suited well. |
| |
White Flame
Registered: Sep 2002 Posts: 136 |
Having played with a bunch of VM strategies on the 64, I'd say that a stack machine is nice for numeric evaluation, but not for general purpose use. If you're only going to have one or the other, I'd stick with some notion of register file. |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
I once coded a 6510-emu routine to be run on C64. :D Sounds like one of the more reasonable things to do, right? ;)
It actually executed the instructions, one after another, setting/restoring cpu flags and such, but it checked some constraints before and after executing the opcodes.
(The purpose was to be able to execute randomly generated 6510 code, but to be able to disallow access to certain memory areas and stuff like that e.g. to avoid crashes. It was related to some experimentation with Genetic Programming.) |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
Frantic: Interesting! Did your evil experiments go anywhere? What did you train the software for? What were the fitness functions? After which generation did the software do anything noteworthy? Did it become self-aware? :) |
| |
Oswald
Registered: Apr 2002 Posts: 5094 |
frantic you could have just insterted a brk _before_ each instruction, and the irq woul have examined it if its within limits :) |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Hehe.. Yes, I pulled the plug and stopped being active in the scene when the C64 threathened to kill me.
Well.. No, I didn't get very far. If I remember correctly I think Alih came up with some "code a c64 game" competition that I made a small game for instead of continuing with the 6510-emu thing. That game was also based on similar evolutionary principles, but I didn't make it before the deadline, so I just stopped. It was released as a preview by Triad though, a whole bunch of years later:
Drone Processor Preview
From an evolutionary perspective that game is not very interesting though, as it was precisely that part was what needed some work. So not much came out of that either. :D That game didn't generate 6510 code though, but I just made some sort of simple VM without much thought at all behind it. Just fiddling around for fun.
...and then I coded defMON instead. :)
http://toolsforscholars.com/defmon/doku.php
Defmon &D
Sorry for being off topic. |
| |
Frantic
Registered: Mar 2003 Posts: 1648 |
Oswald: I guess I could have done that, but if I remember correctly I didn't want to insert anything in the code. I wanted it to have precisely the form it would have when being executed for real, so that potentially an opcode could be used as an operand and vice versa, depending on where things branched, and that it could self modify in whatever way (as long as it stayed within its workspace, so to speak) during the execution, and so forth... |
| |
Digger
Registered: Mar 2005 Posts: 437 |
A little off-topic but I remember MARS being implemented in Core Wars (http://www.koth.org/info/corewars_for_dummies/dummies.html)
It would be cool to run some OOP on C64 as TBH suggests :) |
| |
ChristopherJam
Registered: Aug 2004 Posts: 1409 |
In terms of making more compressible code, I wonder if it would be worth normalising most or all instructions to three bytes long?
A table of all the numbers from 0 to 255 on page $ea would let you turn all immediate operands into absolute, while padding all branches with the same third byte as the immediates. There would be fewer distinct opcodes in the stream to be compressed, too.
Of course, that still leaves the single byte instructions. Pack these into groups of one to three opcodes, padded with a nop or two? |
| |
Krill
Registered: Apr 2002 Posts: 2980 |
I'm sceptical. Assuming it would pay off in terms of compression ratio, you'd end up with either pretty inefficient code or rather expensive decoding (in terms of size, to recover the original 1- or 2-byte instructions). |