lda #%11111111 sbx #%10000000 bcs .x1 eor #%01111111 eor pixels,y sta pixels,y iny lda #%01111111 .x1 sbx #%01000000 bcs .x2 eor #%00111111 eor pixels,y sta pixels,y iny lda #%00111111 .x2 sbx #%00100000 bcs .x3 . . .
My favorite trick is to precalc the data, since doing realtime math on 1MHz/8bits seems a bit too ambitious for me.
E.g. face hidden status, or any other status that typically stays the same for a number of frames
The technique you describe (a bitarray keeping track of which block of the screen must be xor filled and which not) is very close to what i have done for the big cube viewed from the inside in natural wonders. It doesnt use a bitarray but one extra byte per char on the screen, which makes the line drawing and filling faster (you dont need any and/ora/xor for manipulations and testing) at the cost of <1kb. I think using char resolution for this extra data is obvious, because so you can use precalculated chars for the areas that dont need to be xor filled. I remember TTS, Graham and Me discussing about this technique back in 1994 as if it was yesterday.
The math is very simple and can be very fast. Natural Wonders is doing everything 100% realtime, still most of the objects run in 25 fps. And that with far more than 8 bit math (matrix is 24 bit, rotation and z-scaling 16 bit etc).
In 50% of the cases you just need to EOR all signs of the deltas and you know the hidden status. In the other 50% you can try to simply choose other vectors so that EOR-sign-check works again, or simply do those two muls which are also quite fast. But for platonic objects like in Natural Wonders you can do an even far easier check: Simply check the Z-value of the face midpoint against a visibility threshold.
I just don't see how you do it. I've only got 16-bit precision in the matrix and 8-bits (9 really..) for the transformations, plus lookup tables for everything, manually built vertices and so forth, yet it *still* takes over 2500 cycles to process a damned cube.
But.. That would be cheating.. =)
* Implement a good frame rate counter early in the process. Many "optimizations" really arn't any optimizantions because the overhead simply is too great. For example, an unrolled EOR-filler takes approx (4+4)*128 = 1024 cycles to fill a column. A potential speed up would be only to EOR-fill the chars that contain lines and simply STA for the rest. At most you can gain 4*128 = 512 cycles (per column). Any extra overhead in the linedrawer, in the code modifier (JSR+RTS) into the EOR-areas and STA areas quicky eats up those cycles. So DO add a frame rate counter FIRST so that you see that you really get bang for your bucks.
This one can't be emphisised enough... I'd suggest taking it even lower level than that though. Three minutes spent patching vice (Consider a one line patch in cpu.c that printf's cycle/current PC address) and a few scripts to sift through it (try http://artificial-stupidity.net/~alih/ , process-log.c and profiler.py) can help you a lot. Or at least helped me a lot. YMMV.