lda #%11111111 sbx #%10000000 bcs .x1 eor #%01111111 eor pixels,y sta pixels,y iny lda #%01111111 .x1 sbx #%01000000 bcs .x2 eor #%00111111 eor pixels,y sta pixels,y iny lda #%00111111 .x2 sbx #%00100000 bcs .x3 . . .
This one can't be emphisised enough... I'd suggest taking it even lower level than that though. Three minutes spent patching vice (Consider a one line patch in cpu.c that printf's cycle/current PC address) and a few scripts to sift through it (try http://artificial-stupidity.net/~alih/ , process-log.c and profiler.py) can help you a lot. Or at least helped me a lot. YMMV.
Then allow me to speculate ;) How about keeping a separate bit array with one element per screen tile. Then when drawing a line you'd plot a separate low-resolution outline to the bit array, i.e. a conservative estimate of which cells in the real bitmap the line might cover. The standard line algorithm should work if modified to draw both tiles when moving diagonally. At least that's a scheme I tinkered with, without getting it to work I might add.. But I guess setting up and clipping all those extra lines would be a tad too costly to be practical.
but as usual my philosophy is "why do it in realtime if it can be precalculated without taking up too much memory?"