Or you could look at the whole timing problem from a slightly different perspective. That is you might opt to program the sprites as early as possible rather than as late as possible, e.g. right after the previous shape has finished with the sprite channel you're attempting to reuse. That way there's no need to estimate raster timing at all, with the drawback that some sprite which would otherwise be displayed imperfectly simply won't show up at all.
As for speeding up the interrupt handler itself about the best you can do is to unroll it completely and keep a separate interrupt handler per-sprite. That way you can poke sprite values directly into LDA immediates and STA to absolute addresses instead of indexing anything. Whether or not the cycle savings are worth the space depends on your game and how tightly you'll want to pack the sprites, but I've found it a useful method.
I've considered that, but it only works if sprites are never Y-expanded. If some sprites are Y-expanded and some aren't, the place where each sprite ends could not be in the same order as the order where each sprite starts. As the result you'll have to ressort to 2 sorting routine, one to sort the start coordinates as usual, and a second to sort the end coordinate which are the same plus 21/42. So it would become significantly more CPU intensive during VBlank, and more complicated logic overall. Altough you'd save the mess with $d015, as all sprites could be always enabled, and write the first 8 active sprites during VBlank.
Having one IRQ per software sprite could possibly save some time, as you could do lda immediate *but* not only you'd have to change the IRQ vector each time or end up using a jump table or something like that and it would also slow things down. Not to mention the arguemnt of lda immediate would be complicated to change inside the VBlank, and you'll have to have the almost exactly same code about 16-20 times in RAM, and I think it should be a waste to reserve up to 2-3 kb of memory for IRQ code only to save a couple of cycles.
;roughly 75 cycles and 50 bytes per IRQ irq0 sta irq_save_a mux_x0 lda #$00 sta $d000 mux_c0 lda #$00 sta $d027 mux_p0 lda #$00 sta $07f8 mux_m0 lda #$00 sta $d01d mux_y0 lda #$00 sta $d001 asl $d019 mux_l0 lda #$00 sta $d012 cmp $d012 bcc irq1+2 beq irq1+2 lda #<irq1 sta $fffe ; inc $ffff ;every fifth time or so.. lda irq_save_a rti irq1 sta irq_save_a mux_x1 lda #$00 sta $d002 mux_c1 lda #$00 sta $d028 . . .
General-purpose multiplexers rarely bother with any of the bit-packed attributes other than the x bits.
To handle y-expansion effectively I think you'll need to sort by both the top y coordinate and the bottom y coordinate anyway regardless of the interrupt method. At any rate the hardware sprites won't be reprogrammed in strict sequential order anymore.
On top of that there's a matching (unrolled) writer "loop" which goes through the sorted sprite list and pokes the sprite attributes into the immediate constants, compiles the most-significant x bytes, and a few other things. The nice thing is that it doesn't really cost me any extra cycles since I had to preserve the sprite data somewhere anyway.
I don't see the problem as long as I make the interrupt before the sprite, since the starting coordinate is the same regardless if the sprite is Y-expanded. I only regard whether Y-expansion is enabled when computing if sprites overlap, and if there is more than 8 on a scanline
Then I guess I made one of the first true general multiplexer. How could you call it a general multiplexer if it imposes some restrictions to your flags ? I have to admit that in-game sprites will mostly be multicolor, will mostly never be X or Y expanded and will mostly appear on the front of the background, but it's nice to have an engine that isn't that restricting, especially since I plan to have it enabled at all times.
PS : Commodore were really stupid they should have implemented X and Y *inversion* instead of expansion, it would be MUCH more usefull for in-game use.
However I guess I can do inversion by software without having to store the graphics twice.
Imagine that you put one y-expanded sprite on the first scanline. Then starting one line below that you add seven non-expanded sprites. Finally lets add yet another sprite 24 lines below that. Now, the point here is that if you simply sorted by the topmost y coordinate and tried to use the sprites in sequential order then the multiplexer would try to use the same hardware sprite for both the topmost (y-expanded) software sprite and the final bottommost sprite. Clearly not optimal.
I guess the routine is longer than those extra sprite gfx.