The most significant, from an effort point-of-view at least, was unrolling the shifted (non-byte-aligned) sprite rendering routine. That took a bit to get right. Previously it was also reading & writing to each video byte twice; that's now remedied too. FWIW it makes <1fps difference on the 'moving block' screen.
It's worth noting that it requires 75 cycles to render a single (shifted) byte. That entails reading 4 bytes from data memory, performing 4 table lookups (across 2 different tables) before a read-modify-write of a single byte in video memory.
I also opted to duplicate the (small) routine that calculated the video buffer address from X,Y position. It was originally returning the result in U, however the code always then transferred it to either X or Y, depending on whether it was used as a source or destination pointer. 16-bit register transfers are actually surprisingly expensive (6 cycles) and one case required two transfers to preserve U as well. I also optimised the calculation itself to save a few cycles.
I really need to profile the code properly to identify the bottlenecks. Chipping away at the more obvious optimisations isn't having much of an effect on fps.
And just because I haven't posted any pictures for a while...
|Showing the fps counter lower right (49fps)|
There's also definitely some subtle graphics corruption, or rather, garbage. It appears to be limited to human->wulf transformations, and only when in certain orientations. I'll do more experimenting to nail down the exact conditions, then see if it can be reproduced on the ZX Spectrum...
EDIT: Another upside of unrolling the sprite rendering loops is that it should be easier to add support for CPC (4-colour) graphics!