Friday, 26 August 2016

Keep them doggies unrollin'

More incremental optimisations...

The most significant, from an effort point-of-view at least, was unrolling the shifted (non-byte-aligned) sprite rendering routine. That took a bit to get right. Previously it was also reading & writing to each video byte twice; that's now remedied too. FWIW it makes <1fps difference on the 'moving block' screen.

It's worth noting that it requires 75 cycles to render a single (shifted) byte. That entails reading 4 bytes from data memory, performing 4 table lookups (across 2 different tables) before a read-modify-write of a single byte in video memory.

I also opted to duplicate the (small) routine that calculated the video buffer address from X,Y position. It was originally returning the result in U, however the code always then transferred it to either X or Y, depending on whether it was used as a source or destination pointer. 16-bit register transfers are actually surprisingly expensive (6 cycles) and one case required two transfers to preserve U as well. I also optimised the calculation itself to save a few cycles.

I really need to profile the code properly to identify the bottlenecks. Chipping away at the more obvious optimisations isn't having much of an effect on fps.

And just because I haven't posted any pictures for a while...

Showing the fps counter lower right (49fps)

There's also definitely some subtle graphics corruption, or rather, garbage. It appears to be limited to human->wulf transformations, and only when in certain orientations. I'll do more experimenting to nail down the exact conditions, then see if it can be reproduced on the ZX Spectrum...

EDIT: Another upside of unrolling the sprite rendering loops is that it should be easier to add support for CPC (4-colour) graphics!

Thursday, 25 August 2016

Unrollin' unrollin' unrollin'

Still haven't determined the main bottleneck but have made more progress.

After tweaking a few routines to eliminate unnecessary branches and some direct-page & register juggling I returned my focus to the sprite rendering routine. I was recently reminded that the original Z80 code used a separate execution path for byte-aligned sprites, so I decided to optimise that section.

After adding code to test for and then render specifically byte-aligned sprites, I then unrolled the loop for this case. To make matters worse (or better, depending on how you look at it), I had unnecessarily transferred a loop counter to & from DP memory and register B when I need only decrement the in-memory value. So worst case - the widest sprite is 5 bytes - it was taking 65 cycles per sprite line (the tallest sprite is 64 lines) just to test and loop. Contrast that with just 26 cycles set-up per sprite, and no per-line test and loop!

Empty screens are now too fast to control (turning accurately is problematic) and my troublesome 'moving block' screen, which started out at 5fps, is now 9-10fps - and I'm yet to optimise the non-byte-aligned (shifted) sprite rendering logic! Having said that, patching the code to render all sprites as byte-aligned did little to improve the frame rate on this particular screen, though it does improve on other screens.

The above-mentioned 'moving block' screen appears to be puzzlingly slow, considering there's only a pair of blocks to animate. Patching the code again to remove the blocks did increase the frame rate by around 2fps, but I can see little opportunity to speed up this particular sprite handler.

However during that process, I noticed that a dozen or so routines branched (near/relative) to another routine that simply jumped (far/absolute) to a third routine. Whilst this could conceivably have been done to reduce code size (slightly) it is rather detrimental to performance, so I of course remedied the situation by 'removing the middleman' so-to-speak.

I've not compared the performance with the original Z80 code since starting on the optimisations, but I would guess that it's starting to approach it now. It would be ideal if I could exceed it slightly, then tweak the frame rate smoothing logic to bring it back on par with the ZX Spectrum.

Watch this space!

Monday, 22 August 2016

Good News is No News

I'm still perplexed, but I do have some 'good' news. I had a typo that resulted in the main loop attempting to smooth the frame rate when profiling was enabled, when it should have been disabled.

So my frame rate for 'empty' screens is actually up around 43fps (not 29fps). Of course on 'busy' screens the frame rate smoothing logic did nothing so it's still limping along around 7-8fps.

What has me completely bamboozled is the fact that disabling the Z-order calculations still results in no more than 1fps improvement (and I had my hopes up when I first saw the above-mentioned typo), whilst another porter (to 6502) has seen significant improvement by optimising the Z-order algorithm. This means either the 6809 code is vastly more efficient than the translated 6502 code or, more likely, I've done something brain-dead in my attempt to circumvent it.

Oh and I think I saw some transient pixel corruption on one of the animated screens... perhaps that's a clue to something I've done wrong, or perhaps only a side-effect of disabling the Z-order calculations - I'm not sure.

So my quest continues to find the bottleneck on busier screens...

Monday, 15 August 2016

1fps

Had a bit of spare time so I started on the optimisation tonight. The focus was the sprite print routine, though from all reports that's actually not one of the hotter spots in the code. Regardless, I managed to find a few cycles here and there which equated to roughly one frame-per-second improvement on a busier screen - instead of 5/6 it's now 6/7 fps.

Nothing too tricky; found a PSHS/PULS inside a loop that was completely unnecessary, pre-computed an LEAX outside a loop, used post-increment to remove an LEAU, and finally moved a few temporary variables from the stack to the direct page.

Also found a bug in the process, but one that doesn't appear to affect the code - a word-sized variable on the direct page was only allocated a single byte, and AFAICT wasn't used at the same time as the next variable.

Next step is to locate and review my correspondence with fellow porters to identify the critical areas I should be looking at!

Thursday, 11 August 2016

New workspaces and frame rates!

Yes, it has been a while, hasn't it! Long days in the office on my recent overseas work trip followed by late night Skype sessions with the family meant that Retro Ports didn't even get a look-in. And when I did have some spare time in my room, I was either watching the TdF live or catching up on previous days' highlights. Quite removed from my previous jaunt over there...

...and back home since hasn't been much different, to be honest. A full work schedule and trying to balance family time, work around the house and physical activity has meant I have little energy or motivation for sitting in front of a computer screen late at night. And yet here I am at a little past midnight!

I've recently cleared my home computer desk - no longer used on a daily basis - and made it my Coco Space. It now houses my home PC (aka the development machine and theoretically Drivewire server), a PAL Coco 3 with CoCoSDC cartridge installed in a space-age acrylic case, and a TerASIC DE1 running Gary Becker's Coco3FPGA with newly-acquired Zippster Analogue Board. However with only a VGA monitor set up currently, and little extra space, I should probably obtain some form of RGB->VGA adapter for the Coco3.

The new Retro Ports workspace!
So, tonight with the wife on a girls' night out I finally got motivated to crack open Knight Lore again! To recap, I just need to improve the frame rate and the random number generator before it's ready for final release. And tonight in preparation, I added an ISR to count and print the frames-per-second in real time.

At its best, on all-but-empty screens, I'm getting about 28-29 fps. The worst I saw, although I've only visited a few screens, was down to about 5 fps when entering the screen and then averaging around 8-9 fps on those screens.

You need to keep in mind that in Knight Lore the fps has a direct effect on the speed of both the player and the objects being animated in the room. So it's not about obtaining a high frame rate (fps) to get smooth animation; the game would be unplayable in that case. Indeed, the proper game speed would - and this is purely an estimate at this point - sit around 15-20 fps. If I could achieve 20 fps on the busiest screens I would be very happy indeed, and would likely need to tweak (and enable) the game loop delay routine to throttle the game play.

As I've touched on in previous posts, there are a number of approaches to the optimisation process; some preserving the original rendering algorithm (which I'd prefer to do) and some that modify it but at the same time produce significant performance gain. I've had correspondence - that I need to locate again - from a few authors who have ported Knight Lore to less capable platforms and managed to get it running faster than the original by enhancing the rendering algorithm. So I have little doubt that I'll be able to get it running fast enough - somehow - on the Coco3!

Hopefully now that I've dipped my toe back in the water, I'll maintain the momentum to work on it regularly.