Wednesday 21 September 2016

Yah! - or if that's cheating - Soon we'll be livin' high and wide

I'll skip the post on profiling and come back to that next time.

Tonight, at the suggestion of the author of the Atari 800 Pentagram port, I thought I'd look at the masking logic. It certainly seemed like a good candidate, given the amount of time spent in the rendering routines and the fact that the masking requires two table look-ups per byte rendered.

Rather than try to optimise it, I thought I'd disable it altogether and see what effect it had on the performance. I was surprised, and a little disappointed, that it seemed to have very little in fact. Well, perhaps not completely insignificant at 4% in the 'busy' room, taking the frame rate from 8-9fps (see screenshot below) to a solid 9fps.

The infamous 'busy' screen, with fps counter

And because I now had a metric, I disabled the Z-ordering again. I clearly underestimated the effect last time, because the performance jumped to 13fps, and a reduction in the corresponding routine (calc_display_order_and_render) from 25% all the way down to 2%.

I was now a little bummed that disabling masking and Z-order completely still resulted in 13fps, a few frames short of my target 15fps. Curious as to how this compared to the original, I fired up the ZX Spectrum and Coco3 emulators side-by-side and entered the 'busy' room on each.

To my surprise, the Coco3 (with no masking or Z-order) was significantly faster. So I re-enabled Z-order. It was still faster. So I re-enabled masking - back to the complete code - and it was still faster at 8-9fps!!! For this screen at least, I had been trying to optimise something that was already too fast!

So where does that leave the project? What I need to do now is visit a significant number of screens and compare them, side-by-side, with the ZX Spectrum original. If the Coco3 is running faster in every case, then my work here is done! In fact, I'll need to (re-enable and) tweak the throttling function so that the screens are all roughly the same speed.

EDIT: I think I'm going to back-port my fps ISR from the Coco3 to the ZX Spectrum!

Then there's the matter of beefing up the Z80 R register emulation, fixing a graphical glitch that is simply a result of not being able to emulate the Spectrum's attribute bytes, and we're done!

Livin' high and wide, one might even say!

Saturday 17 September 2016

My heart's calculatin'

Just a quick update since it's late...

My profiler appears to be working for the most part, although any delusions I had about writing a generic 6809 profiler are pretty much dashed. I'll go into more details next post, but the nature of assembler makes it difficult - nay impossible - to identify the context (subroutine) of the executing code without some comprehensive code analysis (smells like a halting problem to me).

Regardless, with some inside knowledge of Knight Lore, I've got a pretty good handle on what's taking most of the CPU time now.

Addr   Routine           Count   Cycles
----   ----------------  -----   ------
0xE97A calc_pixel_XY_     1626 15175159 ( 58%)
0xE19E calc_display_o      249  2924962 ( 11%)
0xE8AD blit_to_screen      501  1081830 ( 4%)
0xD858 fill_window         521   845652 ( 3%)
0xE98F print_sprite        129   820025 ( 3%)
0xDB20 upd_16_to_21_2      235   561370 ( 2%)
0xE02E set_draw_objs_      235   501364 ( 2%)
0xE12B save_2d_info      10210   459450 ( 2%)
0xC852 toggle_audio_h      461   411865 ( 2%)
0xE610 get_ptr_object    12960   401760 ( 2%)
0xE967 flip_sprite        2239   380273 ( 1%)
0xE144 list_objects_t      249   337709 ( 1%)
0xE799 update_screen         2   291910 ( 1%)
0xE7BC render_dynamic      249   249443 ( 1%)
0xE790 clear_scrn_buf        2   172068 ( 1%)
0xD6CF print_sun_moon      248   145080 ( 1%)
0xDA55 upd_2_4             498   131802 ( 1%)
(snip)
0xFEF7 _IRQ_               904    38264 ( 0%)
(snip)
0xFEF4 _FIRQ_               58     1334 ( 0%)
(snip)
----------------------        ---------
Total Cycles                   26085292


That top routine is actually calc_pixel_XY_and_render(), which does some trivial calcs and then calls into print_sprite, which is obviously where all the time is spent!

More on this topic next post...

Thursday 15 September 2016

Cut 'em out

Whilst I haven't been working on Knight Lore code per se, I have done some work that will ultimately assist in the optimisation.

I need a profiler, and since none of the Coco3 emulators have that functionality, I have to roll my own. My first preference was to hack MAME/MESS, but the barrier-to-entry is rather high. So the next candidate is Vcc.

Unfortunately Vcc currently compiles under Microsoft Visual C, which I actively try to avoid in my own projects, preferring GNU or otherwise open source toolchains. So I decided to try my hand at porting Vcc to GCC/MINGW. After hitting a few roadblocks that had me stumped for a day or two, I managed to finally get it all building and running! Mostly.

There wasn't a lot of code to modify, in fact about 10 lines in a handful of files. One problem I couldn't figure out how to overcome was a call to AfxInitRichEdit() which is part of MFC and therefore not available under GCC. Interestingly there is a RICHED20 library in the GCC/MINGW distribution which supposedly implements AfxInitRichEdit2(), but I had no luck. Somewhat encouragingly though, the documentation suggests that it's not always necessary to call this function.

So the solution for now was - Cut 'em out!

There are a couple of issues with the GCC build - for example the tape configuration dialog is mysteriously blank - but the real stick-in-the-mud for me is the fact that Knight Lore doesn't actually run. Galactic Attack ran just fine, so perhaps it's an issue with 32KB cartridges?

It's never easy, is it?

UPDATE: There was/is an issue with 32KB cartridge images; Vcc expects the two 16KB banks to be swapped - for some unknown reason - unlike MAME/MESS and also unlike you'd burn to FLASH or EEPROM for that matter. I've done a quick hack for myself so that Knight Lore will run as-is.

The tape configuration dialog was blank because it contained rich edit controls; after adding a call to explicitly load riched20.dll that's now sorted too.

That's the last of the obvious GCC/MINGW issues.

Now I should be able to start on the profiling functionality!