Friday 14 May 2021

So it's down to looking at gcc code generation...

The good news is that I've sorted all the issues with race conditions. I learned how to generate extended inline ASM in gcc, refreshed my memory on Neo Geo interrupts, took a look at the ngdevkit crt0.S code, rediscovered how to generate assembler and symbol listings in gcc, and confirmed the findings in the MAME debugger.

In a nutshell, all I had to do is disable the VBLANK interrupt in the function that updates the tilemap, where it sets REG_VRAMADDR and then writes the tile number to REG_VRAMRW. This eliminates all the glitches where tiles would occasionally be written to the wrong location on the tilemap.

Note that this function is (also) called from the VBLANK interrupt, so I had to save/restore the 68K status register rather than simply re-enable interrupts.

The only other critical function that accesses the Neo Geo VRAM is called from the ngdevkit VBLANK rom callback routine, after the interrupt has been acknowledged (so I don't have to do it after all) and in the context of an ISR (interrupts are already disabled). So nothing to do there.

I also did a preliminary optimisation of the code that updates the tilemap attribute (colour) and scroll registers, and all the sprite registers. On the original arcade hardware, this is done in the NMI and is a simple 128-byte LDIR. Not so simple on the Neo Geo, and quite slow with the REG_VRAMADDR mechanism.

So onto the bad news.

The NMI (VBLANK interrupt) code is taking too much CPU time and although it does finish before the next interrupt, the mainline code is starved of CPU. So you can coin up and play the game but tasks like updating the text and so-called head-up display, and drawing some of the landscape isn't being done.

Taking a look at some of the code being generated was an eye-opener to say the least. There are snippets of code that are inexplicably inefficient and complex. It seems that any code involving a structure member variable is just ridiculously inefficient, and that's a static structure so there's no member offset calculations involved at all.

Check out this snippet that contrasts a simple member update...

1604:src/scramble/scramble.c **** wram.score_table_text_ptr = score_table_text;
4554 .loc 1 1604 0
4555 1d86 203C 0000 move.l #score_table_text,%d0
4555 0000
4556 1d8c 2200 move.l %d0,%d1
4557 1d8e 4241 clr.w %d1
4558 1d90 4841 swap %d1
4559 1d92 E049 lsr.w #8,%d1
4560 1d94 1439 0000 move.b wram+1113,%d2
4560 0000
4561 1d9a 0202 0000 and.b #0,%d2
4562 1d9e 8202 or.b %d2,%d1
4563 1da0 13C1 0000 move.b %d1,wram+1113
4563 0000
4564 1da6 2200 move.l %d0,%d1
4565 1da8 4241 clr.w %d1
4566 1daa 4841 swap %d1
4567 1dac 7400 moveq #0,%d2
4568 1dae 4602 not.b %d2
4569 1db0 C481 and.l %d1,%d2
4570 1db2 2F42 000C move.l %d2,12(%sp)
4571 1db6 1239 0000 move.b wram+1114,%d1
4571 0000 4572 1dbc 0201 0000 and.b #0,%d1
4573 1dc0 822F 000F or.b 15(%sp),%d1
4574 1dc4 13C1 0000 move.b %d1,wram+1114
4574 0000
4575 1dca 2200 move.l %d0,%d1
4576 1dcc E089 lsr.l #8,%d1
4577 1dce 7400 moveq #0,%d2
4578 1dd0 4602 not.b %d2
4579 1dd2 C481 and.l %d1,%d2
4580 1dd4 2F42 0008 move.l %d2,8(%sp)
4581 1dd8 1239 0000 move.b wram+1115,%d1
4581 0000
4582 1dde 0201 0000 and.b #0,%d1
4583 1de2 822F 000B or.b 11(%sp),%d1
4584 1de6 13C1 0000 move.b %d1,wram+1115
4584 0000
4585 1dec 7200 moveq #0,%d1
4586 1dee 4601 not.b %d1
4587 1df0 C280 and.l %d0,%d1
4588 1df2 2F41 0004 move.l %d1,4(%sp)
4589 1df6 1039 0000 move.b wram+1116,%d0
4589 0000
4590 1dfc 0200 0000 and.b #0,%d0
4591 1e00 802F 0007 or.b 7(%sp),%d0
4592 1e04 13C0 0000 move.b %d0,wram+1116
4592 0000

  ...with that of a discrete static variable:

1605:src/scramble/scramble.c **** bananas = score_table_text;
4593 .loc 1 1605 0
4594 1e0a 23FC 0000 move.l #score_table_text,bananas

That's quite a difference! There are other instances throughout the code, even for simple uint16_t variables. I'm sure it all adds up. So one option is to move all the variables out of the structure and see what sort of savings I get.

The not so bad news is that it seems to not be too far from the tipping point as-is. If I remove the code that updates the tilemap attributes for example, which comprises a mere 32 writes to VRAM, then the game appears to run just fine.

So a bit of experimentation in optimsation still to be done.

UPDATE: The structure issue could rather be an alignment issue... more to come...

UPDATE #2: Removing #pragma pack(1) from the wram structure produced code akin to the 2nd snippet above. It also decreased the code size from 0xB28C to 0x98F0. However it didn't seem to have any effect on the NMI execution time... hmm...

UPDATE #3: Almost there! Some more optimisation of the tilemap update code together with eliminating most of the structure packing and the game is almost glitch-free. Occasionally on the ufo stage the landscape isn't drawn correctly (more on this later).

No comments:

Post a Comment