The arcade machine uses a pair of so-called "ping-pong" buffers to allow the DVG to render the frame whilst the CPU is building the next frame. This comes in very handy indeed on the raster ports (Apple IIGS, Coco3) when erasing the previous frame before rendering the current frame.
Of course double-buffering requires erasing the frame prior to the previous frame. The easiest way to implement this with the current architecture is to extend the 2x buffers to 4x and modify the "ping-pong" logic slightly. No more than a handful of instructions in a few strategic locations...
At this point the game is running quite slowly due to the sub-optimal (to put it mildly) erase/render code, so there's little point synchronising the page flipping to the VBLANK and therefore the video still exhibits some flicker. However it is much improved and gives a taste for things to come...
UPDATE: Tonight I thought I'd add some profiling code before starting on any more of the optimisations. When you first start a game (with 4 asteroids on-screen) it's hovering around 55fps. When things get a lot busier, it's down around 20fps, and the lowest I've encountered is 13fps. And when there's all-but-nothing to render, it hits 89fps.
Will be interesting to see where it goes from here...
UPDATE 2: I've just optimised the copyright rendering. The copyright is unique in that it is rendered every frame, in a fixed location, and therefore never needs to be erased.
After some experimentation, and without resorting to stack blasting (which I can't see being optimal in this case due to the OR'ing operation), I came up with the following for each line of 4 words (Y is the video address):
LDD #0x1234
ORA ,Y
ORB 1,Y
STD ,Y
LDD #0x5678
ORA 2,Y
ORB 3,Y
STD 2,Y
...
LEAY 32,Y
That's the best I can come up with late on a Friday night (37 -> 22/24 cycles/line). Improvements welcome!