This blog chronicles my progress porting various retro games to other retro platforms. The goal in each project - at least when targeting a new CPU - is to effectively replicate the original graphics and the original code line-by-line, to produce a 100% accurate port of the original game.

Tuesday, 17 May 2016

Some lunchtime progress; draw and erase 'simple' sprite routines. These draw, and erase, byte-aligned sprites respectively.

Byte-aligned sprites are rendered/reased

Rotating is all the more complicated (=slow) without an 8-bit rotation on the 6809.

Great minds think alike (or is it, fools never differ?)

I tried the lookup table yesterday, I think it's a little slower, plus uses an extra index register. The table's still there, so once it's all working I'll go back and do a proper cycle count to see which is more efficient.

But for now I came up with: ldb ,y rorb pshs cc rola puls cc ror ,y

I need an 8-bit rotate because the original video data needs to be preserved after the rotation is complete. One other option is to copy 8 bytes to an intermediate buffer and rotate there.

But for now, I want the 'easiest' code option just to get it all working, and then I'll look at optimisation. I must admit it's all looking like more cycles than I bargained for.

Awesome suggestion thanks! Just tried it and it works a treat - and the ISR is not taking a whole frame anymore. I'm sure there's a few more optimisations I can make too. Clearly I need to lift my game. :(

Once Space Invaders is working properly and released, I need to go back and optimise Knight Lore... hopefully I'll do a better job than I have thus far on Space Invaders.

The fastest I can think of is the X-flip I use in my graphic routines.

XFLIP is a table of 256 bytes, containing the flipped byte for each index. But as LDAr r,X is using r as a signed-offset, the table is ordered from $80-$FF, then $00-$7F instead of the usual $00-$FF.

Then I would do something like: LDX #XFLIP+$80 ; 3 LDB ,Y ; 4 LDA B,X ; 5

Which is 12 cycles for one, or 9 cycles if you can keep X set up from earlier on.

You could just use the A register with this method and have B free for other things.

You did say that using up one of the index registers might be a problem, but I thought I would mention it anyway!

I take it you need to do something like take 8 bytes of column-oriented bits and turn them into 8 bytes of row-oriented bits?

My 6809 skillz are rusty, but looking over "The 6809 Companion" I think the fastest would involve 8 tables of 256 bytes each. tableN[K] = 1 if bit N of K is clear, 0 otherwise. Take your source bytes and store them as the 8 bit offsets in this loop:

Those 8 bit offsets are signed so that table definition is a little trickier and you'll want to start with "ldx #table0+128". Some other details to work out for sure, but I think the idea is sound.

Would a table help? Might be overkill, but you should have plenty of space.

ReplyDeleteGreat minds think alike (or is it, fools never differ?)

ReplyDeleteI tried the lookup table yesterday, I think it's a little slower, plus uses an extra index register. The table's still there, so once it's all working I'll go back and do a proper cycle count to see which is more efficient.

But for now I came up with:

ldb ,y

rorb

pshs cc

rola

puls cc

ror ,y

I need an 8-bit rotate because the original video data needs to be preserved after the rotation is complete. One other option is to copy 8 bytes to an intermediate buffer and rotate there.

But for now, I want the 'easiest' code option just to get it all working, and then I'll look at optimisation. I must admit it's all looking like more cycles than I bargained for.

I still haven't ruled out going back to plan A.

It looks like you need:

DeleteROR8 [Y]

ROL8 A

Your routine takes 26 cycles.

Could this not work?

LDB ,Y ; 4

RORB ; 2

ROR ,Y ; 6

ROLA ; 2

(14 cycles)

Awesome suggestion thanks! Just tried it and it works a treat - and the ISR is not taking a whole frame anymore. I'm sure there's a few more optimisations I can make too. Clearly I need to lift my game. :(

DeleteOnce Space Invaders is working properly and released, I need to go back and optimise Knight Lore... hopefully I'll do a better job than I have thus far on Space Invaders.

There are a few tricks that can be used. PSHr, PULr and (especially) TFR can be pretty slow.

DeletePSH/PUL are best when moving more than one register as they take "5+number_of_bytes" cycles.

Sometimes it is better to push a loop counter onto the stack and DEC ,S instead of having something like PSHS A/routine/PULS A/DECA.

Remember that PC is one of the registers! So you can replace:

PULS A ; 6

RTS ; 5

with:

PULS A,PC ; 8

Some TFR instructions can be replaced with LEA to save two cycles.

TFR Y,U ; 6 (2)

LEAU ,Y ; 4 (2)

The fastest I can think of is the X-flip I use in my graphic routines.

DeleteXFLIP is a table of 256 bytes, containing the flipped byte for each index. But as LDAr r,X is using r as a signed-offset, the table is ordered from $80-$FF, then $00-$7F instead of the usual $00-$FF.

Then I would do something like:

LDX #XFLIP+$80 ; 3

LDB ,Y ; 4

LDA B,X ; 5

Which is 12 cycles for one, or 9 cycles if you can keep X set up from earlier on.

You could just use the A register with this method and have B free for other things.

You did say that using up one of the index registers might be a problem, but I thought I would mention it anyway!

I take it you need to do something like take 8 bytes of column-oriented bits and turn them into 8 bytes of row-oriented bits?

ReplyDeleteMy 6809 skillz are rusty, but looking over "The 6809 Companion" I think the fastest would involve 8 tables of 256 bytes each. tableN[K] = 1 if bit N of K is clear, 0 otherwise. Take your source bytes and store them as the 8 bit offsets in this loop:

ld a,#8

ld x,#table0

loop:

ldb B0,X

lslb

orb B1,X

lslb

orb B2,X

lslb

orb B3,X

lslb

orb B4,X

lslb

orb B5,X

lslb

orb B6,X

lslb

orb B7,X

stb ,U+

leax 256,X

deca

bne loop

Those 8 bit offsets are signed so that table definition is a little trickier and you'll want to start with "ldx #table0+128". Some other details to work out for sure, but I think the idea is sound.