Input lag in Psikyo games in MAME


There’s is a long-standing belief that MAME has high input lag for Psikyo games (especially earlier titles), so I decided to have a look.

Some people use an ancient MAME-fork called “Shmupmame“, since it has less input lag for these games. I assumed this was due to bugs in the MAME driver, so I started probing some PCBs of mine to figure out how they work.

The MAME drivers report 4 frames of lag on the earlier titles (which is also what I saw when emulating them), and while I did find some things worth adjusting in the MAME driver, these things don’t have any impact on latency. I did however find something else interesting…

Early Psikyo PCBs have quite a bit of lag!

So yeah, the TL;DR is that MAME is accurate, because the Early Psikyo PCBs also have the same amount of input lag. The relevant games are:

  • Sengoku Ace / Samurai Aces
  • Gunbird
  • Battle K-Road
  • Strikers 1945
  • Tengai / Sengoku Blade

“Frames of input lag” is a ambiguous term though, so to be clear:

  • Button/lever is pressed during frame 0
  • No sprite movement on frame 1
  • No sprite movement on frame 2
  • No sprite movement on frame 3
  • Sprite movement on frame 4!

Also note that when playing original hardware, the input lag will vary by up to one additional frame depending on the raster beam position of the screen at time of button press, since inputs are sampled once per frame. Additionally sprites further up on the screen will be updated quicker, since the beam scans up-to-down.

Here is two typical example of what it can look like on Tengai shot in 240fps.

Note that time from button press until bomb animation starts is the same as for lever input to player movement in this game (in some other games that is not true).

“Fast case” for Tengai

See images at:

  • Img 5: Button is pressed (see LED), right before it gets sampled.
  • Img 6-9: One frame of no sprite movement
  • Img 10-13: One frame of no sprite movement
  • Img 14-17: One frame of no sprite movement
  • Img 18: Bomb animation starts

“Slower case” for Tengai

See images at:

  • Img 5: Button is pressed right AFTER it gets sampled.
  • Img 6-8: Game still doesn’t know button was pressed
  • Img 9: Finally button is sampled
  • Img 10-13: One frame of no sprite movement
  • Img 14-17: One frame of no sprite movement
  • Img 18-21: One frame of no sprite movement
  • Img 22: Bomb animation starts

Tengai in MAME

In MAME, the easiest way to verify that its similar timing as on PCB is to do the following:

  • Bind buttons for “Pause” and “Pause – Single Step” (UI input menu)
  • Pause game while not inputting anything
  • Hold bomb button (or direction) when paused, and doing steps below
  • Pause Single Step (One frame of no sprite movement)
  • Pause Single Step (One frame of no sprite movement)
  • Pause Single Step (One frame of no sprite movement)
  • Pause Single Step (Animation starts)

Basically… everything seems fine.

What about later games?

Later games have much lower input latency on PCB. These include:

  • Sol Divide
  • Strikers 1945 II
  • Strikers 1945 III
  • Gunbird 2
  • Dragon Blaze

… but they do on MAME too.

When testing Strikers 1945 II on PCB, I get one frame of no sprite movement. MAME produces the same result as PCB if using the Pause+Step method described above.

Basically, the newer games have two frames less input lag on both PCB and MAME compared to the older games.

But shmupmame has less lag on the early games right?

Yes, shmupmame has less lag than MAME and original PCBs for early Psikyo games. It does this by shortcuts that are possible since some buffers can be safely skipped in emulation, if not worrying about emulating the hardware accurately. If you prefer to play these games like that, then that’s fine too. It’s just not how the games worked originally 🙂

Other stuff I fixed in the MAME Psikyo driver

Still made some solid improvements though.

  • Measured accurate HSync and VSync timings. Previously they were not correct.
  • Fixed various issues in documentations
  • Corrected vertical blanking interrupt level (was irq1, should be irq4)
  • Removed MACHINE_IMPERFECT_TIMING from all relevant machines, since they’re now verified to work correctly

For reference, correct timings are:

SYNCS:  HSync 15.700kHz, VSync 59.923Hz
   HSync most likely derived from 14.3181MHz OSC (divided by 912)
   262 lines per frame consisting of:
   - Visible lines: 224
   - VBlank lines: 38 (Front/Back porch: 15 lines, VSync: 8 lines)


MAME is fine and accurate for Psikyo games in terms of input latency, you don’t need old forks.


New JTAG adapter

Bought a fancy Tigard JTAG adapter instead of the USB Blaster clone I’ve been using.

Need to figure out how to bump performance further, but even at default low speeds in urjtag, it seems like a good improvement.

When I converted this Mushi to Pink Sweets Suicide Club, I fucked up a bit and used bad blocks on U2, which is causing minor sprite glitches now. Going to see if I can fix it easily with JTAG so I don’t have to swap the IC again.


More CV1000 Research (now featuring clipping!)

Disclaimer: This is mostly written for myself. It will be hard to follow. Maybe someone will find it interesting anyways.

Following up on the earlier posts on CV1000 research, I’ve been looking at making a patch for MAME implementing this, but got a nice report from a tester that the current implementation I did had weird amounts of delay after the second midboss in Pink Sweets. This was mostly due to a silly mistake in my code, which was good to find, but when looking at the actual delays from a PCB running Pink Sweets, I still seemed to be off by quite a bit.

This caused me to spend another week or so figuring out what was up, and I first made some sortof interesting observations regarding clipping:

  • Sprites that would be drawn entirely outside of the visible area when clipping is enabled will not cause any memory copies in the Blitter (this I had already expected, from looking at the operations of Espgaluda 2)
  • Since the Blitter still needs to read these operations into its operations FIFO, this means that if many “invisible draws” are read in a sequence, the Blitter will be idle, causing delay.
  • Draws that has some visible section, but cross the boundaries of the visible area will happily write outside of the visible area.

Additionally, I realized that the simplified calculations described in my PDF writeup was faulty. While the general thinking and memory layout was described correctly, write alignments may cause also cause additional delays. This will be described further below as well.

Sprites fully drawn outside of the visible area

Pink Sweets uses and interesting drawing method in sections with scrolling backgrounds featuring “wavy patterns”, where it sets up parts of the background in a separate area in VRAM before copying it to the visible buffer. These copies are done as many 1×324 pixel draws, with offsets varying by up to 4 pixels to generate the waves.

These draws will also be performed outside of the visible clipping area, and by instrumenting MAME I could see these type of fully invisible draws happening 80 times per frame in a single sequence.

This will look pretty interesting in a logic analyzer attached to a PCB. In the image below, the “BACK” pulses on the bottom row is the Blitter reading operations from SRAM. Each pulse will read 64 bytes of data, which means that a total of 1600 bytes are read. Since each Draw operation is 20 bytes, this maps exactly to the 80 “invisible” Operations.

This means two things in terms of delay simulation:

  • Draws fully outside of the visible area should not be calculated.
  • … but if a Main RAM access contains nothing to draw, the bus access timewill need to be added to calculations (in this case about 17.5us after subtracting the Horizontal line read) since the Blitter will still be considered busy when waiting for things to Draw.

Sprites partially drawn outside of the visible area

The way the waves in the Pink Sweets background is generated is by modifying the start offset in a curve with amplitude 4. Most of these draws will start slightly below the visible area.

As an example, a draw may be (note that X/Y are rotated, due to TATE):

// CLIP_AREA is MIN_X=416, MAX_X=735, MIN_Y=128, MAX_Y=368
SRC_X=864, SRC_Y=128
DST_X=414, DST_Y=128
SIZE_X = 324
SIZE_Y = 1

This means that the write to (X=414,Y=128) and (X=415,Y=128) will not be in the visible area. Looking at a logic analyzer output of this write, it is however still written. The image below has some glitches in the signals, but is annotated for clarity.

1: Read 32 pixels of source data, nicely aligned in same VRAM row.
2: Read 2 pixels of destination for the VRAM row of the invisible data.
3: Write the 2 invisible pixels to that VRAM row.
4: Read remaining 30 pixels of destination data
5: Write the 30 pixels of destination data.
(In practice, this will actually read 4 and 32 pixels instead, since all operations work at 4 pixels per VRAM CLK, but that doesnt matter much).

Since each additional VRAM row access has significant overhead, this means that there’s some wasted work going on here… and it gets worse!

Turns out my earlier simplified calculations were wrong. Oops.

In my large writeup earlier, I had the following calculation for Draw operations.

NUM_VRAM_CLK = (NUM_PIXELS * 3 / 4) + NUM_VRAM_ROWS * (5 + 20 + 10) + 10
Time Needed = NUM_VRAM_CLK * 13ns

This has two major issues. First it assumes that the number of pixels being read from source data, and the number of pixels read+written to destination are equal. As described in the section “Writes are always done four pixels at the time, to offsets evenly divisible by four“, this is not true. There can be four pixels of overhead per line, depending on alignment.

Secondly, this assumes that each VRAM row written to, will only be written to once. Depending on alignment, this may also be false. If drawing a sprite with X_SIZE=64, Y_SIZE=1 to position X=16,Y=0, the following sequence will happen:

  • 32 bytes are read from source
  • 16 bytes are read and written to VRAM=0
  • 16 bytes are read and written to VRAM=1
  • 32 bytes are read from source
  • 16 bytes are read and written to VRAM=1
  • 16 bytes are read and written to VRAM=2

Every read+write to a destination VRAM row will add 35 CLK of overhead, so even if the number of rows written to here are just 3, the total access overhead will be for 4 VRAM row accesses. The actual calculation should be something like:

             + NUM_DESTINATION_VRAM_ACCESSES * (5 + 20 + 10) + 10
Time Needed = NUM_VRAM_CLK * 13ns

Calculating the number of VRAM accesses can be done iteratively, but there’s probably a nice formula for it as well, that I’m too tired to figure out right now.

The difference alignments can make…

Finally to show how big the difference can be in Draw latency depending on how the destination is aligned in VRAM rows, both of the picture above show the first 32 bytes of data being written for a 324×1 pixel part of the background. Image one is aligned will with destination VRAM rows, while the second image is off by 2 pixels.

Destination aligned with VRAM rows. 32 pixels takes 652ns
Destination is off by 2 pixels from VRAM rows. 32 pixels takes 1068ns due to the extra overhead.

Some more fun Akai Katana slowdown pics

As an extra footnote to the CPU Slowdown post about Akai Katana, here’s two more fun pictures, showing other stuff that can happens.

First, there are actually some rare occasions where Blitter induced slowdown can matter. This happens very rarely in Akai Katana, but sometimes when there’s a lot of draws going on, the Blitter processing takes more than a frame, and we can see the CPU repeatedly doing ready checks by pulsing CS6.

As mentioned earlier, this is very infrequent in this game, and almost all slowdown is just CPU being slow.

The other fun thing is that the CPU often does enough work that it doesn’t even have time to process it in two frames, which means several frames of slowdown. The example below shows CPU processing and waits causing two extra frames of slowdown, but more can happen too.

It really feels like this was rushed out without much thought given to performance.


CV1000 CPU Slowdown investigated

When looking at CV1000 Blitter performance, one thing I noticed was that behavior seemed to differ heavily between games, and on some games, despite not seeing long Blitter operation lists, there is heavy slowdown.

The main offender here is Akai Katana, which will have very heavy slowdown when in Spirit mode, reflecting bullets. For some PCB footage, here’s an old video I made.

This slowdown comes despite not a lot of draws happening, which indicates that it’s related to CPU. I went ahead and verified this.

When there’s no slowdown and game runs on full speed, we can see the following behavior.

For captures below, the signals from top to bottom are:

  • SH-3 BUS Clock (CKIO)
  • IRQ2 (VSYNC)
  • BREQ (Blitter requests operations)
  • CS3 (RAM CS)
  • CS4 (U2, EEPROM, Audio CS)
  • CS6 (Blitter command CS)

First here’s a non-slowdown section.

No slowdown.

When a VSYNC pulse triggers, Blitter operations kick off. These finish well ahead of the next VSYNC. When looking at CS3 pulses, we can see that these become much less frequent a while before VSYNC. This is when the CPU has finished executing game logic and is waiting for the next frame. Since logic finished before a new pulse, things are fine here.

Now lets look at that chunky Akai Katana slowdown.


The main difference here is that the CPU is actively doing work longer than a single frame, as can be seen by looking at CS3 working past the next IRQ2 pulse. This means that the CPU will ignore the interrupt, and then start spin-waiting for a new one. This causes one frame of slowdown.

So what does that mean?

For some games like Akai Katana, getting the Blitter timing accurate is not going to do anything for accuracy in emulation.

What is instead needed is work on the SH-3 emulation.

Current MAME emulation of the CPU does not support the wait states introduced by RAM accesses that either are uncached, or results in cache misses. The actual cache behavior is described in-depth in the SH-3 datasheet an should be rather simple to implement, but getting the wait-state handling in place seems pretty hard (especially with dynamic recompilation enabled).

There’s also wait states for stuff like U2, Audio, EEPROM, … accesses, but compared to RAM this doesn’t seem to be a big deal.

For some other CV1000 games, the Blitter timing does matter though, so getting both parts right is needed to get something that is truly behaving similar to the actual boards.


Research into CV1000 Blitter performance and behavior

I’ve spent some time in December looking into CV1000 Blitter behavior to figure out how it performs in terms of slowdown. I feel I have a good understanding of how it works now, and have put together a doc describing it.

View/Download it here: CV1000_Blitter_Research_by_buffi.pdf

Why do this?

The current simulation of this Blitter in MAME is quite impressive as a high-level reproduction, but there doesn’t seem to have been much time spent researching the timing of operations.

This document aims to document how the behavior and timing of the Blitter actually works, and people can utilize this to make something that’s mostly accurate.

Also it is very fun to attach a Logic Analyzer to a PCB and figuring out how it works.

Preemptively Answered Questions

Q: But what about tuning Blitter Delay in MAME
A: Trying to tune the existing Blitter Delay slider in MAME doesn’t really make any sense, since the slowdown introduced from it doesn’t have anything to do with how it works on real hardware. It’s still arguably better than no slowdown at all, which used to be the other option, but that’s about it.

Q: Will this make CV1000 emulation run with proper slowdown?
A: Probably not really. While this should make it possible to have the Blitter part of emulation more accurate, there’s still no emulation of SH-3 Wait States either, which means that slowdown that’s due to CPU not having time to finish processing before VBLANK due to waiting will still not be accurate. I have no idea how much this matters for most games.’

Q: How much work is it to implement this?
A: It should be very simple. And the simplest thing to do would be:

  • Rip out all the existing Blitter delay logic.
  • When sending a Command to start Blitter Operations, estimate the time they will take to compute.
  • Don’t return “Ready” for the Ready Requests until that time has passed.

This still doesn’t reflect how it’s performs on real hardware (where Operations are running concurrently with the CPU, and requesting new Operations when the existing ones are done executing), but in practice I don’t think that should really matter in terms of experienced gameplay performance.


If you have feedback on the document, or suggestions for further work, please reach out to me on Arcade-Project forums, Github or in comments on this blog.