Categories
research

More CV1000 Research (now featuring clipping!)

Disclaimer: This is mostly written for myself. It will be hard to follow. Maybe someone will find it interesting anyways.

Following up on the earlier posts on CV1000 research, I’ve been looking at making a patch for MAME implementing this, but got a nice report from a tester that the current implementation I did had weird amounts of delay after the second midboss in Pink Sweets. This was mostly due to a silly mistake in my code, which was good to find, but when looking at the actual delays from a PCB running Pink Sweets, I still seemed to be off by quite a bit.

This caused me to spend another week or so figuring out what was up, and I first made some sortof interesting observations regarding clipping:

  • Sprites that would be drawn entirely outside of the visible area when clipping is enabled will not cause any memory copies in the Blitter (this I had already expected, from looking at the operations of Espgaluda 2)
  • Since the Blitter still needs to read these operations into its operations FIFO, this means that if many “invisible draws” are read in a sequence, the Blitter will be idle, causing delay.
  • Draws that has some visible section, but cross the boundaries of the visible area will happily write outside of the visible area.

Additionally, I realized that the simplified calculations described in my PDF writeup was faulty. While the general thinking and memory layout was described correctly, write alignments may cause also cause additional delays. This will be described further below as well.

Sprites fully drawn outside of the visible area

Pink Sweets uses and interesting drawing method in sections with scrolling backgrounds featuring “wavy patterns”, where it sets up parts of the background in a separate area in VRAM before copying it to the visible buffer. These copies are done as many 1×324 pixel draws, with offsets varying by up to 4 pixels to generate the waves.

These draws will also be performed outside of the visible clipping area, and by instrumenting MAME I could see these type of fully invisible draws happening 80 times per frame in a single sequence.

This will look pretty interesting in a logic analyzer attached to a PCB. In the image below, the “BACK” pulses on the bottom row is the Blitter reading operations from SRAM. Each pulse will read 64 bytes of data, which means that a total of 1600 bytes are read. Since each Draw operation is 20 bytes, this maps exactly to the 80 “invisible” Operations.

This means two things in terms of delay simulation:

  • Draws fully outside of the visible area should not be calculated.
  • … but if a Main RAM access contains nothing to draw, the bus access timewill need to be added to calculations (in this case about 17.5us after subtracting the Horizontal line read) since the Blitter will still be considered busy when waiting for things to Draw.

Sprites partially drawn outside of the visible area

The way the waves in the Pink Sweets background is generated is by modifying the start offset in a curve with amplitude 4. Most of these draws will start slightly below the visible area.

As an example, a draw may be (note that X/Y are rotated, due to TATE):

// CLIP_AREA is MIN_X=416, MAX_X=735, MIN_Y=128, MAX_Y=368
SRC_X=864, SRC_Y=128
DST_X=414, DST_Y=128
SIZE_X = 324
SIZE_Y = 1

This means that the write to (X=414,Y=128) and (X=415,Y=128) will not be in the visible area. Looking at a logic analyzer output of this write, it is however still written. The image below has some glitches in the signals, but is annotated for clarity.

1: Read 32 pixels of source data, nicely aligned in same VRAM row.
2: Read 2 pixels of destination for the VRAM row of the invisible data.
3: Write the 2 invisible pixels to that VRAM row.
4: Read remaining 30 pixels of destination data
5: Write the 30 pixels of destination data.
(In practice, this will actually read 4 and 32 pixels instead, since all operations work at 4 pixels per VRAM CLK, but that doesnt matter much).

Since each additional VRAM row access has significant overhead, this means that there’s some wasted work going on here… and it gets worse!

Turns out my earlier simplified calculations were wrong. Oops.

In my large writeup earlier, I had the following calculation for Draw operations.

VRAM_ROW_HEIGHT = CEIL( ( (IMAGE_Y % 32) + IMAGE_HEIGHT ) / 32 )
VRAM_ROW_WIDTH = CEIL( ( (IMAGE_X % 32) + IMAGE_WIDTH ) / 32 )
NUM_VRAM_ROWS = VRAM_ROW_HEIGHT * VRAM_ROW_WIDTH
NUM_VRAM_CLK = (NUM_PIXELS * 3 / 4) + NUM_VRAM_ROWS * (5 + 20 + 10) + 10
Time Needed = NUM_VRAM_CLK * 13ns

This has two major issues. First it assumes that the number of pixels being read from source data, and the number of pixels read+written to destination are equal. As described in the section “Writes are always done four pixels at the time, to offsets evenly divisible by four“, this is not true. There can be four pixels of overhead per line, depending on alignment.

Secondly, this assumes that each VRAM row written to, will only be written to once. Depending on alignment, this may also be false. If drawing a sprite with X_SIZE=64, Y_SIZE=1 to position X=16,Y=0, the following sequence will happen:

  • 32 bytes are read from source
  • 16 bytes are read and written to VRAM=0
  • 16 bytes are read and written to VRAM=1
  • 32 bytes are read from source
  • 16 bytes are read and written to VRAM=1
  • 16 bytes are read and written to VRAM=2

Every read+write to a destination VRAM row will add 35 CLK of overhead, so even if the number of rows written to here are just 3, the total access overhead will be for 4 VRAM row accesses. The actual calculation should be something like:

NUM_VRAM_CLK = (SRC_PIXELS / 4 + DST_PIXELS * 2 / 4) 
             + NUM_DESTINATION_VRAM_ACCESSES * (5 + 20 + 10) + 10
Time Needed = NUM_VRAM_CLK * 13ns

Calculating the number of VRAM accesses can be done iteratively, but there’s probably a nice formula for it as well, that I’m too tired to figure out right now.

The difference alignments can make…

Finally to show how big the difference can be in Draw latency depending on how the destination is aligned in VRAM rows, both of the picture above show the first 32 bytes of data being written for a 324×1 pixel part of the background. Image one is aligned will with destination VRAM rows, while the second image is off by 2 pixels.

Destination aligned with VRAM rows. 32 pixels takes 652ns
Destination is off by 2 pixels from VRAM rows. 32 pixels takes 1068ns due to the extra overhead.

Leave a Reply

Your email address will not be published. Required fields are marked *