Biko's House of Horrors

Still Life with Bitstreams

Photograph of pink flowers on a pink background, in shadowy light

Still life flowers by Lezlie, CC BY-NC-ND 2.0

Back in 20131, as I was playing Still Life 2, I had the idea to extract the cutscenes from the game's resources. Unlike the previous game in the series, Still Life, which stores cutscenes as .bik files, this one uses a proprietary format. And so, with zero experience and zero knowledge, I set out to understand the file format and write a parser.

I managed to reverse engineer enough of the file format to get the video and audio data out. But, the video was compressed using some custom algorithm, and I couldn't figure it out. So, I made a hack. And by "hack" I mean take an axe to the game's executable: I copied chunks of x86 assembly from within, wrapped them in __asm blocks, and compiled into my own application.

Hey, if it works — it ain't stupid.

Here's a video I extracted that way (no spoilers):

And I left it at that.

Nine years later, something made me remember this thing, and I thought it might be nice to finish the job: fully reverse engineer the video format. I have more experience, and the tools have gotten much better, so let's see what we can dig up.

For those following along at home, you can find a demo of the game here, or purchase it on GOG or on Steam. It's a pretty decent point-and-click adventure, actually, if you don't mind time-sensitive sequences. The first game is better, though.

Update 2022-04-30: Added some more info about the XOR.

Extracting the video files

Before we can begin dissecting the video format, we must first find the actual video files. Yet again, Still Life 2 employs a custom archive format to store all of its resources.

Luckily, nice people on the internet figured it out for us. All we have to do is grab the QuickBMS tool and the extractor script, and point them at the .dat files from the installed game. The videos are in the .PFF files.

For the remainder of this post I'll focus (mostly) on the SPLASH_SL2.PFF file, from the Sl2cine.dat archive.

Where it all began

The first thing I did back then (as far as I can remember), is just stare at the PFF files in a hex editor and try to figure out what was going on. Turns out, most of the high-level structures in the file can be understood that way.

I'll try to walk through my process of reversing the file format, as it was back then. As far as I can remember it involved a lot of guesswork and trial-and-error. Unfortunately, I didn't document the process at all, and only somewhat documented the findings, so mostly it'll be vague recollections of what past me was doing, colored by what present me would've done in his place.

¯\_(ツ)_/¯

I think I was relying heavily on XeNTaX's file format reversing guide in my efforts. I looked it over while preparing this post, and it still looks kinda good actually. It's more tailored toward reversing archive formats (think ZIP), but there're some useful general tips as well.

The header

Essentially, a PFF file consists of a header followed by a sequence of frames, with each frame containing video and audio data.

Here's the header from SPLASH_SL2.PFF:

0000h: 50 46 46 30 2E 30 00 56 49 44 45 4F  PFF0.0.VIDEO
000Ch: 5F 44 44 53 00 53 4F 55 4E 44 5F 56  _DDS.SOUND_V
0018h: 4F 52 42 49 53 00 45 4E 00 45 4E 44  ORBIS.EN.END
0024h: 48 45 41 44 45 52 00                 HEADER.

So we have a magic at the beginning of the file ("PFF0.0"), a magic to mark the end of the header ("ENDHEADER"), and some strings in the middle that, presumably, describe the video and audio formats, and the language.

What is DDS? No idea. But Vorbis is a well-known audio format, so that's promising.

In the form of a C structure, the header looks like this:

struct PFF_HEADER
{
  char szMagic[7];            // "PFF0.0"
  char szVideoFormat[10];
  char szSoundFormat[13];
  char szLang[3];
  char szEndHeader[10];       // "ENDHEADER"
};

Frames

Immediately following the file header are the frames. Here's the beginning of the first frame in our file:

002Bh: 46 52 41 4D 45 00 A6 5D 00 00 00 00  FRAME..]....
0037h: 00 00 00 00 00 00 56 49 44 45 4F 00  ......VIDEO.
0043h: 85 3D 00 00 80 A9 03 00 44 44 53 20  .=......DDS
004Fh: 7C 00 00 00 07 10 00 00 58 02 00 00  |.......X...
005Bh: 20 03 00 00 00 00 00 00 00 00 00 00   ...........
0067h: 00 00 00 00 00 00 00 00 00 00 00 00  ............
0073h: 00 00 00 00 00 00 00 00 00 00 00 00  ............
007Fh: 00 00 00 00 00 00 00 00 00 00 00 00  ............
008Bh: 00 00 00 00 00 00 00 00 00 00 00 00  ............
0097h: 20 00 00 00 04 00 00 00 44 58 54 31   .......DXT1
00A3h: 00 00 00 00 00 00 00 00 00 00 00 00  ............
00AFh: 00 00 00 00 00 00 00 00 00 10 00 00  ............
00BBh: 00 00 00 00 00 00 00 00 00 00 00 00  ............
00C7h: 00 00 00 00                          ....

Okay, so there's a magic marking the beginning of the frame ("FRAME"), the string "VIDEO", the string "DDS", and the string "DXT1".

But what's the stuff immediately after the frame magic? Well, between "FRAME" and "VIDEO" are these 12 bytes:

0031h: A6 5D 00 00 00 00 00 00 00 00 00 00  .]..........

We can make a guess and say that the first four bytes are a size. Taking them as little-endian we get 0x00005DA6, or 23,974. And, looking 23,974 bytes after this field we find... another "FRAME" magic!

5DDBh: 46 52 41 4D 45 00 27 01 00 00        FRAME.'...

This time the size is 295 (0x00000127), and again at that offset we find another frame.

Cool, so we figured out that this DWORD is the frame size. What about the other 8 bytes? It's not immediately clear what they are for, so let's leave them for now.

So, right now we know that each frame begins like this:

struct FRAME
{
  char    szFrame[6];     // "FRAME"
  DWORD   cbSize;         // Size after this field
  BYTE    acReserved[8];

  // ... Rest of the frame ...
};

VIDEO section

Immediately after this header we have the "VIDEO" string and all that other stuff we saw before:

003Dh: 56 49 44 45 4F 00 85 3D 00 00 80 A9  VIDEO..=....
0049h: 03 00 44 44 53 20 7C 00 00 00 07 10  ..DDS |.....
0055h: 00 00 58 02 00 00 20 03 00 00 00 00  ..X... .....
0061h: 00 00 00 00 00 00 00 00 00 00 00 00  ............
006Dh: 00 00 00 00 00 00 00 00 00 00 00 00  ............
0079h: 00 00 00 00 00 00 00 00 00 00 00 00  ............
0085h: 00 00 00 00 00 00 00 00 00 00 00 00  ............
0091h: 00 00 00 00 00 00 20 00 00 00 04 00  ...... .....
009Dh: 00 00 44 58 54 31 00 00 00 00 00 00  ..DXT1......
00A9h: 00 00 00 00 00 00 00 00 00 00 00 00  ............
00B5h: 00 00 00 10 00 00 00 00 00 00 00 00  ............
00C1h: 00 00 00 00 00 00 00 00 00 00        ..........

Just after the "VIDEO" string is a DWORD (0x00003D85, aka 15,749) that we can again (correctly) guess as being a size. This time, it's the size of the video section: just after it is the string "SOUND", which we'll get to in a bit.

Then we have another DWORD: 0x0003A980, otherwise known as 240,000. Unfortunately, we don't yet have enough information to understand what it is.

Afterward, we have the "DDS" string.

Some Googling for DDS and DXT1 reveals that this might be referring to the DirectDraw Surface file format. The "DDS" string at the beginning (actually, it is "DDS ", with a space at the end) is a magic value, and immediately after it is a DDS_HEADER structure. If we parse the data in our video frame according to this structure we'll get:

struct DDS_HEADER
{
  DWORD           dwSize;               // 124
  DWORD           dwFlags;              // 0x1007
                                        // DDSD_PIXELFORMAT
                                        // | DDSD_WIDTH
                                        // | DDSD_HEIGHT
                                        // | DDSD_CAPS
  DWORD           dwHeight;             // 800
  DWORD           dwWidth;              // 600
  DWORD           dwPitchOrLinearSize;  // 0
  DWORD           dwDepth;              // 0
  DWORD           dwMipMapCount;        // 0
  DWORD           dwReserved1[11];
  DDS_PIXELFORMAT ddspf;
  DWORD           dwCaps;               // 0x1000
                                        // DDSCAPS_TEXTURE
  DWORD           dwCaps2;              // 0
  DWORD           dwCaps3;              // 0
  DWORD           dwCaps4;              // 0
  DWORD           dwReserved2;
};

And the DDS_PIXELFORMAT structure inside:

struct DDS_PIXELFORMAT
{
  DWORD dwSize;         // 32
  DWORD dwFlags;        // 0x04
                        // DDPF_FOURCC
  DWORD dwFourCC;       // 0x31545844
                        // "DXT1"
  DWORD dwRGBBitCount;  // 0
  DWORD dwRBitMask;     // 0
  DWORD dwGBitMask;     // 0
  DWORD dwBBitMask;     // 0
  DWORD dwABitMask;     // 0
};

The dwSize fields match, the height and width look legit... Could it be that easy?

Unfortunately, no.

Black image with some random colored pixels at the top

This is what we get if we extract all the video section data, along with the DDS header, into an image file, and try to render it.

What's suspicious is that the dwPitchOrLinearSize field in the DDS header is zero2, when it should actually contain the size of the pixel data for the image. This page provides the formula for computing the size. Given the image dimensions of 800x600 it should be... 240,000. Exactly the value we saw before!

And how much data do we actually have in the video section? 15,617.

Clearly, there's some compression going on here. But, let's leave it for now and summarize what we know. A video section looks like this:

struct VIDEO_SECTION
{
  char        szVideo[6];         // "VIDEO"
  DWORD       cbSize;             // Size after this field
  DWORD       cbDecompressed;     // Size of the DDS
                                  // pixel data after
                                  // decompression
  char        dwDDSMagic[4];      // "DDS "
  DDS_HEADER  tDDS;
  BYTE        acData[cbSize - 8 - sizeof(tDDS)];
};

Okay, onwards!

SOUND section

Immediately after the video section we have the sound section:

3DCCh: 53 4F 55 4E 44 00 FC 1F 00 00 00 4F  SOUND......O
3DD8h: 67 67 53 00 02 00 00 00 00 00 00 00  ggS.........
3DE4h: 00 23 48 00 00 00 00 00 00 77 F1 09  .#H......w..
3DF0h: CF 01 1E 01 76 6F 72 62 69 73 00 00  ....vorbis..
3DFCh: 00 00 02 44 AC 00 00 FF FF FF FF 00  ...D........
3E08h: 77 01 00 FF FF FF FF B8 01 4F 67 67  w........Ogg
3E14h: 53 00 00 00 00 00 00 00 00 00 00 23  S..........#
3E20h: 48 00 00 01 00 00 00 BE 8B 87 4B 10  H.........K.
3E2Ch: 54 FF FF FF FF FF FF FF FF FF FF FF  T...........
3E38h: FF FF FF E2 03 76 6F 72 62 69 73 1D  .....vorbis.
3E44h: 00 00 00 58 69 70 68 2E 4F 72 67 20  ...Xiph.Org
3E50h: 6C 69 62 56 6F 72 62 69 73 20 49 20  libVorbis I
3E5Ch: 32 30 30 35 30 33 30 34 01 00 00 00  20050304....
3E68h: 23 00 00 00 54 72 61 63 6B 20 65 6E  #...Track en
3E74h: 63 6F 64 65 64 20 75 73 69 6E 67 20  coded using
3E80h: 6C 69 6C 79 20 69 6E 74 65 72 66 61  lily interfa
3E8Ch: 63 65 2E 01                          ce..

Once again, we have a magic value denoting the section's beginning, and a DWORD with the size: 0x00001FFC, aka 8,188. After these, we have a single byte with the value of zero, and then the string "OggS". Further down, we have more references to Vorbis.

At this point, it's a pretty safe bet that we're dealing with Vorbis-encoded sound. Specifically, Vorbis sound in an Ogg container. That "OggS" string is a magic denoting the beginning of an Ogg page (an Ogg file is split into pages).

In fact, if we concatenate the data from all sound sections we'll get a valid Ogg file:

(Yes, that's the Ogg file extracted from SPLASH_SL2.PFF. You can download it here and have a look at all the little peculiarities inside.)

Cool, so a sound section looks like this:

struct SOUND_SECTION
{
  char    szSound[6];         // "SOUND"
  DWORD   cbSize;             // Size after this field
  BYTE    cReserved;          // Always 0
  BYTE    acData[cbSize - 1]; // Ogg-Vorbis data
};

An end ...

Immediately after the sound section we have the string "ENDFRAME", and after it the next frame starts. With this in mind, we can fully describe a frame:

struct FRAME
{
  char            szFrame[6];     // "FRAME"
  DWORD           cbSize;         // Size after this field
  BYTE            acReserved[8];
  VIDEO_SECTION   tVideo;
  SOUND_SECTION   tSound;
  char            szEndFrame[9];  // "ENDFRAME"
};

... And a beginning

Okay, let's take a look at the next frame.

5DDBh: 46 52 41 4D 45 00 27 01 00 00 00 00  FRAME.'.....
5DE7h: 00 40 E1 7A A4 3F 56 49 44 45 4F 00  .@.z.?VIDEO.
5DF3h: FD 00 00 00 0F 05 00 04 00 03 00 E1  ............
5DFFh: 80 00 81 FF 80 FF FF FF FF FF FF FF  ............
5E0Bh: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E17h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E23h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E2Fh: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E3Bh: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E47h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E53h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E5Fh: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E6Bh: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E77h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E83h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E8Fh: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5E9Bh: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5EA7h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5EB3h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5EBFh: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5ECBh: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5ED7h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5EE3h: FF FF FF FF FF FF FF FF FF FF FF FF  ............
5EEFh: FF 00 00 00 88 53 4F 55 4E 44 00 05  .....SOUND..
5EFBh: 00 00 00 00 00 00 00 00 45 4E 44 46  ........ENDF
5F07h: 52 41 4D 45 00                       RAME.

Well, there's no DDS header in the video section, and the 240,000 DWORD is missing. But, the size is correct (0x000000FD in this case, i.e. 253). And the sound section is rather short, but everything in it looks in line with our observations so far.

If we skim through the other frames, we'll see that none of them have the header. It makes sense, really — all frames in the video should have the same parameters, so why store the header multiple times?

We can also note that the last frame in the file has neither video nor audio sections:

16D:0404h: 46 52 41 4D 45 00 11 00 00 00 00 00  FRAME.......
16D:0410h: C0 47 B8 1E 2B 40 45 4E 44 46 52 41  .G..+@ENDFRA
16D:041Ch: 4D 45 00                             ME.

Everything else is in place, though:

  1. The first DWORD after the "FRAME" magic is 0x00000011, i.e. 17, which is precisely the size of the frame data.
  2. After that DWORD are the mysterious 8 bytes.

A mystery solved

So how about those 8 bytes in each frame? Here they are, from the first 5 frames:

0035h: 00 00 00 00 00 00 00 00              ........
5DE5h: 00 00 00 40 E1 7A A4 3F              ...@.z.?
5F16h: 00 00 00 40 E1 7A B4 3F              ...@.z.?
6047h: 00 00 00 E0 51 B8 BE 3F              ....Q..?
6178h: 00 00 00 40 E1 7A C4 3F              ...@.z.?

Can you tell what they are? No? Well, neither could I. But then, I tried looking at them as doubles (i.e. IEEE 754 binary64):

0                        0
0.0399999991059303      ~0.04
0.0799999982118607      ~0.08
0.119999997317791       ~0.12
0.159999996423721       ~0.16

Ignoring the imprecisions inherent in floating-point numbers, the difference between each pair is ≈0.04.

Or, put another way, 25 FPS.

These are timestamps, in seconds, when each frame should be presented.

Additionally, the last frame — the empty one, with neither video nor sound — has a timestamp of 13.5599996969104 ≈ 13.56s, which matches the length of the audio track we extracted.

Putting it all together

With everything we've learned, we can now write a definition of the PFF format in pseudo-C:

struct PFF
{
  PFF_HEADER  tHeader;
  FRAME       atFrames[...];  // However many as will fit
  PFF_FOOTER  tFooter;
};

struct PFF_HEADER
{
  char szMagic[7];            // "PFF0.0"
  char szVideoFormat[10];
  char szSoundFormat[13];
  char szLang[3];
  char szEndHeader[10];       // "ENDHEADER"
};

struct PFF_FOOTER
{
  char szEndFile[8];          // "ENDFILE"
};

struct FRAME
{
  char            szFrame[6];     // "FRAME"
  DWORD           cbSize;         // Size after this field
  double          nTimestamp;
  VIDEO_SECTION   tVideo;         // Optional
  SOUND_SECTION   tSound;         // Optional
  char            szEndFrame[9];  // "ENDFRAME"
};

struct VIDEO_SECTION
{
  char        szVideo[6];         // "VIDEO"
  DWORD       cbSize;             // Size after this field
  if (first_video_section)
  {
    DWORD       cbDecompressed;   // Size of the DDS
                                  // pixel data after
                                  // decompression
    char        dwDDSMagic[4];    // "DDS "
    DDS_HEADER  tDDS;
    BYTE        acData[cbSize - 8 - sizeof(tDDS)];
  }
  else
  {
    BYTE        acData[cbSize];
  }
};

struct SOUND_SECTION
{
  char    szSound[6];         // "SOUND"
  DWORD   cbSize;             // Size after this field
  BYTE    cReserved;          // Always 0
  BYTE    acData[cbSize - 1]; // Ogg-Vorbis data
};

If you prefer, here is a Hex Workshop structure library I wrote at the time to assist in parsing PFF files. Past me would also like to note that on big files this library will consume a huge amount of memory.

ב-010 זה לא היה קורה.

Back to the future

As I mentioned, back in 2013 I didn't manage to figure out the compression used for the video frames.

It's 2022 now, and it's time to fix that.

So we fire up Ghidra, point it at SL2.exe, and look for references to the string "PFF0.0", to find the place where the file is parsed. Lucky for us, there's a lot of RTTI left in the binary, so Ghidra can deduce class names, which makes our work much easier. Unfortunately, there're a lot of virtual function calls, too, which we have to devirtualize by hand.

Eventually, we arrive at this function, which seems to be responsible for the decompression of a single video frame (all names except for the class name are my own):

Ghidra decompiler screenshot showing the beginning of a very long function, with a mess of local variables

Unfortunately, there is only so much Ghidra can do, and this is what it looks like after I did my best to clean up the decompiler's output.

The next step was pretty low-tech: copy the decompiler output to a text editor, and start cleaning it up by hand.

Let's take a look at some of the highlights.

Initialization

On the first video frame the code performs some setup:

  1. It reads the decompressed frame size and the DDS header.
  2. It allocates a scratch buffer with this size.
  3. It takes the maximum of the frame's width and height, and rounds it up to the nearest power of 2.
  4. It makes a copy of the original DDS header, but with the width and height changed to the value computed above. For instance, an 800x600 frame will be rounded up to 1024x1024.
  5. It allocates an array of 30 frames, with enough space to hold a DDS header and pixel data for a frame of the updated size.3 It then copies the new DDS header to each such frame.

The decoder (described below) places each decompressed frame into the array, cycling to the beginning when it reaches the end. Why allocate so many? No idea.

Why does it do this rounding of the frame dimensions? Maybe it's due to some quirks of the rendering code, that requires all textures to be square, with each side a power of 2? Who knows.

Capture the flag

The initialization above happens only on the first frame. From then on, the flow is the same for all frames.

First, the code reads a single byte which contains some flags. Here they are, with some names that I guessed based on the usage:

enum VIDEO_FRAME_FLAGS
{
  VIDEO_FRAME_FLAG_INTERMEDIATE = (1 << 0),
  VIDEO_FRAME_FLAG_COMPRESSED   = (1 << 1),
  VIDEO_FRAME_FLAG_RLE          = (1 << 2),
  VIDEO_FRAME_FLAG_TWO_PARTS    = (1 << 3),
};

Of particular note is the VIDEO_FRAME_FLAG_COMPRESSED flag. As far as I can tell, this flag is set in all frames, in all the PFF files that I checked. In fact, if this flag is not set then the code will proceed to read some uninitialized memory, or data from previously decoded frames. Fun.

Huffman coding

Next up, the code reads and decodes some Huffman-coded data from the file.

Huffman coding is a method of compressing a stream of symbols by replacing each symbol with a unique sequence of bits (1s and 0s) - a codeword. This replacement follows two main principles:

  1. Symbols with a higher probability of occurring in the input stream will be replaced by shorter codes.
  2. No codeword is a prefix of any other code (i.e. this is a prefix code).

The first principle is what actually compresses the input stream. The second principle is what allows us to easily read the compressed data — since each codeword has a different length, we need a way to delimit them. By making the code prefix-free we guarantee that once we start reading the compressed data, there is only one possible interpretation of the codewords.

You can read a better description over at Wikipedia.

Wait, how do we know that the video decoder uses Huffman coding? Well, if the decoding fails it logs the message:

Warning : HUFFMAN problem!!!

So there's that.

What does the encoded data look like? Like this:

struct HUFFMAN_DATA
{
  WORD    nTreeLength;
  WORD    anNodes[nTreeLength];
  DWORD   adwData[];  // Yep, a DWORD. See below.
};

First we have the number of nodes in the Huffman tree, then the actual nodes, with the first node in the array being the root. Since a Huffman tree is a full binary tree, each node is either an internal node with 2 children or a leaf node. On disk, these are distinguished using the high bit:

  1. If the high bit is set, this is a leaf node. The value of the node (i.e. the symbol this node encodes) is then stored in the low 9 bits.
  2. If the high bit is clear, this is an internal node. The left child is stored in the array element immediately adjacent to it. The right child is stored at the index indicated by the node's low 15 bits.

How is the actual compressed data stored? In blocks of 4 bytes. The decoder reads 4 bytes at a time in little-endian order, then processes the bits from most to least significant. As it reads those bits, it traverses the tree — taking the left child if a bit is 0, and the right child if it is 1 — until it reaches a leaf node. The value of the leaf is then output as the decompressed byte.

Wait, but how does it know when to stop? The struct above has no "size" field...

To mark the end of the compressed data the developers used a neat trick — they put in a sentinel. When the decompressor reaches a leaf with the value 0x100, it stops. It works since a byte can never have such a value 😀. And, it can save several bytes of space: instead of storing 4 bytes for the size, the sentinel takes up only a few bits.

Now, if the VIDEO_FRAME_FLAG_TWO_PARTS flag is set in the frame flags, and VIDEO_FRAME_FLAG_RLE is not set, then the video decoder will decompress another Huffman blob immediately following the first one, and concatenate the decompressed data to the first decompressed block.

RLE

If the VIDEO_FRAME_FLAG_RLE is set, then after the Huffman decompression the video decoder will run the decompressed data through a Run-Length Encoding (RLE) decoder4.

This particular brand of RLE actually only encodes runs of zeroes. The decompressor goes over every byte of the (compressed) input and:

  1. If the byte has the high bit set, then it treats the low 7 bits as a length, and outputs this many zero bytes.
  2. If the high bit is clear, it again treats the low 7 bits as a length, but this time it copies this many bytes from the input to the output.

Unlike the Huffman decompressor, there are no sentinels here. The decompression stops after processing all of the input data.

Interleaving

If the VIDEO_FRAME_FLAG_TWO_PARTS flag is set, the output of all previous decompression stages is now interleaved: the data is split into two halves, and the code goes over each half, DWORD by DWORD, and places each pair of DWORDs into the output alongside each other. So, for instance, this buffer:

0000h: 11111111 33333333 55555555  ....3333UUUU
000Ch: 22222222 44444444 66666666  """"DDDDffff

Becomes:

0000h: 11111111 22222222 33333333  ....""""3333
000Ch: 44444444 55555555 66666666  DDDDUUUUffff

Tables!

Things got a bit confusing there, so let's summarize how the decompression works:

Flags Stage 1 Stage 2 Stage 3
None Huffman
VIDEO_FRAME_FLAG_TWO_PARTS Huffman x 2 Interleave
VIDEO_FRAME_FLAG_RLE Huffman RLE
Both Huffman RLE Interleave

XOR

Finally, if the VIDEO_FRAME_FLAG_INTERMEDIATE flag is set, the fully decompressed pixel data is XORed with the previous frame's data.

Presumably, this is done to improve compression on similar frames. If two adjacent frames have areas with the same pixels, their XOR will be zero. This will compress really well with the RLE thingy!

Results?

At this point I knew enough about the format and the video compression algorithm to write my own demuxer. Or so I thought.

My first attempt was a Python script that dumps all frames and audio to disk. It was slow, but it worked. Specifically, it worked on files that I had lying around from 9 years ago. They must be from the version of the game I played back then.

And then, while preparing this post, I tried to run it over the files from the demo of the game.

There was a slight hitch.

Remember how I, very confidently, described what the PFF file header looks like? Turns out, there can be multiple audio streams! For instance, this is what the header of CIN006.PFF from the demo looks like:

0000h: 5046 4630 2E30 0056 4944 454F 5F44 4453  PFF0.0.VIDEO_DDS
0010h: 0053 4F55 4E44 5F56 4F52 4249 5300 454E  .SOUND_VORBIS.EN
0020h: 0053 4F55 4E44 5F56 4F52 4249 5300 4652  .SOUND_VORBIS.FR
0030h: 0053 4F55 4E44 5F56 4F52 4249 5300 4445  .SOUND_VORBIS.DE
0040h: 0053 4F55 4E44 5F56 4F52 4249 5300 4553  .SOUND_VORBIS.ES
0050h: 0053 4F55 4E44 5F56 4F52 4249 5300 4954  .SOUND_VORBIS.IT
0060h: 0045 4E44 4845 4144 4552 00              .ENDHEADER.

There are 5 streams!

This file also resolved the mystery of the cReserved byte in the sound section. It's not always 0, but rather a track index. There are frames that contain data for only some of the audio tracks, and that's how you can match a sound section to the track it provides data for.

There was one other intereseting discovery when running the script over all the files at once: sometimes a frame contained only audio data, and no video. I modified my script to just output the previous frame again, but I don't know what the actual behaviour of the game's renderer is. Seems reasonable enough.

Results!

That should've been it, right? You can dump all the contents of a PFF file, which is enough. This is all the information contained in the file. Well, technically there're also the timestamps for each frame, but in all files I've seen they denote a constant framerate of 25 FPS.

You can even feed the individual frames to FFmpeg to create an actual, playable, video!

But of course, I didn't stop there. I went ahead and wrote a PFF demuxer for FFmpeg, all so that I could do:

ffmpeg -i SPLASH_SL2.PFF -c:v libx264 -c:a copy -map 0:v -map 0:a? SPLASH_SL2.mkv

Needless to say, this runs much faster than the Python script.

Funnily enough, the tricky part was actually implementing the extraction of the audio data from each frame. But you can read all about that in the source 🙄.

R̵҉̛͟e̴̡̨s̶̵̴u̕l͝͠҉̷t̡͜͝ş͏̷

Image of an anime character with the text 'There is a point where we needed to stop and we have clearly passed it. But let's keep going and see what happens.'

https://twitter.com/SwiftOnSecurity/status/1483807561010843653

But why stop there? I now definitely know enough about the file format to write a muxer for FFmpeg. So that I can convert any video to PFF, and play it inside the game.

Well, actually I don't know enough. Most importantly, I don't know how to split the Ogg-Vorbis audio into PFF frames. Can I put all audio in the first frame? Should I split by timestamps somehow?

But I do know enough to mux only video. Luckily, there is one PFF file in the game that has no audio — CINEMENU.PFF. This is the file that plays in the main menu.

So I... umm... modified it a little...

Okay, now I'm done. Probably.

Final notes

When I was just starting out on this project, I posted on the XeNTaX forums asking for help. Unfortunately, no one replied. But now, at last, after 9 years, I can answer my own question 😎.

My FFmpeg fork with the PFF muxer/demuxer is here. And you can get some Windows binaries here.

The hacky Python scripts are here.

An 010 Editor template for PFF files is here.

Onward to other things!


  1. Damn, that's a long time ago. ↩︎

  2. Okay, actually Microsoft's documentation states that you should not rely on this field, and always compute the size using the width, height and compression algorithm. But still! ↩︎

  3. Yes, that means the code has to fix-up the decompressed frame data, since it has lower dimensions. Specifically, it has to fix up the stride. I'm deliberately not getting into it. It's not pertinent to the decompression anyway. ↩︎

  4. RLED? RLD?5 Vote in the comments below. ↩︎

  5. Run-Length Decoder. ↩︎