Still life flowers by Lezlie, CC BY-NC-ND 2.0
Back in 20131, as I was playing Still Life 2, I had the idea
to extract the cutscenes from the game's resources. Unlike the previous
game in the series, Still Life, which stores cutscenes as .bik
files, this one uses a proprietary format. And so, with zero experience and
zero knowledge, I set out to understand the file format and write a parser.
I managed to reverse engineer enough of the file format to get the video and audio data
out. But, the video was compressed using some custom algorithm, and I couldn't figure
it out. So, I made a hack. And by "hack" I mean take an axe to the game's executable:
I copied chunks of x86 assembly from within, wrapped them in __asm
blocks, and compiled into my own application.
Hey, if it works — it ain't stupid.
Here's a video I extracted that way (no spoilers):
And I left it at that.
Nine years later, something made me remember this thing, and I thought it might be nice to finish the job: fully reverse engineer the video format. I have more experience, and the tools have gotten much better, so let's see what we can dig up.
For those following along at home, you can find a demo of the game here, or purchase it on GOG or on Steam. It's a pretty decent point-and-click adventure, actually, if you don't mind time-sensitive sequences. The first game is better, though.
Update 2022-04-30: Added some more info about the XOR.
- Extracting the video files
- Where it all began
- Back to the future
- Results?
- Results!
- R̵҉̛͟e̴̡̨s̶̵̴u̕l͝͠҉̷t̡͜͝ş͏̷
- Final notes
Extracting the video files
Before we can begin dissecting the video format, we must first find the actual video files. Yet again, Still Life 2 employs a custom archive format to store all of its resources.
Luckily, nice people on the internet figured it out for us.
All we have to do is grab the QuickBMS tool and the extractor
script, and point them at the .dat
files from the installed game.
The videos are in the .PFF
files.
For the remainder of this post I'll focus (mostly) on the SPLASH_SL2.PFF
file,
from the Sl2cine.dat
archive.
Where it all began
The first thing I did back then (as far as I can remember), is just stare at the PFF files in a hex editor and try to figure out what was going on. Turns out, most of the high-level structures in the file can be understood that way.
I'll try to walk through my process of reversing the file format, as it was back then. As far as I can remember it involved a lot of guesswork and trial-and-error. Unfortunately, I didn't document the process at all, and only somewhat documented the findings, so mostly it'll be vague recollections of what past me was doing, colored by what present me would've done in his place.
¯\_(ツ)_/¯
I think I was relying heavily on XeNTaX's file format reversing guide in my efforts. I looked it over while preparing this post, and it still looks kinda good actually. It's more tailored toward reversing archive formats (think ZIP), but there're some useful general tips as well.
The header
Essentially, a PFF file consists of a header followed by a sequence of frames, with each frame containing video and audio data.
Here's the header from SPLASH_SL2.PFF
:
0000h: 50 46 46 30 2E 30 00 56 49 44 45 4F PFF0.0.VIDEO
000Ch: 5F 44 44 53 00 53 4F 55 4E 44 5F 56 _DDS.SOUND_V
0018h: 4F 52 42 49 53 00 45 4E 00 45 4E 44 ORBIS.EN.END
0024h: 48 45 41 44 45 52 00 HEADER.
So we have a magic at the beginning of the file ("PFF0.0"
), a magic to mark the end
of the header ("ENDHEADER"
), and some strings in the middle that, presumably,
describe the video and audio formats, and the language.
What is DDS? No idea. But Vorbis is a well-known audio format, so that's promising.
In the form of a C structure, the header looks like this:
struct PFF_HEADER
{
char szMagic[7]; // "PFF0.0"
char szVideoFormat[10];
char szSoundFormat[13];
char szLang[3];
char szEndHeader[10]; // "ENDHEADER"
};
Frames
Immediately following the file header are the frames. Here's the beginning of the first frame in our file:
002Bh: 46 52 41 4D 45 00 A6 5D 00 00 00 00 FRAME..]....
0037h: 00 00 00 00 00 00 56 49 44 45 4F 00 ......VIDEO.
0043h: 85 3D 00 00 80 A9 03 00 44 44 53 20 .=......DDS
004Fh: 7C 00 00 00 07 10 00 00 58 02 00 00 |.......X...
005Bh: 20 03 00 00 00 00 00 00 00 00 00 00 ...........
0067h: 00 00 00 00 00 00 00 00 00 00 00 00 ............
0073h: 00 00 00 00 00 00 00 00 00 00 00 00 ............
007Fh: 00 00 00 00 00 00 00 00 00 00 00 00 ............
008Bh: 00 00 00 00 00 00 00 00 00 00 00 00 ............
0097h: 20 00 00 00 04 00 00 00 44 58 54 31 .......DXT1
00A3h: 00 00 00 00 00 00 00 00 00 00 00 00 ............
00AFh: 00 00 00 00 00 00 00 00 00 10 00 00 ............
00BBh: 00 00 00 00 00 00 00 00 00 00 00 00 ............
00C7h: 00 00 00 00 ....
Okay, so there's a magic marking the beginning of the frame ("FRAME"
),
the string "VIDEO"
, the string "DDS"
, and the string "DXT1"
.
But what's the stuff immediately after the frame magic? Well, between "FRAME"
and "VIDEO"
are these 12 bytes:
0031h: A6 5D 00 00 00 00 00 00 00 00 00 00 .]..........
We can make a guess and say that the first four bytes are a size. Taking them
as little-endian we get 0x00005DA6
, or 23,974. And, looking 23,974 bytes
after this field we find... another "FRAME"
magic!
5DDBh: 46 52 41 4D 45 00 27 01 00 00 FRAME.'...
This time the size is 295 (0x00000127
), and again at that offset we find another
frame.
Cool, so we figured out that this DWORD is the frame size. What about the other 8 bytes? It's not immediately clear what they are for, so let's leave them for now.
So, right now we know that each frame begins like this:
struct FRAME
{
char szFrame[6]; // "FRAME"
DWORD cbSize; // Size after this field
BYTE acReserved[8];
// ... Rest of the frame ...
};
VIDEO
section
Immediately after this header we have the "VIDEO"
string and all that other stuff
we saw before:
003Dh: 56 49 44 45 4F 00 85 3D 00 00 80 A9 VIDEO..=....
0049h: 03 00 44 44 53 20 7C 00 00 00 07 10 ..DDS |.....
0055h: 00 00 58 02 00 00 20 03 00 00 00 00 ..X... .....
0061h: 00 00 00 00 00 00 00 00 00 00 00 00 ............
006Dh: 00 00 00 00 00 00 00 00 00 00 00 00 ............
0079h: 00 00 00 00 00 00 00 00 00 00 00 00 ............
0085h: 00 00 00 00 00 00 00 00 00 00 00 00 ............
0091h: 00 00 00 00 00 00 20 00 00 00 04 00 ...... .....
009Dh: 00 00 44 58 54 31 00 00 00 00 00 00 ..DXT1......
00A9h: 00 00 00 00 00 00 00 00 00 00 00 00 ............
00B5h: 00 00 00 10 00 00 00 00 00 00 00 00 ............
00C1h: 00 00 00 00 00 00 00 00 00 00 ..........
Just after the "VIDEO"
string is a DWORD (0x00003D85
, aka 15,749) that we can again
(correctly) guess as being a size. This time, it's the size of the video section: just
after it is the string "SOUND"
, which we'll get to in a bit.
Then we have another DWORD: 0x0003A980
, otherwise known as 240,000. Unfortunately,
we don't yet have enough information to understand what it is.
Afterward, we have the "DDS"
string.
Some Googling for DDS and DXT1 reveals that this might be referring to
the DirectDraw Surface file format. The "DDS"
string at the beginning
(actually, it is "DDS "
, with a space at the end) is a magic value, and immediately
after it is a DDS_HEADER
structure. If we parse the data in our video
frame according to this structure we'll get:
struct DDS_HEADER
{
DWORD dwSize; // 124
DWORD dwFlags; // 0x1007
// DDSD_PIXELFORMAT
// | DDSD_WIDTH
// | DDSD_HEIGHT
// | DDSD_CAPS
DWORD dwHeight; // 800
DWORD dwWidth; // 600
DWORD dwPitchOrLinearSize; // 0
DWORD dwDepth; // 0
DWORD dwMipMapCount; // 0
DWORD dwReserved1[11];
DDS_PIXELFORMAT ddspf;
DWORD dwCaps; // 0x1000
// DDSCAPS_TEXTURE
DWORD dwCaps2; // 0
DWORD dwCaps3; // 0
DWORD dwCaps4; // 0
DWORD dwReserved2;
};
And the DDS_PIXELFORMAT
structure inside:
struct DDS_PIXELFORMAT
{
DWORD dwSize; // 32
DWORD dwFlags; // 0x04
// DDPF_FOURCC
DWORD dwFourCC; // 0x31545844
// "DXT1"
DWORD dwRGBBitCount; // 0
DWORD dwRBitMask; // 0
DWORD dwGBitMask; // 0
DWORD dwBBitMask; // 0
DWORD dwABitMask; // 0
};
The dwSize
fields match, the height and width look legit... Could it be that easy?
Unfortunately, no.
This is what we get if we extract all the video section data, along with the DDS header, into an image file, and try to render it.
What's suspicious is that the dwPitchOrLinearSize
field in the DDS header is
zero2, when it should actually contain the size of the pixel data for the image.
This page provides the formula for computing the size. Given the image
dimensions of 800x600 it should be... 240,000. Exactly the value we saw before!
And how much data do we actually have in the video section? 15,617.
Clearly, there's some compression going on here. But, let's leave it for now and summarize what we know. A video section looks like this:
struct VIDEO_SECTION
{
char szVideo[6]; // "VIDEO"
DWORD cbSize; // Size after this field
DWORD cbDecompressed; // Size of the DDS
// pixel data after
// decompression
char dwDDSMagic[4]; // "DDS "
DDS_HEADER tDDS;
BYTE acData[cbSize - 8 - sizeof(tDDS)];
};
Okay, onwards!
SOUND
section
Immediately after the video section we have the sound section:
3DCCh: 53 4F 55 4E 44 00 FC 1F 00 00 00 4F SOUND......O
3DD8h: 67 67 53 00 02 00 00 00 00 00 00 00 ggS.........
3DE4h: 00 23 48 00 00 00 00 00 00 77 F1 09 .#H......w..
3DF0h: CF 01 1E 01 76 6F 72 62 69 73 00 00 ....vorbis..
3DFCh: 00 00 02 44 AC 00 00 FF FF FF FF 00 ...D........
3E08h: 77 01 00 FF FF FF FF B8 01 4F 67 67 w........Ogg
3E14h: 53 00 00 00 00 00 00 00 00 00 00 23 S..........#
3E20h: 48 00 00 01 00 00 00 BE 8B 87 4B 10 H.........K.
3E2Ch: 54 FF FF FF FF FF FF FF FF FF FF FF T...........
3E38h: FF FF FF E2 03 76 6F 72 62 69 73 1D .....vorbis.
3E44h: 00 00 00 58 69 70 68 2E 4F 72 67 20 ...Xiph.Org
3E50h: 6C 69 62 56 6F 72 62 69 73 20 49 20 libVorbis I
3E5Ch: 32 30 30 35 30 33 30 34 01 00 00 00 20050304....
3E68h: 23 00 00 00 54 72 61 63 6B 20 65 6E #...Track en
3E74h: 63 6F 64 65 64 20 75 73 69 6E 67 20 coded using
3E80h: 6C 69 6C 79 20 69 6E 74 65 72 66 61 lily interfa
3E8Ch: 63 65 2E 01 ce..
Once again, we have a magic value denoting the section's beginning, and a DWORD
with the size: 0x00001FFC
, aka 8,188. After these, we have a single byte with the
value of zero, and then the string "OggS"
. Further down, we have more references
to Vorbis.
At this point, it's a pretty safe bet that we're dealing with Vorbis-encoded sound.
Specifically, Vorbis sound in an Ogg container. That "OggS"
string is a magic
denoting the beginning of an Ogg page (an Ogg file is split into pages).
In fact, if we concatenate the data from all sound sections we'll get a valid Ogg file:
(Yes, that's the Ogg file extracted from SPLASH_SL2.PFF
. You can download it
here and have a look at all the little peculiarities
inside.)
Cool, so a sound section looks like this:
struct SOUND_SECTION
{
char szSound[6]; // "SOUND"
DWORD cbSize; // Size after this field
BYTE cReserved; // Always 0
BYTE acData[cbSize - 1]; // Ogg-Vorbis data
};
An end ...
Immediately after the sound section we have the string "ENDFRAME"
, and after it
the next frame starts. With this in mind, we can fully describe a frame:
struct FRAME
{
char szFrame[6]; // "FRAME"
DWORD cbSize; // Size after this field
BYTE acReserved[8];
VIDEO_SECTION tVideo;
SOUND_SECTION tSound;
char szEndFrame[9]; // "ENDFRAME"
};
... And a beginning
Okay, let's take a look at the next frame.
5DDBh: 46 52 41 4D 45 00 27 01 00 00 00 00 FRAME.'.....
5DE7h: 00 40 E1 7A A4 3F 56 49 44 45 4F 00 .@.z.?VIDEO.
5DF3h: FD 00 00 00 0F 05 00 04 00 03 00 E1 ............
5DFFh: 80 00 81 FF 80 FF FF FF FF FF FF FF ............
5E0Bh: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E17h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E23h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E2Fh: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E3Bh: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E47h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E53h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E5Fh: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E6Bh: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E77h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E83h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E8Fh: FF FF FF FF FF FF FF FF FF FF FF FF ............
5E9Bh: FF FF FF FF FF FF FF FF FF FF FF FF ............
5EA7h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5EB3h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5EBFh: FF FF FF FF FF FF FF FF FF FF FF FF ............
5ECBh: FF FF FF FF FF FF FF FF FF FF FF FF ............
5ED7h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5EE3h: FF FF FF FF FF FF FF FF FF FF FF FF ............
5EEFh: FF 00 00 00 88 53 4F 55 4E 44 00 05 .....SOUND..
5EFBh: 00 00 00 00 00 00 00 00 45 4E 44 46 ........ENDF
5F07h: 52 41 4D 45 00 RAME.
Well, there's no DDS header in the video section, and the 240,000 DWORD is missing.
But, the size is correct (0x000000FD
in this case, i.e. 253). And the sound section is
rather short, but everything in it looks in line with our observations so far.
If we skim through the other frames, we'll see that none of them have the header. It makes sense, really — all frames in the video should have the same parameters, so why store the header multiple times?
We can also note that the last frame in the file has neither video nor audio sections:
16D:0404h: 46 52 41 4D 45 00 11 00 00 00 00 00 FRAME.......
16D:0410h: C0 47 B8 1E 2B 40 45 4E 44 46 52 41 .G..+@ENDFRA
16D:041Ch: 4D 45 00 ME.
Everything else is in place, though:
- The first DWORD after the
"FRAME"
magic is0x00000011
, i.e. 17, which is precisely the size of the frame data. - After that DWORD are the mysterious 8 bytes.
A mystery solved
So how about those 8 bytes in each frame? Here they are, from the first 5 frames:
0035h: 00 00 00 00 00 00 00 00 ........
5DE5h: 00 00 00 40 E1 7A A4 3F ...@.z.?
5F16h: 00 00 00 40 E1 7A B4 3F ...@.z.?
6047h: 00 00 00 E0 51 B8 BE 3F ....Q..?
6178h: 00 00 00 40 E1 7A C4 3F ...@.z.?
Can you tell what they are? No? Well, neither could I. But then, I tried looking
at them as double
s (i.e. IEEE 754 binary64):
0 0
0.0399999991059303 ~0.04
0.0799999982118607 ~0.08
0.119999997317791 ~0.12
0.159999996423721 ~0.16
Ignoring the imprecisions inherent in floating-point numbers, the difference between each pair is ≈0.04.
Or, put another way, 25 FPS.
These are timestamps, in seconds, when each frame should be presented.
Additionally, the last frame — the empty one, with neither video nor sound — has a timestamp of 13.5599996969104 ≈ 13.56s, which matches the length of the audio track we extracted.
Putting it all together
With everything we've learned, we can now write a definition of the PFF format in pseudo-C:
struct PFF
{
PFF_HEADER tHeader;
FRAME atFrames[...]; // However many as will fit
PFF_FOOTER tFooter;
};
struct PFF_HEADER
{
char szMagic[7]; // "PFF0.0"
char szVideoFormat[10];
char szSoundFormat[13];
char szLang[3];
char szEndHeader[10]; // "ENDHEADER"
};
struct PFF_FOOTER
{
char szEndFile[8]; // "ENDFILE"
};
struct FRAME
{
char szFrame[6]; // "FRAME"
DWORD cbSize; // Size after this field
double nTimestamp;
VIDEO_SECTION tVideo; // Optional
SOUND_SECTION tSound; // Optional
char szEndFrame[9]; // "ENDFRAME"
};
struct VIDEO_SECTION
{
char szVideo[6]; // "VIDEO"
DWORD cbSize; // Size after this field
if (first_video_section)
{
DWORD cbDecompressed; // Size of the DDS
// pixel data after
// decompression
char dwDDSMagic[4]; // "DDS "
DDS_HEADER tDDS;
BYTE acData[cbSize - 8 - sizeof(tDDS)];
}
else
{
BYTE acData[cbSize];
}
};
struct SOUND_SECTION
{
char szSound[6]; // "SOUND"
DWORD cbSize; // Size after this field
BYTE cReserved; // Always 0
BYTE acData[cbSize - 1]; // Ogg-Vorbis data
};
If you prefer, here is a Hex Workshop structure library I wrote at the time to assist in parsing PFF files. Past me would also like to note that on big files this library will consume a huge amount of memory.
ב-010 זה לא היה קורה.
Back to the future
As I mentioned, back in 2013 I didn't manage to figure out the compression used for the video frames.
It's 2022 now, and it's time to fix that.
So we fire up Ghidra, point it at SL2.exe
, and look for references to the string
"PFF0.0"
, to find the place where the file is parsed. Lucky for us, there's a lot
of RTTI left in the binary, so Ghidra can deduce class names, which makes our work
much easier. Unfortunately, there're a lot of virtual function calls, too, which we have
to devirtualize by hand.
Eventually, we arrive at this function, which seems to be responsible for the decompression of a single video frame (all names except for the class name are my own):
Unfortunately, there is only so much Ghidra can do, and this is what it looks like after I did my best to clean up the decompiler's output.
The next step was pretty low-tech: copy the decompiler output to a text editor, and start cleaning it up by hand.
Let's take a look at some of the highlights.
Initialization
On the first video frame the code performs some setup:
- It reads the decompressed frame size and the DDS header.
- It allocates a scratch buffer with this size.
- It takes the maximum of the frame's width and height, and rounds it up to the nearest power of 2.
- It makes a copy of the original DDS header, but with the width and height changed to the value computed above. For instance, an 800x600 frame will be rounded up to 1024x1024.
- It allocates an array of 30 frames, with enough space to hold a DDS header and pixel data for a frame of the updated size.3 It then copies the new DDS header to each such frame.
The decoder (described below) places each decompressed frame into the array, cycling to the beginning when it reaches the end. Why allocate so many? No idea.
Why does it do this rounding of the frame dimensions? Maybe it's due to some quirks of the rendering code, that requires all textures to be square, with each side a power of 2? Who knows.
Capture the flag
The initialization above happens only on the first frame. From then on, the flow is the same for all frames.
First, the code reads a single byte which contains some flags. Here they are, with some names that I guessed based on the usage:
enum VIDEO_FRAME_FLAGS
{
VIDEO_FRAME_FLAG_INTERMEDIATE = (1 << 0),
VIDEO_FRAME_FLAG_COMPRESSED = (1 << 1),
VIDEO_FRAME_FLAG_RLE = (1 << 2),
VIDEO_FRAME_FLAG_TWO_PARTS = (1 << 3),
};
Of particular note is the VIDEO_FRAME_FLAG_COMPRESSED
flag. As far as I can tell,
this flag is set in all frames, in all the PFF files that I checked. In fact, if this
flag is not set then the code will proceed to read some uninitialized memory,
or data from previously decoded frames. Fun.
Huffman coding
Next up, the code reads and decodes some Huffman-coded data from the file.
Huffman coding is a method of compressing a stream of symbols by replacing each symbol with a unique sequence of bits (1s and 0s) - a codeword. This replacement follows two main principles:
- Symbols with a higher probability of occurring in the input stream will be replaced by shorter codes.
- No codeword is a prefix of any other code (i.e. this is a prefix code).
The first principle is what actually compresses the input stream. The second principle is what allows us to easily read the compressed data — since each codeword has a different length, we need a way to delimit them. By making the code prefix-free we guarantee that once we start reading the compressed data, there is only one possible interpretation of the codewords.
You can read a better description over at Wikipedia.
Wait, how do we know that the video decoder uses Huffman coding? Well, if the decoding fails it logs the message:
Warning : HUFFMAN problem!!!
So there's that.
What does the encoded data look like? Like this:
struct HUFFMAN_DATA
{
WORD nTreeLength;
WORD anNodes[nTreeLength];
DWORD adwData[]; // Yep, a DWORD. See below.
};
First we have the number of nodes in the Huffman tree, then the actual nodes, with the first node in the array being the root. Since a Huffman tree is a full binary tree, each node is either an internal node with 2 children or a leaf node. On disk, these are distinguished using the high bit:
- If the high bit is set, this is a leaf node. The value of the node (i.e. the symbol this node encodes) is then stored in the low 9 bits.
- If the high bit is clear, this is an internal node. The left child is stored in the array element immediately adjacent to it. The right child is stored at the index indicated by the node's low 15 bits.
How is the actual compressed data stored? In blocks of 4 bytes. The decoder reads 4 bytes at a time in little-endian order, then processes the bits from most to least significant. As it reads those bits, it traverses the tree — taking the left child if a bit is 0, and the right child if it is 1 — until it reaches a leaf node. The value of the leaf is then output as the decompressed byte.
Wait, but how does it know when to stop? The struct above has no "size" field...
To mark the end of the compressed data the developers used a neat trick — they
put in a sentinel. When the decompressor reaches a leaf with the value 0x100
,
it stops. It works since a byte can never have such a value 😀. And, it can save
several bytes of space: instead of storing 4 bytes for the size, the sentinel takes up
only a few bits.
Now, if the VIDEO_FRAME_FLAG_TWO_PARTS
flag is set in the frame flags, and
VIDEO_FRAME_FLAG_RLE
is not set, then the video decoder will decompress
another Huffman blob immediately following the first one, and concatenate
the decompressed data to the first decompressed block.
RLE
If the VIDEO_FRAME_FLAG_RLE
is set, then after the Huffman decompression
the video decoder will run the decompressed data through a
Run-Length Encoding (RLE) decoder4.
This particular brand of RLE actually only encodes runs of zeroes. The decompressor goes over every byte of the (compressed) input and:
- If the byte has the high bit set, then it treats the low 7 bits as a length, and outputs this many zero bytes.
- If the high bit is clear, it again treats the low 7 bits as a length, but this time it copies this many bytes from the input to the output.
Unlike the Huffman decompressor, there are no sentinels here. The decompression stops after processing all of the input data.
Interleaving
If the VIDEO_FRAME_FLAG_TWO_PARTS
flag is set, the output of all previous decompression
stages is now interleaved: the data is split into two halves, and the code goes
over each half, DWORD by DWORD, and places each pair of DWORDs into the output
alongside each other. So, for instance, this buffer:
0000h: 11111111 33333333 55555555 ....3333UUUU
000Ch: 22222222 44444444 66666666 """"DDDDffff
Becomes:
0000h: 11111111 22222222 33333333 ....""""3333
000Ch: 44444444 55555555 66666666 DDDDUUUUffff
Tables!
Things got a bit confusing there, so let's summarize how the decompression works:
Flags | Stage 1 | Stage 2 | Stage 3 |
---|---|---|---|
None | Huffman | ||
VIDEO_FRAME_FLAG_TWO_PARTS |
Huffman x 2 | Interleave | |
VIDEO_FRAME_FLAG_RLE |
Huffman | RLE | |
Both | Huffman | RLE | Interleave |
XOR
Finally, if the VIDEO_FRAME_FLAG_INTERMEDIATE
flag is set, the fully decompressed
pixel data is XORed with the previous frame's data.
Presumably, this is done to improve compression on similar frames. If two adjacent frames have areas with the same pixels, their XOR will be zero. This will compress really well with the RLE thingy!
Results?
At this point I knew enough about the format and the video compression algorithm to write my own demuxer. Or so I thought.
My first attempt was a Python script that dumps all frames and audio to disk. It was slow, but it worked. Specifically, it worked on files that I had lying around from 9 years ago. They must be from the version of the game I played back then.
And then, while preparing this post, I tried to run it over the files from the demo of the game.
There was a slight hitch.
Remember how I, very confidently, described what the PFF
file header looks like? Turns out, there can be multiple audio streams! For instance,
this is what the header of CIN006.PFF
from the demo looks like:
0000h: 5046 4630 2E30 0056 4944 454F 5F44 4453 PFF0.0.VIDEO_DDS
0010h: 0053 4F55 4E44 5F56 4F52 4249 5300 454E .SOUND_VORBIS.EN
0020h: 0053 4F55 4E44 5F56 4F52 4249 5300 4652 .SOUND_VORBIS.FR
0030h: 0053 4F55 4E44 5F56 4F52 4249 5300 4445 .SOUND_VORBIS.DE
0040h: 0053 4F55 4E44 5F56 4F52 4249 5300 4553 .SOUND_VORBIS.ES
0050h: 0053 4F55 4E44 5F56 4F52 4249 5300 4954 .SOUND_VORBIS.IT
0060h: 0045 4E44 4845 4144 4552 00 .ENDHEADER.
There are 5 streams!
This file also resolved the mystery of the cReserved
byte
in the sound section. It's not always 0, but rather a track index. There are frames
that contain data for only some of the audio tracks, and that's how you can match
a sound section to the track it provides data for.
There was one other intereseting discovery when running the script over all the files at once: sometimes a frame contained only audio data, and no video. I modified my script to just output the previous frame again, but I don't know what the actual behaviour of the game's renderer is. Seems reasonable enough.
Results!
That should've been it, right? You can dump all the contents of a PFF file, which is enough. This is all the information contained in the file. Well, technically there're also the timestamps for each frame, but in all files I've seen they denote a constant framerate of 25 FPS.
You can even feed the individual frames to FFmpeg to create an actual, playable, video!
But of course, I didn't stop there. I went ahead and wrote a PFF demuxer for FFmpeg, all so that I could do:
ffmpeg -i SPLASH_SL2.PFF -c:v libx264 -c:a copy -map 0:v -map 0:a? SPLASH_SL2.mkv
Needless to say, this runs much faster than the Python script.
Funnily enough, the tricky part was actually implementing the extraction of the audio data from each frame. But you can read all about that in the source 🙄.
R̵҉̛͟e̴̡̨s̶̵̴u̕l͝͠҉̷t̡͜͝ş͏̷
https://twitter.com/SwiftOnSecurity/status/1483807561010843653
But why stop there? I now definitely know enough about the file format to write a muxer for FFmpeg. So that I can convert any video to PFF, and play it inside the game.
Well, actually I don't know enough. Most importantly, I don't know how to split the Ogg-Vorbis audio into PFF frames. Can I put all audio in the first frame? Should I split by timestamps somehow?
But I do know enough to mux only video. Luckily, there is one PFF file in the game
that has no audio — CINEMENU.PFF
. This is the file that plays in the main
menu.
So I... umm... modified it a little...
Okay, now I'm done. Probably.
Final notes
When I was just starting out on this project, I posted on the XeNTaX forums asking for help. Unfortunately, no one replied. But now, at last, after 9 years, I can answer my own question 😎.
My FFmpeg fork with the PFF muxer/demuxer is here. And you can get some Windows binaries here.
The hacky Python scripts are here.
An 010 Editor template for PFF files is here.
Onward to other things!
-
Damn, that's a long time ago. ↩︎
-
Okay, actually Microsoft's documentation states that you should not rely on this field, and always compute the size using the width, height and compression algorithm. But still! ↩︎
-
Yes, that means the code has to fix-up the decompressed frame data, since it has lower dimensions. Specifically, it has to fix up the stride. I'm deliberately not getting into it. It's not pertinent to the decompression anyway. ↩︎
-
Run-Length Decoder. ↩︎