Titus the Fox / Moktar SQZ file format by Jesses (mail at ttf dot mine dot nu) Visit https://ttf.mine.nu for more TTF/Moktar stuff SQZ decoding explanation, v1.0. Version history: 1.0 - initial version This file describes how to decompress the SQZ files. One of two compression algorithms is used: either LZW or Huffman+RLE. A byte at offset 1 determines which one of them is actually in use. File format: Offset Size What 0 1 bits 0..3: high nibble of uncompressed size, bits 4..7 unused 1 1 0x10: LZW is used, something else (0): Huffman+RLE is used CDRUN.COM's unpacker considers values above 0x10 invalid 2 2 low word of uncompressed size If LZW is used: Offset Size What 4 rest LZW bitstream If Huffman is used: Offset Size What 4 2 HTS: Huffman tree size in bytes 6 HTS HT: Huffman tree as array of 16-bit words 6+HTS rest Huffman bitstream A perl implementation is given in the file "unpack.pl". How to decode LZW bitstream: The bitstream consists of variable-length codewords; from here on, their word size is named 'nbit'. Initially, nbit == 9. LZW makes use of a dictionary; the codewords are used as indices to perform lookups. This dictionary is not specified explicitly in the compressed stream though, instead it is derived during decompression. Initially the dictionary contains 258 entries. For entries 0..255, the value of each entry is simply the ASCII binary form of its index: dict[i] = chr(i). The remaining two entries, 0x100 and 0x101, have a special purpose and aren't used for lookups. The dictionary has a maximum size of 0x1000 entries. The bitstream codewords can be divided in two categories. The first category is formed by the two special values mentioned above, CLEAR_CODE (0x100) and END_CODE (0x101). Whenever CLEAR_CODE is encountered, the decoder should reset its state to the initial condition, that is, set nbit to 9 and reset the dictionary to its initial 258 entries. END_CODE, as expected, marks the end of the stream. The second category is formed by all other codeword values. For each codeword, first a new entry is optionally added to the dictionary. No entry is added immediately after a CLEAR_CODE, and neither when the dictionary is full. Next, the codeword's value is used as an index in the dictionary: the corresponding entry is output. The value of the new dictionary entry is the output generated by the previous codeword, plus one more byte. This byte is normally the first byte of the current codeword's output. However, sometimes the current codeword points to a not-yet-existing dictionary entry. In this case, the extra byte's value is the first byte of the previous output. So, if the previous codeword generated an output of 'ABC', and the current codeword points beyond the dictionary, the new dictionary entry would be 'ABCA'. For this to work correctly, the input stream must fulfill a few conditions, which all apply only when the current codeword points to a not-yet-existing entry. First, the previous codeword can't be a CLEAR_CODE, as there is no previous output in that case. Secondly, the dictionary cannot be full, as otherwise there is no room to add a new entry, which is needed to lookup the output. Finally, the current codeword must point to the to-be-added entry, for the same reason. It is the responsibility of the compressor to make sure these conditions hold, and they do for all LZW-encoded SQZ files in both Moktar and Titus the Fox. If after adding an entry the number of elements in the dictionary is equal to 2^nbit, the value of nbit is incremented; unless nbit is equal to 12. The number of unused input bits following END_CODE is at least one and at most eight, so sometimes an entirely unused input byte is present at the end. Regarding bit-order: consider the first three input bytes of a file, and label the bits like this, with most-significant bits on the left in each byte: 76543210 FEDCBA98 NMLKJIHG then, the first two codewords (most-significant bit left) are: 76543210F EDCBA98NM Constants: CLEAR_CODE = 0x100 (0x101 for CDRUN.COM) END_CODE = 0x101 (0x100 for CDRUN.COM) MAX_TABLE = 0x1000 Pseudocode: int nbit stringarray dict int prev := CLEAR_CODE while (prev != END_CODE) { if (prev == CLEAR_CODE) { nbit := 9 dict := [chr(0)..chr(255), ?, ?] // 258 entries, ?=unused } int cw := next nbit-sized codeword if (cw != CLEAR_CODE && cw != END_CODE) { int newbyte if (cw < num_elem(dict)) { newbyte := first_byte(dict[cw]) } else { // Assumption 1: prev != CLEAR_CODE // Assumption 2: num_elem(dict) < MAX_TABLE // Assumption 3: cw == num_elem(dict) newbyte := first_byte(dict[prev]) } if ((prev != CLEAR_CODE) && (num_elem(dict) < MAX_TABLE)) { append dict[prev] ++ newbyte to dict if (num_elem(dict) == 2**nbit && nbit < 12) { nbit++ } } output dict[cw] } prev := cw } Implementation note: The above pseudocode focuses on clarity, not efficiency. It is possible to apply some optimizations. A simple optimization is to not explicitly keep the first 258 entries of the dictionary in memory, since the first 256 are trivially computed and the remaining two are unused. This gains some memory bytes, at the cost of a few extra conditional statements and subtractions of 0x102. More interesting optimizations are possible. In particular, the use of string data types can be completely avoided. Any addition to the dictionary is formed by a prefix and a single byte, where the prefix already exists in the dictionary at index 'prev'. Instead of storing this prefix literally, it can also be stored as a pointer to the index where it occurs. This, together with the maximum dictionary size, allows for the array 'dict' to be of constant bytesize. The perl implementation in "unpack.pl" uses these two optimizations. Example output: LEVEL1.SQZ is identical for both Moktar and Titus the Fox. The first twelve codewords, which will yield 38 decompressed bytes, are: 01C 045 053 104 105 106 107 105 097 108 109 10B After processing these codewords, the dictionary contains literally: (1C 45, 45 53, 53[2x], 53[3x], 53[4x], 53[5x], 53[6x], 53[3x] 97, 97 53, 53[7x], 53[3x] 97 53) The same dictionary, shown in pointer+byte format as explained above: (01C+45, 045+53, 053+53, 104+53, 105+53, 106+53, 107+53, 105+97, 097+53, 108+53, 109+53) The first bytes of the decompressed output at offset 0x00000 are: 1C 45 53[18x] 97 53[9x] 97 53[10x] 97 53[6x] 96 53[3x] 97 At offset 0x00321 in the output, nbit changes its value for the first time. Surrounding output bytes, starting at offset 0x318, are: 96 53[3x] 97 53 96 53[3x] 97 96 53[3x] 97 53 96 53[3x] 97 At offset 0x012EC in the output, a CLEAR_CODE occurs in the input. The surrounding output bytes, starting at offset 0x012E0, are: 81 BD B6[10x] EB EC B7 B6[9x] EB EC B7 B6[4x] EB 1C 45 53[4x] Finally, the last 20 bytes are: 68 47 CA 47 FF 23 03 23 03 0B 00 08 CA 00[3x] B0 0D 40 02 CRC-32 of the uncompressed LEVEL1.SQZ is 0x44800FE5. How to decode Huffman+RLE bitstream: In Huffman coding a fixed binary tree structure is used, with output codewords stored at its leaves. The Huffman bitstream sequentially addresses these leaf nodes. The tree is navigated node for node, starting at the root. One bit is then read from the stream. This bit determines the next node to visit: a bit value of 0 means visit the first (left) child, where a value of 1 means take the second (right) child. Whenever a leaf node is reached, its value is output and the current node position is reset to the root. In most Huffman implementations the tree is stored efficiently in what is known as "canonical" form. In the SQZ files however, this is not the case and the tree is stored in a very straightforward fashion, making it easy to use. The binary Huffman tree HT is stored as array of 16-bit words, where each word represents a node; each node is either an internal node or a leaf node. The value stored in the array for a node has two parts: bit 15 is set if it is a leaf node, bits 0..14 contain the node's value. For leaf nodes this value is a codeword, for internal nodes it is 2*(index of first child), or put another way: the byte offset in the array of the first child. The second child of an internal node is stored immediately after its first child. Bit order is such that for each byte, the most significant bit is processed first. Pseudocode for reading codewords (Huffman decoding): int node := 0 while (bool b := next bit) { if (b) { node++ } if (HT[node] < 0x8000) { // internal node node := HT[node] / 2 } else { // leaf node int cw := HT[node] & 0x7FFF process codeword cw node := 0 } } This results in a sequence of RLE codewords. Each codeword 'cw' is processed in the following way: Let L == cw mod 256 Let H == cw div 256 Codeword What to do H==0 last := L; output 'last' H!=0 && L==0 read next codeword; output cw times 'last' H!=0 && L==1 read next codeword; count := L*256; read next codeword; count += L; output count times 'last' H!=0 && L>=2 output L times 'last' This is easily implemented using a state machine: the following pseudocode can be directly inserted in the loop above. The pseudocode uses three pre-existing variables: int state := 0 byte last int count Pseudocode for processing a codeword cw (RLE decoding): byte L := cw & 255 byte H := cw >> 8 if (state == 0) { if (H == 0) { last := L output 'last' } else if (L == 0) { state := 1 } else if (L == 1) { state := 2 } else { output L times 'last' } } else if (state == 1) { // cw == repeat count output cw times 'last' state := 0 } else if (state == 2) { // L == high byte of 'count' count := L*256 state := 3 } else if (state == 3) { // L == low byte of 'count' count += L output count times 'last' state := 0 } Example output: The games only have a few Huffman-encoded files. Moktar's only file is SPRITES.SQZ, Titus the Fox has SPREXP.SQZ. In SPRITES.SQZ the first 9 input bit groups corresponding to codewords are: 11 1011 10100 0000101110 11 10010 0001110001 10011 10100 The first 12 intermediate codewords, after Huffman but before RLE: 0000 0100 0003 007A 0000 0001 0098 0080 0003 0065 00C0 0000 The first 16 bytes of decompressed output for SPRITES.SQZ are: 00 00 00 00 7A 00 01 98 80 03 65 C0 00 98 40 02 Now for SPREXP.SQZ, the first 12 input bit groups: 11 11 011010 11 11 0100101 001100000 11 11 0100011 011101 11 The first 12 intermediate codewords: 0000 0000 0008 0000 0000 0030 0028 0000 0000 0018 0040 0000 And the first 16 output bytes: 00 00 08 00 00 30 28 00 00 18 40 00 00 3C 08 00 In both files, there is only one codeword where H!=0 && L>=1. The surrounding codewords, for SPRITES.SQZ starting at output offset 0x2B131 and for SPREXP.SQZ at 0x2AE29, are: 0000 0100 0009 000C 0000 0101 0001 0008 0020 0000 0100 0008 The output generated by these codewords is: 00 00[9x] 0C 00 00[256x] 00[8x] 20 00 00[8x] Finally, the last 16 bytes of both files are: 00 00 0C 60 0E E0 07 C0 07 80 01 00 00 00 00 00 CRC-32 of the uncompressed SPRITES.SQZ is 0xAA81B13A; for SPREXP.SQZ it is 0xEB0338C9.