Difference between revisions of "TLK"

From Dragon Age Toolset Wiki
Jump to: navigation, search
(adding info about TLK format used in Dragon Age 2)
m (Split new DA2 information to its own TLK (DA2) page, minor reformating)
Line 1: Line 1:
These [[GFF]] files contain strings. They can be opened in the GFF editor.
+
The [[TLK]] file is a [[GFF]] formatted file used to contain strings. They can be opened in the Toolset or the GFF Editor.
 
+
 
+
  
 
Keep in mind that all numbers are in little-endian format if you are looking at the raw file.
 
Keep in mind that all numbers are in little-endian format if you are looking at the raw file.
 
 
 
{the following is related to the format used in Dragon Age 2}
 
 
let's take a look first at the raw content of a talk file
 
 
 
 
 
 
== 1. Raw View ==
 
 
 
 
 
 
[[File:1_raw_view.jpg|200px|thumb|left|raw view]]
 
 
 
 
 
 
this image shows an example of how the raw talk file in Dragon Age 2 looks like. As you can see, the readable text at a first glance is minimal, limited to the few recognizable strings as GFF, type, version, etc.
 
 
 
 
We will assume that there are ways to convert the raw format into something more readable, and in the next paragraphs we will explore some ways to make it happen
 
 
 
 
 
 
== 2. Header View ==
 
 
 
 
 
 
 
 
Firstly, we will look at the big header, but because it's a standard format that never changes across different talk files, and the GFF format is covered in depth in a different article, we will not spend a lot of time on it
 
 
 
 
 
 
 
 
[[File:TLK_2_header_view.jpg|200px|thumb|left|header view]]
 
 
 
 
 
 
 
 
in short,and assuming prior knowledge of the GFF format, this GFF file header has 4 bytes GFF Magic Number,
 
 
- followed by four bytes as version V4.0,
 
 
- followed by four bytes for platform, in this case PC,
 
 
- followed by four bytes as file type TLK,
 
 
- followed by four bytes as file type version V0.5.
 
 
- After that, 4 bytes show the number of structures inside this GFF, in this case only 2.
 
 
- The next four bytes is the data offset and shows number 78 which is hexadecimal for 120, and show at what byte starting from the beginning of file the actual talk file starts, basically this means that the first 120 bytes are the file header.
 
 
- after that we get four bytes with the name(type) of the first out of two structures HTLK,
 
 
- followed by four bytes that show the number of fields inside the structure, in our case 3,
 
 
- followed by four bytes that show the field offset, which means at what byte starting from the beginning of file the data related to this structure actually starts, in our case 3C in hexadecimal, which is decimal 60,
 
 
- followed by four bytes that show the length of the structure,0C hexadecimal in our case, which is decimal 12 which is 4 bytes * no. of fields 3 = 12
 
 
- next, similar information about the second out of two structures named HSTR
 
 
- which has four bytes that showed the number of fields inside the structure, in this case 2,
 
 
- followed by four bytes that show at what byte starting from the beginning of the file the data related to the structure actually starts, in our case hexadecimal 60, which is decimal 96
 
 
- followed by four bytes that show the length of the structure, 08, which is four bytes * no. of fields 2 = 8
 
 
- at this point we reached 3C( hexadecimal), which is 60 decimal byte, which is where we learned from the above information that the first structure HTLK related data starts, here is where the field array starts for this first structure
 
 
- the relevant data here is displayed in a predictable manner according to the GFF documentation, basically you get Label, Field Type, Index (3 field subtype) in increments of four bytes, for each of the two structures fields,HTLK has three fields and HSTR has two fields, 2+3 = 5 total structures multiplied by 3( field subtype) = 15
 
 
- And we reached byte 78 hexadecimal/120 decimal. Once again everything up to this point is identical across all the talk files in Dragon Age 2
 
 
 
 
a quick note about the next sequence of bytes: at bytes 78 hexadecimal, you can see 0C( four bytes). I'm not sure what this is yet, but seems to be the same value across talk files. After that there are a sequence of four more bytes, in our example 90 hexadecimal, which may be the size in bytes of the section of the talk file where the list of strings is saved. It may also be the size of the serialized Huffman tree that would be discussed later on.
 
 
 
 
 
 
 
 
== 3. String ID View ==
 
 
 
 
 
 
Moving on to the next picture, the first four bytes listed 224 hexadecimal, seems to be the size of the compressed string, but I'm not sure. These three values( in our examples 0C,90,224 hexadecimal) are the only one that I'm not 100% sure of, I hope I can do better with the rest of the explanation
 
 
 
 
 
 
 
 
[[File:3_stringID_view.jpg|200px|thumb|left|string ID view]]
 
 
 
 
 
 
 
 
In this section, there is not much to say, after the first four bytes 224 hexadecimal that I'm not sure of, the rest is clearer
 
 
- Four bytes, in our case 10 hexadecimal, 16 decimal is the number of strings contained in this example talk file
 
 
- After that there is a sequence of 16 pairs ( the number of pairs has to be identical with the number identified in the line above, in our example 16 decimal)
 
 
- Each pair consists of a string ID and a number that represents the bit count where the string actually starts inside the compressed string
 
 
- In our example the first string ID in decimal is 6200472 ( little endian 989C5E hexadecimal), followed by the string Start count, in our example AB hexadecimal, which is decimal 171, and basically means that the string ID 6200472 first character starts at position 171 inside the compressed string. The compression is a standard Huffman coding, which uses simple concatenation when compressing, so knowing the position of the first character of any string, can easily help us get the rest of the string decoded from the compressed stream. Hopefully it would be clearer later on in the article.
 
 
- There is not much else to say, except that in this example you can see that we have redundant strings( with different ID but with the same starting point, so basically the strings uncompressed are the same, for what ever reason the Dragon Age 2 creators decided to create identical strings with different string ID). Also, some strings in this particular example are empty. it may not be obvious that they are empty by looking at the raw content, but it's apparent once all the strings are decompressed. Again, I don't know why an empty string is actually needed. this is a side note, and not really relevant, I wouldn't mind if someone has more info about these empty strings
 
 
 
 
 
 
 
 
== 4. Huffman coding ==
 
 
 
 
 
 
 
 
[[File:TLK_4_Huffman_coding.jpg|200px|thumb|left|Huffman coding]]
 
 
 
 
 
 
 
 
In this section we will introduce the new format the developers chose to use for the talk files
 
 
If you look at the picture, the first important set of four bytes is the one that in our example is hexadecimal 64 ( decimal 100). This number shows the number of nodes used in the Huffman tree specific to this talk file. It is a very good idea before moving forward to glance over the Huffman coding article on Wikipedia.
 
 
Each node is represented by four bytes. Some nodes are actually characters( and follow standard ASCII encoding), but the developers decided to use them in a negative number format that's why they look like XX-FF-FF-FF format. These nodes are called leaves. It is very easy to identify which character in ASCII encoding they represent( although it may also be Unicode): the formula is FF-FF-FF-FF -( minus) XX-FF-FF-FF. To simplify we can use only FF-FF - XX-FF. The result is the actual character encoding in hexadecimal. In our example the first sequence of four bytes in the negative representation is D3-FF-FF-FF. If we apply the formula mentioned above FF-FF - D3-FF, the result is 2C hexadecimal which is 44 decimal which represents the character "D" ( the capital letter D). So we know that this specific talk file definitely contains a string that has the character "D".
 
 
The other format is in positive representation XX-00-00-00, and they are internal nodes, which act as fillers in the Huffman tree, these nodes do not represent characters, but instead connect leaves to parent nodes.
 
 
Some more information: in the original Huffman coding the formula is that the Huffman tree contains N leaves + (N internal nodes - 1), so 2N-1 total number of nodes. In Dragon Age 2, there is an extra important leaf needed which is the character that signifies the end of a string. In this case, the negative number format for the specific character was decided to be decimal -1 which is hexadecimal FF-FF-FF-FF. Which is perfect when used with the above formula, because this is how the algorithm knows when the string inside the compressed stream was fully identified and decoded, and that happens when the formula generates 0 as the result of FF-FF -(minus) FF-FF
 
 
Without this character, the original Huffman coding formula would be, in our case 50 leaves + ( 50-1)internal nodes =  99 nodes in the tree, which is one less then the expected 100 decimal/64 hexadecimal that was identified as the number of nodes in the tree in the first step of this section.
 
 
so in our Dragon Age case we end up with N+1 leaves + N-1 internal nodes = 2N
 
 
The pattern to serialize the tree nodes is : the nodes in the tree are evaluated as some sort of Breadth-first search ( check out the article in Wikipedia for more information), basically you insert the nodes in a sequence reading them from left to right and from the lowest leaf to the highest node in the tree ( excluding the root, which is not considered).so in our example letter D represents the lowest leaf from the left of the tree, that's why is the first one that appears in our serialized sequence as D3-FF-FF-FF. And then, one by one in sequence, from left to right and bottom to top enumerate the leaves and/or internal nodes as you encounter them up until the last internal nodes which are the immediate children of the only root( which once again is not considered in the serialization process). As a side note, it is not unexpected that the last negative number which represents an actual leaf/character in our list is DF-FF-FF-FF, which  turns out to be hexadecimal 20, which is decimal 32 which stands for the "SPACE" character, which is the most common character in pretty much any talk file.
 
 
Later on we will discuss an actual example, to test the assumptions we learned above
 
 
 
 
 
 
== 5. Compressed Strings ==
 
 
 
 
 
 
[[File:TLK_5_compressed_strings.jpg|200px|thumb|left|compressed strings]]
 
 
 
 
 
 
 
 
this section is pretty straightforward, after the serialized Huffman binary tree we can see a sequence of four bytes that show number 66 hexadecimal, which is 102 decimal, which is the number of sequences of four bytes that the compressed stream in its entirety consists of, so this number shows that the compressed stream is 102*4= 408 bytes long, and inside this 408 bytes stream the 16 strings from our example are compressed into
 
 
one thing to notice is that because our pattern follows increments of four bytes, the last two bytes in the stream are actually two pairs of zeros, this means that the actual 16 strings need only 406 bytes, but because 406 is not the multiple of four, we had two more empty bytes as padding
 
 
 
 
Now we can take a look at the simple example to see if what we learned above applies
 
 
Let's consider the following starting strings: in this simple example the pattern we would follow is that numbers represent string ID, and capital letters represent the actual string. The character "|"( bar) represents a visual representation of the special character that signals the end of the string as discussed in a previous section
 
 
 
 
1 A|
 
 
2 AB|
 
 
3 ABC|
 
 
4 ABCD|
 
 
5 ABCDE|
 
 
6 ABCDEF|
 
 
7 ABCDEFG|
 
 
8 ABCDEFGH|
 
 
9 ABCDEFGHI|
 
 
 
 
using any Huffman coding program that can be found on the Internet, we get the following Huffman tree
 
 
 
 
Key:A Value:9 Frequency:0.167 Binary:110
 
 
Key:| Value:9 Frequency:0.167 Binary:111
 
 
Key:B Value:8 Frequency:0.148 Binary:101
 
 
Key:C Value:7 Frequency:0.130 Binary:100
 
 
Key:D Value:6 Frequency:0.111 Binary:010
 
 
Key:E Value:5 Frequency:0.093 Binary:001
 
 
Key:F Value:4 Frequency:0.074 Binary:000
 
 
Key:G Value:3 Frequency:0.056 Binary:0110
 
 
Key:H Value:2 Frequency:0.037 Binary:01111
 
 
Key:I Value:1 Frequency:0.019 Binary:01110
 
 
 
 
Let's analyze: we have nine strings, and each string contains "A", so we also have 9 occurrences of "A". Because we use "|" for visual representation where the string terminates, we also have nine occurrences of the special "|" character. And so on for each character we encounter we add it to the table
 
 
 
 
Now let's create a basic visual representation of the tree, so we can use it as an aid when we count left to right and bottom to top each node, regardless if it's a leaf( actual character) or an internal node( used as filler/connector). Remember, we don't consider the root. Also as a convention, we will consider 0 on the left, and 1 on the right,it's easier to read the tree left to right.IN means internal node, and the number increments bottom to top and left to right
 
 
 
 
 
 
 
 
[[File:TLK_7_simple_Huffman_tree_representation.jpg|400px|thumb|left|Huffman tree representation]]
 
 
 
 
 
 
 
 
applying our assumptions that we learned previously, we can see that the first item in our serialized tree will be I ( capital i). If you remember, the actual characters are represented as negative numbers."I" ASCII code is 49 hexadecimal. FF-49=B6, so our first sequence of four bytes in the serialized tree looks like:B6-FF-FF-FF. Reading left to right and bottom to up our visual tree above, the next sequence of four bytes is "H", which going through the motions turns into B7-FF-FF-FF, followed by "G" a.k.a. B8-FF-FF-FF, followed by the first internal node we encounter IN0, which is represented as positive 00-00-00-00.
 
 
now we reached "F" which is B9-FF-FF-FF, and so on and so on, just follow the left to right and bottom to up pattern. From the tree above we can see for example that the last two sequences of four bytes in our serialized tree are 06-00-00-00 and 07-00-00-00
 
 
if we count all the nodes in our tree, we can see that they are 18 decimal/12 hexadecimal, and following the formula 2N, which means N+1 for leaves and N-1 for internal nodes, then we have 10 leaves and 8 internal nodes,N=9
 
 
and we are done serializing the Huffman tree, we have all the information we need, such as complete number of nodes(18d/12h), the correct order of nodes, starting with letter "I", and ending with IN7
 
 
 
 
Now let's move on to handling the compressed stream that contains the nine strings along with the termination character.
 
 
if we look at our sequence of strings from our working example, the first string is "1 A|", which has only two characters: "A" and the termination character "|" (character BAR, not to be confused with letter capital "i")
 
 
now if we look at our table, we can see that "A" is represented as Binary 110, and "|" is represented as Binary 111
 
 
if you recall, Huffman compression is just a simple concatenation, so we can expect that our first string "A|" would be represented as "110"+"111" = 110111
 
 
you may also recall from the section 3 of this article: string ID, that "Each pair consists of a string ID and a number that represents the bit count where the string actually starts inside the compressed string". For our first string, the string ID is 1, and for simplicity we will assume that the count starts at zero.
 
 
This means that the section of the talk file that deals with string IDs, will represent the relevant information about the first string as "01 00 00 00" sequence of four bytes as the string ID, followed by another sequence of four bytes as the counter in the compressed stream, in our case for the first string, the counter starts from zero, represented as "00 00 00 00"
 
 
 
 
now let's take a look at the second string, and apply the same thing: "2 AB|"
 
 
binary speaking "AB|" string is converted into 110101111, according to our Huffman table
 
 
the second string counter starts where the first string ended, the first string started at 0, had a length of 6 bits, and ended at position 5 in the compressed stream, so the second string counter has to start at the next available bit in the stream, which is 6. This way we know that the second string ID is "02 00 00 00", and the counter would be "06 00 00 00"
 
 
 
 
let's do one more string "3 ABC|"
 
 
binary it becomes 110101100111
 
 
its ID is "03 00 00 00", and its counter is 15 decimal which is 0F hexadecimal so it is "0F 00 00 00"
 
 
 
 
And so on, and so on. the binary representation of all the strings converted turns the compressed stream into the following binary stream:
 
 
"1 1 0 1 1 1 1 1 0 1 0 1 1 1 1 1 1 0 1 0 1 1 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 1 1 1 1 1 0 1 0 1 1 0 0 0 1 0 0 0 1 0 0 0 0 1 1 0 0 1 1 1 1 0 1 1 1 0 1 1 1 "
 
 
 
 
For example we can validate by looking at the last three characters in our stream, which we assume it should be "HI|", and indeed the last 13 bits in our compressed stream are the 5 bits representing "H", followed by the 5 bits representing "I", followed by the 3 bits representing the termination character "|"
 
 
01111+01110+111 = "HI|"
 
 
 
 
Now there is one more step, which is to turn the binary stream of zeros and ones into its hexadecimal equivalent. We know that any byte is 8 bits, and any 8 bits combination of 0 and 1, has its equivalent of any combination of 2 hexadecimal numbers from 0 to F. For example if we take the first 8 bits from our compressed stream 11011111, which in compressed form stand for  "A","B" and 2/3 bits of "|", that sequence of eight bits actually has the hexadecimal equivalent of "DF". In our example, the long binary stream turns into a sequence of 22 bytes in hexadecimal representation, something like: "DF BB 37 7B A3 BD 11 F7 46 C4 DE 88 B0 BD 11 61 7E 6F 44 98 77 03". But if you remember, we need to apply some padding so that the sequence of bytes in hexadecimal form that represents the compressed stream is multiple of 4, so in our example the final hexadecimal string looks like "DF BB 37 7B A3 BD 11 F7 46 C4 DE 88 B0 BD 11 61 7E 6F 44 98 77 03 00 00"
 
 
 
 
the raw view of our working example end result can be seen in its full glory below
 
 
 
 
 
 
[[File:TLK_6_example.jpg|200px|thumb|left|raw view of working example]]
 
 
 
 
 
 
 
 
one final note: this article intends to shed some light of the inner workings of the talk file format in Dragon Age 2, but it's not intended to be perfectly accurate, please update it if you find errors or if you have anything you might want to add
 
 
 
 
 
 
== 6. Credits ==
 
 
 
 
Credits to: ThoughtProcess(Rick Gibbed) and Longod(Hikami), their previous work related to Dragon Age 2 TLK inspired this article. Longod's da2tlkconv was used to generate the raw output from the working example end result.
 
 
 
 
{Below original article related to the format used in Dragon Age Origins}
 
  
 
This may be incomplete, as I believe previous TLK files had timing info, etc as well.
 
This may be incomplete, as I believe previous TLK files had timing info, etc as well.
  
 
+
== Structure ==
  
 
A cursory glance at the file leads me to believe it's formatted something like this:
 
A cursory glance at the file leads me to believe it's formatted something like this:
  
 +
<pre>
 +
Header
 +
String ID section
 +
String data section
 +
</pre>
  
 
+
=== Header ===
'''Header'''
+
 
+
 
+
 
+
'''String ID section'''
+
 
+
 
+
 
+
'''String data section'''
+
 
+
 
+
 
+
----
+
 
+
 
+
 
+
'''Header'''
+
 
+
 
+
  
 
*104 bytes
 
*104 bytes
 
 
**20 byte string describing the file, watch the 2 spaces between PC and TLK <pre>GFF V4.0PC  TLK V0.2</pre>
 
**20 byte string describing the file, watch the 2 spaces between PC and TLK <pre>GFF V4.0PC  TLK V0.2</pre>
 
 
**80 bytes of other info I haven't figured out yet. May include info about the table, offsets
 
**80 bytes of other info I haven't figured out yet. May include info about the table, offsets
 
 
**4 bytes - number of entries in the table
 
**4 bytes - number of entries in the table
  
 +
=== String ID Section ===
  
 
+
Tightly packed.
'''String ID section'''
+
 
+
 
+
 
+
Tightly packed
+
 
+
 
+
  
 
*4 bytes  
 
*4 bytes  
 
 
**unsigned 32 bit integer - TALK_STRING_ID
 
**unsigned 32 bit integer - TALK_STRING_ID
 
 
*4 bytes
 
*4 bytes
 
 
**unsigned 32 bit integer - offset to string (appears to be an offset from 0x60)
 
**unsigned 32 bit integer - offset to string (appears to be an offset from 0x60)
  
 
+
=== String Data ===
 
+
'''String Data'''
+
 
+
 
+
  
 
*4 bytes
 
*4 bytes
 
 
** Unsigned 32 bit integer, little endian - number of characters in string.
 
** Unsigned 32 bit integer, little endian - number of characters in string.
 
 
** The toolset seems to ignore this.
 
** The toolset seems to ignore this.
 
 
*String in UTF-16, little endian.  
 
*String in UTF-16, little endian.  
 
 
** Terminated by a NUL character (0x00) (twice, since it is UTF-16LE) and filled up (with 0xff) to be a multiple of 4 bytes.
 
** Terminated by a NUL character (0x00) (twice, since it is UTF-16LE) and filled up (with 0xff) to be a multiple of 4 bytes.
 
 
 
 
  
 
[[Category:File types]]
 
[[Category:File types]]

Revision as of 20:42, 29 March 2015

The TLK file is a GFF formatted file used to contain strings. They can be opened in the Toolset or the GFF Editor.

Keep in mind that all numbers are in little-endian format if you are looking at the raw file.

This may be incomplete, as I believe previous TLK files had timing info, etc as well.

Structure

A cursory glance at the file leads me to believe it's formatted something like this:

Header
String ID section
String data section

Header

  • 104 bytes
    • 20 byte string describing the file, watch the 2 spaces between PC and TLK
      GFF V4.0PC  TLK V0.2
    • 80 bytes of other info I haven't figured out yet. May include info about the table, offsets
    • 4 bytes - number of entries in the table

String ID Section

Tightly packed.

  • 4 bytes
    • unsigned 32 bit integer - TALK_STRING_ID
  • 4 bytes
    • unsigned 32 bit integer - offset to string (appears to be an offset from 0x60)

String Data

  • 4 bytes
    • Unsigned 32 bit integer, little endian - number of characters in string.
    • The toolset seems to ignore this.
  • String in UTF-16, little endian.
    • Terminated by a NUL character (0x00) (twice, since it is UTF-16LE) and filled up (with 0xff) to be a multiple of 4 bytes.