Written by me in October 2014; originally posted as a KAT tutorial.
Below, I will investigate the intricacies of torrent metadata, and exactly how the crucial
info_hash identifier is derived from it. A thorough understanding of these details allows us to directly examine, and, subject to the exercise of a lot of caution, even to manually edit, .torrent files. The tutorial is intended as a resource for anyone about to tackle those challenges, as well as for those who are just idly curious. I shall do my best to make the contents broadly accessible; the only prerequisite should be a certain minimal standard of technical literacy.
0) Concepts"Metadata" refers to "data about data", or, more concretely, "data that describes other data". For example, if we consider the preceding sentence as data, an associated item of metadata could be the number of words it contains (14). Torrent files are filled with such metadata, describing the original content based on which they were created and whose transfer they are intended to facilitate.
This data is stored in a hierarchical manner, using the "Bencode" format, which is even simpler than, and almost as transparent as, the bbCode used in this forum. It employs the following three basic concepts of information theory.
Delimited sequences: One of the ways to store data of variable length is to define a pair of flags marking its start and end points. For this to work, those flags cannot occur within the data itself. bbCode tags, for example, are based on this - the string
[b]blah[/b] consists of a start tag, followed by an unspecified number of characters, followed by an end tag.
Value pairs: Associating two values can result in elevating mere data to the level of information; which is to say that, in a very real sense, the meaning contained in such a combination exceeds the sum of the meanings of its parts. Consider, for instance, the word "age" and the number "42", neither of which really tells us much on their own - and then consider the combination "Age: 42".
Lists of items: By contrast, if we have a bunch of data of the same type, such as "apple", "orange", and "plum", we may need a structure that can contain the lot without losing track of where one ends and the next begins - "appleorangeplum" is clearly problematic.
1) Bencode specificationsThe format used in .torrent files combines the concepts outlined above in various ways to produce the following four basic structures (described here in order of complexity). Any linebreaks and indentations occurring in the examples below and hereafter are present only for purposes of legibility, and would be absent in any actual use of the format.
Delimited integers: Numbers are stored as text - which has both advantages and disadvantages, compared to storing them in binary form - and are prefixed with an "i" for "integer" and suffixed with an "e" for "end".
i1e
i987e
Tallied strings: The same approach does not work for text and binary data, because whichever characters we choose as delimiters, there is no way to ensure that they don't appear as part of the data. Instead, these are stored as a value pair, consisting of the string itself and a number declaring its length. The number goes first, followed by a colon, followed by the string; this ordering removes any potential for ambiguity, even in cases in which further numbers and colons appear as part of the text.
4:blah
6:Age:42
Externally delimited lists: The list structure is exceedingly simple, consisting in nothing more than the sequence of items, prefixed with an "l" for "list" and suffixed again with an "e". Unlike in the earlier "appleorangeplum" example, no internal delimiters (such as commas) are required, because unlike ordinary words, each item must be a Bencoded entity in its turn.
l
5:apple
6:orange
4:plum
e
l
i1e
4:blah
l
3:age
i987e
5:apple
e
e
Externally delimited ordered dictionaries: The final and most powerful construct is what is referred to as a "dictionary". In common usage, the word describes what boils down to a list of entries, each of which consists of a term and a definition or description of that term. The IT jargon sense is essentially the same, only a bit more general: A list of value pairs, each consisting of a string called the "key" and data associated with that key. Furthermore, and again just as is generally the case for ordinary dictionaries, the items must be listed such that the keys are in ascending order (cf
lexicographical order @ Wikipedia for specifics). As with simple lists, no additional internal delimiters are necessary because each constituent is a self-contained entity.
d
7:numbers
l
i1e
i987e
e
6:fruits
l
5:apple
6:orange
4:plum
e
e
d
3:Age
i42e
4:Name
d
5:First
7:Freddie
4:Last
5:Femur
e
e
2) Torrent file specificationsThis now gives us the toolkit to pick up where we started off: The, for lack of a better term, physical contents of any .torrent file are nothing more or less than the Bencoded metadata associated with what we generally think of as the "contents" of the "torrent". The usual means of displaying .torrent files, such as websites like this one and BitTorrent clients like the one you're using, hide the former and show us the latter. To see the contents naked, as it were, we have to open the file using software which doesn't know or care about how the BitTorrent protocol works - a hex editor, for example.
The bulk of the metadata is just what common sense would suggest: A description of the files/the data to be torrented, and a tracker listing. For principally historical reasons, the specific substructure of the corresponding blocks differs somewhat depending on whether there is a single file or tracker, or several of either, so we're going to look at examples of both cases. For illustration purposes, our "original data" will consist of two tiny text files, "
abba.txt" and "
abc.txt" (
backup archive), the contents of which match their names: The "abba" file is exactly 64 kB in size and contains nothing but the character "a" (plus some line breaks) in its first and last and nothing but "b" in its two middle quarters. Along the same lines, "abc" is 48 kB in size and consists of three equally-sized portions, filled with, you guessed it, "a", "b", and "c". The reasoning behind those particular choices will become clear in due course.
First example (single-file torrent with a single tracker)Using the current mainline client (BitTorrentPlus 7.9.2), I now create a torrent from the "abba" file, adding one tracker and a short comment and setting the piece size to 16 kB. The resultant .torrent file is 324 Bytes in length and has the following contents, formatted for legibility as before.
d
8:announce
38:udp://tracker.publicbt.com:80/announce
7:comment
30:This is a single-file torrent.
10:created by
16:BitTorrent/7.9.2
13:creation date
i1413650210e
8:encoding
5:UTF-8
4:info
d
6:length
i65536e
4:name
8:abba.txt
12:piece length
i16384e
6:pieces
80:
0x1A 0xD6 0xF6 0x4C 0x8D 0x94 0xFA 0x2E 0x20 0x54 0xD3 0xF6 0xE0 0x1A 0xB7 0x2A 0xE3 0x34 0xF2 0xD9
0x13 0xF7 0xEB 0x29 0x20 0x01 0x54 0x6E 0x42 0x9D 0x0C 0x7F 0x81 0x27 0xCD 0xD2 0xB8 0x39 0x0D 0x85
0x13 0xF7 0xEB 0x29 0x20 0x01 0x54 0x6E 0x42 0x9D 0x0C 0x7F 0x81 0x27 0xCD 0xD2 0xB8 0x39 0x0D 0x85
0x1A 0xD6 0xF6 0x4C 0x8D 0x94 0xFA 0x2E 0x20 0x54 0xD3 0xF6 0xE0 0x1A 0xB7 0x2A 0xE3 0x34 0xF2 0xD9
e
e
Starting from the top (both in the linear and the hierarchical sense), the whole thing is a dictionary with a handful of entries, most of which are simple value pairs, while the last one is another dictionary called "info". The first key is "announce", and the value makes clear that this refers to the lone tracker; then comes the comment I added; then a creator signature; a creation timestamp ("in standard UNIX epoch format (integer, seconds since 1-Jan-1970 00:00:00 UTC)", according to the design document); a charset designation for the text-based portions; and finally the second, subordinate dictionary... which is where things get interesting!
The first of the entries in the "info" dictionary, "length", equates to 64k, so it must refer to the size of the original file - as well as that of the torrent in its entirety, as it contains nothing but said file. Then comes the (file-) "name", the "piece length" equating to 16k, and last but not least a field called "pieces" which contains an 80-byte data block, displayed here in standard hexadecimal notation (each "0x##" snippet corresponds to a single byte). As it turns out and as shown above, this portion is more usefully considered as a series of 20-byte blocks, that being the length of a "SHA1"-type hash value, one of which is derived from, and can later be checked against, each of the (64k/16k=) four pieces. Whenever you instruct your torrent client to perform a "force re-check" on a partially completed download, for example, this is what the your local copy is "re-checked" against to determine whether each piece is identical to the corresponding one in the original copy (which is to say, complete) or not (incomplete).
And that's where the precise partitioning of our "abba" file pays off: The first and last of the four pieces are identical, as are the two middle pieces - and as a direct result, so are their hashes!
Which leaves us with a series of endings - first the "e" flag for the inner "info" dictionary structure, ditto that for the outer wrapper, and then the end of the file. (And of this example.)
(continued below)