Notebook

File Format Identification¶

In this notebook we will peek inside various files to see their internal structure, their file format. We will discover the high-level differences between formats that serve a number of different purposes, from office documents, to images, to binaries and executables, to compression and "archive" formats. Along the way we will learn how file formats are often specified in standards documents that describe their structure in detail. We will learn how file formats may be detected through various pattern recognition techniques and about the PRONOM and other public databases of file format detection patterns.

Peeking at Files¶

Our first step is to create a little code that will let us peek inside any file, regardless of format. Normally we use application software or viewers that mediate our access to most files, but not this time. The code below opens a file as an unprocessed stream of binary data and prints first ten bytes for us to examine. By default the bytes that are in the range of "ASCII values" (0 to 127) are printed as the appropriate ASCII text characters. Other bytes are printed as their "hexidecimal values" with a \x prefix, such as \xf5. Python let's us know that it is printing a byte value through notation like this: b'value'

In [1]:

with open("samples/Website Concerning Career Opportunities/backblue.gif","rb") as file:
    byte = file.read(1)
    count = 0
    while byte and count < 10:
        print(byte)
        byte = file.read(1)
        count = count + 1

b'G'
b'I'
b'F'
b'8'
b'9'
b'a'
b'\xf5'
b'\x01'
b'\xc8'
b'\x01'

You may have noticed that this GIF image file starts with a sequence of ASCII characters that say, "GIF89a". Let's look at another GIF file.

In [2]:

with open("samples/Website Concerning Career Opportunities/fade.gif","rb") as file:
    byte = file.read(1)
    count = 0
    while byte and count < 10:
        print(byte)
        byte = file.read(1)
        count = count + 1

b'G'
b'I'
b'F'
b'8'
b'9'
b'a'
b'\x08'
b'\x00'
b'\x08'
b'\x00'

Again we see that the first few bytes of this file are the ASCII characters "GIF89a". The first part of a file is often referred to as the "header" and the header of any GIF file is set to this sequence of ASCII characters. There are many file types that can be identified by unique header values like "GIF89a".

But What is a File Really? Bits and Bytes¶

So you may have heard that computers work on zeros and ones, somewhere deep-down, internally. The simplest way to describe a file is as a long sequence of bits (zeros and ones) that has been written to some kind of medium, such as a magnetic hard drive. Generally, we step back from these minute "bits" and talk about "bytes". A byte is a sequence of 8 bits and that is generally the smallest unit of information computers work with. A sequence of bytes, whether held in a file or delivered over a network, is often called a "stream" or a "byte stream". Sometimes you may even see the term "octet-stream". Don't let this throw you. An octet is just another word for byte. Because each byte is composed of 8 bits, a byte is really an 8 digit binary number, such as 11011010. The eight digit binary number in a byte can represent the range of binary values from 00000000 to 11111111 and everything in between. There are 256 possible combinations of eight zeros or ones (2⁸). We start with 00000000 which naturally represents the ordinary decimal value "0". On the higher side, 11111111 represents the decimal value of 255. So in ordinary decimal terms, a byte is a number between 0 and 255.

Let's take a look back at one of our GIF files and instead of printing each "byte" in Python's notation, we'll print the decimal value of each byte.

In [3]:

with open("samples/Website Concerning Career Opportunities/backblue.gif","rb") as file:
    integers = list(file.read(10))
    print(integers)

[71, 73, 70, 56, 57, 97, 245, 1, 200, 1]

So now you can see the list of ten decimal (integer) values that make up the first ten bytes of our file. They are just numbers. One difference that we should note is that a byte is always eight bits long, no matter which value is encoded in those eight bits. So we may want to write the number three into a byte, which may be written in just two binary digits as 11. However, when we write the decimal value 3 into a file, it takes up the whole byte and is written as 00000011. The bits are padded on the left side with zeroes. The full eight bits are always used and this is so that we know where one byte value ends and the next byte value begins.

Encoding and Decoding Text¶

Okay, so we have one of 256 values in each byte. How do we go from there to a character like the "G" in "GIF89a"? For the text in this file we can look at an ASCII chart and find the letter that corresponds with the decimal value of 71 above. You may want to spend a few minutes glancing at the ASCII chart to "decode" the decimal values we extracted from the file in the output above. Computers do this work for us routinely and very fast, but doing this helps us understand in depth how text files are written. ASCII is one of many possible text encoding systems or "encodings". Basic ASCII only uses the "lower half" of the byte values, between 0 and 127. However, there is an extended ASCII character set that makes use of the full range from 0 to 255 and that extended character set includes many more symbols and multilingual characters.

We saw above a series of text characters within the header bytes of a GIF file. Those characters had to be "encoded" as bytes values, i.e. numbers, so that they could be written into the data file. In the case of plain ASCII characters, the text characters range between the numbers 32 and 126, these include the uppercase and lowercase latin alphabet, arabic numerals, and the most commonly used symbols and punctuation. ASCII also allows us to encode non-printable characters that represent things like carriage returns or end of line markers, even bell tones. There are historical reasons for all of these characters and not all of them are used much today. You can look up any ASCII value by consulting an ASCII chart.

Microsoft Windows computers for many years used a different text encoding standard by default, called Windows 1250, which mostly supports western languages using latin-based alphabets. Other alphabets were supported by a long list of separate encoding standards, say for Cyrillic (KOI8-R and others) and Simplified Chinese (GBK, HZ, and others). These text encodings may still be found in files from previous decades and certain software programs, but by and large these standards have been eclipsed by a big new text standard, call Unicode. Unicode discussion began in the late 80s and the Unicode 1.0 standard was finalized in 1991. Unicode includes characters from most alphabets in one large standard. Instead of using just one byte to encode each character, Unicode uses two bytes, i.e. 16 bits. A series of 16 bits can have 65,536 (2¹⁶) possible values, which is why Unicode can encode texts that use so many different characters. This notebook is written in Unicode, not ASCII, which is why we can seamlessly use characters such as Я (Cyrillic), ⻝ (Han), and カ (Katakana). Unicode even includes ancient writing systems, such as Phoenician (𐤇). Unicode also supports right-to-left and vertical writing systems. You can usually explore the full Unicode character set via an operating system utility, typically called the character map. You can also consult various Unicode character reference charts online.

Binary Files, Text Files, and Formats¶

Not all files are composed of text encoded as bytes. Many files encode other kinds of data, such as images or sounds. You may hear programmers refer to these as binary files, as opposed to text files. Binary files use the same 8-bit bytes to store information, but different formats use those bytes in widely different ways. Some binary files even contain software programs and these are composed of instructions that are ready to be run on your computer. The way that a software program or even a person encodes their data into the byte stream of a file is called the file format and there are thousands of them.

Example Format: 8-bit Graphics and the Device Independent Bitmap Format¶

You may recognize the term 8-bit graphics if you have ever played "retro" video games. 8-bit graphics use one byte to represent the color of each pixel that makes up an image. We already know that a byte can hold 256 different possible values. Much like the ASCII text encoding, we can read a byte that represents a pixel into a decimal number. We then go and look up the color for the pixel in a chart containing 256 colors instead of text characters. 8-bit Bitmap image files include a long list of these pixel values, one per pixel that makes up the image. Unlike text files, bitmap files also package their "color lookup chart" within the same file. Near the beginning of most Bitmap files there is a list of colors, called the "indexed color palette". Each color definition typically occupies 4 bytes. The first 3 bytes encode 256 levels of red, 256 levels of green, and 256 levels of blue. So you can have any of 256 levels for each color component (RBG) to create a color. So, to sum up, when the computer reads a new pixel value in 8-bit images, it reads a number between 0 and 255. Let's say this pixel's color value is 230. Then the computer will "decode" the color 230 by looking at the 230th entry in the indexed color palette byte area. There are plenty of variations of this style of image format, but an indexed color palette is a very common feature of image formats. You can read the detailed specifications of most standard file formats on Wikipedia or in their official specifications.

Format Identification¶

We've been discussing the structure of files and looked into some specific formats for storing text and images. Now let's turn to the more general question of how a file format can be detected.

File Extensions¶

Typical files have names that include a dot and a three letter or longer "file extension". You may recognize the ".docx" extension as the modern file extension for Microsoft Word files. Your computer uses the file extension is a hint as to the file format and which software applications may be used to open that format. Your operating system manages a registry of file extensions and the installed software it can use to open each type of file. Sometimes we have to edit this registry when we install new software, but usually the registry is edited for us as part of the software installation process. So file extensions are a useful hint about a file's format and they can be associated with specific software on any particular computer. However, nothing really prevents us from renaming files and changing their file extension, say from ".docx" to ".gif", or ".some_new_extension". This creates some difficulties, because the format within the file will no longer match the file extension. The operating system will get confused and try to open such files with the wrong software.

Exercise: Change a File Extension¶

Go to the File Explorer in your operating system and make a copy of any office document. Now rename the copy, changing the extension from something like ".docx" to ".pdf", for instance. What happened to the file icon? What happens when you double-click on the file? Has the byte stream contents of the file changed at all?

In digital preservation the situation is more complicated than a single computer user. As we collect more and more files, from a variety of users, computers and operating systems, we must deal with a wide variety of file formats. Some of these file formats even share the same file extensions! Yes indeed, there is no governing body that decides on file extensions or limits their use. Anyone can invent their own file extension at any time and they often do. So the file extension is a great ~hint~ for the file format, but it is not sufficient.

Sub-Formats, Versions, and Variations¶

There are nine different versions of the PDF format, the first eight (1.0 to 1.7) were created by the Adobe corporation and then 2.0 was created by the ISO organization. In addition, there are several Adobe extensions to the format that only their software supports. There are also numerous sub-formats within some of these versions that are particular to the software used to create them, such as Microsoft Exchange, or for a particular purpose, such as PDF/E for engineering and PDF/A for archiving. The PDF format is not particularly unique in this proliferation of different versions and variations. Even when a file extension can tell us the major format, it usually cannot tell us the specific format version. The impact of particualr format versions is significant. A PDF file with Adobe's extensions may be unviewable in the non-Adobe software used for archival access.

Overlapping Formats¶

Often files are compliant with several file formats at the same time. The most common example of this is the ZIP format. All recent Microsoft Office files are also ZIP files. If you rename a MS Office file, like ".docx" to ".zip", then you can open that file as a ZIP archive and look at the contents. In a way MS Office documents are folders that are bundled up into compressed ZIP files. Similarly, there are many formats that are also compliant with the text/plain MIME Type. The most generic format of all is application/octet-stream, which simply means that the file contains a stream of bytes.

Exercise: Unzip a MS Office Document¶

Let's unpack a modern MS Office doc to examine the contents. The code below will open the included test.docx file and list all of the files zipped inside of it. Run that code cell and then look at the resulting list of files. If you download this file to your operating system and rename it to test.zip, you can look at these files individually.

In [1]:

from zipfile import ZipFile
with ZipFile('test.docx', 'r') as myzip:
    for x in myzip.namelist():
        print(x)

_rels/.rels
docProps/core.xml
docProps/app.xml
docProps/custom.xml
word/_rels/document.xml.rels
word/document.xml
word/header2.xml
word/styles.xml
word/header1.xml
word/theme/theme1.xml
word/numbering.xml
word/fontTable.xml
word/settings.xml
[Content_Types].xml
word/media/image2.png
word/media/image1.png

By reusing the ZIP format, Microsoft has made a strategic decision not to invent a new bundling format. This helps us in the field of digital preservation, since we can use all of our existing ZIP tools to access this format.

PRONOM Format Repository¶

PRONOM is an information service provided by The National Archives of the UK. PRONOM provides a registry of data file formats the software applications that support them. You will want to read a more detailed description of PRONOM on the National Archives website.

So PRONOM holds details about file formats and is an invaluable resource for those who need to identify formats. Here is their format information about the GIF image format: https://www.nationalarchives.gov.uk/PRONOM/Format/proFormatSearch.aspx?status=detailReport&id=620

It describes the format, it's origins, and any other versions of the format. The documentation tab contains links to any authoritative documents or specifications for the format, such as this 1989 GIF specification: http://www.w3.org/Graphics/GIF/spec-gif89a.txt

Most important for us to understand for identification purposes, is the "Signatures" tab. This area describes in detail the signature characteristics of GIF files and is divided into those signatures found external and internal to the byte stream stored inside the file. External signatures are typically the common file extensions for a format. Internal signatures are patterns of byte values that indicate a format, such as this internal signature for the GIF(89a) format:

Byte sequences:

Position type: Absolute from BOF
Offset: 0
Maximum Offset: 0
Byte order:
Value: 474946383961

Position type: Absolute from EOF
Offset: 0
Maximum Offset: 4
Byte order:
Value: 3B

If we look at the first sequence, we see that the position type is absolute BOF, which stands for the "beginning of the file". So this sequence is positioned at "offset" 0 from the beginning of the first, i.e. these are the bytes at the very beginning of the file's byte sequence. The byte value is given in hexidecimal pairs, which is bound to create confusion for all of us who are used to decimal values, i.e. all of us. However, when we look back at our ASCII code chart, we can see that it conveniently also lists the hexidecimal values for ASCII characters. So let's use that to decode these "hex" values to ASCII text:

47 -> G
49 -> I
46 -> F
38 -> 8
39 -> 9
61 -> a

You can perhaps see where this is going. We see in PRONOM the same sequence of bytes that we read from our own GIF file earlier. We have just manually verified the first internal signature on our local GIF file!

Now let's take a look at the second internal signature. It's position is absolute from the EOF or end of file. It has an offset of 0 and a maximum offset of 4, so the sequence should appear in the last four bytes of our GIF. The value given is 3B (hex). Let's sample the last few bytes of our GIF file and see what we find.

In [2]:

import os
with open("samples/Website Concerning Career Opportunities/fade.gif","rb") as file:
    file.seek(-4, os.SEEK_END)  # skipping to the last 4 bytes..
    bytes = file.read(4)
    print(bytes.hex())

2d20003b

When you run the code above, it opens our GIF file and seeks to the position 4 bytes from the end. It then reads the last four bytes and prints them our as their hexidemical values. We can see that the final sequence of bytes in the file matches with the PRONOM signature. (Upper and lower case are not significant in hex values.) So now we have manually verified both internal signatures for our GIF file. We can make a well informed guess that this file is in fact a GIF. The file could still be some kind of fake GIF file, but those sequences are very unlikely to appear in a random non-GIF file without malicious intent. In other words, someone can easily create a file that matches a PRONOM signature and isn't a GIF file in other ways, but that would be the exception. That leads to discussions of virus scanning and related topics that safeguard archives from disseminating viruses and other malicious files, but that is beyond our scope here.

DROID (Digital Record Object Identification)¶

DROID is a tool created by the UK National Archives to identify the format categories that apply to a collection of files. When you run the DROID tool on your files, it looks for internal signatures in PRONOM that match each file's bytes and then returns a report of the format categories matching each file. DROID has both a graphical user interface (GUI) and a command-line tool. The resulting reports are commas-separated values (CSV) files.

DROID is a Java application that doesn't run in Jupyter notebooks. However, in the folder with this notebook you will find export.csv. This is a DROID report on the formats identified for our two test files, test.docx and fade.gif. You can view export.csv in Jupyter by double-clicking on it.

Exercise: Find PRONOM Format Details for text.docx¶

Open up the export.csv DROID report and get the format that was identified by the tool for test.docx. Find this exact format on the PRONOM website. How was this format identified?

Summary¶

In this notebook we have shown the underlying structure of digital files and demonstrated how both text and images may be encoded into a file's byte stream. We have shown that a file's name and extension are hints about the file's technical contents, but that these hints are not enforced by any technical means. We have discussed formats and how formats may be described in specifications and identified through signature bytes at certain locations in their byte stream. Finally we have discussed some tools that are widely used to perform format identification.

In [ ]: