A Simple, Flexible File Format Method

TAD

Introduction

I remember reading about file format design in a previous Hugi and after just writing the .GIF loader article I thought I would describe some nice attributes that the old, and sadly underused, IFF file format. Anyone who wants to create their own fast and flexible file format should find this article useful.

It's not ground breaking, but it should still be useful to know about.

The problem

Most files these days need to contain many different data items of different sizes and different types. It's not only in archive programs that a variety of data objects need to be stored, trackers, players, demos, editors and games all need to store and access data quickly. The lazy solution would be to just create hundreds of files and folders, leaving all the hard work to the Operating System. But the problem with that idea is that there is more chance of a file being deleted or being lost when copying/moving them around. The advantage of having a single, large file is that it is easy to move around, access and far less prone to go missing.

But the disadvantage of having a single file is that of storing different items and accessing them quickly. Another consideration in your file format design is that of expandiblity, how easy it is to add new data types and what affect this would have on other programs which do not support these new data items.

In the past hardly anyone bothered to plan their file format structure and the result was that only the program which created the file could read it back in. This was clearly horrorendous. Without having access to either the coder or their source code you simply couldn't access the data trapped inside the file. Anyone who has coded a MOD player will realise how bad some file format were/still are. In the worst case a block of memory was simply dumped to disk by a program including its highly cryptic and version dependant variables. Even if the original programmer decided to code a new version of their software they usually didn't bother to support their own, older file formats.

IFF I had a hammer

Most of the problems concerning different file formats were elegantly solved by Electronic Arts and others with their IFF (Interchangable File Format).

I will briefly describe the methods it employed to give a neat, flexible and "future proof" format which can easily be used in your own productions today then expanded to in the future without any hard work.

Basic Requirements

What we need from a file format is:

1. Easy identification
It's no good relying on a filename or extension these days. Also the file size shouldn't be used, unless there is NO other route to identification.

2. Seekable/Random access
This means we can extract only the data item we want to use without having to decode the entire file.

3. Future proof/Backwards compatible
Even with the best design and planning you will probably think of something later on which didn't imagine when creating the original file format. Also if possible allow different programs with different needs and versions to access the old data while ignoring the new, unsupported items.

4. Error proof
This is a simple case of using a CRC algorithm, or just a totaling up all the bytes in a block of data and checking the sum against the stored checksum/total. If the data stored inside your file format is highly sensitive to mistakes, errors or corruption (such as code, or data records) then this is a MUST. Any data compression scheme should really consider a good CRC method.

Identification

The easist way is to place a short ASCII string at the very beginning of your file. You may want to place the name of the program which created the file together with the identification string. It is also a good idea to place an EOF (1A hex) character at the very end of this header string, this way if the user tries to view or print the data file then only the header string will be printed, not a huge 10 meg file of unprintable garbage flashing down your screen.

Some coders prefer "magic" numbers like 12345678 hex or 5555FFFF hex or even DEADFACE hex which are much easier to check for (a single CMP), but using an ASCII string makes identification easier for other people.

In the IFF file format a short string of 4 characters were used. E.g.

                  7   6   5   4   3   2   1   0
                ------------------------
        +0                 "F" = 46 hex               Signature "FORM"
        +1                 "O" = 4F hex        
        +2                 "R" = 52 hex        
        +4                 "M" = 4D hex        
                -------------------------------
        +5                 "I" = 49 hex               Type of image
        +6                 "L" = 4C hex               (ILBM = InterLeaved
        +7                 "B" = 42 hex                BitMap image).
        +8                 "M" = 4D hex        
                ------------------------

Random access

Now this is the real beauty of the IFF file format for me. Each data item in the file is preceeded by a standard 8-byte header structure.

Each header consists of 2 dwords. The first is an ASCII string such as "BODY" "SND8" "BMHD" "TEXT" "VID8" etc. and the second dword is the length of the data block which immediately follows the header.

E.g.

                  7   6   5   4   3   2   1   0
                ------------------------
        +0                 "B" = 42 hex               Signature "BODY"
        +1                 "O" = 4F hex        
        +2                 "D" = 44 hex        
        +4                 "Y" = 59 hex        
                -------------------------------
        +5                  Length                    DWORD
        +6                    Of               
        +7                 following           
        +8                   block             
                ------------------------
                  7   6   5   4   3   2   1   0

So what? It doesn't look that clever to me.

In fact that simple 8-byte header allows you to quickly seek within a file to find a particular data type by following the chain of headers and also allows some future expansion of the file format without making it incompatible with older programs.

Future proof

The reason why older programs can still read and use parts of a newer file is due to that 8-byte header. It should be mentioned here that the parts of an IFF files were generally in order, but they didn't have to be. The way to read these fragmented files is to sit inside a loop checking each signature and either skipping each block, or processing the data as required.

For example a tracker file reader could look like this:

        FilePos = 0

   DO
                Seek(FilePos)

        ;; Read the 8-byte header ;;

                Signature = Read(4)
                BlockLength = Read(4)

        ;; Now process each "known" chunk ;;

                IF Signature = "SAMP" then Read_Samples()
                IF Signature = "INST" then Read_Instruments()
                IF Signature = "PATS" then Read_Patterns()
                IF Signature = "SEQU" then Read_Sequence()

        ;; this deals with "unknown" or future signatures ;;

                FilePos = FilePos + BlockLength

   UNTIL BlockLength=0

So the "Signature" gives the loop to filter out any unknown/not needed file chunk. It means that a huge, complex file with hundreds of different data items can be accessed by different programs with different needs. For example a sample editor would only need to handle the "SAMP" (sample) data and all the other chunks can be ignored by skipping past each section using the length field in each header.

Quick, flexible and easy to implement, isn't it?

Inside the data chunks

Now we have a quick method to navigate our way to each chunk, but here is were the IFF format falls down a little bit. Most chunks in the IFF file formats were stored as raw, binary data and each had its own unique data structure. This kinda defeated the whole point of the 8-byte headers, that it to make it flexible and future proof.

Endian Formats

If you want your file format to be used on different CPUs (like the 68xxx) then you will probably need to address the low-high and high-low endian-ness of numbers. But how?

The number 12345678 hex stored on a 68000 machines is 12 34 56 78 hex (high-low order), but on an Intel machine it is 78 56 34 12 hex (low-high order).

You could use a byte to denote which CPU/machine number format has been used for a data chunk, but then you need to define standards and tell other people whats the next "free" ID code which they can use.

A simple solution would be to use the number 12345678 hex itself and then to detect the mapping of each of the 4 bytes. The advantage of using 12345678 hex instead of a token byte is that you can deduce the endian-ness by using a small loop and by looking at the data using a hex editor.

Floating Point formats

This is a common program in the 3-D modelling world, even with standards for representing floating point numbers there is NO single, clear method of storing them in a file.

Most packages allow data and numbers to be saved in an ASCII format. This means that not only can many different processors read/write floating-point numbers by scanning an ASCII string and converting into the host platform's format, but you can also edit these ASCII files using any half decent text editor.

Closing Words

Ah well, that's another article done. The IFF method of using a 8-byte header structure for each block can be extended to use a 8 or 16 byte signature string without too much effort. This would be a good idea because the 4 characters available in a dword isn't enough to describe most things. Soon you start getting signatures like "SXK1" or "MO32" for example which is totally meaningless to other people, you might aswell use a token byte.

I would guess that the ASCII file format method will continue to be used in the near future instead of a raw/binary block based method. Take a look at HTML and all the other web based language and data files. Soon parsing ASCII strings might become more common than reading binary files, especially where data needs to be shared between applications and different CPU types.

But, for certain applications (compressed data-streams for example) a raw binary data stream is perhaps the only solution.

The problem of inventing a file format is a difficult one. Do you go for a fast memory dump memory which suits only your program, or for a more, flexible and easy to read format but which is much slower?

Happy loading.

Regards

TAD #:o)