I am amazed that programming languages (well, the typical ones, at least) don’t make it easier to manipulate files.
A common way files are read in C is to create a struct that matches the file format and to call fread to read the file into it. Isn’t that easy enough?
Not really. This approach is fine in isolation, but it’s non-portable:
- Different architectures or compilers may lay out structs differently. Your compiler sometimes can choose to add padding bytes to guarantee alignment requirements. Luckily compilers aren’t allowed to do it willy-nilly, and some compilers offer #pragmas to control this.
- Different architectures have different integer sizes. Appropriate typedefs often can mitigate this, but it’s still imperfect since it requires a small porting effort.
- Different architectures use different endianness. If a file format is defined to store integers in big-endian byte order but your architecture is little-endian, then if you read the bytes out of the struct without first swapping the bytes you’ll end up with the wrong value.
The typical way to solve these problems is to read a file a byte at a time, copying each byte into the appropriate location within the struct. This is tedious.
Programming languages should provide a mechanism for programmers to declare a struct that must conform to some external format requirement. Programmers should be able to attribute the struct, prohibiting implicit padding bytes and specifying what the size and endian requirements are for each field. For example:
file_struct myFileFormat { uint8 version; uint8[3]; // Reserved. uint32BE numElements; uint32BE dataOffset; };
When retrieving fields from such a struct, the compiler should generate code that automatically performs the necessary byte swaps and internal type promotions.