Recommended file formats for keeping your digital archive readable

Uit Tracks
Versie door Bart Magnus (overleg | bijdragen) op 24 nov 2022 om 13:34 (Nieuwe pagina aangemaakt met '=== Email files ===')
Naar navigatie springen Naar zoeken springen

If your digital archive is properly backed up and/or you save everything in the cloud, then you still have all your digital files. But are you sure can you still open them? Hopefully, you have your poster in a format other than the PageMaker file from 1994, because there's no suitable software available for it anymore. Yes, you read that right: digital archives don't preserve themselves all on their own.

The problem of digital obsolescence

Digital obsolescence is when a file is so old that the software for opening it is no longer available, unless you resort to some (time-intensive) digital archaeology. And even if there is still software available for it, there's a strong chance that later versions will display files differently from older versions.

Software durability is determined by:

  • the extent of backward compatibility: a new version of the software might not be able to properly read files from older versions;
  • the complexity of the software: the more complex the software, the harder it is to guarantee backward compatibility;
  • its distribution in the market or community: a large market means more software for reading files;
  • its open documentation: if the source code is available, programmers can continue to develop the software to read the file format. Using open file formats reduces the risk of being reliant on particular technologies or providers.

The file format determines how the information is coded in a computer file, and is usually indicated by the extension in the file name. A codec is a piece of software or hardware that allows data to be coded or decoded, or compressed or decompressed. You can use DROID to gain an overview of the file formats in your digital archive.

Other risks

Compression can be an issue for image and video files. Photos are widely saved in JPEG format, for example, which uses an intensive compression algorithm. You can't notice this with the naked eye at first, but it leads to problems when migrating the photo to a new format, e.g. importing it into image editing software such as Photoshop.

Also take into account the issue of files that refer to each other. An InDesign file, for example, does not contain the images, but links to the images which are stored elsewhere on your drive. This link is lost when the files are moved.

How do you choose the right file format?

Keeping a digital archive readable is essentially the continuous migration of old files to current file formats (which we call a 'migration strategy'), or emulating an old computer environment on the current setup, so that the old software can still work (which we call an 'emulation strategy').

Both strategies become very complex over time, and are often only implemented by specialists. As an artist or arts organisation, it's best to focus primarily on choosing an open and well-documented file format when creating your document. That's the best guarantee for ensuring your digital archive remains readable in the long term. You could also bet on more than one horse, for example by saving images or PDFs of complex 3D models. Secondly, you can check whether there are any potentially 'at-risk' files among your existing digital content. If there are, then please feel free to contact one of the partners in the TRACKS network for more tailored advice.

Below is an overview of tips for each file type.

Word processing documents

Examples: DOC, DOCX, ODT, TXT, RTF

Word processing documents are best saved in ODT, or PDF if the document no longer needs to be modified. It's easy to save documents as ODT or PDF files from within Word. In the latter case, do not choose to print to PDF as this is lower quality than the 'share' or 'export' option. Always select PDF/A as the PDF archiving profile, which is available in Word in the PDF save settings. Saving files in the latest version of Word (DOCX) in their original format is not an ideal solution, even though the risks are currently very low.

ODT

ODT (Open Document Text) is the open source variant of DOC and DOCX. As an open format for formatted text, it is therefore the preferred option.

PDF

PDF files can simply be saved (in the medium term) in PDF format. If possible, make sure that every PDF created within the organisation is saved in a PDF archiving profile (preferably PDF/A, or PDF/E for architectural drawings).

Raster images

Examples: TIFF, JPEG, GIF, PNG, PSD, BMP

A raster image or bitmap is an image in digital form, with the colour set for each pixel. The disadvantage of a raster image is that individual pixels become visible when the image is magnified. Bitmap software is available for editing raster images. The counterpart of a raster image is the vector image.

One example of a raster image is when a digital camera captures the image and uses an image chip to record it, which contains a raster of pixels.

TIFF

TIFF is generally recommended as a durable storage format for raster images. It is best not to use compression for images. Indeed, (lossy) compression results in a loss of quality when editing images. You should therefore make sure that photos with artistic value, used for communication and presentation, are delivered and saved in uncompressed TIFF format.

There are various TIFF profiles. Uncompressed baseline IBM TIFF v6.0 is considered to be the most durable. Make sure that an RGB profile is used as the colour space, if possible AdobeRGB or ecirgb-v2. It's also best to create equivalent TIFF versions of Photoshop files, but keep the original file with layer information if you want to edit it further.

JPEG

It's fine to use JPEG files for photos created for the purpose of documenting an exhibition or public event, but don't use any exotic or obsolete formats such as BMP (Bitmap).

PNG

PNG is an open image format that uses lossless compression (so no image information is lost). PNG is used for high-quality online publications and presentations, and logos and graphics.

2D Vector images

Examples: AI, SVG, EPS

A vector image is a graphical representation composed of simple geometric objects such as points, lines, curves, polygons, etc. Complex forms are created by combining these more elementary shapes. The objects' formulas describe the images, so vector images can be enlarged to any desired format without any loss of quality. This is in contrast to a raster or bitmap image, in which individual pixels are coloured in separately on the digital canvas. This means the resolution for the chosen scale is fixed, causing the image to become blurred or chunky when enlarged.

The description of a vector image might say, for example, to draw a circle of a certain colour and size over a text. The absolute size of neither the text nor the circle is set, only the relationship between them. This flexibility means that vector graphics can therefore be displayed at any size, and the resolution (the information density) remains the same.

SVG

SVG is generally recommended as a durable file format for vector drawings, so always make sure you have an SVG equivalent of definitive vector images.

Text files

Examples: TXT

Text files can simply be saved as such, but note that text can be coded in different ways (e.g. ANSI, ASCII and UTF-8). Where possible, try to ensure that text files are coded in UTF-8.

Presentation files

Examples: PPT, PPTX

These files can be saved (in the medium term) in their original format. PDF is more durable, however, so migrate completed presentations to this format. PPT files have already become outdated, so make sure you have equivalents in PPTX or PDF, and choose PDF/A.

Spreadsheets

Examples: XLS, XLSX, ODS

There is no comprehensive solution for spreadsheet files within the archive community, but XLSX and ODS are considered to be sufficiently durable. XLS is outdated. It is therefore recommended to identify important XLS spreadsheets in the archive and create an equivalent in ODS and XLSX.

Video files

Examples: AVI, FLV, MOV, MPEG-1, MPEG-2, MPEG-4, SWF, WMV

Long-term storage of video files is a job for specialists. When you order videos, however, you can require the providers to deliver them in durable formats. In principle, MKV' is the most durable format for storing video. MXF, AVI and MOV are other durable formats. File formats for audio and video are simply containers for the audio and video streams, so it's important to determine how audio and video need to be encoded. FFV1 coding is generally chosen within the archive and heritage sector. For audio streams, LPCM coding is recommended. Make sure that neither the file format nor the audio and video stream are compressed. This often results in large files (for FFV1: 45-50 GB per hour of video!), so use it primarily for valuable videos in which a lot has been invested.

Lower quality standards can be used for less important videos. The video codecs h.262 and h.264, for example, are widely used in MP4 format. You can read a good overview on sustainable video file storage at SCART.

Audio files

Examples: AC3, AIFF, MP3, WAV, WMA

Important audio files are best saved in WAV format. FLAC and AIFF are also durable formats. Use LPCM for the audio signal coding. MP3 can be used as a reference format or for less important audio files, e.g. to access via your website.

Email files

Examples: PST, MBOX, MSG

Emails can be saved in different ways. If entire mailboxes are being saved, it's best to opt for the MBOX format. It is, however, recommended to also save important emails (with high informative value for a project) separately in the project dossier. EML format is best for this. Also always save attachments separately from the email. Gmail has functions for exporting emails or saving them in EML and MBOX. Outlook uses application-specific formats, such as PST and MSG, which are not durable. To save Outlook mailboxes, it is therefore best to use an email client like Thunderbird. (See article on how to archive emails).

Websites

Websites are essentially dynamic information entities that are constantly changing. This means you can only capture all a website's information by taking snapshots of it at regular intervals, much like the Internet Archive does (archive.org). Note: it is insufficient to rely solely on the Internet Archive because the snapshots produced by this service are rarely complete. It's also relatively easy to create them yourself. A snapshot of a website is a 'static copy' of all its HTML pages together with all images, style sheets, etc. The system that the website runs on (often a content management system like Drupal or Wordpress) is then not also archived. The archiving format for websites is WARC. You can find strategies for saving websites in the article for how to archive websites.

De mate waarin je websites effectief kunt archiveren, is vaak afhankelijk van de technologie die wordt gebruikt. Flash-code is bijvoorbeeld erg moeilijk om te archiveren. De mate waarin je website archiveerbaar is, kan gemeten worden op archiveready.com. Indien je nieuwe websites ontwikkelt, probeer er dan in de mate van het mogelijke voor te zorgen dat ze later eenvoudig archiveerbaar zijn.

Databases

Databases bestaan in verschillende vormen en functies. Een databank archiveren gaat er in essentie om dat de informatie in de database geëxporteerd wordt in een vorm, zodat deze in een nieuwe database kan worden geïmporteerd. Vaak gaat het om Excel-tabellen, CSV-bestanden of XML-bestanden, maar ook andere databestanden zijn mogelijk. Belangrijk is dat er goed gedocumenteerd wordt hoe de databank in elkaar zat. Dezelfde opmerking geldt hier als bij websites: bouw databases zo op, dat de informatie er gemakkelijk uitgehaald kan worden in vormen die eenvoudig in andere databanken kunnen worden geïmporteerd.

2D CAD

Voorbeelden: DWG, DXF, VWX, DGN

2D-CAD-bestanden kunnen het best worden opgeslagen in een formaat dat algemeen gebruikt en makkelijk te openen is. Voor CAD-tekeningen in 2D is dit meestal DWG of DXF. Voor architecten die geen gebruik maken van Autodesk-producten is het aangeraden om tekeningen met een uitgewisselde en gepubliceerde status in DWG of DXF op te slaan. Zorg dat bestanden die naar elkaar refereren (zoals xref's of plotstyle-bestanden) bij elkaar staan (via AutoCAD kan dit bv. worden gerealiseerd via de etransmit-functie). In veel gevallen worden tekeningen in 2D-CAD ook naar pdf omgezet. Blijf deze pdf’s behouden. Niet alleen hebben ze een juridische waarde, de duurzaamheid van PDF is momenteel veel groter dan die van enige CAD-bestand. PDF's worden op dit moment meestal via de plot- of printfunctie gecreëerd. Programma's als AutoCAD en Vectorworks voorzien echter in de mogelijkheid om tekeningen rechtstreeks te exporteren naar pdf. De pdf's kunnen in dat geval meer informatie bevatten, de kans op fouten bij pdf-creatie verkleint en het geeft ook meer controle aan de tekenaar over welke elementen nu precies in de tekening moeten komen. Kies voor PDF/A of PDF/E.

3D CAD

Voorbeelden: DWG, DXF, VWX, DGN, SKP, 3DM

CAD-bestanden kunnen het best worden opgeslagen in een formaat dat algemeen gebruikt en makkelijk te openen is. Voor CAD-tekeningen in 3D is een dergelijk formaat echter nauwelijks voorhanden. Bewaar 3D-modellen daarom in hun oorspronkelijke formaat, maar documenteer wel de software en de versie van de software waarmee het bestand is gemaakt en documenteer ook de system requirements ervan. Er zijn immers gevallen bekend waarbij een 3D-CAD-bestand anders wordt weergegeven na een versie-update van de software. Om technische 3D-modellen uit te wisselen en te publiceren werpt IFC zich steeds meer op als de industriestandaard. IFC is open gedocumenteerd en duurzaam, maar hou er rekening mee dat de vertaalslag van 3D-model naar IFC steeds een zeker verlies inhoudt.

3D modeling files

Voorbeelden: 3DS, VRML, X3D, U3D, BLEND

De variatie in 3D modeling files is te groot om algemene uitspraken te doen over hun preservatie. X3D en U3D zijn duurzame bestandsformaten, maar deze bestanden zijn niet geschikt als duurzaam formaat voor alle 3D-modellen. Bewaar daarom net als voor 3D-CAD de bestanden in hun oorspronkelijke formaat, met documentatie van de oorspronkelijke software. Vaak worden 3D-modellen gemaakt om andere documenten te produceren, zoals renders in 2D. Voor dergelijke documenten gelden dezelfde aanbevelingen als voor beeldbestanden. In sommige gevallen is een 3D-model geen bestand, maar een executable, zoals bij modellen in Unity. Documenteer in dat geval zeker de system requirements van de executable. Het is een goede optie om 3D-scènes te documenteren via snapshots of video's (bv. schermopnames).

Bladmuziek

De aangeraden formaten voor het bewaren van digitale bladmuziek zijn PDF/A, TIFF of MusicXML. Het formaat waar je voor kiest is afhankelijk van het beoogde gebruik.

PDF/A en TIFF zijn goede formaten voor het bewaren en lezen van documenten. Je behandelt deze net zoals je eender welk ander document in pdf of afbeelding in TIFF zou bewaren. MusicXML is een open formaat dat het mogelijk maakt om bladmuziek te noteren en te bewerken. Dit betekent dat je de informatie die achter de noten genoteerd staat, bewaart en eenvoudig kan aanpassen. Dit is wel minder handig voor het lezen en uitvoeren van muziek. In dat geval kan de partituur best worden opgeslagen naar PDF/A of TIFF.


Auteur: Wim Lowet (VAi) en Nastasia Vanderperren (meemoo)