File Formats - Australian National Data Service

ANDSGuide
FileFormats
Level:
Awareness
Lastupdated:
December2016
Weblink:
www.ands.org.au/guides/file-formats
Insimpleterms,afileformatdescribesthewayinformationisorganisedinacomputerfile.Fileformatsapplyto
documents,images,audiofiles,videofilesandresearchdatasetsforexample.docor.pdf.Itisimportantthat
organisationsimplementdatamanagementpoliciesthatconformtostandardsthatmanagesriskoffileformat
obsolescenceordegradationofinformationstorage.Acomprehensivelistofdifferingfileformatscanbefoundby
awebsearchusingthekeyterm'fileformatslist'.
AkeyrelatedconceptisPreservationofresearchdata:ANDSIntroductiontoPreservation
Choosingasuitablefileformatfordatapreservationandsharingisvitalforthesustainabilityoffutureaccessand
reuseofthatdata.Thismayrequirecarefulanalysisoftheadvantagesofproprietaryoropenstandardssoftware
toensurethataccess,reuseandfuturestorageofthedatameetsfuturereuseofthatdatastored.
Institutionalplanningimplications
Fileformattypesshouldideallybeconsideredanddecideduponbeforethecommencementofdatacollection.
e.g.Informationlostbystoringdatausingalossyimage,soundorvideoformatcannotberecovered.Migrating
datafromanunsuitableformattoamoresustainableoptionisalwaysdifficultandexpensive,andmayinsome
casesbeimpossible.Uncompressednon-lossyfileformatstakeupalotmorestoragespacethatneedstobetaken
intoaccountwhenbudgetingforstorage.
•
•
•
UniversityofWesternAustralia:ResearchDataPreservationFormats
UniversityofSydney:DurableFormats
MonashUniversity:DurableFormats
ands.org.au
1
Toolstomanagefileformats
•
•
•
•
FIDO(FormatIdentificationforDigitalObjects):command-linetooltoidentifythefileformatsofdigital
objects,andisdesignedforsimpleintegrationintoautomatedworkflows
BitCuratorAccess:open-sourcesoftwarethatsupportstheprovisionofaccesstodiskimagesWebinaron
usingBitCurator
ApacheTika:toolkitdetectsandextractsmetadataandtextfromoverathousanddifferentfiletypes
(suchasPPT,XLS,andPDF)
BWFMetaEdit:free,opensourcetoolthatsupportsembedding,validating,andexportingofmetadatain
BroadcastWAVEFormat(BWF)files
Fileformatobsolescence
Fileformatscanbecomeobsoleteforvariousreasons:
• Software/fileformatsareupgradedandthenewversionnolongerworkswiththeoldversion
• Softwarethatsupportstheformatisboughtoutbyacompetitorandwithdrawn
• Formatfallsintodisuseorno-onewritessoftwaretosupport/implementit
• Formatisnolongercompatiblewithcurrentsoftwareorisnotbackwardscompatiblewitholdersoftware
Theresultofthisobsolescencemeansthatitmaynolongerbepossibletoaccessthefile,readthefileorreusethe
data,eitherentirelyorpartially.Risksalsoemergeforusersifthesoftwarerequiredresolvingtheformatis
restrictedorthedeveloperchangeslicensingorcosteduseofthatsoftware.
TheICPSRDigitalPreservationManagementTutorialprovidesausefuloverviewofobsolescenceforfileformats
andsoftware.
Migration
Ifdataisstoredusingaformatthatis,orisabouttobecome,obsolete,thenitmaybenecessarytomigratetoa
moresuitableformat.
Thealternativeistopreservetheentireenvironmentneededtoaccessand/orusethedata.Thisapproacheither
involves:
1.
2.
maintainingoldcomputerhardware,togetherwiththeoperatingsystemandalltherequired
software,or
writing/obtainingspecialemulationsoftwarethatrecreatesthesoftware-operatingenvironment
withinmorerecentsystems
Openandproprietaryformats
Aproprietaryformatisonethatisownedbyanindividualoracorporation.Somecommonexamplesof
proprietaryformatsare:AutoCADsDWGdrawingformat,theMP3MPEGAudioLayer3formatandAdobe
Photoshop'sPSGnativeimageformat.Mostproprietaryformatsareclosed,meaningthattheneitherdefinition
nordevelopmentoftheformatisavailabletothepublic.
ands.org.au
2
Thismeansthatdatastoredintheformatcanonlybeaccessedusingtheformatowner'ssoftware.Someformats
arebothopenandproprietarye.g.AdobePDFMicrosoftOOXMLAnopenformatisonewherethedescriptionof
theformatisavailabletothepublic.
Openformatstypicallyaredevelopedandmaintainedbycommunitiesofinterest.Examplesinclude:
1.
2.
3.
4.
Standardimageformats:JPEG2000,PNGandSVG
Fortext:ASCII,PDF,OpenDocumentFormatandOfficeOpenXMLformat(thenativeformatfor
recentversionsofMicrosoftWord)
Fortheweb:HTML,XHTML,RSSandCSS
NetCDFforsomescientificdata
LossyandLosslessformats
Alosslessformatretainstheoriginaldetailofthedatafilee.g.TIFFforimages.
Alossyformatdiscardsinformationpermanentlyinordertoreducethescaleandsizeoffileineffectloweringthe
qualityofthatdatae.g.JPEGforimages
"Lossy"compressionisadataencodingmethodthatcompressesdatabydiscarding(losing)someofit.The
procedureaimstominimizetheamountofdatathatneedstobeheld,handled,and/ortransmittedbya
computer.Imagesbecomeprogressivelycoarserasthedatathatmadeuptheoriginaloneisdiscarded(lost).
Typically,asubstantialamountofdatacanbediscardedbeforetheresultissufficientlydegradedtobenoticedby
theuser.
Foraudiofileformats,theubiquitousMP3formatislossy,whileWAVformatislossless.Theimplicationsofresavingorconvertingdatafromoneformattoanotherbecomesapparentwhenthequalityofthatdatais
compromisedinqualityduetothisremovalofinformation.
Note-metadatasuchasfiletitle,description,dateetc.isnotremovedduringthisprocess.
Compression
Compressionreferstowaysofmakingdatatakeuplessstoragespacewithoutlosinganyofthecontent.Forlongtermpreservation,uncompressedformatsarelessriskprone.
Alosslessfilethathasbeencompressedcanberestoredtoitsoriginalstate,completelyunchanged.Inthecaseof
lossyformatsthereductioninsizeisachievedeffectivelybythrowingselectdataaway(losing).
Thecompressionprocessmakesdatamoresusceptibleto"bit-rot".Bitrotisthesmallelectricchargeofabitas
memorydisperses,possiblyalteringprogramcode.Theriskisthatachangeofonebitinacompressedtextfile
maycausemajorchangesacrosstheentiredocument,renderingituseless.Theadvantageofretainingalowresolutionlossyformatfilesetisquicknavigationoreaseoftransmission.
ands.org.au
3
Theimportanceofstandards
Standardfileformatsareessentialforeffectivedatasharing.Inmanycasesaresearchdisciplinewillhavea
mandatoryorpreferredstandardforsavingandstoringresearchdatae.g.SPSSdatafilesforsocialsciencedata
setsandFITS('FlexibleImageTransportSystem')whichisastandarddataformatusedinastronomy.
Retainingmultipleformats
Retainingmultipleformatsandinstancesofdatamayaddtothescaleofdatabeingstoredorsyncdifficulties
howeverdoingsocanreducetheriskoflossoftheoriginalhigh-resolutionfile.
Analternativetokeepingmultipleformatsistousecontentmanagementsystemsoftwareeg.Alfrescothatcan
convertittomultiplealternativeformatsonthefly.Forexample,arepositorymightstoreatextdocumentina
gold-standardpreservationformatlikeDocBookXML,butprovideaservicethatcanalsodisseminatethe
documentasHTML,PDForWord,dependingonthepreferenceofthereader.
Futurefileformats
Developmentoffileformatsinthenearfuturewilllikelyincorporateinformationthatpertainstogeospatial
platformsandenvironments.
Mobiletechnologiesandtheonsetofvirtualrealitybaseddatacreationmeanthereareanumberofconsortia
suchastheOpenStandardsforReal-Time3DCommunicationthatconformwiththeInternationalOrganisation
forStandardization.
Theonsetofgeospatialdataalsopresentsnewchallengesasraster,vectorandgridformatsdevelopalongside
otherformatssuchasWorldfileusedforgeo-referencingarasterimagesuchasaJPEGorBMPfile.The
importanceoforganisationstoremainvigilantof,andresponsiveto,futureproofingformatsustainabilityisa
criticalconsiderationfortheaccess,re-useandcompatibilityofdata.
Anintroductiontogeospatialresourcesandformats.
Preservationformatsanddisplayformats
High-resolutiondatae.g.alosslessuncompressedbitmapmayrequireconversiontoa.jpegformatforeaseof
visualisationonlineortransmissionviaemailmessaging.AnotherexampleisastandardXMLformat,whichisbest
renderedtoHTMLorPDFforeaseofviewingorprintingpurposes
Considerationmustthereforebemadeforthelong-termpreservationofdatatakingintoaccountthestorage,
display,visualisation,conversionorre-useofdata.
TheUSLibraryofCongressSustainabilityofDigitalFormatswebsiteprovidesadynamicoverviewofpreservation
andsustainabilityofdigitalformats.
ands.org.au
4
Furtherinformation
ANDSGuidesandotherResources
ANDSfilewranglingfurtherreading
Feedback?
Wewelcomeyourfeedbackonthisguide.Pleaseemailcontact@ands.org.auwithanycommentsorquestions.
AboutANDS
TheAustralianNationalDataService(ANDS)makesAustralia’sresearchdataassetsmorevaluablefor
researchers,researchinstitutionsandthenation.
ANDSisapartnershipledbyMonashUniversityincollaborationwiththeAustralianNationalUniversity(ANU)and
theCommonwealthScientificandIndustrialResearchOrganisation(CSIRO).ItisfundedbytheAustralian
GovernmentthroughtheNationalCollaborativeResearchInfrastructureStrategy(NCRIS).
ThisworkislicensedunderaCreativeCommonsAttribution4.0InternationalLicense.Youarefreetoreuseand
republishthiswork,oranypartofit,withattributiontotheAustralianNationalDataService(ANDS).
ands.org.au
5