Skip to main content
Business LibreTexts

12: Long‑Term Preservation and Archiving

  • Page ID
    157385

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \( \newcommand{\dsum}{\displaystyle\sum\limits} \)

    \( \newcommand{\dint}{\displaystyle\int\limits} \)

    \( \newcommand{\dlim}{\displaystyle\lim\limits} \)

    \( \newcommand{\id}{\mathrm{id}}\) \( \newcommand{\Span}{\mathrm{span}}\)

    ( \newcommand{\kernel}{\mathrm{null}\,}\) \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\) \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\) \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\id}{\mathrm{id}}\)

    \( \newcommand{\Span}{\mathrm{span}}\)

    \( \newcommand{\kernel}{\mathrm{null}\,}\)

    \( \newcommand{\range}{\mathrm{range}\,}\)

    \( \newcommand{\RealPart}{\mathrm{Re}}\)

    \( \newcommand{\ImaginaryPart}{\mathrm{Im}}\)

    \( \newcommand{\Argument}{\mathrm{Arg}}\)

    \( \newcommand{\norm}[1]{\| #1 \|}\)

    \( \newcommand{\inner}[2]{\langle #1, #2 \rangle}\)

    \( \newcommand{\Span}{\mathrm{span}}\) \( \newcommand{\AA}{\unicode[.8,0]{x212B}}\)

    \( \newcommand{\vectorA}[1]{\vec{#1}}      % arrow\)

    \( \newcommand{\vectorAt}[1]{\vec{\text{#1}}}      % arrow\)

    \( \newcommand{\vectorB}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \( \newcommand{\vectorC}[1]{\textbf{#1}} \)

    \( \newcommand{\vectorD}[1]{\overrightarrow{#1}} \)

    \( \newcommand{\vectorDt}[1]{\overrightarrow{\text{#1}}} \)

    \( \newcommand{\vectE}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash{\mathbf {#1}}}} \)

    \( \newcommand{\vecs}[1]{\overset { \scriptstyle \rightharpoonup} {\mathbf{#1}} } \)

    \(\newcommand{\longvect}{\overrightarrow}\)

    \( \newcommand{\vecd}[1]{\overset{-\!-\!\rightharpoonup}{\vphantom{a}\smash {#1}}} \)

    \(\newcommand{\avec}{\mathbf a}\) \(\newcommand{\bvec}{\mathbf b}\) \(\newcommand{\cvec}{\mathbf c}\) \(\newcommand{\dvec}{\mathbf d}\) \(\newcommand{\dtil}{\widetilde{\mathbf d}}\) \(\newcommand{\evec}{\mathbf e}\) \(\newcommand{\fvec}{\mathbf f}\) \(\newcommand{\nvec}{\mathbf n}\) \(\newcommand{\pvec}{\mathbf p}\) \(\newcommand{\qvec}{\mathbf q}\) \(\newcommand{\svec}{\mathbf s}\) \(\newcommand{\tvec}{\mathbf t}\) \(\newcommand{\uvec}{\mathbf u}\) \(\newcommand{\vvec}{\mathbf v}\) \(\newcommand{\wvec}{\mathbf w}\) \(\newcommand{\xvec}{\mathbf x}\) \(\newcommand{\yvec}{\mathbf y}\) \(\newcommand{\zvec}{\mathbf z}\) \(\newcommand{\rvec}{\mathbf r}\) \(\newcommand{\mvec}{\mathbf m}\) \(\newcommand{\zerovec}{\mathbf 0}\) \(\newcommand{\onevec}{\mathbf 1}\) \(\newcommand{\real}{\mathbb R}\) \(\newcommand{\twovec}[2]{\left[\begin{array}{r}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\ctwovec}[2]{\left[\begin{array}{c}#1 \\ #2 \end{array}\right]}\) \(\newcommand{\threevec}[3]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\cthreevec}[3]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \end{array}\right]}\) \(\newcommand{\fourvec}[4]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\cfourvec}[4]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \end{array}\right]}\) \(\newcommand{\fivevec}[5]{\left[\begin{array}{r}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\cfivevec}[5]{\left[\begin{array}{c}#1 \\ #2 \\ #3 \\ #4 \\ #5 \\ \end{array}\right]}\) \(\newcommand{\mattwo}[4]{\left[\begin{array}{rr}#1 \amp #2 \\ #3 \amp #4 \\ \end{array}\right]}\) \(\newcommand{\laspan}[1]{\text{Span}\{#1\}}\) \(\newcommand{\bcal}{\cal B}\) \(\newcommand{\ccal}{\cal C}\) \(\newcommand{\scal}{\cal S}\) \(\newcommand{\wcal}{\cal W}\) \(\newcommand{\ecal}{\cal E}\) \(\newcommand{\coords}[2]{\left\{#1\right\}_{#2}}\) \(\newcommand{\gray}[1]{\color{gray}{#1}}\) \(\newcommand{\lgray}[1]{\color{lightgray}{#1}}\) \(\newcommand{\rank}{\operatorname{rank}}\) \(\newcommand{\row}{\text{Row}}\) \(\newcommand{\col}{\text{Col}}\) \(\renewcommand{\row}{\text{Row}}\) \(\newcommand{\nul}{\text{Nul}}\) \(\newcommand{\var}{\text{Var}}\) \(\newcommand{\corr}{\text{corr}}\) \(\newcommand{\len}[1]{\left|#1\right|}\) \(\newcommand{\bbar}{\overline{\bvec}}\) \(\newcommand{\bhat}{\widehat{\bvec}}\) \(\newcommand{\bperp}{\bvec^\perp}\) \(\newcommand{\xhat}{\widehat{\xvec}}\) \(\newcommand{\vhat}{\widehat{\vvec}}\) \(\newcommand{\uhat}{\widehat{\uvec}}\) \(\newcommand{\what}{\widehat{\wvec}}\) \(\newcommand{\Sighat}{\widehat{\Sigma}}\) \(\newcommand{\lt}{<}\) \(\newcommand{\gt}{>}\) \(\newcommand{\amp}{&}\) \(\definecolor{fillinmathshade}{gray}{0.9}\)

    Introduction

    Long‑term preservation is the discipline of keeping digital information authentic, accessible, and usable for as long as it has value—years, decades, or even centuries. It differs from backup or disaster recovery: backups restore systems to a recent state; preservation ensures that future users can still render, interpret, and trust specific digital objects and their context. The foundation for modern practice is the Open Archival Information System (OAIS) Reference Model, standardized as ISO 14721 and maintained by the Consultative Committee for Space Data Systems (CCSDS). OAIS offers a common language (e.g., Submission Information Package, Archival Information Package, Dissemination Information Package) and a set of responsibilities an archive must fulfill (ingest, preservation planning, storage, access, and administration). The model was updated in 2024–2025; the current OAIS 3rd edition and its ISO adoption confirm its ongoing relevance across disciplines. [ccsds.org], [iso.org], [OAIS (ISO...rs updated]

    Preservation is an Information Governance (IG) function as much as an archival one. IG aligns strategy, policy, and controls so that high‑value information is identified, appraised, protected, and preserved while low‑value information is disposed of legally and defensibly. National memory institutions and standards bodies provide proven guidance: the U.S. National Archives and Records Administration (NARA) publishes transfer and preservation guidance for permanent federal records; the Library of Congress maintains the Sustainability of Digital Formats knowledge base and the annually updated Recommended Formats Statement (RFS); and the Federal Agencies Digital Guidelines Initiative (FADGI) provides technical targets for digitization and embedded metadata. These resources provide concrete criteria for selecting formats, documenting provenance, and validating quality over the long term. [archives.gov], [loc.gov], [digitizati...elines.gov]

    In this chapter, we translate those standards into practical steps for IG programs: how to mitigate digital longevity risks; when to apply migration, emulation, and normalization; how to preserve AI‑generated content with trustworthy provenance; which standards and tools to adopt; and which checklists to use to ensure you are ready for the next audit—and for the future.


    Challenges of Digital Longevity

    Long‑term preservation faces four interlocking risks: format obsolescence, media degradation/bit rot, software dependency, and organizational discontinuity.

    Format obsolescence

    File formats evolve and sometimes disappear. If you cannot open yesterday’s files in tomorrow’s software, your information is practically lost. The Library of Congress Sustainability of Digital Formats program documents thousands of formats, including “sustainability factors” like openness, adoption, transparency, and self‑documentation that correlate with long‑term viability. IG teams use this analysis to favor formats such as PDF/A, TIFF, and XML for preservation masters. [loc.gov]

    The PRONOM registry from The National Archives (UK) complements this by cataloging formats, versions, signatures, and related software, and powering DROID for automated identification. PRONOM’s modernization work (linked data, signature updates) and search services help organizations inventory holdings and spot risky formats early, enabling proactive migration. [nationalar...ves.gov.uk], [nationalar...ves.gov.uk], [github.com]

    Media degradation and bit rot

    Digital media—HDDs, SSDs, optical discs, tape—fail over time; silent corruption (“bit rot”) can render files unreadable. While NIST’s SP 800 series focuses on information security rather than archival science, its guidance highlights the need for integrity mechanisms and lifecycle controls. Preservation programs implement fixity checks (e.g., SHA‑256) on ingest and on a schedule, verifying checksums and repairing from redundant copies when corruption is detected. Standards bodies and communities reinforce the principle of lots of copies: programs like LOCKSS operationalize peer‑to‑peer integrity polling and repair across many nodes so that corruption in one replica is detected and healed. [nist.gov], [lockss.org]

    Software dependency

    Many digital objects depend on specific software stacks (codecs, runtimes, plug‑ins) or even complete environments to render correctly. OAIS anticipates this by emphasizing Representation Information—the documentation and software necessary to render and understand content. Preservation planning thus includes documenting dependencies, acquiring or packaging software, or planning migration/emulation routes for complex objects (e.g., multimedia projects, interactive spreadsheets, CAD/3D, or games). [ccsds.org]

    Organizational and legal discontinuity

    Preservation is a marathon, not a sprint. Funding cycles, mergers, provider shutdowns, and policy changes can interrupt stewardship. NARA’s federal transfer requirements and bulletins (e.g., ERM, Capstone email) try to reduce this risk by mandating standards‑based scheduling, transfer, and metadata across agencies; likewise, LOC’s RFS and FADGI provide continuity for the broader cultural heritage sector. IG’s job is to codify these requirements into retention schedules, legal holds, and disposition overrides so that custody and obligations persist across organizational change. [archives.gov], [loc.gov], [digitizati...elines.gov]


    Core Preservation Strategies

    Effective programs combine migration, emulation, normalization, and open standards selection—guided by OAIS and institutional policy.

    Migration

    Migration moves content to newer formats or media to maintain renderability. Routine examples include normalizing office docs to PDF/A‑2 or PDF/A‑4 for fixed‑layout access and migrating master images to TIFF with embedded metadata. Migration is most successful when driven by risk triggers (e.g., PRONOM signals increased obsolescence) and controlled by documented format policies (e.g., Archivematica’s Format Policy Registry). [pdfa.org], [archivematica.org]

    Migration does not mean discarding originals. OAIS and NARA practice advise retaining the source (for authenticity and re‑processing) while creating a normalized preservation master and access derivatives, with clear provenance metadata explaining each action. [ccsds.org], [archives.gov]

    Emulation

    Emulation recreates the original software/hardware environment so digital objects run unchanged. It is valuable for software‑dependent works (legacy multimedia, interactive art, scientific workflows). OAIS frames emulation as a preservation planning option when migration would deform significant properties. Community projects (e.g., Emulation‑as‑a‑Service) use encapsulated environments with documented Representation Information to deliver authentic experiences. [ccsds.org]

    Normalization

    Normalization converts many incoming formats into a small set of preservation targets with strong sustainability characteristics. Typical targets include PDF/A for documents, TIFF (or JPEG 2000 in some domains) for still images, and WAV/BWF for audio; where structure is key, XML (or JSON with schemas) expresses data in open, self‑documenting form. Normalization is often automated and governed by policy (e.g., Archivematica FPR rules). [pdfa.org], [loc.gov], [w3.org], [archivematica.org]

    Favoring open standards and archival variants

    • PDF/A (ISO 19005) ensures device‑independent, self‑contained documents. Multiple parts exist—PDF/A‑1 (2005), PDF/A‑2, PDF/A‑3, and PDF/A‑4 (2020)—each building capabilities while constraining risky features. Conformance levels (A/U/B) balance accessibility, text extraction, and visual fidelity. [iso.org], [pdfa.org], [pdfa.org]
    • TIFF remains widely preferred for master images because of its openness and transparency, especially when used with baseline features and embedded technical metadata; Library of Congress RFS continues to list TIFF among preferred still‑image formats. [loc.gov]
    • XML persists as a durable, human‑ and machine‑readable textual syntax standardized by the W3C, suitable for metadata (e.g., METS, PREMIS) and content that benefits from structured markup and schema validation. [w3.org]

    Table — File format longevity (illustrative)

    Table 12.1: Preferred Preservation Formats. Recommended file types for long-term archival stability, focusing on open standards and metadata support.
    Content type Preferred preservation targets (examples) Why favored (longevity factors)
    Text/page documents PDF/A‑2, PDF/A‑4 Self‑contained; standardized; controlled features; accessible text (A/U levels). [pdfa.org], [pdfa.org]
    Still images TIFF (baseline); sometimes JPEG 2000 Open, widely adopted; lossless; strong metadata support (technical, embedded). [loc.gov]
    Audio WAV/BWF Lossless PCM; metadata chunks; broadcast industry adoption. [loc.gov]
    Structured data / metadata XML (+ schema), CSV with documentation Open, text‑based, self‑describing; strong tooling; long‑term readability. [w3.org]
    Email MBOX/EML; emerging EA‑PDF for archival email packages Open containers; growing guidance for durable email encapsulation. [wwws.loc.gov]

    Preservation of AI‑Generated Content

    By 2026, organizations create and store vast amounts of AI‑generated and AI‑assisted content—text, images, audio, code, and data. Preserving such content demands provenance, transparency, and reproducibility.

    Provenance metadata and content credentials

    AI outputs must carry provenance: who prompted/approved, what model produced it, when, with what data/parameters, and where it was stored. The C2PA (Coalition for Content Provenance and Authenticity) standard provides a method to bind tamper‑evident provenance assertions (e.g., “content credentials”) to media using embedded metadata and cryptographic signing; the Library of Congress format workplan explicitly references JUMBF (the metadata box format underlying C2PA) as a preservation‑relevant technology. [nvizionsolutions.com], [wwws.loc.gov]

    Watermarking and detection

    Watermarking can signal AI origin but should not be the only control: technical research continues to show that many watermarks are fragile under transformations. Preserve both the watermark (if present) and the provenance record within the object’s metadata/AIP to support future authenticity checks. (The RFS and Sustainability of Digital Formats emphasize metadata richness and transparency to bolster preservation even when technical markers fail.) [loc.gov], [loc.gov]

    Model and prompt versioning

    AI content cannot be fully understood without model context. IG should require:

    • Model version identifiers (and provider), training cutoffs, and safety filters applied at generation time.
    • Prompt/response archives for significant decisions, captured in XML/JSON with timestamps and user IDs.
    • Reproducibility artifacts where possible (e.g., seeds, temperature), recognizing some models are inherently nondeterministic.

    Where models are internally developed, preserve model cards, training data documentation, and versioned artifacts using open tools (e.g., MLflow or DVC) linked to preservation packages, even if the binaries themselves are not preserved indefinitely. Align with OAIS by treating the model and its documentation as Representation Information for derivative AI content. [ccsds.org]

    Ethical considerations

    Preserved AI content can encode bias or misuse. Appraisal should weigh long‑term value against potential harms, applying access controls and use notes where appropriate. LOC’s RFS reminds institutions to consider accessibility, rights, and user communities; OAIS’s Designated Community concept helps tailor documentation and access for ethical clarity. [loc.gov], [ccsds.org]


    Standards and Tools for Archiving

    Flowchart of the OAIS model: SIP is ingested, stored as AIP, and distributed as DIP.


    This page titled 12: Long‑Term Preservation and Archiving is shared under a CC BY-SA 4.0 license and was authored, remixed, and/or curated by .

    • Was this article helpful?