Digital Library Building Blocks
The California Digital Library provides software, best practices,
and other tools to facilitate digital library operations.
Curation Micro-Services
Developed with CDL programs and partners (e.g., LoC, UMich),
curation micro-services offer an unbundled alternative to
all-in-one repositories that can be expensive to support and
modify (cf DSpace, Fedora, LOCKSS).
Using native operating system file and web services, we define
minimal conventions to turn a file system into an "object
system" and provide low barrier tools for full lifecycle
enrichment (identity, fixity, replication, annotation, etc.) of
objects. For more background see
curation services.
Open specifications and tools.
We welcome feedback on these works in progress.
- Noid (Nice Opaque Identifiers):
Noid provides minting, binding, and resolving services in
support of preservation-ready identifiers. Persistent
identifiers may be obtained by a committed provider with
help from these kinds of identity services. Software:
download.
- Dflat: Simple File-Based Object Storage:
An object residence, or "digital flat". Common amenities,
such as versions, metadata, annotations, administrivia, and
the occupant itself (as intended by the depositor),
if present, are always found under reserved names. We will
likely have "Dflats" at the ends of Pairtree paths.
- Pairtrees for Collection Storage:
A filesystem convention for holding a collection of
digital object directories. The directory path ending at
an object is formed by taking the identifier and making a
new sub-directory for each next pair of characters.
Conversely, one can recover every object and its identifier
simply by "walking" the Pairtree. Software:
download.
- Content Access Node (CAN):
A CAN holds a repository instance, which is a set of
collections (Pairtrees) plus policy configuration files to
govern such things as fixity, replication, indexing, and
annotation, depending on the purpose of the repository.
- CLOP: A Class-Based System for Managing Object Properties:
Allows policy declarations to be attached to files,
versions, objects, and entire repositories.
- Directory Typing with Namaste Tags:
Namaste (NAMe AS TExt) tags are primitive directory-level
metadata exposed directly via filenames. As such, they
greet visitors who request a directory listing with a
glimpse of what the directory holds. Alpha software:
download.
- Reverse Directory Deltas (ReDD):
ReDD is a way to represent differences between two sets of
files, which permits great cost reduction when storing
multiple versions. To optimize access to recent versions,
a chain of ReDD "reverse deltas" stretches backward in
time. We will likely use ReDD for Dflat version
directories.
- Checkm: a checksum-based manifest format:
Checkm is a general-purpose text-based manifest format
designed to support tools that verify the bit-level
integrity of file groups for such things as content
fixity, replication, import, and export.
- JHOVE2 Architecture for Format-Aware Characterization:
A next-generation framework and application for
format-aware characterization, building on the succcess of
the original JHOVE
system. JHOVE2 generalizes the process
of characterization to include signature-based
identification, validation, feature extraction, and
policy-based assessment.
- BagIt File Package Format:
A "bag" is a hierarchical file package format suitable for
the exchange of generalized archival content via the
network or hard-disk. It has just enough structure to
safely enclose its payload but does not require the
receiver to have any deep knowledge of its internal
semantics. Software:
download.
- N2T: Name-to-Thing Resolver:
N2T is a centralized, scheme-agnostic identifier resolver to
protect URL stability for organizations with web server
hostnames that might change.
Best Practices and Standards
- Archival Resource Key (ARK):
a naming scheme for preservation-ready identifiers.
[HTML]
- WARC File Format (ISO 28500:2009):
co-authored by CDL preservation staff, this international
standard specifies a structure for storing and exchanging
resources harvested from the web and elsewhere.
[HTML]
- CDL guidelines for digital objects, version 2.0: September 2007 [HTML]
[PDF]
- CDL guidelines for digital images, version 2.0: April 2008 [HTML]
[PDF]
- CDL Text Encoding Initiative (TEI) encoding guidelines: [HTML]
- OAC best practice guidelines for Encoded
Archival Description (EAD), version 2.0:
[HTML]
[PDF]
- Minimal level OAC MARC records for CDL, Version 1.1:
[HTML]
- Standards for minimal level MARC bibliographic records
for University of California Libraries:
[DOC]
- Standards for UC Union catalog input records:
[RTF]
Submission Agreements
- CDL/UC libraries digital assets agreement:
[PDF]
- CDL/UC libraries digital assets submission inventory:
[RTF]
Software and Services
- UC-eLinks OpenURL resolution:
The CDL allows UC campus libraries to customize and localize the
SFX OpenURL resolution service, UC-eLinks. For detailed operational
information about campus instances of UC-eLinks, go to the UC-eLinks
Campus Liaisons page.
- CDL Access and Preservation Repositories:
Provides information about the CDL's digital object repositories.
- eXtensible Text Framework (XTF):
Flexible indexing and query tool that supports searching across
collections of heterogeneous data and present results in a highly
configurable manner.
- 7train: An XSLT 2.0-based tool for generating METS files from standardized XML inputs (e.g., CONTENTdm Standard XML exports, OAI records, etc.).
- Date Normalization Utility: Java code that outputs machine-readable date strings to enrich collections that weren't originally encoded with machine-readable dates.
- Markup
data dictionary: Encoding strategy for the
data dictionary used for processing of all U.S. census studies.
Guidelines
References