Preserving the cultural heritage of our nation is a task of incalculable importance. Artifacts of cultural heritage provide the raw material for researchers and historians and are the narrative upon which our national identity is built.
The Web has increasingly become the public forum for publications by non-profit agencies and research groups as well as a locus of grass-roots reaction to historical events. Due to the volatility, diversity, and growing volume of the Web, the resulting narrative has become increasingly difficult to preserve. The goal of the CDL Web Archiving Program is to create tools to help librarians capture and preserve these materials, to participate in collaborative efforts to preserve web publications on a large scale, and to develop policies and services to assist librarians and archivists in this new realm of collection development.
The CDL web archiving program draws from a group of programmers, analysts and librarians focused on studying web information trends, who design and develop tools for capturing and analyzing web data, and who provide support for libraries as they expand their collection efforts to web content. This team works closely with the CDL’s user experience analysts to integrate user feedback into projects, and draws from the subject expertise of University of California librarians and archivists from all campuses.
The Web Archiving Service (WAS) is a web-based application providing librarians and other curators with the means to preserve web content. WAS is built using both open source and locally developed technology and its design was heavily influenced by user feedback. Easy to use, powerful, and flexible, WAS allows curators to efficiently preserve at risk materials and use them to build subject-specific collections. See the CDL Web Archiving Service page for further information.
The CDL web archiving program staff work consistently to raise awareness and advocate for the importance of preserving web content at UC campuses, professional organizations and in library publications. In addition, the CDL works closely with a number of national and international organizations devoted to web archiving efforts and contributes to the development of standards for this emerging field.
The International Internet Preservation Consortium
The International Internet Preservation Consortium (IIPC) is a group of institutions that fund and participate in projects and working groups to develop tools and standards for the emerging field of web archiving. The IIPC’s specific goals are:
- To enable the collection, preservation and long-term access of a rich body of Internet content from around the world.
- To foster the development and use of common tools, techniques and standards for the creation of international archives.
- To be a strong international advocate for initiatives and legislation that encourage the collection, preservation and access to Internet content.
- To encourage and support libraries, archives, museums and cultural heritage institutions everywhere to address Internet content collecting and preservation.
The California Digital Library (CDL) has been a member of IIPC since 2007. As an IIPC member, the CDL has contributed to the development of the WARC file format, the emerging standard for captured web content.
The National Digital Information Infrastructure and Preservation Program
The California Digital Library has been part of the NDIIPP collaborative partnership since 2005, when it was awarded a grant for the Web-at-Risk project. Several NDIIPP partners are directly involved in developing web archiving solutions for libraries, insuring that both a variety of strategies can be explored and collaborative solutions can be developed where possible. The CDL co-authored the “BagIt” specification used by the Library of Congress to transfer large quantities of data produced by all of the NDIIPP projects.The Internet Archive
The CDL’s web archiving team works in close communication with the Internet Archive and has contributed technical documentation to the Internet Archive’s open source tools, which are widely used in the web archiving community.
WARC
The CDL staff contributed to the development of the Web ARChive (WARC) file format. WARC is a more advanced version of the ARC file format. ARC files are "archives" of other files collected during a web crawl.BagIt
The CDL staff co-authored the BagIt specification. BagIt is a hierarchical file package format suitable for the exchange of generalized archival content via the network or hard-disk. The "bag" has just enough structure to safely enclose its payload but does not require deep knowledge about its internal semantics.
The Web Archiving Program grew from a 2003 Mellon-funded study conducted by the CDL to evaluate the impact of the web as a medium of publication for government information. The final report for that study “Web-Based Government Information: Evaluating Solutions for Capture, Curation, and Preservation” served as the basis for CDL’s 2005 “Web-at-Risk” grant proposal. Research and assessment have continued to be a strong focus of the Web Archiving Program, whether evaluating promising technologies or drawing on user-centered design practices for the tools we build.
The Library of Congress, the California Digital Library, the University of North Texas Libraries, the Internet Archive, and the U.S. Government Printing Office have joined together for a collaborative project to preserve public United States Government web sites by January 19, 2009, which is the end of the current presidential administration. In this collaboration, the partners will conduct a comprehensive harvest of the Federal Government (.gov, .mil, etc) domain. For further information, see the End of Term Harvest Announcement. To participate, contact eotproject@loc.gov.
In 2007 a series of catastrophic wildfires struck Southern California. The web coverage of this event was captured by WAS and the subsequent archive was made available to the Library of Congress.
Hurricane Katrina was covered extensively on the web by both traditional media outlets, such as news sites, and non-traditional outlets, such as blogs. To capture these sources of information, the CDL initiated a harvest of Katrina-related sites. The resulting harvest provided a thorough snapshot of Katrina's web coverage and proved that WAS could successfully undertake large-scale time-critical web crawls.
If you were a high school student, which websites would you want to save for future generations? This is the challenge we posed to students and their teachers. In the spring of 2008, Internet Archive, the Library of Congress and California Digital Library collaborated on a project that explores archiving the Web from the perspective of high school students
The California Recall Election Project is an undertaking to capture and make available — for non-commercial, educational and scholarly research purposes — a collection of web sites from this historic 2003 California gubernatorial recall election.
Stanford University captured 12 TeraBytes of content from the .gov domain in 2007. CDL reformatted the content into ARC files and transmitted this data to the Library of Congress.
Web Archiving Service
Learn more about the tools the CDL has created to capture, curate and preserve web content.
Web-at-Risk Grant
Learn more about the 4.5 year grant effort to develop tools, policies and standards for web archiving.
Digital Preservation Program
The Web Archiving Program is part of the CDL’s more comprehensive Digital Preservation Program, and draws from the practices and technologies developed by the Digital Preservation Group.
Web-Based Government Information: Evaluating Solutions for Capture, Curation, and Preservation