ALIA Information Online 2017 Conference, 13-17 February 2017 Sydney: Data Information Knowledge
This conference paper discusses the National Libarary of Australia's web archiving programs.
Abstract: Twenty years ago the National Library of Australia (NLA) established one of the first programmes in the world to systematically collect, preserve and make accessible web content. This was a mere half decade after the functional implementation of the web itself. The NLA has continued to build content over the past two decades and now holds large amounts of archived web content – more than 400 terabytes of data in the combined collections of the PANDORA Archive, the Australian Government Web Archive and whole .au domain harvests. The Library built the prototype fit-for-purpose selective web archiving workflow systems (PANDAS) first implemented in 2001 and still operating in its third version. This system has made the collecting, archiving and delivery of Australian web materials a routine activity within the Library’s collection development operation.
While collecting web content demands ongoing and timely application to the collecting tasks, efficient workflow systems and established operational activity run the risk of promoting a degree of ‘operational complacency’ – a sense that the job has been done. However, web archiving demands continual strategic attention as well as agility and innovation in practice because of the transforming and dynamic character of the target media – not only in its form and format but in its function conceptualisation.
Over the past two years the National Library has entered into a renewed phase of web archiving development. In part this is driven by the need to bring together the Library’s selective and domain harvesting content collected over a long period, but also to make collecting more agile through access to a variety of collecting methods. In part development is also driven by the important strategic objective of integrating the discovery of archived web materials more effectively through its single discovery service, Trove.
This paper will discuss the issues – successes and shortcomings – involved in managing a large amount of unique legacy web archive material while continuing to develop and refocus infrastructure, workflows and relationships to effectively manage the collection, curation and archival discovery of this largest and most complex publishing medium. The paper does not present a case closed and ‘problem solved’ conclusion but identifies on the one hand the unique value and opportunities of large curated collections of web materials and, on the other, the limitations and considerable challenges that yet remain.
This paper will largely relate to the theme of innovation in practice. Web archiving requires a sustained and sustainable commitment to operational innovation to build large data collections of harvested content and to curate this material so as to facilitate access to the information for the development of knowledge.