The Borthwick Institute for Archives uses the Archive-It service to capture copies of websites selected for inclusion in its web archive. Archive-It is a subscription based web archiving service established by the Internet Archive and is used by organisations throughout the UK, Europe and North America for harvesting and managing web content. Archive-It is currently used at over 500 partner organisations, including university libraries and archives, government libraries and archives, museum and art libraries, historical societies and public libraries.
The Archive-It service captures copies of websites using versions of the open source crawling software programs Heritrix and Brozzler.
For more information, please see our Web Archiving policy.
We will always seek your consent and written permission before crawling your website as part of our web archive program. At this time, we will also discuss what web archiving involves, and share information with you about how frequently your site will be captured and how archived copies will be accessed.
University of York websites are captured as part of our University Archive.
The frequency with which we crawl your website will depend on collection guidelines and the nature of individual websites. Websites that are actively maintained and updated will likely be captured and recaptured at regularly scheduled intervals, such as semi-annual or quarterly.
In some cases, a website may only be crawled once or for a specified time period. Occasionally, we may choose to discontinue regularly scheduled crawls. These cases might occur if:
We crawl websites at a rate designed so as not to interfere with performance. For actively updated websites, crawls will generally be run quarterly or semi-annually, and last for a few days. Very large websites may need to be crawled over a period of weeks. Once a crawl is complete, the crawler no longer interacts with your server. It is unlikely that Archive-It’s crawlers will impact the performance of your website, but if you encounter any issues or have any additional questions, please contact us at borthwick-institute@york.ac.uk.
No. The Borthwick does not archive password-protected content. If you believe that your site’s password-protected content should be included as part of our web archive, please contact us.
In order to create an accurate snapshot of your website for future researchers, we aim to capture as much of your site as possible. We will not bypass robots.txt exclusions by default, but may contact you to request changes to rules or seek permission to override them where such exclusions prevent the capture of content.
No. By archiving your site, the Borthwick is preserving a static snapshot of your site at a particular time. The hosting, management and maintenance of your live website remains your responsibility.
Yes, downloadable media, audio and video files can usually be captured, although content hosted on third-party services like YouTube, Vimeo or Soundcloud can sometimes be challenging. For this reason, we may ask you to provide separate copies of media, audio and video files where possible, which we will store outside of the web archive and which may provide a more robust solution for long-term access.
Archived copies of sites won’t render files that are not linked and have to be retrieved from a database via user queries (for example, when a user must execute a search in order to retrieve and access a file).
The Borthwick makes sites archived as part of its web archiving program freely available to the public via our Archive-It partner page, where website level metadata is supplied to allow for browsing and full-text search. A prominent banner informs users that they are viewing an archived web page, as opposed to the live site. Relevant links to archived copies of sites will also be included in collection guides and/or catalogue records.
By default, archived sites undergo a six-month embargo period before being made available. Where appropriate and in discussion with site owners, selected content may be subject to longer embargo periods.
If your website includes content that is challenging for Archive-It’s crawlers, Borthwick staff will work with you to find alternative solutions for capturing that content. This may mean storing copies of challenging files separately to the web archive and providing access via alternative means.
However, owners wishing to optimise their site design to support more complete archiving may be interested in Columbia University Libraries’ Guidelines for Preservable Websites.
Users may access the collection for non-commercial research purposes and private study. Content within the collection may be subject to intellectual property rights governed by local, national and/or international laws or regulations. In using this content, you are responsible for abiding by all applicable laws in connection with your research.
Users are also responsible for ensuring that they use archived web resources in an ethical manner. Ethical use includes ensuring that archived web content is not represented as the live or most current version of a site, as well as accurately citing any archived web content used in research.
In addition, users are responsible for any personal data concerning living individuals obtained from archived web resources. When you capture or take away personal information you become the data controller of this information, and are liable for it and any subsequent use made of it. It is your responsibility under Data Protection Law to ensure that your use of any personal data accessed via the Borthwick Web Archive is fair and lawful.
There are a number of reasons why the archived copy of a website may appear to be incomplete:
The Borthwick Institute for Archives wishes to ensure that content made available as part of its web archive is lawful. If you object to material included as part of the Borthwick’s web archive, please fill out our Notice and Takedown form. We may disable access to the material in question while we assess your objection. If the material in question was supplied to us by a third party, we may need to contact them as part of our enquiries.