Not many people think about archiving websites. And that’s normally fine. But as archivists, we need to think about it. And do something too. And I don’t mean printing them out.
Pretty much all of the news, advice and publications that organisations used to put out in paper form is now being disseminated digitally via their website or social media. The experience of using any website changes every few months, as the structure of the site changes with every new development. The content of some websites change hourly, like news pages. It’s been that way for at least fifteen years now, and in some cases twenty. Yet I don’t think there’s a lot of understanding about what that means in terms of the historical record. This blog is about ways of responding to this issue, based on current practice and resources at Manchester. I hope it’s useful.
Why is web archiving a problem?
Recently I asked a heritage project manager about a project website that had disappeared within a year of the project having finished. ‘Yeah well,’ he shrugged. ‘The project’s finished.’ That’s a big problem if the whole point of the project was to collect, and help people engage with, archives. What lasting impact can the project have if its main medium disappears?
I was recently shown around the website of a heritage project partner by the office manager. ‘Don’t worry about this Dave. You don’t need to look at this horrible old thing,’ she said. ‘We’re going to have a shiny new one soon.’ But has the old website been crawled? Will the new one have any of the old newsletters and annual reports attached to it?
Since April 2013 the UK Web Archive has been archiving the whole of the UK web domain, under the terms of the Non-Print Legal Deposit Regulations 2013. This means that every .uk website in existence is now crawled every year. They also collect non-.uk websites (and websites at risk) by a simple online process of nomination. Anybody can nominate a website.
This is a good thing, right?
Yes it is. But it’s important to be clear that although the archive crawls everything, this does NOT mean everything will be available online. All it means is that people can look at the crawled websites on dedicated terminals at the six legal deposit libraries (plus Boston Spa). None of which is in the North West region. The legislation even means that if somebody is looking (say) at the archive copy of Chorlton Heston FC’s website in London, I can’t look at it at the same time in Edinburgh. Unless the copyright has been cleared – in which case it’ll be available online.
Ian Milligan, a web archivist in Canada, has written a great blog post about the limitations of the legal deposit web archive. He concludes with the pretty damning sentence: ‘I could research .. hundreds of times quicker if these websites were printed off and in banker’s boxes than if I was using this interface.’
Unless somebody has nominated your site to @UKWebArchive, AND they have received permission from the copyright holder, it’s not going anywhere near the UK Web Archive website. In my opinion this is a pretty ridiculous state of affairs in the twenty-first century. But there you go. There’s a lot of copyright infringement on the web so taking any other approach would be an understandably risky business for the British Library. Copyright legislation is going to take a while to relax.
What does this mean for me and my website?
In order for people to be able to see your website, as it is now, in the future you need to nominate it to be included in the online web archive. Then you need to hope that it is accepted (on the plus side, they haven’t turned down any of our suggestions yet). You also need to get whoever owns the copyright on your website to sign a declaration (it helps if you tell them who this is and how to contact them when you nominate). And THEN you’ll be able to see a copy of your website on the online archive. Also, you need to be aware that websites without a .uk in their url (like, say, archivesplus.org) will not automatically be crawled.
What if the website I want to see has already gone?
The Internet Archive, based in California where UK copyright law isn’t a deterrent, has been busy archiving the web since 1996. Their Wayback Machine is a really, really useful tool if what you’re after has already disappeared. For example the Greater Manchester County Record Office website was crawled 117 times, from March 2000 until GMCRO merged with Manchester in 2011. I use the Wayback Machine every now and again to get my hands on bits and pieces that used to be online but are no longer there. Like pdf copies of articles from the Manchester Region History Review, and the Commonwealth Games legacy website.
Heritage funders have become more aware of the problem of disappearing heritage websites, and have written sensible requirements into their funding streams. But they rarely have the resources to police it. I think a better approach would be a top-level agreement between the UK Web Archive and the Heritage Lottery Fund, so that the web outcomes of heritage projects are automatically archived in a special HLF collection.
Until then the best advice is pay up front for your web domain for five years after the project’s completion. Make sure you nominate, and get copyright permission for, your site so that it’s added the UK Web Archive and available online beyond the five years.
What gets archived?
Web crawlers can’t grab databases. So the search button or image collection on your website won’t work on the archived copy. But crawlers can capture snapshots of what each page looks like, so you can recreate how the website looked and felt like back in the day. Crawlers can also cope with simple files such as pdfs, jpeg, mp3 and .doc files as long as they are hosted by the website itself. This is great news for reports, newsletters and photos.
But modern websites are actually hosting less and less themselves. They are becoming collections of links to social media. This is a reflection of the fact that social media is where the real action is going on. This is a problem because all you’ll see on the archive copy of the website is a blank box where Twitter used to sit, and another blank box where Vimeo used to be embedded.
So what can we do about social media and streaming sites?
Well, the good news is the Library of Congress is signed up to archive all tweets ever. But there’s no access to the archive yet, and it’s not clear when this is going to happen, or who will be able to see what. You can at least ask Twitter for a copy of your own Twitter archive (it’s easy – just go to Settings > Account and click the button). You’ll get a link to download a csv file of every tweet and retweet you have sent. You can open it up as a spreadsheet. This will be a really useful tool for historians, who can mine the data for text and statistics.
But, again, there are a number of issues. Much of Twitter is a conversation about links to articles or photos (actually just links to photos). In your archive you’ll only have one half of these conversations. Also, the links will break if the websites to which the are linked change. If any of the links are broken (as many of them will be next year never mind in twenty years time) you’ve lost the basis of the conversation. Or at least you might have to go to an archived copy of the source to find out what it was. This is why linking data with persistent urls is so crucial.
Having said all this, paper archives are not complete and archivists are used to doing what they can. For example, I just advised the Bishop of Manchester how to archive his excellent Twitter stream and I’m really looking forward to accessioning that csv file. Gotta work with what we have – no point worrying about what we can’t keep.
Manchester City Council started webcasting its committee meetings a while back. Whether these videos capturing councillors’ fashion sense and accents are critical historical records is questionable, but in any case we are committed to keeping them. They used to use a Vimeo embed for this. So we have a pile of DVDs from the Manchester’s Digital Communications team full of the source mp4s to add to our digital archive. They have now switched to a new system and I’m not sure yet how this is going to be archived. The website in any case doesn’t include a dot uk. I suspect we’ll have to get the source files as before.
How do we archive these digital files?
Well, first we choose what to keep. Then we check them for viruses (just as we would check a paper archive for bugs). We catalogue them (pretty much as we would any paper archive). We capture information about them like file types which will help us ‘normalise’ them in future if the software they rely on becomes obsolete. And then we transfer them (making sure we ‘checksum’ the files at every stage) to secure, backed-up server space. It’s a pretty time-consuming process.
And then nobody asks to see them. Or at least nobody has yet. But I’m sure they will at some point. Within a few years all the records we receive will be digital. All the school registers, church minutes, photographs. Everything. This will transform our storage requirements, but it will also entail an awful lot of training for our staff and users. Thankfully people like the Digital Preservation Coalition and the Archives and Records Association are on hand to advise and to look ahead.
Just this week the North West Region Digital Preservation Group is starting an Archives and Records Association-funded month-long trial of Preservica, a digital preservation tool that uses cloud storage. I hope that, one way or another, we can find a solution for storage and access will help us automate some aspects of the digital archives process so that archivists can get on with what we’re trained to do. Selecting and describing the stuff.