We need to
Web archive tool and development
Warcbase is a web archiving platform - speed up access to wyaback machines; its a way to analyze web archives
works really well on a raspberry pi, personal laptop or in a cluster
in a nutshell - it takes a warc file and allows you to:
They've been developing Warcbase 3 years ago - with Jimmy Lin in computer science
Research use cases:
IIIPC - the focus up until the last few years has been about grabbing everything and not a lot of thought on access
there is a survey coming out and next week at IIIPC they will present stats on why or why not people are using archives
Current researchers in LUNA, UK Archive people,
Michigan just ran a great conference on web archiving research
Basic sense of researchers want:
you need more than the wayback machine - you have to know the url, you can only browse at one page at a time
we need to scale up
while we need search, the search needs to be intelligible
Keyword query with no prioritication of results
The goal of Warcbase is to make things translaret because if
Collaboration between computer scientists and historians - it is working well because the team is half CS and historians
Historians will say, "I would really like the power to do X" - they create a ticket, CS responds tot he tickets and they respond - research questions actually guiding the development
the importance of doing things open source
hackathon last month - about building a community that gains enough momentum
Alot of inital development required the researcher to dig in to tool - what they are trying to do is use JUpiter notebooks (ideally they have someone to spin p the notebooks - vagrants), point the webbrowser at it and then
goal of development - people use jupiter to prototype what they want to do and then paste that code into the shell
publishing on it - working in an interdiplicinary cgroup (librarians, historians, and CS) - the key to success - everybody wants to publish in different scholarly communities
technical stuff is ending up
librarian presented at code4lib and their journal
arts and humaniteis computing
published an early web archiving piece in
trying to put something together for digital humantiteis quarterly
he is working on what it was like being a kid building a webiste in the 1990s
everybody has to be happy - jimmy needs a reason to collaborate, ian does and the librarian does - tenue
training - they are running a pilot training workshop in iceland, if that works they want to make that universal
one of the goals this summer, RAs that don't have
software carpentry workshop - python, github, etc.
relationship to the library:
will it continue to to be run like an open source community or hosted in the libraary - he prefers a consortial model for providing resources to support
not having faculty status for librarians at waterloo means they don't have the bandwidth for research
what can we do to imporve:
the most important thing is documentation - if you guys are setting up a collection, you need to have your seed list written down and why we decided to collect what we decide to collect
the canadian political parties collection was set up in 2005 but the libraian that set it up and there was no documentation so if i publish a peer reviewer will immediately go why these sites and not those
beyond documentation, the debate between using hertirtx yourself or archive-it - archive-it has a great community, but nick has been doing heretrix by himself and that allows him to run that on an experiemental basis
advertsiing is good - university of toronot - you need to advocate and outreach we have this stuff - and if you want to use it or point them to the warcbase workfhop
social media is super important - the debate is should you use archive-it or the twitter api to downloadthe data - intersection between
this summer, they used that SHINE program that the uk launched, it just provided faceted search - columbia human rights also uses a faceted search engine
wayback machine - simple archive it keywords it is overwhelming, if you have good metadata you can start faceting down (dates, languages, subjects) - autoextracted metadata
the issues is the underlying data for websites are messy you have to
he works with the socialoogust and network scholars and matt webber at rutgers - when they look at web archives, he wants content and they want the graphs
social scientists want to look at networks, entities and things in a non-digital way
hisotrical contribution longitudinal demention
cs what you need now
historians are looking for change over time
in a dreamworld we would love to see other folks using it - so if we use it, we should be in contact with the team; open source stack
if we decide to
the fun of archive-iit and do funky stuff with it
how researchers get access across instutional collections - they have been exploring UT legal team - what kind of MOU do we need to ask for warc files and
U of Victoria and Alberta have been totally open to it - with the caveat of citations for the source and not sharing the warc files; when they create public facing things, they allow the libraries to review them
when they finally get to the document, it will be cited as Univerity of Toronto; he shared a sample mou document (between donating university and the PI on the grant) - same with broader consortial project - agreement between Ian and the UNiverity of alberta
web archives for longitudinal knowledge