2016-04-07 Meeting notes

Date

Attendees

Agenda

    • GUEST SPEAKER: Ian Milligan, University of Waterloo (Understanding the needs of researchers who want to use web archives as a data set
      1. What are your thoughts on how researchers are using/want to use web archives? How has this informed Warcbase development?
      2. What was your process for determining the best way to access, store and interrogate the data set when you were working with Archive-It?
      3. Is there a training component currently in place at Waterloo for faculty and students who want to start using Warcbase? If so, where is this training/instruction provided - through the library?
      4. What recommendations do you have for people just starting web-archiving programs? More specifically, what parameters/workflows/specifications should new web-archiving programs set in order to optimize usability and access for future researchers?
    • Round Robin:
      • Upcoming conference presentations?
      • Article/publication CfP's of interest?
      • Policy development
      • Repository progress
      • Campus trends/initiatives and how those may be impacting repository activities

Discussion items

TimeItemWhoNotes
Round Robin
  • Abby:
    • Porter Olsen is coming to UT from UofM to do research using Gabriela Garcia Marquez Collection and this will provide concrete feedback on viable access policies for born-digital archival material
  • Laura:
    • Studies the preservation of Islamic cinema from the 1970s and 1980s - eventhough there is a state-run means for preserving film archives, because the country is restrictive in what foreigners access there, there has not been a lot of oversight on the actual preservation strategies or efforts being employed
    • When officials in the new regime, post-1979, there was a lot of stuff that was destroyed or damaged
    • Films are in the hands of both the official archives and also private collectors that maintain and in some cases reformat; some of those reformatted films are now popping up online on websites that are not strictly accessible in Iran
  • Shannon:
    • ExMO Collection does not currently have a lot of digital material but working on collection development in that are
  • Ashley
    • Just received a large collection of disks from an anthropologist mostly from the 1990s - the transcripts to the oral history are on those floppy disks and there is a researcher coming to do research; working on an access solution
    • the primary preservation storage approach is vaulting things to tape but the tape drive has failed and is being updated
  • Marianna
    • Joined the data management committee - focused on institutional data
    • Documentum for faculty and staff
    • Enforces records management laws
    • Might go live next summer - August of 2017
    • Academic data is not included because of IP for students and faculty
  • Iraq invasion
    • So much data but not alot of acces, not clear on how to use them
    • Iraqi issue and Kurdenstan issue
    • They have a lot of artifacts abotu the
  • Melanie:
    • Focused on metadata rather than preservation
    • Texas Conference on Digital Libriaries
    • post-custodial model:
    • co-hosting a bof on fedora-based repositories to get folks to come in to town for the conference
  • Katie
    • humanities librarian in architecture
    • looking at the preservation of architectural records
    • digital scholars in practice series
 Ian Milligan 
  • We need to
  • Web archive tool and development
  • Warcbase is a web archiving platform - speed up access to wyaback machines; its a way to analyze web archives
  • works really well on a raspberry pi, personal laptop or in a cluster
  • in a nutshell - it takes a warc file and allows you to:
    • look for length structure
    • topic modeling
    • setting up the data
    • network graph
  • They've been developing Warcbase 3 years ago - with Jimmy Lin in computer science
  • Research use cases:
    • IIIPC - the focus up until the last few years has been about grabbing everything and not a lot of thought on access
    • there is a survey coming out and next week at IIIPC they will present stats on why or why not people are using archives
    • Current researchers in LUNA, UK Archive people,
    • Michigan just ran a great conference on web archiving research
    • Basic sense of researchers want:
      • you need more than the wayback machine - you have to know the url, you can only browse at one page at a time
      • we need to scale up
      • while we need search, the search needs to be intelligible
  • Keyword query with no prioritication of results
  • The goal of Warcbase is to make things translaret because if
  • Collaboration between computer scientists and historians - it is working well because the team is half CS and historians
  • Historians will say, "I would really like the power to do X" - they create a ticket, CS responds tot he tickets and they respond - research questions actually guiding the development
  • the importance of doing things open source
  • hackathon last month - about building a community that gains enough momentum
  • Alot of inital development required the researcher to dig in to tool - what they are trying to do is use JUpiter notebooks (ideally they have someone to spin p the notebooks - vagrants), point the webbrowser at it and then
  • goal of development - people use jupiter to prototype what they want to do and then paste that code into the shell
  • publishing on it - working in an interdiplicinary cgroup (librarians, historians, and CS) - the key to success - everybody wants to publish in different scholarly communities
    • technical stuff is ending up
    • librarian presented at code4lib and their journal
    • arts and humaniteis computing
    • published an early web archiving piece in
    • trying to put something together for digital humantiteis quarterly
    • he is working on what it was like being a kid building a webiste in the 1990s
    • everybody has to be happy - jimmy needs a reason to collaborate, ian does and the librarian does - tenue
  • training - they are running a pilot training workshop in iceland, if that works they want to make that universal
  • one of the goals this summer, RAs that don't have
  • software carpentry workshop - python, github, etc.
  • relationship to the library:
    • will it continue to to be run like an open source community or hosted in the libraary - he prefers a consortial model for providing resources to support
    • not having faculty status for librarians at waterloo means they don't have the bandwidth for research
  • what can we do to imporve:
    • the most important thing is documentation - if you guys are setting up a collection, you need to have your seed list written down and why we decided to collect what we decide to collect
    • the canadian political parties collection was set up in 2005 but the libraian that set it up and there was no documentation so if i publish a peer reviewer will immediately go why these sites and not those
    • beyond documentation, the debate between using hertirtx yourself or archive-it - archive-it has a great community, but nick has been doing heretrix by himself and that allows him to run that on an experiemental basis
    • advertsiing is good - university of toronot - you need to advocate and outreach we have this stuff - and if you want to use it or point them to the warcbase workfhop
    • social media is super important - the debate is should you use archive-it or the twitter api to downloadthe data - intersection between
  • this summer, they used that SHINE program that the uk launched, it just provided faceted search - columbia human rights also uses a faceted search engine
  • wayback machine - simple archive it keywords it is overwhelming, if you have good metadata you can start faceting down (dates, languages, subjects) - autoextracted metadata
    • they found that was cool and they got a lot of news coverage - people then go, this cool but it begs the question for something more sophisticated - the gateway to warcbase
  • the issues is the underlying data for websites are messy you have to
  • he works with the socialoogust and network scholars and matt webber at rutgers - when they look at web archives, he wants content and they want the graphs
    • social scientists want to look at networks, entities and things in a non-digital way
    • hisotrical contribution longitudinal demention
    • cs what you need now
    • historians are looking for change over time
  • in a dreamworld we would love to see other folks using it - so if we use it, we should be in contact with the team; open source stack
  • if we decide to
  • the fun of archive-iit and do funky stuff with it
  • how researchers get access across instutional collections - they have been exploring UT legal team - what kind of MOU do we need to ask for warc files and
  • U of Victoria and Alberta have been totally open to it - with the caveat of citations for the source and not sharing the warc files; when they create public facing things, they allow the libraries to review them
  • when they finally get to the document, it will be cited as Univerity of Toronto; he shared a sample mou document (between donating university and the PI on the grant) - same with broader consortial project - agreement between Ian and the UNiverity of alberta
  • web archives for longitudinal knowledge

Action items

  •