What We've Done

In an attempt to meet the need for normalized, high quality historic fish occurrence data at various spatial scales. We've done the following:

find data: We’ve contacted, we estimate, nearly one hundred potential data providers and scoured the internet for more data. Data have come in many formats, requiring various treatments (spreadsheets, text files, and paper). We sought out personal accounts of species observations from researchers and the public as well as extracted records from literature (those in track 3 only).
data entry: Data that were not digital had to be hand-entered into digital spreadsheets.
re-formatting (normalizing): Data often came to us in various formats, meaning that a single data field could include various ways of presenting essentially the same information. For example, species names (a single atomized data field for us) were either combined with higher taxonomy, reduced to a simple species name, and/or included a common name. Dates were in various arrangements of day, month, and year, and/or included Roman Numeral months and/or with months written out in text. We had to adjust those as well as other fields into single unified formats.
compile data: We brought all of the disparate datasets into one relational database. The first version was in Microsoft Access and later versions in MySQL and PostgreSQL.
georeference text locations: Since, little of our data originally came to us with spatial coordinates we manually applied them along with error estimate to each record. We could now visualize them on a map.
synonymizing taxa and collector names: Species names were often misspelled and provided under multiple historic names, and more rarely with common names, due to provider use of historical taxonomies and irregular updating over a long period of time.
detect errors (usually via visualization on a map): Once records were georeferenced we could map them out, species by species, and see outliers. The other common method to detect errors was to group records into collecting events based on combinations of date, locality names, and collector names.
verify/correct determinations: We looked at thousands of specimens, either because we flagged them as outliers, they weren't identified to species level, or because they were a species easily mistaken for another.
verify data against documentation: Often outlier specimens, after verification that they were correctly determined, turned out to be erroneous locations that could be corrected based on examining ledgers, fieldnotes, original labels, and published manuscripts or maps. Often georeferences and dates that were imprecise or incorrect could be refined as well.
photograph specimens, field notes and jar labels: When available we photographed and provide fieldnotes. We also photographed specimens and jar labels, prioritizing those that represented edge of range records or that document unusual or rare populations.
preserve original data: Original donor verbatim data are preserved and displayed alongside our edited version.
publish data (including useful summaries): Data are all published on GBIF and our website.
publish research products (models, conservation areas): We publish data summary tools, species distribution models, and native fish conservation areas etc. on our stats tab.