No specimen left behind: Collections digitisation at the NHM, London*

If you like this presentation – show it...

Slide 0

Vince Smith Collections for the 21st Century, Florida 5-6 May 2014 No specimen left behind: Collections digitisation at the NHM, London*

Slide 1

Some history… “the rate of progress by the UK taxonomic institutions in digitising and making collections information available is disappointingly low… there is a significant risk of damage to the international reputation of major institutions such as The Natural History Museum” House of Lords Science and Technology Committee Report on Taxonomy and Systematics, 2009

Slide 2

Digitisation rates at the NHM (circa 2009)

Slide 3

The prevailing attitude collections digitisation Biodiversity Informatics 2010, 7: 120 – 129 2010 GBIF Task Group: Global Strategy and Action Plan for the Digitisation of Natural History Collections “Digitizing all specimens is not an achievable aim at present”

Slide 4

More technology, more automation, more speed Whole drawer scanning Herbarium sheet scanning Microscope slide scanning

Slide 5

European collections rising to the challenge Large-scale data capture & digitisation in France, Netherlands & Finland

Slide 6

NHM London Science Strategy 2013-17 A New Voyage of Discovery Three Focal Areas 1. Scientific discovery 2. Scientific Infrastructure 3. Scientific engagement Five Challenges 1. The Digital NHM 2. Origins, evolution & futures 3. Biodiversity discovery 4. Natural resources & hazards 5. Science, society & skills Resources & funding Measuring success

Slide 7

data.nhm.ac.uk/globe/ Digitisation target 20M specimens available by 2017

Slide 8

A long way to go, practically, technically & culturally… NHM collections comprise c.80m objects Physical register: c.5m Digital data: 2.8m Images: 350k

Slide 9

NHM Digital Collections Programme A 2, 5 and 10 year plan... To collate, organise and make available one of the world’s most important natural history collections as digital resource, delivering: an online specimen / lot-level database to manage all holdings core meta-data and / or images for key parts of the collection flexible informatics tools ?750,000 for first 2 years

Slide 10

Outline Why Internal objectives & benefits Research opportunity - the iCollections example What How much data to digitise Linking digitisation effort to project benefits How Digi-street pilots, quick wins (herbarium, drawer & slide scanning) Crowdsourcing pilots & options Where NHM Data Portal External Portals (E.g. GBIF, Europeana) Links Crowdfunding H2020 projects (COST, SYNTHESYS, LOD, VRE, Dig. Inf.) Other museums, herbaria & partners (e.g. CETAF & publishers) When

Slide 11

1. Why: Objectives

Slide 12

1. Why: Research opportunity & the iCollections pilot Using the NHM collections to track long-term seasonal response of butterflies to climate change Digitisation of British and Irish Lepidoptera collection Species poor, specimen rich ~500,000 specimens, 5,000 drawers Re-curation, imaging, label data, georeferenced ~25% complete (started Jan.’13) About 50% specimens ‘useable’ Many specimens in most years (late - 19th century to 1970) Provide longer time perspective than most observational records (BMS post-1976)

Slide 13

1. Why: Research opportunity & the iCollections pilot Relationship between 10th percentile collection date of Anthocharis cardamines (Orange tip) and mean Mar. – May temp. (N.B. temp. axis reversed) 1900-2000, strong correlation between initial collection dates & temperature Critical marker on phenological response prior to recent rapid climate change Longer time perspective than most observational records (BMS post-1976) Museum data available for rare or hard to record species An example of unique biological and ecological data from collections Brooks, Self, Toloni & Sparks, 2014, Int. J. Biometeorol. DOI 10.1007/s00484-013-0780-6

Slide 14

2. What: Linking data capture effort to research benefits

Slide 15

3. How: Digi-street pilots (Herbarium Sheets) PROCESS

Slide 16

3. How: Digi-street pilots (Herbarium Sheets) 33k Specimens per day, 3 shifts (6am-10pm), Netherlands collection complete in 1.5 years €1.29 Euros per specimen image (if outsourced), transcription at similar cost Video of Herbarium Sheet Digitisation (Not available on SlideShare Version of this presentation)

Slide 17

3. How: Digi-street pilots (Drawer scanning & segmentation) SatScan whole drawer scanning 30 Million specimens, 130k drawers Fast, high res. multi-specimen drawer images (5 mins. each) No specimen handling Limited drawer / unit tray metadata, plus identifiers Specimen segmentation problem Digital and physical collection gets out of sync Need to automate specimen segmentation

Slide 18

3. How: Digi-street pilots (Drawer scanning & segmentation) Starting image

Slide 19

3. How: Digi-street pilots (Slide scanning) 1. Slides cleaned & barcoded 2. Loaded into hopper (50-100) 3. High resolution scan 4. Images stored & databased

Slide 20

3. How: Crowdsourcing pilot 1 user with 32,629 transcriptions! 92 users with 100+ transcriptions 363 users with 1 transcription Ranked users Log no. of records transcribed NHM Bird registers No advertising Hard to transcribe Challenging starting project

Slide 21

3. How: Crowdsourcing options Zooniverse Projects Smithsonian Digital Volunteers Wikisource transcription (WiR) Herbarium@Home Next steps: Survey and review of natural history transcription projects cf. paying transcribers

Slide 22

4. Where: NHM Data Portal A focus for deposition and discovery of NHM research & collections data Stable, citable identifiers on datasets & specimen / lot records Transparent data quality (un-reviewed, reviewed, reviewed & updated) Download (DwCA), web-services & Linked Open Data Build using CKAN, with enhanced mapping functionality Search Datasets matching criteria Individual dataset Results Browse & search criteria Mapping, table & statistical views

Slide 23

4. Where: External Portals Flickr GBIF Europeana e.g. NHM Coleoptera NHM almost getting data to GBIF! Submitting to Europeana portal (via Open-Up) Niche collections on Flickr Robust API services Gateway to image analysis projects (e.g. species recognition & trait extraction tools)

Slide 24

5. Links Crowdfunding Personalizes donation Scales well Requires lots of data Most crowdsourcing platforms unsuitable Potential for a data visualization to support our needs H2020 Projects EU Research & Innovation funding Programme €80 Billion from 2014-2020 Strong record (EDIT, ViBRANT, SYNTHESYS1/2/3) 5 proposals in development for 2014/15 Better alignment with Digital Collections Programme Partners Major museums & herbaria (Kew, Smithsonian, & Euro.6) Umbrella organisations & projects (GBIF, CETAF, iDigBio) Universities (e.g. on Image analysis) Data publishers (engagement on data & systems)

Slide 25

6. When Herbarium scanning Pilot – TBC (starting late-2014) Drawer scanning Segmentation Software (Aug. 2014) Pilots (Ongoing) Slide scanner Testing 6 systems (Complete) Procurement / purchase (July 2014) Pilot projects & system integration (From Sept. 2014) Crowdsourcing pilots Draft review paper (Aug. 2014) Additional Notes from Nature Project (early 2015) NHM Data Portal Internal release (June 2014) Public release (Jan. 2015 Funding H2020 projects (submitted, Sept. 14 & Jan. 15) Key dates over next 2 years

Slide 26

Acknowledgements Digital Collections Programme Planning: Ian Owens, Ben Atkinson, Dave Thomas, Andy Purvis, Emilie Smith & Vince Smith. iCollections Project team: Gordon Paterson, Geoff Martin, Martin Honey, Blanca Huertas, Darrell Siebert, Vladimir Blagoderov , Steve Cafferty, Adrian Hine, Chris Sleep, Mike Sadka, Elisa Cane, Lyndsey Douglas, Joanna Durant, Gerardo Mazzetta, Flavia Toloni, Peter Wing, Malcolm Penn & Liz Duffle. Research: Steve Brooks, Angela Self, Flavia Toloni & Tim Sparks. Drawer scanning NHM Satscan development: Vladimir Blagoderov, Laurence Livermore & Vince Smith. Software: Pieter Holtzhausen & Stefan van der Walt (Stellenbosch University). Slide scanner Testing: Vladimir Blagoderov & Alex Ball. Crowdsourcing Pilots (NHM Team): Tim Conyers, Lawrence Brooks & Adrian Hine. Review paper: Laurence Livermore & Vince Smith. NHM Data Portal Project team: Vince Smith, Darrell Siebert, Dave Thomas & Adrian Hine. Development: Ben Scott & Alice Heaton. Apologies to anyone I have missed!

Slide 27