The aim of the Joint Research Activity (JRA) is to improve the quality of and increase access to digital collections and data within natural history institutions’ virtual collections
Objective 1: Automated data collection from digital images includes the following tasks:
- Automatic processing (segmentation) of digital images: research and development of edge detection technology to locate and classify multiple regions of interest within images of NH specimens.
- Automatic metadata capture: Develop software that will automatically identify properties of an image.
- High Resolution 3D colour image acquisition: Complementary approaches (colour surface scanning, photogrammetry) will be developed in order to provide complete information (3D and colour data) of the specimen.
Objective 2: New methods for 3D digitisation of NH collections includes the following tasks:
- Research on different 3D techniques: The size, shape and the different structured surfaces of specimens make it necessary to adapt the process of digitisation and to develop different standards by selecting exemplary object classes with which to optimise the process of 3D imaging. The aim is to create digital 3D objects viewable from every angle and with high depth of focus by using stacking techniques.
- Micro-Computed Tomography (Micro-CT) for NH collections: Develop protocols and workflows for the rapid digitisation of collections (sample preparation, scanning parameters, model creation).
Objective 3: Crowdsourcing metadata enrichment of digital images includes the following tasks:
- Research into crowdsourcing methodologies for NH collections: Identify which digital image data are most appropriate for use with crowdsourcing.
- Development of website to allow crowdsourcing data capture: Engage with existing crowdsourcing sites and use the results of the pilot study (NA 3 task 1.4) to develop an online mechanism for allowing the public to engage with biodiversity research.
Objective 4: Access and management of an integrated European digital collection includes the following tasks:
- Feasibility research on a “digitise on demand” (DoD) service for European NH Institutions: In order to validate the market for a DoD service, feasibility research will be completed to identify the barriers to adopting a DoD approach across Participants
- Open Access to captured data: Public-facing, Open Access portals will be utilised to publish the resulting SYNTHESYS content (images and metadata).
Obj 1: Automated data collection from digital images
Obj 2: New methods for 3D digitisation of NH collections
Obj 3: Crowdsourcing metadata enrichment of digital images
Obj 4. Access and management of an integrated European digital collection (with NA2)
Obj 1: Automated data collection from digital images
Participants: RBGE (Lead), NHM, RBGK, MNHN, CSIC, BGBM, MfN, HNHM, RBINS, RMCA, NMP
Task 1.1. Automatic processing (segmentation) of digital images
Research and develop edge detection technology to locate and classify multiple regions of interest within images of NH specimens. Using the principle that pixels in a segment are similar with respect to some characteristic or computed property (e.g. colour, intensity, or texture), develop a method to semi-automatically detect, crop and classify these regions of interest such that they can be subject to appropriate additional processing.
Task 1.2. Automatic metadata capture
Develop software that will automatically identify properties of an image. These data “facets” will be automatically captured without human intervention and provide categories of information that allow Users to easily search and browse virtual collections more effectively.
Specimen label data will be subjected to Optical Character Recognition (OCR) software to extract the text string and research methods to improve the accuracy of OCR use on handwritten labels. OCR-extracted text collected from handwritten labels will need to be subject to further processing and validation, such as via crowdsourcing methodologies (obj. 2).
Task 1.3. High Resolution 3D colour image acquisition
Complementary approaches (colour surface scanning, photogrammetry) will be developed in order to provide complete information (3D and colour data) of the specimen. Collaborate with existing European projects such as 3DCOFORM whose focus is on Cultural Heritage digitisation. This task will develop 3DCOFORM outputs to enable their use with NH specimens.
Obj 2: New methods for 3D digitisation of NH collections
Participants: MfN (Lead), NHM, RBGE, NHMW, HCMR, BGBM
Task 2.1. Research on different 3D techniques
The size, shape and the different structured surfaces of specimens make it necessary to adapt the process of digitisation and to develop different standards by selecting exemplary object classes with which to optimise the process of 3D imaging. The aim is to create digital 3D objects viewable from every angle and with high depth of focus by using stacking techniques. The resultant 3D images will show all relevant details necessary for determination of the specimen. Every image will have the possibility to zoom into every part of the specimen. It is anticipated that some taxon groups or specimens will not fit the exemplary object classes, and a determination
might not be possible from one 3D scan; high-resolution images attached to the 3D model to how special details (i.e. microscopic pictures of copulatory organs) will make the resultant new virtual collection a multimedia object. The results of this task will input into NA2 handbook of best practice and standards for 3D imaging of type specimens (Task 1.2).
Task 2.2 Micro-Computed Tomography (Micro-CT) for NH collections
Develop protocols and workflows for the rapid digitisation of collections (sample preparation, scanning parameters, model creation). The resulting models will be displayed and disseminated through a web-based framework which will allow the user to manipulate the 3D tomograms through a series of online tools that will be created.
Obj 3: Crowdsourcing metadata enrichment of digital images
Participants: VIZZ (Lead), NHM, RGBE, MNHN, CSIC, MfN, NHMW, NMP, VU
Task 3.1. Research into crowdsourcing methodologies for NH collections
Identify which digital image data are most appropriate for use with crowdsourcing. This work will draw on experience with other citizen science crowdsourcing efforts, such as the Zooniverse (www.zooniverse.org) project.
Work will focus on 1) the potential for crowdsourcing transcription of handwritten materials (e.g. specimen labels, catalogue cards, letters and diaries), which contain a vast and untapped wealth of historical information about the distribution, identity and origin of NH specimens and 2) image-based identification of unidentified specimen by expert communities. The goals will be to develop a specification that supports these functions on a website that hosts NH crowdsourcing projects. This specification will include a mechanism to integrate output with existing social media sites, maximising the reach to interested parties.
Task 3.2 Development of website to allow crowdsourcing data capture
Engage with existing crowdsourcing sites and use the results of the pilot study (NA 3 task 1.4) to develop an online mechanism for allowing the public to engage with biodiversity research. As part of this work, map the Darwin Core data standard field to crowdsource label data information, ensuring that the collected data maps to existing NH collections data management systems. Once these integration mechanisms exist, the resultant website will be a sustainable source of volunteers that NH institutions can engage with after the life of the project. SYNTHESYS3 will offer LifeWatch the technology developed to use as a basis for its own crowdsourcing
projects. The crowdsourcing website will be monitored and the User engagement tracked. This will be used further improve User uptake. Recommendations will be made on the organisational embedding and sustainability of the website into Consortium partners’ subsequent workflows.
Obj 4: Access and management of an integrated European digital collection
Participants: NHM (Lead), RBGE, UCPH, CSIC, NCB; NMP
Task 4.1. Feasibility research on a “digitise on demand” (DoD) service for European NH Institutions
In order to validate the market for a DoD service, feasibility research will be completed to identify the barriers to adopting a DoD approach across Participants. Specific activities include establishing the technical DoD infrastructure for a detailed market validation; to conduct market validation towards potential adopters; collecting data on the feasibility of the service; and preparing a deployment plan for the future of the DoD network. Research will require careful costing of all activities, including provision for a pay-as-you-go service to help prioritise activities. Requests for digitisation will need to be carefully matched to appropriate technologies through an automated system that incorporates the best practice guide produced in NA3 (task 1.3). Development of a DoD service has the potential to offer access NH specimens across Europe in a highly scalable manner that can be used to either digitise all material for select groups or complete gaps in a particular collection.
Task 4.2. Open Access to captured data
Public-facing, Open Access portals will be utilised to publish the resulting SYNTHESYS content (images and metadata). Initially the data will be made available via GBIF and supplied to the EU-funded LifeWatch project. SYNTHESYS will also develop the protocols to ensure that this output can be accessed by the EU-funded Europeana portal and the international Encyclopedia of Life project. Open licensing and comprehensive dissemination is essential to ensure that all audiences are aware of, and able to access, the NH images and metadata.
Inselect
Open source software that can recognise, process and annotate images that contain multiple specimens (e.g. whole drawer scans of
pinned insects or slide arrays) has been developed.
A workshop was held in September 2014 to develop a specification and produce a functional software prototype called Inselect.
Read about Inselect in more detail, with presentations from the workshop presentation and summary presentation.
This aspect of the SYNTHESYS3 JRA focused on the development of software that is able to automatically identify properties of an image without human intervention, and capture easily searchable information that can be integrated into virtual Natural History Collections.
This research was divided into four ‘sections':
1. Review of development of tools and workflows which incorporate automatic or semi-automatic metadata capture using Optical Character Recognition (OCR)
2. Review of development of Natural Language Processing (NLP) for parsing OCR text into Darwin core fields
3. Review of Handwritten Text Recognition (HTR) and (semi) automatic specimen image classification.
4. Review of automatic capture of character including colour, shape as well as exif data.
You can read the executive summary and full report here (deliverable 4.2)