Corresponding author: Rajapreethi Rajendran ( rajapreethi.rajendran@senckenberg.de ) © Rajapreethi Rajendran, Claus Weiland, Jonas Grieb, Soulaine Theocharides, Sam Leeflang, Wouter Addink, Sharif Islam. This is an open access preprint distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Citation:
Rajendran R, Weiland C, Grieb J, Theocharides S, Leeflang S, Addink W, Islam S (2025) Extraction of Quantitative Specimen Data using Machine Learning as a Service in the DiSSCo Research Infrastructure. ARPHA Preprints. https://doi.org/10.3897/arphapreprints.e160486 |
The Distributed System for Scientific Collections (DiSSCo) is a research infrastructure to integrate European natural science collections (NSCs) digitally. The aim is to facilitate and enhance the access, management and analysis of collection assets in one unified digital collection. The Machine Annotation Services (MAS) are essential components of DiSSCo’s Digital Specimen Architecture (DSArch). These services automate the annotation of digital objects to enable labeling and categorization of NSC's digital assets.
To further advance this, a Machine Learning as a Service (MLaaS) approach was developed which provides researchers with the access to pre-trained machine learning models for complex tasks such as instance segmentation and morphological analysis of datasets. MLaaS enhances the DiSSCo’s scalability and flexibility and allows the integration of machine learning tools in close alignment with the FAIR (Findable, Accessible, Interoperable, Reusable) principles.
This study employs DiSSCO's MLaaS framework for the quantitative analysis of herbarium specimens. Machine learning models such as Mask R-CNN and YOLO11 are comparatively applied to detect and generate the pixel-level masks of plant organs in herbarium sheets. Subsequently, these models are used to reconstruct the scale in the herbarium sheet and to calculate the surface area of identified plant organs.
Based on our finding that YOLO11 performs better than the Mask R-CNN for our use case, we deployed a YOLO11-based service as MAS in DSArch to open up natural science collections on scale for research fields such as plant phenology and climate change science.