<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE article PUBLIC "-//TaxonX//DTD Taxonomic Treatment Publishing DTD v0 20100105//EN" "../../nlm/tax-treatment-NS0.dtd">
<article xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:xlink="http://www.w3.org/1999/xlink" xmlns:tp="http://www.plazi.org/taxpub" article-type="research-article" dtd-version="3.0" xml:lang="en">
  <front>
    <journal-meta>
      <journal-id journal-id-type="publisher-id">102</journal-id>
      <journal-id journal-id-type="index">urn:lsid:arphahub.com:pub:73abe0ce-d97c-5d7c-bee5-b8e6e6fe6a17</journal-id>
      <journal-title-group>
        <journal-title xml:lang="en">ARPHA Preprints</journal-title>
        <abbrev-journal-title xml:lang="en">preprints</abbrev-journal-title>
      </journal-title-group>
      <publisher>
        <publisher-name>Pensoft Publishers</publisher-name>
      </publisher>
    </journal-meta>
    <article-meta>
      <article-id pub-id-type="doi">10.3897/arphapreprints.e86014</article-id>
      <article-id pub-id-type="publisher-id">86014</article-id>
      <article-categories>
        <subj-group subj-group-type="heading">
          <subject>Research Article</subject>
        </subj-group>
        <subj-group subj-group-type="scientific_subject">
          <subject>Biodiversity &amp;amp; Conservation</subject>
          <subject>Data mining &amp;amp; Machine learning</subject>
          <subject>Neural networks</subject>
        </subj-group>
        <subj-group subj-group-type="sdg">
          <subject>Life on land</subject>
        </subj-group>
      </article-categories>
      <title-group>
        <article-title>Essential Biodiversity Variables: Extracting Plant Phenological Data from Specimen Labels Using Machine Learning</article-title>
      </title-group>
      <contrib-group content-type="authors">
        <contrib contrib-type="author" corresp="yes">
          <name name-style="western">
            <surname>Mora-Cross</surname>
            <given-names>Maria Auxiliadora</given-names>
          </name>
          <email xlink:type="simple">mariamoracross@gmail.com</email>
          <uri content-type="orcid">https://orcid.org/0000-0002-3457-0963</uri>
          <xref ref-type="aff" rid="A1">1</xref>
        </contrib>
        <contrib contrib-type="author" corresp="no">
          <name name-style="western">
            <surname>Morales-Carmiol</surname>
            <given-names>Adriana</given-names>
          </name>
          <xref ref-type="aff" rid="A2">2</xref>
        </contrib>
        <contrib contrib-type="author" corresp="no">
          <name name-style="western">
            <surname>Chen-Huang</surname>
            <given-names>Te</given-names>
          </name>
          <xref ref-type="aff" rid="A2">2</xref>
        </contrib>
        <contrib contrib-type="author" corresp="no">
          <name name-style="western">
            <surname>Barquero-Pérez</surname>
            <given-names>María José</given-names>
          </name>
          <xref ref-type="aff" rid="A2">2</xref>
        </contrib>
      </contrib-group>
      <aff id="A1">
        <label>1</label>
        <addr-line content-type="verbatim">Costa Rica Institute of Technology, Cartago, Costa Rica</addr-line>
        <institution>Costa Rica Institute of Technology</institution>
        <addr-line content-type="city">Cartago</addr-line>
        <country>Costa Rica</country>
      </aff>
      <aff id="A2">
        <label>2</label>
        <addr-line content-type="verbatim">School of Computer Engineering, Costa Rica Institute of Technology, Alajuela, Costa Rica</addr-line>
        <institution>School of Computer Engineering, Costa Rica Institute of Technology</institution>
        <addr-line content-type="city">Alajuela</addr-line>
        <country>Costa Rica</country>
      </aff>
      <author-notes>
        <fn fn-type="corresp">
          <p>Corresponding author: Maria Auxiliadora Mora-Cross (<email xlink:type="simple">mariamoracross@gmail.com</email>).</p>
        </fn>
        <fn fn-type="edited-by">
          <p>Academic editor: </p>
        </fn>
      </author-notes>
      <pub-date pub-type="collection">
        <year>2022</year>
      </pub-date>
      <pub-date pub-type="epub">
        <day>09</day>
        <month>05</month>
        <year>2022</year>
      </pub-date>
      <volume>3</volume>
      <uri content-type="arpha" xlink:href="http://openbiodiv.net/7C4AE6F7-3EBD-532E-B77E-C8786285FCCD">7C4AE6F7-3EBD-532E-B77E-C8786285FCCD</uri>
      <history>
        <date date-type="received">
          <day>30</day>
          <month>04</month>
          <year>2022</year>
        </date>
        <date date-type="accepted">
          <day>30</day>
          <month>04</month>
          <year>2022</year>
        </date>
      </history>
      <permissions>
        <copyright-statement>Maria Auxiliadora Mora-Cross, Adriana Morales-Carmiol, Te Chen-Huang, María José Barquero-Pérez</copyright-statement>
        <license license-type="creative-commons-attribution" xlink:href="http://creativecommons.org/licenses/by/4.0/" xlink:type="simple">
          <license-p>This is an open access preprint distributed under the terms of the Creative Commons Attribution License (CC BY 4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.</license-p>
        </license>
      </permissions>
      <abstract>
        <label>Abstract</label>
        <p>Essential Biodiversity Variables (EBVs) make it possible to evaluate and monitor the state of biodiversity over time at different spatial scales. Its development is led by the Group on Earth Observations Biodiversity Observation Network (GEO BON) to harmonize, consolidate, and standardize biodiversity data from varied biodiversity sources. This document presents a mechanism to get baseline data to feed the Species Traits Variable Phenology or other biodiversity indicators by extracting species characters and structure names from morphological descriptions of specimens and classifying such descriptions using machine learning (ML).</p>
        <p>A workflow that performs Named Entity Recognition (NER) and Classification of morphological descriptions using ML algorithms was evaluated with excellent results. It was implemented using Python, Pytorch, Scikit-Learn, Pomegranate, Python-crfsuite, and other libraries applied to 106,804 herbarium records from the National Biodiversity Institute of Costa Rica (INBio). The text classification results were almost excellent (F1 score between 96% and 99%) using three traditional ML methods: Multinomial Naive Bayes (NB), Linear Support Vector Classification (SVC), and Logistic Regression (LR). Furthermore, results extracting names of species morphological structures (e.g., leaves, trichomes, flowers, petals, sepals) and character names (e.g., length, width, pigmentation patterns, and smell) using NER algorithms were competitive (F1 score between 95% and 98%) using Hidden Markov Models (HMM), Conditional Random Fields (CRFs), and Bidirectional Long Short Term Memory Networks with CRF (BI-LSTM-CRF).</p>
      </abstract>
    </article-meta>
  </front>
</article>
