ARPHA Preprints, doi: 10.3897/arphapreprints.e196971
How similar are species names and why does this matter for biodiversity data
expand article infoAndré Menegotto, Cristina Ronquillo§, Joaquín Hortal§, Thomas J. Webb
‡ Ecology and Evolutionary Biology, School of Biosciences, University of Sheffield, Sheffield, United Kingdom§ Department of Biogeography and Global Change, Museo Nacional de Ciencias Naturales (MNCN-CSIC), Madrid, Spain
Open Access
Abstract

Standardising taxonomic names is an essential step in biodiversity studies to ensure robust data aggregation and up-to-date, valid species nomenclature. Fuzzy (inexact) matching is widely used in this process to detect correspondences between scientific names that differ due to typographical errors. Such an approach assumes that species names are sufficiently distinct such that names differing in just a few characters in fact refer to the same taxon, but this has rarely been evaluated. Across c. 230,000 marine species names, we show that name similarity is common: 28.37% of specific epithets differ by three or fewer edits from another epithet within the same genus. Shared epithets are also widespread within and across phyla, occurring in 73% of all marine species; in 7.35% of these cases, the associated genera differ by three or fewer edits. This level of similarity increases the risk of incorrect matches, limiting the reliability of automated text-string tools in biodiversity big data analyses and highlighting the importance of considering systematic and authorship information into taxonomic workflows to support name resolution beyond orthographic similarity.

Keywords
Damerau-Levenshtein distance, epithet, fuzzy match, scientific name, taxonomic harmonisation
login to comment