Geohistorical dataset of ten plant species introduced into Occitania (France)

Abstract Background The original dataset presented here is the result of the first near-exhaustive analysis performed on historical data concerning ten plant species introduced in and around Occitania (south-western France) since 1651. Research was carried out on the following species: Alnusincana, Buddlejadavidii, Castaneasativa, Helianthustuberosus, Impatiensglandulifera, Prunuscerasifera, Prunuslaurocerasus, Reynoutriajaponica, Robiniapseudoacacia and Spiraeajaponica. The data file contains 199 occurrence data exclusively based on historical observations and records made between 1651 and 2004 that were retrieved from 111 of the 640 literary sources consulted. All the records are associated with a year and 61% of them have associated spatial coordinates. Initially, the EI2P-VALEEBEE research project focused on the introduction of these species into Occitania (95 occurrences, 47.7%), but mentions found of introductions beyond this territory - mainly in metropolitan France - are also reported. The creation of this dataset involved five stages: (1) selection of species, (2) consultation of historical sources, (3) recording of occurrences in the dataset, (4) dataset standardisation/enrichment and Darwin core mapping and (5) data publication. Quality controls were conducted at each step. The dataset is available on the platform of the Global Biodiversity Information Facility (GBIF) at https://doi.org/10.15468/3kvaeh. It respects the internationally recognised FAIR Data Principles (Findable, Accessible, Interoperable and Reusable). New information The dataset will be progressively enriched by new data during the EI2P-VALEEBEE research project and future projects on invasive plant species conducted by the team.

The data file contains 199 occurrence data exclusively based on historical observations and records made between 1651 and 2004 that were retrieved from 111 of the 640 literary sources consulted. All the records are associated with a year and 61% of them have associated spatial coordinates. Initially, the EI2P-VALEEBEE research project focused on the introduction of these species into Occitania (95 occurrences, 47.7%), but mentions found of introductions beyond this territory -mainly in metropolitan France -are also reported.

Introduction
The introduction of alien species into a given region may be intentional, for ornamental, horticultural or agricultural purposes, but more often, it is involuntary (Pyšek et al. 2009;Laporte-Cru and Aniotsbéhère 2010). Whatever the case, invasive alien species have multiple consequences on biodiversity worldwide, as do the destruction of natural habitats, pollution, overuse of resources and climate change (Pimentel 2011;Bellard et al. 2012;Verma et al. 2020). Introduced species, if and when they become invasive, induce multiple consequences, direct and/or indirect, affecting the native species, the functioning of natural habitats and the services provided by ecosystems, as well as economic activities and human health (IUCN French Committee 2018). For some authors, the invasive character of an alien plant is linked to both the environment and the species (Rejmanek et al. 2005;Richardson and Pysek 2006). Therefore, it is impossible for an observation made in a given place to be generalised to all environments. The history of such introductions is also a criterion that it is important to consider when attempting to understand the invasion dynamics. Introduction patterns and historical factors resulting in the presence of alien species in a given region could provide key information for risk management and the prevention of potentially harmful introductions (Dudeque Zenni 2014).
The data presented here are derived from the EI2P-VALEEBEE project, co-funded by the Région Occitanie and the Maison des Sciences de l'Homme de Toulouse (cf. glossary of acronyms in Suppl. material 1). They combine two issues of major ecological, sociocultural and economic concern: biological invasions and the decline of pollinator populations. The main objective of this project is to better identify the links between plant invasions and changes in ecosystem services (Vaz et al. 2017a;Pisani et al. 2021) so as to better understand the potential or constraints of invasive alien plants in connection with pollinators and try to apprehend invasive processes through the most systemic approach, by taking several dimensions into account (human, spatial, historical, ecological and ethological) and by carefully considering the practices, perceptions and representations that the various stakeholders in the territory have manifested over time (Vaz et al. 2017b).
This systemic approach is also found in the project's method itself, since it brings together several disciplines and tools and also has a diachronic dimension over several time-scales. In carrying out the project, it was decided to focus on the Occitania Region in the southwestern part of France and, more particularly, on two territories: the Pique Valley and the Oussouet Valley (Guillerme et al. 2020), both of which have not only a high level of plant diversity, but also a high rate of invasion.
The issue of alien plant invasions is complex and multifactorial. It includes an important geographical dimension, since the distribution of invasive alien plants is conditioned by the variations of an environment (Souty-Grosset et al. 2015) and it also includes a huge temporal dimension, since it is a process that takes place in the long term. In this perspective and as suggested by , we intend to study the phenomenon in the most transversal and objective way possible to better understand and adapt our actions towards these species. On this basis, the EI2P-VALEEBEE project has chosen to combine the study of plant invasions with the process of decline of pollinators, which is also a multifactorial phenomenon with major ecological and socio-economic importance, since 70% of the plants used in the world for our food depend on pollination by insects (Klein et al. 2007). This decline is partly due to the fall in the quality and quantity of available melliferous resources (Vanbergen et al. 2017). However, some exotic species, considered as alien plants in Occitania, seem to have strong potential in terms of nectar production and melliferous resources.
The EI2P-VALEEBEE project aims to deepen our knowledge of the links between alien plant invasions and changes in ecosystem services, which are still poorly understood today (Drenovsky et al. 2012) and which could potentially change the way we look at alien species and our management strategies towards them. As Simberloff et al. (2012) point out in general, or  more specifically for France, there is a need for general information on these species. Several databases documenting invasive alien species distributions currently exist (CABI: Invasive Species Compendium 2021, Global Invasive Species Database 2020, Global Register of Introduced and Invasive Species 2020, CABI: Invasive Species Compendium 2021). The history of the introduction of a non-native species into a region is linked to the first observations described and establishing it often requires cross-checking of all the information collected. It is a question of tracing the evolution and progression of these alien species from the date and place of their introduction and of noting their behaviour in our ecosystems according to the first observations that were recorded (Laporte-Cru and Aniotsbéhère 2010).
At the scale of Occitania and, more broadly, at the French national scale, our dataset provides novelty in the consideration of the temporal and geohistorical dimension of the invasion phenomenon as it records both observations and historical and literary mentions of the species studied.

General description
Purpose: The data file is the result of a geo-historical study conducted on the introduction and distribution of invasive plant species. Ten plant species were selected that can all be observed in the Pique and Oussouet Valleys. Some of them are considered as invasive, alien species. The study includes research on the introduction dates of the species studied, the locations of their introduction, their interest and past uses, the different human perceptions of them over time, activities that have impacted their local distribution, comments from authors and observers on their abundance and elements of the historical context of their introduction. Historical sources were consulted during 2020 in order to find the oldest elements concerning the ten species.

Additional information: Interest and use of the dataset
Without a historical analysis, it is difficult to understand the current local distribution dynamics of invasive plant species, especially since some of them were introduced into Metropolitan French territory several centuries ago (Souty-Grosset et al. 2015). A major interest of this dataset is to provide historical depth and chronological elements for the understanding of the current distribution of these ten species at local scale. In this perspective, the dataset is relevant to: • identify the different periods of introduction of the ten species studied; • provide accurate data on the main introduction channels, such as the place where species were introduced (mainly ornamental gardens, thermal parks and private gardens, but, for some of them, also much wilder and more natural areas, such as road borders and forests); • identify pathways of colonisation to understand the current localisation of each species; • enable better understanding of why they were introduced, particularly thanks to information on their uses; • provide elements of analysis to understand the temporal phases of species distributions.
The interest of the dataset is directly in line with the values of the EI2P-VALEEBEE project itself, the objective of which is to contribute to a better understanding of plant invasion processes in a transversal way . In this perspective, it can be useful for all local, national and international organisations involved in this issue with current ecological, economic and social implications.

Project description
Title: Geohistorical dataset of ten plant species introduced in Occitania ( (1) selection of species, (2) consultation of historical sources, (3) recording of occurrences in the dataset, (4) dataset standardization/enrichment and Darwin core mapping and (5) data publication.
Step 1: Species selection. Step 2: Consultation of historical sources.
For the consultation of historical sources, a funnel method was applied: 1.
To begin, the existing sources of naturalist data were inventoried at the national level, such as flora (e.g. Flore de France, Coste 1906); seed catalogues (e.g. Andrieux and Vilmorin 1783); horticultural, botanical and beekeeping newspapers and magazines (e.g. Bulletins de la Société Centrale d'Apiculture 1856-1946, Bulletins de la Société Nationale de Protection de la Nature 1882-1888). This first national-scale inventory gave an idea of the presence of the ten species in France during the different periods of history. It also allowed the written sources containing botanical elements on a smaller scale to be identified.

2.
The second stage was the analysis of literary sources at the regional level (Occitania). Local herbaria (e.g. Bordere 1871-1872), regional flora (e.g. Philippe 1859), botanical magazines and newspapers of the region (e.g. Société d'Horticulture 1854-1921) were inventoried and analysed. The seed merchants and horticulturists located in the region were also inventoried to explore their seed catalogues (e.g. Catalogue des plantes vivaces et d'extérieur, Bonamy Frères 1874 ) and potentially find elements on the colonisation routes of the species. In addition, efforts were made to be as exhaustive as possible by consulting archival institutions at the departmental level (Departmental Archives of Hautes-Pyrénées, Departmental Archives of Haute-Garonne). The archived documents consulted were of all kinds: some known botanists' collections (e.g. Bouget 1901Bouget -1948, invoices related to seed orders, documents on the construction of natural parks (Natural Park of the Pyrenees collection, Aymonin 1975) and thermal gardens ("Projet d'aménagement du Parc des Quinconces", Chevalier 1902), together with herbaria, letters and notes. Examining these references for specific information on introduced exotic species and their spatial location was innovative. 3. Once enough regional historical sources had been analysed, it was possible to focus on the scale of our study sites. The main idea was to study historical sources on the scale of the municipalities located in the Pique and the Oussouet Valleys. For this purpose, municipal archives (e.g. Municipal archives of Bagnères-de-Luchon, Municipal archives of Tarbes) were also consulted, taking care to identify the potential pathways of introduction on a very local scale in advance, to facilitate the search in archived articles. As local data sources, archival records enabled consultation of community monographs that presented a chapter on the local flora (e.g. Barèges monograph, Rondou 1907). Finally, newspapers associated with the municipalities (e.g. Revue de Comminges, Société des Études du Comminges 1885-2004) and local herbaria (e.g. Ramond de Carbonnières 1755-1825), offered information on the flora that was much more local and precise.
Throughout the research process, key informants, having good knowledge of the study areas and their backgrounds, contributed information and advice on valuable literary sources, allowing the research to best fit the study areas and to be as complete as possible. It should also be noted that, in order to consult sources of all kinds, as soon as we felt that a literary source could potentially contribute elements on one of the ten species, we consulted it, even if, at first sight, it had no connection with botany (e.g. the recipe book of La Varenne 1651).
As a result, 640 literary sources were consulted during this step. Amongst these, 111 (17.3%) provided information on the introduction and colonisation of the ten species over time (Suppl. material 3). It must be understood that consulting historical sources was one of the most important steps, not only in the creation of this dataset, but also for its future updates, because this work made it easier to identify potential sources of information and distinguish them from blind alleys.
Step 3: Occurrences recorded in the data file.
Each time a species was mentioned in the historical literature consulted, it was recorded in an occurrence data file created in LibreOffice Calc (spreadsheet programme). The file format is OpenDocument Spreadsheet (.ods). When recording an occurrence, attention was paid to the vernacular and scientific synonyms used in the historical literature. We recorded the occurrences for which the historical name is currently identified as a synonym in the Catalogue of Life, INPN and ISSG, but also according to the number of elements in the bibliographic source that allowed the taxon to be identified as such: photographic representation, image or plate of the species, precise description, mention of other known scientific and vernacular names. In the data file, a maximum of elements mentioned by the author were recorded: the synonym cited; the reference code from the French taxonomic referential TAXREF (https://inpn.mnhn.fr/programme/referentiel-taxonomique-taxref? lg=en); the bibliographic reference in which the mention of the species was found; the date of observation of the species or, failing that, the date of its mention; the names of the observers and authors; the type of source; the description of the location as soon as it was mentioned; the species' spatial coverage and abundance; its minimum and maximum altitudes; any comments by the author about the species, the location of observation or mention of the species; the nearest town/village; and the latitude and longitude coordinates. In addition, each element concerning the literature source was recorded in the same software (LibreOffice Calc): bibliographic reference number (identifier), name(s) of the author(s), title, year of publication, collection and publisher if they were mentioned, as well as the city of publication, the name of the journal (if the source was an article), the URL if accessible or the reference of the archive document with the location of the archive institution, the call number and the series.
Step 4: Dataset standardisation/enrichment and Darwin Core mapping.
The Darwin Core Standard (Wieczorek et al. 2012) "offers a stable, straightforward and flexible framework for compiling biodiversity data from varied and variable sources. (...) This standardization not only simplifies the process of publishing biodiversity datasets, it also makes it easy for users to discover, search, evaluate and compare datasets as they seek answers to today's data-intensive research and policy questions." (Source: https:// www.gbif.org/darwin-core).
Each column header of the occurrence spreadsheet was searched for an equivalent term in the Darwin Core quick reference guide (https://dwc.tdwg.org/terms/). We also chose the Identification History extension (https://tools.gbif.org/dwca-validator/extension.do?id= dwc:Identification) to manage synonyms of taxon names as they were cited in the literature consulted.
The geographic data (longitude, latitude, WGS84 datum) were structured in two ways: 1) geographic coordinates with an accuracy of 500 metres for occurrences whose locality was precisely identified in the literary source and 2) geographic coordinates with an accuracy of 10,000 metres for occurrences whose literary source mentioned only the name of the municipality (https://www.geonames.org/). For each occurrence, the Darwin core terms "country", "province", "county", "municipality" and "locality" were assigned as far as possible from the elements of the literary sources.
For the publication on the GBIF platform, the Integrated Publishing Toolkit (IPT (gbif.org)) of the GBIF was used to fill out the metadata and to generate the Darwin Core Archive.  FAIRness assessment criteria used for this dataset.

REUSABLE •
The Darwin Core Archive facilitates the reusability of the dataset because it enables publication in the GBIF. This compact package (a ZIP file) contains interconnected text files and enables users to share their data using a common terminology (source: https://www.gbif.org/en/darwin-core).

•
Use an open format for the dataset (OpenDocument.ods) and open source software to reuse it. • EML metadata includes provenance for raw and derived data. • This data paper explains the data processing steps, curation protocol, quality assurance processes, methods and tools that permit long-term integrity and understandability of data. • The spatial/temporal/taxonomic coverage is clearly mentioned in the EML metadata and in this data paper, as well as the CC-BY licence and rules for large reuse.
Quality control: Several quality controls were implemented. First, data cleaning and corrections were performed with proofreading by a third party and the use of pivot tables to check data integrity. Harmonisation and standardisation of content were also necessary to allow and facilitate the mapping with Darwin Core terms. The latter made it possible to identify new additional information such as nomenclatural code, coordinate uncertainty, geodetic datum, georeference source, licence etc. As many standards as possible were chosen to describe country codes (ISO 3166-1-alpha-2), municipality names and their geographic coordinates (geonames.org), taxon scientific names (TAXREF v.13.0 -2019-12-06) and taxon ID from several sources (GBIF, IUCN, Catalogue of Life, IPNI, INPN, Tela Botanica BDTFX). Finally, we used a GIS tool (QGIS 3.10 LTR) to check the geographic coordinates of occurrences.

Geographic coverage
Description: Within the framework of this geohistorical study, we consulted the information on the introductions and distributions of target species existing in archival sources regarding the Oussouet Valley (Pyrenean foothills, Hautes-Pyrénées) and the Pique Valley (Haute-Garonne), in Occitania (95 occurrences, 47.7%). When information concerning other French or European territories: other parts of France (50.8%), Belgium (1%) or the UK (0.5%), was found in the documents consulted, it was also recorded. This explains the European geographical coverage. Of the 199 occurrences, only 122 (61%) occurrences are precisely geolocated (municipality or 500 metre buffer). Fig. 1 shows the geolocation of these 122 occurrences and Fig. 2a and Fig. 2b focus on the occurrences found in the Oussouet Valley and the Pique Valley.
Coordinates: 42°5'52.8''N and 59°15'57.6''N Latitude;8°36'46.8''W Longitude and 8°26'16.8''E Longitude.   For the purposes of the historical study, all the Latin synonyms, identified and validated by the national taxonomic and nomenclatural reference frame TAXREF v.13 related to the ten taxa studied, were considered. During the analysis of historical documents, we also collected all vernacular synonyms as soon as they were associated with a validated Latin synonym (Suppl. material 2); they will help to enrich the current vocabulary designating these species. Phylogenetic classification of the studied taxa ordered by order, family, genus and species, with their number of occurrences and their hyperlink to the subsample by species on gbif.org.

Notes:
The geohistorical database includes observation and record data for the ten species from 1651 until 2004, collected in 2020. The objective was to cover the different periods of introduction of these ten species in Occitania. We consulted literature dating from the 17th century, a period during which some of the species studied seem to have been introduced in France (Helianthus tuberosus, Prunus cerasifera, Prunus laurocerasus and Robinia pseudoacacia).
The number of collected occurrences increased from around 1800 until 1950, which can be explained by the introduction of four of the exotic species into Metropolitan France in the 19th century: Buddleja davidii, Impatiens glandulifera, Reynoutria japonica and Spiraea japonica and also by an increase in the number of historical sources relating to botany and horticulture, which facilitated the identification of mentions or observations of the species studied. Therefore, 75% of the historical collected data dates from about 1800 to 1950 (Fig.  3). The lack of data for the 1950-1974 period could reflect the significant decrease in the number of naturalists' records during this post-war period. Description: The Darwin Core Standard (DwC) was used to offer a "stable, straightforward and flexible framework for compiling biodiversity data from varied and variable sources" (https://www.gbif.org/en/darwin-core). All column labels and descriptions are from https://dwc.tdwg.org/terms/.

Column label Column description
id Same as occurrenceID: An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique. This is the primary key of this table. type The nature or genre of the resource. modified The most recent date-time on which the resource was changed.
language A language of the resource.
licence A legal document giving official permission to do something with the resource.
rightsHolder A person or organisation owning or managing rights over the resource.
institutionID An identifier for the institution having custody of the object(s) or information referred to in the record.

institutionCode
The name (or acronym) in use by the institution having custody of the object(s) or information referred to in the record. datasetName The name identifying the dataset from which the record was derived.

ownerInstitutionCode
The name (or acronym) in use by the institution having ownership of the object(s) or information referred to in the record. basisOfRecord The specific nature of the data record. Recommended best practice is to use the standard label of one of the Darwin Core classes.
occurrenceID An identifier for the Occurrence (as opposed to a particular digital record of the occurrence). In the absence of a persistent global unique identifier, construct one from a combination of identifiers in the record that will most closely make the occurrenceID globally unique. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names.

municipality
The full, unabbreviated name of the next smaller administrative region than county (city, municipality etc.) in which the Location occurs. Do not use this term for a nearby named place that does not contain the actual location. Recommended best practice is to use a controlled vocabulary such as the Getty Thesaurus of Geographic Names. locality The specific description of the place. Less specific geographic information can be provided in other geographic terms (higherGeography, continent, country, stateProvince, county, municipality, waterBody, island, islandGroup). This term may contain information modified from the original to correct perceived errors or to standardise the description. verbatimLocality The original textual description of the place.
minimumElevationInMetres The lower limit of the range of elevation (altitude, usually above sea level), in metres.
maximumElevationInMetres The upper limit of the range of elevation (altitude, usually above sea level), in metres. Recommended best practice is to use a date that conforms to ISO 8601-1:2019.
taxonID An identifier for the set of taxon information (data associated with the Taxon class).
May be a global unique identifier or an identifier specific to the dataset.
scientificNameID An identifier for the nomenclatural (not taxonomic) details of a scientific name. scientificName The full scientific name, with authorship and date information, if known. When forming part of an Identification, this should be the name in the lowest level taxonomic rank that can be determined. This term should not contain identification qualifications, which should instead be supplied in the identificationQualifier term. nameAccordingTo The reference to the source in which the specific taxon concept circumscription is defined or implied -traditionally signified by the Latin "sensu" or "sec." (from secundum, meaning "according to"). For taxa that result from identifications, a reference to the keys, monographs, experts and other sources should be given. kingdom The full scientific name of the kingdom in which the taxon is classified. phylum The full scientific name of the phylum or division in which the taxon is classified. class The full scientific name of the class in which the taxon is classified. order The full scientific name of the order in which the taxon is classified. family The full scientific name of the family in which the taxon is classified.

genus
The full scientific name of the genus in which the taxon is classified. taxonRank The taxonomic rank of the most specific name in the scientificName. Description: The Darwin Core Identification History is an extension allowing multiple identification/determinations of species occurrences, particularly name spellings found in each original text. All identifications including the current one are listed, while the current should also be repeated in the occurrence core for simple access (Source: https://tools.gbif.org/dwca-validator/extension.do?id=dwc:Identification). All column labels and descriptions are from https://dwc.tdwg.org/terms/.

Column label
Column description id An identifier for the Occurrence linked to the occurrence.txt file (same as occurrenceID). It can be repeated as a foreign key here.
identificationID A unique identifier corresponding to the name spelling reported as found in the original text.
This is the primary key of this nameAccordingTo The reference to the source in which the specific taxon concept circumscription is defined or implied -traditionally signified by the Latin "sensu" or "sec." (from secundum, meaning "according to"). For taxa that result from identifications, a reference to the keys, monographs, experts and other sources should be given.
vernacularName A common or vernacular name.
taxonRemarks Comments or notes about the taxon or name.

Maintenance and future work
All data will be maintained by their creators. They will be progressively enriched by new data during the current EI2P-VALEEBEE research project and also during the projects that the team will continue to conduct on invasive plant species thereafter.