Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model
Dataset title | Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model |
---|---|
Dataset creators | Rachel Heaven, British Geological Survey Phil Atkinson, British Geological Survey Tarun Joseph, British Geological Survey |
Dataset theme | Geoscientific Information |
Dataset abstract | This dataset consists of sentences extracted from BGS memoirs, DECC/OGA onshore hydrocarbons well reports and Mineral Reconnaissance Programme (MRP) reports. The sentences have been annotated to enable the dataset to be used as labelled training data for a Named Entity Recognition model and Entity Relation Extraction model, both of which are Natural Language Processing (NLP) techniques that assist with extracting structured data from unstructured text. The entities of interest are rock formations, geological ages, rock types, physical properties and locations, with inter-relations such as overlies, observedIn. The entity labels for rock formations and geological ages in the BGS memoirs were an extract from earlier published work https://github.com/BritishGeologicalSurvey/geo-ner-model https://zenodo.org/records/4181488 . The data can be used to fine tune a pre-trained large language model using transfer learning, to create a model that can be used in inference mode to automatically create the labels, thereby creating structured data useful for geological modelling and subsurface characterisation. The data is provided in JSONL(Relation) format which is the export format from doccano open source text annotation software (https://doccano.github.io/doccano/) used to create the labels. The source documents are already publicly available, but the MRP and DECC reports are only published in pdf image form. These latter documents had to undergo OCR and resulted in lower quality text and a lower quality training data. The majority of the labelled data is from the higher quality BGS memoirs text. The dataset is a proof of concept. Minimal peer review of the labelling has been conducted so this should not be treated as a gold standard labelled dataset, and it is of insufficient volume to build a performant model. The development of this training data and the text processing scripts were supported by a grant from UK Government Office for Technology Transfer (GOTT) Knowledge Asset Grant Fund Project 10083604 |
Dataset content dates | 01/11/2023 - 15/02/2024 |
Dataset spatial coverage | Not Applicable |
Dataset supply format | jsonl |
Dataset language | English-United Kingdom |
Dataset discovery metadata record | Discovery Link to the dataset's BGS Discovery Metadata record |
Dataset publisher | NERC EDS National Geoscience Data Centre |
Dataset publication date | 11/12/2024 |
Dataset digital object identifier(DOI) | 10.5285/afba2d1d-8a5d-4b96-a6fa-c13b5d8d32cd |
Dataset citation text | Heaven, R., Atkinson, P., Joseph, T. (2024). Labelled data for fine tuning a geological Named Entity Recognition and Entity Relation Extraction model. NERC EDS National Geoscience Data Centre. (Dataset). https://doi.org/10.5285/afba2d1d-8a5d-4b96-a6fa-c13b5d8d32cd |
Constraints and terms of use | This data set is available under Open Government Licence, subject to the following acknowledgement accompanying any reproduced materials: "Contains data supplied by permission of the Natural Environment Research Council [YEAR]". |
Access the dataset | https://webapps.bgs.ac.uk/services/ngdc/accessions/index.html#item186633 |