Potential of natural language processing for metadata extraction from environmental scientific publications

Blanchy, Guillaume; Albrecht, Lukas; Koestel, John; Garré, Sarah

doi:https://doi.org/10.5194/soil-9-155-2023

Articles | Volume 9, issue 1

https://doi.org/10.5194/soil-9-155-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

https://doi.org/10.5194/soil-9-155-2023

© Author(s) 2023. This work is distributed under
the Creative Commons Attribution 4.0 License.

Articles | Volume 9, issue 1

Original research article

|

14 Mar 2023

Original research article |

| 14 Mar 2023

Potential of natural language processing for metadata extraction from environmental scientific publications

Guillaume Blanchy, Lukas Albrecht, John Koestel, and Sarah Garré

Download

Final revised paper (published on 14 Mar 2023)
Preprint (discussion started on 05 Jul 2022)

Interactive discussion

Status: closed

RC1:
'Comment on egusphere-2022-535', Anonymous Referee #1, 10 Aug 2022

Interesting study regarding the use use natural language processing methods to extract information from the growing volume of scientific literature. The authors not only illustrate the use of different algorithms but also try to evaluate them numerically. In general, a well written manuscript. However, I think there is a lack of discussion and some of their objectives/aims are weakly met. The "relationship extraction" section is interesting and well written and the authors might want to put the same effort in the rest of the sections.

Comments

- Abstract: The beginning abstract seems a bit disconnected with the rest of the manuscript. Climate change is a hot topic but the paper itself is not related to that. I would suggest re-framing the abstract to match the content of the manuscript.

- Assessing the ability of an algorithm such as regex: I find this evaluation a bit estrange. The algorithms itself is infallible in the sense that it always finds what you tell it to find if it is present in the text. The algorithm is only restricted by the capacity of the user to generate valid regular expressions.

- Topic modelling: There is no discussion.

- How did you achieve your second aim (to illustrate the ability of topic classification to classify a new paper as relevant to a given topic)?

- You mention that topic modelling "can help identify knowledge gaps". How? Did you find any? If your aim is to present a practical workflow, perhaps you should guide the user to achieve that.

- Why did you select 6 topics instead of 9. You only mention that you are trying to maximise the coherence, which is higher for 9 topics.

- How does the number of topics might affect your workflow? Is selecting the highest coherence score infallible?

- Could you elaborate on how excluding monograms increased the coherence? From the term frequencies (Fig 7) I do not see many soil related terms, which seems strange. Perhaps they were ignored since their appeared as monograms? I do agree that bi and even trigrams are important but I have usually seen them added to a selection of monograms.

Citation: https://doi.org/10.5194/egusphere-2022-535-RC1
- AC1: 'Reply on RC1', Guillaume Blanchy, 12 Jan 2023
  
  General:
  Interesting study regarding the use of natural language processing methods to extract information from the growing volume of scientific literature. The authors not only illustrate the use of different algorithms but also try to evaluate them numerically. In general, a well written manuscript. However, I think there is a lack of discussion and some of their objectives/aims are weakly met. The "relationship extraction" section is interesting and well written and the authors might want to put the same effort in the rest of the sections.
  We appreciate that you find the study interesting and we thank you for your useful comments on the content that will help to improve the manuscript. We would like to state that the primary aim of the study was to demonstrate a practical workflow of several NLP techniques for summarising a large body of scientific literature. This was not properly reflected in the aims of our study. We will modify the aims accordingly in the revised version of the manuscript.
  We acknowledge that the “topic analysis” part is less developed and weakly matched the objective 2 of addressing if a paper was relevant or not to a topic. In this regard, we plan to restructure the content around topic classification in the manuscript. Instead of classifying “new papers” in different topics, we will now demonstrate how to identify groups of manuscripts (in our case, groups around different types of “agricultural practices”) and observe which groups are less represented (or absent). In this way, we can show practices less studied and identify possible knowledge gaps. This also serves as a first classification to identify on which topic would a meta-analysis be well suited for instance.
  
  Specific comments:
  - Abstract: The beginning abstract seems a bit disconnected with the rest of the manuscript. Climate change is a hot topic but the paper itself is not related to that. I would suggest re-framing the abstract to match the content of the manuscript.
  We will rephrase the abstract such that the main focus will be NLP techniques to summarise a large body of scientific environmental literature and then present the OTIM en Meta corpus as a case study on which we applied these techniques.
  - Assessing the ability of an algorithm such as regex: I find this evaluation a bit estrange. The algorithms itself is infallible in the sense that it always finds what you tell it to find if it is present in the text. The algorithm is only restricted by the capacity of the user to generate valid regular expressions.
  We agree that the regex algorithm is infallible but indeed, in this case, we want to estimate how well user -defined regexes are able to recover specific information. We will make clear in the manuscript that we do not assess the ability of the regex algorithm but rather the ability of the user generated regular expressions to match relevant content considering the trade-off between generality and their specificity.
  - Topic modelling: There is no discussion.
  Further discussion will be added, especially on how topic classification can be used as one of the first steps of the presented semi-automated NLP workflow for information summary and identifying groups of abundant literature where a meta-analysis can be useful.
  - How did you achieve your second aim (to illustrate the ability of topic classification to classify a new paper as relevant to a given topic)?
  (see general comment)
  - You mention that topic modelling "can help identify knowledge gaps". How? Did you find any? If your aim is to present a practical workflow, perhaps you should guide the user to achieve that.
  We agree that a practical interpretation will be a useful addition to the manuscript. We will give a few examples in the manuscript and develop how we identify them.
  - Why did you select 6 topics instead of 9. You only mention that you are trying to maximise the coherence, which is higher for 9 topics.
  That is a fair point and will be corrected in the next version of the manuscript.
  - How does the number of topics might affect your workflow? Is selecting the highest coherence score infallible?
  It is not infallible and we found that choosing a number of topics between 6 and 9 topics tends to lead to the same groups. The variability in coherence for each number of topics can be great, especially for a relatively small number of corpus as we have. This will be discussed in the revised version of the manuscript.
  - Could you elaborate on how excluding monograms increased the coherence? From the term frequencies (Fig 7) I do not see many soil related terms, which seems strange. Perhaps they were ignored since their appeared as monograms? I do agree that bi and even trigrams are important but I have usually seen them added to a selection of monograms.
  In our case, the inclusion of monograms led to words like ‘soil’, ‘treatment’, ‘water’, ‘crop’ or ‘tillage’ to appear prominently in the different topics. This did not allow us to differentiate the topic so well and the average topic coherence in this case was Cv = 0.4. With only bi-grams, some of these words carried more meaning: “conventional tillage”, “soil water”, “cover crop” and hence enabled better to see what the topic is about. This is the reason why, in this case, we preferred to only use bi-grams. This remark is a good point and we recognize that the addition of monograms as seen in other work can sometimes help. This will be discussed in the revised manuscript.
  
  Citation: https://doi.org/10.5194/egusphere-2022-535-AC1
RC2:
'Comment on egusphere-2022-535', Anonymous Referee #2, 28 Nov 2022

General:

Overall this manuscript fits well with SOIL, and the methodology as well as the results will be of interest to readers. The nature of the study, involving "natural language processing for metadata extraction from environmental {soil} scientific publications" is inherently multidisciplinary, and complex! The necessary methods are well discussed and well referenced, and the appendix of the NLP software will be a big help to researchers in this field. The results relating agricultural practices and soil and site properties are novel and important.

Specific:

Most SOIL readers are probably substantially unfamiliar with NLP and would benefit from more focused guidance by the authors, which can be accomplished perhaps mostly easily by a trimmed revision. For example the Abstract is overly complex; the Introduction states the objectives of the study on just four lines 96-100, and a trimmed Abstract could focus simply on the achieving of the objectives.

The Material and Methods section is appropriately long, given the emphasis on methods, but could be edited to be more uniformly coherent. Perhaps part of that could be fixed by reformatting the variety of figures, and relegating some of them to just the appendix.

Most of the figures in the Results section are important, but much of the other discussions in Results are really recommendations and can be eliminated or partly moved to Conclusions.

Technical:

I see Reviewer #1 listed some technical issues, most of which I believe can be handled by trimming as suggested.

Citation: https://doi.org/10.5194/egusphere-2022-535-RC2
- AC2: 'Reply on RC2', Guillaume Blanchy, 12 Jan 2023
  
  General:
  Overall this manuscript fits well with SOIL, and the methodology as well as the results will be of interest to readers. The nature of the study, involving "natural language processing for metadata extraction from environmental {soil} scientific publications" is inherently multidisciplinary, and complex! The necessary methods are well discussed and well referenced, and the appendix of the NLP software will be a big help to researchers in this field. The results relating agricultural practices and soil and site properties are novel and important.
  We appreciate that you find this manuscript well suited for the journal SOIL and more specifically to a multi-disciplinary topic related to agricultural practices. We are also glad to hear that our effort towards a reproducible workflow (by the means of notebooks, github repository) is acknowledged.
  
  Specific:
  Most SOIL readers are probably substantially unfamiliar with NLP and would benefit from more focused guidance by the authors, which can be accomplished perhaps mostly easily by a trimmed revision. For example the Abstract is overly complex; the Introduction states the objectives of the study on just four lines 96-100, and a trimmed Abstract could focus simply on the achieving of the objectives.
  Agree. As mentioned in reply to RC1, we will refocus the abstract around “NLP techniques” and the objectives we want to address in this work. Additionally, we will make sure that the NLP specific language is explained and simplified to make the abstract accessible to most.
  The Material and Methods section is appropriately long, given the emphasis on methods, but could be edited to be more uniformly coherent. Perhaps part of that could be fixed by reformatting the variety of figures, and relegating some of them to just the appendix.
  Figure 3 and Table 2 will be put in appendix to ease the flow through the Material and Methods section.
  Most of the figures in the Results section are important, but much of the other discussions in Results are really recommendations and can be eliminated or partly moved to Conclusions.
  Thank you for the feedback. We will edit the results in discussion this way and move recommendations to the conclusions section.
  Technical:
  I see Reviewer #1 listed some technical issues, most of which I believe can be handled by trimming as suggested.
  See reply to RC1.
  
  Citation: https://doi.org/10.5194/egusphere-2022-535-AC2

Peer review completion

AR: Author's response | RR: Referee report | ED: Editor decision | EF: Editorial file upload

ED: Publish subject to minor revisions (review by editor) (13 Jan 2023) by Olivier Evrard

AR by Guillaume Blanchy on behalf of the Authors (27 Jan 2023) Author's response Author's tracked changes Manuscript

ED: Publish as is (27 Jan 2023) by Olivier Evrard

ED: Publish as is (03 Feb 2023) by Kristof Van Oost (Executive editor)

AR by Guillaume Blanchy on behalf of the Authors (13 Feb 2023) Manuscript

Short summary

Adapting agricultural practices to future climatic conditions requires us to synthesize the effects of management practices on soil properties with respect to local soil and climate. We showcase different automated text-processing methods to identify topics, extract metadata for building a database and summarize findings from publication abstracts. While human intervention remains essential, these methods show great potential to support evidence synthesis from large numbers of publications.