Data sharing and collaboration are critical to solving large-scale problems. The prevailing soil data-sharing model is based on different groups sending their data to a lead party. This model is of a centralised nature and, consequently, results in the participants ceding control and governance over their data to the lead party. Here we explore the use of a distributed ledger (blockchain) to solve the aforementioned issues. We explain what a blockchain is and some of its characteristics to then describe some features of a blockchain that make it an interesting candidate for an inter-institutional database. Finally, we describe the potential use case of developing a global soil spectral library with multiple, independent international institutions constituting the network.
Soil is a key component of ecosystems, and the need for soil information to monitor its condition is
increasing. A large amount of soil data has been collected in the last century, with a special
increment during the 70s to 80s, and many organisations are performing the exhaustive task of
“rescuing” and organising that data in more accessible formats
Most collected soil data are useful for solving problems locally, but they are too fragmented to tackle more
general issues. This applies at various levels of granularity, including different teams within an
institution, a single institution in different regional locations, and multiple institutions either
within a country or internationally. In these cases, collaboration and data sharing become
paramount. The soil community recognises this collaboration need and has responded by creating
different data-sharing initiatives. For instance,
A potential solution for the data control and governance issues derived from the implementation of a centralised data-sharing system is the use of a distributed ledger or blockchain. The aim of this paper is to delineate the requirements for a functional, decentralised, inter-institutional database (IIDB) to share soil information in a distributed ledger or blockchain. We mainly focus on the technical considerations of data sharing instead of its social, political, and organisational aspects, keeping in mind that the latter are important for any data-sharing system, decentralised or not. First, we introduce some terms that will be used throughout this paper to then explain what a blockchain is and what some of its characteristics are. Second, we describe some features of a blockchain that make it an interesting candidate for an IIDB. Finally, we present a use case of collaborative effort that could be a good fit for using the proposed model.
Before defining what a blockchain is, we introduce a list of definitions that are used throughout this paper.
a potential risk caused by a poor system design where a single fault at that
point can affect the correct functioning of the system. alpha-numeric string generated by mapping the data of an arbitrary size onto data of
a fixed size
In simple terms, a blockchain is a linked sequence of records of the transactions of digital assets
(Fig.
Diagram of three consecutive blocks (two transactions) within a blockchain.
It is worth detailing what a key pair is and how it operates in the context of signing
transactions. In asymmetric cryptography, two keys are used – private and public keys
By design, a blockchain usually operates within a network of interconnected nodes
(Fig.
Data flows in two different soil information system infrastructures:
Blockchain technology is a diverse ecosystem with many implementations that differ in their
characteristics and efficiency. For instance, popular implementations such as Bitcoin, Ethereum,
Litecoin, and Monero are computation-intensive and require large energy input due to their consensus
algorithm (proof-of-work), consuming more energy than mineral mining (copper, gold, platinum, and
rare earth oxides) to produce an equivalent market value
Besides providing a solution to the aforementioned problems, namely centralised data control and governance, a blockchain has other characteristics that make it an interesting candidate for a IIDB. Some of these solutions and characteristics are described in this section.
As mentioned before, the main characteristic of a blockchain is the decentralised nature of the system. Each node of the network keeps a copy of the blockchain, which is synchronised after every new transaction (creation or transfer). Assuming that each node of the network is controlled by a different party, there is no centralised data storage, and hence no single point of failure or control. Normally, in a well-designed, diverse network, a significant number of the nodes can be compromised without affecting its integrity.
Because all the nodes have a copy of the blockchain and act as validators, malicious modifications to the data are very difficult (see immutability section). The only possible way of tampering with the data is if most of the nodes are colluded, which can be avoided by ensuring a diverse network.
For intra-institutional data sharing, a blockchain system can also be implemented to replace a traditional, permissioned database. The advantages are similar to the inter-institutional case, including each team leader having “ownership” of their data, data traceability, data access logging and potentially preventing unauthorised access, and preventing malicious modifications or deletions. Data are one of the most valuable assets of any company, and adding this extra layer of security to ensure its integrity should be a priority, and even mandatory for publicly funded institutions.
Data governance defines the norms, principles, and rules under which the activities of a consortium
should be conducted. It might include important details such as data release and rights to publish
with consortium data first, research output rules (e.g. authorship sequence in consortium
publications), if the data should be shared with non-consortium members
Using a technology such as blockchain does not replace the initial process of negotiations or the
effort of setting rules, but it can help reduce some of the friction points. Many of the clauses
included in a data-sharing agreement can be programmatically enforced and, since the network is
collectively governed, changed over time via a democratic process. Usually, any node of the network
can propose an election process where the rest of the nodes cast a vote transaction, which is also
appended to the chain. If the “super-majority” (usually a large proportion such as at least
When a new asset is created, it is cryptographically signed and assigned to one or more users' public key(s). If the data need to be transferred (either to make corrections or include new information, or to another user), only the owners are capable of doing so by using the corresponding private keys, even if all the blockchain data are available at every node. This process is automatically validated by all the nodes by ensuring that the signatures match with the owner's(s') public key(s) before proceeding with the transfer.
Here we refer to data ownership as the link between a user and a digital asset, without any legal implication. Like in any database, decentralised or not, we are assuming that the user has legal rights to upload the data, which should be properly acknowledged, following the rules defined by the consortium. All this information can be included within each asset, permanently linking data and metadata, where any change can be recorded in case of ownership changes. If required, the network can perform basic checks to ensure that the metadata are included or even just provide access to encrypted data to authorised users.
Since the blocks of the chain are linked (Fig.
Similarly to the data ownership case, here we assume that the asset contains data that are legitimate and error-free. In any system, decentralised or not, it is difficult to control what happens to the data before their ingestion into the system. Although it could be possible to implement pre-ingestion solutions, probably it would always be possible to “cheat the system”. It is important to consider that there are implicit incentives for the parties to provide legitimate data, especially considering the transparency of a decentralised system (ownership and immutability), such as maintaining their credibility.
Although a blockchain data-sharing model has applications at many levels of granularity (inter- and
intra-institutional, and international), we would like to focus on the use case of creating
a multi-party (e.g. multi-institutional, multi-national, global) soil spectral library. Spectral
soil data can be compared to the digital fingerprint of a particular soil sample which encodes
information about its physical, chemical, and biological properties
After all the efforts from different institutions to collaborate in a common initiative, it is only fair that the data-sharing infrastructure is carefully designed to ensure a democratic access and control and governance over the data. We believe that, in general, a decentralised system can guard those interests for all parties involved. Particularly in the case of a global soil spectral library, the use of a decentralised database is of critical importance since the resulting database could be used by national reference centres for soil analysis. The level of transparency and security that a distributed ledger offers ensures that the reference data have not been tampered with and also, given its decentralised model, will maximise accessibility. In the following sections we explore certain implementation aspects of a decentralised data-sharing system in the context of a global soil spectral library.
The potential members of the consortium would have enough analytical capacity to measure the spectral response of soil samples and also to perform laboratory analyses to measure the corresponding physical, chemical, and biological soil properties. This includes universities and commercial soil laboratories from different countries.
Each member should have available the computational infrastructure to become a node of the network. The requirements are not prohibitive and include enough capacity to store all the data and Internet connection. Each node should generate their public–private key pair, securely store a copy of the private key, and distribute the public key to the rest of the members. To start the network, all the public keys should be known by all the members. Once the network is functional, more nodes can be added with the approval of most of the current members via an election process.
In terms of the network users, it is possible to have multiple users per node (e.g. different researchers from a single university). Ideally, all the users should have their own public–private key pair to sign their transactions, and their public keys should be known to all the users. This information can also be stored in the blockchain as a public ledger of who can access the data.
As mentioned in Sect.
After the network is functional, any member can create new transactions to add data that will be synchronised between all the nodes, ensuring immediate accessibility to the data to all the members. The structure of what constitutes an “asset” should be defined during the consortium initiation period. For instance, the asset could be a single soil sample with its corresponding analytical data (Snippet 1). The system should support the use of numerical and text data to store all the necessary soil properties and metadata. Complex data structures such as soil spectral data can be stored as comma-separated numbers or compressed.
The new transaction should be signed with the user's private key and the asset ownership set to
a user's public key. This provides a way of authenticating the origin of the data and allows the
user, and only that user, to create updated versions of that asset if needed (e.g. when new
properties are measured or to correct potential errors). Before a new transaction is appended to the
blockchain, a “super-majority” of the voting power must agree on the validity of that
transaction. The most basic validation is to ensure that the owner(s) are signing the transaction,
but in practice it is possible to set any logical rules. This provides the opportunity to give
certain groups of users the control over an asset, define minimum number of owners, perform basic
data integrity checks (plausible values, names encoding), etc. Of course, as mentioned in
Sect.
Since every node keeps a copy of the blockchain locally, it is possible to retrieve data from any node from the network, providing extra redundancy and hence assuring accessibility in case of malfunction of some of the nodes. Advanced users can query their local copy of the database directly. A friendlier way of providing access to read the data is via an application programming interface (API) that connects any user with a node. That API can perform tasks such as querying the blockchain to retrieve specific data, provide the history of any asset, and potentially process data using pipelines approved by the consortium.
Most of the specific blockchain operations (i.e. signing and verifying transaction) are performed in the background. There is no extra overhead for the users besides keeping their respective private keys safe. A user interface can be built on top of an API so users can access the system as if it were a traditional data management system (DMS), with capabilities to query and retrieve the data from the network.
In terms of the types of users with access to the system, any person with access to a node has complete reading access to the blockchain. If public access is required to allow non-consortium members to connect to the database, multiple solutions are available, including single or multiple nodes acting as a web server. Using multiple nodes as web servers might reduce latency, especially when the consortium spans different countries (i.e. an external user can connect to the closest node). Again, a platform can be built to ensure the public experience is identical to a normal DMS.
The prevailing soil data-sharing model is centralised, with users ceding control and governance over their data to a lead party. We propose the use of a public ledger (blockchain) to create a decentralised soil data-sharing network. This network provides a series of advantages to the participant institutions, including
allowing institutions to preserve the ownership and control over their data, instant access to the complete database, ensuring that once the data are appended to the blockchain, they cannot be tampered with, and actively participating in governance decisions such as adding new members through elections
facilitated by the system.
Ultimately, any consortium data-sharing agreement is based on trust between the participants. By using a blockchain network, the need for trust is removed since rules can be programmatically enforced and the data become tamper-resistant. This protects the already existing trust bond between the consortium members and, potentially, allows the consortium to expand its reach by working with new parties that are not fully trusted.
For intra-institutional data sharing, a blockchain system can also be implemented to replace a traditional, permissioned database. The advantages include each team leader having “ownership” of their data, data traceability, data access logging and potentially preventing unauthorised access, and preventing malicious modifications or deletions.
JP conceived the concept and wrote the first draft. Both authors contributed to generating and reviewing the subsequent versions of the manuscript.
The authors declare that they have no conflict of interest.
This paper was edited by Jan Vanderborght and reviewed by Dominique Arrouays and two anonymous referees.