Data Integrity


    Data Integrity


    Data integrity is the property that the data used in a solution is correct, reliable, and useful for all participants. The term “data integrity” is used here in the broader sense ubiquitous in the supply-chain world, referring not only to a resistance to unintended data modification, but also to the completeness, timeliness, and accuracy of the data over its entire lifetime.

    This module covers typical considerations around ensuring that the data used in a blockchain solution is correct, reliable, timely for all participants, and preserved from the point of data creation to the point of usage on the blockchain. This module emphasises that blockchain technology does not necessarily ensure accuracy of data entered on-chain. It highlights that there are indeed multiple stages and steps where data integrity can be compromised.

    The importance of data integrity and key requirements

    What are the key requirements for achieving data integrity in a blockchain context?

    Data integrity is not new to the supply-chain industry – capturing relevant data with integrity has been a priority for a long time. Using a blockchain, however, does not ensure data accuracy of the entered data on-chain, by design. Nevertheless, blockchain specifically protect against manipulation of data, which is immutable once it goes on the shared ledger. Once data is entered and confirmed through the consensus process, blockchain technology provides strong protection from further changes, since those changes would be easily noticed by other participants on the network. Thus, blockchain helps to establish a higher level of traceability and auditability to data so that any data that was entered inaccurately prior to consensus can be traced back to its origin.[95]

    Like any supply-chain solution, blockchains, too, must be designed for data integrity; otherwise the solution will be fragile at best and completely non-functional at worst. This module discusses the challenges to data integrity that arise in a blockchain and supply-chain deployment, and offers nuanced answers and thought frameworks to guide decision-makers along this process.

    Because the purpose of using a blockchain is to collect and manage data in a way that is useful to participants, it is a given that the data used must be accurate, reliable, and timely. Achieving data integrity within blockchain applications is broadly composed of three requirements: data origin integrity, oracle integrity, and digital-twin integrity (Figure 9.1 – Data integrity requirements).

    Achieving data integrity within blockchain applications is broadly composed of three pillars: data origin integrity, oracle integrity, and digital-twin integrity.

    Data integrity requirements
    Figure 9.1 – Data integrity requirements

    Data origin integrity: A common misconception is that the use of a blockchain alone can ensure data integrity. However, even though blockchains can reliably prevent the undetected modification of data once it is confirmed on-chain, blockchains will enforce this only on the data it is given. If the data is not accurate to begin with, then making it immutable by storing it on a blockchain does not provide any benefit - “garbage in, garbage out.”

    Blockchain technology can’t solve for the human factor. If someone inputs garbage data onto a blockchain, that garbage is recorded forever and can inadvertently become a flawed source of truth. Thus, an analysis of data hygiene is a critical precursor to any blockchain deployment.

    Sheila Warren, Platform Head – Blockchain, Digital Currency, and Data Policy, World Economic Forum

    Thus, it is clear that in order to guarantee data integrity in a blockchain and supply chain solution, the accuracy and reliability of data must be preserved from the point of creation to the point of usage on the blockchain. This is referred to as data origin integrity. A lack of data origin integrity will prevent blockchain participants from drawing useful insights from the data on the blockchain, since the data itself is faulty.

    In order to guarantee data integrity in a blockchain and supply chain solution, the accuracy and reliability of data must be preserved from the point of creation to the point of usage on the blockchain.

    Oracle integrity: A common step where problems can occur is at the point of submission to the blockchain. Since blockchains themselves cannot directly access information about the real world such as the status of a shipment, weather conditions, and commodity prices, blockchains must rely on third parties to submit this information, commonly referred to as oracles. The entity submitting the information (the oracle) is often the same entity as the one that provides the data (the data provider or data origin). In either case, these oracles are trusted. Depending on the environment the blockchain solution operates in, a degree of care must be taken to ensure that oracles have not modified or omitted data before submission to the blockchain. This is referred to as oracle integrity. A failure to achieve oracle integrity leaves a blockchain system susceptible to manipulation and exploitation by malicious actors.

    These concepts sound similar to the “Oracle Problem”. What’s the difference?

    This problem of ensuring the accuracy and correctness of data at the time it is submitted to the blockchain is widely referred to in the blockchain industry as the “Oracle Problem”. This is simply a different naming convention. The terms “data origin integrity” and “oracle integrity” are used to reflect the fact that the security of the oracle is only one component of the overall solution, that achieving the broader goal of data integrity requires thinking back to where the data was created in the first place.

    Digital-twin integrity: Lastly, it is common for blockchain and supply-chain solutions to represent real-world objects such as materials and products on the blockchain in a digital form such as a token. This digital representation is referred to as the real-world object’s ‘digital twin’. The idea is that useful real- world data about the object, such as its identity, current location, and other metrics, can be attached to this digital twin in order to yield useful insights about the condition of this objects in the real world, and updated as conditions change. The obvious concerns with this design are whether the data attached to the digital twin presents an accurate and timely view of the physical object and whether the link between the physical object and digital twin may have been compromised. These considerations altogether constitute the property of digital-twin integrity. A lack of digital-twin integrity will cause the digital twins to no longer be an accurate representation of reality, which can prevent the detection of lost, stolen, and counterfeit goods.

    What about off-chain data?

    It is common practice in blockchain deployments to only store the hash digest of data on-chain instead of the data itself when the dataset is particularly large, perhaps including documents, images, videos, long strings of text, or other elements. Storing all of this on the blockchain can lead to blockchain bloat.

    To address this issue, the larger dataset may be stored somewhere off-chain, whether in a shared database, another blockchain, or a peer-to-peer network like InterPlanetary File System (IPFS). The on-chain hashes of the data can then help a blockchain refer to the off-chain data as needed.

    This arrangement guarantees that unwanted modifications to the data will not go undetected, but the same problems, considerations, and solutions relevant to data integrity still apply. The data must still be validated in some way, but since the smart contracts on the blockchain cannot do this directly, the validation needs to be completed by a different architectural component such as by the clients of participating users or even by a trusted execution.

    Since a relatively straightforward change in configuration is sufficient to address this concern, data integrity for on-chain vs off-chain data will not be discussed further. The module Data Protection covers additional information on the off-chain approach to protect data.

    The rest of this module discusses these requirements in more detail and presents solutions for each, with an emphasis on techniques and solutions specific to blockchain-based supply chain deployments.

    The data pipeline - from creation to confirmation

    How exactly does data move from the point of origin to a blockchain network? Where should one look for potential data integrity violations?

    In every blockchain solution that relies on external data, data is originated, submitted to the blockchain by an oracle, and finally confirmed and made usable for blockchain applications. In order to clarify the thinking around this process and raise an awareness of common threats to data integrity, it is helpful to conceptualise data as flowing along a pipeline that includes various stages of processing (Figure 9.2 – Different stages in the lifecycle of data).

    Different stages in the lifecycle of data
    Figure 9.2 – Different stages in the lifecycle of data

    Stages in the data pipeline:

    • Creation/Cleaning: Measurements are made, and raw data is produced. It may come in the form of numbers, text, images, videos, or other structured and unstructured formats. It may be inputted manually by humans or collected automatically by computers and devices, or both. The data is cleaned, enhancing its usefulness, which may include quality assurance, standardisation, analysis, and conversion to usable formats. There is always a human or organisation involved in collecting the data, with varying motivations for doing so.
    • Storage/Gateway: The data is stored somewhere. The data may or may not be stored by the same entity that produced it. If necessary, it is made accessible to relevant parties, through some gateway, whether it is a website, a database download, an application programming interface (API), or simply just physical access to paper records. Usually, a request for data through this gateway simply returns a set of existing data, but in some cases, other operations may be performed as well.
    • Oracle: The oracle connects the data gateway to the blockchain. This may be the same or different entity as the creator of the data or the entity that stored it and made it accessible. The oracle takes data from the data gateway, encapsulates the data in a blockchain transaction, signs it, and broadcasts the transaction to the blockchain’s node network, using a blockchain client. An oracle will often also listen for data requests from the blockchain network and relay these requests to the data gateway. For example, a shipping carrier system may serve an active role as an oracle in a blockchain solution, listening for requests for shipping updates and responding accordingly. However, it is far more likely that some specialised blockchain service provider will provide oracle services by interfacing with the API of the shipping carrier system and submitting results to the blockchain, as it doesn’t require any action on the part of the shipping carrier system.
    • Blockchain node network: The transaction undergoes the consensus process, gets stored in a block, and is eventually confirmed on the blockchain network. The data is stored in a variable in some smart contract on the blockchain and can be usefully consumed or referenced by other smart contracts and users.

    Since each stage in the data pipeline relies upon what was given by the previous stage, data integrity requires that every stage in the pipeline is secure, reliable, and resistant to malfunction or abuse.

    For example, even if a trustworthy, well-secured, and uncompromised computer serves as an oracle, the integrity of the data it provides would still be violated if it were reliant on measurements made by a broken or tampered-with sensor at the point of data creation. Thus, in order to guarantee data integrity, the accuracy and reliability of the data must be maintained from the point of origin all the way to its point of usage on the blockchain.

    Since security of every stage in the pipeline is a prerequisite, it is also important to deploy good practices of cybersecurity. Refer to the module Cybersecurity for further discussions and approaches to enhance cybersecurity throughout a solution.

    Faults in the data pipeline

    What could cause data submitted to the blockchain to be inaccurate? What could go wrong at each stage in the data pipeline?

    The lists below aim to exemplify the kinds of data integrity risks decision-makers should consider when architecting their own use cases. Data integrity faults are highly use case-specific, so these lists should be used as inspiration to help identify potential challenges unique to the organisation’s own use case, rather than as exhaustive categorisations of all possible faults.

    Benign faults

    Most problems in the data pipeline tend to be benign faults, meaning that they are unintentional and not motivated by malicious intent.

    Examples of benign faults in the data pipeline
    Table 9.1 – Examples of benign faults in the data pipeline

    Malicious faults

    Malicious faults occur much less frequently but are important to consider if the blockchain deployment operates in a highly adversarial, low-trust environment, or if the stakes are high. For example, if one is using a blockchain for real-time data sharing among long-trusted business partners, then protections against malicious behaviour are likely of lower priority, since it is assumed that a highly trusted business partner would not intentionally submit misleading data.

    On the other end of the spectrum, if a blockchain deployment programmatically determines which company will win a multi-million dollar procurement contract based on quantitative measures of vendor performance supplied by oracles, protecting oracle integrity is much more important. In this scenario, the procurement contract provides a substantial incentive for the vendors to collude with an oracle in order to submit false performance metrics at the expense of other stakeholders in the system.

    Malicious faults occur much less frequently but are important to consider if the blockchain deployment operates in a highly adversarial, low-trust environment, or if the stakes are high.

    While most violations of data origin integrity are benign, violations of oracle integrity tend to be malicious, since manipulating data to be false or fraudulent generally does not happen by accident.

    Additionally, changing business conditions, shifts in incentives, and takeovers by new management are all commonplace occurrences in the long run that can cause trustful relationships to break down or turn competitive. If the circumstances allow it, former business partners may turn adversarial as well. That is why it is important for serious blockchain deployments designed to operate over a long period of time to take measures against malicious behaviour.

    Changing business conditions, shifts in incentives, and takeovers by new management are all commonplace occurrences in the long run that can cause trustful relationships to break down or turn competitive.

    Examples of malicious faults in the data pipeline
    Table 9.2 – Examples of malicious faults in the data pipeline

    Solutions for data integrity in a blockchain context

    What techniques and solutions are available to support data integrity in a blockchain deployment?

    Preventing benign faults

    Data integrity problems, especially non-adversarial ones, are not new to the supply-chain world. Thus, the solutions relevant to preventing benign data- integrity faults in a blockchain context don’t differ much from the solutions applied to data-integrity concerns in a more traditional supply-chain context.

    It is relatively straightforward to prevent benign faults, since doing so doesn’t require anticipating the potential actions of intelligent and resourceful attackers. The same traditional principles and techniques apply – employing proper system design, maintenance, management, and business practice will prevent the vast majority of benign faults. This module will place a greater emphasis on the techniques and solutions relevant for protecting against malicious faults, an area that blockchain technology excels at.

    The same traditional principles and techniques around data integrity for other technologies apply to blockchain as well. Employing proper system design, maintenance, management, and business practice will prevent the vast majority of benign faults.

    Protecting against malicious faults

    All malicious faults stem from the abilities and privileges given to participants in the blockchain, whether they are data providers, oracles, organisations, or users. While under ordinary circumstances these powers are constructive and supportive of key functionalities in a blockchain deployment, they can also be abused when conditions worsen.

    Protecting against malicious faults is thus a matter of limiting privileges to only those which are necessary, minimising the potential negative impact of privilege abuse, detecting malicious behaviour and holding responsible parties accountable if it occurs. It is also important to minimise the number of trusted actors the system depends upon.

    In the specific context of ensuring data integrity, solutions aim to maintain a high confidence in the reliability of some data while minimising the trust placed in the parties supplying it. This usually comes in the form of protocols and additional processes that help validate data inputs in some way before being finalised.

    Protecting against malicious faults is a matter of limiting privileges to only those which are necessary, minimising the potential negative impact of privilege abuse, detecting malicious behaviour and holding responsible parties accountable if it occurs, and minimising the number of trusted actors the system depends upon.

    These approaches are all guided by the principle of trust-minimisation, which, as the name suggests, seeks to minimise the trust and reliance placed on any parties involved in transactions. Applying this principle helps to produce a system that remains robust, resilient, and functional in the face of malicious behaviour as well as additional classes of benign faults.

    Trust minimisation is relevant even when some parties are necessarily relied upon by the blockchain system. For example, even if an oracle is the only entity to have access to some information needed by the blockchain, all of their data inputs should be transparently recorded somewhere so that any suspected misbehaviour can be examined at a later time if needed.

    Solutions for protecting against malicious faults are more costly than solutions for preventing benign faults, due to the additional complexity and overhead involved. Naturally, it is up to the designer of the blockchain deployment to determine to what extent these protections are necessary and how to balance these requirements against associated trade-offs, such as cost, technical difficulty, and integration challenges. In general, it is best to employ as many of the following techniques as possible within reasonable constraints.

    Traditional techniques for data integrity

    These comparatively simple techniques for data integrity are well-known, effective, and already widely applied across the supply-chain industry in general, not just blockchain deployments.

    • Vetting trusted actors: Strictly vet any humans or organisations that must be trusted to perform certain duties. The same filtering and qualification procedures that apply to choosing a new employee or new contracting company generally apply also to choosing which humans and organisations get to play privileged roles in blockchain systems. For example, individuals could be asked to provide “know your customer” (KYC) information when necessary, or to undergo a certification process. One could select individuals who are legally obligated to act according to certain rules, such as adapting public notaries to blockchain functions. For organisations, one could look at their performance track record, at their company values and management, and at their overall capabilities.
    • Contractual obligations: Another strategy is to introduce punitive measures through traditional legal contracts, such as a fine defined for certain bad conduct on a network. Such a measure would inherit the known advantages and disadvantages of legal contracts, raising questions such as if the contract is practically enforceable, whether the value of what is at stake is great enough to warrant legal settlement, and so on. If arbitration is sufficiently streamlined and cost-effective, legal contracts could be an effective way to create a strong obligation to uphold data integrity.

    Advanced techniques for data integrity

    These comparatively advanced techniques are newer and more difficult to apply but are highly relevant to, and compatible with, the data integrity requirements of a blockchain deployment.

    • Reputation system: Over time, the actions and inputs produced by privileged actors generate information that can help gauge the expected continuing trustworthiness and reliability of these actors. For example, an oracle that has never submitted a value differing substantially from those submitted by other oracles (for the same request) can be said to have a strong track record of good performance that may justify assigning their future inputs a slightly higher weight or paying them a higher rate for their services. The track records that form the basis of the reputation system can themselves be used to audit prior behaviour if any foul play is suspected. Performance track records and manual ratings of trusted actors can be incorporated into a reputation system that serves as the basis for increased privileges and rewards in the future.

      However, financial rewards or enhanced privileges awarded to highly- reputable actors should be introduced only with great caution, as they increase the incentive for potential adversaries to try to unfairly take advantage of the system. This can be done by reputation farming, colluding, destroying the reputation of competitors, abusing power for unfair financial gain, or exit scamming, for instance. Difficulty in aligning incentives properly is a large part of why reputation systems are so notoriously hard to make robust.

    • Automation: Another approach is to try to minimise reliance on potentially malicious human actors through automation. Trucks used for shipments could automatically report their location at all times, so that drivers don’t have any opportunity to lie about packages arriving on time. Payments for goods and shipments could be triggered by smart contracts upon the satisfying pre-specified conditions, instead of waiting on human and bureaucratic processes that may stall payments, intentionally or not. The legal filings required for international shipments could be digitised on a blockchain in order to circumvent those who benefit from the inefficiency and opaqueness of the current process. A major determinant of whether automation is useful to data integrity is whether the automatic process is robust enough to produce objective results even in the face of reasonable attempts to cheat the system. In addition to process efficiencies and cost reductions, increased levels of automation usually come with the added benefit that human input errors are more easily corrected or prevented, helping to prevent more types of benign faults. However, this technique is limited to the extent that many types of work still require humans to complete.

    • Fraud detection and accountability: Ideally, false or fraudulent data inputs can be detected in some way. This is valuable even if data integrity violations can only be detected after the data has already been used, because offending parties can still be held accountable after the fact.

      Depending on the severity of the faults, whether it was accidental or maliciously motivated, and other factors, offending parties can be reprimanded accordingly, such as by marking down their reputation, suspending or revoking their power to submit data inputs, or even by confiscating the whole or a part of some financial collateral they have deposited for the purpose of attesting to their current and future good behaviour. Other possibilities include the integration of machine learning models to detect anomalous data submissions and raise red flags or reject submissions automatically. There are many possible variations, but the underlying motivation is the same – to detect incorrect data inputs and hold trusted parties accountable to them.

    The decentralised prediction market platform Augur utilises tokens which represent reputation, have monetary value, and which must be owned in order to earn fees as an oracle. If an oracle reports an incorrect value (voting against the majority of what all the other oracles independently reported), a portion of their reputation tokens are confiscated.

    • Aggregation across redundant inputs: In some cases, a single data request can be redundantly answered by multiple oracles, where the final result is taken as the aggregate of the inputs supplied by the oracles, usually with outlier results thrown out. The core idea is that by aggregating across redundant inputs, the maximum negative impact of any individual oracle is reduced, and in some cases, the accuracy of the final result is improved as well. Inputs can be aggregated by taking the median, mode, mean or a hybrid of these approaches, but other aggregates could be used, especially for more complicated types of data - it all depends on the use case (Figure: 9.3 – Aggregation across redundant inputs).

      The primary drawback to this technique is that a large portion of data relevant in a supply-chain context is only accessible to a single party, and thus cannot be reported redundantly. For example, the current location of a package is only known by the entity that is currently custodial of it, and an accurate list of its contents can only be supplied by the entity that originally shipped the package. However, for any data that is publicly available or that can be made available to multiple parties, this technique is effective in improving its reliability, accuracy, and robustness to manipulation.

      For example, regional weather conditions, commodity prices, foreign exchange rates, figures inside a U.S. Securities and Exchange Commission filing, and practically everything that can be downloaded from the internet are all suitable candidates for redundancy and aggregation. If the data can be drawn from the API of an organisation, then it is likely that multiple oracles could integrate with the API.

    By aggregating across redundant inputs, it minimises the the potential risk/impact of any individual oracle.

    Aggregation across redundant inputs
    Figure: 9.3 – Aggregation across redundant inputs

    The pros and cons of common aggregation methods:

    Here are the most common aggregation methods along with associated benefits and drawbacks.

    1. Mean: Best if each additional oracle input improves the accuracy of the final result, such as in sampling the credit score of a group in a poll. However, since a single oracle could greatly skew the end result by submitting outlier entries, a naive usage of the mean aggregate is not resilient to manipulation by even one of the oracles, which implies that using multiple oracles is even riskier than using a single one. A more intelligent approach could throw out the greatest and least entries as a rule, only averaging the entries that remain.
    2. Median: Best when variance between values is expected to be moderate or high. Since only the ‘middle’ entry is taken, even extreme outliers will have negligible effects on the result. A median is resilient to a small proportion of false data arising from malfunctioning or adversarial manipulation.
    3. Mode: Best when it is expected that every oracle will return the same value. This includes objective, discrete values or when the variance between results is expected to be low or zero. For example, a request for “the number of non-faulty phones manufactured in this batch” is expected to return the same whole number from all oracles queried. Mode also works well for nominal data, which is composed of categories or labels, and ordinal data, which is composed of ordered, non-numeric options, since the values are discrete.

      Note that if the data is numerical, the numbers must be sufficiently coarse-grained in order for the mode to consistently converge on the “right” value. For example, since the price of Bitcoin in USD ranges in the tens of thousands and fluctuates quickly, applying the mode aggregate to converge on a Bitcoin price may require that prices are rounded to the nearest multiple of 10 or even 25 or 100.

    4. Hybrid Approaches: For instance, an approach that entails taking an average of the middle two quartiles would inherit traits from all of the median, mode, and mean approaches, including tolerance to accidentally or intentionally false data. This approach also has the potential to increase accuracy as more data entries are submitted without having to round numbers down as is sometimes necessary with mode. However, gaining all of these advantages may increase the number of oracles required to the point of being cost-prohibitive. The best scheme will vary case-by-case.

    • Cross-validation. Another approach is to “cross-validate” inputs, meaning that each input submitted is corroborated with nearby inputs. For example, if all of the temperature sensors deployed in a grid-like fashion across a large food storage facility report that the current temperature is around 5 ̊C, with the exception of a single sensor reporting that the temperature is 30 ̊C, it is plausible that the single sensor is malfunctioning, and its input can be automatically thrown out.

      Another example could be applied to the Global Positioning System (GPS) locations of vehicles on the road owned by a large shipping company. A vehicle would not only report its own location, but also the locations of company vehicles nearby, making it more difficult to tamper with a single vehicle’s GPS system without detection. Successfully faking a location would require compromising the GPS systems of all the company vehicles nearby, as opposed to just a single vehicle.

      Input aggregation and cross-validation yield significant data-integrity improvements to the use of oracles. However, since every additional sensor or oracle used incurs fixed and ongoing costs, the key consideration is whether the level of security possible with the chosen quantity and quality of these components is sufficient for the needs of one’s use case.

    In many blockchain projects, oracles don’t have much input in the development process. Therefore it may be difficult to implement additional security mechanisms for additional integrity of oracles later. In such cases, one possible solution might be to introduce a human- oriented approach to data integrity, such as use of a trusted third party to verify the correctness of oracle data.

    Nishio Yamada, Research and Development Group, Hitachi

    • More data, more evidence: In general, more data allows for higher confidence in the events and conditions implied by the data. For example, the simple policy that a courier should record a short video while placing a valuable package in a deposit-only receptacle gives a much stronger assurance that the package was actually delivered. While it is still possible for the courier to record a video of the delivery and then fish out the package afterwards, getting away with it becomes more difficult as more data is required, since the data must not only be self-consistent but must also be comparable to data produced by similar events. Recording this evidence in an immutable tamper-evident data store such as a blockchain allows events to be audited in the future should a dispute arise, although it’s likely that only a hash of the data would be stored on-chain.

      Determining what additional data and evidence is helpful to data integrity is highly specific to each use case, and requires creative thinking by blockchain architects and supply-chain decision-makers. However, the main idea is constant – while evidence can be faked and preventing fraud entirely is difficult, increasing the amount of data and evidence collected makes it more expensive and time-consuming for an attacker to submit false data, especially when used in tandem with other data-integrity solutions.

    • ‘Provably Honest’ Protocols: Another option is to integrate cryptographic protocols and special hardware that allow oracles supplying data inputs to include a corresponding “proof” that the data they are submitting is exactly the data they received from the data source. When the data and proof are received by the blockchain, they are checked against each other, and the data is thrown out if the proof is invalid. The protocols are designed such that it is impossible to generate a proof for some data if it has been modified after receipt from the data source. Hence, oracles that have provided data in this way are “provably honest” and do not have to be trusted except in the sense that they will continue to provide oracle services. However, even if an oracle becomes non- cooperative or discontinues its service, that oracle is fully replaceable. From the perspective of the blockchain, it doesn’t matter which oracle submits the data as long as the associated proof is valid.

    • Hypertext Transfer Protocol Secure (HTTPS): For requests made over the internet, one of the best options is TLSNotary, which modifies the internet HTTPS protocol to allow any computer to produce a proof that a particular web page appeared in its browser. For example, an oracle that connects to the UPS API could use TLSNotary to prove that the tracking info and timestamps it received from UPS were not modified before submission to the blockchain, and a smart contract on the blockchain could verify this proof that the data came from Unlike the prior techniques that use redundancy to reduce the amount of trust placed in oracles, TLSNotary incurs very low costs, since the only requirements on the oracle are to integrate TLSNotary and maintain a server. TLSNotary can be used for any data source that uses the secure HTTPS protocol, which includes the vast majority of websites today.

    • Trusted Execution Environment (TEE): For requests that primarily require some computation to be completed off-chain, one of the leading technologies that can produce a similar proof of correctness are TEEs such as Intel Software Guard Extensions. Essentially, Intel chips that support this protocol include a special component completely isolated from other components in the computer called the Trusted Execution Environment. Other components in the computer can’t read the memory inside the TEE, nor can they see the inputs or outputs of the TEE’s computations, since all of that data in encrypted while in transit. This TEE can then be used to run highly sensitive code that computes over highly sensitive data, with a strong guarantee that the code ran correctly and without leakage of confidential information to any third party or even the computer that this TEE resides on. TEEs excel at providing strong data integrity guarantees in highly adversarial environments.

      For instance, even if a computer, all of the software on it, and the internet connection the computer uses are all under the control of a hacker, any computations sent to the TEE would still be executed correctly due to the hardware and cryptographic security properties of the TEE. Any blockchain project that requires an oracle input that can be obtained as the output of some code could theoretically integrate a TEE, enabling a wide range of use cases and possibilities. However, it is worth noting that TEEs today are still a developing technology with a substantive number of unresolved issues.

    The design choice over smart contract and TEE forms a trade-off between security and accountability, and this choice must be done with consideration of use case specifics. While TEE will bring greater security benefit, it will also pose limitation on process’s transparency. In such case, alternative code verification process that is trusted by stakeholders will complement.

    Takayuki Suzuki, Financial Information Systems Sales Management Division, Hitachi

    Ensuring digital-twin integrity

    How do I ensure that digital twins are synchronised with the physical objects they represent? What are the major components of digital-twin integrity?

    Digital-twin integrity is a more specific type of data integrity that arises whenever physical objects are represented on a blockchain in a digital format. This usually applies to products, parts, and materials, but can apply to virtually any physical component in the supply chain that is useful to track in real-time.

    For example, a luxury handbag tracked on a blockchain may be represented by a blockchain token, with the latest information about its location, current custodian, and stage of manufacturing attached. The digital representation is the ‘digital twin’ of the real, physical object, and the physical object itself may be considered the ‘physical twin’. In order for the digital twin to provide useful insights about the physical object as it is being shipped, it must satisfy three primary conditions:

    1. Accuracy: The data associated with the digital twin is correct and reliable.
    2. Timeliness: The data is recent enough to be useful.
    3. (Cyber-physical) Correspondence: The digital twin represents the physical object it is intended to represent, and the associated data describes the physical object it is intended to describe; the identities of the cyber (digital) and physical twins correspond.

    For example, a luxury handbag tracked on a blockchain may be represented by a blockchain token, with the latest information about its location, current custodian, and stage of manufacturing attached.

    These three components are the core essence of digital-twin integrity. Digital- twin integrity is important to consider whenever a violation of the accuracy, timeliness, or correspondence of data associated with the digital twin can unacceptably distort one’s view of the supply chain. This in turn may result in item mix-ups, missing items, counterfeit items, or simply just not insightful data. The accuracy and timeliness of the data associated with the digital twin can be ensured using the same techniques applied to data origin integrity and oracle integrity – robust system design, competent management, and minimisation of trust. It is in ensuring the correspondence between physical and digital twins that requires thinking in a different way.

    Digital-twin integrity is important to consider whenever a violation of the accuracy, timeliness, or correspondence of data associated with the digital twin can unacceptably distort one’s view of the supply chain.

    Cyber-physical correspondence

    What are the different realms of correspondence between physical and digital twins? What are the solutions for common cyber-physical correspondence issues?

    Ensuring the correspondence between physical and digital twins usually only requires that there is a valid identification method for the physical object being tracked. This is usually done by attaching an identifier (ID) directly to the object or recording identifying information about the object.

    The concept of cyber-physical correspondence may also extend to any systematic process that can uniquely identify and consistently differentiate objects from one another.

    In a system with sound data integrity, an object will be assigned a unique ID such as a serial number, and the digital twin on the blockchain records this unique ID, allowing all data collected about the physical object to be associated with this ID. Such a setup enables a blockchain observer to look up information about a physical object by searching for its ID on the blockchain and is sufficient to ensure cyber-physical correspondence in most cases. However, if the physical objects being tracked have a non-trivial risk of loss, theft, or counterfeit, more stringent requirements must be imposed on the method of identification. This idea is best illustrated with an example.

    Example of cyber-physical correspondence

    In order to provide faster shipping for its customers, a luxury handbag company stores some of its inventory at third party fulfilment centres around the world, where its handbags are personalised with custom engravings, placed in their final packaging, and shipped directly to the customer. However, the company has discovered that counterfeit products are frequently swapped in for real ones at these fulfilment centres, where the company no longer has direct oversight over the bags.

    The company attempts to solve this problem by tracking the handbags on a blockchain. When a bag is manufactured, it is assigned a serial number that uniquely identifies the bag. This serial number is etched onto a tag attached to the inside of the bag and recorded into a newly created blockchain token that represents the bag. Whenever the bag is transported to a different location and changes hands, its associated token is passed along, recording the location of the bag, the identity of the newly responsible party, and other details.

    Altogether, this establishes a full history of the bag from completion of manufacturing to arrival at the fulfilment centre to final delivery to the end customer. The company believes that keeping this data on the blockchain will prevent counterfeiting, citing the transparent, immutable nature of the blockchain maintaining a verifiable, tamper-proof record of information that will supposedly prevent the introduction of counterfeit products.

    However, their blockchain solution may not have resolved the counterfeiting problem if either of the two following conditions was not met:

    1. First, the means of identification on the bag must be hard to forge. If the identification does not have sufficient anti-forgery protections, then a fraudster could create a fake identification tag that appears legitimate, attach it to a counterfeit bag, and swap the counterfeit bag with the real one at the fulfilment centre or any other point in the supply chain. The forged product would continue unnoticed by the following parties in the supply chain.
    2. Second, the identification must be tamper-evident or tamper-resistant, such that modifying or substituting an identifier will be difficult or at least leave traces of tampering. This is necessary because even if the company used a hard-to-forge identifier, a fraudster could still detach a real identifier from a real bag, reattach it onto a fake bag, and pass the fake bag along with the real identifier to the rest of the supply chain - all without leaving a trace.

    In either case, the addition of a blockchain for recordkeeping does not alone prevent counterfeiting. A lack of anti-forgery or tamper-evidence / tamper- resistance protections allows a fraudster to profit by obtaining real bags at the cost of fake bags. On the other hand, if the company has adopted a means of identification that is both hard-to-forge and tamper-evident / tamper-resistant, any party in the remainder of the supply chain can notice attempted counterfeiting and report the fraud.

    For more information on the digital identity of “things” and identifiers, refer to the module Digital Identity.

    Solutions for cyber-physical correspondence

    In use cases where the cyber-physical correspondence of an object is substantively threatened by a risk of loss, theft, or counterfeit, two requirements on the identification method for the physical object must be met:

    1. Hard to forge: It is difficult to falsify an identification that passes as legitimate.
    2. Tamper-resistant or tamper-evident: The method of identification is sturdy enough to prevent tampering or tampering leaves behind detectable evidence.

    Both of these requirements are related to physical security. Considerations of physical security, theft, and counterfeiting are not new to the supply-chain industry, so they are not necessary to discuss in full detail here. Just like the approaches to data origin integrity and oracle integrity, solutions for digital-twin integrity and cyber-physical correspondence are very use case-specific and must be balanced against costs, integration difficulty, and other trade-offs.

    The rest of this section provides more information about hard-to-forge and tamper-resistant / tamper-evident identification methods.

    Something that is tamper-proof is supposedly impossible to tamper with. However, nothing is tamper-proof in the same way that nothing is absolutely secure – it depends on the extent to which risks have been mitigated and the resources of a potential adversary.

    It is thus more accurate to use the term tamper-resistant to describe an item considered to be difficult to tamper with, allowing its security against meddling to be evaluated along a spectrum. Examples of tamper- resistant items include steel safes and vaults, padlocks and bolts, or other sturdily constructed objects, locks, and containers. Note that all of these protections could be broken by an attacker with sufficient time, motivation, and resources.

    On the other hand, a tamper-evident item might not be hard to tamper with but will leave evidence of any meddling that will be apparent to a recipient of the item. Numerous examples are in common use today, such as the lids on pharmaceuticals or jarred foods that pop up when opened, hologram stickers, labels, and seals that leave traces that are difficult to prevent when peeled or broken. Something that is hard-to-forge is hard to reproduce fraudulently without also leaving apparent evidence of forgery. Intuitively, anti-forgery techniques are thus similar to techniques designed to provide tamper-evidence. Examples include the many protections we apply to our paper bills, coins, stamps, and coupons, such as holograms, embedded strips, differing bill sizes, colour shifting and UV-reflective inks, watermarks, and grooves on coins.

    For some examples of identification methods, and their relative effectiveness in satisfying the physical security constraints required for cyber-physical correspondence, see (Table 9.3 – Examples of identification methods and their associated effectiveness levels).

    Examples of identification methods and their associated effectiveness levels
    Table 9.3 – Examples of identification methods and their associated effectiveness levels

    Key questions to approach blockchain data integrity

    The following checklist is a series of guiding questions you can use with your organisation or consortium to approach data integrity concerns. It collects most of the key considerations presented in this module, but the more detailed discussions above should be referenced while going through this checklist. After reading this module, working through this checklist is an ideal starting point for supply chain decision-makers, product managers, solution architects, lead engineers, security experts, corresponding representatives from partner organisations, supply-chain domain experts, and practically anybody who plays a major role in the design and implementation of the blockchain and supply-chain deployment.

    Since data-integrity considerations will have a pervasive impact on the final implementation of the project, these questions should be considered early in the timeline of a blockchain deployment, in the later portions of the design phase, after the core value proposition and mechanics of the use case have been determined but before the use case begins code development.

    Data origin integrity and oracle integrity


    • Has every stage in the data pipeline been examined to determine the faults that may occur?
    • Have proper design, maintenance, and management of this deployment been ensured?
    • Are there measures in place to identify faults unique to the intended use case?
    • Are there malicious faults worth considering in addition to any benign ones?
    • Approximately how much protection against malicious behaviour is required for the intended use case?
    • Have there been measures to protect the integrity of data for every unacceptable fault that has been identified?
    • Has the organisation considered all the techniques and solutions available to address data integrity faults?
    • Is the quantity and quality of protections achievable under any resource constraints for the solution? Have protections been maximised?
    • Does the proposed solution minimise the trust and reliance placed on any and all participating entities?

    Digital-twin integrity


    • How has the accuracy of data associated with the digital twins in the system been ensured?
    • How will data in the system always be kept up-to-date?
    • To what degree does the use case require assurances of cyber-physical correspondence? If there is a significant requirement, does the solution’s design include an identification method that can uniquely identify the objects I'm tracking? Does the identification method consistently differentiate the objects in question from one another?
    • Is loss, theft, and counterfeiting non-trivial risks to the intended use case? If so, has the proposed solution integrated adequate physical security measures to protect against these risks? Is its method of identification hard-to-forge and tamper-evident or tamper-resistant?
    • Are objects sufficiently secured against anti-theft and anti-counterfeit according to the organisation’s needs and cost preference?

    Techniques and solutions


    • Have strict vetting procedures been applied to all humans or organisations relied upon in the blockchain system? Has the organisation checked participants’ credentials, or have they been certified by a reputable entity? Can the solution rely more heavily on actors who have a legal obligation to act honestly?
    • Has the organisation considered applying legal measures to keep privileged parties contractually obligated to perform their duties as expected? Are these legal measures practically enforceable given the amount of value at stake and time required for arbitration in case of a dispute?
    • Are the actions taken by privileged entities in the system immutably recorded somewhere in order to establish track records of performance? Could these track records serve as the basis for a reputation system? How exactly will bad behaviour be punished? Is it possible to reward good behaviour without creating an incentive misalignment?
    • What parts of the data pipeline can be automated to reduce the number of human errors? Where can systematic validations of data inputs be applied? Can any of the mechanics in the use case be handled programmatically? Is an automated process robust enough to produce objective results even in the presence of adversaries?
    • Can false or fraudulent data inputs be detected in some way? How can offending parties be held accountable for their actions? Are there opportunities to apply machine learning models to detect data anomalies programmatically?
    • Can any of the requests for data in the proposed system be answered redundantly by multiple oracles or sensors? If so, how might redundancy help mitigate systemic risks? Which of the aggregation functions is most suitable for the types of data in the intended use case? How does the organisation determine and throw out outliers? How do outlier data feed into the system of reputation or accountability? Are there any opportunities to introduce cross-validation of data inputs instead of pure redundancy? Does this redundancy meet the system’s security needs for a reasonable cost?
    • How can more data be collected to increase confidence in the conditions reported by the data? What data would be useful for this aim? Where can the solution store this data so that it can help resolve potential disputes in the future?
    • Does the system rely on data accessible over the internet or an API via HTTPS? If so, how can the system uses TLSNotary to avoid having to place any trust in oracles? Is a replacement mechanism for the oracles included in the system?
    • Does the system require any computations to be completed off-chain, or do the computational inputs and outputs need to remain confidential? If so, is it possible to conduct these computations with a trusted execution environment? Could the system use a TEE to provide strong data integrity guarantees without having to incur the large overhead costs of redundant sensors and oracles?
    • How do all the system’s data integrity techniques and solutions work together? Is the combination of solutions coherent? Do they altogether form a comprehensive plan to ensure data integrity in the blockchain and supply-chain deployment?