Search Site

 

Journal Entries

 

Stay Informed

Sign Up Today to stay informed about HINZ events and relevant health informatics news!

*

 

 
 

Sponsors & Supporting Partners for 2014

 

 

 

 

 
 

International Events 2014

 

 

 

 

NHI Encryption Methods

Thursday, June 1st, 2006
Jayden MacRae
Team Leader, Information Management
Compass Health Limited

Wellington

New Zealand

Abstract
In New Zealand’s health care sector, the Health Index Unique Identifier (NHI) is an important tool for identifying individuals within the system for both clinical and analytic purposes. Keeping patient data anonymous is usually a major consideration when using such data for analysis. Several methods can be used to achieve this. This paper looks at various methods of preserving data anonymity and assesses their strengths and weakness against a set of seven criteria.

Introduction
The National Health Index Unique Identifier (NHI) is designed to uniquely identify individuals within the health system in New Zealand. It is a seven-character identifier, made up of three alpha characters, three digits and a checksum digit, which combined can produce 12,567,273 valid NHIs. For the purposes of health informatics, the NHI allows easy electronic matching between varying data sets in the health sector.

Some of these data sets are used in research and for service improvement and must, consequently, have all personally identifying information removed. It is often important to identify records belonging to the same individual throughout a data set (eg, to determine how many times the individual accesses a service over a period of time). Because the NHI is a unique identifier, it can be used for this purpose.

The NHI also easily allows the data set to be cross-matched with any other accessible data sets also containing NHIs. There are instances where allowing cross-matching is desirable, eg, when it is necessary to provide data longitudinally over time in discrete subsets of a greater superset. Often such data must be supplied in an anonymous form, sometimes by different organisations holding the data, but identifiers between the subsets must remain constant to allow analysis at an individual level in a longitudinal fashion. There are also instances where it is not desirable to allow cross-matching.

When attempting to make data anonymous, a common misconception is that this can be achieved by removing names, dates of birth, addresses and phone numbers while including the NHI. By its very nature, however, the NHI uniquely identifies individuals and should not be included in any anonymous data set. But if it is the only unique identifier of the patient, its removal precludes the genuine need of it for analysis.

By making data anonymous, the privacy of individuals is maintained. Classically, this has been the major consideration for organisations holding this data. As the holders, they are the guardians and must exercise due diligence of its protection.

Even when the privacy of individuals is maintained, organisations holding data may also wish to take precautions against the misuse of data sets, especially ensuring that data supplied for one purpose is not used out of context. This potentially can happen when two data sets are supplied for two distinct purposes, and are then matched for a third, previously unintended purpose. This is not a privacy issue. It is about ensuring that data sets and their intended application or use cannot be deliberately or inadvertently disregarded.

There are various methods of solving the problem of making data anonymous while maintaining the ability to identify multiple occurrences of the same individual in a particular set of data. This paper looks at different methodologies for encrypting NHIs and the relative strengths and weaknesses in each in terms of security. The term “encrypting” is used loosely to describe any method of making an original NHI unidentifiable while retaining some type of unique identifier. This is distinct from the more specific term “cryptographic encryption” which describes the process of using cryptographic functions to encode data in a particular fashion.

Method Considerations
Before considering encryption methods, it is useful to define the main aspects by which their usefulness will be assessed to set the basis for comparison. The criteria are:

  1. Securing the data. Unauthorised persons should be unable to obtain NHI detail for a particular data set while still being able to identify records that belong to the same individual and cross match records for the same individual between data sets. Any method that does not meet this criterion fails the entire purpose of encryption.

  2. Reversible encryption. Authorised persons should be able to reverse the encryption to identify the NHI while still meeting the criterion of “Securing the data”. This ability effectively gives unbridled access to identifiable data and has implications, mainly, for matching large national data sets.

  3. Encryption at source. Those holding the data should be able to encrypt the data without transferring any of the unencrypted data to another agency. This consideration is important in empowering organisations to encrypt their own data. This means that they don’t have to rely on third parties to do it for them, potentially increasing the privacy of the data (by not disclosing it to any other agency) and reducing the overheads in terms of cost and time.

  4. Changeable key. It should be reasonably easy to change the encryption key used. Any method that does not use a key should include the ability to change the encrypted output, so that for the same NHI, the two outputs would differ.

  5. Key management only. Encryption and decryption should rely only on knowledge of a (basic) key. This specifically excludes “one time pads” and mapping tables.

  6. Set sharing. Some organisations hold regional or national sets of data collected from many sources. Any method should continue to allow such organisations, where authorised, to do this. For most data sets this would involve an ability to match encrypted information only, but for historic or longitudinal data sets, it may be necessary to allow decryption back to original NHIs, so the information can be encrypted in a consistent way with the existing data.

  7. Discrete data sets. The method must provide, where necessary, data sets that in the encrypted form are impossible to match to each other. This means that should there be three distinct data sets, set A, set B and set C, it should be possible to selectively encrypt them such that set A cannot be matched to set B or set C, but set B and C can be matched to each other.

Encrypted NHI
The New Zealand Health Information Service (NZHIS) has an encryption algorithm to generate encrypted NHIs (eNHIs). These bear no resemblance to an NHI, in that they follow an entirely different format. Using eNHIs is one solution that has commonly been employed by researchers to attempt to generate anonymous data sets.

Without performing cryptanalysis on the eNHI it is difficult to assess how secure it is. The algorithm used to generate it is kept secret. This has to do with the relatively small finite set of NHIs.

The eNHI algorithm encrypts NHIs in the same way each time it is applied (much like a hash function evaluates to the same hash value each time it is applied to the same source data). This allows data sets to be matched nationally using the eNHI, while maintaining some level of anonymity of the data.

This is also the source of the major flaw of the algorithm: because it is operating on a small finite set of values, any holder of the algorithm could easily build a reverse lookup table. To do this, an attacker could generate a table containing every possible NHI. They would then apply the algorithm to each entry and store the resultant eNHI. For any given eNHI, they could then find the appropriate entry in the table and the matching NHI value would be the original plain text. This type of attack is called a dictionary attack.[1]

To maintain the secrecy of the algorithm, its only holder must be NZHIS. This introduces the additional difficulty of needing NZHIS to encrypt all data that is to be encrypted in this way.

Having to keep an algorithm secret is considered to be brittle security. As such, this is deemed to be less secure than flexible security.[2] As employees move through the organisations that hold the eNHI algorithm, they acquire the knowledge of the algorithm, which can never be modified. The encryption effectively becomes weaker over time.

Measured against the encryption method considerations, eNHI succeeds in providing an encryption that is reversible (by NZHIS) and, because it is applied consistently each time, can facilitate matching anonymous data sets.

It fails to allow any holder of NHI-based data to encrypt the data itself. It is not possible to change the output of the algorithm without breaking its compatibility with currently held datasets and therefore the issue of key management does not apply. Because it is encrypted in the same way each time, anything encrypted to an eNHI can be matched to any other data set that uses eNHIs. This, therefore, fails to allow discretely anonymous data sets to be produced.

Arbitrary Identifiers
Assigning arbitrary identifiers (aIDs) to individuals in a data set is another method currently employed in making health data anonymous. Each patient is identified by a number or alphanumeric sequence, which outside the context of the data set is meaningless.

The first step in assigning an aID involves building a table containing a record for each unique NHI in the dataset. The record should contain the NHI and a space to populate with a new identifier. The next step is to populate the aID for each record (ensuring no value is used twice). The result is a mapping table containing two fields, NHI and aID. The number of records in the mapping table corresponds to the number of unique NHIs in the data set. This process is effectively no more than a one-time transformation.

This is a relatively straight-forward process that most organisations with data should be able to produce. No specialised knowledge of algorithms or cryptography is required.

The security of this method relies on keeping the mapping table a secret. Without obtaining this table, reversing the encryption is literally impossible. As long as the mapping table is stored, the transformation is reversible (using the mapping table as a reverse lookup from aID back to NHI).

The mapping table is, essentially, the key used in this process. In this respect, it is possible to keep the same key from data set to data set or to vary it as required (by building a new mapping table and assigning different aIDs).

In a large data set the relative size of the key to the data is very large. This makes storing the keys more difficult than traditional cryptographic keys. This may be manageable for a few data sets but could quickly get out of hand, especially when sharing the data with different organisations, all requiring different keys for the same data set.

Although it is theoretically possible to supply data in such a way to various organisations and then give each the mapping table enabling them to decrypt the data set and match it, the reality is less practical. For large data sets, as noted above, the key can be large and this quickly becomes impractical and hard to manage. Another complicating factor is the problem of how to securely communicate the key between trusted parties.

Through the creation and management of mapping tables, it is possible to keep data sets discretely encrypted. For each discrete data set, a new key would need to be produced. Again, because such keys are large, having many discrete data sets can quickly become an impractical task.

Measured against the considerations, aIDs secure and keep the data private. They allow reversing of the encryption as long as the original mapping tables are available. This is a relatively straight-forward method that an organisation can perform itself, and by using different mapping tables, data sets can be kept discretely anonymous.

Because whole mapping tables are kept, this method doesn’t rely on key management alone. Although it would be possible to enable sharing of the set through sharing the mapping tables, there is a logistic problem of sharing the mapping tables.

Symmetric Cryptographically Encrypted NHI
Symmetric cryptography refers to a method of encryption using one of many algorithms that rely on a single key for both encryption and decryption. The key is usually a series of values being between 128–512 binary digits (bits) long. Although the security is provided by a combination of the particular algorithm being used and the key itself, the algorithm being used is not secret. All the secrecy of the data, therefore, lies within the key.

Symmetric encryption can be used to make NHI data anonymous. Passing each NHI value through the algorithm generates a symmetric cryptographically encrypted NHI (sceNHI). The holder of the key can easily reverse the sceNHI back to the original NHI because of the nature of the symmetric algorithm.

Although using symmetric cryptography requires some technical skill, the tools and open source programming code are readily available. It is therefore practical to have the holders of data encrypting the data onsite, without having to pass the data set to any third party. This directly addresses consideration 3, making any potential process or method of exchanging data less complicated.

The key can readily be changed at any time, preventing discreet data from matching each other. Keeping track of the keys used, although requiring some management, should not be onerous, as keys used in this type of encryption tend to be small.

It is possible to share the unencrypted data with other organisations by supplying them with the appropriate keys for the data sets being shared. Sharing the keys must be done in a secure way, however. It could be possible to pre-share keys with core agencies (such as NZHIS), although this has a distinct disadvantage for such agencies because they have to manage a different key from each supplying agency, which could run into the hundreds (for PHOs) or thousands (for individual practices). There are also practical implications in sharing these keys in a secure manner.

Measured against the considerations, sceNHI addresses all but the problem of set sharing, which comes directly from the difficultly in sharing keys. Although keys could be pre-shared, this is not logistically practical.

Asymmetric Cryptographically Encrypted NHI
Asymmetric encryption techniques utilise key-pairs. These sets of keys are termed public and private keys. The public key can be shared with anyone and is still relatively small (usually 1024 bits long). The private key is kept secret by the owner of the key-pair. These keys are one-way: ie, data encrypted with one half of the key-pair can only be decrypted with the other half of the pair and vice-versa. In practice, this means that if organisation Y wishes to encrypt some data that organisation X needs and Y only wants X to be able to decrypt that data, Y can encrypt the data with X’s public key. Although everyone potentially knows the public key for organisation X, that key is useless for decryption as it can’t decrypt the data, only the private key can. The private key is kept secret and only organisation X knows it, so therefore they are the only ones that can decrypt the data.

This method of encryption has a far greater computational overhead and is considered much slower that symmetric encryption.[1, 3] The practical implications of this in dealing with numerous short streams of data, such as NHI, are beyond the scope of this paper, but may warrant further investigation. Asymmetric algorithms tend to generate large encrypted values compared to the original data they encrypt. This could significantly increase the size of any data sets containing NHI data encrypted in this fashion.

In this method, both the algorithm and the public key may be known (with this type of encryption there is no requirement to keep either a secret). Consequently, this method is susceptible to the dictionary attack as described in the eNHI section, where an attacker can build a table of all valid NHIs, apply the algorithm with the public key and get all encrypted values matched to original values. This problem can be solved by modifying the method slightly and introducing arbitrary data into the calculation. This is discussed in the next section as a separate method.

The keys are changeable in this method in two ways. Each time a dataset is encrypted, a choice is made of the public key of the organisation that will be decrypting the asymmetric cryptographically encrypted NHI (aceNHI). To bar any other organisation from decrypting the aceNHI, the originating organisation’s own public key would be used to encrypt the data. Choosing which public key to use effectively chooses which organisation can decrypt the data. The second dimension to the changeability of the keys is that the whole key-pair can be changed, at any time, for any organisation. This allows organisations to maintain secure key-pairs should a private portion of a particular pair ever be compromised.

One additional problem using an aceNHI introduces is that if the data-holder organisation encrypts using the public key for another organisation, the holder itself will not be able to decrypt the aceNHI at a later date. The solution is to include a separate aceNHI attribute in the data set for each organisation that can decrypt the aceNHI. Each attribute is encrypted with the corresponding public key of the organisation authorised to decrypt. Included in this would be an attribute for the public key of the holder organisation.

Encrypting NHIs using an organisation’s public key gives them an inherent ability to decrypt the encrypted data, while still keeping it secret from everyone else. This has an application in health research, where researchers must use anonymous data but where that data might also be supplied to NZHIS for matching to national data sets. Using this method, it is difficult to keep discrete data sets from being matched. Any two data sets encrypted with the same public key could be matched by anybody.

Measured against the considerations, aceNHI suffers the most important failure, that of not securing the data. It would be a trivial exercise to perform a dictionary attack against any data encrypted in this fashion. It is also difficult to produce discretely anonymous data sets with this method, because the same public key is used for each agency each time, so all data sets will always match together.


Salted Asymmetric Cryptographically Encrypted NHI
The dictionary attack – calculating all possible values in a finite domain of data – is well known and has been discussed above. A common solution to this problem is to introduce arbitrary data into the actual data being encrypted. This arbitrary data is referred to as “salt”. Anyone with the ability to decrypt the data can easily remove the salt (inserted in a predefined pattern) and gain the original data.

The purpose of salt is to make it computationally difficult to generate all possible combinations of the NHI. It is easy to build a reverse look-up table on NHI alone, as the total number of possible combinations is only 1.2 x107. Introducing salt with the NHI, however, increases by an order of magnitude the number of total combinations.

Salt with a length of eight characters, each character having a possible 64 different values (a readable alphabet including upper and lower case, numbers and punctuation), combined with the NHI would increase the number of combinations to 3.3 x 1021 records (to grasp how large this value is, consider the age of the sun, which is approximately 1.5x1017 seconds old or 4.6 billion years [http://en.wikipedia.org/wiki/Sun]).

The disadvantage of salting NHI data, is that if it is desirable to be able to create comparable data sets in the future, the salt value must be stored somewhere for reuse on those data sets. This is a relatively minor consideration, as the salts are relatively small.

By changing the salt, the same NHI can be supplied in separate datasets to the same organisation, preventing it from matching between datasets and deriving NHIs.

To allow multiple organisations to decrypt the data, an attribute can be supplied within the data set for each organisation that is authorised to do so. Each attribute would be encrypted using the corresponding public key of that organisation. The same salt may be applied across all attributes or varied for each. Rationalisation of the benefits of providing a salt for each or a common salt for all is beyond the scope of this paper but may warrant further investigation.

In this method, the key determines which organisations can decrypt data and the salt determines which data sets can be matched to each other. This provides a fine level of control of data sets.

As with the aceNHI method of encryption, the size of the encrypted NHI will tend be large. This is the major practical limitation of this method. Measured against the considerations, this method meets them all. Although technically, both the key and the salt must be managed in this situation, the salts are likely to be small and just as manageable as a straight key.


Combination Cryptographically Encrypted NHI

The problems of increased processing time and resulting data size when using asymmetrically encrypted data are well known. A common method for overcoming these problems is to use a combination of both symmetric and asymmetric encryption. Here, NHIs are encrypted with a symmetric algorithm, making the process fast and the encrypted data relatively small. The symmetric key is then encrypted, using an asymmetric algorithm and public key of an organisation. This encrypted symmetric key can then accompany the data.

Because the symmetric algorithm encrypts each NHI in the same way, multiple occurrences of an individual can be identified in any data set using the same symmetric key.

The organisation whose public key was used to encrypt the symmetric key can determine the unencrypted NHIs by first decrypting the symmetric key (using their private key) and then using that to decrypt the data.

In this method, changing the symmetric key prevents matching of discrete data sets. Any sets encrypted with the same symmetric key can be matched.

The public keys used to encrypt the symmetric key control the organisations that are able to access the unencrypted NHIs.

The key management in this Combination Cryptographically Encrypted NHI method (cceNHI) involves keeping track of public keys for organisations authorised to decrypt the NHIs and the symmetric keys for defining discrete data sets.

When measured against the considerations, this method also meets all criteria.

Summary
Table 1 shows a summary of the criteria assessed for each method discussed.

Although the aID method generates a key of sorts, the key is difficult to manage because of its size as a large lookup table making this method an impractical choice. Both the aID and sceNHI methods, although they allow sharing of the keys to decrypt, pose problems in secure key transmission and key management for any central organisation needing to decrypt the data.

While the aceNHI method meets almost all criteria, it fails on the most important one of data set security. Its weakness to a reverse-mapping-table attack renders it useless for the purpose of encrypting a finite set of data.

Both the saceNHI and cceNHI methods meet all the security criteria put forward. They provide a fine level of control to a holder organisation, allowing it to determine who may decrypt the data and which data sets can be matched together (and by whom, potentially). For either of these methods to be effective, they would need to gain widespread acceptance and consensus on their implementation. Issues for consideration include the specific algorithms for the encryption, the pattern of salting and the layout and format of data files containing saceNHI or dceNHI fields.

Further work should be done to investigate the consequences of different encryption algorithms, salt sizes, salt patterns, encrypted data sizes and encryption speeds for saceNHI and cceNHI. These, plus the consideration of practical application, will impact on the costs of implementing such methods. Consideration of this is, however, beyond the scope of this paper, and may form the basis of future work in this area.

Table 1: Summary comparison of methods to criteria met

Encryption Methodology

 eNHI  

 aID   

 sceNHI 

 aceNHI 

saceNHI 

cceNHI 

Securing the data

Y

Y

Y

N

Y

Y

Reversing encryption

Y

Y

Y

Y

Y

Y

Encryption at source

N

Y

Y

Y

Y

Y

Changeable key

N

Y

Y

Y

Y

Y

Key management only

-

N

Y

Y

Y

Y

Set sharing

Y

N

N

Y

Y

Y

Discreet data sets

N

Y

Y

N

Y

Y

Yes

3

5

6

5

7

7

No

3

2

1

2

0

0


References

  1. Schneier B. Applied cryptography. 2nd ed. New York, NY: John Wiley & Sons Inc; 1996.
  2. Schneier B. Beyond fear. 1st ed. New York, NY: Copernicus Books; 2003.
  3. Ferguson N, Schneier B. Practical cryptography. 1st ed. Indianapolis, Indiana: Wiley Publishing Inc; 2003.



Acknowledgements
Thanks to Michael Shapleski and John Grant for reading and making comments on a draft of this paper.