Distance aware address encoding for Privacy-Preserving Record Linkage

10 min readFeb 26, 2019

Privacy-Preserving Record Linkage (PPRL) is the alignment of records from two different datasets that describe the same entity, without revealing any information about the entities’ data to other parties. One challenge is that the information about an entity in the different data sources is not usually exactly the same. There might be spelling mistakes, abbreviations, or missing fields.

A popular technique for PPRL is based on Bloom filters. A Bloom filter is a space-efficient probabilistic data structure for set comparison. The idea is to encode every entity into a Bloom filter in such a way, that similar Bloom filters represent similar entities.

A common encoding technique for inserting entity information into Bloom filters is the n-gram encoding. It efficiently approximates string distance and has been shown to provide good results in real world applications.

The set of bigrams of a string consists of all the substrings of size 2. For example, the bigrams of the name ‘Lucas’ are ‘Lu’, ‘uc’, ‘ca’, and ‘as’. Given a set of j different hash functions h₁, h₂, …, hⱼ, which map a bigram to a specific index in the Bloom filter, the process of inserting an bigram into a filter is as follows. For each hash function hᵢ, map the bigram to an index and set the bit of the Bloom filter at this index to 1. Below is a toy example of mapping ‘Lucas’ into a Bloom filter bf₁ and ‘Lukas’ to bf₂ respectively.

We can see that the bigrams ‘Lu’ and ‘as’ of both ‘Lucas’ and ‘Lukas’ are mapped to the same indices. Only the bigrams that contain the spelling mistake set differing positions. In general, a one-character difference will only affect two bigrams. Thus, the similarity of the Bloom filter encodings approximates string edit distance.

Now that we have Bloom filter representations for Lucas and Lukas, we can compute their similarity. A popular similarity metrics for PPRL is the Jaccard index. It computes the ratio of the count of set bits that the two filters share to the total count of set bits. Value of 1 means that all bits are the same and 0 if none are the same. In our example above the Jaccard index is 5/9.

The Confidential Computing group at Data61 has released two libraries for privacy-preserving record linkage:

clkhash, a library for encoding personal information into a privacy-preserving Bloom filter representation called cryptographic long-term key (CLK).
anonlink, an implementation of anonymous linkage using CLKs.

On the temporal stability of personal information

Personal information that can be used for linkage includes features such as given name, surname, date of birth, address, sex, height, eye colour, and phone number.

Some of these features (e.g., date of birth, sex, eye colour), are very unlikely to ever change during a lifetime. Others are very rarely changed, e.g., surname after a marriage. But there are also some, that might change quite frequently, like address or phone number.

In Australia, according to the 2007–08 Survey of Income and Housing, of people aged 15 years and over, 43% had moved in the last five years (from Australian Social Trends 2010).

When performing edit distance-based record linkage on data sets, collected at different times, the address information will contribute an arbitrary similarity score if the person has moved. Moving from Oxford Ave in Bankstown to Oxford St in Blacktown will provide a higher similarity score than moving to Green Ave in Smithfield. Similar place names are preferred to locations that are physically nearby.

In the example above, we can see that the Blacktown and Bankstown addresses have eight bigrams in common, whereas the Bankstown and Smithfield ones only two. Thus, measuring the string similarities in the presence of address changes leads to arbitrary similarity scores.

An interesting observation is that people don’t tend to move very far.

Most people who move house move relatively close to their former address. Of people who had moved house in the year prior to the 2006 HILDA survey, 60% moved a distance of only 0–9 kilometres from their former residence. (Australian Social Trends 2010)

Instead of computing the string similarity of addresses, is there a way of encoding them so that the similarity of encoded addresses relates to the physical distance between them?

Distance aware address encoding

Computing the string similarities to measure the similarity of entities, like it is done with the n-gram encoding, has been shown to produce good results in the presence of spelling mistakes. However, it is not a good choice in the case of address changes as it will lead to rather arbitrary distance values.

Let us instead investigate an encoding for the geographical coordinates of an address. Every address has a unique location on the earth, described by its latitude and longitude. There are many geocoding services that map addresses to their geographic coordinates.

But how can we encode the coordinates in a way that the differences of Bloom filter encodings of addresses relate to their respective physical distances? We need a function that maps coordinates to tokens, such that similar coordinates are likely mapped to the same tokens.

One group of functions with this property are called locality-sensitive hash functions (LSH). An LSH maps a multi-dimensional input to a one-dimensional value called a bucket, such that similar inputs map to the same bucket with high probability. LSHs have been successfully applied to problems such as near-duplicate detection, similarity identification for images or audio, nearest neighbor search, and fingerprinting of audio or video.

First try: Random Projection

Random projection is a technique for reducing the dimensionality of points in the Euclidean space, while approximately preserving the distance between them.

For illustration purposes, assume we have a two-dimensional Euclidean space and an address aᵢ is represented by a vector. A random projection can be visualised by a random projection plane which divides the space (dotted green line). All addresses on the positive side of the projection plane are mapped to 1, and the ones on the negative side to 0, respectively. It can be mathematically shown that the probability that two vectors aᵢ, aⱼ are mapped to the same value relates to the angle between them. Or more formally

with θ(aᵢ, aⱼ) denoting the angle between the two vectors.

Random projection is used in the LSH method SimHash. It is designed to approximate the cosine of the angle between two multi-dimensional vectors.

The idea is to map each input vector to a set of k tokens by applying the same k random projections to each. Then the Jaccard index — a measure for the similarity of two sets — of the two sets of tokens approximates the cosine of the angle between them.

This approach fits well with the Bloom filter-based approach, as we can easily insert the tokens into the Bloom filter, and the similarity of different records is computed as the Jaccard index between them.

But first we need to evaluate how well the random projection method works for similarity of coordinates. We want to limit it to not more than 200 tokens per address: the complexity of Bloom filter-based record linkage grows with the size of the filters and they only support a certain amount of insertions before the accuracy deteriorates. The more insertions, the higher the chance of a collision (two tokens are hashed to the same position), and the more collisions there are, the less accurate the membership tests become. The optimal number of insertions to minimise the number of collisions is ln 2 * m, with m denoting the length of the Bloom filter. A choice for m that has been shown to achieve good results is 1000 bits and because we want to be able to insert other features besides the address, we limit ourselves to 200 tokens for the address.

Evaluation Setting

We pick a random coordinate as the center. We then draw circles with radii 10, 20, …, 200 km around the center and sample groups of 10000 random addresses from each of the annuli. Then we have 20 groups of 10000 addresses, where the addresses from the first group have a distance of 0 to 10 km to the center, and the elements of the last group a distance of 190 to 200 km.

We then encode all addresses and compute a similarity score between each encoding with the encoded center.

The results are displayed as a violin plot. Each group is represented by a vertical line, the shape represents all possible results, with thickness indicating how common. There are also three tick marks, denoting the extrema and the mean of the data.

Well, that’s a bit disappointing. The similarity scores across all distances are very close to 1, so this will not help us very much.

The problem here is that a 200 km distance on earth relates to a less than two degree angle between the coordinate vectors, and thus the cosine of this angle will be very close to 1.

A distance of 200 km on earth relates to an angle of ~1.8°.

For our purpose, we need something more sensitive to small distances.

Second try: E²LSH

Let’s try an idea from E²LSH, described in Locality-sensitive hashing using stable distributions. Instead of mapping an input element to just two different values, we will now quantize the result of the inner product. That is, we divide the range of possible values into a fixed number of buckets, thus mapping to the integers. If two values are closer, they are more likely to be hashed into the same bucket.

More formally, given a fixed sensitivity parameter w from the reals, a vector a whose entries are chosen at random from the normal distribution with mean 0 and the standard deviation of 1, and a real number b chosen uniformly from the range [0,w], the hash function h is

Note that we can freely choose the value of w such that the sensitivity of the hash function matches our application. The graph below was generated with w=50.

Again, using the same test data as before, we generate 200 tokens for each input, compute the Jaccard index, and plot as violin plot.

Well, that’s a lot better than before. We use almost the entire range of possible values for the similarity scores, and the scores decrease on average with increasing distance. However, there is a lot of variance in the similarity scores: a pair of addresses, 200 km apart can have a higher similarity score than another pair only 40 km apart.

Third try: E²LSH (now with less variance)

We assume that the high variance is mainly due to a significant amount of collisions. A collision in the context of locality sensitive hash functions means that two distant values are hashed into the same bucket. The main culprit here is the projection. The inner product of the input with a projection vector reduces the three dimensions to one with simple additions, thus differences in one dimension might (partially) canceled out in another.

Let’s try to reduce the dimensions to reduce the chances of those collisions. The previous graphs were generated with the inputs as three-dimensional Cartesian coordinates. But as the earth is (almost) a sphere, we can describe any place on earth by just two values: one for latitude and one for longitude. This is the same graph as above, but this time we use the two-dimensional geographic coordinates instead of the three-dimensional Cartesian coordinates.

That made a huge difference. Of course, we might run into problems with the longitude values. The longitude is defined as values between -180º and 180º which wrap around, thus -179º is very close to 179º.

As long as your record linkage problem doesn’t span the whole earth (these problems are usually country- or continent-specific), you can just shift the prime meridian (longitude 0º) to be in the centre of your target area (don’t feel bad about redefining the prime meridian, it’s been done many times in the past).

Conclusion

Frequent address changes poses a challenge for privacy-preserving record linkage, as current techniques for measuring string similarities produce arbitrary results for address changes.

We presented a novel distance-aware address encoding, where the similarity of two encodings relates to the physical distance between the two addresses.

We believe that this technique can be helpful in improving privacy-preserving record linkage, as it produces more accurate similarities after address changes, considering that people tend to stay in the same area.