How is it accomplished?

rev. 04/25/2012
Home >

A chain link.Probabilistic record linkage is accomplished by comparing data fields in two files, such as birth date or gender. Comparisons of numerous data fields lead to a judgment that two records refer to the same person and event (and should be linked). This judgment is based upon the cumulative weight of agreement and disagreement among field values. The amount of information contained is related to a field’s impact on the judgment process.

For instance, agreement of the gender field alone would not determine that two records refer to the same patient, but agreement on Social Security Number nearly guarantees that two records refer to the same individual. By assigning log-likelihood ratios to field comparisons, it is possible to computerize the judgment process.

Let mi equal the probability the ith field agrees, given that the records are known to refer to the same person (a true match). Let ui equal the probability that the ith field will agree by chance among records known to not match. Then for a given pair of records, if field i agrees, the agreement weight is wi= log2(mi / ui ).

If field i disagrees, a disagreement weight wi = log2((1-mi ) / (1-ui )) is assigned. The composite weight for a record pair will be the sum of agreement and disagreement weights for all fields available for comparison.

Examining composite weights for all record pairings allows researchers to assign upper and lower critical values. All pairings with weights higher than the upper critical value are considered matches, while those with weights below the lower critical value are not matched. Pairings with weights between the two critical values are reviewed manually and assigned to either the matched or unmatched sets.

If databases are larger or linking information poor then many true matches may not achieve the predefined cutoff weight and therefore be excluded from the analysis of linked records. Probabilistic linkage may only identify cases with full and accurate information, which can lead to bias if errors or missing data is related to some mechanism. In order to avoid this bias, researchers can apply multiple imputation techniques to probabilistic linkage. Rather than fixing a lower bound that a match must achieve to be included in the linked dataset, this method takes several weighted samples of possible matches. Analyses are performed on each sample and the results are averaged across samples.

To improve computation time, both files are sorted on one or several data fields. Comparisons are then made only on records that agree on the sorted fields, which are called blocking variables. If an error occurs in a data field that is used for blocking then records that should match will not be compared. This is because when the file is blocked, only records that agree on the blocking variable(s) are compared. To account for this problem, records that fail to match are subjected to subsequent attempts to match the files after re-blocking with different data fields.