Abstract
In the biomedical domain, the record linkage is considered as a crucial problem. When the number of records is very large, existing algorithms for record linkage take too much time. Often, we have to link a small set of new records with a large set of old records. This can be done by putting together the old and new records and performing a linkage on all the records. Clearly, this will call for an enormous amount of time. An alternative is to develop algorithms that perform linkage in an incremental manner. We refer to any such algorithm as an Incremental Record Linkage (IRL) algorithm. In this paper we present an efficient IRL algorithm. In addition to taking large amounts of time, existing algorithms might also suffer from a chaining problem and hence introduce some errors in linking. As has been observed in the literature, this chaining problem can be solved by performing clustering under complete linkage. The IRL algorithm we present in this paper employs complete linkage and is called as Incremental Record Linking Algorithm using Complete Linkage "IRLA-CL". We propose sequential and parallel versions of this algorithm. IRLA-CL can handle any number of datasets. In contrast, many of the existing algorithms can only link two datasets at a time. Our algorithm outperforms previous algorithms and offer state-of-the-art solutions to the IRL problem as well. Our algorithms have been tested on millions of records on synthetic and real datasets and outperform the best-known RLA-CL algorithm when the number of new records is up to around 20% of the total number of old records and achieve a very nearly linear speedup in parallel.