The Mystery of Matches

91918008

Match­ing is a very daunt­ing task for a Data Man­age­ment pro­fes­sional in an MDM solu­tion given the num­ber of options one has to choose from while con­fig­ur­ing a match engine. What kind of match­ing engine to use — Deter­min­is­tic, Prob­a­bilis­tic, Heuris­tic, Fuzzy or Math­e­mat­i­cal. If that was not enough then the pro­fes­sional needs to choose the fields to be used for match­ing — Name, Address, Iden­ti­fiers, email, phone, gen­der, DOB, Prod­uct Code etc. etc etc. Then there is the ques­tion of which Match algo­rithm to use for which field — Ham­ming Dis­tance, Jaro-Winkler, Edit Dis­tance, Bigram or Bigram fre­quency. Then there is the ques­tions of the match rules or rule set and wait there is still more choices — thresh­olds for decid­ing the exact matches, poten­tial matches and unique.

Just when the data pro­fes­sional thought he had done a good job and has made all the right choices, the match results would be per­fect, he/she is sad­dled with the case of false pos­i­tive and false neg­a­tives. If there is any­thing that the Data Pro­fes­sional dreads the most is the case of the false pos­i­tive or the false negatives.

False Pos­i­tives: These are records which are iden­ti­fied as matches when they are actu­ally not a match.
False Neg­a­tives: These are records which are iden­ti­fied as not a match when they are actu­ally matches.

The false neg­a­tives are the eas­ier to han­dle of the two and can gen­er­ally be resolved through some man­ual process put in place. It is the false pos­i­tives that is the real night­mare, this group of records give the Data Pro­fes­sional the sleep­less nights because the cost of fix­ing them is very high. In case the records are merged, there needs to be a process to unmerge the records, restore them to the orig­i­nal state and then apply the updates that is applied after the merge.

Many MDM Imple­men­ta­tion teams go over­board with all the options and spend a lot of time and effort in try­ing to per­fect the match process. The team needs to realise the catch-22 sit­u­a­tion and spend time only in opti­mis­ing the sit­u­a­tion and not try to per­fect it and put in place to deal with the false pos­i­tive and the false negatives.

Some of the best prac­tices I have come across while design­ing the Match process are

  • Cleanse and Stan­dard­ise Data before the match process, this avoids the need for com­plex match algo­rithms and ensures bet­ter match groups
  • Pro­file the Data to deter­mine the attrib­utes best suited for match­ing — Look for Com­plete­ness, Accu­racy and Consistency.
  • Use at least one exact match attribute for every match rule.
  • Too many fuzzy rules can result in too many poten­tial matches and hence they need to be tested and tuned before being applied in production.
  • Con­duct test­ing for your match process on at least 50% of the pro­duc­tion data vol­ume for good results.
  • A min­i­mum of three match cycles should be con­ducted dur­ing test­ing to arrive at the best set of match rules.
  • Keep review­ing the results of the match process in pro­duc­tion and fine tune the match process.

As your data­base grows, the match process can take a lot of time and horse power from your infra­struc­ture and hence to go back to your match rules and look at the match rules and analyse the ones that are not com­ing into play and remove them, this should help boost the per­for­mance of the match process and opti­mise the need for resources.

Hope you found the tips use­ful and would use some of them in your projects. I would like to know other best prac­tices that you might have adopted in your match­ing process and share them in the com­ments sec­tion below.

  • Malay Baral

    This is really infor­ma­tive Rohin.