Matching is a very daunting task for a Data Management professional in an MDM solution given the number of options one has to choose from while configuring a match engine. What kind of matching engine to use — Deterministic, Probabilistic, Heuristic, Fuzzy or Mathematical. If that was not enough then the professional needs to choose the fields to be used for matching — Name, Address, Identifiers, email, phone, gender, DOB, Product Code etc. etc etc. Then there is the question of which Match algorithm to use for which field — Hamming Distance, Jaro-Winkler, Edit Distance, Bigram or Bigram frequency. Then there is the questions of the match rules or rule set and wait there is still more choices — thresholds for deciding the exact matches, potential matches and unique.
Just when the data professional thought he had done a good job and has made all the right choices, the match results would be perfect, he/she is saddled with the case of false positive and false negatives. If there is anything that the Data Professional dreads the most is the case of the false positive or the false negatives.
False Negatives: These are records which are identified as not a match when they are actually matches.
The false negatives are the easier to handle of the two and can generally be resolved through some manual process put in place. It is the false positives that is the real nightmare, this group of records give the Data Professional the sleepless nights because the cost of fixing them is very high. In case the records are merged, there needs to be a process to unmerge the records, restore them to the original state and then apply the updates that is applied after the merge.
Many MDM Implementation teams go overboard with all the options and spend a lot of time and effort in trying to perfect the match process. The team needs to realise the catch-22 situation and spend time only in optimising the situation and not try to perfect it and put in place to deal with the false positive and the false negatives.
Some of the best practices I have come across while designing the Match process are
- Cleanse and Standardise Data before the match process, this avoids the need for complex match algorithms and ensures better match groups
- Profile the Data to determine the attributes best suited for matching — Look for Completeness, Accuracy and Consistency.
- Use at least one exact match attribute for every match rule.
- Too many fuzzy rules can result in too many potential matches and hence they need to be tested and tuned before being applied in production.
- Conduct testing for your match process on at least 50% of the production data volume for good results.
- A minimum of three match cycles should be conducted during testing to arrive at the best set of match rules.
- Keep reviewing the results of the match process in production and fine tune the match process.
Hope you found the tips useful and would use some of them in your projects. I would like to know other best practices that you might have adopted in your matching process and share them in the comments section below.