
One bite at a time
When encountering large challenges you cannot always do as you always have done. You cannot cook the elephant like a turkey for thanksgiving, the oven isn’t big enough. As Barbecue? -Well you will need a very big grill or fireplace, and while some part of the meat might be tender and tasty other part will be burned or still raw.
The same goes for big deduplications.
The analogy from eating an elephant to performing data operations on large database is obvious.
If you try to run deduplications on very large deduplications you risk to only get the work done poorly and have to wait a very long time to before it is “well done”.
Chunking it up
So how do we chunk it up? As for parting up an animal for cooking there are certain traditions and preferences which may of cause vary across different cultures and different animals, so if you think there is only one way, think again!
Your database and the data it contains, may be very specific to your business and so the way to chunk up your data might have to be unique to you, but here are a few examples of what others have done.
By Data Quality
Deduping a large database isn’t just about cleaning up the database, but also about the amount of time you put into cleaning it up, so one approach is to try to categorize the data into different levels of data quality and focus your deduplication on the high quality data first and then work you way down.
There are many ways of identifying the quality of a record. Everything from looking at how well the record is populated across certain key field, to classifications by source, interest or how old the records are (no activity/reply to campaigns over long period) etc, will do.
Once you have your way of categorizing by quality the deduplication can begin and you can concentrate on making the good part better, and eventually throw away a serious amount of bad data.
By Type, Ownership or Territory.
Like for the data quality you can categorize your records by type, ownership of the record or territory and use this to perform deduplications on smaller subsets of data.
Not only is the use of such categorization a simple way to divide your database, to use the record type, ownership or territory may in many cases be a mandatory first approach, as the merging (processing the result of the deduplication) of records from different categories can be quite complex and have influence on the business processes, account and territory management models and links into back-office systems as well.
By Likeliness
All though you may break your database into categories as mentioned above, one day you might want to match records from different categories against each other and you will then still end up with a very large set of data. In this context you would need to make chunks within which you are more likely to find duplicates than in others and so you might need to understand a little more about the way your deduplication solution is working.
First of all you will of cause need to consider how you will define dupes across different categories, making sure that the relevant exist on the records you include in the deduplication i.e if using the email in your process for identifying duplicates you might as well exclude records without emails.
By Sound-A-Like
If you break the database into smaller chunks using the name of the contacts, you can of cause use e.g. the first character of the last name to create filters, including a, b and c in one chunk d, e, and f in the next etc.
A better approach is to consider grouping the names into logical groups where the first character may sound similar like: c and s. If your database mostly contain english names, you might consider grouping by soundex.
We have successfully use the following grouping:A+H, B+F+P+V, C+S+Z, D+K+Q, G+I+J, L+X, M+N, U+W+Y+E+R
With DataTrim Dupe Alerts

In our next article of this series (July 2010) you will read more about setting up the alerts using filters in DataTrim Dupe Alerts