Big Deduplications or How to Eat and Elephant ?

Breaking your database into smaller parts, will enable you to clean out your duplicates more effectively and getting it right. This is valid for big deduplications, but recommended for any deduplication. Keep reading if you want to know how…

-One bite at a time

When encountering large challenges, you cannot always do as you always have done.
You cannot cook the elephant like a turkey for thanksgiving, the oven isn’t big enough. As Barbecue?
-Well you will need a very big grill or fireplace, and while some part of the meat might be tender and tasty other part will be burned or still raw.

The analogy from eating an elephant to performing data operations on large database is obvious.

If you try to run deduplications on very large set of data you risk to get the work done poorly and have to wait a very long time to before it is “well done”.

Now why is that? Isn’t deduplication just a question of finding dupes and merging them?

Yes, but the question is not just how to find them. It is ALSO about how to merge them and there you cannot always apply the same process.

Big Deduplications and why you should break it up

When deduping a database, you might need to consider what you define as a dupe, and how you want to treat the dupes, let me give a few examples:

Clients and Prospects: Deduping a list of companies some are existing clients and some are prospects might return combinations where ClientA is a dupe of ClientB, ProspectC is a dupe of ProspectD and ClientE is a dupe of ProspectF. In this case you might want to treat the Client-Client dupes more carefully than the Prospect-Prospect dupes, and for the Client-Prospect dupes you will for sure want to make sure that the Client Records takes priority over the prospect record.
What do you really consider as a Dupe?. What if AccountA, AccountB and AccountC are identified as dupes, but AccountB doesn’t have an address, and AccountC has an address different from AccountA. Are these then dupes from your business point of view? Do you sell to the individual site/division of the accounts or only to one central company per country? Couldn’t AccountB be the Parent of AccountA? –And AccountA and AccountB be 2 sister companies in a large corporation?
Account Ownership: What if AccountX is an account managed by your AccountManagerA and AccountY is an account created by AccountManagerB. If AccountX and AccountY is the same, which record do you want to keep i.e. who is the appropriate owner.
Empty match fields, and common values: In contact deduplications there may be multiple records for John Smith, and so what other criteria do you want to use to make sure that the likeliness of 2 John Smith records are dupes will you want to use; e.g. email? –and what if some of your records doesn’t have an email?

In other words, when you start running big deduplications you will naturally find certain characteristic in your data which indicate logical split.
In the following we will describe some of the most common approaches.

When to chunk it up

There are 3 main reasons why you might want to break your database into smaller parts:

The size and complexity of the match result: As described above, you might want to apply different processes to the end result depending on what type of data you are comparing, and so if you know that all the results from a certain deduplication all consist of e.g. Client-Prospect relations, you can apply the same merge process to them all. But if the 3 different relation types are all mixed up inside one set of results you might have to go through the dupes one-by-one in order to make sure you apply the correct merge action.
How the post processing of the records is managed: Depending on how your company is organized you may envision that there is going to be one person centrally who will be responsible for the deduplication including the review and merge process. With DataTrim Dupe Alerts you have the possibility to share this with other persons in the organization, and so it might be logical to break your database into segments based on who or which business unit that should review and merge the match result.
Technological limitations:

Matching large volumes of data is very time-consuming and demands extensive computer resources. As DataTrim Dupe Alerts is fully integrated into salesforce our solution has to comply with the governor rules defined for the Apex Platform on which our solution is build and is integrated into. As an example, our system includes a ‘safety-net’ which will generate a timeout if a deduplication hasn’t completed after 2 hours of processing.

Doing the Math: To illustrate the theoretical amount of work performed by a matching process, a comparison of 10 records against themselves requires 45 comparisons.
Other values:
1.000 records: 4.950 comparisons
10.000 records: 499.500 comparisons
100.000 records: 49.995.000 comparisons
1.000.000 records: 4.999.950.000 comparisons

Chunking it up

So how do we chunk it up? As for parting up an animal for cooking there are certain traditions and preferences which may of cause vary across different cultures and different animals, so if you still think there is only one way, think again!

Your database and the data it contains may as described above be very specific to your business and so the way to chunk up your data might have to be unique to you, but here are a few examples of what others have done.

By Type, Ownership or Territory.

You might be able to categorize your records by type, ownership of the record or territory and use this to perform deduplications on smaller subsets of data.

As for the example above, you could divide your account deduplication into 3: a) Clients-Clients, Prospect-Clients and Prospect-Prospects.

Breaking up by sales territory, or geography are other effective ways of reducing the size of the matching and making the review and merge process more manageable.

Using this type of segmentation is very simple to apply by using filters on record basis and is the most commonly used approach.

By Priority or Data Quality

Deduping a large database isn’t just about cleaning up the database, but also about the amount of time you put into cleaning it up, so another approach is to try to categorize the data into different levels of ‘data quality’ and focus your deduplication on the ‘best’ data first and then work your way down.

There are many ways of identifying the quality of a record. Everything from looking at how well the record is populated across certain key field, to classifications by source, interest or how old the records are (no activity/reply to campaigns over long period) will often do.

So if you as an example have a very large lead database and a large portion of these records has been ‘inactive’ for months or even years, the matching of new records against records which has been added/updated within the last N months may be more important than to match a new lead against old inactive leads (12, 18, 24 months old).

By splitting by age or priority you will be able to prioritize your effort on the most valuable records first, and work the other records later.

Once you have your way of segmenting by quality the deduplication can begin and you can concentrate on making the good part better, and eventually throw away a serious amount of bad data.

By Likeliness

All though you may break your database into segments as mentioned above, one day you might want to match records from different segments against each other and you will then still end up with a very large set of data. In this context you would need to make chunks within which you are more likely to find duplicates than in others and so you might need to understand a little more about the way your deduplication solution is working.

First of all, you will need to consider how you will define dupes across different segments, and making sure that the relevant information exists on the records you include in the deduplication. As an example, if you want to bind the deduplication e.g. on the email (VAT number, external Id etc.), you might as well make sure you are only including records which has these fields populated i.e. exclude records without an email.

Other more creative approaches which are especially relevant on large contact databases is to use more or less sophisticated match keys (a string which can be used to identify potential duplicates).

If you break the database into smaller chunks using the name of the contacts, you can e.g. use the first character of the last name to create filters, including a, b and c in one chunk d, e, and f in the next etc.

A better approach is to consider grouping the names into logical groups where the first character may sound similar like: c and s. If your database mostly contains English names, you might consider grouping by soundex.
We have successfully use the following grouping: A+H, B+F+P+V, C+S+Z, D+K+Q, G+I+J, L+X, M+N, U+W+Y+E+R

DataTrim Dupe Alerts

Guidelines for large deduplications

As indicated above the need to split a database in smaller subsets can be caused by many reasons and not just big deduplcations, and so it is very difficult to give exact rules for when to do it and when not.

Recommendations: Based on our experience (from project work and working with existing clients) we do though have some general rule of thumbs which may help you to define how and when to split your data.

For single dataset deduplications (Secondary=Master), consider to split the dataset into batches of less than 500.000 records (optimally 250-300K if possible)

For double dataset deduplications (Secondary <> Master), consider to split the dataset into batches of the sum of the 2 datasets are larger than 600.000, or if the ration between the Master and the Secondary data set is less than 1:4/1:3 Examples: 50K against 550K is OK, 150K against 450K might be close to the limit (using the default settings).

The use of Advanced Parameters like the ExactMatchField might allow you to increase the volumes as the matching time will be shorter (less comparisons to do).

Happy Hunting with your Big Deduplications!

The Challenge: The salesforce platform governor rules and the related timeout ‘safety-net’ in DataTrim Dupe Alerts imposes 2 limitations: 1) the amount of records and the time it takes to read the data from the salesforce database and 2) the amount of time it takes to perform the matching of the data.

Reading data from salesforce: When matching 500.000 records against themselves, only involves reading the 500.000 records once (the Master data set and the secondary dataset are identical), and matching 250.000 Client records against 250.000 Prospect records also involves the reading of 500.000 records.

Matching records: When matching e.g. 100.000 records against themselves (in theory) there are 50 Mill comparisons to do. If comparing 100.000 records against another 100.000 records there are a total of 10.000.000.000 comparisons to do and 10.000 records against another 100.000 records there are a total of 1.000.000.000 comparisons (10 times less).

Generating results/classification of potential dupes: Logically, if there are many records within a data set that meets the matching criteria the matching and processing will take more time.

Don’t hesitate to reach out to our support team if you run into any question regarding your big deduplications or how to use the filters to segment data for more effective deduplications.
Contact Support

Big Deduplications or How to Eat and Elephant ?

-One bite at a time

Big Deduplications and why you should break it up

When to chunk it up

Chunking it up

By Type, Ownership or Territory.

By Priority or Data Quality

By Likeliness

DataTrim Dupe Alerts

Guidelines for large deduplications

Related Articles

Archives

Big Deduplications or How to Eat and Elephant ?

-One bite at a time

Big Deduplications and why you should break it up

When to chunk it up

Chunking it up

By Type, Ownership or Territory.

By Priority or Data Quality

By Likeliness

DataTrim Dupe Alerts

Guidelines for large deduplications

Related Articles

Categories

Archives