Entity Resolution (ER) is a vital data management process that connects, identifies, and merges records related to the same real-world entities, whether across various databases or within a single one. In B2C retail, like a brand specializing in travel gear akin to Away.com, Entity Resolution is pivotal in consolidating customer data from diverse sources. This improves data quality and reliability and facilitates a comprehensive understanding of customers, leading to improved personalized marketing and customer service. Entity Resolution is a key part of a Master Data Management strategy.
When we have many records that represent our customers, it can be tricky to identity duplicate and merge records to create a true 360 view of your customers. To address this, we can write code and use Python libraries like Dedupe, a simple library for linking and deduplicating records.
Our fictious customer table includes columns like first_name
, last_name
, city
, country
, internal_id
, MAID
, braze_id
, and shopify_id
, which are crucial for our entity resolution process.
Example setup for loading the dataset
Dedupe requires a bit of preparation, including data cleaning and standardization, to make the matching process more effective. It involves defining the fields to be used for matching and possibly converting data into a format that Dedupe can efficiently process.
Compared to the recordLinkage python library, we have to train Dedupe, which means manually labeling a sample of record pairs as matches or non-matches to teach the model how to identify duplicates.
To do so you simple run dedupe.train()
Depending on your environment, it will prompt you to validate matching records in your CLI or your notebook (see example below)
After training, Dedupe can automatically identify duplicates within the dataset.
Dedupe automatically creates blocks of records that are more likely to be duplicates, significantly reducing the number of comparisons needed. To create blocks, you can run this simple function
After identifying duplicates, we can merge them or take appropriate actions based on your business rules. This is where you will have to write more code to handle your merging logic. Maybe keep the date from the last updated records? or maybe the first one was created? What about the one attached to the customer, maybe? It is up to you.
This process consolidates customer records, enhancing data quality and enabling a more personalized customer approach.
ID | internal_id | first_name | last_name | city | country | MAID | braze_id | shopify_id |
---|---|---|---|---|---|---|---|---|
1 | 101 | Tejas | Manohar | New York | USA | A123 | BCD1 | XYZ2 |
1 | 102 | Tej | Manohar | New York | USA | A123 | BCD2 | XYZ1 |
2 | 201 | Alec | Haas | Los Angeles | USA | B456 | EFG1 | UVW2 |
2 | 202 | Alex | Haas | Los Angeles | USA | B457 | EFG2 | UVW2 |
3 | 301 | Alice | Johnson | Chicago | USA | C789 | HIJ1 | STU3 |
4 | 401 | Bob | White | Miami | USA | D012 | KLM1 | RST4 |
5 | 501 | Charlie | Brown | Boston | USA | E345 | NOP1 | QRS5 |
The output from Dedupe includes clusters of records identified as duplicates with confidence scores. These clusters allow us to:
Leveraging the Dedupe library for entity resolution enables B2C retail companies to enhance their customer data's accuracy and utility. By identifying and merging duplicate records, companies can ensure a consistent and personalized customer experience, laying the groundwork for more informed business strategies and improved data governance.
If you don't want to implement an open-source solution like Dedupe, contact us to see a demo of Census Entity Resolution. We handle the heavy lifting, and you get clean, deduped, and merged data ready to be synced.