Entity Resolution in Python with the Dedupe Package

Daisy McLogan
27 February 2024

Entity Resolution (ER) is a vital data management process that connects, identifies, and merges records related to the same real-world entities, whether across various databases or within a single one. In B2C retail, like a brand specializing in travel gear akin to Away.com, Entity Resolution is pivotal in consolidating customer data from diverse sources. This improves data quality and reliability and facilitates a comprehensive understanding of customers, leading to improved personalized marketing and customer service. Entity Resolution is a key part of a Master Data Management strategy.

The Typical Entity Resolution Problem

When we have many records that represent our customers, it can be tricky to identity duplicate and merge records to create a true 360 view of your customers. To address this, we can write code and use Python libraries like Dedupe, a simple library for linking and deduplicating records.

Implementing Dedupe in Python

Step 1: Understanding the Data

Our fictious customer table includes columns like first_name, last_name, city, country, internal_id, MAID, braze_id, and shopify_id, which are crucial for our entity resolution process.

Example setup for loading the dataset

our customer data

Step 2: Preparing Data with Dedupe

Dedupe requires a bit of preparation, including data cleaning and standardization, to make the matching process more effective. It involves defining the fields to be used for matching and possibly converting data into a format that Dedupe can efficiently process.

Setup Dedupe.io

Step 3: Training Dedupe

Compared to the recordLinkage python library, we have to train Dedupe, which means manually labeling a sample of record pairs as matches or non-matches to teach the model how to identify duplicates.

To do so you simple run dedupe.train()

Depending on your environment, it will prompt you to validate matching records in your CLI or your notebook (see example below)

Manual train dedupe.io

After training, Dedupe can automatically identify duplicates within the dataset.

Step 4: Blocking with Dedupe

Dedupe automatically creates blocks of records that are more likely to be duplicates, significantly reducing the number of comparisons needed. To create blocks, you can run this simple function

Cluster in Dedupe.io

Step 5: Resolving Duplicates

After identifying duplicates, we can merge them or take appropriate actions based on your business rules. This is where you will have to write more code to handle your merging logic. Maybe keep the date from the last updated records? or maybe the first one was created? What about the one attached to the customer, maybe? It is up to you.

This process consolidates customer records, enhancing data quality and enabling a more personalized customer approach.

Output and Actions

 

ID internal_id first_name last_name city country MAID braze_id shopify_id
1 101 Tejas Manohar New York USA A123 BCD1 XYZ2
1 102 Tej Manohar New York USA A123 BCD2 XYZ1
2 201 Alec Haas Los Angeles USA B456 EFG1 UVW2
2 202 Alex Haas Los Angeles USA B457 EFG2 UVW2
3 301 Alice Johnson Chicago USA C789 HIJ1 STU3
4 401 Bob White Miami USA D012 KLM1 RST4
5 501 Charlie Brown Boston USA E345 NOP1 QRS5

The output from Dedupe includes clusters of records identified as duplicates with confidence scores. These clusters allow us to:

  • Merge Records: Combine duplicates into a single, comprehensive record, maintaining all relevant information and identifiers.
  • Clean Data: Correct inconsistencies identified during the deduplication process.
  • Improve Customer Insights: By resolving duplicates, we create more accurate customer profiles, enhancing marketing strategies and customer service.

Conclusion

Leveraging the Dedupe library for entity resolution enables B2C retail companies to enhance their customer data's accuracy and utility. By identifying and merging duplicate records, companies can ensure a consistent and personalized customer experience, laying the groundwork for more informed business strategies and improved data governance.

If you don't want to implement an open-source solution like Dedupe, contact us to see a demo of Census Entity Resolution. We handle the heavy lifting, and you get clean, deduped, and merged data ready to be synced.