Master Data Management

Automated golden record creation that consolidates fragmented entity data into reliable, consistent references across your entire organization.

Master data is the authoritative reference for critical business entities: customers, products, vendors, locations. When master data is fragmented across systems with duplicates and inconsistencies, every analysis that relies on these entities becomes questionable. Master data management automation creates golden records—the single, authoritative version of each entity—that downstream systems can trust.

The Master Data Problem

Consider customer data. Your CRM has one set of customer records. Your billing system has another. Your support system has a third. They don't use consistent IDs, so matching them requires name and email matching. And even then, 'John Smith' at 'Acme Corp' might be 'J. Smith' at 'Acme, Inc.' (with a comma) in another system. The result is fragmentation. Marketing sends an email to the wrong John Smith because they have different addresses in different systems. Finance revenue attribution is wrong because the same customer appears with different company names. Support doesn't see purchase history because the customer record doesn't match. Master data management addresses this by creating authoritative golden records that consolidate information from multiple sources. The golden record knows that 'John Smith' at 'Acme Corp', 'J. Smith' at 'Acme, Inc.', and 'john.smith@acmecorp.com' are the same person.

What Is Master Data?

Master data is the foundational data that describes the core entities of your business: who you sell to (customers), what you sell (products), who you buy from (vendors), and where you operate (locations). These entities are referenced by transactions but are not generated by transactions. Transactional data (sales, invoices, orders) references master data.

Entity Resolution Techniques

Entity resolution identifies when records across different systems refer to the same real-world entity. This is the core challenge of master data management. Deterministic matching uses exact rules: if name AND email match exactly, it's the same entity. Simple and fast but misses similar-but-not-identical records. Probabilistic matching uses statistical models to calculate match confidence. 'John Smith' + '123 Main St' might be a match even without email. These models learn from training data and handle variations better but require more computation. Rule-based matching combines deterministic and probabilistic rules with business logic. 'Same email is high confidence match. Same name plus same zip code is medium confidence. Same phone number alone is low confidence.' Business rules encode organizational knowledge about what makes two records likely the same person. AI-based matching uses machine learning to identify matches based on many features simultaneously. It can discover matching patterns humans didn't explicitly encode. However, it requires training data with known matches and may be a black box that's hard to explain to business users.

Building Golden Records

Golden records consolidate attributes from multiple source records into a single authoritative record. The consolidation logic determines which attributes win. Survivorship rules define which source takes precedence for each attribute. 'CRM email is authoritative for contact info. Billing system address is authoritative for shipping. Support system phone is authoritative for outreach.' These rules encode business knowledge about which system has the most reliable data for each attribute. Confidence scoring tracks how certain the system is about each attribute. If all sources agree on name, confidence is high. If sources disagree on address, confidence is low and might require human review. Source tracking maintains provenance of each attribute. 'Address came from Salesforce on 2024-01-15. Updated from HubSpot on 2024-03-20.' This allows auditing why the golden record shows a particular value. Timeline tracking maintains historical states of the golden record for point-in-time queries. A customer who moved in March has historical addresses accessible for historical analysis.

The Merge Conflict Problem

When source records conflict, the system must decide what goes into the golden record. Simple rules (CRM always wins) may create frustration when another system has more current information. Complex rules (evaluate 15 attributes and score them) may be too complicated to maintain. Invest time in survivorship rule design with business stakeholders—these decisions have real operational impact.

Matching and Merging Workflows

Automated matching handles most records but leaves uncertain cases for human review. The workflow determines how efficiently the system operates. Auto-merge for high-confidence matches: if matching algorithm is 99% confident two records are the same person, merge them automatically without human review. Log the merge for auditing. Review queue for medium-confidence matches: if confidence is 70-98%, route to human reviewers who confirm or reject the match. Batch reviews efficiently—show reviewers the evidence that supports the match and let them accept or reject in bulk. Reject queue for low-confidence matches: if confidence is below threshold, don't merge but log the potential match for future reconsideration as more data accumulates. Tuning thresholds based on review queue volume: if too many medium-confidence matches pile up, lower the auto-merge threshold or raise the review threshold. Balance speed against quality.

Master Data Distribution

Golden records are only valuable if downstream systems use them. Distribution automation publishes authoritative master data to all systems that need it. Hub-and-spoke architecture maintains master data in a central hub and pushes to spoke systems. The hub is the system of record; spokes receive synchronized copies. Changes in the hub propagate to spokes on defined schedules or in real-time. API-based distribution provides on-demand access to golden records through REST APIs. When a CRM needs customer information, it calls the master data API rather than relying on its own copy. Ensures always-current data but requires API integration effort. ETL-based distribution uses batch processing to synchronize golden records to downstream systems on schedules. Simpler to implement than API integration but introduces latency between hub updates and spoke synchronization.

Customer Data Platforms

Customer data platforms (CDPs) specialize in master data management for customer entities. They automate the creation of unified customer profiles from multiple sources. Segment collects customer data from all touchpoints—website, mobile, email, in-store—and creates unified profiles with automatic identity resolution. Strong for marketing use cases; provides activation to advertising platforms. Salesforce Customer 360 creates unified customer view across Salesforce clouds. Best for organizations heavily invested in Salesforce ecosystem; provides identity resolution across Sales Cloud, Service Cloud, and Marketing Cloud. mParticle provides customer data infrastructure with focus on cross-channel data activation. Strong for mobile-first consumer businesses that need to activate data across many channels. CDP selection depends on existing tech stack, primary use cases (marketing vs. sales vs. service), and budget. All provide value beyond custom master data implementations for customer entities.

Key Takeaways

•Master data fragmentation causes inconsistent analytics and operational failures across systems
•Entity resolution techniques range from deterministic (exact rules) to probabilistic (statistical models) to AI-based
•Golden records consolidate attributes from multiple sources using survivorship rules that encode business logic
•Auto-merge high-confidence matches, route medium-confidence to human reviewers, log low-confidence for future review
•Distribution automation pushes golden records to downstream systems through hub-and-spoke, API, or batch synchronization
•Customer data platforms automate customer master data management for organizations without custom implementation resources

← Back to Data Analytics Automation