Consolidation of Multiple Public Data Sources Through Advanced Soft-Matching Techniques
Clearly Define Your Business Objectives
A major B2B SAAS provider was seeking additional externally sourced information about their existing clients and prospects. Combining this information with existing CRM data would create a broader view of their client and prospect base. This first required consolidating relevant data sources to create a “single source of truth.”
The primary challenge with consolidating multiple public data sources into one comprehensive database is the likelihood of a single entity occurring in multiple forms throughout the dataset. Accounting for this problem became the first priority.
Acquire & Synthesize Relevant Data
Most of the existing client data was stored in relational databases and Excel workbooks. Existing matching formulas such as Excel’s Vlookup or R’s adist were the most obvious solution, but are not optimized to match business names. Therefore we had to develop a proprietary algorithm to match entities across multiple sources.
These entities usually exist as text data entries. Excel’s Vlookup works by first cross-referencing data sources with each other. For example, using the VLookup function in Excel within existing data-sets returned the following for matching results:
Fig. 1 – Soft Matching Attempts Using Excel Lookup Formula
In the example given in Figure 1, the entities in source 2 are being searched in source 1. Vlookup’s exact match did not produce any matches for the entities in source 2 column, while only returning one correct result with approximate match. Relevant algorithms in R and Python showed similar limitations.
Figure 2 shows comparable results when using FischerJordan’s Algorithm (FJ Algorithm). While we can see 3 entities with similar names as “ABC & Company” in source 1, the nearest match is “ABC & Co.”. For “New man tech”, the nearest match is “Newman Technologies.” Both of these return as matches in the third column.
Fig. 2 – Soft Matching Attempts on Data-Set 1 Using FJ Algorithm
Different public databases have data aggregated at different levels. We needed an algorithm to reconcile different data sources in order to create a single source of truth. Figure 3 shows how the FJ algorithm can recognize and match entities that may exhibit sub-entity or sister-entity relationships (e.g. Kaiser Permanente is the parent organization of Kaiser Foundation Group).
Fig. 3 – Soft Matching Attempts on Data-Set 2 Using FJ Algorithm
Develop an Action Plan
We leveraged the algorithm to aggregate multiple public data sources, combine them with our client’s proprietary CRM data, and create a “database of databases” containing all relevant information about a given customer. This allowed our client to identify firmographic information such as size, years in business, and geography, and combine them with proprietary information to develop a more customized relationship management strategy for each customer.
Partner with You to See it Through
Partner with You to See it Through
We automated the algorithm and combined it with existing web scraping algorithms to continuously update data from internal and external data sources on a quarterly basis without requiring on-site management and reconciliation of incoming data (Fig. 4).
Fig. 4 – Process Flow to Final Analysis Output Leveraging FJ Soft-Matching Algorithm
Soft Matching, New York Consulting, Management Consulting, Microsoft Excel, VLOOKUP, SAAS, B2B, Algorithm, R, Python, Data Analytics, Quantitative Analysis, Analytics, Case Study
Back to ANALYTICS
OUR CASE STUDIES
Copyright © 2019 FischerJordan. All Rights Reserved.