Third Party Data Integration

Background

My manager asked me to drive the delivery of an ETL product integrating our sales coverage data with a third party system of client data. The sales team for US Wealth generates $25 billion annually representing 30% of annual net revenue.

They set target goals based on revenue generated through trades of our funds by retail financial advisors. For example, it should be possible to determine how many dollars salesperson “Jane” generated in revenue for the firm from their sales coverage of “Dollar Sign Financial”.

Revenue is determined through financial trades, the purchase or selling of a fund such as a mutual fund. Ideally purchase but both trade types are crucial for analysis which is coined sales attribution but we often call it getting credit. Trade data comes through a third party vendor that feeds the data to us several times a day.

A separate daily feed from the same vendor includes client data organized by firms, offices, and financial advisors (FAs). This client data feed ties to the trades in the first feed. For simplicity I’ll call these feeds the Trade and the Client Data feeds.

Let’s say Patty works in Seattle for Dollar Sign Financial as a FA. This employment information would come through the client data feed. The trade data feed would have for example, Patty with Dollar Sign Financial purchased 10 shares of ABC fund from my firm at $100 per share. Combining the two feeds with our in-house coverage data we can complete the puzzle to say “Since Jane at my firm covers Patty at Dollar Sign Financial she should be credited with the $1000 from the trade. (10 shares * $100)”

Problem Statement

The trade data feed was already integrated so the mission was to integrate the daily client data feed of firms, offices, and FAs into our in-house database of clients and sales. This would enable connecting the trade and client data feeds to our sales coverage data. The ultimate goal was sales attribution like “Jane at our firm generated X dollars in trades of our funds from the firms she covers.”

The complexity was that my firm already had a database of client data that was used for multiple businesses in addition to US Wealth. This feed impacted 18% of the central database of client data. Naturally, there were multiple factors to consider and solve for:

The ET of ETL (Extract, Transfer, and Load) – solve for bringing the data into our system from files
The L of ETL
1. Mapping the 3rd party data to our existing data model
2. Data ingestion logic
3. Data issue considerations such as duplicates or missing data
4. Contention where different business segments at my firm may view the data differently
  - Example: US Wealth expects to see Patty’s primary employment as Dollar Sign but Institutional business views her role of being on the Board of Directors at Panda Federal as her primary employment. Who should win?

My Role

As product manager my role was to work through all the factors listed above in order to carve out the requirements to determine the scope and sequencing and see to delivery. Often it seemed we were solutioning while gathering requirements in an odd chicken/egg sort of way. (Foreshadow learning much?)

I collaborated with product owners of the trade data, business partners, and engineering leads on all workstreams in parallel, documenting decisions and requirements along the way. Determined end-to-end testing strategy and communicated to stakeholders to provide transparency of product process.

Outcomes

It is one of those products that just works behind the scenes quietly in the best possible way.

This was a massive two year effort carefully designed and executed with seven engineers, four product owners, and countless stakeholders.

ETL was solved through storing the five data files in Hadoop and being sqoop-ed into a relational staging database in Sybase. This was a relatively small piece of the product solution as there was already similar processes in place for us to piggyback on.

Mapping the 3rd party data to our existing data model ate up at least 8 months of the 2 year timeline, likely more. The deliverable was mapping fields from five separate files to our system.

Many mappings were straightforward such as Firm data such as name in the feed mapped to Firm name in our system. The not so straightforward mappings presented larger impacts on the current system. An example challenge was that the client data feed stored the branch differently than our system did presenting a foundational data conflict.

I worked with engineering to determine all possible solutions and then illustrated the pros and cons with multiple partners to determine the best solution. There were many such mapping considerations to work through, decide, and document.

In parallel I worked with my business partner on data ingestion logic. Ideally, the data comes in clean and there are no conflicts with existing data in the system. Unfortunately, this is not always the case.

What if the firm is getting removed by the feed but there are still active offices at the firm? Do we automatically remove the firm’s offices or do we notify the client data team of the data issue?

Also, people modify data during the day through our CRM application. Someone could remove an employee on Monday only to have the feed try to change employee’s branch the next day. Who wins?

There were many cases to consider, decide the action, and write-up the business logic. Through this process, it became clear there needed to be a way for the team to manage data integrity errors that will arise. This required spinning up a new workstream to build a suite of separate products to design and deliver in parallel for data quality, see Data Quality Dashboard and Merge Tool.

Learning

This is one of those product deliveries I never thought would happen while I was in the thick of things. Then when it did I really missed it. By end I mean the product was at final state and I moved to build out a different product. The feed still lives on and is considered quite a success and is enhanced as needed.

Lessons Learned

err on the side of over communication
communicate again and in different ways: emails, calls, meetings, slide decks, wiki, billboards, skywriting etc.
use storytelling to explain complexities
- through headlines
- providing a picture to help grasp timeline
- simple CYA and escalating possible reasons for delays early
document key decisions clearly and communicate often these decisions. (Again with the communication)
make data mapping decisions first and then solution everything else to avoid rework