Overview
When pulling company data from multiple sources (Crunchbase, PitchBook, LinkedIn, your CRM), the same company appears differently in each. “Acme Inc,” “Acme Corporation,” “Acme,” “ACME, Inc.”, all the same company, but your system doesn’t know that. Entity resolution is determining which records refer to the same real-world entity. Without it, you can’t aggregate data, build accurate analytics, or track companies properly. It’s one of the hardest and most important data problems in VC. Getting it right was significant to EQT’s Motherbrain. This chapter covers why it matters, how to approach it (start simple, build up only when necessary), and what it takes to build your own system. Spoiler: start with URL matching and avoid building a full system as long as possible.The Entity Resolution Problem
Company name variations: “Stripe, Inc.” vs “Stripe” vs “Stripe Payments” vs “Stripe Inc” (no period). These are all the same company, but string matching fails. Companies also rebrand (“Facebook” became “Meta”), get acquired (should “Instagram” be separate from Meta or merged?), and use different legal names in different jurisdictions. URL variations: “stripe.com” vs “https://stripe.com” vs “www.stripe.com” vs “stripe.com/”. Same domain, different formats. Some sources include the protocol, others don’t. Some include www, others don’t. Trailing slashes are inconsistent. Location variations: “San Francisco, CA” vs “San Francisco, California” vs “SF” vs “San Francisco, CA, USA”. Same location, different representations. Some sources use city and state, others add country, others use abbreviations. People name variations: “Alex Smith” vs “Alexander Smith” vs “A. Smith”. Same person, different formats. People also change names (marriage, personal preference), use nicknames professionally, and have names that transliterate differently from other languages. Data staleness: A company moved offices, changed their URL, or was acquired. Some data sources have updated information, others don’t. Which is correct? How do you handle temporal changes while maintaining entity consistency? Missing data: Not every source has every field. Crunchbase might have a LinkedIn URL, but PitchBook doesn’t. One source has founding date, another doesn’t. You need to match entities even with incomplete information. The fundamental challenge is that there’s no universal, stable identifier for companies or people across all data sources. Each source assigns their own IDs. You need to create mappings between these IDs to know when records refer to the same entity.Why Entity Resolution Matters for VC
Without entity resolution, your data infrastructure breaks in multiple ways. Analytics are wrong: If “Acme Inc” and “Acme Corporation” are both in your database as separate companies, your portfolio size is overstated. Your market maps show duplicate companies. Your metrics count the same funding round twice if it appears in both Crunchbase and PitchBook under slightly different names. You can’t aggregate data: You want to combine Crunchbase’s funding data with PitchBook’s valuation data and LinkedIn’s employee count for the same company. Without entity resolution, you can’t merge these data sources. You’re stuck with fragmented, incomplete information about each company. Knowledge graphs break: If the same company appears as multiple nodes in your graph, relationships are split incorrectly. Investors appear to have invested in fewer companies than they actually did. People appear to work at multiple companies simultaneously. Competitive relationships are missed. Sourcing is inefficient: Your sourcing tool identifies “Acme Inc” as a potential investment. But you already passed on this company three months ago when it appeared in your CRM as “Acme Corporation.” Without entity resolution, you waste time re-evaluating the same companies. Portfolio tracking fails: You invested in a company, but your fund operations system, CRM, and data warehouse all have different records for this company with no linkage. When the company raises a Series B (triggering a markup), you have to manually update multiple systems because they can’t automatically recognize it’s the same company. This isn’t a nice-to-have. Entity resolution is foundational to making any cross-source data system work correctly.Start Simple: See How Far You Can Get
Before building anything sophisticated, see how far simple matching rules take you. Match on URLs: Company websites are relatively unique and stable. If two records have the same domain (stripe.com), they’re almost certainly the same company. Normalize the URLs first (remove protocol, remove www, remove trailing slashes, lowercase everything), then compare. This catches most matches. Match on LinkedIn URLs: LinkedIn company pages have stable identifiers in their URLs (linkedin.com/company/stripe). If two records share the same LinkedIn URL, they’re the same company. Same approach: normalize then compare. This also works well for people data. Match on email addressses: If you don’t have reliable LinkedIn URLs for people, email addresses are usually the next best thing. Exact string matches: After normalizing names (lowercase, remove punctuation, standardize “Inc” vs “Incorporated”), exact string matches catch many cases. “Stripe Inc” and “Stripe, Inc.” become “stripe inc” and match. These simple rules will catch 70-80% of matches without any sophisticated infrastructure. Implement these first. Only move to more complex solutions when simple matching stops working. Why this matters: Building an entity resolution system is expensive in engineering time and ongoing maintenance. Every hour spent on entity resolution is an hour not spent on research platforms, deal flow tools, or portfolio support. If simple rules handle most cases, that might be enough for your fund’s scale.Sometimes You Can Avoid Entity Resolution Entirely
Before investing in sophisticated entity resolution, consider whether you actually need to combine data sources for your use case. Use single authoritative sources: Instead of stitching together funding data from Crunchbase, PitchBook, and your CRM, pick the best source for funding data (probably PitchBook) and use only that for market analysis. Don’t try to combine them. This avoids the entity resolution problem entirely for that use case. Examples where single-source works:- Analyzing total funding in a market? Use PitchBook exclusively.
- Tracking employee counts? Use LinkedIn data from one of the People Data Providers only.
- Portfolio valuations? Use your fund operations system as the single source of truth.
Entity Resolution as a Service
Before building your own system, consider whether services exist that solve this problem for you. Services available: Several vendors offer entity resolution as a service:- Senzing: Real-time entity resolution focused on messy people and organization data. Available on AWS Marketplace, deploys in your infrastructure for data privacy and compliance. Good for company matching with inconsistent names and addresses.
- Tamr: AI-native entity resolution with specific B2B capabilities. Offers automated matching and supports Dun & Bradstreet enrichment for company data. Focuses on master data management and customer 360 use cases.
- AWS Entity Resolution: Amazon’s native service with flexible, configurable workflows. Integrates directly with other AWS services if you’re already in that ecosystem.
Building Your Own Entity Resolution Service
When simple matching stops working and services are too expensive, you need to build your own system. This is where it gets hard. The canonical ID approach: Assign one internal ID per entity. Every company or person gets exactly one canonical ID in your system. All records from external sources (Crunchbase, PitchBook, LinkedIn, your CRM) map to these canonical IDs. When you query for a company, you query by canonical ID, which pulls data from all sources that reference that entity. This canonical ID becomes your “golden record” or “single source of truth” for each entity. External IDs come and go, but your canonical ID is stable. Matching strategies for companies: When comparing two company records to decide if they’re the same entity, match on multiple fields:- Name: Fuzzy string matching with normalization. Convert to lowercase, remove punctuation, standardize abbreviations (“Inc”, “Corp”, “Ltd”). Use string similarity algorithms (Levenshtein distance, Jaro-Winkler) to catch typos and minor variations.
- URL/Domain: As mentioned earlier, this is your strongest signal. If domains match, very high confidence it’s the same company.
- LinkedIn URL: Similarly strong signal if available.
- Location: City and country provide supporting evidence. “San Francisco” and “SF” should normalize to the same value. Location alone isn’t sufficient (multiple companies in the same city), but combined with name similarity it strengthens confidence.
- Founding date: If both records have founding dates and they’re close (within a year), that supports a match. But founding dates are often missing or wrong in data sources, so don’t rely on this.
- Description similarity: Company descriptions can be compared using embeddings or keyword analysis. If two companies with similar names also have similar descriptions, higher confidence they’re the same entity.
- LinkedIn URL: Strongest signal. LinkedIn profiles have stable IDs.
- Email: If you have email addresses from multiple sources, exact match is strong signal.
- Name + Company: If first name, last name, and current company all match, very likely the same person.
- Check if it has a URL or LinkedIn URL that matches an existing canonical entity → if yes, link it
- If no URL match, check for high-confidence name + location match → if yes, link it
- If medium confidence match, flag for human review
- If no match found, create a new canonical entity
Human-in-the-Loop: You’ll Need It
Automated matching will never be 100% accurate. You will have false positives (incorrectly merged entities) and false negatives (incorrectly split entities). You need tooling for humans to review and correct mistakes. Review interface: Show suggested matches with confidence scores. A human reviews the match (looking at names, URLs, descriptions, and any other data) and confirms or rejects it. This is especially important for medium-confidence matches where the algorithm isn’t sure. Merge interface: When you discover two canonical entities that should be one (the algorithm missed a match), you need to merge them. This means:- Choosing which canonical ID to keep
- Mapping all external references from the old ID to the new ID
- Combining data from both entities into a single record
- Preserving history (you want to know these were merged in case you need to undo it)
- Create a new canonical ID for the second entity
- Decide which external references belong to which entity (this is hard - you’re untangling merged data)
- Recalculate any analytics or relationships that were affected by the incorrect merge