Data Quality and Validation

Overview

You will show bad data to a GP. It’s inevitable. A company in your market map that shut down two years ago. Funding amounts that are wrong. Employee counts that are wildly inaccurate. A competitor analysis that misses obvious competitors or includes companies that aren’t actually competitive. This happens to everyone building data infrastructure for VC. The question isn’t whether you’ll have data quality problems. The question is how you minimize them, how you surface them when they exist, and how you maintain trust with your investment team when bad data inevitably makes it through. Data quality in VC is particularly challenging because you’re aggregating data from multiple external sources, none of which are perfect. Crunchbase has gaps. PitchBook has errors. LinkedIn data is stale. Your CRM has whatever your team bothered to enter. Company websites lie. Every data source has different coverage, accuracy, and freshness characteristics. Unlike operational systems where you control data entry and can enforce validation rules, VC data comes from the outside world. You don’t control when companies update their information. You can’t force data vendors to be more accurate. You’re playing defense, trying to catch errors before they damage your credibility with the people who make investment decisions. This chapter covers how to think about data quality in VC, how to choose and validate data sources, how to handle conflicting information, and how to build trust with your investment team despite the inherent messiness of external data.

Understanding Data Source Trust Levels

Not all data is equally trustworthy. The first step in data quality is understanding which sources you can trust for what purposes. High trust: Your firm’s internal data Your fund’s portfolio companies, investment amounts, ownership percentages, board seats, LP capital calls, distributions. This is data your firm generates and controls. It should be your highest quality data because you have direct access to the source. If your internal fund operations data is wrong, that’s a problem you can and should fix immediately. Use this data as ground truth. When external sources conflict with your internal data about portfolio companies, your internal data is almost always correct. Medium trust: Established data vendors for their core competency PitchBook for funding data. LinkedIn for employee counts and job changes. These vendors specialize in their domains and have reasonably good accuracy for what they focus on. They’re not perfect, but they’re reliable enough to use for analysis and decision-making. The tradeoff: PitchBook is usually accurate but slow. They verify information before publishing, which means there’s lag. A Series B that closed last week might not appear in PitchBook for two weeks. Signal data providers like Specter or Harmonic prioritize speed over completeness. They surface companies quickly but aren’t trying to be comprehensive. If you need accuracy and completeness, use PitchBook. If you need speed and real-time signals, use signal providers. Pick based on your use case. Your CRM data falls into this category too. It’s only as good as what your team enters, but if your team is diligent about tracking companies, it’s reasonably trustworthy. If your CRM says you passed on a company, that’s probably correct. If it says you never talked to a company, that might just mean nobody logged it. Low trust: Scraped, derived, or transcribed data Meeting transcriptions from voice-to-text systems. Data scraped from company websites. Derived metrics that combine multiple sources. AI-generated summaries of company descriptions. This data is useful but shouldn’t be trusted without verification. Meeting transcriptions get names wrong, mishear technical terms, and miss context. Scraped data is often stale (companies change but websites don’t update). Derived metrics inherit errors from their sources and add their own. AI summaries hallucinate. Use this data for context and exploration, not for definitive analysis. Why trust levels matter This hierarchy should guide your decisions about what data to use for what purposes. Building a market map to present to GPs? Use PitchBook or your CRM, not scraped website data. Feeding context into an LLM for research? Lower trust data is fine because you’re exploring, not deciding. Calculating portfolio performance for LP reporting? Only use high trust internal data. When you’re deciding whether to build a feature that depends on certain data, consider the trust level. If it requires high trust data but you only have medium trust sources, either find better data or don’t build the feature. Showing low trust data to GPs without caveats damages your credibility.

Choosing Your Data Sources

Before worrying about validation, make the right choices about which data sources to use. This is where many funds go wrong: they subscribe to multiple overlapping data sources, try to merge them (requiring entity resolution), and end up with more complexity and not much better data. Use your portfolio companies as ground truth You know your portfolio companies better than any external data source. Use them to validate the accuracy of vendors. Take 10-20 of your portfolio companies and check how accurately each vendor represents them:

Is the funding data correct (amounts, dates, investors)?
Are employee counts reasonably accurate?
Are the company descriptions accurate?
How quickly does the vendor update after new events (funding rounds, acquisitions)?

This test reveals which vendors are reliable for which data types. PitchBook might be great on funding but weak on employee counts. LinkedIn might be accurate on people but terrible on company descriptions. Your CRM might have great qualitative notes but missing quantitative data. Use this analysis to pick your authoritative source for each type of data. Don’t guess. Validate with companies you have ground truth on. Pick one source per use case As covered in Entity Resolution (Entity Resolution), the best way to avoid data quality problems is to not merge sources when you don’t need to. If you’re analyzing total funding in a market, pick PitchBook and use only that. Don’t try to combine PitchBook, signal providers, and your CRM. The entity resolution required to merge them creates new opportunities for errors, and you probably won’t get meaningfully better data. Only combine sources when each provides unique value. If you need funding data (PitchBook) and employee counts (LinkedIn) and qualitative notes (your CRM) for the same companies, then yes, merge them. But if you just need funding data, stick with one source. Understand the tradeoffs Every data source has tradeoffs. PitchBook is accurate but slow and expensive. Signal providers like Specter or Harmonic are faster but prioritize coverage over completeness. LinkedIn has great employee data but mediocre company data. Your CRM has great context but only for companies you’ve talked to. Choose based on what matters more for your use case. Building a real-time deal flow alert system? Speed matters, use signal providers. Building materials for LP reporting? Accuracy matters, use PitchBook. Tracking warm intro paths? Use LinkedIn and your CRM. Don’t expect any single source to be perfect for everything. Different sources excel at different things.

The Validation Challenge

You want validation rules that catch errors before bad data reaches your investment team. In theory, this makes sense. In practice, it’s much harder than you’d expect. Why automated validation is limited In operational systems, you can validate that email addresses have @ symbols, that phone numbers are 10 digits, that dates are in the future for scheduled events. These rules work because the domain is constrained and you control data entry. VC data doesn’t work like this. Should Series B come after Series A? Usually, but not always (some companies skip rounds). Should valuations increase with each round? Usually, but down rounds happen. Should funding dates be chronological? Usually, but sometimes rounds are announced retroactively. Should employee counts increase over time? Usually, but companies do layoffs. Every validation rule you write will have exceptions. Real companies don’t follow clean patterns. Markets are messy. You’ll either have rules so strict they flag too many false positives (wasting time investigating legitimate data), or rules so loose they miss real errors. Domain knowledge over automation The most effective validation is humans with industry knowledge reviewing data. Someone who knows the fintech space will immediately spot that “Stripe raised a $10M Series A in 2023” is wrong because Stripe raised their Series A over a decade ago. Someone who follows AI infrastructure companies will know which competitors belong in a market map. Automated systems won’t catch these errors because they lack context. This is why you should know the industries and markets your fund focuses on. If you’re building data infrastructure for a fintech fund, learn fintech. If you’re supporting a deep tech fund, understand deep tech. Domain knowledge is your best validation tool. Basic sanity checks that help Some validation is still worth doing:

Funding dates should be reasonable (not in the future, not before the company was founded)
Funding amounts should be positive numbers
URLs should be valid domains
Employee counts shouldn’t be negative
Location data should be real cities/countries

These catch data entry errors and obvious problems. They won’t catch subtle errors, but they prevent embarrassing mistakes like showing a company with -50 employees. The gut feel test When you build an analysis, look at it yourself before showing it to anyone. Does it pass the gut feel test? If you’re showing a market map of 200 AI companies and you’ve never heard of 150 of them, something is probably wrong with your sourcing criteria. If funding totals seem way too high or too low compared to what you know about the market, investigate. Your intuition catches problems that automated validation misses. This is another reason why domain knowledge matters. You need a gut feel for what’s reasonable.

Handling Data Staleness

Data gets stale. Companies update their information irregularly. Data vendors scrape on their own schedules. Your CRM is only as fresh as your last conversation with a company. Staleness is inevitable, but you can manage it. Surface freshness to users Always show when data was last updated. If you’re displaying employee count, show “500 employees (as of Dec 2024).” If you’re showing funding data, show “Series B, $50M (announced May 2025).” Don’t present stale data as if it’s current. This does two things. First, it sets appropriate expectations. Users know whether to trust the data for current decisions. Second, it shifts responsibility. If a GP uses 6-month-old employee counts and makes a decision based on that, they knew the data was old. You’re not hiding staleness. Push vendors to refresh Some data vendors have refresh or rescrape endpoints. If you need current data on a specific company, you can request that they update it. This works for high-priority companies (companies you’re actively diligencing, portfolio companies) but doesn’t scale to refreshing your entire database daily. Use selective refresh strategically. When you’re preparing a market map for Monday’s partner meeting, refresh the key companies in that market over the weekend. Don’t try to keep everything fresh all the time. Know when staleness matters For some use cases, stale data is fine. If you’re analyzing funding trends over the past five years, using data from three months ago is perfectly adequate. If you’re building a list of potential competitors for a portfolio company, employee counts from six months ago are probably close enough. Staleness matters most for real-time decision-making. If a GP is deciding whether to take a meeting with a company this week, you want current information. If you’re doing background research on a market, last quarter’s data is fine. Understand which of your use cases are time-sensitive and prioritize freshness there.

Building Trust with GPs

You will show bad data to your investment team. When it happens, how you handle it determines whether they trust you going forward. Demo to friendly associates first Before presenting anything to partners or GPs, show it to a junior associate you trust. Someone who knows the market, will give you honest feedback, and won’t hold mistakes against you. Ask them to spot-check the data: do these companies make sense? Are the numbers reasonable? Are there obvious errors? This catches embarrassing mistakes before they damage your credibility with decision-makers. An associate telling you “Airbnb didn’t raise a Series A in 2023” is much better than a GP saying it in front of the partnership. Always cite your data sources Every analysis, dashboard, or report should include where the data came from. “Funding data from PitchBook, as of January 2026.” “Employee counts from LinkedIn.” “Competitive analysis based on Crunchbase and CRM data.” This does two things. First, it sets expectations. People know PitchBook funding data is reliable. They know LinkedIn employee counts are approximate. They calibrate their trust appropriately. Second, when data is wrong, you can shift responsibility to the vendor. “This is what PitchBook reported” is much better than “this is what I analyzed” when the data is wrong. Entity resolution makes blame-shifting much harder When you combine multiple data sources, you can’t easily shift blame. If your market map merges signal providers, PitchBook, and your CRM, and the funding data is wrong, whose fault is it? The data vendor, the entity resolution algorithm that matched companies incorrectly, or your data model that chose which source to trust? This is one of the hidden costs of entity resolution. Simple, single-source analyses let you point to the data provider when things are wrong. Complex, merged analyses make you responsible for errors even when the underlying data was bad. Learn from each mistake When bad data reaches GPs, figure out how it happened. Was the source data wrong? Did your entity resolution merge the wrong companies? Did you pull data from a low-trust source when you should have used a high-trust source? Was the data stale and you didn’t surface that? Fix the root cause. If PitchBook has wrong data, report it to them. If your entity resolution is merging companies incorrectly, improve it or add human review. If you used low-trust data for high-stakes analysis, change your workflow to prevent that. Don’t just apologize and move on. Each mistake is a chance to improve your data infrastructure. Accept that perfect data doesn’t exist Even with perfect processes, external data will have errors. Companies lie on their websites. Funding announcements exaggerate. Journalists misreport. Data vendors make mistakes. You can minimize errors but not eliminate them. Set this expectation with your investment team. “We use PitchBook for funding data, which is generally accurate but has occasional errors. When we find errors, we report them to PitchBook.” This is better than promising perfect data and failing to deliver.

Prevention and Detection

Good data quality comes from both preventing bad data from entering your systems and detecting it when it does. Prevention: Good ingestion practices When you ingest data from external sources:

Validate basic formats (URLs are URLs, dates are dates, numbers are numbers)
Normalize data (consistent date formats, consistent naming, lowercase domains)
Log what you received so you can trace errors back to the source
Version your data so you can see what changed and when
Flag data from low-trust sources so it doesn’t get used inappropriately

These practices don’t prevent all errors, but they prevent careless mistakes and make debugging easier when problems arise. Detection: Monitoring for obvious errors Set up automated tests that run frequently to catch data quality issues before they reach users. dbt (data build tool) tests are particularly good for this. They let you define tests in SQL that run against your data warehouse and alert you when things break. Examples of useful dbt tests for VC data:

Uniqueness: Each company should have only one canonical ID
Not null: Required fields like company name, founding date are populated
Relationships: Every investment references a valid company and investor

Run these tests frequently (daily or after each data ingestion). When tests fail, you get alerts before bad data reaches dashboards or analyses. Why automated validation is even harder than it seems You might think you can set up alerts for obviously wrong data: funding amounts over $1B, companies with over 100,000 employees, founding dates before 1900, or sudden 10x changes in employee count. But all of these occur as valid data:

Later-stage companies do raise $1B+ rounds (OpenAI, Stripe, etc.)
Large public and private companies do have 100,000+ employees
Old companies founded in the 1800s are still operating
Companies do have massive hiring sprees or layoffs that change headcount 10x

Even “obvious” validation rules have too many false positives to be useful. You end up investigating legitimate data constantly, which trains you to ignore the alerts, which means you miss real errors when they occur. The only reliable automated checks are for truly impossible data: negative funding amounts, null values in required fields, dates in the future, invalid data types. Beyond that, you need domain knowledge and human review. Human review for important analyses Anything that will be seen by GPs or partners should have human review. You look at it (gut feel test), a friendly associate looks at it (domain knowledge check), and only then does it go to decision-makers. This doesn’t scale to every query or dashboard, but for high-stakes deliverables, human review is essential. Automation is powerful, but humans catch errors that automated validation misses.

The Bottom Line

Data quality in VC is hard. You’re aggregating external data you don’t control, from sources with different accuracy and freshness characteristics, for users who make high-stakes decisions based on that data. You will show bad data to GPs. It’s inevitable. Your job is to minimize errors, surface uncertainty when it exists, and maintain trust despite the inherent messiness of external data. Understand trust levels: Internal data > established vendors > scraped data. Use this to guide what data you use for what purposes. Use portfolio companies to validate: Test vendor accuracy against companies you know well. Use this to pick authoritative sources for each data type. Pick one source per problem: Don’t try to merge PitchBook and signal providers for the same use case. Pick the best source and stick with it. Entity resolution creates new opportunities for errors. Domain knowledge over automation: Automated validation rules have limited value. Humans with industry knowledge catch more errors than algorithms. Surface freshness: Always show when data was last updated. Don’t present stale data as current. Always cite sources: Let users know where data came from so they can calibrate trust and so you can shift blame when vendors are wrong. Demo to friendly associates first: Catch embarrassing mistakes before they damage credibility with GPs. Accept imperfection: External data will have errors. Set expectations appropriately rather than promising perfect data. In the next chapter, we’ll cover data warehousing and analytics: once you have data (imperfect as it is), how do you structure it, analyze it, and surface insights that GPs actually use?

Introduction

Part 1: Understanding VC

Part 2: The VC Tech Stack

Part 3: Technical Foundations

Data Quality and Validation

Overview

Understanding Data Source Trust Levels

Choosing Your Data Sources

The Validation Challenge

Handling Data Staleness

Building Trust with GPs

Prevention and Detection

The Bottom Line

Introduction

Part 1: Understanding VC

Part 2: The VC Tech Stack

Part 3: Technical Foundations

​Overview

​Understanding Data Source Trust Levels

​Choosing Your Data Sources

​The Validation Challenge

​Handling Data Staleness

​Building Trust with GPs

​Prevention and Detection

​The Bottom Line

Overview

Understanding Data Source Trust Levels

Choosing Your Data Sources

The Validation Challenge

Handling Data Staleness

Building Trust with GPs

Prevention and Detection

The Bottom Line