Overview
You will show bad data to a GP. It’s inevitable. A company in your market map that shut down two years ago. Funding amounts that are wrong. Employee counts that are wildly inaccurate. A competitor analysis that misses obvious competitors or includes companies that aren’t actually competitive. This happens to everyone building data infrastructure for VC. The question isn’t whether you’ll have data quality problems. The question is how you minimize them, how you surface them when they exist, and how you maintain trust with your investment team when bad data inevitably makes it through. Data quality in VC is particularly challenging because you’re aggregating data from multiple external sources, none of which are perfect. Crunchbase has gaps. PitchBook has errors. LinkedIn data is stale. Your CRM has whatever your team bothered to enter. Company websites lie. Every data source has different coverage, accuracy, and freshness characteristics. Unlike operational systems where you control data entry and can enforce validation rules, VC data comes from the outside world. You don’t control when companies update their information. You can’t force data vendors to be more accurate. You’re playing defense, trying to catch errors before they damage your credibility with the people who make investment decisions. This chapter covers how to think about data quality in VC, how to choose and validate data sources, how to handle conflicting information, and how to build trust with your investment team despite the inherent messiness of external data.Understanding Data Source Trust Levels
Not all data is equally trustworthy. The first step in data quality is understanding which sources you can trust for what purposes. High trust: Your firm’s internal data Your fund’s portfolio companies, investment amounts, ownership percentages, board seats, LP capital calls, distributions. This is data your firm generates and controls. It should be your highest quality data because you have direct access to the source. If your internal fund operations data is wrong, that’s a problem you can and should fix immediately. Use this data as ground truth. When external sources conflict with your internal data about portfolio companies, your internal data is almost always correct. Medium trust: Established data vendors for their core competency PitchBook for funding data. LinkedIn for employee counts and job changes. These vendors specialize in their domains and have reasonably good accuracy for what they focus on. They’re not perfect, but they’re reliable enough to use for analysis and decision-making. The tradeoff: PitchBook is usually accurate but slow. They verify information before publishing, which means there’s lag. A Series B that closed last week might not appear in PitchBook for two weeks. Signal data providers like Specter or Harmonic prioritize speed over completeness. They surface companies quickly but aren’t trying to be comprehensive. If you need accuracy and completeness, use PitchBook. If you need speed and real-time signals, use signal providers. Pick based on your use case. Your CRM data falls into this category too. It’s only as good as what your team enters, but if your team is diligent about tracking companies, it’s reasonably trustworthy. If your CRM says you passed on a company, that’s probably correct. If it says you never talked to a company, that might just mean nobody logged it. Low trust: Scraped, derived, or transcribed data Meeting transcriptions from voice-to-text systems. Data scraped from company websites. Derived metrics that combine multiple sources. AI-generated summaries of company descriptions. This data is useful but shouldn’t be trusted without verification. Meeting transcriptions get names wrong, mishear technical terms, and miss context. Scraped data is often stale (companies change but websites don’t update). Derived metrics inherit errors from their sources and add their own. AI summaries hallucinate. Use this data for context and exploration, not for definitive analysis. Why trust levels matter This hierarchy should guide your decisions about what data to use for what purposes. Building a market map to present to GPs? Use PitchBook or your CRM, not scraped website data. Feeding context into an LLM for research? Lower trust data is fine because you’re exploring, not deciding. Calculating portfolio performance for LP reporting? Only use high trust internal data. When you’re deciding whether to build a feature that depends on certain data, consider the trust level. If it requires high trust data but you only have medium trust sources, either find better data or don’t build the feature. Showing low trust data to GPs without caveats damages your credibility.Choosing Your Data Sources
Before worrying about validation, make the right choices about which data sources to use. This is where many funds go wrong: they subscribe to multiple overlapping data sources, try to merge them (requiring entity resolution), and end up with more complexity and not much better data. Use your portfolio companies as ground truth You know your portfolio companies better than any external data source. Use them to validate the accuracy of vendors. Take 10-20 of your portfolio companies and check how accurately each vendor represents them:- Is the funding data correct (amounts, dates, investors)?
- Are employee counts reasonably accurate?
- Are the company descriptions accurate?
- How quickly does the vendor update after new events (funding rounds, acquisitions)?
The Validation Challenge
You want validation rules that catch errors before bad data reaches your investment team. In theory, this makes sense. In practice, it’s much harder than you’d expect. Why automated validation is limited In operational systems, you can validate that email addresses have @ symbols, that phone numbers are 10 digits, that dates are in the future for scheduled events. These rules work because the domain is constrained and you control data entry. VC data doesn’t work like this. Should Series B come after Series A? Usually, but not always (some companies skip rounds). Should valuations increase with each round? Usually, but down rounds happen. Should funding dates be chronological? Usually, but sometimes rounds are announced retroactively. Should employee counts increase over time? Usually, but companies do layoffs. Every validation rule you write will have exceptions. Real companies don’t follow clean patterns. Markets are messy. You’ll either have rules so strict they flag too many false positives (wasting time investigating legitimate data), or rules so loose they miss real errors. Domain knowledge over automation The most effective validation is humans with industry knowledge reviewing data. Someone who knows the fintech space will immediately spot that “Stripe raised a $10M Series A in 2023” is wrong because Stripe raised their Series A over a decade ago. Someone who follows AI infrastructure companies will know which competitors belong in a market map. Automated systems won’t catch these errors because they lack context. This is why you should know the industries and markets your fund focuses on. If you’re building data infrastructure for a fintech fund, learn fintech. If you’re supporting a deep tech fund, understand deep tech. Domain knowledge is your best validation tool. Basic sanity checks that help Some validation is still worth doing:- Funding dates should be reasonable (not in the future, not before the company was founded)
- Funding amounts should be positive numbers
- URLs should be valid domains
- Employee counts shouldn’t be negative
- Location data should be real cities/countries
Handling Data Staleness
Data gets stale. Companies update their information irregularly. Data vendors scrape on their own schedules. Your CRM is only as fresh as your last conversation with a company. Staleness is inevitable, but you can manage it. Surface freshness to users Always show when data was last updated. If you’re displaying employee count, show “500 employees (as of Dec 2024).” If you’re showing funding data, show “Series B, $50M (announced May 2025).” Don’t present stale data as if it’s current. This does two things. First, it sets appropriate expectations. Users know whether to trust the data for current decisions. Second, it shifts responsibility. If a GP uses 6-month-old employee counts and makes a decision based on that, they knew the data was old. You’re not hiding staleness. Push vendors to refresh Some data vendors have refresh or rescrape endpoints. If you need current data on a specific company, you can request that they update it. This works for high-priority companies (companies you’re actively diligencing, portfolio companies) but doesn’t scale to refreshing your entire database daily. Use selective refresh strategically. When you’re preparing a market map for Monday’s partner meeting, refresh the key companies in that market over the weekend. Don’t try to keep everything fresh all the time. Know when staleness matters For some use cases, stale data is fine. If you’re analyzing funding trends over the past five years, using data from three months ago is perfectly adequate. If you’re building a list of potential competitors for a portfolio company, employee counts from six months ago are probably close enough. Staleness matters most for real-time decision-making. If a GP is deciding whether to take a meeting with a company this week, you want current information. If you’re doing background research on a market, last quarter’s data is fine. Understand which of your use cases are time-sensitive and prioritize freshness there.Building Trust with GPs
You will show bad data to your investment team. When it happens, how you handle it determines whether they trust you going forward. Demo to friendly associates first Before presenting anything to partners or GPs, show it to a junior associate you trust. Someone who knows the market, will give you honest feedback, and won’t hold mistakes against you. Ask them to spot-check the data: do these companies make sense? Are the numbers reasonable? Are there obvious errors? This catches embarrassing mistakes before they damage your credibility with decision-makers. An associate telling you “Airbnb didn’t raise a Series A in 2023” is much better than a GP saying it in front of the partnership. Always cite your data sources Every analysis, dashboard, or report should include where the data came from. “Funding data from PitchBook, as of January 2026.” “Employee counts from LinkedIn.” “Competitive analysis based on Crunchbase and CRM data.” This does two things. First, it sets expectations. People know PitchBook funding data is reliable. They know LinkedIn employee counts are approximate. They calibrate their trust appropriately. Second, when data is wrong, you can shift responsibility to the vendor. “This is what PitchBook reported” is much better than “this is what I analyzed” when the data is wrong. Entity resolution makes blame-shifting much harder When you combine multiple data sources, you can’t easily shift blame. If your market map merges signal providers, PitchBook, and your CRM, and the funding data is wrong, whose fault is it? The data vendor, the entity resolution algorithm that matched companies incorrectly, or your data model that chose which source to trust? This is one of the hidden costs of entity resolution. Simple, single-source analyses let you point to the data provider when things are wrong. Complex, merged analyses make you responsible for errors even when the underlying data was bad. Learn from each mistake When bad data reaches GPs, figure out how it happened. Was the source data wrong? Did your entity resolution merge the wrong companies? Did you pull data from a low-trust source when you should have used a high-trust source? Was the data stale and you didn’t surface that? Fix the root cause. If PitchBook has wrong data, report it to them. If your entity resolution is merging companies incorrectly, improve it or add human review. If you used low-trust data for high-stakes analysis, change your workflow to prevent that. Don’t just apologize and move on. Each mistake is a chance to improve your data infrastructure. Accept that perfect data doesn’t exist Even with perfect processes, external data will have errors. Companies lie on their websites. Funding announcements exaggerate. Journalists misreport. Data vendors make mistakes. You can minimize errors but not eliminate them. Set this expectation with your investment team. “We use PitchBook for funding data, which is generally accurate but has occasional errors. When we find errors, we report them to PitchBook.” This is better than promising perfect data and failing to deliver.Prevention and Detection
Good data quality comes from both preventing bad data from entering your systems and detecting it when it does. Prevention: Good ingestion practices When you ingest data from external sources:- Validate basic formats (URLs are URLs, dates are dates, numbers are numbers)
- Normalize data (consistent date formats, consistent naming, lowercase domains)
- Log what you received so you can trace errors back to the source
- Version your data so you can see what changed and when
- Flag data from low-trust sources so it doesn’t get used inappropriately
- Uniqueness: Each company should have only one canonical ID
- Not null: Required fields like company name, founding date are populated
- Relationships: Every investment references a valid company and investor
- Later-stage companies do raise $1B+ rounds (OpenAI, Stripe, etc.)
- Large public and private companies do have 100,000+ employees
- Old companies founded in the 1800s are still operating
- Companies do have massive hiring sprees or layoffs that change headcount 10x