Overview
How you model your data determines what questions you can answer, how fast you can build new features, and whether your systems can adapt as your fund’s needs evolve. Good data modeling makes everything easier. Bad data modeling creates technical debt that slows you down for years. VC data modeling has specific challenges. Companies change constantly (funding rounds, employee growth, pivots). Your relationship with companies evolves through multiple stages (sourcing, diligence, investment, portfolio monitoring). People move between companies, have overlapping experiences, and maintain networks that matter for deal flow. Data comes from multiple sources with conflicting information. And you need to track history because understanding what you knew when you made an investment decision is important for learning from outcomes. This chapter covers how to model venture capital data: the core entities and their relationships, the critical distinction between companies and deals, how to handle temporal data that changes over time, and how to structure your data warehouse using dbt to manage complexity as you add more data sources. The focus is on practical patterns that work for VC funds, not abstract database theory. These patterns come from building real systems at Inflection and dealing with messy, real-world data from multiple vendors.Core Entities and Relationships
Your data model will vary based on what you’re building (CRM, research platform, portfolio dashboard), but most VC systems share common entities. Companies The companies you’re tracking: startups you might invest in, portfolio companies, competitors, acquirers. Core attributes include name, description, website, founding date, location, sector, and stage. Companies are the center of your data model. Most other entities relate to companies in some way. Deals Your fund’s relationship with a company. This is separate from the company itself (covered in detail in the next section). A deal represents an investment opportunity: you sourced Company X, you’re in diligence, you invested, you passed. One company can have multiple deals. You might pass on their seed round, then invest in their Series A. Each is a separate deal. People Founders, executives, employees. People you’ve met, people at companies you’re tracking, people at portfolio companies. Core attributes include name, email, LinkedIn URL, current company, current role. People are connected to companies through roles (covered below). The same person might work at multiple companies over time or even simultaneously (advisor roles, board seats). Roles The connection between people and companies. A role captures: this person works at this company in this capacity during this time period. Essential for tracking company history and founder backgrounds. Model roles as a separate entity rather than embedding them in the people or companies table. This lets you track full history: a founder worked at Google (2018-2020), then started Company X (2020-present), while also advising Company Y (2021-present). Education Where people went to school, what they studied, when. Similar to roles, model this separately so you can capture multiple degrees and overlapping education (joint degrees, part-time programs). This matters for analysis: which universities produce the most founders in your focus areas? Do Stanford CS grads in your pipeline perform differently than MIT grads? You need structured education data to answer these questions. Funding Rounds Capital raised by companies: seed, Series A, Series B, etc. Attributes include round type, amount raised, valuation (pre and post), date announced, date closed, lead investors, participating investors. Funding rounds are messy in early-stage VC. Not every company announces rounds. SAFE notes are common but often unreported. Data vendors have incomplete information. Your data model needs to handle this uncertainty (covered in “Dealing with Messy Reality” below). Investors and Funds Other VC firms, angels, corporate investors who participate in rounds. You care about investors for co-investment patterns, warm intro paths, and understanding who’s active in your sectors. Critical: Model the hierarchy as Investor → Fund → Funding Round, not just Investor → Funding Round. Here’s why this matters: One investor (the firm) can have multiple funds. Sequoia Capital has multiple funds. a16z has multiple funds. Your own firm probably has multiple funds (Fund I, Fund II, etc.). When tracking who invested in a company, you need to know which specific fund made the investment, not just which firm. This matters because different funds from the same investor can make different investment decisions. Inflection Mercury Fund might pass on a deal while Inflection Mars Fund invests. If you model this as just “Inflection invested,” you lose critical information.- Individual LPs: Personal name, SSN/tax ID, home address, bank account for distributions
- Institutional LPs: Organization name, EIN, authorized signatories, wire instructions, contact people
Companies vs. Deals: The Critical Distinction
The most important modeling decision in VC systems is separating companies from deals. A company is an entity that exists independently of your fund. Stripe is a company. It has founders, funding rounds, employees, a product, customers. Stripe exists whether or not your fund ever talks to them. A deal is your fund’s relationship with a company. You sourced Stripe, you met with the founders, you decided to invest (or pass), you negotiated terms, you closed the investment. The deal represents your fund’s interaction with that company. Why separate them? Your fund can have multiple deals with the same company. You might:- Pass on their seed round (Deal 1: Status = Passed)
- Invest in their Series A two years later (Deal 2: Status = Closed)
- Consider a follow-on in their Series B (Deal 3: Status = In Diligence)
Dealing with Messy Reality
VC data is messy. Your data model needs to account for this rather than assuming perfect information. Don’t assume GPs make one investment per company As covered above, model deals separately from companies. Your fund might invest multiple times, or pass then later invest, or invest then later decline a follow-on. The data model needs to support multiple deals per company. Don’t assume perfect funding round data Early-stage companies raise money in ways that aren’t always captured in data sources. SAFE notes, convertible notes, rolling closes, stealth rounds. Crunchbase and PitchBook miss things, especially for pre-seed and seed companies. Your data model should allow for:- Unknown or estimated funding amounts
- Approximate dates (Q2 2024, not a specific day)
- Rounds that might not exist (rumored but unconfirmed)
- Multiple rounds of the same type (Seed, Seed Extension, Seed II)
Temporal Data and History
Companies change constantly. Employee count grows. Funding rounds happen. Valuations change. Products pivot. You need to decide what history to keep and how to model it. Start with append-only tables The simplest approach: never update or delete data. When you get new information about a company, append a new row with a timestamp. This creates an audit trail of every change.pitchbook_company_changes (full history) and pitchbook_companies (latest values). Applications usually query the latest values table. When you need to analyze “how did this company change over time?” you go back to the append-only table or create a custom table for those types of queries.
What history actually matters
Storage is cheap these days. Rather than deciding what history to keep and what to discard, it’s often simpler to just keep everything in your append-only staging tables. A year of daily company snapshots for thousands of companies costs pennies in Postgres or cloud data warehouse storage.
The practical considerations are query performance and data warehouse costs (if you’re using Snowflake or BigQuery where you pay per query). But even there, you’re usually querying the transformed and materialized “latest values” tables, not the full historical append-only tables.
Keep all history in staging tables. When you need to analyze “how did this company change over time?” you have the data. When you don’t need it, you’re not querying it, so it doesn’t cost anything. This is simpler than trying to decide upfront what historical data might be valuable someday.
The main exception: if you’re storing truly high-volume data (real-time metrics, streaming data, logs), you might need retention policies. But for standard VC data (company attributes, funding rounds, people roles), just keep it all.
The dbt Approach to Data Modeling
As you add more data sources (PitchBook, LinkedIn, Harmonic, your CRM, scraped data), your data model needs structure to stay manageable. Follow dbt’s layered approach: staging → intermediate → marts. Staging: Raw data to columns Staging models take raw JSON or CSV data from external sources and convert it to structured columns. These are append-only: every fetch creates new rows. Minimal transformation, just making the data queryable.pitchbook_companies, linkedin_companies, harmonic_companies.