Overview
How you model your data determines what you can analyze, how fast you can build features, and whether your systems adapt as needs evolve. This chapter covers core entities, the critical distinction between companies and deals, temporal data handling, and practical patterns from building real VC systems.
Core Entities and Relationships
Your data model will vary based on what you’re building (CRM, research platform, portfolio dashboard), but most VC systems share common entities.
Companies - Startups you might invest in, portfolio companies, competitors. Core attributes: name, description, website, founding date, location, sector, stage. Companies are the center of your data model.
Deals - Your fund’s relationship with a company (covered in detail in the next section). One company can have multiple deals: you might pass on their seed round, then invest in their Series A.
People - Founders, executives, employees. Core attributes: name, email, LinkedIn URL. People connect to companies through roles, not direct relationships.
Roles - The connection between people and companies capturing who works where, in what capacity, and when. Model as a separate entity to track full history: a founder worked at Google (2018-2020), started Company X (2020-present), while advising Company Y (2021-present).
Education - Where people went to school, what they studied, when. Model separately to capture multiple degrees and overlapping education. Enables analysis like “which universities produce the most founders in our focus areas?”
Funding Rounds - Capital raised by companies. Attributes: round type, amount, valuation, dates, investors. Expect messiness: unreported SAFE notes, incomplete data from vendors, unannounced rounds. Your data model needs to handle uncertainty (covered in “Dealing with Messy Reality” below).
Investors and Funds
Other VC firms, angels, corporate investors who participate in rounds. You care about investors for co-investment patterns, warm intro paths, and understanding who’s active in your sectors.
Critical: Model the hierarchy as Investor → Fund → Funding Round, not just Investor → Funding Round. Here’s why this matters:
One investor (the firm) can have multiple funds. Sequoia Capital has multiple funds. a16z has multiple funds. Your own firm probably has multiple funds (Fund I, Fund II, etc.). When tracking who invested in a company, you need to know which specific fund made the investment, not just which firm.
This matters because different funds from the same investor can make different investment decisions. Inflection Mercury Fund might pass on a deal while Inflection Mars Fund invests. If you model this as just “Inflection invested,” you lose critical information.
The SQL below is simplified to illustrate entity relationships - not production-ready code.
-- Investors table (the firms)
CREATE TABLE investors (
id UUID PRIMARY KEY,
name TEXT NOT NULL, -- "Sequoia Capital"
type TEXT, -- VC, Angel, Corporate
headquarters TEXT,
focus_areas TEXT[]
);
-- Funds table (specific funds within firms)
CREATE TABLE funds (
id UUID PRIMARY KEY,
investor_id UUID REFERENCES investors(id),
name TEXT NOT NULL, -- "Sequoia Capital Fund XIV"
vintage_year INTEGER,
fund_size NUMERIC,
status TEXT -- Active, Deployed, Closed
);
-- Funding round participations (which funds invested in which rounds)
CREATE TABLE funding_round_participants (
id UUID PRIMARY KEY,
funding_round_id UUID REFERENCES funding_rounds(id),
fund_id UUID REFERENCES funds(id), -- Links to specific fund, not just investor
amount_invested NUMERIC,
is_lead BOOLEAN
);
Some data providers (some are better at this than others) provide fund-level detail. When available, capture it. When it’s not available (data provider only knows “Sequoia invested” but not which Sequoia fund), you can create a placeholder fund or link directly to the investor, but structure your schema to support fund-level data when you have it.
This distinction has bitten many VC data systems. Get it right from the start.
LPs (Limited Partners) - Individuals and institutions who invest in your fund. Required for fund operations tools or LP portals. LPs can be individuals (HNWIs with personal info) or institutions (pension funds, endowments with organizational info). Model with a type field and nullable columns, or separate tables if managed differently.
The list above isn’t exhaustive. Depending on what you’re building, you might also model: board seats, advisors, customers (for B2B portfolio companies), competitors, acquisitions, exits, fund entities (if you manage multiple funds), and more. Start with the entities your application actually needs, not a comprehensive schema you think you might need someday.
Companies vs. Deals: The Critical Distinction
The most important modeling decision in VC systems is separating companies from deals.
A company is an entity that exists independently of your fund. Stripe is a company. It has founders, funding rounds, employees, a product, customers. Stripe exists whether or not your fund ever talks to them.
A deal is your fund’s relationship with a company. You sourced Stripe, you met with the founders, you decided to invest (or pass), you negotiated terms, you closed the investment. The deal represents your fund’s interaction with that company.
Why separate them?
Your fund can have multiple deals with the same company. You might:
- Pass on their seed round (Deal 1: Status = Passed)
- Invest in their Series A two years later (Deal 2: Status = Closed)
- Consider a follow-on in their Series B (Deal 3: Status = In Diligence)
Each deal has different properties: deal source, who at your fund is leading it, what stage it’s at, when you first talked to them, what your investment thesis was, what terms you negotiated. These properties belong to the deal, not the company.
Meanwhile, company properties (employee count, funding history, product description) are shared across all deals. If you update Stripe’s employee count, that change applies regardless of which deal you’re looking at.
In practice, this means separate tables:
-- Companies table
CREATE TABLE companies (
id UUID PRIMARY KEY,
name TEXT NOT NULL,
website TEXT,
description TEXT,
founded_date DATE,
location TEXT,
sector TEXT,
stage TEXT,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
-- Deals table
CREATE TABLE deals (
id UUID PRIMARY KEY,
company_id UUID REFERENCES companies(id), -- Foreign key to company
deal_name TEXT, -- "Stripe Series A"
status TEXT, -- Sourced, Meeting, Diligence, IC, Term Sheet, Closed, Passed
source TEXT, -- How we found this company (warm intro, Harmonic, conference)
lead_investor UUID REFERENCES people(id), -- Who at our fund is leading
initial_contact_date DATE,
last_activity_date DATE,
investment_amount NUMERIC,
ownership_percentage NUMERIC,
created_at TIMESTAMP,
updated_at TIMESTAMP
);
When you query deals, you join to companies to get company information. When you’re analyzing companies (building market maps, tracking sectors), you query companies directly without caring about deals.
This separation is fundamental to building VC software that works. If you try to embed deals into the companies table or vice versa, you’ll create a mess that’s hard to query and impossible to extend. As emphasized in CRM and Deal Flow on CRM, this distinction is one of the most important architectural decisions you’ll make.
Dealing with Messy Reality
VC data is messy. Your data model needs to account for this rather than assuming perfect information.
Don’t assume GPs make one investment per company
As covered above, model deals separately from companies. Your fund might invest multiple times, or pass then later invest, or invest then later decline a follow-on. The data model needs to support multiple deals per company.
Don’t assume perfect funding round data
Early-stage companies raise money in ways that aren’t always captured in data sources. SAFE notes, convertible notes, rolling closes, stealth rounds. Crunchbase and PitchBook miss things, especially for pre-seed and seed companies.
Your data model should allow for:
- Unknown or estimated funding amounts
- Approximate dates (Q2 2024, not a specific day)
- Rounds that might not exist (rumored but unconfirmed)
- Multiple rounds of the same type (Seed, Seed Extension, Seed II)
Don’t build validation rules that enforce “Series B must come after Series A” or “funding amounts must be known.” Real companies break these rules constantly and your data sources, even more so.
People have overlapping experiences and education
Founders often work multiple jobs simultaneously: running their startup while advising other companies, sitting on boards, teaching part-time. They pursue education while working: part-time MBAs, executive programs, online courses.
This is why roles and education must be separate tables with start and end dates, not single fields in the people table. You need to capture overlapping time periods.
-- Roles table (separate from people and companies)
CREATE TABLE roles (
id UUID PRIMARY KEY,
person_id UUID REFERENCES people(id),
company_id UUID REFERENCES companies(id),
title TEXT,
role_type TEXT, -- Employee, Founder, Advisor, Board Member
start_date DATE,
end_date DATE, -- NULL if current
is_current BOOLEAN,
created_at TIMESTAMP
);
-- Education table (separate from people)
CREATE TABLE education (
id UUID PRIMARY KEY,
person_id UUID REFERENCES people(id),
institution TEXT,
degree TEXT, -- BS, MS, MBA, PhD
field_of_study TEXT,
start_date DATE,
end_date DATE,
created_at TIMESTAMP
);
With this structure, you can query “show me all founders who worked at Google” (join people → roles → companies where company = Google and role_type = Employee). You can analyze “which schools produce the most founders in fintech” (join people → roles → companies → education, filter for fintech sector and founder roles).
If you tried to store current role and education as fields in the people table, you’d lose history and couldn’t answer these questions.
Temporal Data and History
Companies change constantly. Employee count grows. Funding rounds happen. Valuations change. Products pivot. You need to decide what history to keep and how to model it.
Start with append-only tables
The simplest approach: never update or delete data. When you get new information about a company, append a new row with a timestamp. This creates an audit trail of every change.
-- Append-only company data from PitchBook
CREATE TABLE pitchbook_company_changes (
id UUID PRIMARY KEY,
company_id UUID,
fetched_at TIMESTAMP, -- When we got this data
raw_json JSONB, -- Raw response from PitchBook API
name TEXT,
description TEXT,
employee_count INTEGER,
funding_total NUMERIC,
-- ... other fields
created_at TIMESTAMP
);
Every time you fetch company data from PitchBook (daily, weekly, or on-demand), you insert a new row. You never update existing rows. This preserves full history: you can see what PitchBook reported about a company on any date.
Use dbt to transform to latest values
Append-only tables are great for history but inconvenient to query. You usually want the latest information, not every historical value.
Use dbt (covered in Data Warehousing) to transform append-only staging tables into intermediate tables with current values:
-- dbt model: models/intermediate/pitchbook_companies.sql
-- Gets the latest record per company from append-only staging data
SELECT DISTINCT ON (company_id)
company_id,
name,
description,
employee_count,
funding_total,
fetched_at
FROM {{ ref('pitchbook_company_changes') }}
ORDER BY company_id, fetched_at DESC
Now you have both: pitchbook_company_changes (full history) and pitchbook_companies (latest values). Applications usually query the latest values table. When you need to analyze “how did this company change over time?” you go back to the append-only table or create a custom table for those types of queries.
What history actually matters
Storage is cheap these days. Rather than deciding what history to keep and what to discard, it’s often simpler to just keep everything in your append-only staging tables. A year of daily company snapshots for thousands of companies costs pennies in Postgres or cloud data warehouse storage.
The practical considerations are query performance and data warehouse costs (if you’re using Snowflake or BigQuery where you pay per query). But even there, you’re usually querying the transformed and materialized “latest values” tables, not the full historical append-only tables.
Keep all history in staging tables. When you need to analyze “how did this company change over time?” you have the data. When you don’t need it, you’re not querying it, so it doesn’t cost anything. This is simpler than trying to decide upfront what historical data might be valuable someday.
The main exception: if you’re storing truly high-volume data (real-time metrics, streaming data, logs), you might need retention policies. But for standard VC data (company attributes, funding rounds, people roles), just keep it all.
As you add more data sources, follow dbt’s layered approach (staging → intermediate → marts) to stay organized. The Data Warehousing chapter covers this methodology in detail.
The Bottom Line
Accept that VC data is messy: multiple investments per company, imperfect funding data, overlapping experiences. Your data model should handle uncertainty and incomplete information.
Separate companies from deals. Use append-only tables for history (storage is cheap), transform to latest values for queries. Start with what you need and add entities as your systems grow.
In the next chapter, we’ll cover more advanced entity resolution: matching the same company across different data sources when they use different names, formats, and identifiers.