Overview
A data warehouse is a central repository where you store and analyze data from all your sources: CRM, data vendors, fund operations, portfolio companies, your research platform. Instead of querying different systems separately, you pull everything into one place where you can run analyses that span multiple data sources. For VC funds, a data warehouse serves multiple purposes. It’s an audit log of what happened in your systems (who talked to which companies, when deals moved through your pipeline, how data changed over time). It’s the foundation for analytics and dashboards. It’s where you run ad-hoc queries when a GP asks “how many Series A companies in fintech raised in the last 6 months?” And it’s where you prepare data for tools like LLMs or knowledge graphs. You don’t need a data warehouse on day one. But it’s a good second step after you have your CRM set up. Even if you’re not immediately running sophisticated analyses, building the infrastructure early creates a foundation that becomes more valuable as your fund scales. This chapter covers when you actually need a data warehouse, which one to choose, how to structure your data (following dbt best practices), and how to use it for analytics that GPs actually care about.When You Actually Need a Data Warehouse
Most funds don’t start with a data warehouse. They start with a CRM, maybe some spreadsheets, and direct queries to data vendors. This works fine at first. But as you add more data sources, want to track changes over time, or need to run analyses that combine multiple sources, you hit the limits of this approach. The right time to set up a data warehouse: After your CRM is set up and being used consistently. The CRM is your source of truth for deal flow and relationships. A data warehouse extends that by:- Tracking historical changes (you can see how your pipeline evolved over time, not just current state)
- Combining CRM data with external sources (merge your deal flow with funding data from PitchBook, etc.)
- Running analyses that would be painful in your CRM’s query interface
- Serving as an audit log of activity (who talked to which companies, when deals moved stages, what data changed)
Choosing Your Data Warehouse
For small funds: Start with Postgres. If you’re tracking tens of thousands of companies (not millions), Postgres works fine. It’s simpler and cheaper than specialized warehouses, your team already knows it, and you can use the same database for applications and analytics. Before migrating to a specialized warehouse, consider Postgres extensions: pg_mooncake (Iceberg/Delta Lake format on S3) or Hydra (columnar storage, 200x+ speedups for analytics). These let you keep Postgres while getting analytics database benefits. For larger funds: Snowflake, BigQuery, or Redshift. When Postgres performance degrades (millions of companies, complex aggregations), migrate to a specialized warehouse. Snowflake (industry standard, cloud-agnostic, expensive), BigQuery (fast, Google ecosystem, query-based pricing), or Redshift (AWS integration, cheapest if committed to AWS). Choose based on your cloud ecosystem. Migration path: Start with Postgres. Migrate when query performance becomes painful. The migration is straightforward - you’re already structuring data in a warehouse pattern, just moving to a more powerful database.The ELT Approach with dbt
The modern approach is ELT (Extract, Load, Transform) using dbt. Traditional ETL transforms data before loading it. ELT loads raw data first, then transforms it using SQL inside the warehouse. This is simpler because all transformation logic is in SQL, version-controlled, and tested like application code. How it works: Extract data from sources (CRM, PitchBook, LinkedIn) and load it raw into your warehouse using tools like Fivetran or Airbyte. Write dbt models (SELECT statements) to transform raw data: clean and normalize, combine sources, calculate metrics, build final tables (marts) for analysis. dbt handles orchestration, testing, and documentation automatically. Analysts query the transformed tables, not raw data.Organizing Your Data: dbt Best Practices
The right way to structure data in your warehouse follows dbt best practices. Rather than reinvent this, follow their guide. Here are the key principles: Layered structure: Staging → Intermediate → Marts- Staging: Raw data from sources, lightly cleaned (data types fixed, column names standardized, but otherwise unchanged). One staging model per source table. These models just make raw data easier to work with downstream.
- Intermediate: Business logic and transformations. Join data from multiple sources, calculate derived fields, handle entity resolution (linking records across sources). These are building blocks that other models reference.
-
Marts: Final tables optimized for specific use cases or business domains. These are what dashboards and analyses query. Examples:
companies_enriched(companies with all their data from multiple sources),portfolio_performance(portfolio company metrics over time),deal_flow_metrics(pipeline analytics).
Analytics and Dashboards
The value of a data warehouse isn’t just storing data. It’s enabling analysis that helps your fund make better decisions. What GPs actually use is very fund-specific Different funds care about different metrics. A pre-seed fund might focus on pipeline metrics (how many companies talked to, conversion rates by source, time from first contact to investment). A growth equity fund might focus on portfolio performance (company growth rates, follow-on opportunities, markups/markdowns). A thesis-driven fund might focus on market coverage (how many companies tracked in each thesis area, market maps by sector). There’s no universal dashboard that every fund needs. What matters is having data structured so you can quickly answer questions that come up. The value is ad-hoc analysis The biggest benefit of having a data warehouse isn’t pre-built dashboards. It’s being able to answer questions quickly when GPs ask them. “How many companies did we pass on in 2023 that later raised Series B from our target co-investors?” “Which of our portfolio companies are hiring in engineering?” “What’s our average time from first meeting to investment decision?” With data in a warehouse, these queries are straightforward SQL. Without it, they require manually pulling data from multiple systems, matching companies, and piecing together answers. The warehouse makes ad-hoc analysis feasible. Tools like Hex for quick dashboards When you do need dashboards, tools like Hex or Mode let you write SQL queries against your warehouse and turn them into interactive visualizations. These are much faster to build than custom dashboards, and they’re easy to iterate on and share with the team. At Inflection, we use Hex. When a GP wants to see something, we can spin up a dashboard in an hour or two: write SQL queries against the warehouse, add visualizations, share a link. This responsiveness is valuable. You’re not waiting weeks for data engineering to build custom dashboards. You’re answering questions as they come up. Examples of common analyses:- Pipeline metrics (deals in each stage, conversion rates, time in pipeline, source of deals)
- Market maps (companies in specific sectors, with funding, stage, growth data)
- Portfolio tracking (company metrics over time, which companies need attention)
- Investor analysis (which co-investors appear frequently, warm intro paths to founders)