Schema Design Patterns for Graph Data Snapshots: Specialized vs. Unified Approaches
I've been wrestling with a schema design problem that felt simple at first... but made me rethink my fundamental assumptions about schema "best practices."...
Context
- I'm building snapshot-based storage for prediction market analytics where entities (markets, events, traders) form a complex graph with time-varying properties.
Core Challenge
- I need to capture point-in-time snapshots of this entire graph state, then rebuild analytical models across different time periods.
- ... Should I store these snapshots as specialized files (one per entity type) or unified tables that can hold any entity?
I need to answer questions like "rebuild my volatility model using only data from Q3" or "compare trader behavior patterns between election weeks and normal periods." The schema design makes these queries either trivial or painful.
Lets dive in
In modeling prediction market APIs, relationships matter as much as the entities themselves. A "Series" contains multiple "Events," each Event spawns multiple "Markets," and traders interact across all of them. But unlike a live graph database, I need to freeze these relationships at specific timestamps and efficiently reconstruct historical views of the entire system.
"Best Practices" dictate the following:
- Normalize your schemas: One table per entity type, strongly typed columns, clean foreign key relationships
- Optimize for common queries: Design schemas around your 80% use case, accept complexity elsewhere
- Type safety is non-negotiable: Catch errors at ingestion time, not query time
The schema design affects everything I'll build on top of itβquery performance, how painful it is to add new entity types, and whether I can easily answer research questions six months from now.
The naive approachβdump everything into a single tableβbreaks down quickly. The sophisticated approachβcreate perfectly normalized schemasβmakes cross-entity analysis painful. I needed something in between.
Approach 1: Specialized Schemas (The Type-Safe Path)
The specialized approach feels like what a database designer would do: separate tables for each entity type, with strongly-typed columns.
# Specialized parquet file structure
/datalake/
βββ nodes/
β βββ series_nodes/2025_Q1_20250115.parquet
β βββ event_nodes/2025_Q1_20250115.parquet
β βββ market_nodes/2025_Q1_20250115.parquet
β βββ trader_nodes/2025_Q1_20250115.parquet
βββ edges/
β βββ series_event_edges/2025_Q1_20250115.parquet
β βββ trader_market_edges/2025_Q1_20250115.parquet
βββ temporal_properties/
βββ price_properties/2025_Q1_20250115.parquet
βββ volume_properties/2025_Q1_20250115.parquet
βββ liquidity_properties/2025_Q1_20250115.parquet
Each file has a clean, purpose-built schema:
# market_nodes.parquet schema
class MarketNode(BaseModel):
node_id: str
polymarket_id: int
polymarket_question: str
outcomes: Tuple[str, str] # ("Yes", "No")
date_end: datetime
restricted: bool
created_at: datetime
updated_at: datetime
# price_properties.parquet schema
class PriceProperty(BaseModel):
property_id: str
node_id: str # References MarketNode
price_yes: Decimal
price_no: Decimal
timestamp: datetime
source: str
This approach has the satisfying clarity of a well-designed API. Each file serves a single purpose. Queries are fast because you're only scanning relevant data. Type safety catches errors at ingestion time.
But there's a hidden cost: rigidity. Adding a new entity type means creating new schemas, new ingestion pipelines, new parquet files. Cross-entity analysis requires complex joins across multiple files. And when you want to ask questions like "show me all entities created in the last week," you're scanning multiple files and union-ing results.
Approach 2: Unified Schema (The Flexible Path)
The unified approach treats the data lake like a graph database flattened into tables. Four universal schemas handle everything:
# Unified 4-file structure
/datalake/
βββ entities/2025_Q1_20250115.parquet # ALL entity types
βββ edges/2025_Q1_20250115.parquet # ALL relationships
βββ temporal/2025_Q1_20250115.parquet # ALL time-series data
βββ parameters/2025_Q1_20250115.parquet # ALL entity-specific properties
The schemas become more generic but more flexible:
# entities.parquet - universal entity table
class UnifiedEntity(BaseModel):
node_id: str
node_type: str # "series", "event", "market", "trader"
polymarket_id: Optional[int] = None
created_at: datetime
updated_at: datetime
# parameters.parquet - entity-specific properties as key-value pairs
class UnifiedParameter(BaseModel):
node_id: str
parameter_name: str # "outcomes", "restricted", "polymarket_question"
parameter_value: str # JSON-serialized value
parameter_type: str # "string", "boolean", "json_array"
# temporal.parquet - all time-series data
class UnifiedTemporal(BaseModel):
node_id: str
property_name: str # "price_yes", "volume_24h", "liquidity"
property_value: str # JSON-serialized value
timestamp: datetime
source: str
Now the same market data looks like this:
# entities.parquet
node_id="mkt_123", node_type="market", polymarket_id=568058, created_at=...
# parameters.parquet
node_id="mkt_123", parameter_name="outcomes", parameter_value='["Yes", "No"]', parameter_type="json_array"
node_id="mkt_123", parameter_name="restricted", parameter_value="true", parameter_type="boolean"
# temporal.parquet
node_id="mkt_123", property_name="price_yes", property_value="0.0215", timestamp=...
node_id="mkt_123", property_name="price_no", property_value="0.9785", timestamp=...
This is essentially an Entity-Attribute-Value (EAV) pattern applied to a data lake. It's the kind of design that makes database purists wince, but it felt right so I dove deeper into this rabbit hole...
The Performance Reality Check
I ran some benchmarks to see how these approaches perform in practice. The results surprised me.
Query 1: Get all data for a specific market
-- Specialized schema (3 file scans + joins)
SELECT m.*, p.price_yes, p.price_no, v.volume_24h
FROM market_nodes m
JOIN price_properties p ON m.node_id = p.node_id
JOIN volume_properties v ON m.node_id = v.node_id
WHERE m.polymarket_id = 568058
-- Unified schema (3 file scans + joins, but different pattern)
SELECT e.*,
p1.parameter_value as outcomes,
p2.parameter_value as restricted,
t1.property_value as price_yes,
t2.property_value as volume_24h
FROM entities e
JOIN parameters p1 ON e.node_id = p1.node_id AND p1.parameter_name = 'outcomes'
JOIN parameters p2 ON e.node_id = p2.node_id AND p2.parameter_name = 'restricted'
JOIN temporal t1 ON e.node_id = t1.node_id AND t1.property_name = 'price_yes'
JOIN temporal t2 ON e.node_id = t2.node_id AND t2.property_name = 'volume_24h'
WHERE e.node_type = 'market' AND e.polymarket_id = 568058
Result: Specialized schema wins by a noticeable margin. The joins are simpler and the column pruning is more effective.
Query 2: Cross-entity analysis - "Show volume trends for all entities created this week"
-- Specialized schema (scan all entity files + union + join)
(SELECT node_id, 'series' as type, created_at FROM series_nodes WHERE created_at > '2025-01-08'
UNION ALL
SELECT node_id, 'event' as type, created_at FROM event_nodes WHERE created_at > '2025-01-08'
UNION ALL
SELECT node_id, 'market' as type, created_at FROM market_nodes WHERE created_at > '2025-01-08')
JOIN volume_properties v ON entities.node_id = v.node_id
-- Unified schema (single scan + filter + join)
SELECT e.node_id, e.node_type, t.property_value as volume, t.timestamp
FROM entities e
JOIN temporal t ON e.node_id = t.node_id AND t.property_name = 'volume_24h'
WHERE e.created_at > '2025-01-08'
Result: Unified schema wins by a significant margin. Single table scans beat multiple file unions every time.
The pattern became clear: specialized schemas excel at entity-specific queries, unified schemas excel at cross-entity analysis. Since I'm building for research workloads where cross-entity questions are common, the unified approach started looking more attractive.
The Hidden Costs
But performance isn't the whole story. Each approach has operational costs that might only surface after you've been running the system for months.
Here are some I could speculate on:
Specialized Schema Pain Points:
- Schema proliferation: Every new entity type requires new schemas, new ingestion code, new parquet files. I counted 12 different schemas for what felt like a simple domain.
- Cross-cutting queries: Questions like "show me all entities with high volatility" require scanning multiple files and complex unions. These queries are common in research but painful to write.
- Schema evolution: Adding a field to markets means updating schemas, migration scripts, and potentially reprocessing historical data.
Unified Schema Pain Points:
- Type safety erosion: Everything becomes strings in the parameters table. You lose compile-time validation and gain runtime parsing overhead.
- Query complexity: Simple entity queries require multiple joins and careful filtering. What should be a single table scan becomes a join fest.
- Storage overhead: The EAV pattern is notoriously space-inefficient. Simple boolean fields become three-column rows.
The Scaling Escape Hatch
Here's where the unified approach gets interesting: when files get too big, you can partition by time instead of restructuring schemas.
# When unified files get too large, partition by time
/datalake/
βββ entities/
β βββ 2025_Q1_week01.parquet # Jan 1-7
β βββ 2025_Q1_week02.parquet # Jan 8-14
β βββ 2025_Q1_week03.parquet # Jan 15-21
βββ temporal/
β βββ 2025_Q1_day001.parquet # Jan 1 (high-frequency data)
β βββ 2025_Q1_day002.parquet # Jan 2
β βββ 2025_Q1_day003.parquet # Jan 3
This gives you the best of both worlds: schema flexibility when you're small, performance optimization when you're large. The specialized approach doesn't have this escape hatchβyou're committed to the schema structure from day one.
Consequences In The Wild
Data Quality: The specialized approach caught more errors at ingestion time, but the unified approach made it easier to spot inconsistencies across entity types. When trader data started showing negative volumes, it was obvious in the unified temporal table but would have been hidden in a trader-specific schema.
Debugging: When ingestion failed, the unified approach made it easier to see what was missing. A single scan of the entities table showed which node types were present. The specialized approach required checking multiple files to understand the failure.
Research Velocity: This was the clincher. Research questions almost always span entity types. "How does trader behavior correlate with market volatility?" "Which series have the most predictable events?" These questions are natural in the unified schema but require complex joins in the specialized approach. I could explore ideas at roughly 10001x the speed.
A Decision Framework
I've come to think about this choice through three lenses:
1. Query Pattern Distribution
- If >70% of queries are entity-specific: specialized schema
- If >50% of queries span entity types: unified schema
2. Developer Composition (purely speculative... :P)
- Database-heavy developer: specialized schema (familiar patterns)
- Research-heavy developer: unified schema (exploration-friendly)
3. Schema Stability
- Stable domain with known entities: specialized schema
- Evolving domain with unknown entities: unified schema
For prediction market analytics, I landed on unified schemas. The domain is still evolving (new entity types emerge regularly), I'm research-focused, and cross-entity analysis is the norm rather than the exception.
Implementation Reality
Here's what the unified approach looks like, more or less:
class GraphParquetStore:
"""Store typed graph data as unified parquet files"""
def store_graph_snapshot(self,
nodes: Dict[str, List[BaseGraphNode]],
edges: Dict[str, List[BaseGraphEdge]],
temporal_properties: Dict[str, List[BaseTemporalProperty]]) -> None:
timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
# Flatten all node types into unified entities table
all_entities = []
all_parameters = []
for node_type, node_list in nodes.items():
for node in node_list:
# Core entity record
entity = UnifiedEntity(
node_id=node.node_id,
node_type=node_type,
polymarket_id=getattr(node, 'polymarket_id', None),
created_at=node.created_at,
updated_at=node.updated_at
)
all_entities.append(entity)
# Entity-specific properties as parameters
for field_name, field_value in node.model_dump().items():
if field_name not in ['node_id', 'created_at', 'updated_at']:
param = UnifiedParameter(
node_id=node.node_id,
parameter_name=field_name,
parameter_value=json.dumps(field_value),
parameter_type=self._infer_type(field_value)
)
all_parameters.append(param)
# Store to parquet
entities_df = pd.DataFrame([e.model_dump() for e in all_entities])
entities_df.to_parquet(f"{self.datalake_path}/entities/2025_Q1_{timestamp}.parquet")
parameters_df = pd.DataFrame([p.model_dump() for p in all_parameters])
parameters_df.to_parquet(f"{self.datalake_path}/parameters/2025_Q1_{timestamp}.parquet")
The code is surprisingly clean. The complexity moves from schema management to query construction, which feels like the right trade-off for a research-focused workflow.
Some Hard-Won Lessons
Performance isn't everything: The performance hit on entity-specific queries was less important than the massive improvement in research velocity.
Schema evolution matters more than I thought: Adding new entity types in the unified approach takes minutes. In the specialized approach, it's a multi-day project involving schema changes, migration scripts, and testing across the entire pipeline.
My cognitive load is real: Four schemas are easier to keep in my head than twelve, even if each individual schema is more complex. When I'm debugging at 2 AM, simplicity wins.
Escape hatches are valuable: The ability to partition by time when files get large gives me a path to optimize performance without restructuring everything. I'm not locked into my initial decision forever.
Domain maturity affects everything: Early in a domain's lifecycle, flexibility trumps performance. As the domain stabilizes and query patterns become predictable, performance becomes more important.
The unified schema approach isn't right for every use case. If I were building operational dashboards with known query patterns and stable schemas, the specialized approach would serve me better. But for research workloads in evolving domains, the flexibility of unified schemas can be transformative.
Sometimes the "wrong" design is the right choice. The unified schema violates database design principles but enables research patterns that would be painful otherwise. When data exploration is what drives the work, that trade-off makes sense.
My mileage varied. Yours will too. But when it does, at least you'll know why.
Postscript
I got a request to elaborate on this idea. Here's a stub of what the tables can look like:
entities.parquet - Universal entity table:
node_id | node_type | polymarket_id | created_at | updated_at
--------|-----------|---------------|------------|------------
abc123 | series | 10000 | 2025-01-15 | 2025-01-15
def456 | event | 34332 | 2025-01-15 | 2025-01-15
ghi789 | market | 568058 | 2025-01-15 | 2025-01-15
jkl012 | trader | NULL | 2025-01-15 | 2025-01-15
parameters.parquet - Entity-specific properties:
node_id | parameter_name | parameter_value | parameter_type
--------|-------------------|---------------------------|---------------
abc123 | recurrence | "weekly" | string
abc123 | series_type | "single" | string
abc123 | ticker | "ELON_TWEETS" | string
def456 | resolution_source | "polymarket_official" | string
def456 | negative_risk | false | boolean
ghi789 | outcomes | ["Yes", "No"] | json_array
ghi789 | restricted | true | boolean
jkl012 | trader_address | "0x1234...abcd" | string
edges.parquet - Universal relationship table:
edge_id | source_node_id | target_node_id | relationship_type | weight | created_at
--------|----------------|----------------|------------------------|--------|------------
edge1 | abc123 | def456 | SERIES_CONTAINS_EVENT | 1.0 | 2025-01-15
edge2 | def456 | ghi789 | EVENT_CONTAINS_MARKET | 1.0 | 2025-01-15
edge3 | jkl012 | ghi789 | TRADER_TRADES_MARKET | 0.8 | 2025-01-15
temporal.parquet - Universal time-series table:
node_id | property_name | property_value | timestamp | source
--------|---------------|----------------|---------------------|----------------
ghi789 | price_yes | 0.0215 | 2025-07-28T00:19:02 | polymarket_api
ghi789 | price_no | 0.9785 | 2025-07-28T00:19:02 | polymarket_api
ghi789 | volume_24h | 2318680.12 | 2025-07-28T00:01:04 | polymarket_api
ghi789 | liquidity | 265041.98 | 2025-07-28T00:01:04 | polymarket_api