Engineering in the Wild

Schema Design Patterns for Graph Data Snapshots: Specialized vs. Unified Approaches

I've been wrestling with a schema design problem that felt simple at first... but made me rethink my fundamental assumptions about schema "best practices."...

Context

Core Challenge

I need to answer questions like "rebuild my volatility model using only data from Q3" or "compare trader behavior patterns between election weeks and normal periods." The schema design makes these queries either trivial or painful.

Lets dive in

In modeling prediction market APIs, relationships matter as much as the entities themselves. A "Series" contains multiple "Events," each Event spawns multiple "Markets," and traders interact across all of them. But unlike a live graph database, I need to freeze these relationships at specific timestamps and efficiently reconstruct historical views of the entire system.

"Best Practices" dictate the following:

The schema design affects everything I'll build on top of itβ€”query performance, how painful it is to add new entity types, and whether I can easily answer research questions six months from now.

The naive approachβ€”dump everything into a single tableβ€”breaks down quickly. The sophisticated approachβ€”create perfectly normalized schemasβ€”makes cross-entity analysis painful. I needed something in between.

Approach 1: Specialized Schemas (The Type-Safe Path)

The specialized approach feels like what a database designer would do: separate tables for each entity type, with strongly-typed columns.

# Specialized parquet file structure
/datalake/
β”œβ”€β”€ nodes/
β”‚   β”œβ”€β”€ series_nodes/2025_Q1_20250115.parquet
β”‚   β”œβ”€β”€ event_nodes/2025_Q1_20250115.parquet  
β”‚   β”œβ”€β”€ market_nodes/2025_Q1_20250115.parquet
β”‚   └── trader_nodes/2025_Q1_20250115.parquet
β”œβ”€β”€ edges/
β”‚   β”œβ”€β”€ series_event_edges/2025_Q1_20250115.parquet
β”‚   └── trader_market_edges/2025_Q1_20250115.parquet
└── temporal_properties/
    β”œβ”€β”€ price_properties/2025_Q1_20250115.parquet
    β”œβ”€β”€ volume_properties/2025_Q1_20250115.parquet
    └── liquidity_properties/2025_Q1_20250115.parquet

Each file has a clean, purpose-built schema:

# market_nodes.parquet schema
class MarketNode(BaseModel):
    node_id: str
    polymarket_id: int
    polymarket_question: str
    outcomes: Tuple[str, str]  # ("Yes", "No")
    date_end: datetime
    restricted: bool
    created_at: datetime
    updated_at: datetime

# price_properties.parquet schema  
class PriceProperty(BaseModel):
    property_id: str
    node_id: str  # References MarketNode
    price_yes: Decimal
    price_no: Decimal
    timestamp: datetime
    source: str

This approach has the satisfying clarity of a well-designed API. Each file serves a single purpose. Queries are fast because you're only scanning relevant data. Type safety catches errors at ingestion time.

But there's a hidden cost: rigidity. Adding a new entity type means creating new schemas, new ingestion pipelines, new parquet files. Cross-entity analysis requires complex joins across multiple files. And when you want to ask questions like "show me all entities created in the last week," you're scanning multiple files and union-ing results.

Approach 2: Unified Schema (The Flexible Path)

The unified approach treats the data lake like a graph database flattened into tables. Four universal schemas handle everything:

# Unified 4-file structure
/datalake/
β”œβ”€β”€ entities/2025_Q1_20250115.parquet      # ALL entity types
β”œβ”€β”€ edges/2025_Q1_20250115.parquet         # ALL relationships  
β”œβ”€β”€ temporal/2025_Q1_20250115.parquet      # ALL time-series data
└── parameters/2025_Q1_20250115.parquet    # ALL entity-specific properties

The schemas become more generic but more flexible:

# entities.parquet - universal entity table
class UnifiedEntity(BaseModel):
    node_id: str
    node_type: str  # "series", "event", "market", "trader"
    polymarket_id: Optional[int] = None
    created_at: datetime
    updated_at: datetime

# parameters.parquet - entity-specific properties as key-value pairs
class UnifiedParameter(BaseModel):
    node_id: str
    parameter_name: str  # "outcomes", "restricted", "polymarket_question"
    parameter_value: str  # JSON-serialized value
    parameter_type: str  # "string", "boolean", "json_array"

# temporal.parquet - all time-series data
class UnifiedTemporal(BaseModel):
    node_id: str
    property_name: str  # "price_yes", "volume_24h", "liquidity"
    property_value: str  # JSON-serialized value
    timestamp: datetime
    source: str

Now the same market data looks like this:

# entities.parquet
node_id="mkt_123", node_type="market", polymarket_id=568058, created_at=...

# parameters.parquet  
node_id="mkt_123", parameter_name="outcomes", parameter_value='["Yes", "No"]', parameter_type="json_array"
node_id="mkt_123", parameter_name="restricted", parameter_value="true", parameter_type="boolean"

# temporal.parquet
node_id="mkt_123", property_name="price_yes", property_value="0.0215", timestamp=...
node_id="mkt_123", property_name="price_no", property_value="0.9785", timestamp=...

This is essentially an Entity-Attribute-Value (EAV) pattern applied to a data lake. It's the kind of design that makes database purists wince, but it felt right so I dove deeper into this rabbit hole...

The Performance Reality Check

I ran some benchmarks to see how these approaches perform in practice. The results surprised me.

Query 1: Get all data for a specific market

-- Specialized schema (3 file scans + joins)
SELECT m.*, p.price_yes, p.price_no, v.volume_24h
FROM market_nodes m
JOIN price_properties p ON m.node_id = p.node_id  
JOIN volume_properties v ON m.node_id = v.node_id
WHERE m.polymarket_id = 568058

-- Unified schema (3 file scans + joins, but different pattern)
SELECT e.*, 
       p1.parameter_value as outcomes,
       p2.parameter_value as restricted,
       t1.property_value as price_yes,
       t2.property_value as volume_24h
FROM entities e
JOIN parameters p1 ON e.node_id = p1.node_id AND p1.parameter_name = 'outcomes'
JOIN parameters p2 ON e.node_id = p2.node_id AND p2.parameter_name = 'restricted'  
JOIN temporal t1 ON e.node_id = t1.node_id AND t1.property_name = 'price_yes'
JOIN temporal t2 ON e.node_id = t2.node_id AND t2.property_name = 'volume_24h'
WHERE e.node_type = 'market' AND e.polymarket_id = 568058

Result: Specialized schema wins by a noticeable margin. The joins are simpler and the column pruning is more effective.

Query 2: Cross-entity analysis - "Show volume trends for all entities created this week"

-- Specialized schema (scan all entity files + union + join)
(SELECT node_id, 'series' as type, created_at FROM series_nodes WHERE created_at > '2025-01-08'
 UNION ALL
 SELECT node_id, 'event' as type, created_at FROM event_nodes WHERE created_at > '2025-01-08'  
 UNION ALL
 SELECT node_id, 'market' as type, created_at FROM market_nodes WHERE created_at > '2025-01-08')
JOIN volume_properties v ON entities.node_id = v.node_id

-- Unified schema (single scan + filter + join)
SELECT e.node_id, e.node_type, t.property_value as volume, t.timestamp
FROM entities e
JOIN temporal t ON e.node_id = t.node_id AND t.property_name = 'volume_24h'
WHERE e.created_at > '2025-01-08'

Result: Unified schema wins by a significant margin. Single table scans beat multiple file unions every time.

The pattern became clear: specialized schemas excel at entity-specific queries, unified schemas excel at cross-entity analysis. Since I'm building for research workloads where cross-entity questions are common, the unified approach started looking more attractive.

The Hidden Costs

But performance isn't the whole story. Each approach has operational costs that might only surface after you've been running the system for months.

Here are some I could speculate on:

Specialized Schema Pain Points:

  1. Schema proliferation: Every new entity type requires new schemas, new ingestion code, new parquet files. I counted 12 different schemas for what felt like a simple domain.
  2. Cross-cutting queries: Questions like "show me all entities with high volatility" require scanning multiple files and complex unions. These queries are common in research but painful to write.
  3. Schema evolution: Adding a field to markets means updating schemas, migration scripts, and potentially reprocessing historical data.

Unified Schema Pain Points:

  1. Type safety erosion: Everything becomes strings in the parameters table. You lose compile-time validation and gain runtime parsing overhead.
  2. Query complexity: Simple entity queries require multiple joins and careful filtering. What should be a single table scan becomes a join fest.
  3. Storage overhead: The EAV pattern is notoriously space-inefficient. Simple boolean fields become three-column rows.

The Scaling Escape Hatch

Here's where the unified approach gets interesting: when files get too big, you can partition by time instead of restructuring schemas.

# When unified files get too large, partition by time
/datalake/
β”œβ”€β”€ entities/
β”‚   β”œβ”€β”€ 2025_Q1_week01.parquet    # Jan 1-7
β”‚   β”œβ”€β”€ 2025_Q1_week02.parquet    # Jan 8-14  
β”‚   └── 2025_Q1_week03.parquet    # Jan 15-21
β”œβ”€β”€ temporal/
β”‚   β”œβ”€β”€ 2025_Q1_day001.parquet    # Jan 1 (high-frequency data)
β”‚   β”œβ”€β”€ 2025_Q1_day002.parquet    # Jan 2
β”‚   └── 2025_Q1_day003.parquet    # Jan 3

This gives you the best of both worlds: schema flexibility when you're small, performance optimization when you're large. The specialized approach doesn't have this escape hatchβ€”you're committed to the schema structure from day one.

Consequences In The Wild

Data Quality: The specialized approach caught more errors at ingestion time, but the unified approach made it easier to spot inconsistencies across entity types. When trader data started showing negative volumes, it was obvious in the unified temporal table but would have been hidden in a trader-specific schema.

Debugging: When ingestion failed, the unified approach made it easier to see what was missing. A single scan of the entities table showed which node types were present. The specialized approach required checking multiple files to understand the failure.

Research Velocity: This was the clincher. Research questions almost always span entity types. "How does trader behavior correlate with market volatility?" "Which series have the most predictable events?" These questions are natural in the unified schema but require complex joins in the specialized approach. I could explore ideas at roughly 10001x the speed.

A Decision Framework

I've come to think about this choice through three lenses:

1. Query Pattern Distribution

2. Developer Composition (purely speculative... :P)

3. Schema Stability

For prediction market analytics, I landed on unified schemas. The domain is still evolving (new entity types emerge regularly), I'm research-focused, and cross-entity analysis is the norm rather than the exception.

Implementation Reality

Here's what the unified approach looks like, more or less:

class GraphParquetStore:
    """Store typed graph data as unified parquet files"""
    
    def store_graph_snapshot(self, 
                           nodes: Dict[str, List[BaseGraphNode]], 
                           edges: Dict[str, List[BaseGraphEdge]],
                           temporal_properties: Dict[str, List[BaseTemporalProperty]]) -> None:
        
        timestamp = datetime.now().strftime("%Y%m%d_%H%M%S")
        
        # Flatten all node types into unified entities table
        all_entities = []
        all_parameters = []
        
        for node_type, node_list in nodes.items():
            for node in node_list:
                # Core entity record
                entity = UnifiedEntity(
                    node_id=node.node_id,
                    node_type=node_type,
                    polymarket_id=getattr(node, 'polymarket_id', None),
                    created_at=node.created_at,
                    updated_at=node.updated_at
                )
                all_entities.append(entity)
                
                # Entity-specific properties as parameters
                for field_name, field_value in node.model_dump().items():
                    if field_name not in ['node_id', 'created_at', 'updated_at']:
                        param = UnifiedParameter(
                            node_id=node.node_id,
                            parameter_name=field_name,
                            parameter_value=json.dumps(field_value),
                            parameter_type=self._infer_type(field_value)
                        )
                        all_parameters.append(param)
        
        # Store to parquet
        entities_df = pd.DataFrame([e.model_dump() for e in all_entities])
        entities_df.to_parquet(f"{self.datalake_path}/entities/2025_Q1_{timestamp}.parquet")
        
        parameters_df = pd.DataFrame([p.model_dump() for p in all_parameters])  
        parameters_df.to_parquet(f"{self.datalake_path}/parameters/2025_Q1_{timestamp}.parquet")

The code is surprisingly clean. The complexity moves from schema management to query construction, which feels like the right trade-off for a research-focused workflow.

Some Hard-Won Lessons

  1. Performance isn't everything: The performance hit on entity-specific queries was less important than the massive improvement in research velocity.

  2. Schema evolution matters more than I thought: Adding new entity types in the unified approach takes minutes. In the specialized approach, it's a multi-day project involving schema changes, migration scripts, and testing across the entire pipeline.

  3. My cognitive load is real: Four schemas are easier to keep in my head than twelve, even if each individual schema is more complex. When I'm debugging at 2 AM, simplicity wins.

  4. Escape hatches are valuable: The ability to partition by time when files get large gives me a path to optimize performance without restructuring everything. I'm not locked into my initial decision forever.

  5. Domain maturity affects everything: Early in a domain's lifecycle, flexibility trumps performance. As the domain stabilizes and query patterns become predictable, performance becomes more important.

The unified schema approach isn't right for every use case. If I were building operational dashboards with known query patterns and stable schemas, the specialized approach would serve me better. But for research workloads in evolving domains, the flexibility of unified schemas can be transformative.

Sometimes the "wrong" design is the right choice. The unified schema violates database design principles but enables research patterns that would be painful otherwise. When data exploration is what drives the work, that trade-off makes sense.

My mileage varied. Yours will too. But when it does, at least you'll know why.


Postscript

I got a request to elaborate on this idea. Here's a stub of what the tables can look like:

entities.parquet - Universal entity table:

node_id | node_type | polymarket_id | created_at | updated_at
--------|-----------|---------------|------------|------------
abc123  | series    | 10000         | 2025-01-15 | 2025-01-15
def456  | event     | 34332         | 2025-01-15 | 2025-01-15  
ghi789  | market    | 568058        | 2025-01-15 | 2025-01-15
jkl012  | trader    | NULL          | 2025-01-15 | 2025-01-15

parameters.parquet - Entity-specific properties:

node_id | parameter_name    | parameter_value           | parameter_type
--------|-------------------|---------------------------|---------------
abc123  | recurrence        | "weekly"                  | string
abc123  | series_type       | "single"                  | string
abc123  | ticker            | "ELON_TWEETS"             | string
def456  | resolution_source | "polymarket_official"     | string
def456  | negative_risk     | false                     | boolean
ghi789  | outcomes          | ["Yes", "No"]             | json_array
ghi789  | restricted        | true                      | boolean
jkl012  | trader_address    | "0x1234...abcd"           | string

edges.parquet - Universal relationship table:

edge_id | source_node_id | target_node_id | relationship_type      | weight | created_at
--------|----------------|----------------|------------------------|--------|------------
edge1   | abc123         | def456         | SERIES_CONTAINS_EVENT  | 1.0    | 2025-01-15
edge2   | def456         | ghi789         | EVENT_CONTAINS_MARKET  | 1.0    | 2025-01-15
edge3   | jkl012         | ghi789         | TRADER_TRADES_MARKET   | 0.8    | 2025-01-15

temporal.parquet - Universal time-series table:

node_id | property_name | property_value | timestamp           | source
--------|---------------|----------------|---------------------|----------------
ghi789  | price_yes     | 0.0215         | 2025-07-28T00:19:02 | polymarket_api
ghi789  | price_no      | 0.9785         | 2025-07-28T00:19:02 | polymarket_api
ghi789  | volume_24h    | 2318680.12     | 2025-07-28T00:01:04 | polymarket_api
ghi789  | liquidity     | 265041.98      | 2025-07-28T00:01:04 | polymarket_api