Category

Data Engineering

Pipelines, warehouses, GPU-accelerated query engines, and big-data systems.

9 posts

samkhya v1.0: Plug Claude, GPT-4o-mini, or Local Ollama Into Your SQL Query Optimizer
Data Engineering16 min read

samkhya v1.0: Plug Claude, GPT-4o-mini, or Local Ollama Into Your SQL Query Optimizer

samkhya v1.0 ships an LLM-pluggable corrector backend for embedded analytical engines — DataFusion, DuckDB, Polars, Postgres, Iceberg, gpudb. Plug Claude, GPT-4o-mini, or local Ollama into the cardinality-estimation slot via a simple HTTP wire contract (Python FastAPI and Node TypeScript reference servers ship in the box). Every LLM output is clamped from above by a provable pessimistic ceiling (LpJoinBound — 40.95× tighter than the 2008 AGM bound) so the LLM can never make your plan worse than the engine's native estimate. Transport-floor latency measured at P95 0.07–0.11 ms; live-LLM end-to-end cells honestly marked PROJECTED pending API budget.

Read
Why I built a GPU SQL engine in 2026 — when every other one died
Data Engineering27 min read

Why I built a GPU SQL engine in 2026 — when every other one died

Every standalone GPU database built between 2013 and 2024 was acqui-hired or pivoted. So why ship gpudb in 2026? Because nobody had wired Apple Silicon's unified memory into a SQL engine — and DuckDB hands you a hundred-thousand-user distribution channel without writing a database from scratch.

Read
Databricks vs Snowflake vs The New Wave: The Data Engineering Paradigm Shift
Data Engineering5 min read

Databricks vs Snowflake vs The New Wave: The Data Engineering Paradigm Shift

Snowflake just posted $4.68B in FY26 revenue at 29% growth. Databricks crossed $5.4B ARR in February at 65% growth. And neither chart explains why the most interesting data infrastructure being shipped in 2026 is single-process, embeddable, and runs on a laptop.

Read
Iceberg's Puffin Sidecars: Portable Stats for the Open Lakehouse
Data Engineering10 min read

Iceberg's Puffin Sidecars: Portable Stats for the Open Lakehouse

Apache Iceberg's Puffin file format is the most strategically important subsystem nobody is talking about. It is the mechanism by which an open lakehouse can carry warehouse-grade statistics across vendors — write the sketch once in Trino, read it tomorrow in Snowflake, plan a join correctly on the first cold query.

Read
DuckDB Ate the Modern Data Stack
Data Engineering6 min read

DuckDB Ate the Modern Data Stack

An embedded analytical engine with no servers, no cluster, no migration cost just quietly displaced Spark for small data and Snowflake XS for medium data. MotherDuck closed Series B at a $400M post-money. Here's the part everyone undercounts.

Read
Iceberg, Delta, Hudi: Pick One in 2026 and Move On
Data Engineering5 min read

Iceberg, Delta, Hudi: Pick One in 2026 and Move On

The table-format wars are functionally over. Iceberg won on interop. Delta won on installed base. Hudi won on streaming upserts. The decision tree for a new project in 2026 is shorter than the comparison-blog industry wants you to believe.

Read
Polars vs DuckDB in 2026: When To Pick Which
Data Engineering9 min read

Polars vs DuckDB in 2026: When To Pick Which

Polars ate Pandas. DuckDB ate everything below the warehouse. The 2023 expectation was a cage match between two in-process analytical engines — the 2026 reality is they ate different cake, and the decision is mostly about whether your team thinks in DataFrames or SQL.

Read
Vector Indexes in OLAP Engines: 2025 Is Where Search Ate Analytics
Data Engineering10 min read

Vector Indexes in OLAP Engines: 2025 Is Where Search Ate Analytics

DuckDB, ClickHouse, Snowflake, BigQuery, Postgres — by late 2025 every serious analytical engine ships a native vector index. That wasn't an AI-hype reflex. It was the realization that embedding search is just a column scan with a different distance function, and the warehouse-plus-vector-DB split was operational waste for the 90% case.

Read
Apache Arrow IPC vs JSON: The Numbers Behind the Switch
Data Engineering10 min read

Apache Arrow IPC vs JSON: The Numbers Behind the Switch

Most data-API traffic in 2025 still moves as JSON because humans need to read it. But for any system actually shipping columnar batches between services — analytical pipelines, feature stores, embedding services, MCP-style tool calls — Arrow IPC is 3-30× faster end-to-end. Honest accounting of when the switch pays off and when JSON is still correct.

Read