Publications

Research papers, patents, and technical whitepapers.

preprint2026| theaivibe.org / GitHub

samkhya: An LLM-Pluggable Corrector Backend for Embedded SQL Query Optimizers, with a Provable Safety Envelope

Prateek Singh

samkhya is an engine-agnostic Rust SDK that brings portable, feedback-driven cardinality correction to embedded analytical engines — DataFusion, DuckDB, Polars, Postgres, Iceberg, and the author's prior GPU-accelerated extension gpudb. The contribution is fourfold. First, a pluggable Corrector trait shipped with four production backends in v1.0: a sub-MB gradient-boosted-tree default; an opt-in TabPFN-2.5 foundation-tabular-model backend (measured P95 31.15 ms at B=8 L=128 on RTX 4090 Laptop); an opt-in LLM-via-HTTP backend (LlmHttpCorrector, gated on the llm_http Cargo feature) with documented support for Anthropic Claude (claude-opus-4-7, claude-sonnet-4-6), OpenAI GPT-4o-mini, and local Ollama (llama3.2:1b), plus two reference servers — Python FastAPI (canonical) and Node TypeScript — shipped in the repository; and a dummy echo backend that produced the H1-A transport-floor PASS at P95 0.07–0.11 ms across batch sizes 1/4/8/16/32. Live-LLM end-to-end latency cells are explicitly marked PROJECTED pending API-key budget and a 30-trial measurement campaign — the mechanism ships and is measured at the transport floor; the headline live-LLM numbers are next-revision work. Second, an LpJoinBound pessimistic envelope that is strictly tighter than the Atserias-Grohe-Marx AGM bound (PODS 2008) on every cell of the star-5 join-topology grid (Wilcoxon W=0, p=1.73×10⁻⁶, n=30), with a measured 40.95× wallclock speedup vs native DataFusion 46 LpBound tightness (BCa 95% CI [30.93, 47.45]); the same envelope clamps every corrector output, so a hallucinating LLM cannot produce a worse plan than the engine's native estimate. Third, a portable-stats layer built on Iceberg Puffin sidecars with versioned KIND-tagged blobs for five classical sketch families (HLL, Bloom, Count-Min, equi-depth histogram, 2D correlated histogram). Fourth — and equally important — an honest measurement disclosure. Every pre-registered headline upper bound (≥1.6× join-heavy, ≥1.35× aggregate, ≥1.50× overall) was falsified by the WAVE4-F head-to-head against native DataFusion 46 on the IMDb Join-Order-Benchmark Slow subset; the measured geometric mean is 1.038× wallclock (BCa 95% CI [1.026, 1.056], Wilcoxon p=3.00×10⁻⁶, BH-FDR rejects 24 of 55 cells, record 17 wins / 38 ties / 0 losses). The effect is statistically real and the never-regress safety guarantee held under measurement, but the effect size is small and is reported as such. samkhya v1.0.0 ships as a 13-crate Cargo workspace with ~266 #[test] blocks, 17 property tests, 31M cargo-fuzz executions with zero crashes, and an ACM Artifact Evaluation v1.1 reviewer entry. Apache-2.0 single-license with explicit §3 patent grant.

preprint2026| theaivibe.org / GitHub

gpudb: A GPU-Resident Execution Engine for DuckDB on NVIDIA CUDA and Apple Silicon Metal

Prateek Singh

We present gpudb, a DuckDB extension that adds GPU-resident analytical operators with two production backends — NVIDIA CUDA and Apple Silicon Metal — sharing one C++ codebase. To our knowledge it is the first published SQL execution engine that targets Apple GPUs. We describe the four-layer architecture, the CUDA and Metal kernel designs (including a multi-aggregate fusion pass that reaches 87% of LPDDR5X peak bandwidth on M4 Max), a cardinality-aware hybrid CPU/GPU planner that beats both pure-CPU and pure-GPU dispatch at the 1M-unique-key crossover, and an algorithm playbook for SQL window functions that the strongest competing GPU OLAP engine (Sirius, CIDR 2026) does not ship. On TPC-H SF10 lineitem queries, multi-aggregate fusion delivers 22-25× speedups over a 16-thread DuckDB CPU baseline, hash-join probe at 97% selectivity runs 3.7× wall-clock and 107× kernel-only on CUDA, and GROUP BY at 50M-1B rows × 1M unique keys is 3-9× faster on M4 Max. We discuss the GPU-database commercial graveyard of 2013-2024 (HEAVY.AI, BlazingSQL, Voltron Data, Brytlyt, Kinetica), the architectural openings this leaves, and why a DuckDB-extension shape — not a new database — is the only commercially viable position for a 2026 GPU SQL project.