Schedule - PGConf.dev 2026

PostgreSQL as an open data format: 100x faster TPC-H queries through direct storage reads

Date: 2026-05-21
Time: 11:00–11:50
Room: Fletcher
Level: Advanced

PostgreSQL is not typically thought of as an analytical system. Historically, people forked and repackaged PostgreSQL to add analytical capabilities, including native-columnar formats, distributed massively parallel processing, and vectorized query executors. However, these systems significantly deviated from core PostgreSQL and became incompatible with standard PostgreSQL tooling. More recently, several projects have leveraged the PostgreSQL extension system to accelerate OLAP-style queries through federation and integration with DuckDB, but these still lag in scale behind purpose-built analytical systems.

One challenge with current approaches for accelerating OLAP queries is limitations imposed by the PostgreSQL architecture, which is better-optimized for OLTP style queries. While several PostgreSQL extensions have added support for a columnar format, the PostgreSQL executor still lacks capabilities required in large scale analytics, including a massive parallel processing and vectorized execution.

What if we flip the model on its head: instead of going through the PostgreSQL execution layer, we read data directly from storage?

Modern analytics has moved towards lakehouse architectures, which adopt open formats like Parquet for files and Iceberg for tables that are organized and governed by a catalog service. We can borrow these concepts to build a “lakebase,” where we separate storage and compute for operational databases to allow for serverless reads of an open format from different engines.

Making it less abstract, think of the PostgreSQL filesystem as an open data format (which it is!) that can be used as a base to build a modern OLAP engine that’s fully-compatible with PostgreSQL transaction semantics that doesn’t require PostgreSQL compute. This allows you to build an engine to run OLAP queries that directly access PostgreSQL data files without impacting performance of operational workloads or requiring additional replicas that are just for analytics.

We begin this talk by first outlining the state of analytics of PostgreSQL: how well it works today, what people have done to help accelerate analytical queries through extensions, and some of the ongoing challenges. We then review how the lakehouse works, covering key concepts like open file and table formats, why catalogs matter, and how engines like Spark can leverage these to accelerate performance while still respecting transactional and access control. We then introduce a new open source project that can read PostgreSQL files directly from storage outside of the PostgreSQL process, explain the architecture and the benefits it has for running fast OLAP queries directly on PostgreSQL files through a look at the project internals. Finally, we demo this project using an integration through an OLAP query engine that shows up to 100x faster TPC-H performance on PostgreSQL data without impacting the running PostgreSQL instances!

Speaker

Hristo Stoyanov
Jonathan Katz