Databricks Doubles Down on Liquid Clustering as Lakehouse Layout Default
Data Engineering

Databricks Doubles Down on Liquid Clustering as Lakehouse Layout Default

Databricks published a comprehensive case for retiring Hive style partitioning in favour of Liquid Clustering, citing petabyte scale customers seeing 5x to 7x query speedups and 27 percent storage reductions.

PublishedJune 1, 2026
Read time5 min read
Share

Databricks dropped a long form engineering post on June 1, 2026, methodically dismantling eight common defences of Hive style partitioning and arguing that Liquid Clustering should be the default layout for any new Delta or Iceberg table. The post is unusual for the Databricks blog in both length and tone. It reads less like a feature announcement and more like a position paper, and it lands two weeks before the Data + AI Summit where Liquid Clustering is widely expected to be repositioned as the recommended default for all Unity Catalog managed tables.

The technical core of the argument is that partitioning in Delta and Iceberg has not used directory pruning for years. Both formats rely on per file statistics stored in the transaction log, which means Liquid Clustering uses the exact same pruning mechanism as partitioning without any of the cardinality, file size or evolution constraints. The post walks through benchmarks showing that Liquid is faster than partitioning at both low and high cardinality, supports the same metadata only operations such as DELETE and COUNT, and scales cleanly to petabyte sized tables.

The customer evidence is what gives the post its weight. Arctic Wolf reports that their 3.8 petabyte security telemetry table, ingesting more than a trillion events per day, saw 90 day queries drop from 51 seconds to 6.6 seconds after migrating to Liquid Clustering on Unity Catalog managed tables with Predictive Optimization enabled. File count halved from 4 million to 2 million, and data freshness moved from hours to minutes. Bolt, using the Liquid Conversion private preview, saw write throughput increase 138 percent and read times drop up to 63 percent on a terabyte scale change data capture table, with zero downtime during the conversion.

For us at Carrefour, the Bolt case is the more interesting one. We have several terabyte scale CDC tables landing from our SAP estate into Delta on Databricks, all of them still partitioned by ingestion date because that was the recommended pattern when we first built them in 2022. The cost of rewriting those tables to remove partitioning has always been the blocker, and the new Liquid Conversion command, exposed as ALTER TABLE REPLACE PARTITIONED BY WITH CLUSTER BY, is exactly the in place migration path we have been waiting for. The fact that it runs alongside live ingestion without downtime makes it operationally feasible to schedule during a normal change window rather than requiring a maintenance outage.

The two forward looking items in the post are Co-Clustered Joins and broader Liquid Conversion availability. Co-Clustered Joins, also in private preview, remove the shuffle stage when joining two Liquid tables on their clustering columns. The published benchmark shows a 51 percent latency reduction and an 87 percent reduction in shuffle volume on a representative workload. For analytics teams running broadcast and shuffle hash joins on multi terabyte fact tables, that is a meaningful saving in both wall clock time and cluster cost.

The strategic context is the ongoing competition with Snowflake on the lakehouse format question. Snowflake's Polaris and Horizon catalog work has been pulling Iceberg adoption inside enterprises that were previously Delta first, and Databricks has responded by aggressively supporting Iceberg as a peer format under Unity Catalog. The Liquid Clustering push fits that strategy because it is a table layout optimisation that works on both Delta and Iceberg, which lets Databricks position the optimisation as a platform capability rather than a format lock in.

For operators, the practical takeaway is to stop creating new tables with Hive style partitioning. Liquid Clustering has been GA since 2024, the tooling around Predictive Optimization and Automatic Liquid Clustering means that key selection no longer requires deep expertise, and the performance and operational benefits compound at scale. Migrations of existing partitioned tables should be planned around the Liquid Conversion private preview availability, which Databricks is expected to open more broadly at the Summit.

The one caveat is that not every engine outside Databricks handles Liquid Clustered tables optimally yet. The post correctly points out that Liquid is a write side optimisation producing standard Parquet files, so any reader that supports file level statistics will benefit. But engines like Trino and Spark vanilla still trail Databricks Photon on overall query performance, which means teams running heterogeneous query layers should validate the gains in their own environment before committing to a platform wide migration.

The governance angle is also worth flagging. Liquid Clustering changes the file layout of a table without changing its schema, which means any downstream consumer that depends on file naming conventions, partition directory listings or external metadata stores will need to be reviewed. Most well behaved consumers using the Delta or Iceberg APIs will not notice, but legacy ETL jobs that read Parquet files directly through HDFS style paths will break. Inventorying those consumers before a large migration is the kind of unglamorous work that prevents weekend incidents.

For our Ahold Delhaize colleagues running mixed Databricks and BigQuery workloads, the cross engine read story is the most relevant point. BigQuery's external Iceberg support has matured significantly in the past year, and a Liquid Clustered Iceberg table written from Databricks should now be queryable from BigQuery with comparable pruning efficiency. That opens the door to architectures where Databricks owns the write path and BigQuery serves the analytics workload, without the cost of dual ingestion. We expect this pattern to become more common as Iceberg v3 adoption spreads.

Tagged#data#databricks#lakehouse#news