Today, more and more organizations are adopting the lakehouse pattern, which combines the best elements of data lakes and data warehouses.

But first, let us provide you with some context.

Let’s first talk about data warehouses

We have all worked with data warehouses. They are purpose-built for BI and reporting. Organizations extract data from operational systems and load it into data warehouses for analytics. Over time, they became essential; today, every organization has many of them.

However, data warehouses were not built for modern data use cases. They offer very limited data science and machine learning capabilities, they have limited support for video, audio and text, and, last but not least, the data is stored in proprietary formats. To overcome these challenges, many leading enterprises adopted open standard data lakes like ADLS, S3, or GCS.

The rise of Data Lakes

Data lakes came around a decade ago. They can indeed handle all your data and therefore are good for data science and machine learning use cases. Unfortunately, data lakes are unable to fully support the data warehousing and BI use cases, they are complex to set up, slow, and have poor data quality controls.

The coexistence of data lakes and warehouses

As shown above, organizations require both a data warehouse and a data lake, as they serve different needs and use cases. As a result, most organizations extract data from their data lakes to data warehouses for BI and analytics.

 

Figure 1 – Two-tier data architectures with a separate lake and warehouse

 

This architecture is not ideal because it leads to two copies of your data, one in the data lake and another in the data warehouse. Secondly, when you are doing BI on the data warehouse, that data is stale, because the most recent version of it are in the data lake. This architecture also needs rock-solid data ops to keep data flowing from the data lake to the data warehouse.

But what if we can replace this two-tier architecture by combining the benefits of data lakes (such as low-cost, open format and direct access to data) with traditional analytical DBMS management and performance features from data warehouses?

The Lakehouse architecture

This architecture combines the best of both worlds: it enables a single source of truth for all your data while enabling streaming analytics, BI, data science and machine learning capabilities on top of it. This way, organizations can manage a single consolidated system that democratizes all data.

 

Figure 2 – Lakehouse architecture

 

It all starts with a data lake for all your data with a metadata layer in the underlying object store that can raise its abstraction level to implement ACID transactions and other management features such as schema enforcement, audit history and upserts and deletes operations. One of the best examples is Delta Lake, an open-source format and storage metadata layer released by Databricks.

This layer means that users do not have to keep running servers to maintain and refresh the state for Delta Lake tables and only need to launch servers when running queries and enjoy the benefits of decoupling storage and compute.

While delta lake ensures high quality and reliable data, you still need a query engine to serve your analytical use cases. Databricks has recently introduced Delta Engine, a high-performance query engine built from the ground up to deliver fast performance and bring even better performance to your Delta Lake on Databricks. Other external query engines also enable you to access Delta tables, such as Presto, Athena, Redshift Spectrum, Snowflake and Apache Hive.

In summary, despite the concept of data lakehouse being at an early stage, we believe it will become the standard for a single source of truth for data across any data processing and serving tool. Companies are gradually moving to the lakehouse architecture due to its flexibility, cost efficiency, open standards and technology.

At Link Redglue, we have been implementing this pattern in our customer’s data architectures since mid-2019. It requires less effort to implement and maintain and, therefore, the time to value is much quicker with this new approach.

 

Hugo Almeida

Lead Data Engineer