Data Lakes

We build Lakes, not Swamps

The technical architecture of the data lake has coalesced into a handful of foundational platforms, from storage and file management to governance to database architecture itself and the main benefit of a data lake is the centralization of disparate content sources.

While it may not be exclusively synonymous with the data lake, the Apache Hadoop Distributed File System (HDFS) is one of the dominant data lake storage platforms. The introduction of Hadoop YARN (Yet Another Resource Negotiator) in 2012 revolutionized the HDFS ecosystem, adding capabilities for real-time and near-real-time processing.

As Hadoop and HDFS dominates all the data lakes storage solutions and different solutions are now in place in the cloud and for on-premises lakes, how about data governance?

In a data lake full of raw data meant to be accessible by a wide range of users, how could the security and provenance of the data be assured?

The answer is that is necessary to find new policies to enable users and get out of the way, but still manage risk, otherwise you are doing the same data polices as the  the old heavily curated enterprise data warehouse model.

Questions to ask about your Data Lake


Who will provide data, which departments, which data sets, and who will consume the data?
Choosing a data lake technology (Hadoop, Amazon S3, Azure Data Lake, etc), is only step one of the journey and does not guarantee success.


If data lake is not leveraged, it often ends up being just a data dump or worse still a ‘data swamp’ where no one with access can make sense of the information and put it to good use.


Data governance refers to the overall management of availability, usability, integrity, and security of enterprise data.  No old data policies like the ones in traditional data warehousing fit here.

This is the step beyond technology that really defines how useful, secure and usable will be your data inside your data lake.