While it may not be exclusively synonymous with the data lake, the Apache Hadoop Distributed File System (HDFS) is one of the dominant data lake storage platforms. The introduction of Hadoop YARN (Yet Another Resource Negotiator) in 2012 revolutionized the HDFS ecosystem, adding capabilities for real-time and near-real-time processing.
As Hadoop and HDFS dominates all the data lakes storage solutions and different solutions are now in place in the cloud and for on-premises lakes, how about data governance?
In a data lake full of raw data meant to be accessible by a wide range of users, how could the security and provenance of the data be assured?
The answer is that is necessary to find new policies to enable users and get out of the way, but still manage risk, otherwise you are doing the same data polices as the the old heavily curated enterprise data warehouse model.