- Posted by redglue
- On May 4, 2019
- 0 Comments
- big data, databricks, open source
We work with Databricks, a lot, so the recent news from the Spark creators are for us a great step in the right direction.
Databricks is an analytics cloud platform only. It is not available on-premises and some of their great features were not opensource. While this is ok for some customers, for other customers it is difficult to stay locked-in to some of their features.
Recently Databricks announced that at Spark AI Summit that Databricks Delta Lake project goes opensource. This is major news for all (like us) support the opensource community.
Some of the issues solved by Databricks Delta are:
- ACID transactions
- Data versioning and time travel
- Ability to write batch and streaming directly
- Support for DML (Update, Delete, Merge)
- Schema validation and management
For those Data Architects reading this this is major news, because only they know how difficult is to build a Datalake with decent reliability and performance.
Eventually Databricks Delta will receive some “pull requests” from data community and some other vendors (looking at you, Cloudera) start to integrate in the Hadoop distributions.
In other news from Spark AI Summit, Databricks announced the project Koalas, it is basically porting Pandas Dataframe APIs on top of Spark, so you can deploy your already “one-computer” code to a distributed computing platform like Spark.
Happy hacking and coding!