- Posted by redglue
- On March 29, 2017
- 0 Comments
- hadoop, hdinsight, hive, ranger, security
The democratization of data is a reality nowadays, but many are really scared about how that affects the data security practices that they have in place for long years regarding what data can be seen. Data Masking is an old practice in many places under the umbrella of RDBMS, but with the introduction of Hadoop ecosystem for some workloads, this became also a topic as the data that comes from Data Lakes are more important than ever.
This is a small post demonstrate how Dynamic Data Masking can be applied and how it works on Apache Hive and Apache Ranger.
Dynamic Data Masking does not alter the original data in HDFS (or other sources like Azure Data Lake Storage), the data is only obfuscated when presenting to the user. A little bit like Oracle Data Redaction feature.
Masking policies are used to define which data fields are to be masked and what are the mapping functions that should be applied.
All this is made possible by Apache Ranger that is basically a framework to enable, monitor and manage data security across the Hadoop platform. If you are using Hortonworks (HDP 2.5 or Azure HDInsight), Apache Ranger is there for you, out of the box. If you are using Azure HDInsight as your Hadoop environment, Apache Ranger can integrate with your on-premises Active Directory (here)
Data Masking is also popular for DEV&QA environments where the developers and sometimes testers shouldn’t use real data for their activities.
Again, if you are into Azure HDInsight, you can integrate Apache Ranger with your Active Directory and specify granular masking policies for each user and group and build a robust solution.
On demo, let’s query the sensitive information before the masking happens using Hive (here is via command line but you can use whatever you want from APIs to Ambari Hive Views or even Excel):
hive> use foodmart;
hive> SELECT customer_id, yearly_income FROM customer LIMIT 10;
1 $30K - $50K
2 $70K - $90K
3 $50K - $70K
4 $10K - $30K
5 $30K - $50K
6 $70K - $90K
7 $30K - $50K
8 $50K - $70K
9 $10K - $30K
10 $30K - $50K
Time taken: 0.734 seconds, Fetched: 10 row(s)
As you can see, data is visible as for user hive (on this case) and no masking is done. Let’s fire up Apache Ranger and create a policy for masking this column called yearly_income.
This policy will affect group Public and Users hive and admin and will affect database foodmart, table customer and column yearly_income.
The conditions are pretty simple; On “select” a hash algorithm will be applied to the column data. It is possible to pick a lot of masking options or even define one for yourself .
Save the policy (it is very fast as it don’t affect any data in rest) and query the data again (this time i’ve used Ambari Hive View). As seen the yearly_income is now a hashed column data.