- Posted by Hugo Almeida
- On June 27, 2018
- 0 Comments
Our Hadoop HDP IaaS cluster on Azure uses Azure Data Lake Store (ADLS) for data repository and accesses it through an applicational user created on Azure Active Directory (AAD). Check this tutorial if you want to connect your own Hadoop to ADLS.
Our ADLS is getting bigger and we’re working on a backup strategy for it. ADLS provides locally-redundant storage (LRS), however, this does not prevent our application from corrupting data or accidentally deleting it. Since Microsoft hasn’t published a new version of ADLS with a clone feature we had to find a way to backup all the data stored in our data lake.
We’re going to show you How to do a full ADLS backup with Azure Data Factory (ADF). ADF does not preserve permissions. However, our Hadoop client can only access the AzureDataLakeStoreFilesystem (adl) through hive with a “hive” user and we can generate these permissions before the backup.
Our Hadoop cluster has some distinct points that makes it easier to replicate the “data permissions” for ADLS backup:
- Applications can access Hadoop through Hive layer with “hive” user only. Set
- Applications can only access directly Hadoop for uploading or downloading files to and from
- Hive permissions inheritance is set to true.
Let’s get started!
Create a new ADLS and define ADLS Data Permissions
After creating your new ADLS, open Data Explorer, then the “Access” button on the root folder. Afterwards, click the “Add” button to select the same identity that has access to the original ADLS and has the credentials configured in your core-site.xml file. Select the adequate permissions.
Creating your ADF pipeline
I won’t get into much more detail since there are several tutorials out there about ADF, just make sure you create 2 linked services, 2 data sets and your pipeline with a simple copy activity.
Some notes for your pipeline creation:
use service principal authentication for your linked services
Create two data sets with binary copy enabled
Create a “Copy” activity with the “Copy file recursively” option enabled and the copy behavior set to “Preserve hierarchy” and run your pipeline
In this case we had a solid throughput of 152 MB/s
Set other permissions
If you need to give rwx permissions to other users on your /tmp folder you can use an hdfs command or Azure Portal to do that.
We recommend to use the
hdfs dfs -setfacl command because it gives you more flexibility, just make sure to give owner access in ADLS IAM tab to your Hadoop user.
hdfs dfs -setfacl -m -R default:other::rwx,other::rwx /tmp
Point your default file system to your new ADLS
fs.defaultFS to your new ADLS url in core-site.xml
Update your Hive metastore
Some of the entries in the Hive metastore database contain references to Hadoop. After changing the default file system to our new ADLS we need to update the old values using the Hive MetaTool:
hive --config /etc/hive/conf/conf.server --service metatool -updateLocation adl://sink.azuredatalakestore.net adl://source.azuredatalakestore.net --verbose
We’ve tried to use DistCp with all the tuning options and permissions preservation, and it created a perfect copy, however the performance was very poor with almost 14 hours of backup for 1.2TB of data. We believe it is because we have very small files (we use ORC tables in hive with partitioning) and we have a very complex folder structure that does not help in MR jobs.