As I mentioned before in my other posts, Cloudera, as an employer, allows us to do a couple of self-learning weeks during a calendar year, at least for all of us in the Support Organisation. We can choose whatever topics that we would like to learn, the only thing that we need to make sure is whatever we learnt can help with our day to day work. Last time I chose Spark and did an internal presentation to our wider team regarding how to develop Spark applications within Jetbrains IDE, without the need of a working Hadoop cluster, just on a laptop. This time, I chose Ranger, which has been decided as a replacement for Sentry, after Cloudera and Hortonworks’ merger.
Before I started learning, I tried to gather a few resources for myself, however, I noticed there were not a lot actually. We have O’Reilly subscription, but I can’t find a book or video course about Ranger at all. There are a few posts that can be found from Google, but they only cover very high level information and finished in one post. So I have decided, I will write a few series about Ranger posts after I finished my own training to share with the rest of you guys.
So, let’s get started. Firstly, I would like to do a high level comparison between Ranger and Sentry, to understand why Sentry is now deprecated and will be replaced by Ranger. I will assume that you have basic knowledge of Hadoop, CDH or HDP ecosystem to continue this article.
Let’s have a look at what Sentry has to provide. According to Sentry’s official documentation, Apache Sentry is a granular, role-based authorisation module for Hadoop. It provides the ability to control and enforce precise levels of privileges on data for authenticated users and applications that run on Hadoop cluster, particularly CDH. Currently Sentry is well integrated with Apache Hive, Apache Solr, Apache Kafka, Apache Impala and HDFS (limited to data that are linked by Hive tables via Sentry HDFS sync).
Sentry is role-based, meaning, you will need to create Roles in Sentry, which will need to be mapped to Groups, either at OS level, or AD, which will then be mapped to end users who intended to access Hadoop. You can use Sentry to limit user’s access to DB, TABLE, COLUMN or URI, and this is done via Sentry commands, which are to be run from Impala or Hive interface, more details about those commands can be found in Cloudera’s Sentry Documentation.
Now, let’s have a look at what Ranger has to offer. Again, according to official Apache Ranger documentation, Apache Ranger is a framework to enable, monitor and manage comprehensive data security across the Hadoop Platform. Apache Ranger has the following goals:
- Centralised security administration to manage all security related tasks in a central UI or using REST APIs.
- Fine grained authorisation to do a specific action and/or operation with Hadoop component/tool and managed through a central administration tool
- Standardise authorisation method across all Hadoop components.
- Enhanced support for different authorisation methods – Role based access control, attribute based access control etc.
- Centralise auditing of user access and administrative actions (security related) within all the components of Hadoop.
As you can see, on top of the security authorisation, Apache Ranger also supports user friendly web UI, REST APIs and Auditing etc, which are missing from Sentry. So to summarise, I will outline the main differences between the two Apache projects to understand why Ranger is the choice to go in future of CDH, which is CDP:
|Apache Sentry||Apache Ranger||Comment|
|Impala||For now, WIP for Ranger + Impala|
|HDFS||Sentry supports via Sync|
|Support Tag Based||More details to come|
|Row Level Filtering||More details to come|
|Column Masking||More details to come|
As you can see, Apache Ranger supports more features and integrated with more other Hadoop components. Even though Ranger currently does not support Impala, work is in progress and it will be available in the future release of Cloudera CDP product.
That’s all for the first episode, I will discuss more Ranger features in more detail in the future episodes. If you have any comments, please feel free to add below.
Please note that the table comparison above was based on information prior to Cloudera’s CDP release. So things will have changed by the time you read this blog post. The Ranger support for Impala and HDFS sync is being worked and will be available in future release of CDP 7 and plus.
Impala has added some things about Apache ranger in Impala 3.4 but you have stated in your article that Impala is not supported (in the comparison table). Can you check if it doesn’t bother?
Thanks Ansha for your comment. Yes indeed, Ranger now supports Impala in CDP. However, my blog post was written before the release of CDP.
I have updated the post to include a small note that changes might have happened already. The table I had for comparison was to illustrate the features of each component had before the CDP to show why Ranger was chosen as the replacement for Sentry.
FYI starting from CDP 7.1.5 Ranger offers HDFS Sync using RMS plugin.
Thanks for visiting and sharing that CDP 7.1.5 has HDFS SYNC from Ranger. Good to know that new features are added to latest Ranger!
Strangely, Ranger roles are acting weird in CDP 7.1.5. Database table access policy given to a role works for impala but not for hive sql query.
Apologies, I am not at Cloudera anymore, so I do not have access to CDP and Ranger, and unable to test and confirm the issue. You might want to reach out to Cloudera folks.