Authorizing Hadoop users without Sentry - hadoop

I have a Kerberized CDH cluster, where there are some daily oozie workflows running. All of them use shell, impala-shell, hive and sqoop to ingest data to Hive tables (lets call these tables SensitiveTables)
Now, I want to create 2 new BI users to use the cluster and experiment with some other ingested data.
The requirement is that these new BI users:
should not have access to the SensitiveTables
should be able to spark-submit jobs to the cluster
(optionally) use Hue
Apart from setting-up Apache Sentry (which is the recommended way to go), is there any chance to meet those requirements using file-permissions or ACL and Service Level Authorization ?
So far, I managed (via hadoop fs -chmod o-rwx /user/hive/warehouse/sensitive) to restrict access to SensitiveTables via Hive (which uses user impersonation), but failed to do so via Impala (which submits all jobs to the cluster as user impala). Is there anything else I should try?
Thank you,
Gee

After a lot of research and based on the assumptions I described, the answer is NO. Furthermore, the metastore can not be protected this way.

Related

How to query kerberos enabled hbase using apache drill?

We have a kerberoized hadoop cluster, where HBase is running. Apache drill is running in distributed mode, in another cluster. Now, we need to query the Kerberos enabled HBase from the apache drill, using web UI. The Hadoop cluster is actually running in AWS, the HBase uses s3 as storage. Please help me with steps to achieve successful queries to HBase.
Apache_Drill_version:1.16.0 Hadoop version: 2
Usually, to query HBase, we run kinit with the keytab manually, then get into HBase shell in Hadoop cluster. We wanted to make use of drill, to query in a SQL fashion easily, better readability.

Submitting MR job to Hadoop cluster with different ID's

What is the best way in which we can submit the MR job to hadoop cluster?
Scenario:
Developers have their own id's e.g. dev-user1, dev-user2 etc.
Hadoop cluster has various id's for various components e.g hdfs user for HDFS, yarn for YARN etc.
This means dev-user1 can't read / write HDFS as it is hdfs id that has access to HDFS.
Can anyone help me understand what is the best practice in which a developer can submit a job to hadoop cluster? I don't want to share the hadoop "specific" id details to anyone.
How does it work in real life scenarios.
best practice in which a developer can submit a job to hadoop cluster?
Depends on the job... yarn jar would be a used for MapReduce
This means dev-user1 can't read / write HDFS as it is hdfs id that has access to HDFS.
Not everything is owned by the hdfs user. You need to make /user/dev-user1 HDFS directory owned by that user so that's where the user has a "private" space. You can still make a directory anywhere else on HDFS that multiple users write to.
And permissions are only checked if you've explicitly enabled them on HDFS... And even if you did, then you still can put both users into the same POSIX group, or make directories globally writable by all.
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html
In production grade clusters, Hadoop is secured by Kerberos credentials and ACLs are managed via Apache Ranger or Sentry, which both allow fine-grained permission management

Ambari Hive View: How to make Hive queries run as current user

I finished configuring the Hive and HDFS views inside Ambari 2.4.2.. I want execute a query as a user and not a hive user.
I changed hive.server2.enable.doAs is set to “true” but I think is not enough because always my query execute as a hive user when I see it in ranger.
I'm looked in th eforum, that I should to configure impersonation the authParam look like the following:
hive.server2.proxy.user=${username}
I don't find where is this parameter to change it. knowing that I don't have a kerberized cluster hadoop, not yet I setting up the kerberos.
Thank you

How to know that a new data is been added to HDFS?

I am implementing a Notification system based on publish subscribe model to notify about the availability of data as it arrives/loaded to HDFS. I did n't find a ways where to look for this. Is there any HDFS API which can be used to do this or what method should I use to get information of new data written to HDFS? I am using Hadoop v2.0.2 and I don't want to use HCatalog, I want to implement my own tool to do this.
What you are looking for is Oozie Coordinator.
HDFS is a file system, so something must be built on top of HDFS to check for file availability. HBase has coprocessor which are triggered procedures . But it is only available for HBase tables. So it cannot be used for detecting data availabilty in HDFS.
Oozie is a workflow scheduler system to manage Hadoop jobs. Oozie Coordinator jobs are recurrent Oozie Workflow jobs triggered by time (frequency) and data availabilty. Also you can execute other programs from it :
Oozie is integrated with the rest of the Hadoop stack supporting
several types of Hadoop jobs out of the box (such as Java map-reduce,
Streaming map-reduce, Pig, Hive, Sqoop and Distcp) as well as system
specific jobs (such as Java programs and shell scripts).
So you can use the file availability trigger for your notification system too.
If you use HDFS you might want to check out HBase as it has the functionality you want. In HBase, you can create a pre-put (or post-put) coprocessor essentially acting equivilant to a MySQL Trigger- running a bit of code for every time data is written to a table.
If HBase doesn't suit your use case and you must use HDFS, AFAIK there aren't similar triggers. You can try wrapping the HDFS API with your own code to perform the notification whenever data is written to your file system under the appropriate circumstances. Alternatively, you can poll HDFS for changes (which sounds like an ugly alternative)...
Hope that helps

How do add a new user to Hue/Hive with permissions?

I have a Hadoop cluster that I manage via the Hue interface to run Hive queries.
I would like to add another user to Hue and give them access to SOME of the table to run queries on. Is this possible?
Hue is just a view on top of Hive so using Hive Authorization should do it (beware: Hive Authorization is currently being reworked in order to be really secure).
You might want to add that user to the hadoop group that you must have created to run hadoop. Also you might want to create a separate working directory for each user on hadoop.

Resources