Restrict Folder Access in Hadoop - hadoop

Two different groups of people plan to use our hadoop cluster, but I don't want them to see each other's data.
How can I prevent this functionality on hadoop cluster ?
I understand that if you set a environment variable you can easily impersonate the hadoop superuser and access all data in HDFS. Is there an simpler way to prevent this or kerberos and ldap based security is the only way to go?

Kerberos is the only way to prevent users in Hadoop from impersonating as hdfs superuser and misusing privileges.
Its very simple for users to impersonate as hdfs user (who happens to be the superuser of hadoop in most distributions). Anyone could do that by specifying the env variable HADOOP_USER_NAME to hdfs.

Related

Why would I Kerberise my hadoop (HDP) cluster if it already uses AD/LDAP?

I have a HDP cluster.
This cluster is configured to use Active Directory as Authentication and Authorization authority. To be more specific, we use Ranger to limit accesses to HDFS directories, Hive tables and Yarn queues after said user provided correct username/password combinaison.
I have been tasked to Kerberise the Cluster, which is very easy thanks to the "press buttons and skip" like option in Ambari.
We Kerberised a test cluster. While interacting with Hive does not require any modification on our existing scripts on the cluster's machines, it is very, very difficult to find a way for end users to interact with Hive from OUTSIDE the cluster (PowerBI, DbVisualizer, PHP application).
Kerberising seems to bring an unnecessary amount of work.
What concret benefits would I get from Kerberising the cluster (except make the guys above in the hierachy happy because, hey, we Kerberised, yoohoo) ?
Edit:
One benefit:
Kerberising the Cluster grants more security as it is running on linux machines, but the company Active Directory is not able to handle such OS.
Ranger with AD/LDAP authentication and authorization is ok for external users, but AFAIK, it will not secure machine-to-machine or command-line interactions.
I'm not sure if it still applies, but on a Cloudera cluster without Kerberos, you could fake a login by setting an environment parameter HADOOP_USER_NAME on the command line:
sh-4.1$ whoami
ali
sh-4.1$ hadoop fs -ls /tmp/hive/zeppelin
ls: Permission denied: user=ali, access=READ_EXECUTE, inode="/tmp/hive/zeppelin":zeppelin:hdfs:drwx------
sh-4.1$ export HADOOP_USER_NAME=hdfs
sh-4.1$ hadoop fs -ls /tmp/hive/zeppelin
Found 4 items
drwx------ - zeppelin hdfs 0 2015-09-26 17:51 /tmp/hive/zeppelin/037f5062-56ba-4efc-b438-6f349cab51e4
For machine-to-machine communications, tools like Storm, Kafka, Solr or Spark are not secured by Ranger, but they are secured by Kerberos, so only dedicated processes can use those services.
Source: https://community.cloudera.com/t5/Support-Questions/Kerberos-AD-LDAP-and-Ranger/td-p/96755
Update: Apparently, Kafka and Solr Integration has been implemented in Ranger since then.

Submitting MR job to Hadoop cluster with different ID's

What is the best way in which we can submit the MR job to hadoop cluster?
Scenario:
Developers have their own id's e.g. dev-user1, dev-user2 etc.
Hadoop cluster has various id's for various components e.g hdfs user for HDFS, yarn for YARN etc.
This means dev-user1 can't read / write HDFS as it is hdfs id that has access to HDFS.
Can anyone help me understand what is the best practice in which a developer can submit a job to hadoop cluster? I don't want to share the hadoop "specific" id details to anyone.
How does it work in real life scenarios.
best practice in which a developer can submit a job to hadoop cluster?
Depends on the job... yarn jar would be a used for MapReduce
This means dev-user1 can't read / write HDFS as it is hdfs id that has access to HDFS.
Not everything is owned by the hdfs user. You need to make /user/dev-user1 HDFS directory owned by that user so that's where the user has a "private" space. You can still make a directory anywhere else on HDFS that multiple users write to.
And permissions are only checked if you've explicitly enabled them on HDFS... And even if you did, then you still can put both users into the same POSIX group, or make directories globally writable by all.
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html
In production grade clusters, Hadoop is secured by Kerberos credentials and ACLs are managed via Apache Ranger or Sentry, which both allow fine-grained permission management

Why another user is required to run hadoop?

I have a question regarding hadoop configuration
why we need to create a user for running hadoop can we not run hadoop on a root user?
Yes, you can run it as root.
It is not a requirement to have a dedicated user for Hadoop but having one with lesser privileges than root is considered a good practice. It helps in separating Hadoop processes from other services running on the same machine.
This is not hadoop specific, it's a common good practice in IT to have specific users for running daemons ,for security reasons (for example in hadoop, if you run map reduce daemons as root, a malign user could launch a map reduce job which deletes not only hdfs data, but operating system data), for best control ,etc. Take a look at this:
https://unix.stackexchange.com/questions/29159/why-is-it-recommended-to-create-a-group-and-user-for-some-applications
It is not at all required to create a new user to run hadoop. Also, hadoop user need not be (should not be) in sudoers file or a root user [ref]. Your login user for the machine can also act as a hadoop user. But as mentioned by #Luis and #franklinsijo, it is a good practice to have a specific user for a specific service.

Check permission in HDFS

I'm totally new in Hadoop. One of SAS users has problem to save a file from SAS Enterprise Guide to Hadoop and I've been asked to check permissions in HDFS that if they have been granted properly. Somehow to make sure users are allowed to move from one side and to add it to the other side.
Where should I check for it on SAS servers? If it is a file or how can I check it?
Your answer with details would be more appreciated.
Thanks.
This question is to vague, but I can offer a few suggestions. First off, the SAS Enterprise Guide user should have a resulting SAS log from his job with any errors.
The Hadoop cluster distribution, version, services being used (For example Knox, Sentry, or Ranger security products must be setup), and authentication (kerberos) all make a difference. I will assume you are not having kerberos issues nor are running Knox, Sentry, Ranger ect, and you are using core hadoop with no Kerberos. If you need help with those you must be more specific.
1. You have to check permissions on the hadoop side for this. You have to know where they are putting the data into hadoop. These are paths in HDFS, not the servers file system.
If connecting to hive, and not specifying any options it is likely /user/hive/warehouse, or /user/username folder.
2 - Hadoop Stickybit enabled by default prevents users from writing to /tmp in HDFS. Some SAS Programs write to /tmp folder in hdfs to save metadata, along with other information.
Run the following command on a Hadoop node to check basic permissions in HDFS.
hadoop fs -ls /
You should see the /tmp folder along with permissions, if the /tmp folder has a "t" at the end the sticky bit is set such as drwxrwxrwt. If the permissions are drwxrwxrwx then sticky bit isn't set, which is good to eliminate permissions issues.
If you have a sticky bit set on /tmp, which is usually by default then you must either remote it, or set an HDFS TEMP directory in the SAS Programs libname for Hadoop cluster.
Please see the following SAS/Access to Hadoop Guide about the libname options at SAS/ACCESS® 9.4 for Relational Databases: Reference, Ninth Edition | LIBNAME Statement Specifics for Hadoop
To remove/change the Hadoop sticky bit see the following article, or from your Hadoop vendor. Configuring Hadoop Security in CDH 5 Step 14: Set the Sticky Bit on HDFS Directories . You will want to do the opposite of this article to remove the stickybit though.
2 - SAS + Authentication + Users -
If your Hadoop cluster is secured using Kerberos then each SAS user much have a valid kerberos ticket to talk to any Hadoop service. There are a number of guides on the SAS Hadoop Support page about Kerberos along with other resources. With kerberos they need a kerberos ticket, not a username or password.
SAS 9.4 Support For Hadoop Reference
If you are not using kerberos then you can either have either the Hadoop default of no authentication, or possibly some services such as Hive could have LDAP enabled.
If you don't have LDAP enabled then you can use any Hadoop username in the libname statement to connect such as hive, hdfs, or yarn. You do not need to enter any password, and this user doesn't have to be the SAS User Account. This is because they default Hadoop configuration does not require authentication. You can use another account such as one you might create for the SAS User in your Hadoop cluster. If you do this you must create a /user/username folder in HDFS by running something like the following as the HDFS superuser, or one with permissions in Hadoop then set the ownership to the user.
hadoop fs -mkdir /user/sasdemo
hadoop fs -chown sasdemo:sasusers /user/sasdemo
Then you can check to make sure it exists with
hadoop fs -ls /user/
Basically whichever user they have in their libname statement in their SAS program must have a users home folder in hadoop. The Hadoop users will have one created by default on install but you will need to create them for any additional users.
If you are using LDAP with Hadoop (not to common from what I've seen) then you will have to have the LDAP username along with a password for the user account in the libname statement. I believe you can encode the password if you like.
Testing Connections to Hadoop from SAS Program
You can modify the following SAS code to do a basic test to put one of the sashelp datasets into Hadoop using a serial connection to HiveServer2 using SAS Enterprise Guide. This is only a very basic test but should prove you can write to Hadoop.
libname myhive hadoop server=hiveserver.example.com port=10000 schema=default user=hive;
data myhive.cars;set sashelp.cars;run;
Then if you want you can use the Hadoop client of your choice to find the data in Hadoop in the location you stored it, likely /user/hive/warehouse.
hadoop fs -ls /user/hive/warehouse
And/Or you should be able to run a proc contents in SAS Enterprise Guide to display the contents of the Hadoop Hive table you just put into Hadoop.
PROC CONTENTS DATA=myhive.cars;run;
Hope this helps, good luck!
To find the proper groups who can access files in the HDFS, we need to check the Sentry.
The file ACL's are described in the Sentry, so if you want to give/revoke access to anyone, it can be done through it.
On the left hand side is the file location and right hand side is the ACL's of the groups.

Hive Access Issue - Not able to access Hive CLI without Write Access

I am new to Hadoop and Hive so this question may be too basic
I am using Hadoop as a non-admin user, i.e., I do not know the hdfs, root or superuser passwords. My objective is to just query the Hive tables and probably do some simple analysis but not write in the hdfs or create any new tables.
Logging initialized using configuration in file:/etc/hive/2.3.2.0-2950/0/hive-log4j.properties
Exception in thread "main" java.lang.RuntimeException: org.apache.hadoop.security.AccessControlException: Permission denied: user=b001195, access=WRITE, inode="/user/b001195":hdfs:hdfs:drwxr-xr-x
at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:319)
I can see that this error says that this user does not have write access to hdfs. But I am not trying to write anything. Can anyone please suggest what changes I can make through my access level to remove this issue?
Thanks in advance for your help.
Use the following command to find an hdfs directory that has 777 permissions (/tmp maybe):
hdfs dfs -ls /
Then once you find that directory, add the following to the command you connect to hive with:
--hiveconf hive.exec.scratchdir=<DIRECTORY FOUND IN PREVIOUS STEP>
If you are in kerberized environment, you have to create principal with your user name in kerberize db and you have to open a ticket,
Note
You'll need Kerberos admin privileges for that.
Reference:
https://web.mit.edu/kerberos/krb5-1.12/doc/admin/admin_commands__/kadmin_local.html
Then you are good to get in Hadoop environment but, to read/write/execute Hadoop directories, you have to be either:
be the owner of the directory
be in the group of that directory
the directory has to give permission to other groups
be in Hadoop super user group, (which you are not as you mentioned)
Based on my experience with Hadoop, Linux or even Windows servers, majority of problems is related to permission issues.
I'd like to add my quote about this :)
When we achieve living in a world where we don't have and need any keys, locks, gates, safes, passwords, user permissions,or any security related time/money wasting obstacles, We can say we are living in a perfect society

Resources