Why another user is required to run hadoop? - hadoop

I have a question regarding hadoop configuration
why we need to create a user for running hadoop can we not run hadoop on a root user?

Yes, you can run it as root.
It is not a requirement to have a dedicated user for Hadoop but having one with lesser privileges than root is considered a good practice. It helps in separating Hadoop processes from other services running on the same machine.

This is not hadoop specific, it's a common good practice in IT to have specific users for running daemons ,for security reasons (for example in hadoop, if you run map reduce daemons as root, a malign user could launch a map reduce job which deletes not only hdfs data, but operating system data), for best control ,etc. Take a look at this:
https://unix.stackexchange.com/questions/29159/why-is-it-recommended-to-create-a-group-and-user-for-some-applications

It is not at all required to create a new user to run hadoop. Also, hadoop user need not be (should not be) in sudoers file or a root user [ref]. Your login user for the machine can also act as a hadoop user. But as mentioned by #Luis and #franklinsijo, it is a good practice to have a specific user for a specific service.

Related

Submitting MR job to Hadoop cluster with different ID's

What is the best way in which we can submit the MR job to hadoop cluster?
Scenario:
Developers have their own id's e.g. dev-user1, dev-user2 etc.
Hadoop cluster has various id's for various components e.g hdfs user for HDFS, yarn for YARN etc.
This means dev-user1 can't read / write HDFS as it is hdfs id that has access to HDFS.
Can anyone help me understand what is the best practice in which a developer can submit a job to hadoop cluster? I don't want to share the hadoop "specific" id details to anyone.
How does it work in real life scenarios.
best practice in which a developer can submit a job to hadoop cluster?
Depends on the job... yarn jar would be a used for MapReduce
This means dev-user1 can't read / write HDFS as it is hdfs id that has access to HDFS.
Not everything is owned by the hdfs user. You need to make /user/dev-user1 HDFS directory owned by that user so that's where the user has a "private" space. You can still make a directory anywhere else on HDFS that multiple users write to.
And permissions are only checked if you've explicitly enabled them on HDFS... And even if you did, then you still can put both users into the same POSIX group, or make directories globally writable by all.
https://hadoop.apache.org/docs/r2.7.3/hadoop-project-dist/hadoop-hdfs/HdfsPermissionsGuide.html
In production grade clusters, Hadoop is secured by Kerberos credentials and ACLs are managed via Apache Ranger or Sentry, which both allow fine-grained permission management

Hadoop User Addition in Secured Cluster

We are using a kerborized CDH cluster. While adding a user to the cluster, we used to add the user only to the gateway/edge nodes as in any hadoop distro cluster. But with the newly added userIDs, we are not able to execute map-reduce/yarn jobs and throwing "user not found" exception.
When I researched through this, I came across a link https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/SecureContainer.html , which says to execute the yarn jobs in the secured cluster, we might need to have the corresponding user in all the nodes as the secure containers execute under the credentials of the job user.
So we added the corresponding userID to all the nodes and the jobs are getting executed.
If this is the case and if the cluster has around 100+ nodes, user provisioning for each userID would became a tedious job.
Can anyone please suggest any other effective way, if you came across the same scenario in your project implementation?
There are several approaches ordered by difficulty (from simple to painful).
One is to have a job-runner user that everyone uses to run jobs.
Another one is to use a configuration management tool to sync /etc/passwd and /etc/group (chef, puppet) on your cluster at regular intervals (1 hr - 1 day) or use a cron job to do this.
Otherwise you can buy or use open source Linux/UNIX user mapping services like Centrify (commercial), VAS (commercial), FreeIPA (free) or SSSD (free).
If you have an Active Directory server or LDAP server use the Hadoop LDAP user mappings.
References:
https://community.hortonworks.com/questions/57394/what-are-the-best-practises-for-unix-user-mapping.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/cm_sg_ldap_grp_mappings.html

User Accounts for Hadoop Daemons

I read the Hadoop in Secure Mode Documentation. Right now all my daemons are running under one single account. The Doc suggests to run different daemons under different account. What is the purpose for doing that?
Link to Hadoop in Secure Mode Doc
It's generally good practice to use separate dedicated service accounts for different server processes where possible. This limits attack surface in the event that an attacker compromises one of the processes. For example, if an attacker compromised process A, then the attacker could do things like access files owned by the account running process A. If process B used the same account as process A, then files created by process B would also be compromised. By using a separate account for process B, we can limit the impact of a vulnerability.
Aside from that general principle, there are other considerations specific to the implementation of Hadoop that make it desirable to use separate accounts.
HDFS has a concept of the super-user. The HDFS super-user is the account that is running the NameNode process. The super-user has special privileges to run HDFS administration commands and access all files in HDFS, regardless of the permission settings on those files. YARN and MapReduce daemons do not require HDFS super-user privilege. They can operate as an unprivileged user of HDFS, accessing only files for which they have permission. Running everything with the same account would unintentionally escalate privileges for the YARN and MapReduce daemons.
When running in secured mode, the YARN NodeManager utilizes the LinuxContainerExecutor to launch container processes as the user who submitted the YARN application. This works by using a special setuid executable, which allows the user running the NodeManager to switch to running a process as the user who submitted the application. This ensures that users submitting applications cannot escalate privileges by running code in the context of a different user account. However, setuid executables themselves are powerful tools that can cause privilege escalation problems if used incorrectly. The LinuxContainerExecutor documentation describes very specific steps to take in setting the permissions and configuration for this setuid executable. If a separate account was not used for running the YARN daemons, then this setuid executable would have to be made accessible to a larger set of accounts, which would increase the attack surface.

Restrict Folder Access in Hadoop

Two different groups of people plan to use our hadoop cluster, but I don't want them to see each other's data.
How can I prevent this functionality on hadoop cluster ?
I understand that if you set a environment variable you can easily impersonate the hadoop superuser and access all data in HDFS. Is there an simpler way to prevent this or kerberos and ldap based security is the only way to go?
Kerberos is the only way to prevent users in Hadoop from impersonating as hdfs superuser and misusing privileges.
Its very simple for users to impersonate as hdfs user (who happens to be the superuser of hadoop in most distributions). Anyone could do that by specifying the env variable HADOOP_USER_NAME to hdfs.

hadoop single cluster user

I am reading this document here:
http://hadoop.apache.org/docs/r2.4.0/hadoop-project-dist/hadoop-common/SingleCluster.html#Pseudo-Distributed_Operation
It has this item:
Make the HDFS directories required to execute MapReduce jobs:
$ bin/hdfs dfs -mkdir /user
$ bin/hdfs dfs -mkdir /user/<username>
It is not clear to me what <username> here should be.
Is this the Linux dedicated user which I created for Hadoop or something else?
I am beginner at Hadoop, just installed it today
and I am just trying to play a few basic examples.
Short Answer: It doesn't have to be any username, it's just whatever you choose to call the directory in HDFS where you want to put your output. But using /user/<username> is convention and good practice.
Long-Winded Answer:
Peter, think of the "Hadoop username" merely as a way to keep your stuff in HDFS distinct from that of anyone else who's also using the same Hadoop cluster. It's really just the name of a directory that you're creating or using under /user in HDFS. You don't necessarily have to "log in" to use Hadoop, but very often the hadoop username just mimics your standard username/profile.
For example, at my previous employer, everyone's logins (for email address, chat client, accessing applications, connecting to servers, developing code, etc. -- pretty much anything at work that ever required a username & password) were in the format of <firstname.lastname>, so we'd log in to everything that way. Most of us had execution privileges to our grid, so we would ssh to an appropriate server (e.g. $ssh trevor.allen#server-of-awesomeness), where we had permission to execute MapReduce jobs to the grid. Just like my user was always first.last on my own machine, as well as on all our Linux servers (e.g. home in /home/trevor.allen/), we would follow this precedent in HDFS as well, pointing any output to HDFS to /user/first.last. Of course, since the "username" was arbitrary (really just the name of a directory), you'd occasionally see typos (/user/john.deo) or someone got mixed up between Linux's usr convention and Hadoop's user convention (/user/john.doe vs /usr/john.doe), and just random dropping of last names (/user/john), and so on.
Hope that helps!
The username corresponds to a user in HDFS. So here you can create a the same user as your linux account or others. For example if you install hive, spark or Hbase, you will have to create their directories in order to running this services.
user name here is the one you used to login to hadoop .by default its a user account name.

Resources