I read the Hadoop in Secure Mode Documentation. Right now all my daemons are running under one single account. The Doc suggests to run different daemons under different account. What is the purpose for doing that?
It's generally good practice to use separate dedicated service accounts for different server processes where possible. This limits attack surface in the event that an attacker compromises one of the processes. For example, if an attacker compromised process A, then the attacker could do things like access files owned by the account running process A. If process B used the same account as process A, then files created by process B would also be compromised. By using a separate account for process B, we can limit the impact of a vulnerability.
Aside from that general principle, there are other considerations specific to the implementation of Hadoop that make it desirable to use separate accounts.
HDFS has a concept of the super-user. The HDFS super-user is the account that is running the NameNode process. The super-user has special privileges to run HDFS administration commands and access all files in HDFS, regardless of the permission settings on those files. YARN and MapReduce daemons do not require HDFS super-user privilege. They can operate as an unprivileged user of HDFS, accessing only files for which they have permission. Running everything with the same account would unintentionally escalate privileges for the YARN and MapReduce daemons.
When running in secured mode, the YARN NodeManager utilizes the LinuxContainerExecutor to launch container processes as the user who submitted the YARN application. This works by using a special setuid executable, which allows the user running the NodeManager to switch to running a process as the user who submitted the application. This ensures that users submitting applications cannot escalate privileges by running code in the context of a different user account. However, setuid executables themselves are powerful tools that can cause privilege escalation problems if used incorrectly. The LinuxContainerExecutor documentation describes very specific steps to take in setting the permissions and configuration for this setuid executable. If a separate account was not used for running the YARN daemons, then this setuid executable would have to be made accessible to a larger set of accounts, which would increase the attack surface.


Why another user is required to run hadoop?

I have a question regarding hadoop configuration
why we need to create a user for running hadoop can we not run hadoop on a root user?
Yes, you can run it as root.
It is not a requirement to have a dedicated user for Hadoop but having one with lesser privileges than root is considered a good practice. It helps in separating Hadoop processes from other services running on the same machine.
This is not hadoop specific, it's a common good practice in IT to have specific users for running daemons ,for security reasons (for example in hadoop, if you run map reduce daemons as root, a malign user could launch a map reduce job which deletes not only hdfs data, but operating system data), for best control ,etc. Take a look at this:
It is not at all required to create a new user to run hadoop. Also, hadoop user need not be (should not be) in sudoers file or a root user [ref]. Your login user for the machine can also act as a hadoop user. But as mentioned by #Luis and #franklinsijo, it is a good practice to have a specific user for a specific service.

Hadoop User Addition in Secured Cluster

We are using a kerborized CDH cluster. While adding a user to the cluster, we used to add the user only to the gateway/edge nodes as in any hadoop distro cluster. But with the newly added userIDs, we are not able to execute map-reduce/yarn jobs and throwing "user not found" exception.
When I researched through this, I came across a link , which says to execute the yarn jobs in the secured cluster, we might need to have the corresponding user in all the nodes as the secure containers execute under the credentials of the job user.
So we added the corresponding userID to all the nodes and the jobs are getting executed.
If this is the case and if the cluster has around 100+ nodes, user provisioning for each userID would became a tedious job.
Can anyone please suggest any other effective way, if you came across the same scenario in your project implementation?
There are several approaches ordered by difficulty (from simple to painful).
One is to have a job-runner user that everyone uses to run jobs.
Another one is to use a configuration management tool to sync /etc/passwd and /etc/group (chef, puppet) on your cluster at regular intervals (1 hr - 1 day) or use a cron job to do this.
Otherwise you can buy or use open source Linux/UNIX user mapping services like Centrify (commercial), VAS (commercial), FreeIPA (free) or SSSD (free).
If you have an Active Directory server or LDAP server use the Hadoop LDAP user mappings.

Spark submit to yarn as a another user

Is it possible to submit a spark job to a yarn cluster and choose, either with the command line or inside the jar, which user will "own" the job?
The spark-submit will be launch from a script containing the user.
PS: is it still possible if the cluster has a kerberos configuration (and the script a keytab) ?
For a non-kerberized cluster: export HADOOP_USER_NAME=zorro before submitting the Spark job will do the trick.
Make sure to unset HADOOP_USER_NAME afterwards, if you want to revert to your default credentials in the rest of the shell script (or in your interactive shell session).
For a kerberized cluster, the clean way to impersonate another account without trashing your other jobs/sessions (that probably depend on your default ticket) would be something in this line...
export KRB5CCNAME=FILE:/tmp/krb5cc_$(id -u)_temp_$$
kinit -kt ~/.protectedDir/zorro.keytab zorro#MY.REALM
spark-submit ...........
For a non-kerberized cluster you can add a Spark conf as:
--conf spark.yarn.appMasterEnv.HADOOP_USER_NAME=<user_name>
Another (much safer) approach is to use proxy authentication - basically you create a service account and then allow it to impersonate to other users.
$ spark-submit --help 2>&1 | grep proxy
--proxy-user NAME User to impersonate when submitting the application.
Assuming Kerberized / secured cluster.
I mentioned it's much safer because you don't need to store (and manage) keytabs of alll users you will have to impersonate to.
To enable impersonation, there are several settings you'd need to enable on Hadoop side to tell which account(s) can impersonate which users or groups and on which servers. Let's say you have created svc_spark_prd service account/ user.
hadoop.proxyuser.svc_spark_prd.hosts - list of fully-qualified domain names for servers which are allowed to submit impersonated Spark applications. * is allowed but nor recommended for any host.
Also specify either hadoop.proxyuser.svc_spark_prd.users or hadoop.proxyuser.svc_spark_prd.groups to list users or groups that svc_spark_prd is allowed to impersonate. * is allowed but not recommended for any user/group.
Also, check out documentation on proxy authentication.
Apache Livy for example uses this approach to submit Spark jobs on behalf of other end users.
If your user exists, you can still launch your spark submit with
su $my_user -c spark submit [...]
I am not sure about the kerberos keytab, but if you make a kinit with this user it should be fine.
If you can't use su because you don't want the password, I invite you to see this stackoverflow answer:
how to run script as another user without password

Launching map-reduce job as a different user

Installing cosmos-gui

Can you help me with installing cosmos-gui? I think you are one of the developers behind cosmos? Am I right?
We have already installed Cosmos, and now we want to install cosmos-gui.
In the link below, I found the install guide:
Under subchapter “Prerequisites” is written
A couple of sudoer users, one within the storage cluster and another one wihtin the computing clusters, are required. Through these users, the cosmos-gui will remotely run certain administration commands such as new users creation, HDFS userspaces provision, etc. The access through these sudoer users will be authenticated by means of private keys.
What is meant by the above? Must I create, a sudo user for the computing and storage cluster? And for that, do need to install a MySQL DB?
And under subchapter “Installing the GUI.”
Before continuing, remember to add the RSA key fingerprints of the Namenodes accessed by the GUI. These fingerprints are automatically added to /home/cosmos-gui/.ssh/known_hosts if you try an ssh access to the Namenodes for the first time.
I can’t make any sense about the above. Can you give a step by step plan?
I hope you can help me.
First of all, a reminder about the Cosmos architecture:
There is a storage cluster based on HDFS.
There is a computing cluster based on shared Hadoop or based on Sahara; that's up to the administrator.
There is a services node for the storage cluster, a special node not storing data but exposing storage-related services such as HttpFS for data I/O. It is the entry point to the storage cluster.
There is a services node for the computing cluster, a special node not involved in the computations but exposing computing-related services such as Hive or Oozie. It is the entry point to the computing cluster.
There is another machine hosting the GUI, not belonging to any cluster.
Being said that, the paragraphs you mention try to explain the following:
Since the GUI needs to perform certain sudo operations on the storage and computing clusters for user account creation purposes, then a sudoer user must be created in both the services nodes. These sudoer users will be used by the GUI in order to remotely perform the required operations on top of ssh.
Regarding the RSA fingerprints, since the operations the GUI performs on the services nodes are executed in top of ssh, then the fingerprints the servers send back when you ssh them must be included in the .ssh/known_hosts file. You may do this manually, or simply ssh'ing the services nodes for the first time (you will be prompted to add the fingerprints to the file or not).
MySQL appears in the requirements because that section is about all the requisites in general, and thus they are listed. Not necessarily there may be relation maong them. In this particular case, MySQL is needed in order to store the accounts information.
We are always improving the documentation, we'll try to explain this better for the next release.
