Spark submit to yarn as a another user - hadoop

Is it possible to submit a spark job to a yarn cluster and choose, either with the command line or inside the jar, which user will "own" the job?
The spark-submit will be launch from a script containing the user.
PS: is it still possible if the cluster has a kerberos configuration (and the script a keytab) ?

For a non-kerberized cluster: export HADOOP_USER_NAME=zorro before submitting the Spark job will do the trick.
Make sure to unset HADOOP_USER_NAME afterwards, if you want to revert to your default credentials in the rest of the shell script (or in your interactive shell session).
For a kerberized cluster, the clean way to impersonate another account without trashing your other jobs/sessions (that probably depend on your default ticket) would be something in this line...
export KRB5CCNAME=FILE:/tmp/krb5cc_$(id -u)_temp_$$
kinit -kt ~/.protectedDir/zorro.keytab zorro#MY.REALM
spark-submit ...........
kdestroy

For a non-kerberized cluster you can add a Spark conf as:
--conf spark.yarn.appMasterEnv.HADOOP_USER_NAME=<user_name>

Another (much safer) approach is to use proxy authentication - basically you create a service account and then allow it to impersonate to other users.
$ spark-submit --help 2>&1 | grep proxy
--proxy-user NAME User to impersonate when submitting the application.
Assuming Kerberized / secured cluster.
I mentioned it's much safer because you don't need to store (and manage) keytabs of alll users you will have to impersonate to.
To enable impersonation, there are several settings you'd need to enable on Hadoop side to tell which account(s) can impersonate which users or groups and on which servers. Let's say you have created svc_spark_prd service account/ user.
hadoop.proxyuser.svc_spark_prd.hosts - list of fully-qualified domain names for servers which are allowed to submit impersonated Spark applications. * is allowed but nor recommended for any host.
Also specify either hadoop.proxyuser.svc_spark_prd.users or hadoop.proxyuser.svc_spark_prd.groups to list users or groups that svc_spark_prd is allowed to impersonate. * is allowed but not recommended for any user/group.
Also, check out documentation on proxy authentication.
Apache Livy for example uses this approach to submit Spark jobs on behalf of other end users.

If your user exists, you can still launch your spark submit with
su $my_user -c spark submit [...]
I am not sure about the kerberos keytab, but if you make a kinit with this user it should be fine.
If you can't use su because you don't want the password, I invite you to see this stackoverflow answer:
how to run script as another user without password

Related

How to login to real time company hadoop cluster?

I am new to hadoop environment. I was joined in a company and was given KT and required documents for project. They asked me to login into cluster and start work immediately. Can any one suggest me the steps to login?
Not really clear what you're logging into. You should ask your coworkers for advice.
However, sounds like you have a Kerberos keytab, and you would run
kinit -k key.kt
There might be additional arguments necessary there, such as what's referred to as a principal, but only the cluster administrators can answer what that needs to be.
To verify your ticket is active
klist
Usually you will have Edge Nodes i.e client nodes installed with all the clients like
HDFS Client
Sqoop Client
Hive Client etc.
You need to get the hostnames/ip-addresses for these machines. If you are using windows you can use putty to login to these nodes by either using username and password or by using the .ppk file provided for those nodes.
Any company in my view will have a infrastructure team which configures LDAP with the Hadoop cluster which allows all the users by providing/adding your ID to the group roles.
And btw, are you using Cloudera/Mapr/Hortonworks? Every distribution has their own way and best practices.
I am assuming KT means knowledge transfer. Also the project document is about the application and not the Hadoop Cluster/Infra.
I would follow the following procedure:
1) Find out the name of the edge-node (also called client node) from your team or your TechOps. Also find out if you will be using some generic linux user (say "develteam") or you would have to get a user created on the edge-node.
2) Assuming you are accessing from Windows, install some ssh client (like putty).
3) Log in to the edge node using the credentials (for generic user or specific user as in #1).
4) Run following command to check you are on Hadoop Cluster:
> hadoop version
5) Try hive shell by typing:
> hive
6) Try running following HDFS command:
> hdfs dfs -ls /
6) Ask a team member where to find Hadoop config for that cluster. You would most probably not have write permissions, but may be you can cat the following files to get idea of the cluster:
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml

how users should work with ambari cluster

My question is pretty trivial but didnt find anyone actually asking it.
We have a ambari cluster with spark storm hbase and hdfs(among other things).
I dont understand how a user that want to use that cluster use it.
for example, a user want to copy a file to hdfs, run a spark-shell or create new table in hbase shell.
should he get a local account on the server that run the cooresponded service? shouldn't he use a 3rd party machine(his own laptop for example)?
If so ,how one should use hadoop fs, there is no way to specify the server ip like spark-shell has.
what is the normal/right/expected way to run all these tasks from a user prespective.
Thanks.
The expected way to run the described tasks from the command line is as follows.
First, gain access to the command line of a server that has the required clients installed for the services you want to use, e.g. HDFS, Spark, HBase et cetera.
During the process of provisioning a cluster via Ambari, it is possible to define one or more servers where the clients will be installed.
Here you can see an example of an Ambari provisioning process step. I decided to install the clients on all servers.
Afterwards, one way to figure out which servers have the required clients installed is to check your hosts views in Ambari. Here you can find an example of an Ambari hosts view: check the green rectangle to see the installed clients.
Once you have installed the clients on one or more servers, these servers will be able to utilize the services of your cluster via the command line.
Just to be clear, the utilization of a service by a client is location-independent from the server where the service is actually running.
Second, make sure that you are compliant with the security mechanisms of your cluster. In relation to HDFS, this could influence which users you are allowed to use and which directories you can access by using them. If you do not use security mechanisms like e.g. Kerberos, Ranger and so on, you should be able to directly run your stated tasks from the command line.
Third, execute your tasks via command line.
Here is a short example of how to access HDFS without considering security mechanisms:
ssh user#hostxyz # Connect to the server that has the required HDFS client installed
hdfs dfs -ls /tmp # Command to list the contents of the HDFS tmp directory
Take a look on Ambari views, especially on Files view that allows browsing HDFS

Why another user is required to run hadoop?

I have a question regarding hadoop configuration
why we need to create a user for running hadoop can we not run hadoop on a root user?
Yes, you can run it as root.
It is not a requirement to have a dedicated user for Hadoop but having one with lesser privileges than root is considered a good practice. It helps in separating Hadoop processes from other services running on the same machine.
This is not hadoop specific, it's a common good practice in IT to have specific users for running daemons ,for security reasons (for example in hadoop, if you run map reduce daemons as root, a malign user could launch a map reduce job which deletes not only hdfs data, but operating system data), for best control ,etc. Take a look at this:
https://unix.stackexchange.com/questions/29159/why-is-it-recommended-to-create-a-group-and-user-for-some-applications
It is not at all required to create a new user to run hadoop. Also, hadoop user need not be (should not be) in sudoers file or a root user [ref]. Your login user for the machine can also act as a hadoop user. But as mentioned by #Luis and #franklinsijo, it is a good practice to have a specific user for a specific service.

Hadoop User Addition in Secured Cluster

We are using a kerborized CDH cluster. While adding a user to the cluster, we used to add the user only to the gateway/edge nodes as in any hadoop distro cluster. But with the newly added userIDs, we are not able to execute map-reduce/yarn jobs and throwing "user not found" exception.
When I researched through this, I came across a link https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/SecureContainer.html , which says to execute the yarn jobs in the secured cluster, we might need to have the corresponding user in all the nodes as the secure containers execute under the credentials of the job user.
So we added the corresponding userID to all the nodes and the jobs are getting executed.
If this is the case and if the cluster has around 100+ nodes, user provisioning for each userID would became a tedious job.
Can anyone please suggest any other effective way, if you came across the same scenario in your project implementation?
There are several approaches ordered by difficulty (from simple to painful).
One is to have a job-runner user that everyone uses to run jobs.
Another one is to use a configuration management tool to sync /etc/passwd and /etc/group (chef, puppet) on your cluster at regular intervals (1 hr - 1 day) or use a cron job to do this.
Otherwise you can buy or use open source Linux/UNIX user mapping services like Centrify (commercial), VAS (commercial), FreeIPA (free) or SSSD (free).
If you have an Active Directory server or LDAP server use the Hadoop LDAP user mappings.
References:
https://community.hortonworks.com/questions/57394/what-are-the-best-practises-for-unix-user-mapping.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/cm_sg_ldap_grp_mappings.html

kerberos ticket and delegation token usage in Hadoop Oozie shell action

I am new to hadoop and trying to understand why my oozie shell action is not taking the new ticket even after doing kinit. here is my scenario.
I login using my ID "A", and have a kerberos ticket for my id. I submit oozie worklow with shell action using my ID.
Inside oozie shell action I do another kinit to obtain the ticket for ID "B".
Only this id "B" has access to some HDFS file. kinit is working fine since klist showed the ticket for id "B". Now when I read the HDFS file that only B has access to, I get permission denied error saying "A" does not have permission to access this file.
But when I do the same thing from linux cli, outside oozie, after I do kinit and take ticket for "B", I am able to read the HDFS file as "B".
But the same step is not working inside oozie shell action and hadoop fs commands always seem to work as the user that submitted the oozie workflow rather than the user for which kerberos ticket is present.
Can someone please explain why this is happening? I am not able to understand this.
In the same shell action, though hadoop fs command failed to change to user "B", hbase shell works as user B. Just for testing, I created a hbase table that only "A" has access to. I added the hbase shell to perform get command on this table. If I do kinit -kt for user "B" and get its ticket, this failed too, saying "B" does not have access to this table. So I think hbase is taking the new ticket instead of the delegation token of the user submitting the oozie workflow. When I dont do kinit -kt inside the shell action, hbase command succeeds.
If I do kinit, I could not even run hive queries saying "A" does not have execute access to some directories like /tmp/B/ that only "B" has access to, so I could not understand how hive is working, if it is taking the delegation token that is created when oozie workflow is submitted or if it is taking the new ticket created for new user.
Can someone please help me understand the above scenario? Which hadoop services takes new ticket for authentication and which commands take the delegation token (like hadoop fs commands)? Is this how it would work or am I doing something wrong?
I just dont understand why the same hadoop fs command worked from outside oozie as different user but not working inside oozie shell action even after kinit.
When is this delegation token actually get created? Does it get created only when oozie worklow is submitted or even I issue hadoop fs commands?
Thank you!
In theory -- Oozie automatically transfers the credentials of the submitter (i.e. A) to the YARN containers running the job. You don't have to care about kinit, because, in fact, it's too late for that.
You are not supposed to impersonate another user inside of an Oozie job, that would defeat the purpose of strict Kerberos authentication
In practice it's more tricky -- (1) the core Hadoop services (HDFS, YARN) check the Kerberos token just once, then they create a "delegation token" that is shared among all nodes and all services.
(2) the oozie service user has special privileges, it can do a kind of Hadoop "sudo" so that it connects to YARN as oozie but YARN creates the "delegation token" for the job submitter (i.e. A) and that's it, you can't alter change that token.
(3) well, actually you can use an alternate token, but only with some custom Java code that explicitly creates a UserGroupInformation object for an alternate user. The Hadoop command-line interfaces don't do that.
(4) what about non-core Hadoop, i.e. HBase or Hive Metastore, or non-Hadoop stuff, i.e. Zookeeper? They don't use "delegation tokens" at all. Either you manage explicitly the UserGroupInformation in Java code, or the default Kerberos token is used at connect time.
That's why your HBase shell worked, and if you had used Beeline (the JDBC thin client) instead of Hive (the legacy fat client) it would probably have worked too.
(5) Oozie tries to fill that gap with specific <credentials> options for Hive, Beeline ("Hive2" action), HBase, etc; I'm not sure how that works but it has to imply a non-default Kerberos ticket cache, local to your job containers.
We've found it possible to become another kerb principal once an oozie workflow is launched. We have to run a shell action that then runs java with a custom -Djava.security.auth.login.config=custom_jaas.conf that will then provide a jvm kinit'ed as someone else. This is along the lines of Samson's (3), although this kinit can be even a completely different realm.

Resources