kerberos ticket and delegation token usage in Hadoop Oozie shell action - hadoop

I am new to hadoop and trying to understand why my oozie shell action is not taking the new ticket even after doing kinit. here is my scenario.
I login using my ID "A", and have a kerberos ticket for my id. I submit oozie worklow with shell action using my ID.
Inside oozie shell action I do another kinit to obtain the ticket for ID "B".
Only this id "B" has access to some HDFS file. kinit is working fine since klist showed the ticket for id "B". Now when I read the HDFS file that only B has access to, I get permission denied error saying "A" does not have permission to access this file.
But when I do the same thing from linux cli, outside oozie, after I do kinit and take ticket for "B", I am able to read the HDFS file as "B".
But the same step is not working inside oozie shell action and hadoop fs commands always seem to work as the user that submitted the oozie workflow rather than the user for which kerberos ticket is present.
Can someone please explain why this is happening? I am not able to understand this.
In the same shell action, though hadoop fs command failed to change to user "B", hbase shell works as user B. Just for testing, I created a hbase table that only "A" has access to. I added the hbase shell to perform get command on this table. If I do kinit -kt for user "B" and get its ticket, this failed too, saying "B" does not have access to this table. So I think hbase is taking the new ticket instead of the delegation token of the user submitting the oozie workflow. When I dont do kinit -kt inside the shell action, hbase command succeeds.
If I do kinit, I could not even run hive queries saying "A" does not have execute access to some directories like /tmp/B/ that only "B" has access to, so I could not understand how hive is working, if it is taking the delegation token that is created when oozie workflow is submitted or if it is taking the new ticket created for new user.
Can someone please help me understand the above scenario? Which hadoop services takes new ticket for authentication and which commands take the delegation token (like hadoop fs commands)? Is this how it would work or am I doing something wrong?
I just dont understand why the same hadoop fs command worked from outside oozie as different user but not working inside oozie shell action even after kinit.
When is this delegation token actually get created? Does it get created only when oozie worklow is submitted or even I issue hadoop fs commands?
Thank you!

In theory -- Oozie automatically transfers the credentials of the submitter (i.e. A) to the YARN containers running the job. You don't have to care about kinit, because, in fact, it's too late for that.
You are not supposed to impersonate another user inside of an Oozie job, that would defeat the purpose of strict Kerberos authentication
In practice it's more tricky -- (1) the core Hadoop services (HDFS, YARN) check the Kerberos token just once, then they create a "delegation token" that is shared among all nodes and all services.
(2) the oozie service user has special privileges, it can do a kind of Hadoop "sudo" so that it connects to YARN as oozie but YARN creates the "delegation token" for the job submitter (i.e. A) and that's it, you can't alter change that token.
(3) well, actually you can use an alternate token, but only with some custom Java code that explicitly creates a UserGroupInformation object for an alternate user. The Hadoop command-line interfaces don't do that.
(4) what about non-core Hadoop, i.e. HBase or Hive Metastore, or non-Hadoop stuff, i.e. Zookeeper? They don't use "delegation tokens" at all. Either you manage explicitly the UserGroupInformation in Java code, or the default Kerberos token is used at connect time.
That's why your HBase shell worked, and if you had used Beeline (the JDBC thin client) instead of Hive (the legacy fat client) it would probably have worked too.
(5) Oozie tries to fill that gap with specific <credentials> options for Hive, Beeline ("Hive2" action), HBase, etc; I'm not sure how that works but it has to imply a non-default Kerberos ticket cache, local to your job containers.

We've found it possible to become another kerb principal once an oozie workflow is launched. We have to run a shell action that then runs java with a custom -Djava.security.auth.login.config=custom_jaas.conf that will then provide a jvm kinit'ed as someone else. This is along the lines of Samson's (3), although this kinit can be even a completely different realm.

Related

Oozie credentials kerberos

Oozie credentials for Kerberos cluster are working well with hive and hbase
Suppose consider an example where I have an oozie shell action that reads hdfs files. Oozie credentials are not coming to help in such a case.
In an oozie workflow that has a combination of different actions it gets very weird to use credentials in some places and kinit(using keytab and principal) in other places.
Please suggest, If there is an alternative to access hdfs with oozie credentials??

How to login to real time company hadoop cluster?

I am new to hadoop environment. I was joined in a company and was given KT and required documents for project. They asked me to login into cluster and start work immediately. Can any one suggest me the steps to login?
Not really clear what you're logging into. You should ask your coworkers for advice.
However, sounds like you have a Kerberos keytab, and you would run
kinit -k key.kt
There might be additional arguments necessary there, such as what's referred to as a principal, but only the cluster administrators can answer what that needs to be.
To verify your ticket is active
klist
Usually you will have Edge Nodes i.e client nodes installed with all the clients like
HDFS Client
Sqoop Client
Hive Client etc.
You need to get the hostnames/ip-addresses for these machines. If you are using windows you can use putty to login to these nodes by either using username and password or by using the .ppk file provided for those nodes.
Any company in my view will have a infrastructure team which configures LDAP with the Hadoop cluster which allows all the users by providing/adding your ID to the group roles.
And btw, are you using Cloudera/Mapr/Hortonworks? Every distribution has their own way and best practices.
I am assuming KT means knowledge transfer. Also the project document is about the application and not the Hadoop Cluster/Infra.
I would follow the following procedure:
1) Find out the name of the edge-node (also called client node) from your team or your TechOps. Also find out if you will be using some generic linux user (say "develteam") or you would have to get a user created on the edge-node.
2) Assuming you are accessing from Windows, install some ssh client (like putty).
3) Log in to the edge node using the credentials (for generic user or specific user as in #1).
4) Run following command to check you are on Hadoop Cluster:
> hadoop version
5) Try hive shell by typing:
> hive
6) Try running following HDFS command:
> hdfs dfs -ls /
6) Ask a team member where to find Hadoop config for that cluster. You would most probably not have write permissions, but may be you can cat the following files to get idea of the cluster:
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml

As a Hadoop Regular User, Is There a Way to See Details about Running Jobs?

I do not have access to any CLI on any of the Hadoop nodes, but I have access to the cluster via Hue and Jupyter. The engineering team has also configured the Hadoop UI that shows New, Running, Submitted, Finish, etc. applications. However, it appears all spark jobs have a generic name, for instance, something like this:
HIVE-f23fa1a1-4444-4ab2-1c44-12345a123456
or similar and when I click on the application_id, I get a Failed to read the attempts of the application error. (even for my own jobs). Similarly, spark jobs, which you can normally name using setAppName, are all named generic "Spark-something" because the spark context is already initialized upon bringing up Jupyter on an edge node (i.e. I can't establish a name because one already exists).
Is there a way for a unprivileged Hadoop user to see into what job is actually running (i.e. the Hive query or the Spark / Hadoop command ), without having some sort of CLI privilege?
I have tried using a few urls that I suspect have job information in them, for instance:
http://cluster_master:<portnum>/history/application_1234123412341234_12345/jobs/ or
http://cluster_master:<portnum>/jobs/application_1234123412341234_12345/
but neither attempt returns any details about the job itself (even things I named myself within the hive / spark context using setAppName.
Please let me know if there's a better way to ask this question. I am relatively new to Hadoop/Spark. All the reference docs and SO answers I've found assume CLI or privileged access and I can't find any documentation in either Spark or Hadoop that applies to this problem.

How does impersonation in hadoop work

I am trying to understand how impersonation works in hadoop environment.
I found a few resources like:
About doAs and proxy users- hadoop-kerberos-guide
and about tokens- delegation-tokens.
But I was not able to connect all the dots wrt the full flow of operations.
My current understanding is :
user does a kinit and executes a end user facing program like
beeline, spark-submit etc.
The program is app specific and gets service tickets for HDFS
It then gets tokens for all the services it may need during the job
exeution and saves the tokens in an HDFS directory.
The program then connects a job executer(using a service ticket for
the job executer??) e.g. yarn with the job info and the token path.
The job executor get the tocken and initializes UGI and all
communication with HDFS is done using the token and kerberos ticket
are not used.
Is the above high level understanding correct? (I have more follow up queries.)
Can the token mecahnism be skipped and use only kerberos at each
layer, if so, any resources will help.
My final aim is to write a spark connector with impersonation support
for a data storage system which does not use hadoop(tokens) but
supports kerberos.
Thanks & regards
-Sri

Spark submit to yarn as a another user

Is it possible to submit a spark job to a yarn cluster and choose, either with the command line or inside the jar, which user will "own" the job?
The spark-submit will be launch from a script containing the user.
PS: is it still possible if the cluster has a kerberos configuration (and the script a keytab) ?
For a non-kerberized cluster: export HADOOP_USER_NAME=zorro before submitting the Spark job will do the trick.
Make sure to unset HADOOP_USER_NAME afterwards, if you want to revert to your default credentials in the rest of the shell script (or in your interactive shell session).
For a kerberized cluster, the clean way to impersonate another account without trashing your other jobs/sessions (that probably depend on your default ticket) would be something in this line...
export KRB5CCNAME=FILE:/tmp/krb5cc_$(id -u)_temp_$$
kinit -kt ~/.protectedDir/zorro.keytab zorro#MY.REALM
spark-submit ...........
kdestroy
For a non-kerberized cluster you can add a Spark conf as:
--conf spark.yarn.appMasterEnv.HADOOP_USER_NAME=<user_name>
Another (much safer) approach is to use proxy authentication - basically you create a service account and then allow it to impersonate to other users.
$ spark-submit --help 2>&1 | grep proxy
--proxy-user NAME User to impersonate when submitting the application.
Assuming Kerberized / secured cluster.
I mentioned it's much safer because you don't need to store (and manage) keytabs of alll users you will have to impersonate to.
To enable impersonation, there are several settings you'd need to enable on Hadoop side to tell which account(s) can impersonate which users or groups and on which servers. Let's say you have created svc_spark_prd service account/ user.
hadoop.proxyuser.svc_spark_prd.hosts - list of fully-qualified domain names for servers which are allowed to submit impersonated Spark applications. * is allowed but nor recommended for any host.
Also specify either hadoop.proxyuser.svc_spark_prd.users or hadoop.proxyuser.svc_spark_prd.groups to list users or groups that svc_spark_prd is allowed to impersonate. * is allowed but not recommended for any user/group.
Also, check out documentation on proxy authentication.
Apache Livy for example uses this approach to submit Spark jobs on behalf of other end users.
If your user exists, you can still launch your spark submit with
su $my_user -c spark submit [...]
I am not sure about the kerberos keytab, but if you make a kinit with this user it should be fine.
If you can't use su because you don't want the password, I invite you to see this stackoverflow answer:
how to run script as another user without password

Resources