How does impersonation in hadoop work - hadoop

I am trying to understand how impersonation works in hadoop environment.
I found a few resources like:
About doAs and proxy users- hadoop-kerberos-guide
and about tokens- delegation-tokens.
But I was not able to connect all the dots wrt the full flow of operations.
My current understanding is :
user does a kinit and executes a end user facing program like
beeline, spark-submit etc.
The program is app specific and gets service tickets for HDFS
It then gets tokens for all the services it may need during the job
exeution and saves the tokens in an HDFS directory.
The program then connects a job executer(using a service ticket for
the job executer??) e.g. yarn with the job info and the token path.
The job executor get the tocken and initializes UGI and all
communication with HDFS is done using the token and kerberos ticket
are not used.
Is the above high level understanding correct? (I have more follow up queries.)
Can the token mecahnism be skipped and use only kerberos at each
layer, if so, any resources will help.
My final aim is to write a spark connector with impersonation support
for a data storage system which does not use hadoop(tokens) but
supports kerberos.
Thanks & regards
-Sri

Related

kerberos ticket and delegation token usage in Hadoop Oozie shell action

I am new to hadoop and trying to understand why my oozie shell action is not taking the new ticket even after doing kinit. here is my scenario.
I login using my ID "A", and have a kerberos ticket for my id. I submit oozie worklow with shell action using my ID.
Inside oozie shell action I do another kinit to obtain the ticket for ID "B".
Only this id "B" has access to some HDFS file. kinit is working fine since klist showed the ticket for id "B". Now when I read the HDFS file that only B has access to, I get permission denied error saying "A" does not have permission to access this file.
But when I do the same thing from linux cli, outside oozie, after I do kinit and take ticket for "B", I am able to read the HDFS file as "B".
But the same step is not working inside oozie shell action and hadoop fs commands always seem to work as the user that submitted the oozie workflow rather than the user for which kerberos ticket is present.
Can someone please explain why this is happening? I am not able to understand this.
In the same shell action, though hadoop fs command failed to change to user "B", hbase shell works as user B. Just for testing, I created a hbase table that only "A" has access to. I added the hbase shell to perform get command on this table. If I do kinit -kt for user "B" and get its ticket, this failed too, saying "B" does not have access to this table. So I think hbase is taking the new ticket instead of the delegation token of the user submitting the oozie workflow. When I dont do kinit -kt inside the shell action, hbase command succeeds.
If I do kinit, I could not even run hive queries saying "A" does not have execute access to some directories like /tmp/B/ that only "B" has access to, so I could not understand how hive is working, if it is taking the delegation token that is created when oozie workflow is submitted or if it is taking the new ticket created for new user.
Can someone please help me understand the above scenario? Which hadoop services takes new ticket for authentication and which commands take the delegation token (like hadoop fs commands)? Is this how it would work or am I doing something wrong?
I just dont understand why the same hadoop fs command worked from outside oozie as different user but not working inside oozie shell action even after kinit.
When is this delegation token actually get created? Does it get created only when oozie worklow is submitted or even I issue hadoop fs commands?
Thank you!
In theory -- Oozie automatically transfers the credentials of the submitter (i.e. A) to the YARN containers running the job. You don't have to care about kinit, because, in fact, it's too late for that.
You are not supposed to impersonate another user inside of an Oozie job, that would defeat the purpose of strict Kerberos authentication
In practice it's more tricky -- (1) the core Hadoop services (HDFS, YARN) check the Kerberos token just once, then they create a "delegation token" that is shared among all nodes and all services.
(2) the oozie service user has special privileges, it can do a kind of Hadoop "sudo" so that it connects to YARN as oozie but YARN creates the "delegation token" for the job submitter (i.e. A) and that's it, you can't alter change that token.
(3) well, actually you can use an alternate token, but only with some custom Java code that explicitly creates a UserGroupInformation object for an alternate user. The Hadoop command-line interfaces don't do that.
(4) what about non-core Hadoop, i.e. HBase or Hive Metastore, or non-Hadoop stuff, i.e. Zookeeper? They don't use "delegation tokens" at all. Either you manage explicitly the UserGroupInformation in Java code, or the default Kerberos token is used at connect time.
That's why your HBase shell worked, and if you had used Beeline (the JDBC thin client) instead of Hive (the legacy fat client) it would probably have worked too.
(5) Oozie tries to fill that gap with specific <credentials> options for Hive, Beeline ("Hive2" action), HBase, etc; I'm not sure how that works but it has to imply a non-default Kerberos ticket cache, local to your job containers.
We've found it possible to become another kerb principal once an oozie workflow is launched. We have to run a shell action that then runs java with a custom -Djava.security.auth.login.config=custom_jaas.conf that will then provide a jvm kinit'ed as someone else. This is along the lines of Samson's (3), although this kinit can be even a completely different realm.

How to successfully make a hive jdbc call inside a mapper in MR job where the cluster is secured by Kerberos

I am writing a utility that is a map reduce job where the reducer makes calls to various databases and Hive is one of them.
Our cluster is kerberized.
I am doing kinit before kicking off the MR job, but when the reducer runs, it fails with an error "No valid credentials provided (Mechanism level: Failed to find any Kerberos tgt)"
This indicates that it doesnt have a valid ticket. I tried to get a delegation token for Hive service in the MR driver, but it failed because the Hive service account is not allowed to impersonate my user.
I don't want to copy the keytab file onto all worker nodes, I want to somehow make either delegation token work or pass the credentials from MR driver to the mappers & reducers.
Can anyone suggest another method to get a valid ticket in the mappers & reducers to make Hive JDBC calls.
Thanks,
Arun.

Flume-ng hdfs security

I'm new in hadoop and Flume NG and I need some help.
I don't understand how hdfs security implemented.
Here are lines from configuration from Flume User Guide:
# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata
Does it mean that anyone who knows my hdfs path can write any data to my hdfs?
The question is from some time ago, but I'll try to answer it for any other developer dealing with Flume and HDFS security.
Flume's HDFS sink just need the endpoint where the data is going to be persisted. It such an endpoint is secured or not, it depends entirely on Hadoop, not in Flume.
Hadoop ecosystem has several tools and system for implementing security, but focusing on those native elements, we talk about the authentication and authorization methods.
The authentication is based on Kerberos, and as any other auth mechanism, it is the process of determining whether someone or something is, in fact, who or what it is declared to be. So, by using auth it is not enough by knowing a HDFS user name, but you have to demostrate you own such a user by previously authenticating against Kerberos and obtaining a ticket. Authentication may be pasword-based or keytab-based; you can see the keytabs as "certificate files" containing the authentication keys.
The authorization can be implemented at file system, by deciding which permissions has any folder or file within HDFS. Thus, is a certaing file has only 600 permissions, then only its owner will be able to read or write it. Other authorization mechanisms like Hadoop ACLs can be used.
Being said that, if you have a look to the Flume sink, you'll see that there is a couple of parameters about Kerberos:
hdfs.kerberosPrincipal – Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab – Kerberos keytab for accessing secure HDFS
In Kerberos terminology, a principal is a unique identity to which Kerberos can assign tickets. Thus, for each enabled user at HDFS you will need a principal registered in Kerberos. The keytab, as previously said, is a container for the authentication keys a certain principal owns.
Thus, if you want to secure your HDFS then install Kerberos, create principals and keytabs for each enabled user and configure the HDFS sink properly. In addition, change the permissions appropriately in your HDFS.

Regarding Hadoop Security via Kerberos

I am trying to learn how Kerberos can be implemented in Hadoop.
I have gone through this doc https://issues.apache.org/jira/browse/HADOOP-4487
I have also gone through Basic Kerberos stuff (https://www.youtube.com/watch?v=KD2Q-2ToloE)
1) The Apache doc uses the word "Token" whereas the general doc over the internet uses the term "Ticket".
Are Token and Ticket same ?
2) The Apache doc also "DataNodes do not enforce any access control on accesses to its data blocks.
This makes it possible for an unauthorized client to read a data block as
long as she can supply its block ID. It’s also possible for anyone to write
arbitrary data blocks to DataNodes."
My thoughts on this:-
I can fetch the block Id from file path using the command:-
hadoop#Studio-1555:/opt/hadoop/hadoop-1.0.2/bin$ ./hadoop fsck /hadoop/mapred/system/jobtracker.info -files -blocks
FSCK started by hadoop from /127.0.0.1 for path /hadoop/mapred/system/jobtracker.info at Mon Jul 09 06:57:14 EDT 2012
/hadoop/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK
0. blk_-9148080207111019586_1001 len=4 repl=1
As I was authorized to access this file jobtracker.info, I was able to find its blockID using the above command.
I think that if I add some offset to this block ID and write to that datanode.
How can I explicitly mention the blockID while writing a file to HDFS.(What is the command ?)
Any other way to write arbitrary data blocks to DataNodes ?
Please tell me if my approach is wrong ?
Are Token and Ticket same ?
No. Tickets are issued by Kerberos and then servers in Hadoop (NameNode or JobTracker) issue tokens to provide authentication within the Hadoop cluster. Hadoop does not rely on Kerberos to authenticate running tasks, for instance, but uses its own tokens that were issued based on the Kerberos tickets.
The Apache doc also "DataNodes do not enforce any access control on accesses to its data blocks.
I'm guessing you're taking that from the JIRA where access control was provided (https://issues.apache.org/jira/browse/HADOOP-4359) via BlockAccessTokens. Assuming this is turned on - which it should be in a secure cluster - one cannot access a block on a datanode without such a token, which is issued by the NameNode after authentication and authorization via Kerberos and HDFS' own file system permissions.
How can I access the Datanode and write data arbitrarily ?
I am not sure what you mean here. Do you mean when the user does not have permission? As Jacob mentioned
you will not get a valid BlockAccessToken unless the user has the permissions to access the Data based on the file system permissions, assuming that you have secure Hadoop cluster.

Hadoop Security

I am trying to learn " How Kerberos can be implemented in Hadoop ?"
I have gone through this doc https://issues.apache.org/jira/browse/HADOOP-4487
I have also gone through Basic Kerberos stuff ( https://www.youtube.com/watch?v=KD2Q-2ToloE)
After learning from these resources I have come to a conclusion which I am representing through a diagram.
Scenario : - User logs on to his computer gets authenticated by Kerberos Authentication and submits a map reduce job
(Please read the description of the diagram it hardly needs 5 minutes of your time)
I would like to explain the diagram and ask questions related with few steps (in bold)
Numbers in yellow background represents the entire flow (Numbers 1 to 19)
DT (with red background ) represents Delegation Token
BAT (with green Background) represents Block Access Token
JT (with Brown Background) represents Job Token
Steps 1,2,3 and 4 represents :-
Request for a TGT (Ticket Granting Ticket)
Request for a service Ticket for Name Node.
Question1) Where should be KDC located ? Can it be on the machine where my name node or job tracker is present ?
Steps 5,6,7,8 and 9 represents :-
Show service ticket to name node , get an Acknowledgement .
Name Node will issue a Delegation Token (red)
User will tell about the Token renewer (In this case it is Job Tracker)
Question2) User submits thisDelegation Token along with the job to Job Tracker. Will Delegation Token be shared with Task tracker ?
Steps 10,11,12,13 and 14 represents:-
Ask a service ticket for Job tracker , get the service ticket from KDC
Show this ticket to Job Tracker and get an ACK from JobTracker
Submit Job + Delegation Token to JobTracker.
Steps 15,16 and 17 represents:-
Generate Block Access Token and spread across all Data Nodes.
Send blockID and Block Access Token to Job Tracker and Job Tracker will pass it on to TaskTracker
Question 3)Who will ask for the BlockAccessToken and Block ID from the Name Node ?
JobTracker or TaskTracker
Sorry, I missed number 18 by mistake.
Step19 represents:-
Job tracker generates Job Token (brown) and passes it to the TaskTrackers.
Question4)Can I conclude that there will be one Delegation Token per user which will be distributed throughout the cluster and
there will be one Job token per job ? So a user will have only one Delegation Token and many Job Tokens(equal to the number of Jobs submitted by him) .
Please tell me if I missed something or I was wrong at some point in my explanation.
Steps to follow to make sure Hadoop is secure
Install Kerberos in any server accessible to all cluster nodes.
yum install krb5-server
yum install krb5-workstation
yum install krb5-libs
Modify Configuration file in KDC server configuration to setup acl files, admin keytab files, for the host.
/var/kerberos/krb5kdc/kdc.conf
Modify Configuration file /etc/krb5.conf to setup kdc host and admin server
Creating database in KDC host
$ kdb5_util create –r host_name -s
Add administrators to the ACL file
vi /etc/kdamin.acl
Add admin principal ‘admin/admin#host_name’ in that file
Add Admin principal
$addprinc admin/admin#host_name
Install Kerberos clients on all Cluster Nodes
yum install krb5-workstation
Copy krb5.conf to all cluster nodes
Make sure to enable Secure mode in Hadoop by setting required configurations
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html
Verify :
Login as normal user to cluster gateway or node where user keytabs are deployed
Run “kinit –k –t /location/of/keytab file username#host_name”
And run HDFS commands or mapreduce jobs to verify cluster is secured
These are the basic steps to make sure kerberos is enabled in your cluster.
Hadoop security mostly used Kerberos for authentication, sentry for authorization.
Ranger like gateways, knox is used for security aspects
http://commandstech.com/latest-hadoop-admin-interview-questions/

Resources