I am trying to learn " How Kerberos can be implemented in Hadoop ?"
I have gone through this doc https://issues.apache.org/jira/browse/HADOOP-4487
I have also gone through Basic Kerberos stuff ( https://www.youtube.com/watch?v=KD2Q-2ToloE)
After learning from these resources I have come to a conclusion which I am representing through a diagram.
Scenario : - User logs on to his computer gets authenticated by Kerberos Authentication and submits a map reduce job
(Please read the description of the diagram it hardly needs 5 minutes of your time)
I would like to explain the diagram and ask questions related with few steps (in bold)
Numbers in yellow background represents the entire flow (Numbers 1 to 19)
DT (with red background ) represents Delegation Token
BAT (with green Background) represents Block Access Token
JT (with Brown Background) represents Job Token
Steps 1,2,3 and 4 represents :-
Request for a TGT (Ticket Granting Ticket)
Request for a service Ticket for Name Node.
Question1) Where should be KDC located ? Can it be on the machine where my name node or job tracker is present ?
Steps 5,6,7,8 and 9 represents :-
Show service ticket to name node , get an Acknowledgement .
Name Node will issue a Delegation Token (red)
User will tell about the Token renewer (In this case it is Job Tracker)
Question2) User submits thisDelegation Token along with the job to Job Tracker. Will Delegation Token be shared with Task tracker ?
Steps 10,11,12,13 and 14 represents:-
Ask a service ticket for Job tracker , get the service ticket from KDC
Show this ticket to Job Tracker and get an ACK from JobTracker
Submit Job + Delegation Token to JobTracker.
Steps 15,16 and 17 represents:-
Generate Block Access Token and spread across all Data Nodes.
Send blockID and Block Access Token to Job Tracker and Job Tracker will pass it on to TaskTracker
Question 3)Who will ask for the BlockAccessToken and Block ID from the Name Node ?
JobTracker or TaskTracker
Sorry, I missed number 18 by mistake.
Step19 represents:-
Job tracker generates Job Token (brown) and passes it to the TaskTrackers.
Question4)Can I conclude that there will be one Delegation Token per user which will be distributed throughout the cluster and
there will be one Job token per job ? So a user will have only one Delegation Token and many Job Tokens(equal to the number of Jobs submitted by him) .
Please tell me if I missed something or I was wrong at some point in my explanation.
Steps to follow to make sure Hadoop is secure
Install Kerberos in any server accessible to all cluster nodes.
yum install krb5-server
yum install krb5-workstation
yum install krb5-libs
Modify Configuration file in KDC server configuration to setup acl files, admin keytab files, for the host.
/var/kerberos/krb5kdc/kdc.conf
Modify Configuration file /etc/krb5.conf to setup kdc host and admin server
Creating database in KDC host
$ kdb5_util create –r host_name -s
Add administrators to the ACL file
vi /etc/kdamin.acl
Add admin principal ‘admin/admin#host_name’ in that file
Add Admin principal
$addprinc admin/admin#host_name
Install Kerberos clients on all Cluster Nodes
yum install krb5-workstation
Copy krb5.conf to all cluster nodes
Make sure to enable Secure mode in Hadoop by setting required configurations
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html
Verify :
Login as normal user to cluster gateway or node where user keytabs are deployed
Run “kinit –k –t /location/of/keytab file username#host_name”
And run HDFS commands or mapreduce jobs to verify cluster is secured
These are the basic steps to make sure kerberos is enabled in your cluster.
Hadoop security mostly used Kerberos for authentication, sentry for authorization.
Ranger like gateways, knox is used for security aspects
http://commandstech.com/latest-hadoop-admin-interview-questions/
Related
I am new to hadoop environment. I was joined in a company and was given KT and required documents for project. They asked me to login into cluster and start work immediately. Can any one suggest me the steps to login?
Not really clear what you're logging into. You should ask your coworkers for advice.
However, sounds like you have a Kerberos keytab, and you would run
kinit -k key.kt
There might be additional arguments necessary there, such as what's referred to as a principal, but only the cluster administrators can answer what that needs to be.
To verify your ticket is active
klist
Usually you will have Edge Nodes i.e client nodes installed with all the clients like
HDFS Client
Sqoop Client
Hive Client etc.
You need to get the hostnames/ip-addresses for these machines. If you are using windows you can use putty to login to these nodes by either using username and password or by using the .ppk file provided for those nodes.
Any company in my view will have a infrastructure team which configures LDAP with the Hadoop cluster which allows all the users by providing/adding your ID to the group roles.
And btw, are you using Cloudera/Mapr/Hortonworks? Every distribution has their own way and best practices.
I am assuming KT means knowledge transfer. Also the project document is about the application and not the Hadoop Cluster/Infra.
I would follow the following procedure:
1) Find out the name of the edge-node (also called client node) from your team or your TechOps. Also find out if you will be using some generic linux user (say "develteam") or you would have to get a user created on the edge-node.
2) Assuming you are accessing from Windows, install some ssh client (like putty).
3) Log in to the edge node using the credentials (for generic user or specific user as in #1).
4) Run following command to check you are on Hadoop Cluster:
> hadoop version
5) Try hive shell by typing:
> hive
6) Try running following HDFS command:
> hdfs dfs -ls /
6) Ask a team member where to find Hadoop config for that cluster. You would most probably not have write permissions, but may be you can cat the following files to get idea of the cluster:
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml
I am trying to understand how impersonation works in hadoop environment.
I found a few resources like:
About doAs and proxy users- hadoop-kerberos-guide
and about tokens- delegation-tokens.
But I was not able to connect all the dots wrt the full flow of operations.
My current understanding is :
user does a kinit and executes a end user facing program like
beeline, spark-submit etc.
The program is app specific and gets service tickets for HDFS
It then gets tokens for all the services it may need during the job
exeution and saves the tokens in an HDFS directory.
The program then connects a job executer(using a service ticket for
the job executer??) e.g. yarn with the job info and the token path.
The job executor get the tocken and initializes UGI and all
communication with HDFS is done using the token and kerberos ticket
are not used.
Is the above high level understanding correct? (I have more follow up queries.)
Can the token mecahnism be skipped and use only kerberos at each
layer, if so, any resources will help.
My final aim is to write a spark connector with impersonation support
for a data storage system which does not use hadoop(tokens) but
supports kerberos.
Thanks & regards
-Sri
We are using a kerborized CDH cluster. While adding a user to the cluster, we used to add the user only to the gateway/edge nodes as in any hadoop distro cluster. But with the newly added userIDs, we are not able to execute map-reduce/yarn jobs and throwing "user not found" exception.
When I researched through this, I came across a link https://hadoop.apache.org/docs/stable/hadoop-yarn/hadoop-yarn-site/SecureContainer.html , which says to execute the yarn jobs in the secured cluster, we might need to have the corresponding user in all the nodes as the secure containers execute under the credentials of the job user.
So we added the corresponding userID to all the nodes and the jobs are getting executed.
If this is the case and if the cluster has around 100+ nodes, user provisioning for each userID would became a tedious job.
Can anyone please suggest any other effective way, if you came across the same scenario in your project implementation?
There are several approaches ordered by difficulty (from simple to painful).
One is to have a job-runner user that everyone uses to run jobs.
Another one is to use a configuration management tool to sync /etc/passwd and /etc/group (chef, puppet) on your cluster at regular intervals (1 hr - 1 day) or use a cron job to do this.
Otherwise you can buy or use open source Linux/UNIX user mapping services like Centrify (commercial), VAS (commercial), FreeIPA (free) or SSSD (free).
If you have an Active Directory server or LDAP server use the Hadoop LDAP user mappings.
References:
https://community.hortonworks.com/questions/57394/what-are-the-best-practises-for-unix-user-mapping.html
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/cm_sg_ldap_grp_mappings.html
Is it possible to submit a spark job to a yarn cluster and choose, either with the command line or inside the jar, which user will "own" the job?
The spark-submit will be launch from a script containing the user.
PS: is it still possible if the cluster has a kerberos configuration (and the script a keytab) ?
For a non-kerberized cluster: export HADOOP_USER_NAME=zorro before submitting the Spark job will do the trick.
Make sure to unset HADOOP_USER_NAME afterwards, if you want to revert to your default credentials in the rest of the shell script (or in your interactive shell session).
For a kerberized cluster, the clean way to impersonate another account without trashing your other jobs/sessions (that probably depend on your default ticket) would be something in this line...
export KRB5CCNAME=FILE:/tmp/krb5cc_$(id -u)_temp_$$
kinit -kt ~/.protectedDir/zorro.keytab zorro#MY.REALM
spark-submit ...........
kdestroy
For a non-kerberized cluster you can add a Spark conf as:
--conf spark.yarn.appMasterEnv.HADOOP_USER_NAME=<user_name>
Another (much safer) approach is to use proxy authentication - basically you create a service account and then allow it to impersonate to other users.
$ spark-submit --help 2>&1 | grep proxy
--proxy-user NAME User to impersonate when submitting the application.
Assuming Kerberized / secured cluster.
I mentioned it's much safer because you don't need to store (and manage) keytabs of alll users you will have to impersonate to.
To enable impersonation, there are several settings you'd need to enable on Hadoop side to tell which account(s) can impersonate which users or groups and on which servers. Let's say you have created svc_spark_prd service account/ user.
hadoop.proxyuser.svc_spark_prd.hosts - list of fully-qualified domain names for servers which are allowed to submit impersonated Spark applications. * is allowed but nor recommended for any host.
Also specify either hadoop.proxyuser.svc_spark_prd.users or hadoop.proxyuser.svc_spark_prd.groups to list users or groups that svc_spark_prd is allowed to impersonate. * is allowed but not recommended for any user/group.
Also, check out documentation on proxy authentication.
Apache Livy for example uses this approach to submit Spark jobs on behalf of other end users.
If your user exists, you can still launch your spark submit with
su $my_user -c spark submit [...]
I am not sure about the kerberos keytab, but if you make a kinit with this user it should be fine.
If you can't use su because you don't want the password, I invite you to see this stackoverflow answer:
how to run script as another user without password
I am trying to learn how Kerberos can be implemented in Hadoop.
I have gone through this doc https://issues.apache.org/jira/browse/HADOOP-4487
I have also gone through Basic Kerberos stuff (https://www.youtube.com/watch?v=KD2Q-2ToloE)
1) The Apache doc uses the word "Token" whereas the general doc over the internet uses the term "Ticket".
Are Token and Ticket same ?
2) The Apache doc also "DataNodes do not enforce any access control on accesses to its data blocks.
This makes it possible for an unauthorized client to read a data block as
long as she can supply its block ID. It’s also possible for anyone to write
arbitrary data blocks to DataNodes."
My thoughts on this:-
I can fetch the block Id from file path using the command:-
hadoop#Studio-1555:/opt/hadoop/hadoop-1.0.2/bin$ ./hadoop fsck /hadoop/mapred/system/jobtracker.info -files -blocks
FSCK started by hadoop from /127.0.0.1 for path /hadoop/mapred/system/jobtracker.info at Mon Jul 09 06:57:14 EDT 2012
/hadoop/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK
0. blk_-9148080207111019586_1001 len=4 repl=1
As I was authorized to access this file jobtracker.info, I was able to find its blockID using the above command.
I think that if I add some offset to this block ID and write to that datanode.
How can I explicitly mention the blockID while writing a file to HDFS.(What is the command ?)
Any other way to write arbitrary data blocks to DataNodes ?
Please tell me if my approach is wrong ?
Are Token and Ticket same ?
No. Tickets are issued by Kerberos and then servers in Hadoop (NameNode or JobTracker) issue tokens to provide authentication within the Hadoop cluster. Hadoop does not rely on Kerberos to authenticate running tasks, for instance, but uses its own tokens that were issued based on the Kerberos tickets.
The Apache doc also "DataNodes do not enforce any access control on accesses to its data blocks.
I'm guessing you're taking that from the JIRA where access control was provided (https://issues.apache.org/jira/browse/HADOOP-4359) via BlockAccessTokens. Assuming this is turned on - which it should be in a secure cluster - one cannot access a block on a datanode without such a token, which is issued by the NameNode after authentication and authorization via Kerberos and HDFS' own file system permissions.
How can I access the Datanode and write data arbitrarily ?
I am not sure what you mean here. Do you mean when the user does not have permission? As Jacob mentioned
you will not get a valid BlockAccessToken unless the user has the permissions to access the Data based on the file system permissions, assuming that you have secure Hadoop cluster.