Regarding Hadoop Security via Kerberos - hadoop

I am trying to learn how Kerberos can be implemented in Hadoop.
I have gone through this doc https://issues.apache.org/jira/browse/HADOOP-4487
I have also gone through Basic Kerberos stuff (https://www.youtube.com/watch?v=KD2Q-2ToloE)
1) The Apache doc uses the word "Token" whereas the general doc over the internet uses the term "Ticket".
Are Token and Ticket same ?
2) The Apache doc also "DataNodes do not enforce any access control on accesses to its data blocks.
This makes it possible for an unauthorized client to read a data block as
long as she can supply its block ID. It’s also possible for anyone to write
arbitrary data blocks to DataNodes."
My thoughts on this:-
I can fetch the block Id from file path using the command:-
hadoop#Studio-1555:/opt/hadoop/hadoop-1.0.2/bin$ ./hadoop fsck /hadoop/mapred/system/jobtracker.info -files -blocks
FSCK started by hadoop from /127.0.0.1 for path /hadoop/mapred/system/jobtracker.info at Mon Jul 09 06:57:14 EDT 2012
/hadoop/mapred/system/jobtracker.info 4 bytes, 1 block(s): OK
0. blk_-9148080207111019586_1001 len=4 repl=1
As I was authorized to access this file jobtracker.info, I was able to find its blockID using the above command.
I think that if I add some offset to this block ID and write to that datanode.
How can I explicitly mention the blockID while writing a file to HDFS.(What is the command ?)
Any other way to write arbitrary data blocks to DataNodes ?
Please tell me if my approach is wrong ?

Are Token and Ticket same ?
No. Tickets are issued by Kerberos and then servers in Hadoop (NameNode or JobTracker) issue tokens to provide authentication within the Hadoop cluster. Hadoop does not rely on Kerberos to authenticate running tasks, for instance, but uses its own tokens that were issued based on the Kerberos tickets.
The Apache doc also "DataNodes do not enforce any access control on accesses to its data blocks.
I'm guessing you're taking that from the JIRA where access control was provided (https://issues.apache.org/jira/browse/HADOOP-4359) via BlockAccessTokens. Assuming this is turned on - which it should be in a secure cluster - one cannot access a block on a datanode without such a token, which is issued by the NameNode after authentication and authorization via Kerberos and HDFS' own file system permissions.

How can I access the Datanode and write data arbitrarily ?
I am not sure what you mean here. Do you mean when the user does not have permission? As Jacob mentioned
you will not get a valid BlockAccessToken unless the user has the permissions to access the Data based on the file system permissions, assuming that you have secure Hadoop cluster.

Related

How to login to real time company hadoop cluster?

I am new to hadoop environment. I was joined in a company and was given KT and required documents for project. They asked me to login into cluster and start work immediately. Can any one suggest me the steps to login?
Not really clear what you're logging into. You should ask your coworkers for advice.
However, sounds like you have a Kerberos keytab, and you would run
kinit -k key.kt
There might be additional arguments necessary there, such as what's referred to as a principal, but only the cluster administrators can answer what that needs to be.
To verify your ticket is active
klist
Usually you will have Edge Nodes i.e client nodes installed with all the clients like
HDFS Client
Sqoop Client
Hive Client etc.
You need to get the hostnames/ip-addresses for these machines. If you are using windows you can use putty to login to these nodes by either using username and password or by using the .ppk file provided for those nodes.
Any company in my view will have a infrastructure team which configures LDAP with the Hadoop cluster which allows all the users by providing/adding your ID to the group roles.
And btw, are you using Cloudera/Mapr/Hortonworks? Every distribution has their own way and best practices.
I am assuming KT means knowledge transfer. Also the project document is about the application and not the Hadoop Cluster/Infra.
I would follow the following procedure:
1) Find out the name of the edge-node (also called client node) from your team or your TechOps. Also find out if you will be using some generic linux user (say "develteam") or you would have to get a user created on the edge-node.
2) Assuming you are accessing from Windows, install some ssh client (like putty).
3) Log in to the edge node using the credentials (for generic user or specific user as in #1).
4) Run following command to check you are on Hadoop Cluster:
> hadoop version
5) Try hive shell by typing:
> hive
6) Try running following HDFS command:
> hdfs dfs -ls /
6) Ask a team member where to find Hadoop config for that cluster. You would most probably not have write permissions, but may be you can cat the following files to get idea of the cluster:
core-site.xml
hdfs-site.xml
yarn-site.xml
mapred-site.xml

How does impersonation in hadoop work

I am trying to understand how impersonation works in hadoop environment.
I found a few resources like:
About doAs and proxy users- hadoop-kerberos-guide
and about tokens- delegation-tokens.
But I was not able to connect all the dots wrt the full flow of operations.
My current understanding is :
user does a kinit and executes a end user facing program like
beeline, spark-submit etc.
The program is app specific and gets service tickets for HDFS
It then gets tokens for all the services it may need during the job
exeution and saves the tokens in an HDFS directory.
The program then connects a job executer(using a service ticket for
the job executer??) e.g. yarn with the job info and the token path.
The job executor get the tocken and initializes UGI and all
communication with HDFS is done using the token and kerberos ticket
are not used.
Is the above high level understanding correct? (I have more follow up queries.)
Can the token mecahnism be skipped and use only kerberos at each
layer, if so, any resources will help.
My final aim is to write a spark connector with impersonation support
for a data storage system which does not use hadoop(tokens) but
supports kerberos.
Thanks & regards
-Sri

Check permission in HDFS

I'm totally new in Hadoop. One of SAS users has problem to save a file from SAS Enterprise Guide to Hadoop and I've been asked to check permissions in HDFS that if they have been granted properly. Somehow to make sure users are allowed to move from one side and to add it to the other side.
Where should I check for it on SAS servers? If it is a file or how can I check it?
Your answer with details would be more appreciated.
Thanks.
This question is to vague, but I can offer a few suggestions. First off, the SAS Enterprise Guide user should have a resulting SAS log from his job with any errors.
The Hadoop cluster distribution, version, services being used (For example Knox, Sentry, or Ranger security products must be setup), and authentication (kerberos) all make a difference. I will assume you are not having kerberos issues nor are running Knox, Sentry, Ranger ect, and you are using core hadoop with no Kerberos. If you need help with those you must be more specific.
1. You have to check permissions on the hadoop side for this. You have to know where they are putting the data into hadoop. These are paths in HDFS, not the servers file system.
If connecting to hive, and not specifying any options it is likely /user/hive/warehouse, or /user/username folder.
2 - Hadoop Stickybit enabled by default prevents users from writing to /tmp in HDFS. Some SAS Programs write to /tmp folder in hdfs to save metadata, along with other information.
Run the following command on a Hadoop node to check basic permissions in HDFS.
hadoop fs -ls /
You should see the /tmp folder along with permissions, if the /tmp folder has a "t" at the end the sticky bit is set such as drwxrwxrwt. If the permissions are drwxrwxrwx then sticky bit isn't set, which is good to eliminate permissions issues.
If you have a sticky bit set on /tmp, which is usually by default then you must either remote it, or set an HDFS TEMP directory in the SAS Programs libname for Hadoop cluster.
Please see the following SAS/Access to Hadoop Guide about the libname options at SAS/ACCESS® 9.4 for Relational Databases: Reference, Ninth Edition | LIBNAME Statement Specifics for Hadoop
To remove/change the Hadoop sticky bit see the following article, or from your Hadoop vendor. Configuring Hadoop Security in CDH 5 Step 14: Set the Sticky Bit on HDFS Directories . You will want to do the opposite of this article to remove the stickybit though.
2 - SAS + Authentication + Users -
If your Hadoop cluster is secured using Kerberos then each SAS user much have a valid kerberos ticket to talk to any Hadoop service. There are a number of guides on the SAS Hadoop Support page about Kerberos along with other resources. With kerberos they need a kerberos ticket, not a username or password.
SAS 9.4 Support For Hadoop Reference
If you are not using kerberos then you can either have either the Hadoop default of no authentication, or possibly some services such as Hive could have LDAP enabled.
If you don't have LDAP enabled then you can use any Hadoop username in the libname statement to connect such as hive, hdfs, or yarn. You do not need to enter any password, and this user doesn't have to be the SAS User Account. This is because they default Hadoop configuration does not require authentication. You can use another account such as one you might create for the SAS User in your Hadoop cluster. If you do this you must create a /user/username folder in HDFS by running something like the following as the HDFS superuser, or one with permissions in Hadoop then set the ownership to the user.
hadoop fs -mkdir /user/sasdemo
hadoop fs -chown sasdemo:sasusers /user/sasdemo
Then you can check to make sure it exists with
hadoop fs -ls /user/
Basically whichever user they have in their libname statement in their SAS program must have a users home folder in hadoop. The Hadoop users will have one created by default on install but you will need to create them for any additional users.
If you are using LDAP with Hadoop (not to common from what I've seen) then you will have to have the LDAP username along with a password for the user account in the libname statement. I believe you can encode the password if you like.
Testing Connections to Hadoop from SAS Program
You can modify the following SAS code to do a basic test to put one of the sashelp datasets into Hadoop using a serial connection to HiveServer2 using SAS Enterprise Guide. This is only a very basic test but should prove you can write to Hadoop.
libname myhive hadoop server=hiveserver.example.com port=10000 schema=default user=hive;
data myhive.cars;set sashelp.cars;run;
Then if you want you can use the Hadoop client of your choice to find the data in Hadoop in the location you stored it, likely /user/hive/warehouse.
hadoop fs -ls /user/hive/warehouse
And/Or you should be able to run a proc contents in SAS Enterprise Guide to display the contents of the Hadoop Hive table you just put into Hadoop.
PROC CONTENTS DATA=myhive.cars;run;
Hope this helps, good luck!
To find the proper groups who can access files in the HDFS, we need to check the Sentry.
The file ACL's are described in the Sentry, so if you want to give/revoke access to anyone, it can be done through it.
On the left hand side is the file location and right hand side is the ACL's of the groups.

Flume-ng hdfs security

I'm new in hadoop and Flume NG and I need some help.
I don't understand how hdfs security implemented.
Here are lines from configuration from Flume User Guide:
# properties of hdfs-Cluster1-sink
agent_foo.sinks.hdfs-Cluster1-sink.type = hdfs
agent_foo.sinks.hdfs-Cluster1-sink.hdfs.path = hdfs://namenode/flume/webdata
Does it mean that anyone who knows my hdfs path can write any data to my hdfs?
The question is from some time ago, but I'll try to answer it for any other developer dealing with Flume and HDFS security.
Flume's HDFS sink just need the endpoint where the data is going to be persisted. It such an endpoint is secured or not, it depends entirely on Hadoop, not in Flume.
Hadoop ecosystem has several tools and system for implementing security, but focusing on those native elements, we talk about the authentication and authorization methods.
The authentication is based on Kerberos, and as any other auth mechanism, it is the process of determining whether someone or something is, in fact, who or what it is declared to be. So, by using auth it is not enough by knowing a HDFS user name, but you have to demostrate you own such a user by previously authenticating against Kerberos and obtaining a ticket. Authentication may be pasword-based or keytab-based; you can see the keytabs as "certificate files" containing the authentication keys.
The authorization can be implemented at file system, by deciding which permissions has any folder or file within HDFS. Thus, is a certaing file has only 600 permissions, then only its owner will be able to read or write it. Other authorization mechanisms like Hadoop ACLs can be used.
Being said that, if you have a look to the Flume sink, you'll see that there is a couple of parameters about Kerberos:
hdfs.kerberosPrincipal – Kerberos user principal for accessing secure HDFS
hdfs.kerberosKeytab – Kerberos keytab for accessing secure HDFS
In Kerberos terminology, a principal is a unique identity to which Kerberos can assign tickets. Thus, for each enabled user at HDFS you will need a principal registered in Kerberos. The keytab, as previously said, is a container for the authentication keys a certain principal owns.
Thus, if you want to secure your HDFS then install Kerberos, create principals and keytabs for each enabled user and configure the HDFS sink properly. In addition, change the permissions appropriately in your HDFS.

Hadoop Security

I am trying to learn " How Kerberos can be implemented in Hadoop ?"
I have gone through this doc https://issues.apache.org/jira/browse/HADOOP-4487
I have also gone through Basic Kerberos stuff ( https://www.youtube.com/watch?v=KD2Q-2ToloE)
After learning from these resources I have come to a conclusion which I am representing through a diagram.
Scenario : - User logs on to his computer gets authenticated by Kerberos Authentication and submits a map reduce job
(Please read the description of the diagram it hardly needs 5 minutes of your time)
I would like to explain the diagram and ask questions related with few steps (in bold)
Numbers in yellow background represents the entire flow (Numbers 1 to 19)
DT (with red background ) represents Delegation Token
BAT (with green Background) represents Block Access Token
JT (with Brown Background) represents Job Token
Steps 1,2,3 and 4 represents :-
Request for a TGT (Ticket Granting Ticket)
Request for a service Ticket for Name Node.
Question1) Where should be KDC located ? Can it be on the machine where my name node or job tracker is present ?
Steps 5,6,7,8 and 9 represents :-
Show service ticket to name node , get an Acknowledgement .
Name Node will issue a Delegation Token (red)
User will tell about the Token renewer (In this case it is Job Tracker)
Question2) User submits thisDelegation Token along with the job to Job Tracker. Will Delegation Token be shared with Task tracker ?
Steps 10,11,12,13 and 14 represents:-
Ask a service ticket for Job tracker , get the service ticket from KDC
Show this ticket to Job Tracker and get an ACK from JobTracker
Submit Job + Delegation Token to JobTracker.
Steps 15,16 and 17 represents:-
Generate Block Access Token and spread across all Data Nodes.
Send blockID and Block Access Token to Job Tracker and Job Tracker will pass it on to TaskTracker
Question 3)Who will ask for the BlockAccessToken and Block ID from the Name Node ?
JobTracker or TaskTracker
Sorry, I missed number 18 by mistake.
Step19 represents:-
Job tracker generates Job Token (brown) and passes it to the TaskTrackers.
Question4)Can I conclude that there will be one Delegation Token per user which will be distributed throughout the cluster and
there will be one Job token per job ? So a user will have only one Delegation Token and many Job Tokens(equal to the number of Jobs submitted by him) .
Please tell me if I missed something or I was wrong at some point in my explanation.
Steps to follow to make sure Hadoop is secure
Install Kerberos in any server accessible to all cluster nodes.
yum install krb5-server
yum install krb5-workstation
yum install krb5-libs
Modify Configuration file in KDC server configuration to setup acl files, admin keytab files, for the host.
/var/kerberos/krb5kdc/kdc.conf
Modify Configuration file /etc/krb5.conf to setup kdc host and admin server
Creating database in KDC host
$ kdb5_util create –r host_name -s
Add administrators to the ACL file
vi /etc/kdamin.acl
Add admin principal ‘admin/admin#host_name’ in that file
Add Admin principal
$addprinc admin/admin#host_name
Install Kerberos clients on all Cluster Nodes
yum install krb5-workstation
Copy krb5.conf to all cluster nodes
Make sure to enable Secure mode in Hadoop by setting required configurations
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html
Verify :
Login as normal user to cluster gateway or node where user keytabs are deployed
Run “kinit –k –t /location/of/keytab file username#host_name”
And run HDFS commands or mapreduce jobs to verify cluster is secured
These are the basic steps to make sure kerberos is enabled in your cluster.
Hadoop security mostly used Kerberos for authentication, sentry for authorization.
Ranger like gateways, knox is used for security aspects
http://commandstech.com/latest-hadoop-admin-interview-questions/

Resources