How to get Kerberos instead of delegation token in Hadoop mapReduce? - hadoop

I'm a Java user, when submitting a job to Hadoop mapReduce, it uses Kerberos to authenticate for Hadoop, and upon success there's the delegation token created and passed with the job submission to Hadoop instead of the kerberos ticket (for security reason as stated by Hadoop). Now the job is running as me, but the job itself needs to use Kerberos to send request to other services outside Hadoop. Now I don't have kerberos TGT on Hadoop and I can't get the service ticket.
Is there anyway I can pass the Kerberos ticket with the job? (I know it might be dangerous since we don't want to pass the secret around), JobConf could pass the string to string pairs to Hadoop, but I have to convert the TGT to a json string and revert it during job running?
Or is it possible to use the delegation token reform TGT?
I tried to google it but not much information, anyone could help? Thank you.
**Editted:**
Looks like there's no easy way of doing this without passing the TGT to Hadoop, so I am going to try the following method by passing the TGT as string via job config map to Hadoop (String only), and convert the string back to TGT object when the job runs in Hadoop. The concern is I am going to pass the credentials over the network, which is not a best practice and one of the very reasons Hadoop didn't pass Kerberos around for security. If I could re-use the reformed TGT passed to Hadoop to get the service tickets, I will try to encrypt the TGT string as much as possible to avoid security issues.
So before starting a job in the local machine, the code would be like:
import sun.security.krb5.Credentials;
Credentials tgt = Credentials.acquireTGTFromCache(null, null); // Make sure kinit is done before this
String tgtStr = tgt.convertToJsonString(); //Need to implement this
Job job = new Job("Test");
JobConf jobConf = job.getJobConf();
jobConf.set("tgtStr", tgtStr);
job.addTask(Test.class, "run", null);
job.submit();
job.waitForCompletion(true);
Then the function in the job for Hadoop to run would be like:
Configuration conf = TaskContext.get().getConfiguration();
String tgtStr = conf.get("tgtStr");
Credentials tgt = reformTGTFromString(tgtStr);//Need to implement this
Credentials serviceTicket = Credentials.acquireServiceCreds(servicePrincipal, tgt); //This is to get any service ticket
So I need to implement two function to stream TGT object (Credentials.class) to string and then reform it back to object.
Anyone knows a better solution for this? Thanks.

Please see the design at http://carfield.com.hk:8080/document/distributed/hadoop-security-design.pdf , if you have not done so already.
Or is it possible to use the delegation token reform TGT?
No, the delegation tokens are issued by Hadoop name node and while it is based on the Kerberos authentication, it is independent and you can not derive the Kerberos TGT from it.
In the original design, we considered using solely Kerberos(without any additional tokens), which would have made your plan easy but decided against it for these reasons:
Performance:
Thousands of M/R tasks may need to get the Kerberos tickets at the
same time
Kerberos credentials need to be renewed before the expiry
For scheduled jobs, this will be an issue
Delegation tokens don’t depend on Kerberos and can be coupled with non-Kerberos authentication mechanisms- (such as SSL) used at the edge.
In your case, you can use a private distributed cache and send the forwardable TGT. I think this will be OK but need to think about it some more. Obviously you need to make sure your implementation is secure, that your tickets have minimally necessary lifetime, IP channel bindings are used if possible and restrict the use of tickets only to authorized processes.

By disassembling the Credentials fields and convert them to Strings using Base64 encoder, form a JSON string and pass it to Hadoop using config map or distributed cache suggested by RVM, and then reform the Credentials object in the job running on Hadoop, I can get back the Kerberos TGT and successfully get any service tickets using it. So this method works, and the only thing here needs to be very cautious is the encryption of the keys that are passed over network.

First of all, your account has to have delegation enabled. The service ticket has to request a forwardable ticket. If that is all true, Hadoop has to retrieve the delegated credential from the GSSContext and construct a new one on behalf of you. With that new TGT it will be able to perform further steps. Use Wireshark to check the ticket for hadoop.

Related

How to set User-Agent (prefix) for every upload request to S3 from Amazon EMR application

AWS has requested that the product I'm working on identifies requests that it makes to our users' S3 resources on their behalf so they can assess its impact.
To accomplish this, we have to set the User-Agent header for every upload request done against a S3 bucket from an EMR application. I'm wondering how this can be achieved?
Hadoop's doc mentions the fs.s3a.user.agent.prefix property (core-default.xml). However, the protocol s3a seems to be deprecated (Work with Storage and File Systems), so I'm not sure if this property will work.
To give a bit of more context what I need to do, with AWS Java SDK, it is possible to set the User-Agent header's prefix, for example:
AWSCredentials credentials;
ClientConfiguration conf = new ClientConfiguration()
.withUserAgentPrefix("APN/1.0 PARTNER/1.0 PRODUCT/1.0");
AmazonS3Client client = new AmazonS3Client(credentials, conf);
Then, every request's User-Agent http header will has a value similar to: APN/1.0 PARTNER/1.0 PRODUCT/1.0, aws-sdk-java/1.11.234 Linux/4.15.0-58-generic Java_HotSpot(TM)_64-Bit_Server_VM/25.201-b09 java/1.8.0_201. I need to achieve something similar when uploading files from an EMR application.
S3A is not deprecated in ASF hadoop; I will argue that it is now ahead of what EMR's own connector will do. If you are using EMR you may be able to use it, otherwise you get to work with what they implement.
FWIW in S3A we're looking at what it'd take to actually dynamically change the header for a specific query, so you go beyond specific users to specific hive/spark queries in shared clusters. Be fairly complex to do this though as you need to do it on a per request setting.
The solution in my case was to include a awssdk_config_default.json file inside the JAR submitted to EMR job. This file it used by AWS SDK to allow developers to override some custom settings.
I've added this json file within the JAR submitted to EMR with this content:
{
"userAgentTemplate": "APN/1.0 PARTNER/1.0 PRODUCT/1.0 aws-sdk-{platform}/{version} {os.name}/{os.version} {java.vm.name}/{java.vm.version} java/{java.version}{language.and.region}{additional.languages} vendor/{java.vendor}"
}
Note: passing the fs.s3a.user.agent.prefix property to EMR job didn't work. AWS EMR uses EMRFS when handling files stored in S3 which uses AWS SDK. I realized it because of an exception thrown in AWS EMR that I see sometimes, part of its stack trace was:
Caused by: java.lang.ExceptionInInitializerError: null
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createAndTrack(TemporaryDirectoriesGenerator.java:144)
at com.amazon.ws.emr.hadoop.fs.files.TemporaryDirectoriesGenerator.createTemporaryDirectories(TemporaryDirectoriesGenerator.java:93)
at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.create(S3NativeFileSystem.java:616)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:932)
at org.apache.hadoop.fs.FileSystem.create(FileSystem.java:825)
at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.create(EmrFileSystem.java:217)
at org.apache.hadoop.mapred.TextOutputFormat.getRecordWriter(TextOutputFormat.java:135)
I'm posting the answer here to future references. Some interests links:
The class in AWS SDK that uses this configuration file: InternalConfig.java
https://stackoverflow.com/a/31173739/1070393
EMRFS

How does Kerberos handle multiple TGT request in the same node for the same service and for the same client?

For my understanding in a Kerberos Architecture, a client needs to get a particular Ticket-Granting-Ticket (TGT) from the Authentication Server to be able to interact with a service. Those TGT contains:
client ID
client network address
ticket validity period
client/TGS session key.
I got this from here
Let's imagine I have a Master Workflow which contains: pig, hive and spark files I will need three different TGT, one per service, to use them all sucessfully.
One of the elements in the TGT is the ticket validity period. Let's imagine this is set to 8 hours.
For my understanding, if the master workflow needs, let's say, 10 hours to complete, it may fail after the 8th hour, since the validity of the ticket will be over.
So, as I understand, it will be necessary to refresh every 8 hours this TGT to communicate with the service without issues.
Now I was thinking as a possible approach to have a background process refreshing this TGT every 8 hours, so the client will have for any necessary service always a valid TGS session key.
A possible problem with this approach is that may be a gap between this refreshing, even a 30 seconds gap or 1 minute gap for any delay, which may cause the client being with an invalid TGS session key.
My question: Is it possible to refresh this TGS session key every 6 hours, which mean get a new TGT with the previous one is still valid? And what happens if you make this TGT request when an valid one still exists? is the old one replaced/descarted, are both stored in the client or is this new request just ignored?
I am completely new at this, so if there other ways to handle this issue please let me know.
Yes, you can update your program to use this keytab rather than relying on a TGT to already exist in the cache. This is done by using the UserGroupInformation class from the Hadoop Security package.
val configuration = new Configuration
configuration.addResource("/etc/hadoop/conf/hdfs-site.xml")
UserGroupInformation.setConfiguration(configuration)
UserGroupInformation.getCurrentUser.setAuthenticationMethod(AuthenticationMethod.KERBEROS)
UserGroupInformation.loginUserFromKeytabAndReturnUGI(
"hadoop.kerberos.principal", " path of hadoop.kerberos.keytab file")
.doAs(new PrivilegedExceptionAction[Unit]() {
#Override
def run(): Unit = {
// logic
}
})
Above we specify the name of our service principal and the path to the keytab file we generated. As long as that keytab is valid our program will use the desired service principal for all actions, regardless of whether or not the user running the program has already authenticated and received a TGT.

How to understand the process of Kerberos (over Hadoop)?

I have deployed Kerberos in hadoop cluster. According to the theory, the KDC will verify you are the one as you clared, according to the private key.
However, using that system confused me. For example, if you need access to the HDFS, what you need to do is just to input "kinit hdfs#MY.REALM" and the password from a client. Then you will get ticket and manipulate the HDFS as the superuser "hdfs".
Does this the real process of kerberos? If the user are only verified by password, why don't we directly build a list inside the server and require the user to input its username/password? Where is the private key mentioned in the theory? Can anyone explain this to me please?

How to use the ResourceManager web interface as an user

Every time i try to use the Hadoop Resource Manager web interface (http://resource-manger.host:8088/cluster/) i show up logged in as dr.who.
My question, how can I login as another user? In this case i want to login as myself and have a higher lever of privileges than dr.who.
The user infomation is got from HttpServletRequest#getRemoteUser().
1. If you deployed an insecure cluster, the simplest way to pass the username to server is by url parameter. For example, http://localhost:8088/cluster?user.name=babu
2. If you deployed a secure cluster, you probably use Kerberos authentication. You can use kinit to get a kerberos tgt, then configure the browser to negotiate. (network.negotiate-auth.trusted-uris for firefox, and --auth-server-whitelist for chromium. I'm sure there's lots of answers about this)
For more information, you can check hadoop official documentation.(https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/HttpAuthentication.html)
You should set the access control list by changing the default configuration of:
yarn.resourcemanager.zk-acl
from
world:anyone:rwcda
to something else,which is Cluster-specific
The ACLs the ResourceManager uses for the znode structure to store the internal state.

Running a map reduce job as a different user

I have a web application which interacts with Hadoop. (Cloudera cdh3u6) A particular user operation should launch a new Map Reduce job in the cluster.
The cluster is not a secure cluster, but it uses simple group authentication - so if I ssh to it as myself, I can launch MR jobs from the command line.
In the web application, I'm using the ToolRunner to run my job:
MyMapReduceWrapperClass mr = new MyMapReduceWrapperClass();
ToolRunner.run(mr, null);
// inside the run implementation of my wrapper class :
Job job = new Job(conf, "job title");
//set up stuff removed
job.submit();
Currently this job is submitted as the user that launched the web application server (Tomcat) process, and that user is a special local account on this web server that doesn't have permissions to send jobs to the cluster.
Ideally I'd like to be able to get some kind of identity from the user and pass it along, so that as different users were interacting with the web app / service we could see who was invoking what jobs. Skipping over the issues of how to actually coordinate those credential services, I'm not even clear on where it would go.
I see that on a Job I have a getCredentials() option, but from reading about the token / Kerberos stuff in there I have the impression that this is for secured clusters (which I think we are not) - not to mention I don't think my webserver has Kerberos installed. That could be fixed though. But it also sounds like the intended use case is to add secrets that a map reduce job might want while running to access other services - and not about running the job as someone else.
I also see that on the (older?) JobConf class I have the ability to setUser(String name) which seems promising - even though I don't know where it would require a password or something - but I can't find much information or documentation on that function. I tried it out and it had no impact - the job was still submitted as the Tomcat user.
Are there other avenues to explore or research? I am out of key words to Google. I would prefer to not have the option "Just give your tomcat user permissions on the cluster" - I don't manage that asset and I don't expect that request to fly. If however that literally is my only option I'd like to understand why that is, so that I can argue the need, having the right information.
You can use the UserGroupInformation class like this:
UserGroupInformation ugi = UserGroupInformation.createRemoteUser(username);
ugi.doAs(new PrivilegedExceptionAction<MyMapReduceWrapperClass>() {
public Object run() throws Exception {
MyMapReduceWrapperClass mr = new MyMapReduceWrapperClass();
ToolRunner.run(mr, null);
return mr;
}
});

Resources