Running a map reduce job as a different user - hadoop

I have a web application which interacts with Hadoop. (Cloudera cdh3u6) A particular user operation should launch a new Map Reduce job in the cluster.
The cluster is not a secure cluster, but it uses simple group authentication - so if I ssh to it as myself, I can launch MR jobs from the command line.
In the web application, I'm using the ToolRunner to run my job:
MyMapReduceWrapperClass mr = new MyMapReduceWrapperClass();
ToolRunner.run(mr, null);
// inside the run implementation of my wrapper class :
Job job = new Job(conf, "job title");
//set up stuff removed
job.submit();
Currently this job is submitted as the user that launched the web application server (Tomcat) process, and that user is a special local account on this web server that doesn't have permissions to send jobs to the cluster.
Ideally I'd like to be able to get some kind of identity from the user and pass it along, so that as different users were interacting with the web app / service we could see who was invoking what jobs. Skipping over the issues of how to actually coordinate those credential services, I'm not even clear on where it would go.
I see that on a Job I have a getCredentials() option, but from reading about the token / Kerberos stuff in there I have the impression that this is for secured clusters (which I think we are not) - not to mention I don't think my webserver has Kerberos installed. That could be fixed though. But it also sounds like the intended use case is to add secrets that a map reduce job might want while running to access other services - and not about running the job as someone else.
I also see that on the (older?) JobConf class I have the ability to setUser(String name) which seems promising - even though I don't know where it would require a password or something - but I can't find much information or documentation on that function. I tried it out and it had no impact - the job was still submitted as the Tomcat user.
Are there other avenues to explore or research? I am out of key words to Google. I would prefer to not have the option "Just give your tomcat user permissions on the cluster" - I don't manage that asset and I don't expect that request to fly. If however that literally is my only option I'd like to understand why that is, so that I can argue the need, having the right information.

You can use the UserGroupInformation class like this:
UserGroupInformation ugi = UserGroupInformation.createRemoteUser(username);
ugi.doAs(new PrivilegedExceptionAction<MyMapReduceWrapperClass>() {
public Object run() throws Exception {
MyMapReduceWrapperClass mr = new MyMapReduceWrapperClass();
ToolRunner.run(mr, null);
return mr;
}
});

Related

How to use the ResourceManager web interface as an user

Every time i try to use the Hadoop Resource Manager web interface (http://resource-manger.host:8088/cluster/) i show up logged in as dr.who.
My question, how can I login as another user? In this case i want to login as myself and have a higher lever of privileges than dr.who.
The user infomation is got from HttpServletRequest#getRemoteUser().
1. If you deployed an insecure cluster, the simplest way to pass the username to server is by url parameter. For example, http://localhost:8088/cluster?user.name=babu
2. If you deployed a secure cluster, you probably use Kerberos authentication. You can use kinit to get a kerberos tgt, then configure the browser to negotiate. (network.negotiate-auth.trusted-uris for firefox, and --auth-server-whitelist for chromium. I'm sure there's lots of answers about this)
For more information, you can check hadoop official documentation.(https://hadoop.apache.org/docs/r2.7.2/hadoop-project-dist/hadoop-common/HttpAuthentication.html)
You should set the access control list by changing the default configuration of:
yarn.resourcemanager.zk-acl
from
world:anyone:rwcda
to something else,which is Cluster-specific
The ACLs the ResourceManager uses for the znode structure to store the internal state.

how to pass parameters from web requests to spring boot yarn application

I'm using spring-boot and spring-boot-yarn to submit yarn applications to a cluster.
My use-case is close to the one described in this tutorial https://github.com/spring-guides/gs-yarn-basic.
The only difference is that my 'client' is supposed to be a web application and submit the yarn jobs when web requests are made.
The problem I have is that web requests to the 'client' web-application provide parameters I need to pass down to the yarn job.
In the above tutorial parameters are passed as command line arguments to to the appmaster / container specified in application.yml. In my case this approach does not work since I have a different set of parameters for each yarn job.
Is there a way to pass dynamic parameters to yarn jobs without hard-coding them in application.yml?
Original idea was to prevent "rogue" users or applications to pass properties which would then automatically end up in a command-line options potentially making harm within a hadoop cluster.
It's worth to check my answer in Spring Boot Yarn - Passing Command line arguments if this is what you want.
Having said that, you are not a first person to ask this or "complain" that it is too difficult or unclear how to do it. We're going to make this much easier with future releases mostly because it just seem to be what users want to do.

Open a JDBC connection in a specific AS400 subsystem

I have a web service that calls some stored procedure on a AS400 via JTOpen.
What I would like to do is that the connections used to call the stored procedures was opened in a specific subsystem with a specific user, instead of qusrwrk/quser as now (default).
I think I can be able to clone the qusrwrk subsystem to make it start with a specific user, but what I cannot figure out is the mechanism to open the connection in the specific subsystem.
I guess there should be a property at connection level to say subsystem=MySubsystem.
But unfortunatly I haven't found that property.
Any hint would be appreciated.
Flavio
Let the system take care of the subsystem the job database server job is started in.
You should just focus on the application (which is what IBM i excels in).
If need be, you can tweak subsystem parameters for QUSRWRK to improve performance by allocating memory, etc.
The system uses a pool of prestarted jobs as described in the FAQ: When I do WRKACTJOB, why is the host server job running under QUSER instead of the profile specified on the AS400 object?
To improve performance, the host server jobs are prestarted jobs running under QUSER. When the Toolbox connects to a host server job in order to perform an API call, run a command, etc, a request is sent from the Toolbox to an available prestarted job. This request includes the user profile specified on the AS400 object that represents the connection. The host server job receives the request and swaps to the specified user profile before it runs the request. The host server itself originally runs under the QUSER profile, so output from the WRKACTJOB command will show the job as being owned by QUSER. However, the job is in fact running under the profile specified on the request. To determine what profile is being used for any given host server job, you can do one of three things:
1. Display the job log for that job and find the message indicating which user profile is used as a result of the swap.
2. Work with the job and display job status attributes to view the current user profile.
3. Use Navigator for i to view all of the server jobs, which will list the current user of each job. You can also use Navigator for i to look at the server jobs being used by a particular user.

How to get Kerberos instead of delegation token in Hadoop mapReduce?

I'm a Java user, when submitting a job to Hadoop mapReduce, it uses Kerberos to authenticate for Hadoop, and upon success there's the delegation token created and passed with the job submission to Hadoop instead of the kerberos ticket (for security reason as stated by Hadoop). Now the job is running as me, but the job itself needs to use Kerberos to send request to other services outside Hadoop. Now I don't have kerberos TGT on Hadoop and I can't get the service ticket.
Is there anyway I can pass the Kerberos ticket with the job? (I know it might be dangerous since we don't want to pass the secret around), JobConf could pass the string to string pairs to Hadoop, but I have to convert the TGT to a json string and revert it during job running?
Or is it possible to use the delegation token reform TGT?
I tried to google it but not much information, anyone could help? Thank you.
**Editted:**
Looks like there's no easy way of doing this without passing the TGT to Hadoop, so I am going to try the following method by passing the TGT as string via job config map to Hadoop (String only), and convert the string back to TGT object when the job runs in Hadoop. The concern is I am going to pass the credentials over the network, which is not a best practice and one of the very reasons Hadoop didn't pass Kerberos around for security. If I could re-use the reformed TGT passed to Hadoop to get the service tickets, I will try to encrypt the TGT string as much as possible to avoid security issues.
So before starting a job in the local machine, the code would be like:
import sun.security.krb5.Credentials;
Credentials tgt = Credentials.acquireTGTFromCache(null, null); // Make sure kinit is done before this
String tgtStr = tgt.convertToJsonString(); //Need to implement this
Job job = new Job("Test");
JobConf jobConf = job.getJobConf();
jobConf.set("tgtStr", tgtStr);
job.addTask(Test.class, "run", null);
job.submit();
job.waitForCompletion(true);
Then the function in the job for Hadoop to run would be like:
Configuration conf = TaskContext.get().getConfiguration();
String tgtStr = conf.get("tgtStr");
Credentials tgt = reformTGTFromString(tgtStr);//Need to implement this
Credentials serviceTicket = Credentials.acquireServiceCreds(servicePrincipal, tgt); //This is to get any service ticket
So I need to implement two function to stream TGT object (Credentials.class) to string and then reform it back to object.
Anyone knows a better solution for this? Thanks.
Please see the design at http://carfield.com.hk:8080/document/distributed/hadoop-security-design.pdf , if you have not done so already.
Or is it possible to use the delegation token reform TGT?
No, the delegation tokens are issued by Hadoop name node and while it is based on the Kerberos authentication, it is independent and you can not derive the Kerberos TGT from it.
In the original design, we considered using solely Kerberos(without any additional tokens), which would have made your plan easy but decided against it for these reasons:
Performance:
Thousands of M/R tasks may need to get the Kerberos tickets at the
same time
Kerberos credentials need to be renewed before the expiry
For scheduled jobs, this will be an issue
Delegation tokens don’t depend on Kerberos and can be coupled with non-Kerberos authentication mechanisms- (such as SSL) used at the edge.
In your case, you can use a private distributed cache and send the forwardable TGT. I think this will be OK but need to think about it some more. Obviously you need to make sure your implementation is secure, that your tickets have minimally necessary lifetime, IP channel bindings are used if possible and restrict the use of tickets only to authorized processes.
By disassembling the Credentials fields and convert them to Strings using Base64 encoder, form a JSON string and pass it to Hadoop using config map or distributed cache suggested by RVM, and then reform the Credentials object in the job running on Hadoop, I can get back the Kerberos TGT and successfully get any service tickets using it. So this method works, and the only thing here needs to be very cautious is the encryption of the keys that are passed over network.
First of all, your account has to have delegation enabled. The service ticket has to request a forwardable ticket. If that is all true, Hadoop has to retrieve the delegated credential from the GSSContext and construct a new one on behalf of you. With that new TGT it will be able to perform further steps. Use Wireshark to check the ticket for hadoop.

Spring Batch Admin: Schedule new jobs through web GUI

A newbie question on Sprint Batch Admin.
My requirement is that the user should be able to schedule new jobs (passing some parameters for the job functionality) through a web UI. These jobs should be persistent, will be repetitive and could be cancelled or deleted. Also, a report could be generated for last run jobs and to list all the existing jobs with their next run dates.
Perhaps my most important requirement is that this should be possible "on the fly", not requiring redeploying the web-application or a server re-start.
Can this be done using Spring Batch Admin (I see that the guide talks about uploading an XML for adding a job but that seems tedious, if there is an API why shouldn't we be able to create a job on the fly through the Batch Admin Web UI)? Or does JDK Timer or Quartz support it?
Once a job has been created, it can't be deleted, but it can be stopped. Allowing deletion from DB is a risky operation, as Spring Batch might have already been started the job execution, but the DB has not been updated yet. If one removes the job at this moment, you have inconsistency.
Scheduling a new job is described in Launch Job. It is not possible to create new types of jobs, as jobs can generally have complicated configuration which is parsed only once when Spring Context is loaded.
Dynamic deployment (on the fly) of jobs and configurations, without requiring server restart, is a feature we implemented in Trooper Batch Profile - it is not exactly Spring Batch admin but builds on it. You continue to write your jobs using Spring batch, just the container changes for in Trooper you would use its Batch profile runtime. Screen shots and features are here : https://github.com/regunathb/Trooper/wiki/Writing-Batch-jobs-in-Trooper
I think we can deploy the each spring batch job by a SBA. I mean each batch job will be compiled as a war file. We deploy them together in server. In this way, we have the following visiting urls to monitor each jobs:
h t t p://bactchjobserver/job1
h t t p://bactchjobserver/job2
h t t p://bactchjobserver/job3
h t t p://bactchjobserver/job4
But the downside is that each war fill surely contains lib files, which make each war file like 10MB size.
At the same time, I tried to manually add new-job.xml to war-file\WEB-INF\classes\META-INF\spring\batch\jobs, and new-job.jar to war-file\WEB-INF\lib without stopping JBoss. It works. The new job can be showed in SBA UI and runnable.
But obviously this would lead much maintenance and trouble shooting. It is not implementable.

Resources