Launching map-reduce job as a different user - hadoop

Is there any way to launch map-reduce job as a different user process without using secure impersonation. Also the target cluster may or may not be secured.

Related

How does impersonation in hadoop work

I am trying to understand how impersonation works in hadoop environment.
I found a few resources like:
About doAs and proxy users- hadoop-kerberos-guide
and about tokens- delegation-tokens.
But I was not able to connect all the dots wrt the full flow of operations.
My current understanding is :
user does a kinit and executes a end user facing program like
beeline, spark-submit etc.
The program is app specific and gets service tickets for HDFS
It then gets tokens for all the services it may need during the job
exeution and saves the tokens in an HDFS directory.
The program then connects a job executer(using a service ticket for
the job executer??) e.g. yarn with the job info and the token path.
The job executor get the tocken and initializes UGI and all
communication with HDFS is done using the token and kerberos ticket
are not used.
Is the above high level understanding correct? (I have more follow up queries.)
Can the token mecahnism be skipped and use only kerberos at each
layer, if so, any resources will help.
My final aim is to write a spark connector with impersonation support
for a data storage system which does not use hadoop(tokens) but
supports kerberos.
Thanks & regards
-Sri

Why another user is required to run hadoop?

I have a question regarding hadoop configuration
why we need to create a user for running hadoop can we not run hadoop on a root user?
Yes, you can run it as root.
It is not a requirement to have a dedicated user for Hadoop but having one with lesser privileges than root is considered a good practice. It helps in separating Hadoop processes from other services running on the same machine.
This is not hadoop specific, it's a common good practice in IT to have specific users for running daemons ,for security reasons (for example in hadoop, if you run map reduce daemons as root, a malign user could launch a map reduce job which deletes not only hdfs data, but operating system data), for best control ,etc. Take a look at this:
https://unix.stackexchange.com/questions/29159/why-is-it-recommended-to-create-a-group-and-user-for-some-applications
It is not at all required to create a new user to run hadoop. Also, hadoop user need not be (should not be) in sudoers file or a root user [ref]. Your login user for the machine can also act as a hadoop user. But as mentioned by #Luis and #franklinsijo, it is a good practice to have a specific user for a specific service.

Difference between job, application, task, task attempt logs in Hadoop, Oozie

I'm running an Oozie job with multiple actions and there's a part I could not make it work. In the process of troubleshooting I'm overwhelmed with lots of logs.
In YARN UI (yarn.resourceman­ager.webapp.address in yarn-site.xml, normally on port 8088), there's the application_<app_id> logs.
In Job History Server (yarn.log.server.url in yarn-site.xml, ours on port 19888), there's the job_<job_id> logs. (These job logs should also show up on Hue's Job Browser, right?)
In Hue's Oozie workflow editor, there's the task and task_attempt (not sure if they're the same, everything's a mixed-up soup to me already), which redirects to the Job Browser if you clicked here and there.
Can someone explain what's the difference between these things from Hadoop/Oozie architectural standpoint?
P.S.
I've seen in logs container_<container_id> as well. Might as well include this in your explanation in relation to the things above.
In terms of YARN, the programs that are being run on a cluster are called applications. In terms of MapReduce they are called jobs. So, if you are running MapReduce on YARN, job and application are the same thing (if you take a close look, job ids and application ids are the same).
MapReduce job consists of several tasks (they could be either map or reduce tasks). If a task fails, it is launched again on another node. Those are task attempts.
Container is a YARN term. This is a unit of resource allocation. For example, MapReduce task would be run in a single container.

User Accounts for Hadoop Daemons

I read the Hadoop in Secure Mode Documentation. Right now all my daemons are running under one single account. The Doc suggests to run different daemons under different account. What is the purpose for doing that?
Link to Hadoop in Secure Mode Doc
It's generally good practice to use separate dedicated service accounts for different server processes where possible. This limits attack surface in the event that an attacker compromises one of the processes. For example, if an attacker compromised process A, then the attacker could do things like access files owned by the account running process A. If process B used the same account as process A, then files created by process B would also be compromised. By using a separate account for process B, we can limit the impact of a vulnerability.
Aside from that general principle, there are other considerations specific to the implementation of Hadoop that make it desirable to use separate accounts.
HDFS has a concept of the super-user. The HDFS super-user is the account that is running the NameNode process. The super-user has special privileges to run HDFS administration commands and access all files in HDFS, regardless of the permission settings on those files. YARN and MapReduce daemons do not require HDFS super-user privilege. They can operate as an unprivileged user of HDFS, accessing only files for which they have permission. Running everything with the same account would unintentionally escalate privileges for the YARN and MapReduce daemons.
When running in secured mode, the YARN NodeManager utilizes the LinuxContainerExecutor to launch container processes as the user who submitted the YARN application. This works by using a special setuid executable, which allows the user running the NodeManager to switch to running a process as the user who submitted the application. This ensures that users submitting applications cannot escalate privileges by running code in the context of a different user account. However, setuid executables themselves are powerful tools that can cause privilege escalation problems if used incorrectly. The LinuxContainerExecutor documentation describes very specific steps to take in setting the permissions and configuration for this setuid executable. If a separate account was not used for running the YARN daemons, then this setuid executable would have to be made accessible to a larger set of accounts, which would increase the attack surface.

How to use custom pool assignment for FairScheduler in Hadoop?

I am trying to take advantage of multiple pools in FairScheduler. But all my jobs are submitted by a single agent process and therefore all belong to same user.
I have set mapred.fairscheduler.poolnameproperty to scheduler.pool.name and then in each job I set "scheduler.pool.name" to a specific pool from pools.xml that I want to use for that job.
I can see in job configuration web page that both properties have values as expected and scheduler webpage shows all pools I am trying to use. However all jobs are still running in the pool %username% where username is name of the user that was used to submit all jobs.
I am running hadoop version 0.20.1 from Cloudera distribution.
Any ideas how to make my jobs run in a pool that is not dependent on the name of the user, who submitted the job?
Looks like restart of jobtracker was not sufficient to effect new configuration. After restart of all tasktrackers and a jobtracker pool assignment works as expected.

Resources