How to use OozieClient.doAs - hadoop

I'm trying to start an Oozie workflow from a web service. One of the actions should delete and create some folders. Concretely, I want to empty a folder before starting a Java action (which actually serves as a driver for MapReduce job). I know that there is a "prepare" part where in java action where you can specify the path to delete,but I need to delete all the files in a folder, but to keep the folder. That's why I'm using fs action, to delete the folder and than to make the folder.
The problem is, when I run this using oozieClient.run I get an exception that says that there is a problem with permissions, since I'm running the workflow as root user.
I found that I can use OozieClient.doAs to impersonate a specific user, but I'm not able to use it for some reason. I get internal oozie exception.
Can anyone show me how to run a workflow as a specific user, or at least point me to some good example?

One way to run oozie job with specific user is to secure oozie using kerberos with Active Directory. So that you can authenticate for any user from active directory and run the oozie job for authenticated user.

Related

Nifi path is invalid

I am running a Nifi server, basically I have a ListFile running perfectly OK in this path : /tmp/nifi/info/daily. Here I can work and run the processor without any issue.
Because an specific reason, I had to create another ListFile, which its information is on the path: /tmp/nifi/info/last_month. When I add this second value, it says the path doesn't exist.
I checked the permissions with an ls -l, they are exactly the same, and same group:user, so I'm confused:
drwxr-xr-x. 2 nifi hadoop
I even tried re-starting Nifi to see if it was that but not. Is there any way I can test (other than keep trying input paths in the config) to see which access Nifi have? Why it doesn't see the folder?
Thanks.
As #Ben Yaakobi mentioned I was missing to create the folder on every node.

How to run spark-jobs outside the bin folder of spark-2.1.1-bin-hadoop2.7

I have an existing spark-job, the functionality of this spark-job is to connect kafka-server get the data and then storing the data into cassandra tables, now this spark-job is running on server inside spark-2.1.1-bin-hadoop2.7/bin but whenever I am trying to run this spark-job from other location, Its not running, this spark-job contains some JavaRDD related code.
Is there any chance, I can run this spark-job from outside also by adding any dependency in pom or something else?
whenever I am trying to run this spark-job from other location, Its not running
spark-job is a custom launcher script for a Spark application, perhaps with some additional command-line options and packages. Open it, review the content and fix the issue.
If it's too hard to figure out what spark-job does and there's no one nearby to help you out, it's likely time to throw it away and replace with the good ol' spark-submit.
Why don't you use it in the first place?!
Read up on spark-submit in Submitting Applications.

Better way to store password during oozie spark job workflow

I have an oozie workflow which executes a spark job, which needs usernames and passwords to connect to various servers. Right now I pass it in the workflow.xml as arguments:
username
password
It's (of course) a bad way to do this as it makes the password visible. What is the standard way to obfuscate the password in such a case?
Thanks!
Sqoop is an interesting case, as you can see in its documentation:
at first there was just the --password command-line option, followed by the password in plain text (yuck!)
then the --password-file was introduced, followed by a file that contains the password; it's a clear improvement because(a) when running on an Edge Node, the password itself is not visible to anyone running a ps command(b) when running in Oozie, you can just upload the file once in HDFS, then tell Oozie to download it to the CWD of the YARN container running your job, with a <file> option, and the password itself is not visible to anyone who inspects the job definition ---- but don't you forget to restrict access to the damn file, both on Edge Node and on HDFS, otherwise the password could still be compromised ----
finally, an optional Credential Store was introduced in Hadoop, and Sqoop supports that natively (although you now have the issue of protecting the password you use to connect to the Credential Store...)
Similarly, for Spark (and also for any custom Java / Scala / Python app) I strongly suggest that you store all sensitive information in a "properties" file, restrict access to that file, then pass it as a command-line argument to your program.
It will also make your life easier if you have distinct Dev / Test / Prod environments -- the Oozie script and "properties" filename will be exactly the same, but the actual props will be environment-specific.

Jenkins Job (In Windows environment) not able to access Shared locations

I am trying to schedule a batch in Jenkins (Windows environment) for Windows EXE program (Implemented through .NET).
This program refers to some shared location in the network (viz. \shared network.net\sample path) for the sake of reading from and writing into files.
When I run this program independently out of Jenkins, it works fine, as it considers my login as user who actually has access over shared path.
However, when I run it through Jenkins, there is issue over access. Through my program logs I checked and found that it uses 'NT AUTHORITY\SYSTEM' as user.
I need to make Jenkins job run through particular user's authentication, which will have relevant access over shared path.
Please advise.
The Authorize Project Plugin allows you to run a job as a specific user.
Or, if you are executing from a bat script, you should be able to change the user in your script before running your program.
Several options:
Use "net use" to map the network location under the job's session using your credentials.
In your Windows slave you can go to services-> Jenkins slave->properties. there under "Log On" section you can specify the user you want the service to run under.
I would definitely go with the first option as it is much more manageable (tomorrow you'll replace your slave and have to do it all over again, instead of just migrating the job and mapping the session again).
Good Luck!

How to run "hadoop jar" as another user?

hadoop jar uses the name of the currently logged-in user. Is there a way to change this without adding a new system user?
There is, through a feature called Secure Impersonation, which lets one user submit on behalf of another (that user must exist though). If you're running as the hadoop superuser, it's as simple as setting the env variable $HADOOP_PROXY_USER.
If you want to impersonate a user which doesn't exist, you'll have to do the above and then implement your own AuthenticationHandler.
If you don't have to impersonate too many users, I find it easiest to just create those users on the namenode and use secure impersonation in my scripts.

Resources