Hadoop: Pseudo Distributed mode for multiple users - hadoop

I appreciate your help in advance.
I have setup Hadoop in Pseudo Distributed mode using the root user credentials. I want to provide access to multiple users (let us say hadoop1, hadoop2, etc) to be able to submit and run MapReduce jobs on this cluster. How do we get this done?
What I have done so far?
> - Setup Hadoop to run in Pseudo-distributed mode
> - Used "root" user credentials to set this up.
> - Added users hadoop1 and hadoop2 to a group called "hadoop".
> - Added root also to be part of the group "hadoop".
> - Created a folder called hdfstmp and set this as the path for hadoop.tmp.dir.
> - Started the cluster using bin/start-all.sh
> - Ran MapReduce jobs using hadoop1 and hadoop2 users.
I got the error below:
Exception in thread "main" java.io.IOException: Permission denied
at java.io.UnixFileSystem.createFileExclusively(Native Method)
at java.io.File.createNewFile(File.java:1006)
at java.io.File.createTempFile(File.java:1989)
at org.apache.hadoop.util.RunJar.main(RunJar.java:119)
To overcome this error, I gave group "hadoop" rwx permissions to folder hdfstmp. The permissions on this folder look like drwxrwxr-x.
Submitted MapReduce jobs using hadoop1 and hadoop2 users login. The job ran fine without any errors.
However, if I do a stop-all.sh and then do a start-all.sh, the DataNode (and occassionally even NameNode) does not start up. When I check the logs, i see an error as below:
2013-09-21 16:43:54,518 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: Invalid directory in dfs.data.dir: Incorrect permission for /data/hdfstmp/dfs/data, expected: rwxr-xr-x, while actual: rwxrwxr-x
Now, without change to the group ownership of the hdfstmp directory, my MR jobs submitted by different users do not run. But when the NameNode gets restarted, i get the issue as above.
How do i overcome this issue? What is the best practice for the same?
Also, is there a way to monitor the jobs that are being submitted by the different users? I am assuming the Web UI should allow me to do this. Please confirm.
I appreciate any assistance you can provide me on this issue. Thanks.
Regards

Adding a dedicated Hadoop system user
We will use a dedicated Hadoop user account for running Hadoop. While that’s not required it is recommended because it helps to separate the Hadoop installation from other software applications and user accounts running on the same machine (think: security, permissions, backups, etc).
#addgroup hadoop
#adduser --ingroup hadoop hadoop1
#adduser --ingroup hadoop hadoop2
This will add the user hduser and the group hadoop to your local machine.
Change permission of your hadoop installed directory
chown -R hduser:hadoop hadoop
And lastly change hadoop temporary directoy permission
If your temp directory is /app/hadoop/tmp
#mkdir -p /app/hadoop/tmp
#chown hduser:hadoop /app/hadoop/tmp
and if you want to tighten up security, chmod from 755 to 750...
#chmod 750 /app/hadoop/tmp

Related

Spark/Hadoop can't read root files

I'm trying to read a file inside a folder that only me (and root) can read/write, through spark, first I start the shell with:
spark-shell --master yarn-client
then I:
val base = sc.textFile("file///mount/bases/FOLDER_LOCKED/folder/folder/file.txt")
base.take(1)
And got the following error:
2018-02-19 13:40:20,835 WARN scheduler.TaskSetManager:
Lost task 0.0 in stage 0.0 (TID 0, mydomain, executor 1):
java.io.FileNotFoundException: File file: /mount/bases/FOLDER_LOCKED/folder/folder/file.txt does not exist
at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:611)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
...
I am suspecting that as yarn/hadoop was launched with the user hadoop it can't go further in this folder to get the file. How could I solve this?
OBS: This folder can't be open to other users because it has private data.
EDIT1: This /mount/bases is a network storage, using a cifs connection.
EDIT2: hdfs and yarn was launched with the user hadoop
As hadoop was the user that lauched hdfs and yarn, he is the user that will try to open a file in a job, so it must be authorized to access this folder, fortunely hadoop checks what user is executing the job first to allow the access to a folder/file, so you will not take risks at this.
Well, if it would have been access related issue with the file, you would have got 'access denied' as an error. In this particular scenario, I think file that you are trying to read is not present at all, or might have some other name[typos]. Just check for the file name.

spark history server does not show jobs or stages

We are trying to use spark history server to further improve our spark jobs. The spark job correctly writes the eventlog into HDFS and the spark history server also can access this eventlog: we do see the job in the spark history server job listing but aside from the environment variables and executors everything is empty...
Any ideas on how we can make the spark history server show everything (we really want to see the DAG for instance) ?
We are using spark 1.4.1.
Thanks.
I had a similar issue. I am browsing the history server with port forwarding with ssh. After granting the read permission to all the files in the log directory, they appear in my history server!
cd {SPARK_EVENT_LOG_DIR}
chmod +r * # grant the read permission to all users for all files

Permission denied issue in mapreduce?

I have tried the below query.
hadoop jar /home/cloudera/workspace/para.jar word.Paras examples/wordcount /home/cloudera/Desktop/words/output
map reduce is started after that its showing below error. can anyone please help on this issue.
15/11/04 10:33:57 INFO mapred.JobClient: Task Id : attempt_201511040935_0008_m_000002_0, Status : FAILED
org.apache.hadoop.security.AccessControlException: Permission denied: user=cloudera, access=WRITE, inode="/":hdfs:supergroup:drwxr-xr-x
Do I need to change anything config file or in cloudera manager.
The exception suggests that you are trying to write to the HDFS root directory "/" which you (user:cloudera) does not have permission to do.
Without knowing what your specific jar does:
I guess that the last argument ("/home/cloudera/Desktop/words/output") is where you wish to place the output.
I guess this is supposed to be within HDFS where /home does not exist.
Try to change this to somewhere where you can write, possibly "/user/cloudera/words/output"
There are set of default directories to be created before you start using the hadoop cluster,
do, it should show you the directories
$ hadoop fs -ls /
sample user, if you want to run as cloudera you need on hdfs
/user/cloudera -- the user running the program
/user/hadoop -- your hadoop file system user
/user/mapred -- your mapred user
/tmp -- temporary which needs to have permission hdfs chmod 1777
HTH.
The last argument that you are passing should be the output path of HDFS not the default file system.
As you are running with cloudera user, you can point to the /user/cloudera/words/output. But first you need to check whether you have cloudera in your HDFS and you have write permission by issuing the following
hadoop fs -ls /user/
Once you have it change your command to following:
hadoop jar /home/cloudera/workspace/para.jar word.Paras examples/wordcount <path_where_you_have_write_permission_in_HDFS>

"Permission denied" for almost everything after a successful ssh into gcloud instance that was created using bdutil

Just created instance and deployed a cluster using bdutil. SSH works fine as I can ssh into instance using ./bdutil shell.
When I try to access directories such as Hadoop, hdfs etc., it throws an error:
Permission Denied
The terminal appears like this username#hadoop-m $ I know hadoop-m is the name of the instance. What is the username? It says my name but I don't know where it got this from or what the password is.
I am using Ubuntu to ssh into the instance.
Not a hadoop expert, I can answer a bit generally. On GCE when you ssh in gcloud creates a username from you google account name. Hadoop directories such as hadoop or hdfs are probably owned by a different user. Please try using sudo chmod to make give your username permissions to read/write the directories you need.
To elaborate on Jeff's answer, bdutil-deployed clusters set up the user hadoop as the Hadoop admin (this 'admin' user may differ on different Hadoop systems, where Hadoop admin accounts may be split into separate users hdfs, yarn, mapred, etc). Note that bdutil clusters should work without needing deal with Hadoop admin stuff for normal jobs, but if you need to access those Hadoop directories, you can either do:
sudo su hadoop
or
sudo su
to open a shell as hadoop or root, respectively. Or as Jeff mentions, you can sudo chmod to grant broader access to your own username.

Hadoop on CentOS streaming example with python - permission denied on /mapred/local/taskTracker

I have been able to set up the streaming example with python mapper & reducer. The mapred folder location is /mapred/local/taskTracker
both root & mapred users have the ownership to this folder & sub folders
however when I run my streaming it creates maps but no reduces and gives the following error
Cannot Run Program
/mapred/local/taskTracker/root/jobcache/job_201303071607_0035/attempt_201303071607_0035_m_000001_3/work/./mapper1.py
Permission Denied
I noticed that though it have provided a+rwx permission to mapred/local/taskTracker and all its sub directories, when mapreduce creates the temp folders for this job, the folders do not have the rwx for all users ...and hence I get the permission denied error
I have been looking for forum threads on this, and though there are threads mentioning the same error ...I could not find any responses with resolutions.
any help would be greatly appreciated
I assume that you run your Hadoop daemons as user root. In this case the permissions of newly created files are determined by the umask of user root. However you must not change the umask for root.
If you'd like to run MapReduce jobs and cluster as different users, it would be better to start the Hadoop daemons as user hadoop and the MapReduce jobs as user mapreduce. However both users should belong to the same group, i.e. hadoop. Furthermore the umask for user hadoop shall be set accordingly.

Resources