Hadoop user/folder permissions - hadoop

I want to create directory for each user.
I looked at several how-to's and they say different things,
I want it to be as easy as possible (I don't care about the encryption, as users will login to the machine using their ssh keys)
I've found this small guide:
hadoop user file permissions
But have few questions,
Do I need to create directories and users on each slave/node machine too?
What is /user/myuser folder exactly? Is it supposed to be the /opt/hadoop/dfs/name/data (dfs.data.dir) folder in the $HADOOP_HOME/etc/hadoop/hdfs-site.xml file?
Do I also need to give/create a dfs.name.dir dir for each user?
After I create the users and directory, do I need to put some params in user's .bashrc file or give them specific permissions to use the hadoop commands? (put/delete files for example, create dirs...)
Anything else I forgot?
P.S
My Hadoop works with sparks, if that matters.

Do I need to create the folders and users on each slave/node machine too?
No. It is enough to create the folders on the master either through a program or simple using hadoop fs -mkdir /foo
what is this /user/myuser folder exactly? is it supposed to be the /opt/hadoop/dfs/name/data ( dfs.data.dir )folder in the $HADOOP_HOME/etc/hadoop/hdfs-site.xml file?
The folder is what you'd expect of a standard users directory under home on linux. The user running the job/task/program has permissions in their folder. Note that these directories are not created by default by HDFS unless the users are added using something like Apache Ambari or Hue.
Do I also need to give/create a dfs.name.dir dir for each user?
You do not! They all share the same dfs

Related

Why does h2o require write access on hdfs root directory?

Seeing error message
Job setup failed : org.apache.hadoop.security.AccessControlException: Permission denied: user=airflow, access=WRITE, inode="/":hdfs:hdfs:drwxr-xr-x at org.apache.hadoop.hdfs.server.namenode.FSPermissionChecker.check(FSPermissionChecker.java:399) at ...
when trying to connect to start the h2o cluster (h2o-3.28.0.1-hdp3.1). Ie it appears that it does not like that the root hdfs dir hdfs:/// does not have write permissions for my user (and giving write access to my user via ranger does appear to fix the problem), but this seems wrong.
From past experience, I've seen this for case where the launching user does not have write permissions the their own hdfs:///user/<username> folder, but seems odd to me that h2o wants the user to have write access over the entire top level hdfs dir. Is this normal? Can I change this?
Possibly related: Finding that after starting the cluster, can't manually kill in YARN ResourceManager UI or killing the PID, rather need to go to the h2o cluster url and use the admin tab to shutdown the cluster. Any ideas why this would happen?
Found the problem, can't find the docs / other-post-detailing-this right now, but basically, when running the hadoop jar h2odriver.jar ... command, there is an optional param called -output where you would normally put some hdfs location that h2o will write stuff (from what I can recall, this is some legacy directory that is not super important) to.
I had forgotten that this is an HDFS location and put some local temp folder's absolute path. The error was because h2o was trying to create that folder by creating the entire path in hdfs that lead to it, thus requiring being able to write from the hdfs root dir. The correct value would be something like /user/<username>.

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

Hadoop copying file to hadoop filesystem

I have copied a file from a local to the hdfs file system and the file got copied -- /user/hduser/in
hduser#vagrant:/usr/local/hadoop/hadoop-1.2.1$ bin/hadoop fs -copyFromLocal /home/hduser/afile in
Question:-
1.How does hadoop by default copies the file to this directory -- /user/hduser/in ...Where is this mapping specified in the conf file?
If you write the command like above, the file gets copied to your user's HDFS home directory, which is /home/username. See also here: HDFS Home Directory.
You can use an absolute pathname (one starting with "/") just like in a Linux filesystem, if you want to write the file to a different location.
Are u using a default vm? Basically if you configure hadoop from binaries without using the preconfigure yum package. It doesnt have a default path. But if you use yum via hortin or cloudera vm. It comes with default path i guess
Check the hdfs-site.xml to see the default fs path. So "/" will point to the base URL set in the above mentioned XML. Any folder mentioned in the command without the use of home path will be appended to that.
hadoop picks the default path defined in hdfs-site.xml and write data.
below image clear how writes works in HDFS.

hadoop file system change directory command

I was going through the HADOOP fs commands list. I am little perplexed not to find any "cd" command in hadoop fs.
Why is it so? It might sound silly question for the HADOOP users, but as I am beginner I can not understand why there is no list of cd command in HADOOP fs level?
Think about it like this:
Hadoop has a special file system called "hdfs" which runs on top of existing say linux file system. There is no concept of current or present working directory a.k.a. pwd
Let's say we have following structure in hdfs:
d1/
d2/
f1
d3/
f2
d4/
f3
You could do cd in your Linux file system from moving from one to the other but do you think changing directory in hadoop would makes sense? HDFS is like virtual file system and you dont directly interact with hdfs except via hadoop command or job tracker.
HDFS provides various features that enable accessing HDFS(Hadoop Filesystem) easy on local machines or edge nodes. You have an option to mount HDFS using any of the following methods. Once Hadoop file system is mounted on your machine, you may use cd command to browse through the file system (It's is like mounting remote network filesystem like NAS)
Fuse dfs (Available from Hadoop 0.20 onwards )
NFSv3 Gateway access to HDFS data (Available from Hadoop version
Hadoop 2.2.0)

Hadoop on CentOS streaming example with python - permission denied on /mapred/local/taskTracker

I have been able to set up the streaming example with python mapper & reducer. The mapred folder location is /mapred/local/taskTracker
both root & mapred users have the ownership to this folder & sub folders
however when I run my streaming it creates maps but no reduces and gives the following error
Cannot Run Program
/mapred/local/taskTracker/root/jobcache/job_201303071607_0035/attempt_201303071607_0035_m_000001_3/work/./mapper1.py
Permission Denied
I noticed that though it have provided a+rwx permission to mapred/local/taskTracker and all its sub directories, when mapreduce creates the temp folders for this job, the folders do not have the rwx for all users ...and hence I get the permission denied error
I have been looking for forum threads on this, and though there are threads mentioning the same error ...I could not find any responses with resolutions.
any help would be greatly appreciated
I assume that you run your Hadoop daemons as user root. In this case the permissions of newly created files are determined by the umask of user root. However you must not change the umask for root.
If you'd like to run MapReduce jobs and cluster as different users, it would be better to start the Hadoop daemons as user hadoop and the MapReduce jobs as user mapreduce. However both users should belong to the same group, i.e. hadoop. Furthermore the umask for user hadoop shall be set accordingly.

Resources