Mount HAR using HDFS-Fuse - hadoop

Is it possible to mount a Hadoop Archive File when using hdfs-fuse-dfs?
I followed the notes on Cloudera for setting up hdfs-fuse-dfs and am able to mount hdfs. I can view hdfs as expected. However, on our HDFS we have .har files. Within hdfs-fuse-dfs I can see the .har files, but I am not able to access the files within (aside from viewing a part-0, _index, etc files).

Related

hdfs or hadoop command to sync the files or folder between local to hdfs

I have a local files which gets added daily so I want to sync these newly added files to hdfs.
I tried below command but all are complete copy, I want some command which copies only newly added files
$ hdfs dfs -cp /home/user/files/* /data/files/*
You can use hsync.
https://github.com/alexholmes/hsync
Its Alex's custom package and perhaps useful on a dev box but could be hard to deploy on production environment. I am looking for a similar solution but for now this seems to be closest. Other option is to write your own shell script to compare source/target file times and then overwrite newer files only.

Hadoop user/folder permissions

I want to create directory for each user.
I looked at several how-to's and they say different things,
I want it to be as easy as possible (I don't care about the encryption, as users will login to the machine using their ssh keys)
I've found this small guide:
hadoop user file permissions
But have few questions,
Do I need to create directories and users on each slave/node machine too?
What is /user/myuser folder exactly? Is it supposed to be the /opt/hadoop/dfs/name/data (dfs.data.dir) folder in the $HADOOP_HOME/etc/hadoop/hdfs-site.xml file?
Do I also need to give/create a dfs.name.dir dir for each user?
After I create the users and directory, do I need to put some params in user's .bashrc file or give them specific permissions to use the hadoop commands? (put/delete files for example, create dirs...)
Anything else I forgot?
P.S
My Hadoop works with sparks, if that matters.
Do I need to create the folders and users on each slave/node machine too?
No. It is enough to create the folders on the master either through a program or simple using hadoop fs -mkdir /foo
what is this /user/myuser folder exactly? is it supposed to be the /opt/hadoop/dfs/name/data ( dfs.data.dir )folder in the $HADOOP_HOME/etc/hadoop/hdfs-site.xml file?
The folder is what you'd expect of a standard users directory under home on linux. The user running the job/task/program has permissions in their folder. Note that these directories are not created by default by HDFS unless the users are added using something like Apache Ambari or Hue.
Do I also need to give/create a dfs.name.dir dir for each user?
You do not! They all share the same dfs

Basic issue in copying files from hive or hadoop to local directory due to wrong nomenclature

I'm trying to copy a file that is hosted both within Hive and on an HDFS onto my local computer, but I can't seem to figure out the write call/terminology to use to invoke my local. All online explanations describe it solely as "path to local". I'm trying to copy this into a folder at C/Users/PC/Desktop/data and am using the following attempts:
In HDFS
hdfs dfs -copyToLocal /user/w205/staging /C/Users/PC/Desktop/data
In Hive
INSERT OVERWRITE LOCAL DIRECTORY 'c/users/pc/desktop/data/' SELECT * FROM lacountyvoters;
How should I be invoking the local repository in this case?

how to save data in HDFS with spark?

I want to using Spark Streaming to retrieve data from Kafka. Now, I want to save my data in a remote HDFS. I know that I have to use the function saveAsText. However, I don't know precisely how to specify the path.
Is that correct if I write this:
myDStream.foreachRDD(frm->{
frm.saveAsTextFile("hdfs://ip_addr:9000//home/hadoop/datanode/myNewFolder");
});
where ip_addr is the ip address of my hdfs remote server.
/home/hadoop/datanode/ is the DataNode HDFS directory created when I installed hadoop (I don't know if I have to specify this directory). And,
myNewFolder is the folder where I want to save my data.
Thanks in advance.
Yassir
The path has to be a directory in HDFS.
For example, if you want to save the files inside a folder named myNewFolder under the root / path in HDFS.
The path to use would be hdfs://namenode_ip:port/myNewFolder/
On execution of the spark job this directory myNewFolder will be created.
The datanode data directory which is given for the dfs.datanode.data.dir in hdfs-site.xml is used to store the blocks of the files you store in HDFS, should not be referenced as HDFS directory path.

Hadoop copying file to hadoop filesystem

I have copied a file from a local to the hdfs file system and the file got copied -- /user/hduser/in
hduser#vagrant:/usr/local/hadoop/hadoop-1.2.1$ bin/hadoop fs -copyFromLocal /home/hduser/afile in
Question:-
1.How does hadoop by default copies the file to this directory -- /user/hduser/in ...Where is this mapping specified in the conf file?
If you write the command like above, the file gets copied to your user's HDFS home directory, which is /home/username. See also here: HDFS Home Directory.
You can use an absolute pathname (one starting with "/") just like in a Linux filesystem, if you want to write the file to a different location.
Are u using a default vm? Basically if you configure hadoop from binaries without using the preconfigure yum package. It doesnt have a default path. But if you use yum via hortin or cloudera vm. It comes with default path i guess
Check the hdfs-site.xml to see the default fs path. So "/" will point to the base URL set in the above mentioned XML. Any folder mentioned in the command without the use of home path will be appended to that.
hadoop picks the default path defined in hdfs-site.xml and write data.
below image clear how writes works in HDFS.

Resources