HDFS copyFromLocal slow when using ssh - hadoop

I am using ssh to issue a copyFromLocal command of HDFS like this (in a script):
ssh -t ubuntu#namenode_server "hdfs dfs -copyFromlocal data/file.csv /file.csv"
However, I am observing very peculiar behavior. This ssh command can take a variable time from 20 min to 25 min for a 9GB file. However, if I simply delete the file from HDFS and rerun the command, it always executes within 4 min.
The transfer of the file also takes around 3-4 min when moving it from one HDFS cluster to another as well (even when I change the block size between source and destination clusters).
I am using EC2 servers for the HDFS cluster. I am using Hadoop 2.7.6.
Not sure why it takes such a long time to copy the file from local file system to HDFS the first time.

Related

Curl, Kerberos authenticated file copy on hadoop

We need to establish a filecopy at HDFS location, between HDFS folders. We currently have used curl command, as shown below, in a shell script loop.
/usr/bin/curl -v --negotiate -u : -X PUT "<hnode>:<port>/webhdfs/v1/busy/rg/stg/"$1"/"$table"/"$table"_"$3".dsv?op=RENAME&destination=/busy/rg/data/"$1"/"$table"/"$table"_$date1.dsv"
However this achieves the file move. We need to establish a filecopy, such that file is maintained at the original staging location.
I Wanted to know if there is a corresponding curl operation? op=RENAME&destination instead of Rename, what else could work?
WebHDFS alone does not offer a copy operation in its interface. The WebHDFS interface provides lower-level file system primitives. A copy operation is a higher-level application that uses those primitive operations to accomplish its work.
The implementation of hdfs dfs -cp against a webhdfs: URL essentially combines op=OPEN and op=CREATE calls to complete the copy. You could potentially re-implement a subset of that logic in your script. If you want to pursue that direction, the CopyCommands class is a good starting point in the Apache Hadoop codebase for seeing how that works.
Here is a starting point for how this could work. There is an existing file at /hello1 that we want to copy to /hello2. This script calls curl to open /hello1 and pipes the output to another curl command, which creates /hello2, using stdin as the input source.
> hdfs dfs -ls /hello*
-rw-r--r-- 3 cnauroth supergroup 6 2017-07-06 09:15 /hello1
> curl -sS -L 'http://localhost:9870/webhdfs/v1/hello1?op=OPEN' |
> curl -sS -L -X PUT -d #- 'http://localhost:9870/webhdfs/v1/hello2?op=CREATE&user.name=cnauroth'
> hdfs dfs -ls /hello*
-rw-r--r-- 3 cnauroth supergroup 6 2017-07-06 09:15 /hello1
-rw-r--r-- 3 cnauroth supergroup 5 2017-07-06 09:20 /hello2
But my requirement is to connect from an external unix box, automated kerberos login into hdfs and then move the files within hdfs, hence the curl.
Another option could be a client-only Hadoop installation on your external host. You would have an installation of the Hadoop software and the same configuration files from the Hadoop cluster, and then you could issue the hdfs dfs -cp commands instead of running curl commands against HDFS.
I don't know what distribution you use, if you use Cloudera, try using BDR (Backup, Data recovery module) using REST APIs.
I used it to copy the files/folders within hadoop cluster and across hadoop clusters, it works against encrypted zones(TDE) as well

HDFS space consumed: "hdfs dfs -du /" vs "hdfs dfsadmin -report"

Which tool is the right one to measure HDFS space consumed?
When I sum up the output of "hdfs dfs -du /" I always get less amount of space consumed compared to "hdfs dfsadmin -report" ("DFS Used" line). Is there data that du does not take into account?
Hadoop file systems provides a relabel storage, by putting a copy of data to several nodes. The number of copies is replication factor, usually it is greate then one.
Command hdfs dfs -du / shows space consume your data without replications.
Command hdfs dfsadmin -report (line DFS Used) shows actual disk usage, taking into account data replication. So it should be several times bigger when number getting from dfs -ud command.
How HDFS Storage works in brief:
Let say replication factor = 3 (default)
Data file size = 10GB (i.e xyz.log)
HDFS will take 10x3 = 30GB to store that file
Depending on the type of command you use, you will get different values for space occupied by HDFS (10GB vs 30GB)
If you are on latest version of Hadoop, try the following command. In my case this works very well on Hortonworks Data Platform (HDP) 2.3.* and above. This should also work on cloudera's latest platform.
hadoop fs -count -q -h -v /path/to/directory
(-q = quota, -h = human readable values, -v = verbose)
This command will show the following fields in the output.
QUOTA REMAINING_QUOTA SPACE_QUOTA REMAINING_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Where
CONTENT_SIZE = real file size without replication (10GB) and
SPACE_QUOTA = space occupied in HDFS to save the file (30GB)
Notes:
Control replication factor here: Modify "dfs.replication" property found in hdfs-site.xml file under conf/ dir of default hadoop installation directory). Changing this using Ambari/Cloudera Manager is recommended if you have multinode cluster.
There are other commands to check storage space. E.G hadoop fsck, hadoop dfs -dus,

Move zip files from one server to hdfs?

What is the best approach to move files from one Linux box to HDFS should I use flume or ssh ?
SSH Command:
cat kali.txt | ssh user#hadoopdatanode.com "hdfs dfs -put - /data/kali.txt"
Only problem with SSH is I need to mention password every time need to check how to pass password without authentication.
Can flume move files straight to HDFS from one server?
Maybe you can make passwordless-ssh, then transfer files without entering password
Maybe you create a script in python for example which does the job for you
You could install hadoop client on a Linux box that has the files. Then you could "hdfs dfs -put" your data directly from that box to hadoop cluster.

Failed to copy file from FTP to HDFS

I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.
The current flow is:
wget ftp://user:password#ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs
I tried:
hadoop fs -cp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
and:
hadoop distcp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?
Pipe it through stdin:
wget ftp://user:password#ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs
The single - tells HDFS put to read from stdin.
hadoop fs -cp ftp://user:password#ftpserver.com/absolute_path_to_file path_in_hdfs
This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem
DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.

hadoop getmerge to another machine

Is it possible to store the output of the hadoop dfs -getmerge command to another machine?
The reason is that there is no enough space in my local machine. The job output is 100GB and my local storage is 60GB.
Another possible reason could be that I want to process the output in another program locally, in another machine and I don't want to transfer it twice (HDFS-> local FS -> remote machine). I just want (HDFS -> remote machine).
I am looking for something similar to how scp works, like:
hadoop dfs -getmerge /user/hduser/Job-output user#someIP:/home/user/
Alternatively, I would also like to get the HDFS data from a remote host to my local machine.
Could unix pipelines be used in this occasion?
For those who are not familiar with hadoop, I am just looking for a way to replace a local dir parameter (/user/hduser/Job-output) in this command with a directory on a remote machine.
This will do exactly what you need:
hadoop fs -cat /user/hduser/Job-output/* | ssh user#remotehost.com "cat >mergedOutput.txt"
fs -cat will read all files in sequence and output them to stdout.
ssh will pass them to a file on remote machine (note that scp will not accept stdin as input)

Resources