hadoop getmerge to another machine - hadoop

Is it possible to store the output of the hadoop dfs -getmerge command to another machine?
The reason is that there is no enough space in my local machine. The job output is 100GB and my local storage is 60GB.
Another possible reason could be that I want to process the output in another program locally, in another machine and I don't want to transfer it twice (HDFS-> local FS -> remote machine). I just want (HDFS -> remote machine).
I am looking for something similar to how scp works, like:
hadoop dfs -getmerge /user/hduser/Job-output user#someIP:/home/user/
Alternatively, I would also like to get the HDFS data from a remote host to my local machine.
Could unix pipelines be used in this occasion?
For those who are not familiar with hadoop, I am just looking for a way to replace a local dir parameter (/user/hduser/Job-output) in this command with a directory on a remote machine.

This will do exactly what you need:
hadoop fs -cat /user/hduser/Job-output/* | ssh user#remotehost.com "cat >mergedOutput.txt"
fs -cat will read all files in sequence and output them to stdout.
ssh will pass them to a file on remote machine (note that scp will not accept stdin as input)

Related

how to copy file from remote server to HDFS

I have a remote server and servers authenticated Hadoop environment.
I want to copy file from Remote server to Hadoop machine to HDFS
Please advise efficient approach/HDFS command to copy files from remote server to HDFS.
Any example will be helpful.
as ordinary way to copy file from remote server to server itself is
scp -rp file remote_server:/tmp
but this approach not support copy directly to hdfs
You can try that:
ssh remote-server "hadoop -put - /tmp/file" < file
Here the remote server you mean to say it is not in the same network as the hadoop nodes. If that is the case may be you can scp from remote machine to hadoop nodes local file system and then use -put or -copyFromLocal command to move to HDFS.
example: hadoop fs -put file-name hdfs://namenode-uri/path-to-hdfs

HDFS copyFromLocal slow when using ssh

I am using ssh to issue a copyFromLocal command of HDFS like this (in a script):
ssh -t ubuntu#namenode_server "hdfs dfs -copyFromlocal data/file.csv /file.csv"
However, I am observing very peculiar behavior. This ssh command can take a variable time from 20 min to 25 min for a 9GB file. However, if I simply delete the file from HDFS and rerun the command, it always executes within 4 min.
The transfer of the file also takes around 3-4 min when moving it from one HDFS cluster to another as well (even when I change the block size between source and destination clusters).
I am using EC2 servers for the HDFS cluster. I am using Hadoop 2.7.6.
Not sure why it takes such a long time to copy the file from local file system to HDFS the first time.

Move zip files from one server to hdfs?

What is the best approach to move files from one Linux box to HDFS should I use flume or ssh ?
SSH Command:
cat kali.txt | ssh user#hadoopdatanode.com "hdfs dfs -put - /data/kali.txt"
Only problem with SSH is I need to mention password every time need to check how to pass password without authentication.
Can flume move files straight to HDFS from one server?
Maybe you can make passwordless-ssh, then transfer files without entering password
Maybe you create a script in python for example which does the job for you
You could install hadoop client on a Linux box that has the files. Then you could "hdfs dfs -put" your data directly from that box to hadoop cluster.

Hadoop fs getmerge to remote server/machine due to low disk space

I have the same question as this other post:
hadoop getmerge to another machine
but the answer does not work for me
To summarize what I want to do: get merge (or get the files) from the hadoop cluster, and NOT copy to the local machine (due to low or no disk space), but directly transfer them to a remote machine. I have my public key in the remote machine authorized keys list, so no password authentication is necessary.
My usual command on the local machine is (which merges and puts the file onto the local server/machine as a gzip file):
hadoop fs -getmerge folderName.on.cluster merged.files.in.that.folder.gz
I tried as in the other post:
hadoop fs -cat folderName.on.cluster/* | ssh user#remotehost.com:/storage | "cat > mergedoutput.txt"
This did not work for me.. I get these kind of errors..
Pseudo-terminal will not be allocated because stdin is not a terminal.
ssh: Could not resolve hostname user#remotehost.com:/storage /: Name or service not known
and I tried it the other way
ssh user#remotehost.com:/storage "hadoop fs -cat folderName.on.cluster/*" | cat > mergedoutput.txt
Then:
-bash: cat > mergedoutput.txt: command not found
Pseudo-terminal will not be allocated because stdin is not a terminal.
-bash: line 1: syntax error near unexpected token `('
Any help is appreciated. I also don't need to do -getmerge, I could also do -get and then just merge the files once copied over to the remote machine. Another alternative is if there is a way I can run a command on the remote server to directly copy the file from the hadoop cluster server.
Thanks
Figured it out
hadoop fs -cat folderName.on.cluster/* | ssh user#remotehost.com "cd storage; cat > mergedoutput.txt"
This is what works for me. Thanks to #vefthym for the help.
This merges the files in the directory on the hadoop cluster, to the remote host without copying it to the local host YAY (its pretty full already). Before I copy the file, I need to change to another directory I need the file to be in, hence the cd storage; before cat merged output.gz
I'm glad that you found my question useful!
I think your problem is just in the ssh, not in the solution that you describe. It worked perfectly for me. By the way, in the first command, you have an extra '|' character. What do you get if you just type ssh user#remotehost.com? Do you type a name, or an IP? If you type a name, it should exist in /etc/hosts file.
Based on this post, I guess you are using cygwin and have some misconfigurations. Apart from the accepted solution, check if you have installed the openssh cygwin package, as the second best answer suggests.
hadoop fs -cat folderName.on.cluster/* | ssh user#remotehost.com "cd storage; cat > mergedoutput.txt"
This is what works for me. Thanks to #vefthym for the help.
This merges the files in the directory on the hadoop cluster, to the remote host without copying it to the local host YAY (its pretty full already). Before I copy the file, I need to change to another directory I need the file to be in, hence the cd storage; before cat merged output.gz

Failed to copy file from FTP to HDFS

I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.
The current flow is:
wget ftp://user:password#ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs
I tried:
hadoop fs -cp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
and:
hadoop distcp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?
Pipe it through stdin:
wget ftp://user:password#ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs
The single - tells HDFS put to read from stdin.
hadoop fs -cp ftp://user:password#ftpserver.com/absolute_path_to_file path_in_hdfs
This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem
DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.

Resources