when will `hdfs dfs -appendToFile - $remote_file` flush buffer? - hadoop

If I use hdfs dfs -appendToFile - hdfs://nn.example.com/hadoop/hadoopfile to open a stdin. When will the local buffer flush to remote?
Is it depend on buffer size or time elapsed?
Thanks!

Related

Cannot see the file I have upload to hadoop

The pscp tell me, it is successful.
pscp -P 22 part-00003 username#172.31.143.131:/home/username/lab2_hdfs
username#172.31.143.131's password:
part-00000 | 758 kB | 758.9 kB/s | ETA: 00:00:00 | 100%
But I didn't see it in my hadoop when I use hdfs dfs -ls, why?
HDFS is not the same as your local filesystem. You can't upload files to HDFS using SCP
According your command you have just transfered your local file to a remote host and into a remote directory (/home/username/lab2_hdfs). At that stage HDFS wasn't involved at all and therefore do not know about the new file.
You may have a look into articles like Hadoop: copy a local file to HDFS and use commands like
hadoop fs -put part-00003 /hdfs/path

HDFS copyFromLocal slow when using ssh

I am using ssh to issue a copyFromLocal command of HDFS like this (in a script):
ssh -t ubuntu#namenode_server "hdfs dfs -copyFromlocal data/file.csv /file.csv"
However, I am observing very peculiar behavior. This ssh command can take a variable time from 20 min to 25 min for a 9GB file. However, if I simply delete the file from HDFS and rerun the command, it always executes within 4 min.
The transfer of the file also takes around 3-4 min when moving it from one HDFS cluster to another as well (even when I change the block size between source and destination clusters).
I am using EC2 servers for the HDFS cluster. I am using Hadoop 2.7.6.
Not sure why it takes such a long time to copy the file from local file system to HDFS the first time.

HDFS space consumed: "hdfs dfs -du /" vs "hdfs dfsadmin -report"

Which tool is the right one to measure HDFS space consumed?
When I sum up the output of "hdfs dfs -du /" I always get less amount of space consumed compared to "hdfs dfsadmin -report" ("DFS Used" line). Is there data that du does not take into account?
Hadoop file systems provides a relabel storage, by putting a copy of data to several nodes. The number of copies is replication factor, usually it is greate then one.
Command hdfs dfs -du / shows space consume your data without replications.
Command hdfs dfsadmin -report (line DFS Used) shows actual disk usage, taking into account data replication. So it should be several times bigger when number getting from dfs -ud command.
How HDFS Storage works in brief:
Let say replication factor = 3 (default)
Data file size = 10GB (i.e xyz.log)
HDFS will take 10x3 = 30GB to store that file
Depending on the type of command you use, you will get different values for space occupied by HDFS (10GB vs 30GB)
If you are on latest version of Hadoop, try the following command. In my case this works very well on Hortonworks Data Platform (HDP) 2.3.* and above. This should also work on cloudera's latest platform.
hadoop fs -count -q -h -v /path/to/directory
(-q = quota, -h = human readable values, -v = verbose)
This command will show the following fields in the output.
QUOTA REMAINING_QUOTA SPACE_QUOTA REMAINING_SPACE_QUOTA DIR_COUNT FILE_COUNT CONTENT_SIZE FILE_NAME
Where
CONTENT_SIZE = real file size without replication (10GB) and
SPACE_QUOTA = space occupied in HDFS to save the file (30GB)
Notes:
Control replication factor here: Modify "dfs.replication" property found in hdfs-site.xml file under conf/ dir of default hadoop installation directory). Changing this using Ambari/Cloudera Manager is recommended if you have multinode cluster.
There are other commands to check storage space. E.G hadoop fsck, hadoop dfs -dus,

Failed to copy file from FTP to HDFS

I have FTP server (F [ftp]), linux box(S [standalone]) and hadoop cluster (C [cluster]). The current files flow is F->S->C. I am trying to improve performance by skipping S.
The current flow is:
wget ftp://user:password#ftpserver/absolute_path_to_file
hadoop fs -copyFromLocal path_to_file path_in_hdfs
I tried:
hadoop fs -cp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
and:
hadoop distcp ftp://user:password#ftpserver/absolute_path_to_file path_in_hdfs
Both hangs. The distcp one being a job is killed by timeout. The logs (hadoop job -logs) only said it was killed by timeout. I tried to wget from the ftp from some node of the C and it worked. What could be the reason and any hint how to figure it out?
Pipe it through stdin:
wget ftp://user:password#ftpserver/absolute_path_to_file | hadoop fs -put - path_in_hdfs
The single - tells HDFS put to read from stdin.
hadoop fs -cp ftp://user:password#ftpserver.com/absolute_path_to_file path_in_hdfs
This cannot be used as the source file is a file in the local file system. It does not take into account the scheme you are trying to pass. Refer to the javadoc: FileSystem
DISTCP is only for large intra or inter cluster (to be read as Hadoop clusters i.e. HDFS). Again it cannot get data from FTP. 2 step process is still your best bet. Or write a program to read from FTP and write to HDFS.

hadoop getmerge to another machine

Is it possible to store the output of the hadoop dfs -getmerge command to another machine?
The reason is that there is no enough space in my local machine. The job output is 100GB and my local storage is 60GB.
Another possible reason could be that I want to process the output in another program locally, in another machine and I don't want to transfer it twice (HDFS-> local FS -> remote machine). I just want (HDFS -> remote machine).
I am looking for something similar to how scp works, like:
hadoop dfs -getmerge /user/hduser/Job-output user#someIP:/home/user/
Alternatively, I would also like to get the HDFS data from a remote host to my local machine.
Could unix pipelines be used in this occasion?
For those who are not familiar with hadoop, I am just looking for a way to replace a local dir parameter (/user/hduser/Job-output) in this command with a directory on a remote machine.
This will do exactly what you need:
hadoop fs -cat /user/hduser/Job-output/* | ssh user#remotehost.com "cat >mergedOutput.txt"
fs -cat will read all files in sequence and output them to stdout.
ssh will pass them to a file on remote machine (note that scp will not accept stdin as input)

Resources