Check file size in hdfs - shell

I am able to retrieve the size of a hdfs file using the following command :
hadoop fs -du -s /user/demouser/first/prod123.txt | cut -d ' ' -f 1
which is giving me the output as 82(which is in bytes).
Now i want to merge this file with another file only if its size is less that 100 MB. I am using shell script to write all these commands in a single file.
How do I convert it into MB and then compare the size? Is there any specific command for that?

Simply use :
hdfs dfs -du -h /path/to/file
I tried the same on my cluster by copying your command. Only possible mistake is that you're using hadoop fs, just use hdfs dfs and make sure you're logged in as a HDFS user.

Related

how to send files to hdfs while keeping their basename

Someone suggest to me, what's the best solution to shipp files from different sources and store them in hdfs based on their names. My situation is :
I have a server that has large number of files and I need to send them to HDFS.
Actually I used flume, in its config I tried spooldir and ftp as sources, but both of them has disadvantages.
So any idea, how to do that ?
Use the hadoop put command:
put
Usage: hadoop fs -put [-f] [-p] [-l] [-d] [ - | .. ].
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”
Copying fails if the file already exists, unless the -f flag is given.
Options:
-p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.
-d : Skip creation of temporary file with the suffix .COPYING.
Examples:
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#put

Find out actual disk usage in HDFS

Is there a way to find out how much space is consumed in HDFS?
I used
hdfs dfs -df
but it seems to be not relevant cause after deleting huge amount of data with
hdfs dfs -rm -r -skipTrash
the previous comand displays changes not at once but after several minutes (I need up-to-date disk usage info).
To see the space consumed by a particular folder try:
hadoop fs -du -s /folder/path
And if you want to see the usage, space consumed, space available, etc. of the whole HDFS:
hadoop dfsadmin -report
hadoop cli is deprecated. Use hdfs instead.
Folder wise :
sudo -u hdfs hdfs dfs -du -h /
Cluster wise :
sudo -u hdfs hdfs dfsadmin -report
hadoop fs -count -q /path/to/directory

How to change replication factor while running copyFromLocal command?

I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders.
hadoop fs -setrep -R -w 3 <folder-path>
I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while copying data from local. I'm running following command,
hadoop fs -copyFromLocal <src> <dest>
When I run above commands, it copies the data from src to dest path with replication factor as 4. But I want to make replication factor as 1 while copying data but not after copying is complete. Bascially I want something like this,
hadoop fs -setrep -R 1 -copyFromLocal <src> <dest>
I tried it, but it didn't work. So, can it be done? or I've first copy data with replication factor 4 and then run setrep command?
According to this post and this post (both asking different questions), this command seems to work:
hadoop fs -D dfs.replication=1 -copyFromLocal <src> <dest>
The -D option means "Use value for given property."

How to unzip file in hadoop?

I was trying to unzip a zip file, stored in Hadoop file system, & store it back in hadoop file system. I tried following commands, but none of them worked.
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop fs -put - /tmp/
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop fs -put - /tmp
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop put - /tmp/
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop put - /tmp
I get errors like gzip: stdin has more than one entry--rest ignored, cat: Unable to write to output stream., Error: Could not find or load main class put on terminal, when I run those commands. Any help?
Edit 1: I don't have access to UI. So, only command lines are allowed. Unzip/gzip utils are installed on my hadoop machine. I'm using Hadoop 2.4.0 version.
To unzip a gzipped (or bzipped) file, I use the following
hdfs dfs -cat /data/<data.gz> | gzip -d | hdfs dfs -put - /data/
If the file sits on your local drive, then
zcat <infile> | hdfs dfs -put - /data/
I use most of the times hdfs fuse mounts for this
So you could just do
$ cd /hdfs_mount/somewhere/
$ unzip file_in_hdfs.zip
http://www.cloudera.com/content/www/en-us/documentation/archive/cdh/4-x/4-7-1/CDH4-Installation-Guide/cdh4ig_topic_28.html
Edit 1/30/16: In case if you use hdfs ACLs: In some cases fuse mounts don't adhere to hdfs ACLs, so you'll be able to do file operations that are permitted by basic unix access privileges. See https://issues.apache.org/jira/browse/HDFS-6255, comments at the bottom that I recently asked to reopen.
To stream the data through a pipe to hadoop, you need to use the hdfs command.
cat mydatafile | hdfs dfs -put - /MY/HADOOP/FILE/PATH/FILENAME.EXTENSION
gzip use -c to read data from stdin
hadoop fs -put doesnt support read the data from stdin
I tried a lots of things and would help.I cant find the zip input support of hadoop.So it left me no choice but download the hadoop file to local fs ,unzip it and upload to hdfs again.

How to store the actual name of a /*url*?

I'm converting a script to HDFS (Hadoop) and I have this cmd:
tail -n+$indexedPlus1 $seedsDir/*url* | head -n$it_size > $it_seedsDir/urls
With HDFS I need to get the file using -get and this works.
bin/hadoop dfs -get $seedsDir/*url* .
However I don't know what downloaded file name is, let alone that I wanted to store in $local_seedsDir/url.
Can I know?
KISS tells me:
bin/hadoop dfs -get $seedsDir/*url* $local_seedsDir/urls
i.e. just name the file as urls locally.
url=`echo bin/hadoop dfs -get urls-input/MR6/*url* .`
then tail and head to extract from url the actual file name and store it in $urls
rm $urls
But otherwise, just KISS

Resources