How to store the actual name of a /*url*? - bash

I'm converting a script to HDFS (Hadoop) and I have this cmd:
tail -n+$indexedPlus1 $seedsDir/*url* | head -n$it_size > $it_seedsDir/urls
With HDFS I need to get the file using -get and this works.
bin/hadoop dfs -get $seedsDir/*url* .
However I don't know what downloaded file name is, let alone that I wanted to store in $local_seedsDir/url.
Can I know?
KISS tells me:
bin/hadoop dfs -get $seedsDir/*url* $local_seedsDir/urls
i.e. just name the file as urls locally.

url=`echo bin/hadoop dfs -get urls-input/MR6/*url* .`
then tail and head to extract from url the actual file name and store it in $urls
rm $urls
But otherwise, just KISS

Related

Check file size in hdfs

I am able to retrieve the size of a hdfs file using the following command :
hadoop fs -du -s /user/demouser/first/prod123.txt | cut -d ' ' -f 1
which is giving me the output as 82(which is in bytes).
Now i want to merge this file with another file only if its size is less that 100 MB. I am using shell script to write all these commands in a single file.
How do I convert it into MB and then compare the size? Is there any specific command for that?
Simply use :
hdfs dfs -du -h /path/to/file
I tried the same on my cluster by copying your command. Only possible mistake is that you're using hadoop fs, just use hdfs dfs and make sure you're logged in as a HDFS user.

how to send files to hdfs while keeping their basename

Someone suggest to me, what's the best solution to shipp files from different sources and store them in hdfs based on their names. My situation is :
I have a server that has large number of files and I need to send them to HDFS.
Actually I used flume, in its config I tried spooldir and ftp as sources, but both of them has disadvantages.
So any idea, how to do that ?
Use the hadoop put command:
put
Usage: hadoop fs -put [-f] [-p] [-l] [-d] [ - | .. ].
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”
Copying fails if the file already exists, unless the -f flag is given.
Options:
-p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.
-d : Skip creation of temporary file with the suffix .COPYING.
Examples:
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#put

How to change replication factor while running copyFromLocal command?

I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders.
hadoop fs -setrep -R -w 3 <folder-path>
I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while copying data from local. I'm running following command,
hadoop fs -copyFromLocal <src> <dest>
When I run above commands, it copies the data from src to dest path with replication factor as 4. But I want to make replication factor as 1 while copying data but not after copying is complete. Bascially I want something like this,
hadoop fs -setrep -R 1 -copyFromLocal <src> <dest>
I tried it, but it didn't work. So, can it be done? or I've first copy data with replication factor 4 and then run setrep command?
According to this post and this post (both asking different questions), this command seems to work:
hadoop fs -D dfs.replication=1 -copyFromLocal <src> <dest>
The -D option means "Use value for given property."

How to unzip file in hadoop?

I was trying to unzip a zip file, stored in Hadoop file system, & store it back in hadoop file system. I tried following commands, but none of them worked.
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop fs -put - /tmp/
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop fs -put - /tmp
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop put - /tmp/
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop put - /tmp
I get errors like gzip: stdin has more than one entry--rest ignored, cat: Unable to write to output stream., Error: Could not find or load main class put on terminal, when I run those commands. Any help?
Edit 1: I don't have access to UI. So, only command lines are allowed. Unzip/gzip utils are installed on my hadoop machine. I'm using Hadoop 2.4.0 version.
To unzip a gzipped (or bzipped) file, I use the following
hdfs dfs -cat /data/<data.gz> | gzip -d | hdfs dfs -put - /data/
If the file sits on your local drive, then
zcat <infile> | hdfs dfs -put - /data/
I use most of the times hdfs fuse mounts for this
So you could just do
$ cd /hdfs_mount/somewhere/
$ unzip file_in_hdfs.zip
http://www.cloudera.com/content/www/en-us/documentation/archive/cdh/4-x/4-7-1/CDH4-Installation-Guide/cdh4ig_topic_28.html
Edit 1/30/16: In case if you use hdfs ACLs: In some cases fuse mounts don't adhere to hdfs ACLs, so you'll be able to do file operations that are permitted by basic unix access privileges. See https://issues.apache.org/jira/browse/HDFS-6255, comments at the bottom that I recently asked to reopen.
To stream the data through a pipe to hadoop, you need to use the hdfs command.
cat mydatafile | hdfs dfs -put - /MY/HADOOP/FILE/PATH/FILENAME.EXTENSION
gzip use -c to read data from stdin
hadoop fs -put doesnt support read the data from stdin
I tried a lots of things and would help.I cant find the zip input support of hadoop.So it left me no choice but download the hadoop file to local fs ,unzip it and upload to hdfs again.

Hadoop -copyFromLocal cannot find input file

sudo -u hdfs hadoop fs -copyFromLocal input.csv input.csv
copyFromLocal: `input.csv': No such file or directory
Can anyone tell me the exact reason why I am getting this kind of error? I gave all permissions to the input.csv file and I even changed the owner to hdfs. I am new to Hadoop and Hbase.
In this case you are trying to read the file as the hdfs user, which may not have permission to view this file. To test, do this:
sudo -u hdfs cat input.csv
If you get permission denied, you either need to change the permissions of this file so the hdfs user can read it (or if it already has read permissions, move the file to a directory that the hdfs user can read), or use a different user that has permission to access the local and the remote directories/files.
You need to make sure that user hdfs has read permission to all the parent directories of input.csv along the path.
syntax: hadoop dfs -copyFromLocal completelocalfilesystempath hdfspath
Example: Let input.csv be in localpath /usr/examples and my hdfs path where it needs to be copied is /usr/input so the command will be
hadoop dfs -copyFromLocal /usr/examples/input.csv /usr/input/
To copy files you can run it like this:
cat input.csv | sudo -u hdfs hadoop fs -put - input.csv

Resources