how to send files to hdfs while keeping their basename - ftp

Someone suggest to me, what's the best solution to shipp files from different sources and store them in hdfs based on their names. My situation is :
I have a server that has large number of files and I need to send them to HDFS.
Actually I used flume, in its config I tried spooldir and ftp as sources, but both of them has disadvantages.
So any idea, how to do that ?

Use the hadoop put command:
put
Usage: hadoop fs -put [-f] [-p] [-l] [-d] [ - | .. ].
Copy single src, or multiple srcs from local file system to the destination file system. Also reads input from stdin and writes to destination file system if the source is set to “-”
Copying fails if the file already exists, unless the -f flag is given.
Options:
-p : Preserves access and modification times, ownership and the permissions. (assuming the permissions can be propagated across filesystems)
-f : Overwrites the destination if it already exists.
-l : Allow DataNode to lazily persist the file to disk, Forces a replication factor of 1. This flag will result in reduced durability. Use with care.
-d : Skip creation of temporary file with the suffix .COPYING.
Examples:
hadoop fs -put localfile /user/hadoop/hadoopfile
hadoop fs -put -f localfile1 localfile2 /user/hadoop/hadoopdir
hadoop fs -put -d localfile hdfs://nn.example.com/hadoop/hadoopfile
hadoop fs -put - hdfs://nn.example.com/hadoop/hadoopfile Reads the input from stdin.
Exit Code:
Returns 0 on success and -1 on error.
https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/FileSystemShell.html#put

Related

Check file size in hdfs

I am able to retrieve the size of a hdfs file using the following command :
hadoop fs -du -s /user/demouser/first/prod123.txt | cut -d ' ' -f 1
which is giving me the output as 82(which is in bytes).
Now i want to merge this file with another file only if its size is less that 100 MB. I am using shell script to write all these commands in a single file.
How do I convert it into MB and then compare the size? Is there any specific command for that?
Simply use :
hdfs dfs -du -h /path/to/file
I tried the same on my cluster by copying your command. Only possible mistake is that you're using hadoop fs, just use hdfs dfs and make sure you're logged in as a HDFS user.

Hadoop Crontab Put

Im trying to program with crontab a simple task, copy some files from local to HDFS. My code is this:
#!/bing/ksh
ANIO=$(date +"%Y")
MES=$(date +"%m")
DIA=$(date +"%d")
HORA=$(date +"%H")
# LOCAL AND HDFS DIRECTORIES
DIRECTORIO_LOCAL="/home/cloudera/bicing/data/$ANIO/$MES/$DIA/stations"$ANIO$MES$DIA$HORA"*"
DIRECTORIO_HDFS="/bicing/data/$ANIO/$MES/$DIA/"
# Test if the destination directory exist and create it if it's necesary
echo "hdfs dfs -test -d $DIRECTORIO_HDFS">>/home/cloudera/bicing/data/logFile
hdfs dfs -test -d $DIRECTORIO_HDFS
if [ $? != 0 ]
then
echo "hdfs dfs -mkdir -p $DIRECTORIO_HDFS">>/home/cloudera/bicing/data/logFile
hdfs dfs -mkdir -p $DIRECTORIO_HDFS
fi
# Upload the files to HDFS
echo "hdfs dfs -put $DIRECTORIO_LOCAL $DIRECTORIO_HDFS">>/home/cloudera/bicing/data/logFile
hdfs dfs -put $DIRECTORIO_LOCAL $DIRECTORIO_HDFS
As you can see is quite simple, it only define the folders variables, create the directory in HDFS (if it doesn't exists) and copies the files from local to HDFS.
The script works if I launch it directly on the Terminal but when I schedule it with Crontab it doesn't "put" the files in HDFS.
Moreover, the script creates a "logFile" with the commands that should have been executed. When I copy them to the Terminal them work perfectly.
hdfs dfs -test -d /bicing/data/2015/12/10/
hdfs dfs -mkdir -p /bicing/data/2015/12/10/
hdfs dfs -put /home/cloudera/bicing/data/2015/12/10/stations2015121022* /bicing/data/2015/12/10/
I have checked the directories and files, but I cant find the key to solve it.
Thanks in advance!!!
When you execute these commands on the console, they run fine, because "HADOOP_HOME" is set. But, when the Cron job runs, most likely, "HADOOP_HOME" environment variable is not available.
You can resolve this problem in 2 ways:
In the script, add the following statements at the beginning. This will add the paths of all the Hadoop jars to your environment.
export HADOOP_HOME={Path to your HADOOP_HOME}
export PATH=$PATH:$HADOOP_HOME\etc\hadoop\;$HADOOP_HOME\share\hadoop\common\*;$HADOOP_HOME\share\hadoop\common\lib\*;$HADOOP_HOME\share\hadoop\hdfs\*;$HADOOP_HOME\share\hadoop\hdfs\lib\*;$HADOOP_HOME\share\hadoop\mapreduce\*;$HADOOP_HOME\share\hadoop\mapreduce\lib\*;$HADOOP_HOME\share\hadoop\tools\*;$HADOOP_HOME\share\hadoop\tools\lib\*;$HADOOP_HOME\share\hadoop\yarn\*;$HADOOP_HOME\share\hadoop\yarn\lib\*
You can also update your .profile (present in $HOME/.profile) or .kshrc (present in $HOME/.kshrc) to include the HADOOP paths.
That should solve your problem.

How to change replication factor while running copyFromLocal command?

I'm not asking how to set replication factor in hadoop for a folder/file. I know following command works flawlessly for existing files & folders.
hadoop fs -setrep -R -w 3 <folder-path>
I'm asking, how do I set the replication factor, other than default (which is 4 in my scenario), while copying data from local. I'm running following command,
hadoop fs -copyFromLocal <src> <dest>
When I run above commands, it copies the data from src to dest path with replication factor as 4. But I want to make replication factor as 1 while copying data but not after copying is complete. Bascially I want something like this,
hadoop fs -setrep -R 1 -copyFromLocal <src> <dest>
I tried it, but it didn't work. So, can it be done? or I've first copy data with replication factor 4 and then run setrep command?
According to this post and this post (both asking different questions), this command seems to work:
hadoop fs -D dfs.replication=1 -copyFromLocal <src> <dest>
The -D option means "Use value for given property."

How to unzip file in hadoop?

I was trying to unzip a zip file, stored in Hadoop file system, & store it back in hadoop file system. I tried following commands, but none of them worked.
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop fs -put - /tmp/
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop fs -put - /tmp
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop put - /tmp/
hadoop fs -cat /tmp/test.zip|gzip -d|hadoop put - /tmp
I get errors like gzip: stdin has more than one entry--rest ignored, cat: Unable to write to output stream., Error: Could not find or load main class put on terminal, when I run those commands. Any help?
Edit 1: I don't have access to UI. So, only command lines are allowed. Unzip/gzip utils are installed on my hadoop machine. I'm using Hadoop 2.4.0 version.
To unzip a gzipped (or bzipped) file, I use the following
hdfs dfs -cat /data/<data.gz> | gzip -d | hdfs dfs -put - /data/
If the file sits on your local drive, then
zcat <infile> | hdfs dfs -put - /data/
I use most of the times hdfs fuse mounts for this
So you could just do
$ cd /hdfs_mount/somewhere/
$ unzip file_in_hdfs.zip
http://www.cloudera.com/content/www/en-us/documentation/archive/cdh/4-x/4-7-1/CDH4-Installation-Guide/cdh4ig_topic_28.html
Edit 1/30/16: In case if you use hdfs ACLs: In some cases fuse mounts don't adhere to hdfs ACLs, so you'll be able to do file operations that are permitted by basic unix access privileges. See https://issues.apache.org/jira/browse/HDFS-6255, comments at the bottom that I recently asked to reopen.
To stream the data through a pipe to hadoop, you need to use the hdfs command.
cat mydatafile | hdfs dfs -put - /MY/HADOOP/FILE/PATH/FILENAME.EXTENSION
gzip use -c to read data from stdin
hadoop fs -put doesnt support read the data from stdin
I tried a lots of things and would help.I cant find the zip input support of hadoop.So it left me no choice but download the hadoop file to local fs ,unzip it and upload to hdfs again.

How to store the actual name of a /*url*?

I'm converting a script to HDFS (Hadoop) and I have this cmd:
tail -n+$indexedPlus1 $seedsDir/*url* | head -n$it_size > $it_seedsDir/urls
With HDFS I need to get the file using -get and this works.
bin/hadoop dfs -get $seedsDir/*url* .
However I don't know what downloaded file name is, let alone that I wanted to store in $local_seedsDir/url.
Can I know?
KISS tells me:
bin/hadoop dfs -get $seedsDir/*url* $local_seedsDir/urls
i.e. just name the file as urls locally.
url=`echo bin/hadoop dfs -get urls-input/MR6/*url* .`
then tail and head to extract from url the actual file name and store it in $urls
rm $urls
But otherwise, just KISS

Resources