Copy files from Hadoop multiple directories to edge node folder - bash

I have the multiple directories in hadoop as following
/env/hdfsdata/ob/sample/partfile..
/env/hdfsdata/ob/sample_1/partfile..
/env/hdfsdata/ob/sample_2/partfile..
I am new to hadoop and shell scripting and looking for a way to copy the files present in sample directory (sample*) onto edge node folder location and the files should be named as follows assuming sample is the prefix for file name
sample.txt
sample_1.txt
sample_2.txt
once the files are copied on to edgenode, location the respective directories has to be deleted in hadoop. I have tried using to list the directories using wild cards and then process these using shell script and cat command but facing issue no such directory found.

Use getmerge to create one file from many
#!/bin/bash
dl() {
FILENAME=$1
BASE_DIR='/env/hdfsdata/ob'
hadoop fs -getmerge "${BASE_DIR}/${FILENAME}/*" "${FILENAME}.txt"
}
FILENAME='sample'
dl "${FILENAME}" # sample
for i in `seq 2`; do
dl "${FILENAME}_${i}" # sample_1, sample_2
done
new to hadoop and shell scripting
You can use Java/Python/etc to do the exact same thing

Related

How to delete all the directories with a work in their names in HDFS? (using the command line)

I want to delete all the directories in the HDFS with an specific word in their names. It has to be considered the directories are in different locations under a common directory. Is there a way to do this?
I have tried the following but it didn't work:
hdfs dfs -rm -r /user/myUser/*toFind*
The answer to the previous was:
rm: `/user/myUser/toFind': No such file or directory
It is ok for me to do with this command , and my cluster is cdh5.6.0 with apache 2.6.0
Are you sure is there any file or directory's name contains "toFind" ??
Besides it cannot recursion direcotry .

How to interpret Hadoop Grep command output

This is a very basic question concerning the output files generated from running the Grep utility inside a HDFS directory. Essentially, I've included the grep command inside a simple shell script, which is supposed to search this directory for a given string - which is a parameter to the script. The contents of the script are as follows:
#!/bin/bash
set - e
cd $HADOOP_HOME
bin/hadoop org.apache.hadoop.examples.Grep
"hdfs://localhost:9000/user/hduser" "hdfs://localhost:9000/user/hduser/out" $1
bin/hadoop fs -get "hdfs://localhost:9000/user/hduser/out/*" "/opt/data/out/"
bin/hadoop fs -rm -r "hdfs://localhost:9000/user/hduser/out"
The results sent to the hdfs out directory are copied across to a local directory in the second last line. I've deliberately placed two files in this hdfs directory, only one of which contains multiple instances of the string I'm searching for. What ends up in my /opt/data/out directory are the following 2 files.
_SUCCESS
part-r-00000
The jobs look like they ran successfully, however the only content i'm seeing between both files, is in the "part-r-0000" file, and it's literally the following.
29472 e
I suppose I was naively hoping to see the filename where the string was located, and perhaps a count of the number of times it occurred.
My question is, how and where are these values typically returned from the hadoop grep command? I've looked through the console out while the map reduce jobs where running, and there's no reference to the file name where the search string is stored. Any pointers as to how I can access this information would be appreciated, as I'm unsure how to interpret "29472 e".
I understand like...
You have some jobs' output in HDFS, which you copy to your local.
You are then trying to get the count of a string in the files.
In that case, add the code after the below line
bin/hadoop fs -get "hdfs://localhost:9000/user/hduser/out/*" "/opt/data/out/"
grep -c $1 /opt/data/out/*
This command will do what is expected.
It will give the file name and also the count of strings found in the file.

hadoop discp issue while copying singe file

(Note: I need to use distcp to get parallelism)
I have 2 files in /user/bhavesh folder
I have 1 file in /user/bhavesh1 folder
Copying 2 files from /user/bhavesh to /user/uday folder (This work fine)
This create /user/uday folder
Copying 1 file from /user/bhavesh1 to /user/uday1 folder if creates file instead of folder
What i need is if there is one file /user/bhavesh1/emp1.csv i need is it should create /user/uday1/emp1.csv [uday1 should form as directory] Any suggestion or help is highly appreciated.
In unix systems, when u copy a single file by giving destination directory name ending with /user/uday1/, destination directory will be created, however hadoop fs -cp command will fail if destination directory is missing.
When it comes it hdfs distcp, file/dir names ending with / will be ignored if it's a single file. One workaround is to create the destination directory before executing distcp command. you may add -p option in -mkdir to avoid directory already exists error.
hadoop fs -mkdir -p /user/uday1 ; hadoop distcp /user/bhavesh1/emp*.csv /user/uday1/
this works for both single file and multiple files in the source directory.

Add multiple files to distributed cache in HIVE

I currently have an issue adding a folders contents to Hives distrusted cache. I can successfully add multiple files to the distributed cache in Hive using:
ADD FILE /folder/file1.ext;
ADD FILE /folder/file2.ext;
ADD FILE /folder/file3.ext;
etc.
.
I also see that there is a ADD FILES (plural) option which in my mind means you could specify a directory like: ADD FILES /folder/; and everything in the folder gets included (this works with Hadoop Streaming -files option). But this does not work with Hive. Right now I have to explicitly add each file.
Am I doing this wrong? Is there a way to had a whole folders contents to the distributed cache.
P.S. I tried wild cards ADD FILE /folder/* and ADD FILES /folder/* but that fails too.
Edit:
As of hive 0.11 this now supported so:
ADD FILE /folder
now works.
What I am using is passing the folder location to the hive script as a param so:
$ hive -f my-query.hql -hiveconf folder=/folder
and in the my-query.hql file:
ADD FILE ${hiveconf:folder}
Nice and tidy now!
Add doesn't support directories, but as a workaround you can zip the files. Then add the it to the distributed cache as an archive (ADD ARCHIVE my.zip). When the job is running the content of the archive will be unpacked on the local job directory of the
slave nodes (see the mapred.job.classpath.archives property)
If the number of the files you want to pass is relatively small, and you don't want deal with archives you can also write a small script which prepares the add file command for all the files you have in a given directory:
E.g:
#!/bin/bash
#list.sh
if [ ! "$1" ]
then
echo "Directory is missing!"
exit 1
fi
ls -d $1/* | while read f; do echo ADD FILE $f\;; done
Then invoke it from the Hive shell and execute the generated output:
!/home/user/list.sh /path/to/files
Well, in my case, I had to move a folder with child folders and files in it.
I used the ADD ARCHIVE xxx.gz, which was adding the file, but was not exploding(unzipping) in the slave machines.
Instead, ADD FILE <folder_name_without_traling_slash> actually copies the whole folder recursively to the slaves.
Courtesy: The comments helped debugging
Hope this helps !

Shell script to move files into a hadoop cluster

This may have been answered somewhere but I haven't found it yet.
I have a simple shell script that I'd like to use to move log files into my Hadoop cluster. The script will be called by Logrotate on a daily basis.
It fails with the following error: "/user/qradar: cannot open `/user/qradar' (No such file or directory)".
#!/bin/bash
#use today's date and time
day=$(date +%Y-%m-%d)
#change to log directory
cd /var/log/qradar
#move and add time date to file name
mv qradar.log qradar$day.log
#load file into variable
#copy file from local to hdfs cluster
if [ -f qradar$day.log ]
then
file=qradar$day.log
hadoop dfs -put /var/log/qradar/&file /user/qradar
else
echo "failed to rename and move the file into the cluster" >> /var/log/messages
fi
The directory /user/qradar does exist and can be listed with the Hadoop file commands.
I can also manually move the file into the correct directory using the Hadoop file commands. Can I move files into the cluster in this manner? Is there a better way?
Any thoughts and comments are welcome.
Thanks
Is the &file a typo on in hadoop dfs -put line?
If not then this is likely your problem, you're running the command hadoop dfs -put /var/log/qradar/ in the background (the ampersand runs the command in the background), then the command file /user/qradar, which the shell is looking for on the local path.
My guess is you meant for the following (dollar rather than ampersand):
hadoop dfs -put /var/log/qradar/$file /user/qradar

Resources