Shell script to move files into a hadoop cluster - shell

This may have been answered somewhere but I haven't found it yet.
I have a simple shell script that I'd like to use to move log files into my Hadoop cluster. The script will be called by Logrotate on a daily basis.
It fails with the following error: "/user/qradar: cannot open `/user/qradar' (No such file or directory)".
#!/bin/bash
#use today's date and time
day=$(date +%Y-%m-%d)
#change to log directory
cd /var/log/qradar
#move and add time date to file name
mv qradar.log qradar$day.log
#load file into variable
#copy file from local to hdfs cluster
if [ -f qradar$day.log ]
then
file=qradar$day.log
hadoop dfs -put /var/log/qradar/&file /user/qradar
else
echo "failed to rename and move the file into the cluster" >> /var/log/messages
fi
The directory /user/qradar does exist and can be listed with the Hadoop file commands.
I can also manually move the file into the correct directory using the Hadoop file commands. Can I move files into the cluster in this manner? Is there a better way?
Any thoughts and comments are welcome.
Thanks

Is the &file a typo on in hadoop dfs -put line?
If not then this is likely your problem, you're running the command hadoop dfs -put /var/log/qradar/ in the background (the ampersand runs the command in the background), then the command file /user/qradar, which the shell is looking for on the local path.
My guess is you meant for the following (dollar rather than ampersand):
hadoop dfs -put /var/log/qradar/$file /user/qradar

Related

Hadoop error when outputting the grep results to a new file in a different directory

I'm trying to read the contents of a few files and using grep find the lines with the my search query and then output the results into a folder in another directory. I get an error "No such file or directory exists". I have created the folder structure and the text file.
hadoop fs -cat /Final_Dataset/c*.txt | grep 2015-01-* > /energydata/2015/01/01.txt
ERROR:
-bash: /energydata/2015/01/01.txt: No such file or directory
> /energydata/2015/01/01.txt means that the output is being redirected to a local file. hdfs fs -cat sends output to your local machine and at that point you're no longer operating within Hadoop. grep simply acts on a stream of data, it doesn't care (or know) where it came from.
You need to make sure that /energydata/2015/01/ exists locally before you run this command. You can create it with mkdir -p /energydata/2015/01/.
If you're looking to pull certain records from a file on HDFS and then re-write the new file to HDFS then I'd suggest not cat-ing the file and instead keeping the processing entirely on the cluster, by using something like Spark or Hive to transform data efficiently. Or failing that just do a hadoop dfs -put <local_path> /energydata/2015/01/01.txt.
The following CLI command worked
hadoop fs -cat /FinalDataset/c*.txt | grep 2015-01-* | hadoop fs -put - /energydata/2015/01/output.txt

Copy files from Hadoop multiple directories to edge node folder

I have the multiple directories in hadoop as following
/env/hdfsdata/ob/sample/partfile..
/env/hdfsdata/ob/sample_1/partfile..
/env/hdfsdata/ob/sample_2/partfile..
I am new to hadoop and shell scripting and looking for a way to copy the files present in sample directory (sample*) onto edge node folder location and the files should be named as follows assuming sample is the prefix for file name
sample.txt
sample_1.txt
sample_2.txt
once the files are copied on to edgenode, location the respective directories has to be deleted in hadoop. I have tried using to list the directories using wild cards and then process these using shell script and cat command but facing issue no such directory found.
Use getmerge to create one file from many
#!/bin/bash
dl() {
FILENAME=$1
BASE_DIR='/env/hdfsdata/ob'
hadoop fs -getmerge "${BASE_DIR}/${FILENAME}/*" "${FILENAME}.txt"
}
FILENAME='sample'
dl "${FILENAME}" # sample
for i in `seq 2`; do
dl "${FILENAME}_${i}" # sample_1, sample_2
done
new to hadoop and shell scripting
You can use Java/Python/etc to do the exact same thing

Bashscript upload files to hdfs

I am trying to create a bashscript to upload files from the local edge node filesystem to hdfs. I was wondering a good way to add the timestamp in the file. Having some problems with getting timestamp to work.
#!/bin/bash
echo Running upload script to hdfs...
timestamp(){date +"%T"}
hdfs dfs -put /home/myname/folder1/* /user/myname/example_1_$(timestamp).txt
hdfs dfs -put /home/myname/folder2/* /user/myname/example_2_$(timestamp).txt
Using date +%T is not possible as the command result would contain : characters in it like 11:12:45, and creating filenames with : character is not possible in HDFS. See Hadoop-3275.
Try this command in the script,
hdfs dfs -put /home/myname/folder1/* /user/myname/example_1_`date +%H%M%S`.txt
This will create filename like /user/myname/example_1_111245.txt.

How do I remove a file from HDFS

I am learning Hadoop and I have never worked on Unix before . So, I am facing a problem here . What I am doing is:
$ hadoop fs -mkdir -p /user/user_name/abcd
now I am gonna put a ready made file with name file.txt in HDFS
$ hadoop fs -put file.txt /user/user_name/abcd
The file gets stored in hdfs since it shows up on running -ls command.
Now , I want to remove this file from HDFS . How should i do this ? What command should i use?
If you run the command hadoop fs -usage you'll get a look at what commands the filesystem supports and with hadoop fs -help you'll get a more in-depth description of them.
For removing files the commands is simply -rm with -rf specified for recursively removing folders. Read the command descriptions and try them out.

Hadoop commands

I have Hadoop installed in this location
/usr/local/hadoop$
Now I want to list the files inside the dfs. The command I used is :
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
This gave me the files in the dfs
Found 3 items
drwxr-xr-x - hduser supergroup 0 2014-03-20 03:53 /user/hduser/gutenberg
drwxr-xr-x - hduser supergroup 0 2014-03-24 22:34 /user/hduser/mytext-output
-rw-r--r-- 1 hduser supergroup 126 2014-03-24 22:30 /user/hduser/text.txt
Next time, I tried the same command in a different manner
hduser#ubuntu:/usr/local/hadoop$ hadoop dfs -ls
It also gave me the same result.
Could some one please explain why both are working despite of executing the ls command from different folders. I hope you guys understood my question.Just explain me difference between these two :
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
hduser#ubuntu:/usr/local/hadoop$ hadoop dfs -ls
In unix an executable file can be executed in two ways, either by giving the absolute/relative path or commands in system executables path(path should be specified in PATH variable)
When you execute bin/hadoop dfs -ls should be inside the directory /usr/local/hadoop. Or /usr/local/hadoop/bin/hadoop dfs -ls will also work
There is one environment variable PATH in unix which keeps in the list of executable location by default it keeps the following path /usr/local/bin:/bin:/usr/bin:/usr/local/sbin:/usr/sbin:/sbin: . Whenever we execute any command like ls, mkdir etc it is taking from the one location in PATH variable. When you give the command hadoop(it will be taken from the path /usr/local/hadoop/bin/). Since you have specified the path /usr/local/hadoop/bin/ in PATH variable. Use the following command to check the value of your PATH variable
echo $PATH
You set a hadoop global path HADOOP_HOME in your ~/.bashrc file so that Hadoop commands will works in anywhere in Terminal.
In both case you got same result because you already set HADOOP_HOME/bin in bashrc file you can check the entry by sudo nano ~/.bashrc , we do this because it give us convenience to execute command from terminal irrespective of current file directory
if you remove HADOOP_HOME/bin entry from bashrc file you will not get the same result
First of all, you are executing the same command. bin/hadoop in hadoop installation dir and hadoop are same, For this check your .bashrc file where you must have specified the executable path for hadoop.
if you call hadoop it means you are calling /usr/local/hadoop/bin/hadoop command.
If you are having problem with output of ls-
you are executing ls on hadoop file system, not on local file system. It will show you the content available in hadoop file system. for more details go to localhost:50070 and check the content
In both case you got same result because you already set HADOOP_HOME/bin in bashrc file. Because in unix an executable file can be executed in two ways, either by giving the absolute/relative path or commands in system's executable path.
When you execute - "bin/hadoop dfs -ls" should be inside the directory /usr/local/hadoop.
To locate where the executable file associated with the hadoop command, just run:
which hadoop
This will print out the location of the hadoop command used.
At a certain point during the Hadoop installation you configure the hdfs filesystem. And eventually you format it using hdfs namenode -format. From that point on dfs does not refer to your own filesystem, but the hdfs filesystem. When you execute hadoop dfs -ls it displays the user's home directory on the hdfs filesystem. It doesn't matter from where you are located on the host filesystem when you execute the command because it's not being used.
However it is possible not to configure hdfs and it'll use the local filesystem. Either way hadoop dfs -ls displays the contents of the user's home directory.
With that note, if you remove the user directory /user/hduser and execute hadoop dfs -ls it will give you an error because the user directory does not exist.
Source:
https://amodernstory.com/2014/09/23/installing-hadoop-on-mac-osx-yosemite/
Another:
This is related to OS and not Hadoop.
When ever you run a command without an explicit path the OS will search the locations in the PATH variable. In your case during installation of Hadoop you must have set some what following variables in your user profile (.bashrc or .profiles)
export HADOOP_INSTALL=/usr/local/hadoop
export PATH=$HADOOP_INSTALL/bin:$PATH
So whenever you type following it will check in $HADOOP_INSTALL/bin as you have set this path in OS PATH variable.
hadoop dfs -ls
And when you type following then it will use current folder path which is in your snippet is /usr/local/hadoop and just under your current folder there is bin/hadoop file
bin/hadoop dfs -ls
Hence in both the cases it is using same file for execution but identification of it done via PATH variable in one case and current directory content path (absolute path) in another case.
Its because PATH Variable you set while configuring Hadoop. PATH contains path to your HADOOP home so thats why even if you not specified the /bin path, it is taking it from PATH variable.
Both the commands are doing the same thing.
Check your /etc/bashrc or /root/.bashrc
There you will find the HADOOP_HOME set and bin path is added along with the path variable.
While setting this,we will be able to execute the hadoop commands anywhere from the command line.No other use on it..!!!
It is working because you are not executing shell "ls" command instead passing "ls" as parameter to the command "hadoop" so in both case same exact command with same parameters is executed i.e. 'hadoop dfs -ls' only thing changing is that in one case you are qualifying it with path in another case you are not and that is working because 'hadoop' must be in set in you $PATH environment variable.
The command hadoop dfs ls is equal to ls command in the hdfs file system. We can treat that as the ls command in linux/unix.
When you are log in as hadoop user
just type hado in terminal and press TAB key you will get hadoop means your hadoop setup is working properly..
So your ~./bashrc file is also set properly..
it means that when you use this command from any directory structure
hadoop dfs -ls /
it will gives you list of all files which are present in hdfs
The hadoop executable is within the /bin/ folder so it is the exact same command as long as /bin/hadoop/ is set as a the 'hadoop' variable in $PATH. You can find the $PATH variable defined in the ~/.bashrc file. Try cat .bashrc from you root directory so you can take a look.
There is one simple concept in LINUX Operating System. You can say bashrc file is linked to the terminal. So when you open the terminal then it loads all the variable from the bashrc file. Any executable appended in the path variable defined in the bashrc file can be loaded from any directory you presently are.
Now to answer your question. Since hadoop fs -ls / was working fine for you so that means the executable is in your PATH variable. and for the latter case bin/hadoop fs -ls / you are manullay going to the folder where the executable is present. This is similar to running any executable in LINUX.
just set all the hadoop, yarn and other path in .bashrc file. it will run from anywhere.
standard hadoop>bin/hadoop fs -ls
see the below hadoop forum for more details on hadoop.
http://tekzak.com/forum/viewforum.php?f=2&sid=5d01e2e3c27aebc6e7ee95447ef328a4
The variable HADOOP_HOME might have been set with the bin path of hadoop binary. In such a case, both the above commands irrespective of from where the command hadoop is executed will work.
You will have to give the absolute path followed by bin/hadoop dfs -ls
Absolute_path/bin/hadoop dfs -ls
Because You set your hadoop path in .bashrc file of your hadoop user.SO you dont need to navigate your path to bin folder.The command which work from bin folder also works from root folder of current user.
Command hadoop fs -ls is to list all files and directories in root folder in your hadoop file system (HDFS) rather then your current file system. Without any modifications of files in root of HDFS, result should be the same using this command.
you have already added hadoop/bin in the path when you installed hadoop in your machine. Command would directed to the same path wherever you run it in your current system.
Therefore, you dont make any change in your HDFS and you use the same command. That is why you get exactly the same result.
Both the commands bin/hadoop dfs -ls and hadoop dfs -ls are working for you because you have the hadoop executable (/usr/local/hadoop/bin/hadoop) set in $PATH variable for the user "hduser".
To understand it further, you can open a Terminal, and remove the value ($HADOOP_HOME/bin or /usr/local/hadoop/bin) from $PATH using the export command in linux. If you do this, the second command(hadoop dfs -ls) won't work for that Terminal Session.
In your case hadoop is present at bin/ (within /usr/local/hadoop), so you may execute it as bin/hadoop from /usr/local/hadoop(which is the current location in above example).
You are also able to execute it directly without specifying the relative/absolute path as the location of hadoop is added to the PATH.
You may check this by running which hadoop and printing PATH(echo $PATH).
when you install Hadoop its binaries are added to /usr/bin folder.
any Binary in folder /bin, /sbin, /usr/bin are available from any path and user in UNIX. proof: 1: https://askubuntu.com/questions/571617/what-is-the-purpose-of-the-bin-directory
Just to add, there are differences between folders /bin, /sbin etc and differences are explained here (https://askubuntu.com/questions/308045/differences-between-bin-sbin-usr-bin-usr-sbin-usr-local-bin-usr-local)
Its because of bashrc linux env setup
1) export HADOOP_PREFIX=/usr/local/hadoop/
2) export PATH=$PATH:$HADOOP_PREFIX/bin
After doing this we need to run command
exec bash
Its quite likely that you have exported
$HADOOP_HOME/bin in PATH variable.
Like if it is EMR then it would be
export HADOOP_PREFIX=/usr/lib/hadoop
export PATH=$PATH:$HADOOP_PREFIX/bin
You can check the path and find out
It is because you have already exported hadoop path when you did install hadoop . Now you can either go to the exact hadoop path or just type hadoop from whereever you are. It will work both ways.
I will answer your question in 2 perspectives,
hduser#ubuntu:/usr/local/hadoop$ bin/hadoop dfs -ls
hduser#ubuntu:/usr/local/hadoop$ hadoop dfs -ls
How these 2 commands works in a same way?
It is because your environment is aware of hadoop commands. It is because your $PATH variable contains hadoop installation directory.
Why both returns the same result?
It is because you are trying to list the directory which is residing in hdfs. When you execute hadoop dfs -ls command it will list the current user's home directory items from hdfs. In your case it lists hduser user directory data.
Hope it answers your question.
Both the commands gives the same result because we will already give the home path while installing hadoop itself.
So it will work even if we give the original path and also will work if it is given directly.

Resources