This is a very basic question concerning the output files generated from running the Grep utility inside a HDFS directory. Essentially, I've included the grep command inside a simple shell script, which is supposed to search this directory for a given string - which is a parameter to the script. The contents of the script are as follows:
#!/bin/bash
set - e
cd $HADOOP_HOME
bin/hadoop org.apache.hadoop.examples.Grep
"hdfs://localhost:9000/user/hduser" "hdfs://localhost:9000/user/hduser/out" $1
bin/hadoop fs -get "hdfs://localhost:9000/user/hduser/out/*" "/opt/data/out/"
bin/hadoop fs -rm -r "hdfs://localhost:9000/user/hduser/out"
The results sent to the hdfs out directory are copied across to a local directory in the second last line. I've deliberately placed two files in this hdfs directory, only one of which contains multiple instances of the string I'm searching for. What ends up in my /opt/data/out directory are the following 2 files.
_SUCCESS
part-r-00000
The jobs look like they ran successfully, however the only content i'm seeing between both files, is in the "part-r-0000" file, and it's literally the following.
29472 e
I suppose I was naively hoping to see the filename where the string was located, and perhaps a count of the number of times it occurred.
My question is, how and where are these values typically returned from the hadoop grep command? I've looked through the console out while the map reduce jobs where running, and there's no reference to the file name where the search string is stored. Any pointers as to how I can access this information would be appreciated, as I'm unsure how to interpret "29472 e".
I understand like...
You have some jobs' output in HDFS, which you copy to your local.
You are then trying to get the count of a string in the files.
In that case, add the code after the below line
bin/hadoop fs -get "hdfs://localhost:9000/user/hduser/out/*" "/opt/data/out/"
grep -c $1 /opt/data/out/*
This command will do what is expected.
It will give the file name and also the count of strings found in the file.
Related
I'm trying to read the contents of a few files and using grep find the lines with the my search query and then output the results into a folder in another directory. I get an error "No such file or directory exists". I have created the folder structure and the text file.
hadoop fs -cat /Final_Dataset/c*.txt | grep 2015-01-* > /energydata/2015/01/01.txt
ERROR:
-bash: /energydata/2015/01/01.txt: No such file or directory
> /energydata/2015/01/01.txt means that the output is being redirected to a local file. hdfs fs -cat sends output to your local machine and at that point you're no longer operating within Hadoop. grep simply acts on a stream of data, it doesn't care (or know) where it came from.
You need to make sure that /energydata/2015/01/ exists locally before you run this command. You can create it with mkdir -p /energydata/2015/01/.
If you're looking to pull certain records from a file on HDFS and then re-write the new file to HDFS then I'd suggest not cat-ing the file and instead keeping the processing entirely on the cluster, by using something like Spark or Hive to transform data efficiently. Or failing that just do a hadoop dfs -put <local_path> /energydata/2015/01/01.txt.
The following CLI command worked
hadoop fs -cat /FinalDataset/c*.txt | grep 2015-01-* | hadoop fs -put - /energydata/2015/01/output.txt
Does hadoop filesystem shell moving of empty directory?
Assume that I have a below directory which is empty.
hadoop fs -mv /user/abc/* /user/xyz/*
When I am executing the above command , it is giving me the error
'/user/abc/*' does not exists.
However, If I put some data inside /user/abc/* , it is getting executed successfully.
Does anyone know how to handle for empty directory?
Is there any alternative to execute above command without giving error?
hadoop fs -mv /user/abc/* /user/xyz
The destination file doesn't need to add /*
I thinks you want to rename the file.
you also can use this ->
hadoop fs -mv /user/abc /user/xyz
Because you xyz file is empty,so you don't got error.
but if you xyz file has many file,you will get error as well.
This answer should be correct I believe.
hadoop fs -mv /user/abc /user/xyz
'*' is a wild card. So it's looking for any file inside the folder. When nothing found, it returns the error.
As per the command,
When you move a file, all links to otherfiles remain intact, except when youmove it to a different file system.
I am trying to determine if a number of paths exist in hadoop fs using: hdfs fs -test -e filename
However, as I begin to use wildcard * characters in a path to search through all of my directories to find a specific file (for example: /*/*/*/*/*/fileName, etc.), this significantly slows down the process as the test function searches through all directories. While I need this power to be able to find files (I am using a company hadoop cluster where I do not know where a specific file is stored), I would like to know if there is any way to speed up the process similar to this question about php.
I originally used hdfs fs -test -e filename as opposed to hdfs fs -ls filename to test that a file exists after reading this link, because I could easily determine if a file existed without throwing an error, but I am willing to change my code given a better alternative.
I would like to know the best (least time consuming) way to determine if a file exists in hadoop fs given a variable file path location.
Can I perform the search starting from the most current modified directory given the variable file path location?
Is it possible to terminate the search once the file has been found or will the search continue running until all paths have been checked?
Since I am only checking for file existence before I pass the file into a mapreduce job to avoid errors when the job tries to read from the file, should I forget about this time consuming check and simply attempt to catch an error thrown when a file path does not exist?
Created a folder [LOAN_DATA] with below command
hadoop fs -mkdir hdfs://masterNode:8020/tmp/hadoop-hadoop/dfs/LOAN_DATA
Now using the web UI when I list the contents of directory /tmp/hadoop-hadoop/dfs, it shows LOAN_DATA.
But when I want to store some Data from a TXT file to the LOAN_DATA directory using put or copyFromLocal I get
put: Unknown command
Command used:
hadoop fs –put '/home/hadoop/my_work/Acquisition_2012Q1.txt' hdfs://masterNode:8020/tmp/hadoop-hadoop/dfs/LOAN_DATA
How to resolve this issue?
This issue may occur when you copy-paste a command and use it. It is because of the change in font (or character set) used in the document from where it was copied.
For example:
If you copy/paste and execute the command -
hdfs dfs -put workflow.xml /testfile/workflow.xml
You may get-
–put: Unknown command
OR
–p-t: Unknown command
This happens because the copy is done from a UTF-8 file and the - or u (or any of the characters) copied may be of different character set.
So just type the command on the terminal (don't copy/paste) and you should be fine.
Alternatively, if you are running a shell script which was copied from
some other editor then run a dos2unix on the script before running it
on the Linux terminal.
Eg: dos2unix <shell_script.sh>
Tried your command and "it appears", there is a typo error in the above command 'hadoop fs –put ....'.
Instead of '–put', use '-put' or '-copyFromLocal'. Problem is with '–' but the correct character should be '-'. As such, the error is obvious :-)
Here is my example (using a get command instead of put):
$ hadoop fs –get /tmp/hadoop-data/output/* /tmp/hadoop-data/output/
–get: Unknown command
$ hadoop fs -get /tmp/hadoop-data/output/* /tmp/hadoop-data/output/
get: `/tmp/hadoop-data/output/part-r-00000': File exists
Anand's answer is, of course, correct. But it might not have been a typo but rather a subtle trap. Often when people are learning new technology, they copy and paste commands from websites and blogs. Often, what was originally entered as a dash will be copied as a hyphen. Hyphens differ from dashes only in that they are a tad longer, so the mistake is hard to spot, but since they are a completely different character the command is wrong, that is, "not found".
I have created a folder to drop the result file from a Pig process using the Store command. It works the first time, but the second time it compains that the folder already exists. What is the best practice for this situiation? Documentation is sparse on this topic.
My next step will be to rename the folder to the original file name, to reduce the impact of this. Any thoughts?
You can execute fs commands from within Pig, and should be able to delete the directory by issuing a fs -rmr command before running the STORE command:
fs -rmr dir
STORE A into 'dir' using PigStorage();
The only subtly is the fs command doesn't expect quotes around the directory name, whereas the store command does expect quotes around the directory name.