Reduce speed to determine hdfs dfs file existence - performance

I am trying to determine if a number of paths exist in hadoop fs using: hdfs fs -test -e filename
However, as I begin to use wildcard * characters in a path to search through all of my directories to find a specific file (for example: /*/*/*/*/*/fileName, etc.), this significantly slows down the process as the test function searches through all directories. While I need this power to be able to find files (I am using a company hadoop cluster where I do not know where a specific file is stored), I would like to know if there is any way to speed up the process similar to this question about php.
I originally used hdfs fs -test -e filename as opposed to hdfs fs -ls filename to test that a file exists after reading this link, because I could easily determine if a file existed without throwing an error, but I am willing to change my code given a better alternative.
I would like to know the best (least time consuming) way to determine if a file exists in hadoop fs given a variable file path location.
Can I perform the search starting from the most current modified directory given the variable file path location?
Is it possible to terminate the search once the file has been found or will the search continue running until all paths have been checked?
Since I am only checking for file existence before I pass the file into a mapreduce job to avoid errors when the job tries to read from the file, should I forget about this time consuming check and simply attempt to catch an error thrown when a file path does not exist?

Related

How to interpret Hadoop Grep command output

This is a very basic question concerning the output files generated from running the Grep utility inside a HDFS directory. Essentially, I've included the grep command inside a simple shell script, which is supposed to search this directory for a given string - which is a parameter to the script. The contents of the script are as follows:
#!/bin/bash
set - e
cd $HADOOP_HOME
bin/hadoop org.apache.hadoop.examples.Grep
"hdfs://localhost:9000/user/hduser" "hdfs://localhost:9000/user/hduser/out" $1
bin/hadoop fs -get "hdfs://localhost:9000/user/hduser/out/*" "/opt/data/out/"
bin/hadoop fs -rm -r "hdfs://localhost:9000/user/hduser/out"
The results sent to the hdfs out directory are copied across to a local directory in the second last line. I've deliberately placed two files in this hdfs directory, only one of which contains multiple instances of the string I'm searching for. What ends up in my /opt/data/out directory are the following 2 files.
_SUCCESS
part-r-00000
The jobs look like they ran successfully, however the only content i'm seeing between both files, is in the "part-r-0000" file, and it's literally the following.
29472 e
I suppose I was naively hoping to see the filename where the string was located, and perhaps a count of the number of times it occurred.
My question is, how and where are these values typically returned from the hadoop grep command? I've looked through the console out while the map reduce jobs where running, and there's no reference to the file name where the search string is stored. Any pointers as to how I can access this information would be appreciated, as I'm unsure how to interpret "29472 e".
I understand like...
You have some jobs' output in HDFS, which you copy to your local.
You are then trying to get the count of a string in the files.
In that case, add the code after the below line
bin/hadoop fs -get "hdfs://localhost:9000/user/hduser/out/*" "/opt/data/out/"
grep -c $1 /opt/data/out/*
This command will do what is expected.
It will give the file name and also the count of strings found in the file.

How to place a file directly in HDFS without using local by directly download a file from a webpage?

I need some help. I am downloading a file from a webpage using python code and placing it in local file system and then transferring it into HDFS using put command and then performing operations on it.
But there might be some situations where the file size will be very large and downloading into Local File System is not a right procedure. So I want the file to be directly be downloaded into HDFS with out using the local file system at all.
Can any one suggest me some methods which one would be the best method to proceed?
If there are any errors in my question please correct me.
You can pipe it directly from a download to avoid writing it to disk, e.g.:
curl server.com/my/file | hdfs dfs -put - destination/file
The - parameter to -put tells it to read from stdin (see the documentation).
This will still route the download through your local machine, though, just not through your local file system. If you want to download the file without using your local machine at all, you can write a map-only MapReduce job whose tasks accept e.g. an input file containing a list of files to be downloaded and then download them and stream out the results. Note that this will require your cluster to have open access to the internet which is generally not desirable.

Combine Map output for directory to one file

I have a requirement, where i have to merge the output of mappers of a directory in to a single file. Lets say i have a directory A which contains 3 files.
../A/1.txt
../A/2.txt
../A/3.txt
I need to run a mapper to process these files which shud generate one output file. I KNOW REDUCER WILL DO THAT, BUT I DONT WANT TO USE REDUCER LOGIC.
OR
Can i have only one mapper to process all the files under a directory.
If you set up fuse to mount your HDFS to a local directory, then your output can be the mounted filesystem.
For example, I have our HDFS mounted to /mnt/hdfs locally. I run the following command and it works great:
hadoop fs -getmerge /reports/some_output /mnt/hdfs/reports/some_output.txt
Of course, there are other reasons to use fuse to mount HDFS to a local directory, but this was a nice side effect for us
Can i have only one mapper to process all the files under a directory.
Have you looked into CombinedFileInputFormat? Felix Ren-Chyan Chern writes about setting it up in some detail.

Hadoop Pig cannot store to an existing folder

I have created a folder to drop the result file from a Pig process using the Store command. It works the first time, but the second time it compains that the folder already exists. What is the best practice for this situiation? Documentation is sparse on this topic.
My next step will be to rename the folder to the original file name, to reduce the impact of this. Any thoughts?
You can execute fs commands from within Pig, and should be able to delete the directory by issuing a fs -rmr command before running the STORE command:
fs -rmr dir
STORE A into 'dir' using PigStorage();
The only subtly is the fs command doesn't expect quotes around the directory name, whereas the store command does expect quotes around the directory name.

How can I run the wordCount example in Hadoop?

I'm trying to run the following example in hadoop: http://hadoop.apache.org/common/docs/current/mapred_tutorial.html
However I don't understand the commands that are being used, specifically how to create an input file, upload it to the HDFS and then run the word count example.
I'm trying the following command:
bin/hadoop fs -put inputFolder/inputFile inputHDFS/
however it says
put: File inputFolder/inputFile does not exist
I have this folder inside the hadoop folder which is the folder before "bin" so why is this happening?
thanks :)
Hopefully this isn't overkill:
Assuming you've installed hadoop (in either local, distributed or pseudo-distributed), you have to make sure hadoop's bin and other misc parameters are in your path. In linux/mac this is a simple matter of adding the following to one of your shell files (~/.bashrc, ~/.zshrc, ~/.bash_profile, etc. - depending on your setup and preferences):
export HADOOP_INSTALL_DIR=/path/to/hadoop # /opt/hadoop or /usr/local/hadoop, for example
export JAVA_HOME=/path/to/jvm
export PATH=$PATH:$HADOOP_INSTALL_DIR/bin
export PATH=$PATH:$HADOOP_INSTALL_DIR/sbin
Then run exec $SHELL or reload your terminal. To verify hadoop is running, type hadoop version and see that no errors are raised. Assuming you followed the instructions on how to set up a single node cluster and started hadoop services with the start-all.sh command, you should be good to go:
In pseudo-dist mode, your file system pretends to be HDFS. So just reference any path like you would with any other linux command, like cat or grep. This is useful for testing, and you don't have to copy anything around.
With an actual HDFS running, I use the copyFromLocal command (I find it to just work):
$ hadoop fs -copyFromLocal ~/data/testfile.txt /user/hadoopuser/data/
Here I've assumed your performing the copying on a machine that is part of the cluster. Note that if your hadoopuser is the same as your unix username, you can drop the /user/hadoopuser/ part - it is implicitly assumed to do everything inside your HDFS user dir. Also, if you're using a client machine to run commands on a cluster (you can do that too!), know that you'll need to pass the cluster's configuration using -conf flag right after hadoop fs, like:
# assumes your username is the same as the one on HDFS, as explained earlier
$ hadoop fs -conf ~/conf/hadoop-cluster.xml -copyFromLocal ~/data/testfile.txt data/
For the input file, you can use any file/s that contain text. I used some random files from the gutenberg site.
Last, to run the wordcount example (comes as jar in hadoop distro), just run the command:
$ hadoop jar /path/to/hadoop-*-examples.jar wordcount /user/hadoopuser/data/ /user/hadoopuser/output/wc
This will read everything in data/ folder (can have one or many files) and write everything to output/wc folder - all on HDFS. If you run this in pseudo-dist, no need to copy anything - just point it to proper input and output dirs. Make sure the wc dir doesn't exist or your job will crash (cannot write over existing dir). See this for a better wordcount breakdown.
Again, all this assumes you've made it through the setup stages successfully (no small feat).
Hope this wasn't too confusing - good luck!

Resources