Editing a multi million row file on Hadoop cluster

Editing a multi million row file on Hadoop cluster - hadoop

I am trying to edit a large file on Hadoop cluster and trim white spaces and special characters like ¦,*,#," etc from the file.
I dont want to copyToLocal and use a sed as i have 1000's of such files to edit.

MapReduce is perfect for this. Good thing you have it in HDFS!
You say you think you can solve your problem with sed. If that's the case, then Hadoop Streaming would be a good choice for a one-off.
$ hadoop jar /path/to/hadoop/hadoop-streaming.jar \
-D mapred.reduce.tasks=0 \
-input MyLargeFiles \
-output outputdir \
-mapper "sed ..."
This will fire up a MapReduce job that applies your sed command to every line in the entire file. Since there are 1000s of files, you'll have several mapper tasks hitting the files at once. The data will also go right back into the cluster.
Note that I set the number of reducers to 0 here. That's because its not really needed. If you want your output to be one file, than use one reducer, but don't specify -reducer. I think that uses the identity reducer and effectively just creates one output file with one reducer. The mapper-only version is definitely faster.
Another option, which I don't think is as good, but doesn't require MapReduce, and is still better than copyToLocal is to stream it through the node and push it back up without hitting disk. Here's an example:
$ hadoop fs -cat MyLargeFile.txt | sed '...' | hadoop fs -put - outputfile.txt
The - in hadoop fs -put tells it to take data from stdin instead of a file.

Related

Splitting a file on Hadoop

I have an 8.8G file on the hadoop cluster that I'm trying to extract certain lines for testing purpose.
Seeing that Apache Hadoop 2.6.0 have no split command, how am I able to do it without having to download the file.
If the file was on a linux server I would've used:
$ csplit filename %2015-07-17%
The previous command works as desired, is something close to that possible on Hadoop?

You could use a combination of unix and hdfs commands.
hadoop fs -cat filename.dat | head -250 > /redirect/filename
Or if last KB of the file is suffice you could use this.
hadoop fs -tail filename.dat > /redirect/filename

processing result of hdfs command output

It is probably a question about stream processing. But I am not able to find a elegant solution using awk.
I am running a m/r job scheduled to run once a day. But there can be multiple HDFS directories on which it needs to run. For example, 3 input directories were uploaded to HDFS for the day, so 3 m/r jobs one for each directory needs to run.
So I need a solution, where i can extract filenames from the results of:
hdfs dfs -ls /user/xxx/17-03-15*
Then iterate over the filenames, launching one m/r job for each file.
Thanks

Browsing more on the issue, I found Hadoop provides a configuration settings for this issue. Here are details.
Also, I was just having some syntax issue and this simple awk command did, what i wanted:
files=`hdfs dfs -ls /user/hduser/17-03-15* | awk {'print $8'}`

Get last 5 lines of a file in Hadoop (HDFS)

I have several files in my Hadoop cluster (on HDFS). I want to see the last 5 lines of every file. Is there a simple command to do so?

If you want to see the last 5 lines specifically (and not any more or any less) of a file in HDFS, you can use the following command but its not very efficient:
hadoop fs -cat /your/file/with/path | tail -5
Here's a more efficient command within hadoop, but it returns the last kilobyte of the data, not a user-specified number of lines:
hadoop fs -tail /your/file/with/path
Here's a reference to the hadoop tail command : http://hadoop.apache.org/docs/r0.18.3/hdfs_shell.html#tail

Use Hadoop Streaming to run binary via script

I am new to Hadoop and I am trying to figure out a way to do the following:
I have multiple input image files.
I have binary executables that processes these files.
These binary executables write text files as output.
I have a folder that contains all of these executables.
I have a script which runs all of these executables in certain order, passing image location as arguments.
My question is this: can I use Hadoop streaming to process these images via these binaries and spit out the results from the text file.
I am currently trying this.
I have my Hadoop cluster running. I uploaded by binaries and my images onto the HDFS.
I have set up a scrip which, when hadoop runs should change directory into the folder with images and execute another script which executes the binaries.
Then the scrip spits out via stdout the results.
However, I can't figure out how to have my map script change into the image folder on HDFS and then execute the other script.
Can someone give me a hint?
sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
-numReduceTasks 0 \
-file /home/hduser/RunHadoopJob.sh \
-input /user/hduser/7posLarge \
-output /user/hduser/output5 \
-mapper RunHadoopJob.sh \
-verbose
And my RunHadoopJob.sh:
#!/bin/bash
cd /user/hduser/7posLarge/;
/user/hduser/RunSFM/RunSFM.sh;
My HDFS looks like this:
hadoop fs -ls
Warning: $HADOOP_HOME is deprecated.
Found 4 items
drwxr-xr-x - hduser supergroup 0 2012-11-28 17:32 /user/hduser/7posLarge
drwxr-xr-x - hduser supergroup 0 2012-11-28 17:39 /user/hduser/RunSFM
drwxr-xr-x - root supergroup 0 2012-11-30 14:32 /user/hduser/output5
I know this is not the standard use of MapReduce. I am simply looking for a way to easily, without writing much overhead spin up multiple jobs on different clusters of the same program with different input. It seems like this is possible looking at Hadoop Streaming documentation.
"How do I use Hadoop Streaming to run an arbitrary set of
(semi-)independent tasks?
Often you do not need the full power of Map Reduce, but only need to
run multiple instances of the same program - either on different parts
of the data, or on the same data, but with different parameters. You
can use Hadoop Streaming to do this. "
If this is not possible, is there another tool on AmazonAWS for example that can do this for me?
UPDATE:
Looks like there are similar solutions but I have trouble following them. They are here and here.

There are several issues when dealing with Hadoop-streaming and binary files:
Hadoop doesn't know itself how to process image files
mappers are taking the input from the stdin line by line so you need to create an intermediate shell script that writes the image
data from the stdin to some temp. file which is then passed
to the executable.
Just passing the directory location to the executables is not really efficient since in this case you'll loose data locality. I don't want to repeat the already well answered questions on this topic, so here are the links:
Using Amazon MapReduce/Hadoop for Image Processing
Hadoop: how to access (many) photo images to be processed by map/reduce?
Another approach would be to transform the image files into splittable SequenceFiles. I.e: each record would be one image in the SequenceFile. Then using this as input format the mappers would call the executables on each record they get. Note that you have to provide them to the TaskTracker nodes beforehand with the correct file permissions so that they are executable from java code.
Some more information on this topic:
Hadoop: Example process to generating a SequenceFile with image binaries to be processed in map/reduce

I was able to use a "hack" to have a prototype of a workaround.
I am still trying this out, and I don't think this will work on an elastic cluster since you would have to recompile your binaries depending on your cluster's OS architecture. But, if you have a private cluster this may be a solution.
Using hadoop streaming you can package your binaries in .jar files and ship them to the node, where they will get unpacked before your script runs.
I have my images in pics.jar and my program which processes all images found in the directory from where you start the program in BinaryProgramFolder.jar. Inside the folder I have a script which launcher the program.
My streaming job ships the images and a the binary program + scripts to the node and starts them. Again this is a workaround hack...not a "real" solution to the problem.
So,
sudo ./hadoop/bin/hadoop jar ../hduser/hadoop/contrib/streaming/hadoop-streaming-1.1.0.jar \
-archives 'hdfs://master:54310/user/hduser/pics.jar#pics','hdfs://master:54310/user/hduser/BinaryProgramFolder.jar#BinaryProgramFolder' \
-numReduceTasks 0 \
-file /home/hduser/RunHadoopJob.sh \
-input /user/hduser/input.txt \
-output /user/hduser/output \
-mapper RunHadoopJob.sh \
-verbose
Filler input file text.txt:
Filler text for streaming job.
RunHadoopJob.sh
cp -Hr BinaryProgramFolder ./pics; #copy a sym link to your unpacked program folder into your pics directory.
cd ./pics;
./BinaryProgramFolder/BinaryProgramLauncScript.sh <params>; #lunch your program following the symlink to the programs folder, I also used a script to launch my bin program which was in the same folder as the launch script.
NOTE: you must first put your program and images into a jar archive and then copy them to the HDFS. Use hadoop fs -copyFromLocal ./<file location> ./<hadoop fs location>

How can I concatenate two files in hadoop into one using Hadoop FS shell?

I am working with Hadoop 0.20.2 and would like to concatenate two files into one using the -cat shell command if possible (source: http://hadoop.apache.org/common/docs/r0.19.2/hdfs_shell.html)
Here is the command I'm submitting (names have been changed):
**/path/path/path/hadoop-0.20.2> bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv > /user/username/folder/outputdirectory/**
It returns bash: /user/username/folder/outputdirectory/: No such file or directory
I also tried creating that directory and then running it again -- i still got the 'no such file or directory' error.
I have also tried using the -cp command to copy both into a new folder and -getmerge to combine them but have no luck with the getmerge either.
The reason for doing this in hadoop is that the files are massive and would take a long time to download, merge, and re-upload outside of hadoop.

The error relates to you trying to re-direct the standard output of the command back to HDFS. There are ways you can do this, using the hadoop fs -put command with the source argument being a hypen:
bin/hadoop fs -cat /user/username/folder/csv1.csv /user/username/folder/csv2.csv | hadoop fs -put - /user/username/folder/output.csv
-getmerge also outputs to the local file system, not HDFS
Unforntunatley there is no efficient way to merge multiple files into one (unless you want to look into Hadoop 'appending', but in your version of hadoop, that is disabled by default and potentially buggy), without having to copy the files to one machine and then back into HDFS, whether you do that in
a custom map reduce job with a single reducer and a custom mapper reducer that retains the file ordering (remember each line will be sorted by the keys, so you key will need to be some combination of the input file name and line number, and the value will be the line itself)
via the FsShell commands, depending on your network topology - i.e. does your client console have a good speed connection to the datanodes? This certainly is the least effort on your part, and will probably complete quicker than a MR job to do the same (as everything has to go to one machine anyway, so why not your local console?)

To concatenate all files in the folder to an output file:
hadoop fs -cat myfolder/* | hadoop fs -put - myfolder/output.txt
If you have multiple folders on hdfs and you want to concatenate files in each of those folders, you can use a shell script to do this. (note: this is not very effective and can be slow)
Syntax :
for i in `hadoop fs -ls <folder>| cut -d' ' -f19` ;do `hadoop fs -cat $i/* | suy hadoop fs -put - $i/<outputfilename>`; done
eg:
for i in `hadoop fs -ls my-job-folder | cut -d' ' -f19` ;do `hadoop fs -cat $i/* |hadoop fs -put - $i/output.csv`; done
Explanation:
So you basically loop over all the files and cat each of the folders contents into an output file on the hdfs.

Develop Reference

ruby bash windows laravel spring algorithm oracle macos go visual-studio