I have installed Hadoop single-node cluster 0.20.2 on Ubuntu 10.04 and run an example using the material of the tutorial I found on this site:
http://www.dscripts.net/wiki/setup-hadoop-ubuntu-single-node
Now I am trying to run the Sort example on Hadoop. It needs Sequential files as input. Could anyone please help me running the Sort example? (or giving me some more info on how to generate the Sequential files as input).
Thank you in advance.. ;-)
Running Sort Benchmark
To use the sort example as a benchmark, generate 10GB/node of random data using RandomWriter. Then sort the data using the sort example. This provides a sort benchmark that scales depending on the size of the cluster. By default, the sort example uses 1.0 * capacity for the number of reduces and depending on your cluster you may see better results at 1.75 * capacity.
The commands are:
$> bin/hadoop jar hadoop-*-examples.jar randomwriter /path/randFiles
$> bin/hadoop jar hadoop-*-examples.jar sort /path/randFiles /path/resultFile
The first command will generate the unsorted data in the rand directory. The second command will read that data, sort it, and write into the rand-sort directory.
Take a look at the RandomWriter example. It is a job that outputs a sequence file using random data. The key is the job.setOutputFormat(SequenceFileOutputFormat.class) line that specifies the output format.
Related
I am trying to install Open Edx Insights on Azure instance. Both LMS and Insights are on same box. As part of installation, I have installed Hadoop, hive etc through yml script. Now the next instruction is to test the hadoop installation and for that document asks to compute value of "pi". For that they have given following command:
hadoop jar hadoop*/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar pi 2 100
But after I run this command it gives error as :
hadoop#MillionEdx:~$ hadoop jar hadoop*/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar pi 2 100
Unknown program 'hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar' chosen.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
I tried many things like tried giving fully qualified name for Pi but i never got value of pi. Please suggest some work around.
Thanks in advance.
need to check for java version compatible with Hadoop-mapreduce 2.7.2.
This can be because for wrong jar file with respective to java version.
I'm running Mahout Streaming K means algorithm on a cluster and I'm getting only one file as output.
I'm new to Mahoot/Hadoop,but if I understood well there should be more than one file,since the job is split on multiple nodes.
If I'm correct why isn't that so in my case?
Could it be that I'm having too little data so the processing is done on one machine, or I have messed up something when running the job(paths for Hadoop or something like that) and that is the reason why it runs on a single machine?
Hadoop manages data chunking (ie : splitting a file into multiple ones).
This means that from your perspective (ie, from HDFS), there is one file. Howver, for the datanodes file system, there are many.
I am trying to run Kmeans using Hadoop. I want to save the centroids of the clusters calculated in the cleanup method of the Reducer to some file say centroids.txt. Now, I would like to know what will happen if multiple reducers' cleanup method starts at the same time and all of them try to write to this file simultaneously. Will it be handled internally? If not is there a way to synchronize this task?
Note that this is not my output file of reducer. It is an additional file that I am maintaining to keep track of the centroids. I am using BufferedWriter from the reducer's cleanup method to do this.
Yes you are right. You cannot achieve that using existing framework.
Cleanup will be called many times.and you cannot synchronize. Possible
approaches you can follow are
Call merge after successful job.
hadoop fs -getmerge <src> <localdst> [addnl]
here
2 Clearly specify where your output file(s) should go. Use this folder as input to your next job.
3 Chain one more MR. where map and reduce don't change the data, and partitioner assigns all data to a single reducer
Each reducer writes to a separate file. Multiple reducers can never modify the same file.
Since the centroids are relatively few you can write them into zookeeper. If you had a high read/write load you would probably need HBase (which you can also use here but it would be an overkill)
Also note that there are several k-means implementation on Hadoop like Mahout. Some of these implementations are more efficient than map/reduce like Apache Hama which uses BSP or Spark which runs in-memory
I am very new in hadoop world and struggling to achieve one simple task.
Can anybody please tell me how to get top n values for word count example by using only Map reduce code technique?
I do not want to use any hadoop command for this simple task.
You have two obvious options:
Have two MapReduce jobs:
WordCount: counts all the words (pretty much the example exactly)
TopN: A MapReduce job that finds the top N of something (here are some examples: source code, blog post)
Have the output of WordCount write to HDFS. Then, have TopN read that output. This is called job chaining and there are a number of ways to solve this problem: oozie, bash scripts, firing two jobs from your driver, etc.
The reason you need two jobs is you are doing two aggregations: one is word count, and the second is topN. Typically in MapReduce each aggregation requires its own MapReduce job.
First, have your WordCount job run on the data. Then, use some bash to pull the top N out.
hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n20
sort -n -k2 -r says "sort numerically by column #2, in descending order". head -n20 pulls the top twenty.
This is the better option for WordCount, just because WordCount will probably only output on the order of thousands or tens of thousands of lines and you don't need a MapReduce job for that. Remember that just because you have hadoop around doesn't mean you should solve all your problems with Hadoop.
One non-obvious version, which is tricky but a mix of both of the above...
Write a WordCount MapReduce job, but in the Reducer do something like in the TopN MapReduce jobs I showed you earlier. Then, have each reducer output only the TopN results from that reducer.
So, if you are doing Top 10, each reducer will output 10 results. Let's say you have 30 reducers, you'll output 300 results.
Then, do the same thing as in option #2 with bash:
hadoop fs -cat /output/of/wordcount/part* | sort -n -k2 -r | head -n10
This should be faster because you are only postprocessing a fraction of the results.
This is the fastest way I can think of doing this, but it's probably not worth the effort.
I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line.
Ramesh is studying at XXXXXXXXXX XX XXXXX XX XXXXXXXXX.
How is the total mappers determined as 28?
I added the below line into my wordcount.java program to check.
FileInputFormat.setMaxInputSplitSize(job, 2);
Also, I would like to know if the input file can contain only 2 rows. (i.e.) Suppose if I have an input file, like below.
row1,row2,row3,row4,row5,row6.......row20
Should I split the input file into 20 different files each having only 2 rows?
HDFS block and MapReduce splits are 2 different things. Blocks are physical division of data while a Split is just a logical division done during a MR job. It is the duty of InputFormat to create the Splits from a given set data and based on the number of Splits the number of Mappers is decided. When you use setMaxInputSplitSize, you overrule this behavior and give a Split size of your own. But giving a very small value to setMaxInputSplitSize would be an overkill as there will be a lot of very small Splits, and you'll end up having a lot of unnecessary Map tasks.
Actually I don't see any need for you to use FileInputFormat.setMaxInputSplitSize(job, 2); in your WC program. Also,it looks like you have mistaken the 2 here. It is not the number of lines in a file. It is the Split size, in long, which you would like to have for your MR job. You can have any number of lines in the file which you are going to use as your MR input.
Does this sound OK?
That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.
Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?