I am trying to install Open Edx Insights on Azure instance. Both LMS and Insights are on same box. As part of installation, I have installed Hadoop, hive etc through yml script. Now the next instruction is to test the hadoop installation and for that document asks to compute value of "pi". For that they have given following command:
hadoop jar hadoop*/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar pi 2 100
But after I run this command it gives error as :
hadoop#MillionEdx:~$ hadoop jar hadoop*/share/hadoop/mapreduce/hadoop-mapreduce-examples*.jar pi 2 100
Unknown program 'hadoop/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.7.2.jar' chosen.
Valid program names are:
aggregatewordcount: An Aggregate based map/reduce program that counts the words in the input files.
aggregatewordhist: An Aggregate based map/reduce program that computes the histogram of the words in the input files. bbp: A map/reduce program that uses Bailey-Borwein-Plouffe to compute exact digits of Pi.
dbcount: An example job that count the pageview counts from a database.
distbbp: A map/reduce program that uses a BBP-type formula to compute exact bits of Pi.
grep: A map/reduce program that counts the matches of a regex in the input.
join: A job that effects a join over sorted, equally partitioned datasets
multifilewc: A job that counts words from several files.
pentomino: A map/reduce tile laying program to find solutions to pentomino problems.
pi: A map/reduce program that estimates Pi using a quasi-Monte Carlo method.
randomtextwriter: A map/reduce program that writes 10GB of random textual data per node.
randomwriter: A map/reduce program that writes 10GB of random data per node.
secondarysort: An example defining a secondary sort to the reduce.
sort: A map/reduce program that sorts the data written by the random writer.
sudoku: A sudoku solver.
teragen: Generate data for the terasort
terasort: Run the terasort
teravalidate: Checking results of terasort
wordcount: A map/reduce program that counts the words in the input files.
wordmean: A map/reduce program that counts the average length of the words in the input files.
wordmedian: A map/reduce program that counts the median length of the words in the input files.
wordstandarddeviation: A map/reduce program that counts the standard deviation of the length of the words in the input files.
I tried many things like tried giving fully qualified name for Pi but i never got value of pi. Please suggest some work around.
Thanks in advance.
need to check for java version compatible with Hadoop-mapreduce 2.7.2.
This can be because for wrong jar file with respective to java version.
Related
When executing a MR job, Hadoop divides the input data into N Splits and then starts the corresponding N Map programs to process them separately.
1.How is the data divided (splited into different inputSplits)?
2.How is Split scheduled (how do you decide which TaskTracker machine the Map program that handles Split should run on)?
3.How to read the divided data?
4.How Reduce task assigned ?
In hadoop1.X
In hadoop 2.x
Both of the questions has some relationship , so I asked them together , you can show which of the part you are good at .
thanks in advance .
Data is stored/read in HDFS Blocks of a predefined size, and read by various RecordReader types by using byte scanners, and knowing how many bytes to read in order to determine when an InputSplit needs to be returned.
A good exercise to understand it better is to implement your own RecordReader and create small and large files of one small record, one large record, and many records. In the many records case, you try to split a record across two blocks, but that test case should be the same as one large record over two blocks.
Reduce tasks can be set by the client of the MapReduce action.
As of Hadoop 2 + YARN, that image is outdated
I am new to hadoop and just installed oracle's virtualbox and hortonworks' sandbox. I then, downloaded the latest version of hadoop and imported the jar files into my java program. I copied a sample wordcount program and created a new jar file. I run this jar file as a job using sandbox. The wordcount works perfectly fine as expected. However, in my job status page, I see the number of mappers to my input file is determined as 28. In my input file, I have the following line.
Ramesh is studying at XXXXXXXXXX XX XXXXX XX XXXXXXXXX.
How is the total mappers determined as 28?
I added the below line into my wordcount.java program to check.
FileInputFormat.setMaxInputSplitSize(job, 2);
Also, I would like to know if the input file can contain only 2 rows. (i.e.) Suppose if I have an input file, like below.
row1,row2,row3,row4,row5,row6.......row20
Should I split the input file into 20 different files each having only 2 rows?
HDFS block and MapReduce splits are 2 different things. Blocks are physical division of data while a Split is just a logical division done during a MR job. It is the duty of InputFormat to create the Splits from a given set data and based on the number of Splits the number of Mappers is decided. When you use setMaxInputSplitSize, you overrule this behavior and give a Split size of your own. But giving a very small value to setMaxInputSplitSize would be an overkill as there will be a lot of very small Splits, and you'll end up having a lot of unnecessary Map tasks.
Actually I don't see any need for you to use FileInputFormat.setMaxInputSplitSize(job, 2); in your WC program. Also,it looks like you have mistaken the 2 here. It is not the number of lines in a file. It is the Split size, in long, which you would like to have for your MR job. You can have any number of lines in the file which you are going to use as your MR input.
Does this sound OK?
That means your input file is split into roughly 28 parts(blocks) in HDFS since, you said 28 map tasks were scheduled- but, not may not be total 28 parallel map task though. Parallelism will depend on the number of slots you'll have in your cluster. I'm talking in terms of Apache Hadoop. I don't know if Horton works did nay modification to this.
Hadoop likes to work with Large files, so, do you want to split your input file to 20 different files?
I am new to Map-reduce and I am using Hadoop Pipes. I have an input file which contains the number of records, one per line. I have written one simple program to print those lines in which three words are common. In map function I have emitted the word as a key and record as a value and compared those records in reduce function. Then I compared Hadoop's performance with simple C++ program in which I read the records from file and split it into words and load the data in map. Map contains word as a key and record as a value. After loading all the data, I compared that data. But I found that for doing the same task Hadoop Map-reduce takes long time compared with plain C++ program. When I run my program on hadoop it takes about 37 minutes where as it takes only about 5 minutes for simple C++ program. Please, somebody help me to figure out whether I am doing wrong somewhere? Our application needs performance.
There are several points which should be made here:
Hadoop is not high performance - it is scalable. Local program doing the same on small data set will always outperform hadoop. So its usage makes sense only when you want to run on cluster on machine and enjoy Hadoop's parallel processing.
Hadoop streaming is also not best thing performance wise since there are task switches per line. In many cases native hadoop program written in Java will have better performance
I am having clarification regarding using Hadoop for large file size around 2 million. I have file data that consists of 2 million lines for which I want to split each line as single file, copy it in Hadoop File System and do perform calculation of term frequency using Mahout. Mahout uses map-reduce computation in a distributed fashion. But for this, say If I have a file that consist of 2 million lines, I want to take each line as a document for calculation of term-frequency. I will finally have one directory where I will have 2 million documents, each document consist of single line. Will this create n-maps for n-files, here 2 million maps for the process. This takes lot of time for computation. Is there is any alternative way of representing documents for faster computation.
2 millions files is a lot for hadoop. More then that - running 2 million tasks will have roughly 2M seconds overhead, what means a few days of small cluster work.
I think that the problem is of algorithmic nature - how to map your computation to the map reduce paradigm in the way that you will have modest number of mappers. Please drop a few lines about task you need, and I might suggest algorithm.
Mahout has implementation for calcualating TF and IDF for text.
check mahout liberary for it,
and splitting each line as a file is not good idea in hadoop map reduce framework.
I have installed Hadoop single-node cluster 0.20.2 on Ubuntu 10.04 and run an example using the material of the tutorial I found on this site:
http://www.dscripts.net/wiki/setup-hadoop-ubuntu-single-node
Now I am trying to run the Sort example on Hadoop. It needs Sequential files as input. Could anyone please help me running the Sort example? (or giving me some more info on how to generate the Sequential files as input).
Thank you in advance.. ;-)
Running Sort Benchmark
To use the sort example as a benchmark, generate 10GB/node of random data using RandomWriter. Then sort the data using the sort example. This provides a sort benchmark that scales depending on the size of the cluster. By default, the sort example uses 1.0 * capacity for the number of reduces and depending on your cluster you may see better results at 1.75 * capacity.
The commands are:
$> bin/hadoop jar hadoop-*-examples.jar randomwriter /path/randFiles
$> bin/hadoop jar hadoop-*-examples.jar sort /path/randFiles /path/resultFile
The first command will generate the unsorted data in the rand directory. The second command will read that data, sort it, and write into the rand-sort directory.
Take a look at the RandomWriter example. It is a job that outputs a sequence file using random data. The key is the job.setOutputFormat(SequenceFileOutputFormat.class) line that specifies the output format.