I have chosen my seminar topic on Hadoop Map reduce framework, and being a beginner I have very less knowledge about it.
My doubts are as follows :-
Whether the Job Tracker,Task Tracker comes as a software or Hardware?
As it was given in one of the examples, about map reduce been used to find occurrences of all words in some (say 3) files, then where do u give the query i.e "find occurrences of word" in map reduce program? How does map know what is to be done?
a) Whether the Job Tracker,Task Tracker comes as a software or Hardware?
Both are softwares. I.e. services running on nodes of a cluster.
b) As it was given in one of the examples, about map reduce been used to find occurrences of all words in some (say 3) files, then where do u give the query i.e "find occurrences of word" in map reduce program? How does map know what is to be done
There is no query language, unless you are using hive. You basically implement your logic inside Mapper & Reducer classes.
If you want to prepare for seminar go through Hadoop Definitive Guide, starting 2 chapters.
Get to know what kind of problems hadoop solves, its architecture.
Then if you do not understand anything, happy to help :)
Related
I'm currently migrating from the Hadoop MR paradigm to apache Spark, and there is a few doubts that come to my mind regarding advanced efficiency implementation patterns outside the usual "map and reduce" basic workflow.
In this well known book (Lin and Dier 2010) the "in-mapper combiner" pattern is introduced, which can significantly improve efficiency in many applications.
i.e, the canonical word count example in hadoop, where we normally emit (word, 1) tuples to be further combined, can be greatly improved if local aggregation of (word, n) tuples is performed and then emitted. Although combiners can fulfil this behaviour my experience is that using local variables for each mapper along with hadoop's functions like "setUp" and "cleanUp" can lead to higher computational savings (here is a nice tutorial).
Inside the Spark world I could not find anything similar, just the so called map-side aggregation, which is equivalent to the Hadoop's local combiner. Given the previous example, I wonder if it can be translated into Spark by using map functions.
I have a program that I wish to compute by splitting it up across many computers. Is this something I can accomplish with Hadoop or Map/Reduce, and if so, how do I begin using it? Does it cost money to use that many computers?
You can split you program depending on the nature of your algorithm. You should split the input data and on each node apply your program on a subset of that input. That is, you should implement a data parallelism. In each node execute the same program but on a smaller input.
My advice, take a look to the book "Hadoop, the definitive guide", the first two chapters can help you to understand something better.
If you want to try simple MapReduce programs as WordCount you can download the Hortonworks sandbox which you can install on a virtual machine and you'll have a single node hadoop installation very quickly. Here is the link http://hortonworks.com/products/hortonworks-sandbox/
I want to share some data between all the splits in the hadoop framework, more specifically, I have a file that contains alot of terms that I’m concerned to search for, and write how many times appear in each document, but the problem is that in case the word does not appear in some splits, i need to return 0 for this file, but the problem that I cant pass the terms I’m searching for to all the nodes in splits, can anybody give me some idea
Generally, the DistributedCache is the way to share data across nodes. However since it is deprecated, check this answer.
Hello,
I am learning Hadoop and after reading the material found on the net (tutorials, map reduce concepts, Hive, Ping and so on) and developed some small application with those I would like to learn the real world usages of these technologies.
What are the everyday software we use that are based upon Hadoop stack?
If you use the internet, there are good changes that you are indirectly impacted from Hadoop/MapReduce from Google Search to FaceBook to LinkedIn etc. Here are some interesting links to find how widespread Hadoop/MR usage is
Mapreduce & Hadoop Algorithms in Academic Papers (4th update – May 2011)
10 ways big data changes everything
One thing to note is Hadoop/MR is not an efficient solution for every problem. Consider other distributed programming models like those based on BSP also.
Happy Hadooping !!!
Here are some sample mapreduce examples which will be helpful for beginners..
1.Word Count
2.SQL Aggregation using Map reduce
3.SQL Aggregation on multiple fields using Map reduce
URL - http://hadoopdeveloperguide.blogspot.in/
Ok, I am attempting to learn Hadoop and mapreduce. I really want to start with mapreduce and what I find are many, many simplified examples of mappers and reducers, etc. However, I seen to be missing something.
While an example showing how many occurrences of a word are in a document is simple to understand it does not really help me solve any "real world" problems. Does anybody know of a good tutorial of implementing mapreduce in a psuedo-realistic situation. Say, for instance, I want to use hadoop and mapreduce on top of a data store similar to Adventureworks. Now I want to get orders for a given product in the month of may. How would that look from a hadoop/mapreduce perspective? (I realize this may not be the type of problem mapreduce is intended to solve but, it just came to mind quickly.)
Any direction would help.
The book Hadoop: The Definitive Guide is a good place to start. The introductory chapters should be really useful to you to figure out where MapReduce is useful and when you should use it. The more advanced chapters have plenty of more realistic examples than word count.
If you want to dive deeper, you may want to check out Data-Intensive Text Processing with MapReduce. This definitely has plenty of "real-world" use cases, but it doesn't sound like you are interested in doing text processing.
For your particular example, the main things to realize are:
The map phase is mostly for parsing, transforming data, and filtering out data. Think record-by-record, shared-nothing approach to record processing. In word count, this is parsing the line and splitting out the words.
The reduce phase is all about aggregation: counting, averaging, min/max, etc. In word count, this is counting up the instances of the word.
So, if you would want all the records for a given product in the month of May, you could use a map-only job to filter through all the data and only keep the records you want. However, you really should read about what Hadoop is useful for. The question that would fit Hadoop better would be: give me a count of how many times every item was purchased in every month (to build a matrix, perhaps). Very rarely are you looking for specific records like you suggest.
If you are looking for a more real-time access platform, you should check out HBase once you are done learning about Hadoop.
Hadoop can be used for a wide variety of problems. Check this blog entry from atbrox. Also, there is a lot of information on the internet about Hadoop and MapReduce and it's easy to get lost. So, here is the consolidated list of resources on Hadoop.
BTW, Hadoop - The Definitive Guide 3rd edition is due in May. Looks like it also covers MRv2 (NextGen MapReduce) and also includes more case studies. The 2nd edition is worth as mentioned by orangeoctopus.
MapReduce can be a complex topic so I found it easier to understand it by applying its approach to a simple problem. Then I go on to describe how MapReduce makes it straightforward to solve the same problem in a cluster. You can take a look in my article here: Intro to Parallel Processing with MapReduce.
Let me know if you think this article makes it easier to understand MapReduce and Hadoop.