Objects from memory as input for Hadoop/MapReduce? - hadoop

I am working on the parallelization an algorithm, which roughly does the following:
Read several text documents with a total of 10k words.
Create an objects for every word in the text corpus.
Create a pair between all word-objects (yes, O(n)). And return the most frequent pairs.
I would like to parallelize the 3. step by creating the pairs between the first 1000 word-objects the rest on the fist machine, the second 1000 word-objects on the next machine, etc.
My question is how to pass the objects created in the 2. step to the Mapper? As far as I am aware I would require input files for this and hence would need to serialize the objects (though haven't worked with this before). Is there a direct way to pass the objects to the Mapper?
Thanks in advance for the help
Evgeni
UPDATE
Thank you for reading my question before. Serialization seems to be the best way to solve this (see java.io.Serializable). Furthermore, I have found this tutorial useful to read data from serialized objects into hadoop: http://www.cs.brown.edu/~pavlo/hadoop/).

How about parallelize all steps? Use your #1 text documents as input to your Mapper. Create the object for every word in the Mapper. In the Mapper your key-value pair will be the word-object pair (or object-word depending on what you are doing). The Reducer can then count the unique pairs.
Hadoop will take care of bringing all the same keys together into the same Reducer.

Use twitter protobufs ( elephant-bird ) . Convert each word into a protobuf object and process it however you want. Also protobufs are much faster and light compared to default java serialization. Refer Kevin Weil's presentation on this. http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter

Related

Why does MapReduce bother mapping every value to 1 in the map step?

I'm trying to figure out MapReduce and so far I think I'm gaining an okay understanding.
However, one thing confuses me. In every example and explanation of MapReduce I can find, the map step maps all values to 1. For instance, in the most common example (counting occurrences of words in a string), the Map section splits up each word and then maps it to the value 1.
The Reduce section then combines/reduces like words, adding up the amount of times they occur so that they map to N instead of 1 (N being how many times the word appears).
What I don't understand is: why even bother mapping them to 1 in the first place? It seems like they will ALWAYS map to 1. Why not just split them apart, and then in the Reduce step, do the mapping there, and sum everything up at the same time?
I'm sure there must be a good reason that I just can't think of. Thanks!
(this question is about MapReduce as a concept in general, not necessarily about Hadoop or any other specific technology or implementation)
The output of the mapper is decided based on the use case you wanted to have. In word count, we want the mapper to separate the individual words and output the number of occurrences for each word. The mapper is called for every key value pair(input split) in the input. Here its for each line. Key is the offset and value is the entire sentence. There would be grouping performed before the reducer is invoked. so all the words are grouped and each occurence(1 here) is counted. It is not a hard rule to emit 1 as mapper output. If you have noticed the data set example in Hadoop : Definitive guide, they have the year and temperature emitted as mapper output. The usecase is to group based on the years and find the max/min temperature. You can think of this as group parameter, for a basic understanding sake. Happy learning

Is (key,value) pair in Hadoop always ('text',1)?

I am new to Hadoop.
Can you please tell about (key/value) pair? Is the value always one? Is the output of the reduce step always a (key/value) pair? If yes, how is that (key/value) data used further?
Please help me.
I guess you are asking about the 'one' value for the (key,values) pair due to the wordcount example in the Hadoop tutorials. So, the answer is no, it is not always 'one'.
Hadoop implementation of MapReduce works by passing (key,values) pairs in the entire workflow, from the input to the output:
Map step: Generally speaking (there are other particular cases, depending on the input format), the mappers process line by line the data within the splits they are assigned to; such lines are passed to the map method as (key,value) pairs telling about the offset (the key) of the line within the split, and the line itself (the value). Then, they produce at the output another (key,value) pair, and its meaning depends on the mapping function you are implementing; sometimes it will be a variable key and a fixed value (e.g. in wordcount, the key is the word, and the value is always 'one'); other times the value will be the length of the line, or the sum of all the words starting by a prefix... whatever you may imagine; the key may be a word, a fixed custom key...
Reduce step: Typically the reducer receives lists of (key,value) pairs produced by the mappers whose key is the same (this depends on the combiner class you are using, of course but this is generaly speaking). Then, they produce another (key,value) pair in the poutput, again, this depends on the logic of your application. Typically, the reducer is used to aggregate all the values regarding the same key.
This is a very rough quick and undetailed explanation, I encourage you to read some official documentation about it, or especialized literature such as this.
Hope you have started learning mapreduce with Wordcount example..
Key/Value pair is the record entity that mapreduce accepts for execution. The InputFormat classes to read records from source and the OutputFormat classes to commit results operate only using the records as Key/Value format.
Key/Value format is the best suited representation of records to pass through the different stages of the map-partition-sort-combine-shuffle-merge-sort-reduce lifecycle of mapreduce. Please refer,
http://www.thecloudavenue.com/2012/09/why-does-hadoop-uses-kv-keyvalue-pairs.html
The Key/Value data types can be anything. The Text/Interwritable key/value you used is the best pair used for wordcount. Its actually can be anything according to your requirement.
Kindly Spend some time in reading hadoop definitive guide/ Yahoo tutorials to get more understanding. happy learning...

How do I process a 2-D array, one per file, using Hadoop MapReduce?

I need to read and process a file as a single unit, not line by line, and it's not clear how you'd do this in a Hadoop MapReduce application. What I need to do is to read the first line of the file as a header, which I can use as my key, and the following lines as data to build a 2-D data array, which I can use as my value. I'll then do some analysis on the entire 2-D array of data (i.e. the value).
Below is how I'm planning to tackle this problem, and I would very much appreciate comments if this doesn't look reasonable or if there's a better way to go about this (this is my first serious MapReduce application so I'm probably making rookie mistakes):
My text file inputs contain one line with station information (name, lat/lon, ID, etc.) and then one or more lines containing a year value (i.e. 1956) plus 12 monthly values (i.e. 0.3 2.8 4.7 ...) separated by spaces. I have to do my processing over the entire array of monthly values [number_of_years][12] so each individual line is meaningless in isolation.
Create a custom key class, making it implement WritableComparable. This will hold the header information from the initial line of the input text files.
Create a custom input format class in which a) the isSplitable() method returns false, and b) the getRecordReader() method returns a custom record reader that knows how to read a file split and turn it into my custom key and value classes.
Create a mapper class which does the analysis on the input value (the 2-D array of monthly values) and outputs the original key (the station header info) and an output value (a 2-D array of analysis values). There'll only be a wrapper reducer class since there's no real reduction to be done.
It's not clear that this is a good/correct application of the map reduce approach a) since I'm doing analysis on a single value (the data array) mapped to a single key, and b) since there is never more than a single value (data array) per key then no real reduction will ever need to be performed. Another issue is that the files I'm processing are relatively small, much less than the default 64MB split size. With this being the case perhaps the first task is instead to consolidate the input files into a sequence file, as shown in the SmallFilesToSequenceFileConverter example in the Definitive Hadoop O'Reilly book (p. 194 in the 2nd Edition)?
Thanks in advance for your comments and/or suggestions!
It looks like your plan regarding coding is spot on, I would do the same thing.
You will benefit from hadoop if you have a lot of input files provided as input to the Job, as each file will have its own InputSplit and in Hadoop number of executed mappers is the same as number of input splits.
Too many small files will cause too much memory use on the HDFS Namenode. To consolidate the files you can use SequenceFiles or Hadoop Archives (hadoop equivalent of tar) See docs. With har files (Hadoop Archives) each small file will have its own Mapper.

Is the input to a Hadoop reduce function complete with regards to its key?

I'm looking at solutions to a problem that involves reading keyed data from more than one file. In a single map step I need all the values for a particular key in the same place at the same time. I see in White's book the discussion about "the shuffle" and am tempted to wonder if when you come out of merging and the input to a reducer is sorted by key, if all the data for a key is there....if you can count on that.
The bigger pictures is that I want to do a poor-man's triple-store federation and the triples I want to load into an in-memory store don't all come from the same file. It's a vertical (?) partition where the values for a particular key are in different files. Said another way, the columns for a complete record each come from different files. Does Hadoop re-assemble that? ...at least for a single key at a time.
In short: yes. In a Hadoop job, the partitioner chooses which reducer receives which (key, value) pairs. Quote from the Yahoo tutorial section on partitioning: "It is necessary that for any key, regardless of which mapper instance generated it, the destination partition is the same". This is also necessary for many of the types of algorithms typically solved with map reduce (such as distributed sorting, which is what you're describing).

Sorting the values before they are send to the reducer

I'm thinking about building a small testing application in hadoop to get the hang of the system.
The application I have in mind will be in the realm of doing statistics.
I want to have "The 10 worst values for each key" from my reducer function (where I must assume the possibility a huge number of values for some keys).
What I have planned is that the values that go into my reducer will basically be the combination of "The actual value" and "The quality/relevance of the actual value".
Based on the relevance I "simply" want to take the 10 worst/best values and output them from the reducer.
How do I go about doing that (assuming a huge number of values for a specific key)?
Is there a way that I can sort all values BEFORE they are sent into the reducer (and simply stop reading the input when I have read the first 10) or must this be done differently?
Can someone here point me to a piece of example code I can have a look at?
Update: I found two interesting Jira issues HADOOP-485 and HADOOP-686.
Anyone has a code fragment on how to use this in the Hadoop 0.20 API?
Sounds definitively like a SecondarySortProblem. Take a look into "Hadoop: The definitive guide", if you like to. It's from O'Reilly. You can also access it online. There they describe a pretty good implementation.
I implemented it by myself too. Basically it works this way:
The partitioner will care for all the key-value-pairs with the same key going to one single reducer. Nothing special here.
But there is also the GroupingComparator, that will form groupings. One group is actually passed as an iterator to one reduce()-call. So a Partition can contain multiple groupings. But the amount of partitions should be equal the number of reducers. But the grouping also allows to do some sorting as it implements a compareTo-method.
With this method, you can control, that the 10 best/worst/highest/lowest however keys will reach the reducer first. So after you read these 10 keys, you can leave the reduce method without any further iterations.
Hope that was helpful :-)
It sounds like you want to use a Combiner, which defines what to do with the values your create on the Map side before they are sent to the Reducer, but after they are grouped by key.
The combiner is often set to just be the reducer class (so you reduce on the map side, and then again on the reduce side).
Take a look at how the wordCount example uses the combiner to pre-compute partial counts:
http://wiki.apache.org/hadoop/WordCount
Update
Here's what I have in mind for your problem; it's possible I misunderstood what you are trying to do, though.
Every mapper emits <key, {score, data}> pairs.
The combiner gets a partial set of these pairs: <key, [set of {score, data}> and does a local sort (still on the mapper nodes), and outputs <key, [sorted set of top 10 local {score, data}]> pairs.
The reducer will get <key, [set of top-10-sets]> -- all it has to do is perform the merge step of sort-merge (no sorting needed) for each of the members of the value sets, and stop merging when the first 10 values are pulled.
update 2
So, now that we know that the rank as cumilative and as a result, you can't filter the data early by using combiners, the only thing is to do what you suggested -- get a secondary sort going. You've found the right tickets; there is an example of how to do this in Hadoop 20 in src/examples/org/apache/hadoop/examples/SecondarySort.java (or, if you don't want to download the whole source tree, you can look at the example patch in https://issues.apache.org/jira/browse/HADOOP-4545 )
If I understand the question properly, you'll need to use a TotalOrderPartitioner.

Resources