I am trying to execute the following query and it is taking forever to load data as only a single reducer is used for the second job.
INSERT INTO TABLE ddb_table
SELECT * FROM data_dump sort by rank desc LIMIT 1000000;
Two jobs are created for the above query. First job run pretty fast as it is using 80 mappers and about 22 reducers. Second job mappers are fast but it is super slow due to a single reducer.
I tried to increase reducer count with set mapred.reduce.tasks=35 but interestingly it was applied only for the first job and not the second.
Why is a single reducer used? Is it because of the sort by clause?
How can I set max reducers?
Is there a better way of doing it?
I'm not positive, but my intuition is that it's because of the "limit", not the "sort by". In fact, "sort by" explicitly will only sort within each reducer, so you will not get a total ordering.
The issue is that if there are multiple reducers, they aren't coordinated enough to be able to know when they've reached 1000000 records. So to do limit, it must be only one reducer, which maintains a count of the number of records, and stops outputting new ones once the limit is reached.
In fact, even if it were possible to do "sort by" and "limit" with multiple reducers, you could get different output on different runs, depending on which reducer runs fastest, so I don't think what you're trying to do here makes sense in the first place.
It is just the way sorting with default Partitioner works in Hadoop. Default partitioning uses hashcode mod number of reducers, so if you want 35 reducers, than you will get 35 output files, each sorted, but with overlapping ranges. For example you have keys starting with alpha characters [a..z]: file1 (a1,a2,a15,d3,d5,f6), file2(a3,a5,b1,z3), etc.
In order to avoid the overlapping key ranges you either need one Reducer or you need to make your partitioner more aware of the nature of the keys, for example make you partitioner to direct all of the keys with the same first character into the same partition, thus there will be multiple files in the output, but none of the ranges will overlap. Ex file1 (a1,a2,a3,a5,a15), file2(b1),file3(....) file4(d3,d6), etc.
It works for when me I use standard Hadoop jobs or Apache PIG. Unfortunately I do not have Hive expirience, but you could try to use Dynamic Partitioning on the table you are inserting into.
Related
As we know, that during the shuffle phase of hadoop, each of the reducer read data from all the mapper's output (intermedia data).
Now, we also know that by default Hash-Partitioning is used for reducers.
My question is: How do we implement an algorithm, e.g. Locality-aware?
In short, you should not do it.
First, you have no control over where the mappers and reducers are executed on the cluster, so even when the complete output of a single mapper will go to a single reducer there is a huge probability that they would be on different hosts and the data would be transferred through the network
Second, to make the reducer process the whole output of the mapper, you first have to make mapper process the right part of the information, which means that you have to preprocess data by partitioning it and then run a single mapper and a single reducer for each partition, but this preprocessing itself would take much resources so it is mostly meaningless
And finally, why do you need it? The main concept of map-reduce is manipulation with key-value pairs, and reducer in general should aggregate list of values outputted by the mappers for the same keys. Here's why hash partitioning is used: distribute N keys between K reducers. Using different type of partitioner is a really seldom case. If you need data locality you might prefer to work with MPP database rather than Hadoop, for example.
If you really need a custom partitioner, here's an example of how it can be implemented: http://hadooptutorial.wikispaces.com/Custom+partitioner. Nothing special, just return reducer number based on the key and value passed and the number of reducers. Using hash code of the host name divided (%) by the number of reducers will make the whole output of a single mapper go to a single reducer. Also you might use process PID % number of reducers. But before doing this you have to check, whether you really need this behavior or not.
I know that during the intermediate steps between mapper and reducer, hadoop will sort and partition the data on its way to the reducer.
Since I am dealing with already partitioned data in my input to the mapper, is there a way to take advantage of it and possibly accelerate the intermediate processing so no more sorting or grouping-by will take place?
Adding some details:
As I store data on S3, let's say I only have two files in my bucket. First file will store records of the lower half users ids, the other file will store values of the upper half of user ids. Data in each file is not necessarily sorted, but it is guaranteed that all data pertaining to a user is located in the same file.
Such as:
\mybucket\file1
\mybucket\file2
File1 content:
User1,ValueX
User3,ValueY
User1,ValueZ
User1,ValueAZ
File2 content:
User9,ValueD
User7,ValueB
User7,ValueD
User8,ValueB
From what I read, I can use a streaming job and two mappers and each mapper will suck in one of the two files, but the whole file. Is this true?
Next,
Let's say the mapper will only output a unique Key just once, with the associated value being the number of occurrences of that Key. (which I realize it is more of a reducer responsibility, but just for our example here)
Can the sorting and partitioning of those output keys from the Mapper be disabled and let them fly freely to the reducer(s) ?
Or to give another example:
Imagine all my input data contains just one line for each Unique Key, and I don't need that data to be sorted in the final output of the reducer. I just want to Hash the Value for each Key. Can I disable that sorting and partitioning step before the reducer?
Although for the files shown above you'll get 2 mappers, it can't be guaranteed always. Number of mappers depend upon the number of InputSplits created from the input data. If your files are big you might have more than one mappers.
Partitioning is merely a way to tell which key/value goes to which reducer. If you disable it then you either need some other way to do this or you'll end up with performance degradation, as the inputs to reducers will be uneven. A particular reducer might get all of the input or a particular reducer might get zero input. I can't see any performance gain here. Of course, if you think your custom partitioner fits better into the situation you could definitely do that. But skipping partitioning doesn't sound logical to me. The default partitioning behavior depends on hash itself. After a mapper emits its output keys are hashed to find out which set of key/value pairs goes to which reducer.
And if your data is already sorted and you want to skip the sorting phase in your MR job, you might find the patch provided in response to this JIRA useful. Issue is not closed yet, but it would definitely help you in getting started.
HTH
I have an file with over 300000 lines that's an input to a map reduce job and I want the job to process only the first 1000 lines of this file. Is there a good way to limit the number of records sent to the reducer?
A simple identity reducer is all I need to write out my output. Currently, the reducer writes out as many lines as there are in the input.
First, make sure your mapreduce program is set to only use one reducer. It has to be explicitly set, otherwise Hadoop might choose some other number, and then there's no good way to coordinate between reduce tasks to make sure they don't emit more than 1000 total. Then you can simply maintain an instance variable in your Reducer class that counts how many records it has seen, and stops emitting them after 1000.
The other, probably simpler, way to do it would be to shorten your input file. Just delete the lines you don't need.
It's also worth noting that hive and pig are both frameworks that will do this type of thing for you. Writing "raw" MapReduce code is rare in practice. Most people use one of those two.
My job dosn't require sorting, just aggregation information per key. So I think if it possible to disable sorting of all information in order of increasing performance.
Note: I can't set reducers count to zero because I need to aggregate data between many mappers. I just not interested in sorted result withing one reducer.
One of the main purpose to sort the map output is, when the tuples reaches reducer, reducer has to make ) to invoke reducer task, with the sorted map output list it can make the list just by sequential scan (when it sees different key then just make new list), if the map output is not sorted then it has to scan the whole list to form the list with same key.
No, Sorting in MapReduce is essentially performed for internal purposes and not for the end results to be sorted.
Sorted input ensures good performance when creating list of values for unique keys, which are fed as Values> arguments when calling the reduce() function.
Shuffling and sorting in Hadoop MapReduce are not performed at all if you specify zero reducers (setNumReduceTasks(0)).
and
The number of reducer can be set to 0 in driver class by job.setNumreduceTasks(0).This shows that there is no reducer phase and has only map phase.It is called as a map-only job.
We have a large dataset to analyze with multiple reduce functions.
All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.
Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.
Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.
Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.
Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.
Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.
You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.
It's possible to do that, but not in trivial way in one MR.
You may use composite keys. Let's say you need two kinds of the reducers, 'R1' and 'R2'. Add ids for these as a prefix to your o/p keys in the mapper. So, in the mapper, a key 'K' now becomes 'R1:K' or 'R2:K'.
Then, in the reducer, pass values to implementations of R1 or R2 based on the prefix.
I guess you want to run different reducers in a chain. In hadoop 'multiple reducers' means running multiple instances of the same reducer. I would propose you run one reducer at a time, providing trivial map function for all of them except the first one. To minimize time for data transfer, you can use compression.
Of course you can define multiple reducers. For the Job (Hadoop 0.20) just add:
job.setNumReduceTasks(<number>);
But. Your infrastructure has to support the multiple reducers, meaning that you have to
have more than one cpu available
adjust mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml accordingly
And of course your job has to match some specifications. Without knowing what you exactly want to do, I only can give broad tips:
the keymap-output have either to be partitionable by %numreducers OR you have to define your own partitioner:
job.setPartitionerClass(...)
for example with a random-partitioner ...
the data must be reduce-able in the partitioned format ... (references needed?)
You'll get multiple output files, one for each reducer. If you want a sorted output, you have to add another job reading all files (multiple map-tasks this time ...) and writing them sorted with only one reducer ...
Have a look too at the Combiner-Class, which is the local Reducer. It means that you can aggregate (reduce) already in memory over partial data emitted by map.
Very nice example is the WordCount-Example. Map emits each word as key and its count as 1: (word, 1). The Combiner gets partial data from map, emits (, ) locally. The Reducer does exactly the same, but now some (Combined) wordcounts are already >1. Saves bandwith.
I still dont get your problem you can use following sequence:
database-->map-->reduce(use cat or None depending on requirement)
then store the data representation you have extracted.
if you are saying that it is small enough to fit in memory then storing it on disk shouldnt be an issue.
Also your use of MapReduce paradigm for the given problem is incorrect, using a single map function and multiple "different" reduce function makes no sense, it shows that you are just using map to pass out data to different machines to do different things. you dont require hadoop or any other special architecture for that.