I have ~5000 entries in my Hadoop input file, but I know in advance that some of the lines will take much longer to process than others (in the map stage).
(Mainly because I need to download a file from Amazon S3, and the size of the file will vary between tasks)
I want to make sure that the biggest map tasks are processed first, to make sure that all my hadoop nodes will finish working roughly at the same time.
Is there a way to do that with Hadoop? Or do I need to rework the whole thing? (I am new to Hadoop)
Thanks!
Well if you would implement your custom InputFormat (the getSplits() method contains the logic about split creation), then theoretically you could achieve what you want.
BUT, you have to take special care, because the order of how the InputFormat returns the splits is not the order of how Hadoop will process it.
There is a split re-ordering code inside the JobClient:
// sort the splits into order based on size, so that the biggest
// go first
Arrays.sort(array, new NewSplitComparator());
which will make the whole thing more tricky.
But you could implement a custom InputFormat + a custom InputSplit and make the InputSlip#length() dependent on its expected execution time.
Related
I don't see a value for the reducers in Hadoop in the following scenario:
The Map Tasks generate unique keys (Because we can merge both the Map/Reduce functionality together)
The output size of the Map Tasks is too big (This will exhaust the memory if we wait for the reducers to begin the work)
If we have any functionality that doesn't need grouping and sorting of the keys
Please correct me if I am wrong.
And if someone could give me a real example of the benefits of the reducers and when it should be used, I will appreciate it.
Reducer is beneficial (or required) when you need to do operations like aggregation/grouping etc..
FYI : Reducer is meant for grouping different value for a key which comes from different mapper. So for a use case which do not require grouping/aggregation then there is no point of using reducer(you can set it to Zero , meaning Map-Only jobs).
One quick use-case i can think of is - you want to randomly split a big file to multiple part file. In this case you will supply big file (lets say 100G) to Map-Only jobs. All maps will read a chunk of file and write as a part of file.
What do I need to do to have smaller/larger blocks in Hadoop?
Concretely, I want to have larger number of mappers, that gets smaller piece of data to work on. It seems that I need to decrease the block size, but I'm confused (I'm new to Hadoop) - do I need to do something while putting the file on HDFS, or do I need to specify something related to input split size, or both?
I'm sharing the cluster, so I cannot perform global settings, so need this on a per-job basis, if possible? And I'm running the job from code (later from Oozie, possibly).
What a mapper runs is controlled by the input split, and is completely up to you how you specify it. The HDFS block size has nothing to do with it (other than the fact that most splitters use the block size as a basic 'block' for creating the input splits in order to achieve good data locality). You can write your own splitter that takes an HDFS block and splits in 100 splits, if you so fancy. Aslo look at Change File Split size in Hadoop.
Now that being said, the wisdom of doing that ('many mappers with small splits') is highly questionable. Everybody else is trying to do the opposite (create few mappers with aggregated splits). See Dealing with Hadoop's small files problem, The Small Files Problem, Amazon Elastic MapReduce Deep Dive and Best Practices and so on.
You dont really have to decrease the block size to have more mappers , that would process smaller amount of data.
You dont have to modify the HDFS block size ( dfs.blocksize ), let it be with th default global value as per your cluster configuration.
You may use mapreduce.input.fileinputformat.split.maxsize property in your job configuration with lower value than the block size.
The input splits will be calculated with this value and one mapper will be triggered for every input split calculated.
I have read through the tutorials from Apache and Yahoo on DistributedCache. I am still confused about one thing though. Suppose I have a file which I want to be copied to all data nodes. So, I use
DistributedCache.addCacheFile(new URI(hdfsPath),job) in the job Driver to make the file available. Then, I call DistributedCache.getLocalCacheFiles(job) inside my Mapper.
Now, I want to create an array on the data node based on the contents of this file so that each time map() runs, it can access the elements of the array. Can I do this? I am confused because if I read the cached file and create the array within the Mapper class, it seems like it would create the array for each new input to the Mapper rather than just once per Mapper. How does this part actually work (i.e., where/when should I create the array)?
There are a few concepts mixed here.
Datanode has nothing to do directly with the DistributedCache. It is concept of the MapReduce layer.
Desire to reuse the same derivative from the cached file between mappers is somwhat contradicts with the functional nature of the MR paradigm. Mappers should be logically independent.
What you want is a kind of optimization which makes sense if preprocessing of cached file for the mappers is relatively expensive
You can do it in some extent by saving the preprocessed data in the some static variable, lazy evaluate it, and set hadoop to reuse virtual machines between tasks. It is not "MR" spirit solution but should work.
Better solution would be to preprocess the cached file to the form, where its consumption by the mapper will be cheap.
Lets assume that all the idea is a kind of optimization - otherwise reading and processing the file for each mapping is just fine.
Can be stated that if preparing the file for each mapper is much cheaper than map processing itself, or much cheaper than mapper run overhead - we are fine.
By form I mean the format of the file, which can be very efficiently converted to the in-memory structure we need. For example - if we need some search in the data - we can store data already sorted. It will save us sorting each time, what, usually much more expensive than sequential reading from the disk
If in your case it is properties in some modest number (let say thousands) I can assume that their reading and initialization is not significant comparing to the single mapper
Scenario:
I have one subset of database and one dataware house. I have bring this both things on HDFS.
I want to analyse the result based on subset and datawarehouse.
(In short, for one record in subset I have to scan each and every record in dataware house)
Question:
I want to do this task using Map-Reduce algo. I am not getting that how to take both files as a input in mapper and also how to handle both files in map phase of map-reduce.
Pls suggest me some idea so that I can able to perform it?
Check the Section 3.5 (Relations Joins) in Data-Intensive Text Processing with MapReduce for Map-Side Joins, Reduce-Side Joins and Memory-Backed Joins. In any case MultipleInput class is used to have multiple mappers process different files in a single job.
FYI, you could use Apache Sqoop to import DB into HDFS.
Some time ago I wrote a Hadoop map reduce for one of my classes. I was scanning several IMD databases and producing a merged information about actors (basically the name, biography and films he acted in was in different databases). I think you can use the same approach I used for my homework:
I wrote a separate map reduce turning every database file in the same format, just placing a two-letter prefix infront of every row the map-reduce produced to be able to tell 'BI' (biography), 'MV' (movies) and so on. Then I used all these produced files as input for my last map reduced that processed them grouping them in the desired way.
I am not even sure that you need so much work if you are really going to scan every line of the datawarehouse. Maybe in this case you can just do this scan either in the map or the reduce phase (based on what additional processing you want to do), but my suggestion assumes that you actually need to filter the datawarehouse based on the subsets. If the latter my suggestion might work for you.
We have a large dataset to analyze with multiple reduce functions.
All reduce algorithm work on the same dataset generated by the same map function. Reading the large dataset costs too much to do it every time, it would be better to read only once and pass the mapped data to multiple reduce functions.
Can I do this with Hadoop? I've searched the examples and the intarweb but I could not find any solutions.
Maybe a simple solution would be to write a job that doesn't have a reduce function. So you would pass all the mapped data directly to the output of the job. You just set the number of reducers to zero for the job.
Then you would write a job for each different reduce function that works on that data. This would mean storing all the mapped data on the HDFS though.
Another alternative might be to combine all your reduce functions into a single Reducer which outputs to multiple files, using a different output for each different function. Multiple outputs are mentioned in this article for hadoop 0.19. I'm pretty sure that this feature is broken in the new mapreduce API released with 0.20.1, but you can still use it in the older mapred API.
Are you expecting every reducer to work on exactly same mapped data? But at least the "key" should be different since it decides which reducer to go.
You can write an output for multiple times in mapper, and output as key (where $i is for the i-th reducer, and $key is your original key). And you need to add a "Partitioner" to make sure these n records are distributed in reducers, based on $i. Then using "GroupingComparator" to group records by original $key.
It's possible to do that, but not in trivial way in one MR.
You may use composite keys. Let's say you need two kinds of the reducers, 'R1' and 'R2'. Add ids for these as a prefix to your o/p keys in the mapper. So, in the mapper, a key 'K' now becomes 'R1:K' or 'R2:K'.
Then, in the reducer, pass values to implementations of R1 or R2 based on the prefix.
I guess you want to run different reducers in a chain. In hadoop 'multiple reducers' means running multiple instances of the same reducer. I would propose you run one reducer at a time, providing trivial map function for all of them except the first one. To minimize time for data transfer, you can use compression.
Of course you can define multiple reducers. For the Job (Hadoop 0.20) just add:
job.setNumReduceTasks(<number>);
But. Your infrastructure has to support the multiple reducers, meaning that you have to
have more than one cpu available
adjust mapred.tasktracker.reduce.tasks.maximum in mapred-site.xml accordingly
And of course your job has to match some specifications. Without knowing what you exactly want to do, I only can give broad tips:
the keymap-output have either to be partitionable by %numreducers OR you have to define your own partitioner:
job.setPartitionerClass(...)
for example with a random-partitioner ...
the data must be reduce-able in the partitioned format ... (references needed?)
You'll get multiple output files, one for each reducer. If you want a sorted output, you have to add another job reading all files (multiple map-tasks this time ...) and writing them sorted with only one reducer ...
Have a look too at the Combiner-Class, which is the local Reducer. It means that you can aggregate (reduce) already in memory over partial data emitted by map.
Very nice example is the WordCount-Example. Map emits each word as key and its count as 1: (word, 1). The Combiner gets partial data from map, emits (, ) locally. The Reducer does exactly the same, but now some (Combined) wordcounts are already >1. Saves bandwith.
I still dont get your problem you can use following sequence:
database-->map-->reduce(use cat or None depending on requirement)
then store the data representation you have extracted.
if you are saying that it is small enough to fit in memory then storing it on disk shouldnt be an issue.
Also your use of MapReduce paradigm for the given problem is incorrect, using a single map function and multiple "different" reduce function makes no sense, it shows that you are just using map to pass out data to different machines to do different things. you dont require hadoop or any other special architecture for that.