I have a massive, static dataset and I've a function to apply to it.
f is in the form reduce(map(f, dataset)), so I would use the MapReduce skeleton. However, I don't want to scatter the data at each request (and ideally I want to take advantage of indexing in order to speedup f). There is a MapReduce implementation that address this general case?
I've taken a look at IterativeMapReduce and maybe it does the job, but seems to address a slightly different case, and the code isn't available yet.
Hadoop's MapReduce (and all the others map-reduce skeleton inspired by Google) doesn't scatter the data all the time.
Related
I'm currently migrating from the Hadoop MR paradigm to apache Spark, and there is a few doubts that come to my mind regarding advanced efficiency implementation patterns outside the usual "map and reduce" basic workflow.
In this well known book (Lin and Dier 2010) the "in-mapper combiner" pattern is introduced, which can significantly improve efficiency in many applications.
i.e, the canonical word count example in hadoop, where we normally emit (word, 1) tuples to be further combined, can be greatly improved if local aggregation of (word, n) tuples is performed and then emitted. Although combiners can fulfil this behaviour my experience is that using local variables for each mapper along with hadoop's functions like "setUp" and "cleanUp" can lead to higher computational savings (here is a nice tutorial).
Inside the Spark world I could not find anything similar, just the so called map-side aggregation, which is equivalent to the Hadoop's local combiner. Given the previous example, I wonder if it can be translated into Spark by using map functions.
I have written a K-Means Clustering code in MapReduce on Hadoop. If I have few number of clusters, consider 2, and if the data is very large, the whole data would be divided into two sets and each Reducer would receive too many values for a particular key, i.e the cluster centroid. How to solve this?
Note: I use the iterative approch to calculate new centers.
Algorithmically, there is not much you can do, as the nature of this algorithm is the one that you describe. The only option, in this respect, is to use more clusters and divide your data to more reducers, but this yields a different result.
So, the only thing that you can do, in my opinion, is compressing. And I do not only mean, using a compression codec of Hadoop.
For instance, you could find a compact representation of your data. E.g., give an integer id to each element and only pass this id to the reducers. This will save network traffic (store elements as VIntWritables, or define your own VIntArrayWritable extending ArrayWritable) and memory of each reducer.
In this case of k-means, I think that a combiner is not applicable, but if it is, it would greatly reduce the network and the reducer's overhead.
EDIT: It seems that you CAN use a combiner, if you follow this iterative implementation. Please, edit your question to describe the algorithm that you have implemented.
If you have too much shuffle then you will run into OOM issues.
Try to split the dataset in smaller chunks and try
yarn.app.mapreduce.client.job.retry-interval
AND
mapreduce.reduce.shuffle.retry-delay.max.ms
where there are more splits but the retries of the job will be long enough so there is no OOM issues.
Though MapReduce may not be the best way to implement the algorithms used in Image Processing, just out of curiosity, which would be the simplest ones to implement if I were to try them out as a beginner.
Hadoop is really well suited for large amounts of IO. So for example, you could make a job that blurs an image, using the algorithm in the fork/join tutorial.
To do this, you'd create a MapReduce job with the following characteristics:
A Custom, non-splitable input format for each image.
A Mapper implementation that does the blurring.
An Identity Reducer.
Here's a good post that should get you started.
(I only have conceptual knowledge of NoSQL, no working experience)
I am aware of the following types of NoSQL databases:
key-value, column family, document databases (Aggregates)
graph databases
Is the Map-Reduce paradigm applicable to all? My guess would be no since Map-Reduce is often discussed in terms of keys and values, but since the distinction between different NoSQL stores isn't so clean-cut, I am wondering where Map-Reduce is and isn't applicable. And since I'm in the process of evaluating which DB to use for a few app ideas I have, I should think whether it's possible to achieve large scale processing regardless of which store I use.
Support for map reduce probably shouldn't be the thing on which to base your choice of a datastore.
Firstly, map reduce isn't the only way to do large-scale data processing. For example, MongoDB implemented map reduce support early (in v1), but later added their Aggregation Framework which was much more general, subsuming many tasks that would make use of map reduce.
Map reduce is just one paradigm for processing large data sets. Use it only if your application needs to process a large number of data records with a mapper and then needs to combine results together with a reducer. That's all it really does. As to when the paradigm is applicable and when it is not, simply look at your use case. Do you need to manipulate all of your records consistently and then combine the results? Or is there another way to phrase your problem?
Take a look at the Mongo aggregation framework for examples of where aggregation is used as a simpler alternative to many problems for which forcing them into a map-reduce problem would be overkill.
It should also help give you insight into your question of whether you can do large-scale data processing without map-reduce, to which the answer is yes. Clearly map-reduce is good for making search indexes, but many problems on large data sets benefit from other paradigms.
A web search on "alternatives to map reduce" will also be helpful.
From the little I have read, I understood that Hadoop is great for the following class of problems - having one massive question answered by distributing the computation between potentially many nodes.
Was Hadoop designed to solve problems that involve multiple calculations, on the same dataset, but each with different parameters? for example, simulating different scenarios based on the same master dataset, but with different parameters (e.g. testing a data mining model on the same data set, but spawning multiple iterations of the simulation, each with a different set of parameters and finding the best model)
E.g. for a model predicting weather, that have a set of rules with different weights, does Hadoop support running the same model, but each "node" running with a different weight values on a learning set and comparingthe prediction results to find the best model?
Or is this something that Hadoop was simply not designed to do?
It's not really something it was designed to do. Typically you would want to be able to distribute different parts of the dataset to different nodes. This is one of the main ideas behind hadoop: split a huge dataset over multiple nodes and bring the computation to the data. However, it still can definitely be accomplished without jumping through too many hoops.
I'm not sure how familiar you are with the MapReduce paradigm, but you could think of the parameters of your models as the "input" to your map task. Put the dataset in some location in HDFS, and write a MapReduce job such that the map tasks read in the dataset from HDFS, then evaluate the model using the given parameters. All map outputs get sent to one reducer, which simply outputs the parameters that gave the highest score. If you make the number of input files (model parameters) equal to the number of nodes, you should get one map task per node, which is what you want.