Map-Reduce - only applicable to key-value NoSql data models? - hadoop

(I only have conceptual knowledge of NoSQL, no working experience)
I am aware of the following types of NoSQL databases:
key-value, column family, document databases (Aggregates)
graph databases
Is the Map-Reduce paradigm applicable to all? My guess would be no since Map-Reduce is often discussed in terms of keys and values, but since the distinction between different NoSQL stores isn't so clean-cut, I am wondering where Map-Reduce is and isn't applicable. And since I'm in the process of evaluating which DB to use for a few app ideas I have, I should think whether it's possible to achieve large scale processing regardless of which store I use.

Support for map reduce probably shouldn't be the thing on which to base your choice of a datastore.
Firstly, map reduce isn't the only way to do large-scale data processing. For example, MongoDB implemented map reduce support early (in v1), but later added their Aggregation Framework which was much more general, subsuming many tasks that would make use of map reduce.
Map reduce is just one paradigm for processing large data sets. Use it only if your application needs to process a large number of data records with a mapper and then needs to combine results together with a reducer. That's all it really does. As to when the paradigm is applicable and when it is not, simply look at your use case. Do you need to manipulate all of your records consistently and then combine the results? Or is there another way to phrase your problem?
Take a look at the Mongo aggregation framework for examples of where aggregation is used as a simpler alternative to many problems for which forcing them into a map-reduce problem would be overkill.
It should also help give you insight into your question of whether you can do large-scale data processing without map-reduce, to which the answer is yes. Clearly map-reduce is good for making search indexes, but many problems on large data sets benefit from other paradigms.
A web search on "alternatives to map reduce" will also be helpful.

Related

Most efficient storage format for HDFS data

I have to store a lot of data on dedicated storage servers in HDFS. This is some kind of archive for historic data. The data being store is row oriented and have tens of different kind of fields. Some of them are Strings, some are Integers, there are also few Floats, Shorts, ArrayLists and a Map.
The idea is that the data will be scanned from time to time using MapReduce or Spark job.
Currently I am storing them as SequenceFiles with NullWritable as keys and custom WritableComparable class as values. This custom class has all of these fields defined.
I would like to achieve two goals - one is to optimize a size of data, as it is getting really big and I have to add new servers every few weeks and the costs are constantly growing. The other thing is to make it easier to add new fields - in current state if I would like to add some new field I would have to rewrite all of the old data.
I tried to achieve this by using EnumMap inside this class. It gave quite good results, as it allows adding new fields easily and also the size of data have been reduced by 20% (the reason is a lot of fields in a record are often empty). But the code I wrote looks awful and it gets even uglier when I try to add to this EnumMap also Lists and Maps. It's ok for a data of the same type, but trying to combine all of the fields is a nightmare.
So I thought of some other popular formats. I have tried Avro and Parquet, but size of the data is almost exactly the same as SequenceFiles with custom class before trying with Enums. So it resolves problems of adding new fields without a need of rewriting old data, but I feel like there is more potential to optimize the size of the data.
The one more thing I am going to check yet is of course the time it takes to load the data (this will also tell me if it's ok to use bzip2 compression or I have to go back to gzip because of performance), but before I proceed with this I was wondering if maybe someone will suggest some other solution or a hint.
Thanks in advance for all comments.
Most of your approach seems good. I just decided to add some of my thoughts in this answer.
The data being store is row oriented and have tens of different kind
of fields. Some of them are Strings, some are Integers, there are also
few Floats, Shorts, ArrayLists and a Map.
None of the types you have mentioned here are any more complex than the datatypes supported by spark. So I wouldn't bother changing the data types in any way.
achieve two goals - one is to optimize a size of data, as it is
getting really big and I have to add new servers every few weeks and
the costs are constantly growing.
By adding servers, are you also adding compute? Storage should be relatively cheap, and I'm wondering if you are adding compute with your servers, which you don't really need. You should only be paying to store and retrieve data. Consider a simple object store like S3 that only charges you for storage space and gives a free quota of access requests (GET/PUT/POST) - I believe about 1000 requests are free and it costs only ~$10 for a terabyte of storage per month.
The other thing is to make it easier to add new fields - in current
state if I would like to add some new field I would have to rewrite
all of the old data.
If you have a use case where you will be writing to the files more often than reading, I'd recommend not storing the file on HDFS. It is more suited for write once, read many type applications. That said, i'd recommend using parquet to start since i think you will need a file format that allows slicing and dicing the data. Avro is also a good choice as it also supports schema evolution. But its better to use this if you have a complex structures where you need to specify the schema and make it easier to serialize/deserialize with java objects.
The one more thing I am going to check yet is of course the time it
takes to load the data (this will also tell me if it's ok to use bzip2
compression or I have to go back to gzip because of performance)
Bzip2 has the highest compression, but is also the slowest. So i'd recommend it if the data isn't really used/queried frequently. Gzip has comparable compression with Bzip2, but is slightly faster. Also consider snappy compression as that has a balance of performance and storage and can support splittable files for certain file types (parquet or avro) which is useful for map-reduce jobs.

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

Joins in MapReduce

while going through the hadoop in action book i came across Several Classes regarding reduced joins,some of them are DataJoinMapperBase,TaggedMapOutput,DataJoinReducerBase.
but when i had gone through google to search for joins concept on hadoop,none them are based on the above specified classes.instead they were implementing their own logics and many are based on MultipleInputs.
Now My question is which is the better approach for joins on hadoop?what could be done to achieve better results?any suggestions on this?
You could try Pangool library, it makes reduce side joins very easy. Map side joins are just a memory lookup.

Motivation behind an implicit sort after the mapper phase in map-reduce

I am trying to understand as to why does map-reduce does an implicit sorting during the shuffle and sort phase both on the map side and the reduce side which is manifested as a mixture of both in-memory as well as on-disk sorting (can be really expensive for large sets of data).
My concern is that while running map-reduce jobs, performance is a significant consideration and an implicit sorting based on the keys before throwing the output of the mapper to the reducer will have a great impact on the performance when dealing with large sets of data.
I understand that sorting can prove to be a boon in certain cases where it is explicitly required but this is not always true? So, why does the concept of implicit sorting exist in Hadoop Map-Reduce?
For any kind of reference to what I am talking about while mentioning the shuffle and sort phase feel free to give a brief reading to the post : Map-Reduce: Shuffle and Sort on my blog: Hadoop-Some Salient Understandings
One of the possible explanation to the above which came to my mind much later after posting this question is:
The sorting is done just to aggregate all the records corresponding to a particular key, together, so that all these records corresponding to that single key maybe sent to a single reducer (default partitioning logic in Hadoop Map-Reduce). So, it may be said that by sorting all the records by the keys after the Mapper phase just allows to bring all records corresponding to a single key together where the order of the keys in sorted order may just get used for certain use cases such as sorting large sets of data.
If people can verify the above if they think the same, it shall be great. Thanks.

Efficient MapReduce when dealing with streams to queries to the same dataset

I have a massive, static dataset and I've a function to apply to it.
f is in the form reduce(map(f, dataset)), so I would use the MapReduce skeleton. However, I don't want to scatter the data at each request (and ideally I want to take advantage of indexing in order to speedup f). There is a MapReduce implementation that address this general case?
I've taken a look at IterativeMapReduce and maybe it does the job, but seems to address a slightly different case, and the code isn't available yet.
Hadoop's MapReduce (and all the others map-reduce skeleton inspired by Google) doesn't scatter the data all the time.

Resources