Joins in MapReduce - hadoop

while going through the hadoop in action book i came across Several Classes regarding reduced joins,some of them are DataJoinMapperBase,TaggedMapOutput,DataJoinReducerBase.
but when i had gone through google to search for joins concept on hadoop,none them are based on the above specified classes.instead they were implementing their own logics and many are based on MultipleInputs.
Now My question is which is the better approach for joins on hadoop?what could be done to achieve better results?any suggestions on this?

You could try Pangool library, it makes reduce side joins very easy. Map side joins are just a memory lookup.

Related

What algorithms could i implement to improve the general design and performance of a database?

I'm working on a project for university. Do you guys know what kind of algorithms i could implement that would help with the proper design and general performance of a database? Until now i came up with an algorithm that can help the user pick candidate keys and also an algorithm for normalization up to 3NF. Do you have any other ideas or suggestions? Thanks.
This is like asking how you can figure out how to make a car be more efficient. It's such a broad question that it's essentially unanswerable. There are so many moving parts to a car, and each one has its own problems. You really need to understand what each component is doing. In the case of databases, you need to understand the data before you try and fix it. And if you want a good answer, you have to ask the right questions.
A good question should include context on what you are working with, and what you are trying to do. And when it comes to data manipulation, the details are extremely important. How is your data represented? What kind of infrastructure are you working with? What purpose does the data serve, and what processes use this data? If you are working with floating point numbers, are your processes tolerant of small rounding errors? Would your organization even let you make changes to how the data is stored?
In general, adding algorithms to improve data performance is probably largely unnecessary. Databases are designed out of the box to be simple and efficient. If there were a known method to increase efficiency in general without any drawbacks, there's no reason why the designers of the system wouldn't have implemented it already.
I am just putting an answer because I have no way to tell this in the comment section. You need to understand a basic principle in database design and data model construction. What your database is for ? That is the main question, and believe it or not, sometimes people with experience make the same mistake.
As you were saying, 3NF could be good for OLTP systems, but it would be horrendous for Data Warehouse or Reporting Databases where the queries are huge and they work on big batch operations. In those systems denormalization offers always better results.
Once you know what you're database is for, then you can start to apply some "Best Practices" , but even here there is a lot of room for interpretation, and even worse, same principles could be good in one place but very bad in another. I am just going to provide you an example of my real experience
8 years ago I started a project and we have to design a database for a financial application. After some analysis, we decided to use a start model, or dimension-fact model. We decided to create indexes ( including bitmap ) for some tables, even though we were rebuild them during batch to avoid performance degradation.
Funny thing is that after some months, I realised that the indexes were useless, as the users were running queries that were accessing the whole data, mostly analytics and aggregation. Consequence: I drop all indexes.
Is it a good thing to do ? No, it is not, but in my scenario it was the best thing and the performance increased a lot, both in batch and also in user experience.
Summary, like an old friend of mine that that was working in Oracle Support used to tell me: "Performance is an art my friend, not a science"
There are too many database algorithms to list, but below is a structured way of thinking about classes of algorithms that affect database performance.
Algorithm analysis is a helpful way of categorizing and thinking about many database performance problems. While most performance problems are solved with best practices and trial-and-error, we'll never truly understand why one solution is better than another without understanding the algorithms behind them. Below is a list of functions that describe the algorithmic complexity of different database operations, ordered from fastest to slowest.
O(1/N) – Batching to reduce overhead for bulk collect, sequences, fetching rows
O(1) - Hashing for hash partitioning, hash clusters, hash joins
O(LOG(N)) – Index access for b-trees
1/((1-P)+P/N) – Amdahl's Law and its implications for parallelizing large data warehouse workloads
O(N) - Full table scans, hash joins (in theory)
O(N*LOG(N)) – Full table scan versus repeated index reads, sorting, global versus local indexes, gathering statistics (distinct approximations and partition birthday problems)
O(N^2) – Cross joins, nested loops, parsing
O(N!) – Join order
O(∞) – The optimizer (satisficing and avoiding the halting problem)
One suggestion - based on the way you phrased your questions and comments, you're thinking of a database as merely a place to store data. But the most interesting parts of a database happen when you think of them as joining machines. There's not much to optimize about data sitting around, the real work happens when data is combined.
The above list is based on Chapter 16 of my book, Pro Oracle SQL Development. You can read an early version of the entire chapter for free here. While the chapter mostly stands alone, it requires an advanced understanding of Oracle. But each of the topics could be the basis for a lifetime of academic study, so you only need to pick one.

Which structure is better for my cms platform

I am designing our cms platform. I meet the question about how combine response data. I know two way solution1 and solution2. Which is better? Why? Anyone have better way.
Solution1:
Solution2:
You will have to share the context with us to a definite answer but will the information available I can give you the following.
Solution 1 is tightly coupled entities are in the long run can be a problem but it will be easier to maintain if you have coupled entities in the correct manner.
Solution 2 has a better separation of concerns but you will have to clearly identify entities and will have multiple files to maintain.
Usually a single entity represents a single table in the table with the constraints of the table.

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

Efficient Data Structure for Query Sync

I have a giant lists of query searches with cached image results for a few different servers and I want to sync the queries efficiently. I know that one way would be to do it in two steps. First comparing the queries, and second, only syncing non-identical results. Instead though I'd like it to be faster and more efficient by only exchanging a small fixed amount of data and then syncing non-identical results based on that data (it's fine if it happens to sync a small amount of identical results).
What kind of data structure for these queries would be recommended to accomplish this? I've been looking at https://en.wikipedia.org/wiki/List_of_data_structures to try to get a better idea, but I don't have a lot of experience in algorithms so I could really use some direction. I'm planning to do this in C++ if that needs to be taken into consideration. All suggestions appreciated, thanks.

Map-Reduce - only applicable to key-value NoSql data models?

(I only have conceptual knowledge of NoSQL, no working experience)
I am aware of the following types of NoSQL databases:
key-value, column family, document databases (Aggregates)
graph databases
Is the Map-Reduce paradigm applicable to all? My guess would be no since Map-Reduce is often discussed in terms of keys and values, but since the distinction between different NoSQL stores isn't so clean-cut, I am wondering where Map-Reduce is and isn't applicable. And since I'm in the process of evaluating which DB to use for a few app ideas I have, I should think whether it's possible to achieve large scale processing regardless of which store I use.
Support for map reduce probably shouldn't be the thing on which to base your choice of a datastore.
Firstly, map reduce isn't the only way to do large-scale data processing. For example, MongoDB implemented map reduce support early (in v1), but later added their Aggregation Framework which was much more general, subsuming many tasks that would make use of map reduce.
Map reduce is just one paradigm for processing large data sets. Use it only if your application needs to process a large number of data records with a mapper and then needs to combine results together with a reducer. That's all it really does. As to when the paradigm is applicable and when it is not, simply look at your use case. Do you need to manipulate all of your records consistently and then combine the results? Or is there another way to phrase your problem?
Take a look at the Mongo aggregation framework for examples of where aggregation is used as a simpler alternative to many problems for which forcing them into a map-reduce problem would be overkill.
It should also help give you insight into your question of whether you can do large-scale data processing without map-reduce, to which the answer is yes. Clearly map-reduce is good for making search indexes, but many problems on large data sets benefit from other paradigms.
A web search on "alternatives to map reduce" will also be helpful.

Resources