Pure spark vs spark SQL for quering data on HDFS - hadoop

I have (tabular) data on a hdfs cluster and need to do some slightly complex querying on it. I expect to face the same situation many times in the future, with other data. And so, question:
What are the factors to take into account to choose where to use (pure) Spark and where to use Spark-SQL when implementing such task?
Here is the selection factors I could think of:
Familiarity with language:
In my case, I am more of a data-analyst than a DB guy, so this would lead me to use spark: I am more comfortable to think of how to (efficiently) implement data selection in Java/Scala than in SQL. This however depends mostly on the query.
Serialization:
I think that one can run Spark-SQL query without sending home-made-jar+dep to the spark worker (?). But then, returned data are raw and should be converted locally.
Efficiency:
I have no idea what differences there are between the two.
I know this question might be too general for SO, but maybe not. So, could anyone with more knowledge provides some insight?

About point 3, depending on your input-format, the way in which the data is scanned can be different when you use a pure-Spark vs Spark SQL. For example if your input format has multiple columns, but you need only few of them, it's possible to skip the retrieval using Spark SQL, whereas this is a bit trickier to achieve in pure Spark.
On top of that Spark SQL has a query optimizer, when using DataFrame or a query statement, the resulting query will go through the optimizer such that it will be executed more efficiently.
Spark SQL does not exclude Spark; combined usage is probably for the best results.

Related

How couchdb 1.6 inherently take advantage of Map reduce when it is Single Server Database

I am new to couch db, while going through documentation of Couch DB1.6, i came to know that it is single server DB, so I was wondering how map reduce inherently take advantage of it.
If i need to scale this DB then do I need to put more RAID hardware, of will it work on commodity hardware like HDFS?
I came to know that couch db 2.0 planning to bring clustering feature, but could not get proper documentation on this.
Can you please help me understanding how exactly internally file get stored and accessed.
Really appreciate your help.
I think your question is something like this:
"MapReduce is … a parallel, distributed algorithm on a cluster." [shortened from MapReduce article on Wikipedia]
But CouchDB 1.x is not a clustered database.
So what does CouchDB mean by using the term "map reduce"?
This is a reasonable question.
The historical use of "MapReduce" as described by Google in this paper using that stylized term, and implemented in Hadoop also using that same styling implies parallel processing over a dataset that may be too large for a single machine to handle.
But that's not how CouchDB 1.x works. View index "map" and "reduce" processing happens not just on single machine, but even on a single thread! As dch (a longtime contributor to the core CouchDB project) explains in his answer to https://stackoverflow.com/a/12725497/179583:
The issue is that eventually, something has to operate in serial to build the B~tree in such a way that range queries across the indexed view are efficient. … It does seem totally wacko the first time you realise that the highly parallelisable map-reduce algorithm is being operated sequentially, wat!
So: what benefit does map/reduce bring to single-server CouchDB? Why were CouchDB 1.x view indexes built around it?
The benefit is that the two functions that a developer can provide for each index "map", and optionally "reduce", form very simple building blocks that are easy to reason about, at least after your indexes are designed.
What I mean is this:
With e.g. the SQL query language, you focus on what data you need — not on how much work it takes to find it. So you might have unexpected performance problems, that may or may not be solved by figuring out the right columns to add indexes on, etc.
With CouchDB, the so-called NoSQL approach is taken to an extreme. You have to think explicitly about how you each document or set of documents "should be" found. You say, I want to be able to find all the "employee" documents whose "supervisor" field matches a certain identifier. So now you have to write a map function:
function (doc) {
if (doc.isEmployeeRecord) emit(doc.supervisor.identifier);
}
And then you have to query it like:
GET http://couchdb.local:5984/personnel/_design/my_indexes/_view/by_supervisor?key=SOME_UUID
In SQL you might simply say something like:
SELECT * FROM personnel WHERE supervisor == ?
So what's the advantage to the CouchDB way? Well, in the SQL case this query could be slow if you don't have an index on the supervisor column. In the CouchDB case, you can't really make an unoptimized query by accident — you always have to figure out a custom view first!
(The "reduce" function that you provide to a CouchDB view is usually used for aggregate functions purposes, like counting or averaging across multiple documents.)
If you think this is a dubious advantage, you are not alone. Personally I found designing my own indexes via a custom "map function" and sometimes a "reduce function" to be an interesting challenge, and it did pay off in knowing the scaling costs at least of queries (not so much for replications…).
So don't think of CouchDB view so much as being "MapReduce" (in the stylized sense) but just as providing efficiently-accessible storage for the results of running [].map(…).reduce(…) across a set of data. Because the "map" function is applied to only a document at once, the total set of data can be bigger than fits in memory at once. Because the "reduce" function is limited in its size, it further encourages efficient processing of a large set of data into an efficiently-accessed index.
If you want to learn a bit more about how the indexes generated in CouchDB are stored, you might find these articles interesting:
The Power of B-trees
CouchDB's File Format is brilliantly simple and speed-efficient (at the cost of disk space).
Technical Details, View Indexes
You may have noticed, and I am sorry, that I do not actually have a clear/solid answer of what the actual advantage and reasons were! I did not design or implement CouchDB, was only an avid user for many years.
Maybe the bigger advantage is that, in systems like Couchbase and CouchDB 2.x, the "parallel friendliness" of the map/reduce idea may come into play more. So then if you have designed an app to work in CouchDB 1.x it may then scale in the newer version without further intervention on your part.

Monolithic ETL to distributed/scalable solution and OLAP cube to Elasticsearch/Solr

I am relatively a newbie to big data processing looking for some specific guidance from the SO community.
We are currently setup with a monolithic/sequential ETL, needless to say it is not scalable as our data grows. What are our options (sure distributing and parallelizing are but need specifics)? I have played with Hadoop and it may be appropriate to use here, but I am wondering what are some of the other options out there? May be something that's easier to transition to for a database developer?
Kind of related to question above is we also have an OLAP cube for aggregated data. Is Elasticsearch or Solr good candidates for replacing an OLAP cube? Has anyone successfully done this? What are the gotchas?
same kind of use case currently we are working on.
our approach may be use full.
step 1: we are sqooping data to Hdfs from dbs
step 2: ETL logic in Pig scripting
step 3: building index on aggregated table data to solr.
step 4: search on solr through web interface.
in our use case we are developing pig jobs to perform transformation logic storing them to final folders incrementally. later MR indexer tool will index the data to solr.we are using cloudera-search. let me know if any thing.

Map-Reduce - only applicable to key-value NoSql data models?

(I only have conceptual knowledge of NoSQL, no working experience)
I am aware of the following types of NoSQL databases:
key-value, column family, document databases (Aggregates)
graph databases
Is the Map-Reduce paradigm applicable to all? My guess would be no since Map-Reduce is often discussed in terms of keys and values, but since the distinction between different NoSQL stores isn't so clean-cut, I am wondering where Map-Reduce is and isn't applicable. And since I'm in the process of evaluating which DB to use for a few app ideas I have, I should think whether it's possible to achieve large scale processing regardless of which store I use.
Support for map reduce probably shouldn't be the thing on which to base your choice of a datastore.
Firstly, map reduce isn't the only way to do large-scale data processing. For example, MongoDB implemented map reduce support early (in v1), but later added their Aggregation Framework which was much more general, subsuming many tasks that would make use of map reduce.
Map reduce is just one paradigm for processing large data sets. Use it only if your application needs to process a large number of data records with a mapper and then needs to combine results together with a reducer. That's all it really does. As to when the paradigm is applicable and when it is not, simply look at your use case. Do you need to manipulate all of your records consistently and then combine the results? Or is there another way to phrase your problem?
Take a look at the Mongo aggregation framework for examples of where aggregation is used as a simpler alternative to many problems for which forcing them into a map-reduce problem would be overkill.
It should also help give you insight into your question of whether you can do large-scale data processing without map-reduce, to which the answer is yes. Clearly map-reduce is good for making search indexes, but many problems on large data sets benefit from other paradigms.
A web search on "alternatives to map reduce" will also be helpful.

Hive vs Pig when performing Joins

I have some scripts which process my website's logs. I have loaded this data into multiple tables in Hive. I run these scripts on daily basis to do the analysis of the traffic.
Lately I am seeing that the hive queries which I have written in these scripts is taking too much time. Earlier, it used to take around 10-15 mins to generate the reports, but now it takes hours to do the same.
I did the analysis of the data and its around 5-10% of increase in dataset.
One of my friends suggested me that Hive is not good when it comes to joining multiple hive tables and I should switch my scripts to Pig. Is Hive bad at joining tables when compared to Pig?
Is Hive bad at joining tables
No. Hive is actually pretty good, but sometimes it takes a bit playing around with the query optimizer.
Depending on which version of Hive you use, you may need to provide hints in your query to tell the optimizer to join the data using a certain algorithm. You can find some details about different hints here.
If you're thinking about using Pig, I think your choice should not be motivated only by performance considerations. In my experience there is no quantifiable gain in using Pig, I have used both over the past years, and in terms of performance there is no clear winner.
What Pig gives you however is more transparency when defining what kind of join you want to use instead of relying on some (sometimes obscure) optimizer hints.
In the end, Pig or Hive doesn't really matter, it just depends how you decide to optimize your queries. If you're considering switching to Pig, I would first really analyze what your needs in terms of processing are, as you'll probably fall even in terms of performance. Here is a good post if you want to compare the 2.

HBase Inner join and coprocessors

I am planning to do a project for implementing all aggregation operations in HBase. But I don’t know about its difficulty. I have only 6 months for completing that project. Should I go forward with it? I am planning to do it in java. I know that there are already some aggregation functions. But there in no INNER JOIN like queries now. I am planning to implement such type of queries. I don't know it’s a blunder or bluff.
I think technically we should distinguish two types of joins:
a) One small table + One Big Table. By small table I mean table which can be cached in memory of each node w/o seriously affecting cluster operation. In this case Join using coprocessor should be be possible by putting small table in the hash map, iterating over the node local part of the data of the big table and this way producing join results. In the Hive's term it is called "map" join http://www.facebook.com/note.php?note_id=470667928919.
b) Two big tables. I do not think it is viable to get it production quality in short time frame. I might state that such functionality is realm of MPP databases and serious part of their IP.
It is definitely harder in HBase than doing it in an RDBMS or a different Hadoop technology like PIG or Hive.

Resources