I have discovered some (significant) performance differences (in terms of real time runtime as well as CPU time) between Pig and Hive and am looking for ways to come to the bottom of these differences. I have used both language's explain feature (i.e. Hive: EXPLAIN keyword, Pig: pig -e 'explain -script explain.pig') to contrast and compare the generated syntax tree, logical, physical and map-reduce plans. However both seem to do the same things. The job tracker however shows a difference in the number of map and reduce tasks launched (I consequently ensured that both use the same number of map and reduce tasks and the performance difference remains). My question therefore is: in what other ways can I analyze what is going on (possibly at a lower level / bytecode level)?
EDIT: I am running the TPC-H benchmarks by the TPC (available https://issues.apache.org/jira/browse/PIG-2397 and https://issues.apache.org/jira/browse/HIVE-600 ). However even simpler scripts show a quite large performance difference. For example:
SELECT (dataset.age * dataset.gpa + 3) AS F1,
(dataset.age/dataset.gpa - 1.5) AS F2
FROM dataset
WHERE dataset.gpa > 0;
I still need to fully evaluate the TPC-H benchmarks (will update later), however the results for the simpler scripts are detailed in this document: https://www.dropbox.com/s/16u3kx852nu6waw/output.pdf
(jpg: http://i.imgur.com/1j1rCWS.jpg )
I have read some source codes of Pig and Hive before. I can share some opinions.
As I was focusing on the Join implementation, here I can provide some details of the Join implementation of Pig and Hive. Hive's Join implementation is less efficient than Pig. I have no idea why Hive needs to create so many objects (Such operations are very slow and should have been avoided) in the Join implementation. I think that's why Hive does Join more slowly than Pig. If you are interested in it, you can check the CommonJoinOperator code by yourself. So I guess that Pig usually more efficient as its high quality codes.
Related
What are the steps one can think through to logically conclude which api's or commands in general to use for time efficiency?
For example: Empirically, I found joining dataframes through sql api calls to be ~30% more time efficient than using native scala commands.
df1.join(df2, df1.k == df2.k, joinType='inner')
sqlContext.sql('SELECT * FROM df1 JOIN df2 ON df1.k = df2.k')
What are the first principles involved when determining the optimal command?
Performance comparisons in big data are notoriously tricky because there are too many factors you cannot control.
Use explain to see the logical and physical execution plans. If the two are the same for DSL vs. SparkSQL then Spark will do exactly the same work. I expect the result for both statements above to be the same and, hence, the observed difference to be due to other factors, e.g., machine resource use by other processes during the test run, pre-caching between runs, etc.
During job execution, use the Spark UI to see what's going on.
I'm currently migrating from the Hadoop MR paradigm to apache Spark, and there is a few doubts that come to my mind regarding advanced efficiency implementation patterns outside the usual "map and reduce" basic workflow.
In this well known book (Lin and Dier 2010) the "in-mapper combiner" pattern is introduced, which can significantly improve efficiency in many applications.
i.e, the canonical word count example in hadoop, where we normally emit (word, 1) tuples to be further combined, can be greatly improved if local aggregation of (word, n) tuples is performed and then emitted. Although combiners can fulfil this behaviour my experience is that using local variables for each mapper along with hadoop's functions like "setUp" and "cleanUp" can lead to higher computational savings (here is a nice tutorial).
Inside the Spark world I could not find anything similar, just the so called map-side aggregation, which is equivalent to the Hadoop's local combiner. Given the previous example, I wonder if it can be translated into Spark by using map functions.
I have (tabular) data on a hdfs cluster and need to do some slightly complex querying on it. I expect to face the same situation many times in the future, with other data. And so, question:
What are the factors to take into account to choose where to use (pure) Spark and where to use Spark-SQL when implementing such task?
Here is the selection factors I could think of:
Familiarity with language:
In my case, I am more of a data-analyst than a DB guy, so this would lead me to use spark: I am more comfortable to think of how to (efficiently) implement data selection in Java/Scala than in SQL. This however depends mostly on the query.
Serialization:
I think that one can run Spark-SQL query without sending home-made-jar+dep to the spark worker (?). But then, returned data are raw and should be converted locally.
Efficiency:
I have no idea what differences there are between the two.
I know this question might be too general for SO, but maybe not. So, could anyone with more knowledge provides some insight?
About point 3, depending on your input-format, the way in which the data is scanned can be different when you use a pure-Spark vs Spark SQL. For example if your input format has multiple columns, but you need only few of them, it's possible to skip the retrieval using Spark SQL, whereas this is a bit trickier to achieve in pure Spark.
On top of that Spark SQL has a query optimizer, when using DataFrame or a query statement, the resulting query will go through the optimizer such that it will be executed more efficiently.
Spark SQL does not exclude Spark; combined usage is probably for the best results.
I have some scripts which process my website's logs. I have loaded this data into multiple tables in Hive. I run these scripts on daily basis to do the analysis of the traffic.
Lately I am seeing that the hive queries which I have written in these scripts is taking too much time. Earlier, it used to take around 10-15 mins to generate the reports, but now it takes hours to do the same.
I did the analysis of the data and its around 5-10% of increase in dataset.
One of my friends suggested me that Hive is not good when it comes to joining multiple hive tables and I should switch my scripts to Pig. Is Hive bad at joining tables when compared to Pig?
Is Hive bad at joining tables
No. Hive is actually pretty good, but sometimes it takes a bit playing around with the query optimizer.
Depending on which version of Hive you use, you may need to provide hints in your query to tell the optimizer to join the data using a certain algorithm. You can find some details about different hints here.
If you're thinking about using Pig, I think your choice should not be motivated only by performance considerations. In my experience there is no quantifiable gain in using Pig, I have used both over the past years, and in terms of performance there is no clear winner.
What Pig gives you however is more transparency when defining what kind of join you want to use instead of relying on some (sometimes obscure) optimizer hints.
In the end, Pig or Hive doesn't really matter, it just depends how you decide to optimize your queries. If you're considering switching to Pig, I would first really analyze what your needs in terms of processing are, as you'll probably fall even in terms of performance. Here is a good post if you want to compare the 2.
I am planning to do a project for implementing all aggregation operations in HBase. But I don’t know about its difficulty. I have only 6 months for completing that project. Should I go forward with it? I am planning to do it in java. I know that there are already some aggregation functions. But there in no INNER JOIN like queries now. I am planning to implement such type of queries. I don't know it’s a blunder or bluff.
I think technically we should distinguish two types of joins:
a) One small table + One Big Table. By small table I mean table which can be cached in memory of each node w/o seriously affecting cluster operation. In this case Join using coprocessor should be be possible by putting small table in the hash map, iterating over the node local part of the data of the big table and this way producing join results. In the Hive's term it is called "map" join http://www.facebook.com/note.php?note_id=470667928919.
b) Two big tables. I do not think it is viable to get it production quality in short time frame. I might state that such functionality is realm of MPP databases and serious part of their IP.
It is definitely harder in HBase than doing it in an RDBMS or a different Hadoop technology like PIG or Hive.