HBase Inner join and coprocessors - hadoop

I am planning to do a project for implementing all aggregation operations in HBase. But I don’t know about its difficulty. I have only 6 months for completing that project. Should I go forward with it? I am planning to do it in java. I know that there are already some aggregation functions. But there in no INNER JOIN like queries now. I am planning to implement such type of queries. I don't know it’s a blunder or bluff.

I think technically we should distinguish two types of joins:
a) One small table + One Big Table. By small table I mean table which can be cached in memory of each node w/o seriously affecting cluster operation. In this case Join using coprocessor should be be possible by putting small table in the hash map, iterating over the node local part of the data of the big table and this way producing join results. In the Hive's term it is called "map" join http://www.facebook.com/note.php?note_id=470667928919.
b) Two big tables. I do not think it is viable to get it production quality in short time frame. I might state that such functionality is realm of MPP databases and serious part of their IP.

It is definitely harder in HBase than doing it in an RDBMS or a different Hadoop technology like PIG or Hive.

Related

Hive union all efficiency and best practice

I have a hive efficiency question. I have 2 massive queries that need to be filtered, joined with mapping tables, and unioned. All the joins are identical for both tables. Would it be more efficient to union them before applying the joins to the combined table or to apply the joins to each massive query individually then union the results? Does it make a difference?
I tried the second way and the query ran for 24 hours before I killed it. I feel like I did everything I could to optimize it except potentially rearrange the union statement. On the one hand, I feel like it should not matter because the number or rows being joined by the mapping table is the same and since everything is palatalized, it should take roughly the same amount of time. On the other hand, maybe by doing the union first, it should guarantee that the two big queries are given full system resources before the joins are run. Then again, that might mean that there are only 2 jobs running at a time so the system is not being fully used or something.
I simply do not know enough about how hive and it's multi-threading works. Anybody have any ideas?
There is no such best practice. Both approaches are applicable. Subqueries in UNION ALL are running as parallel jobs. So join before union will work as parallel tasks with smaller datasets, tez can optimize execution and common joined tables will be read only once in single mapper stage for each table.
Also you can avoid joins for some subqueries for example if their keys are not applicable for join.
Join with union-ed bigger dataset also may work with very high parallelism depending on your settings (bytes per reducer for example), optimizer also may rewrite query plan. So I suggest you to check both methods, measure speed, study plan and check if you can change something. Change, measure, study plan... repeat
Few more suggestions:
Try to limit datasets before joining them. If your join multiplies rows then analytics and aggregation may work slower on bigger datasets and first approach may be preferable if you can apply analytics/aggregation before union.

Pure spark vs spark SQL for quering data on HDFS

I have (tabular) data on a hdfs cluster and need to do some slightly complex querying on it. I expect to face the same situation many times in the future, with other data. And so, question:
What are the factors to take into account to choose where to use (pure) Spark and where to use Spark-SQL when implementing such task?
Here is the selection factors I could think of:
Familiarity with language:
In my case, I am more of a data-analyst than a DB guy, so this would lead me to use spark: I am more comfortable to think of how to (efficiently) implement data selection in Java/Scala than in SQL. This however depends mostly on the query.
Serialization:
I think that one can run Spark-SQL query without sending home-made-jar+dep to the spark worker (?). But then, returned data are raw and should be converted locally.
Efficiency:
I have no idea what differences there are between the two.
I know this question might be too general for SO, but maybe not. So, could anyone with more knowledge provides some insight?
About point 3, depending on your input-format, the way in which the data is scanned can be different when you use a pure-Spark vs Spark SQL. For example if your input format has multiple columns, but you need only few of them, it's possible to skip the retrieval using Spark SQL, whereas this is a bit trickier to achieve in pure Spark.
On top of that Spark SQL has a query optimizer, when using DataFrame or a query statement, the resulting query will go through the optimizer such that it will be executed more efficiently.
Spark SQL does not exclude Spark; combined usage is probably for the best results.

Hive vs Pig when performing Joins

I have some scripts which process my website's logs. I have loaded this data into multiple tables in Hive. I run these scripts on daily basis to do the analysis of the traffic.
Lately I am seeing that the hive queries which I have written in these scripts is taking too much time. Earlier, it used to take around 10-15 mins to generate the reports, but now it takes hours to do the same.
I did the analysis of the data and its around 5-10% of increase in dataset.
One of my friends suggested me that Hive is not good when it comes to joining multiple hive tables and I should switch my scripts to Pig. Is Hive bad at joining tables when compared to Pig?
Is Hive bad at joining tables
No. Hive is actually pretty good, but sometimes it takes a bit playing around with the query optimizer.
Depending on which version of Hive you use, you may need to provide hints in your query to tell the optimizer to join the data using a certain algorithm. You can find some details about different hints here.
If you're thinking about using Pig, I think your choice should not be motivated only by performance considerations. In my experience there is no quantifiable gain in using Pig, I have used both over the past years, and in terms of performance there is no clear winner.
What Pig gives you however is more transparency when defining what kind of join you want to use instead of relying on some (sometimes obscure) optimizer hints.
In the end, Pig or Hive doesn't really matter, it just depends how you decide to optimize your queries. If you're considering switching to Pig, I would first really analyze what your needs in terms of processing are, as you'll probably fall even in terms of performance. Here is a good post if you want to compare the 2.

What are the deciding factors for the order of Tables when joining amongst them?

I know that when joining across multiple tables, performance is dependent upon the order in which they are joined. What factors should I consider when joining tables?
Most modern RDBM's optimize the query based upon which tables are joined, the indexes used, table statistics, etc. They rarely, if ever, differ in their final execution plan based upon the order of the joins in the query.
SQL is designed to be declarative; you specify what you want, not (in most cases) how to get it. While there are things like index hints that can allow you to direct the optimizer to use or avoid specific indexes, by and large you can leave that work to the engine and be about the business of writing your queries.
In the end, running different versions of your queries within SQL Server Management Studio and viewing the actual execution plans is the only way to tell if order can truly make a difference.
As far as I know, the join order has no effect on query performance. The query engine will parse the query and execute it in the way it believes is the most efficient. If you want, try writing the query using different join orders and look at the execution plan. They should be the same.
See this article: http://sql-4-life.blogspot.com/2009/03/order-of-inner-joins.html

Oracle Hierarchical Query Performance

We're looking at using Oracle Hierarchical queries to model potentially very large tree structures (potentially infinitely wide, and depth of 30+). My understanding is that hierarchal queries provide a method to write recursively joining SQL but they it does not provide any real performance enhancements over if you were to manually write an equivalent query... is this the case? What sort of experiences have people had, performance wise, with using oracle hierarchical queries?
Well the short answer is that without the hierarchical extension (connect by) you couldn't write a recursive query. You could programmitically issue many queries which were recurisively linked.
The rule of thumb with everything database is, especially oracle, is that if you can issue your result in a single query it will almost always be faster than doing it programatically.
My experiences have been with much smaller sets, so I can't speak for how well heirarchical queries will perform for large sets.
When doing these tree retrievals, you typically have these options
Query everything and assemble the tree on the client side.
Perform one query for each level of the tree, building on what you know that you need from the previous query results
Use the built in stuff Oracle provides (START WITH,CONNECT BY PRIOR).
Doing it all in the database will reduce unnecessary round trips or wasteful queries that pull too much data.
Try partitioning the data within you hierarchical table and then limiting the partition included in the query.
CREATE TABLE
loopy
(key NUMBER, key_hier number, info VARCHAR2, part NUMBER)
PARTITION BY
RANGE (part)
(
PARTITION low VALUES LESS THAN (1000),
PARTITION mid VALUES LESS THAN (10000),
PARTITION high VALUES LESS THAN (MAXVALUE)
);
SELECT
info
FROM
loopy PARTITION(mid)
CONNECT BY
key = key_hier
START WITH
key = <some value>;
The interesting problem now becomes your partitioning strategy. Oracle provides several options.
I've seen that using connect by can be slow but compared to what? There isn't really another option except building a result set using recursive PL/SQL calls (slower) or doing it on your client side.
You could try separating your data into a mapping (hierarchy definition) and lookup tables (the display data) and then joining them back together. I guess I wouldn't expect much of a gain assuming you are getting the hierarchy data from indexed fields but its worth a try.
Have you tried it using the connect by yet? I'm a big fan of trying different variations.

Resources