Goto statement or iteration in HiveQL - hadoop

How to do iterations in HiveQL? if there is no way to implement a loop, is there a GOTO kind of statement?
I need to perform execution four times with four parameter values, one iteration for each. Is there way to do this without using script?

You could use SparkSQL or any other Hive client in a language that supported procedural express expressions, and have both options. HiveQL alone offers neither.

Related

How to determine which apis to use for the code to be time efficient in spark

What are the steps one can think through to logically conclude which api's or commands in general to use for time efficiency?
For example: Empirically, I found joining dataframes through sql api calls to be ~30% more time efficient than using native scala commands.
df1.join(df2, df1.k == df2.k, joinType='inner')
sqlContext.sql('SELECT * FROM df1 JOIN df2 ON df1.k = df2.k')
What are the first principles involved when determining the optimal command?
Performance comparisons in big data are notoriously tricky because there are too many factors you cannot control.
Use explain to see the logical and physical execution plans. If the two are the same for DSL vs. SparkSQL then Spark will do exactly the same work. I expect the result for both statements above to be the same and, hence, the observed difference to be due to other factors, e.g., machine resource use by other processes during the test run, pre-caching between runs, etc.
During job execution, use the Spark UI to see what's going on.

Pure spark vs spark SQL for quering data on HDFS

I have (tabular) data on a hdfs cluster and need to do some slightly complex querying on it. I expect to face the same situation many times in the future, with other data. And so, question:
What are the factors to take into account to choose where to use (pure) Spark and where to use Spark-SQL when implementing such task?
Here is the selection factors I could think of:
Familiarity with language:
In my case, I am more of a data-analyst than a DB guy, so this would lead me to use spark: I am more comfortable to think of how to (efficiently) implement data selection in Java/Scala than in SQL. This however depends mostly on the query.
Serialization:
I think that one can run Spark-SQL query without sending home-made-jar+dep to the spark worker (?). But then, returned data are raw and should be converted locally.
Efficiency:
I have no idea what differences there are between the two.
I know this question might be too general for SO, but maybe not. So, could anyone with more knowledge provides some insight?
About point 3, depending on your input-format, the way in which the data is scanned can be different when you use a pure-Spark vs Spark SQL. For example if your input format has multiple columns, but you need only few of them, it's possible to skip the retrieval using Spark SQL, whereas this is a bit trickier to achieve in pure Spark.
On top of that Spark SQL has a query optimizer, when using DataFrame or a query statement, the resulting query will go through the optimizer such that it will be executed more efficiently.
Spark SQL does not exclude Spark; combined usage is probably for the best results.

Too much if-else in an Oracle procedure, Good or bad on performance?

In a procedure, I need to decide about the value of a column using many if-else conditions. The script starts with a
FOR rec IN (SELECT....) LOOP Begin
and decides on many different values the rec sub-records can obtain in each iteration. for some cases, there might be just 2 or 3 variable assignments and some cases include running a separate procedure or doing an INSERT operation.
Does this approach the best approach for decision making from the aspect of performance where there are many records as the result of loop's SELECT statement, or there might be better alternatives? What if I adapt Chain-of-responsibility pattern in writing this procedure? Will this approach increase performance or just makes things even worse?
Regards.
If possible stay inside SQL.
For bigger volumes, avoid "if" as you will be serialized on your PL/SQL side.
Have it better as CASE inside your SELECT if possible.
For example, in case of SQL paralelization it can be handled by multiple processes for different rows at the same time.
Hope it helps,
Igor

Large query, multiple tables, old vs new JOIN syntax

I have a large query that joins around 20 tables (mostly outer joins). It is using the older join syntax with commas and where conditions with (+) for outer joins.
We noticed that it is consuming a lot of server memory. We are trying several things among which one idea is to convert this query to use the newer ANSI syntax, since the ANSI syntax allows better control on the order of JOINs and also specifies the JOIN predicates explicitly as they are applied.
Does converting the query from an older syntax to the newer ANSI syntax help in reducing the amount of data processed, for such large queries spanning a good number of tables?
In my experience, it does not - it generates identical execution plans. That said, the newer JOIN syntax does allow you to things that you can't do with the old syntax. I would recommend converting it for that reason, and for clarity. The ANSI syntax is just so much easier to read (at least for me). Once converted you can then compare execution plans.
DCookie said all there is to say about ANSI syntax.
However, if you outer join 20 tables, it is no wonder you will consume a lot of server memory. Maybe if you cut down your query in smaller subqueries it might improve performance. That way not all tables have to be read in memory and then joined in memory and then filtered and then only the columns you need selected.
Reversing this order will at least save memory, although it doesn't have to improve execution speed.
As DCookie mentioned, both versions should produce identical execution plans. I would start by looking at the current query's execution plan and figuring out what is actually taking up the memory. A quick look at DBMS_XPLAN.DISPLAY_CURSOR output should be a good start. Once you know exactly what part of the query you are trying to improve, then you can analyze if switching to ANSI style joins will do anything to help you reach your end goal.

In PL/sql oracle stored procedure,trying to use analytical functions

Is it better to use cursor or analytical functions in a stored procedure for performance improvement?
It is better to use pure SQL in a set-wise fashion than to use cursors to process data RBAR. This is because the context switching between the PL/SQL and SQL engines is an overhead. Cursors and all their additional code are also an overhead.
Analytic functions are an excellent extension to SQL which allow us to do stuff in a SELECT statement which previously would have required procedural code.
Of course, if you're looking to process large amounts then bulk collection and the FORALL statement are definitely the best approach. If you need to use the LIMIT clause then explicit cursors are unavoidable.
It's always preferable to not use cursors if you don't have to.
Using Pure SQL statement is always is better choice compared to cursor. Cursors are used when it is cumbersome to process the data using pure SOL.
Using cursor we process one row each. With bulk collection we reduce context switching but we still be processing only few thousand rows in a cycle using Limit Clause. Using pure SQL, it is possible to insert or update more rows provided enough resources are available. Rollback segment is a limiting factor. In one project I wrote very complex pure SQL to insert millions of rows in very large data warehousing environment with failure. The DBA had made very large rollback segment available. I could not achieve that kind of performance with bulk collection.
Please beware, most companies will not give you that kind of facility. So please take a decision based on the available resources. You can use combination of both.

Resources